![Pandas logo](img/pandas.svg)

# Containers

Let us backup a little to understand what Pandas containers really are.  We have primarily given examples of DataFrames in earlier modules, and indeed that is the primary object you will work with.  Moreover, we loosely implied a Pandas DataFrame was a wrapper around a 2-D NumPy array; that is not a bad starting metaphor, but it is slightly wrong.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from src.training import *

## Series

The basic data container in Pandas is actually a *Series*, not a DataFrame.  A `pd.Series` indeed wraps a NumPy array, but always a 1-D array.  The principle addition a Pandas Series gives us is labels on its rows.

In [None]:
arr1 = np.arange(10, 15)
arr2 = np.arange(1., 6)
arr3 = np.array(['foo', 'bar', 'baz', 'bam', 'fiz'])
s1, s2, s3 = (pd.Series(arr) for arr in (arr1, arr2, arr3))

In [None]:
print('arr1')
print(arr1)
print('s1')
print(s1)

For the most part, working with a Series or an array are similar with default indices (as we have used so far here). The screen display is a little bit different, but accessing items by index is the same. Indeed, most code that expects to work with 1-D NumPy arrays will seamlessly handle Pandas Series also.  You often do not need to think about the difference when writing functions.

In [None]:
# Slice
print('dtype:', arr1.dtype)
arr1[2:4]

In [None]:
# Slice
s1[2:4]

Where things are more evidently different is when the index is something different from successive integers starting at 0.

In [None]:
s_new = pd.Series(arr1, index=arr3)
s_new

In [None]:
# Can access rows by meaningful names
s_new[['foo', 'bam']]

In [None]:
# But can still use "accessor" to access by position
s_new.iloc[[0, 3]]

In [None]:
# To avoid ambiguity, best practice uses accessor for index too
s_new.loc[['foo', 'bam']]

## DataFrame

A DataFrame is **not** in fact a wrapper around a 2-D array, but more like a "bag" or "mapping" of column names to multiple Pandas Series.  An additional constraint (and capability) exists in DataFrames that all the Series contained in a DataFrame must use the same index.  This is a powerful abstraction.

One way to understand why DataFrames must be "bags" of 1-D structures is to think about the fact that they often contain different datatypes per column.  In contrast, N-dimensional arrays *always* have a homogeneous datatype across all values inside them.

In [None]:
df = pd.DataFrame({'Words': s3, 'Integers': s1, 'Floats': s2})
df

We can absolutely work with these sequential integers as index.  But notice that the index is "sticky" in a way it is not for a 2-D NumPy array.

In [None]:
# After operations, index might not be sequential integers
df2 = df.loc[[0, 3]].copy()
df2

In [None]:
# There is an index called '3'
df2.loc[3]

In [None]:
# But there is no index called '1'
try:
    print(df2.loc[1])
except Exception as err:
    print(err.__class__)

In [None]:
arr2d = np.c_[arr1, arr2]
arr2d

In [None]:
arr2d_2 = arr2d[[0, 3]]
arr2d_2

In [None]:
# In contrast to DataFrame, array is always sequentially indexed
arr2d_2[1, :]

DataFrames may be created in numerous ways, both from input data files and with different patterns in the class initialization.  Many variants of these are left to the exercises.

## Index

Let us use one particular column as our index for an example.

In [None]:
df_names = df.set_index('Words')
df_names

In [None]:
df_names.loc[['foo', 'bam']]

We can use more general patterns to get at portions of the data by combining predicates (about rows) with column selectors.  As with NumPy, commas separate "dimensions." The index may be drawn from the `RangeIndex` that is applied by default, but the row labels will still remain after the selection.

In [None]:
cancer = pd.read_csv('data/wisconsin.csv')                     
focus = cancer.loc[cancer['mean radius'] > 24, 'mean radius':'mean area']
focus

In [None]:
cancer.index

The ``RangeIndex`` is an efficient & lazy way to represent a sequence.

Other common **Index** types are

|    Class      |         Type of Values     |
|---------------|----------------------------|
|`Index`        | Anything, often strings    |
|`Int64Index`   | 64-bit integers            |
|`DatetimeIndex`| `TimeStamp` objects        |
|`RangeIndex`   | `range()`-like integers    |
|`MultiIndex`   | Hierarchical values        |

In [None]:
focus.index

We might also choose an index by indicating it in the initializer, rather than by later calling `.set_index()` on the DataFrame.  Notice the crucial differences in this constructor:

* The columns are indicated with the keyword argument `data` and a dictionary
* The index is indicated with the keyword argument `index`
* We cannot use the Series that already have different row labels (index)
  * Can use original NumPy array for values
  * Can use series.array for unlabelled value
  * Can cast back to NumPy with `np.array(series)`

In [None]:
pd.DataFrame(data={'Integers': arr1, 'Floats': s2.array}, index=s3)

### Resetting index

We can always remove the index from a DataFrame, which will move the values back to a regular column.

In [None]:
df_names

In [None]:
df_names.reset_index()

In [None]:
df_names.loc[['foo', 'bam']].reset_index()

In [None]:
# The column 'index' is probably pointless, but this follows from rule
df.reset_index()

# Exercises

The exercises below can each be done with a provided Python object.  These objects have a few properties.  Simply echoing the object in a cell produces a "pretty" display that may emphasize some aspect of the data of interest.

Each object has an `obj.data` attribute contains some sort of source data you should work with.  Each also contains an `obj.result` attribute that contains a some sort of transformation of the original data which you are trying to match.  In some cases, an exercise may have additional special attributes or methods that will be discussed in the description.

In [None]:
from pprint import pprint
from src.pandas_exercises import *

Pandas DataFrames may be read from Python dictionaries of several different styles.  Try to replicate that transformation.  The *original* DataFrame is the familiar `patients` one. Some of these round-trips are lossy?  

Hint: you probably want to utilize the help system built into Jupyter to explore some details not shown in examples.  E.g. run `pd.DataFrame?` in a cell.

*Extra credit*: Think about how you might address the lossiness, where it is an issue.

In [None]:
ex3_1.ideal

In [None]:
# Row oriented
pprint(ex3_1.data)

In [None]:
# Transform the dictionary
...

---

In [None]:
# Record oriented
pprint(ex3_2.data)

In [None]:
# Transform the dictionary
...

---

In [None]:
# Column oriented
pprint(ex3_3)

In [None]:
# Transform the dictionary
...

---

Using the same `patients` DateFrame we have used for a number of examples, set its index to an upper case version of the patient name, but retain the mixed case version as an additional column.  Since many names in various languages use varying capitalization rules, the transformation to upper case is a lossy operation and we do not want to discard information.

*Note:* While you are free to discover a more Pandas-native way of solving this, the standard Python `str.upper()` method is enough to arrive at an answer.  However, do not solve the problem purely manually (e.g. by simply typing the four names in upper case; pretend the DataFrame had thousands of records rather than just four).

In [None]:
ex3_4

In [None]:
# Transform data into result
...

---

Create a DataFrame based on a 2-D NumPy array (here of random values).  We want to name the rows with labels 'Alice', 'Bob', 'Carlos', 'Dan', and the columns with labels 'key1', 'key2', 'key3', 'key4'.  These choices are reminiscent of names often used in cryptography, but these specific values have no special connection to that.

In [None]:
ex3_5

In [None]:
# Create a DataFrame based on data and matching result
...

Having solved the problem one way, see if you can play around with Pandas methods for constructing DataFrames to create an equal DataFrame using different APIs.

In [None]:
# Create DataFrame from data, matching result (2nd approach)
...

---
Based on a Series containing platform-native integer values, construct a DataFrame with columns named 'Int16', 'Float32', 'Complex', and 'String'.  The values inside each column should be "the same" as those in the Series, but their types should correspond to the column names.

*Extra credit:* Also set the index of the DataFrame you build be the value in their original datatype.

In [None]:
ex3_6

In [None]:
print(ex3_6.result.dtypes)
print(ex3_6.result.index)

In [None]:
# Create a DataFrame based on Series data, with columns described
...