## Topics

We'll wrap up our discussion on Scientific Python by discussing `pandas`.

### Pandas

Pandas builds on the structured data tools available in NumPy by giving us a data structure called a `DataFrame`, which acts as a multidimensional array with row and column labels, heterogeneous types, and/or missing data.

"As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs." - Python Data Science Handbook, Jake VanderPlas

In [None]:
import pandas as pd

### Pandas: Series Object
A Series is a one-dimensional array of indexed data. It wraps:

- A sequence of values (accessible via values attribute).
- A sequence of indices (accessible via index attribute).

In [None]:
data = pd.Series(range(10,20))
print(data)

In [None]:
print(data.values)
print(data.index)
print(data.dtype)

Data is accessible by offset (index) in square brackets.

In [None]:
print(data[0])
print(data[2:5])

We may consider a Pandas Series object as a generalized NumPy array. Whereas a NumPy array has an implicit integer index, a Pandas Series has an explicit index that may consist of values of any type.

In [None]:
data = pd.Series(range(10,20), index=['ten', 'eleven', '12', 13, 14, 15, 16, 17, 18, 19])
print(data['ten'])
print(data[19])
print(data.index)

There exists no requirement that an index be sequential.

In [None]:
data = pd.Series(range(10,20), index=[20, 19, 18, 15, 14, 'six', 'eight', 'seven', 9, 10])
print(data['six'])
print(data['eight'])
print(data.index)

We may also consider a Pandas `Series` a specialized dictionary. Whereas a Python `dict` maps a set of arbitrary keys to a set of arbitrary values, a Series maps a set of typed keys to a set of typed values.

"This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type of information of a Pandas `Series` makes it more efficient than a Python dictionary for certain operations."

In [None]:
star_trek_captains = {
    'Jean-luc Picard': 'STNG',
    'James T. Kirk': 'TOG',
    'Saru': 'Discovery',
    'Benjamin Sisko': 'DS9',
    'Kathryn Janeway': 'Voyager'
}
star_trek_captains_series = pd.Series(star_trek_captains)
print(star_trek_captains_series)
print(star_trek_captains_series['Kathryn Janeway'])

The Series supports array-style operations, like slicing:

In [None]:
star_trek_captains_series['Saru': 'Kathryn Janeway']

Pandas Series can be created from:

- Lists, NumPy arrays: index defaults to sequence of integers. 
- Dictionaries: index defaults to sorted keys of the dictionary. 
- Scalars: value repeated to fill given index.

In [None]:
pd.Series(42, index=[1, 2, 3])

"If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names."

"Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by 'aligned' we mean that they share the same index."

### Pandas `DataFrame` Object

In [None]:
star_trek_captains = {
    'Jean-luc Picard': 'STNG',
    'James T. Kirk': 'TOG',
    'Saru': 'Discovery',
    'Benjamin Sisko': 'DS9',
    'Kathryn Janeway': 'Voyager'
}

star_trek_captain_ranking = {
    'Jean-luc Picard': 1,
    'James T. Kirk': 5,
    'Saru': 4,
    'Benjamin Sisko': 2,
    'Kathryn Janeway': 3
}

star_trek_captain_series = pd.Series(star_trek_captains)
star_trek_captain_ranking_series = pd.Series(star_trek_captain_ranking)

star_trek_df = pd.DataFrame({
    'ranking': star_trek_captain_ranking_series,
    'season': star_trek_captain_series
})

star_trek_df

A `DataFrame` has attributes:

- index: An Index object. The values are the row/index labels.
- columns: An index object. The values are the column labels.

In [None]:
print(star_trek_df.index)
print(star_trek_df.columns)

Another way to frame our understanding of the `DataFrame` object is to consider it a specialized dictionary. Whereas a dictionary maps arbitrary keys to arbitrary values, a `DataFrame` maps a column name to a `Series` of column data.

In [None]:
print('===== DataFrame =====')
print(star_trek_df)           # DataFrame
print('===== Season =====')
print(star_trek_df.season)
print('===== Ranking =====')
print(star_trek_df['ranking'])

#### Data Indexing and Selection

Because the `__getitem__` behavior of a `DataFrame` returns a column, our conceptualization of the `DataFrame` as a two-dimensional ndarray may be misleading. For this reason, the specialized dictionary conceptualization is preferable.

In [None]:
star_trek_captains = {
    'Jean-luc Picard': 'STNG',
    'James T. Kirk': 'TOG',
    'Saru': 'Discovery',
    'Benjamin Sisko': 'DS9',
    'Kathryn Janeway': 'Voyager'
}

star_trek_captain_series = pd.Series(star_trek_captains)

# Access an element by index like a dictionary
print(star_trek_captain_series['Saru'])

# Access element by implicit integer index
print(star_trek_captain_series[2])

#### Data Indexing and Selection: Series

In [None]:
# Extend series
print('===== Extend series =====')
star_trek_captain_series['John Archer'] = 'Enterprise'
print(star_trek_captain_series)
# Slicing by explicit index
print('===== Slicing series =====')
print(star_trek_captain_series['Jean-luc Picard':'Saru'])
print(star_trek_captain_series[0:3])
# Masking 
print('===== Masking series =====')
print(star_trek_captain_series[(star_trek_captain_series != 'TOG')])
# Fancy Indexing!
print('===== Fancying indexing in series =====')
star_trek_captain_series[['Jean-luc Picard', 'Benjamin Sisko']]

#### Data Indexers: Series

In [None]:
print('===== Always reference Explicit Index =====')
print(star_trek_captain_series['Saru'])
print(star_trek_captain_series.loc['Saru'])
print(star_trek_captain_series.loc['Saru': 'Kathryn Janeway'])
print('===== Always reference Implicit Index =====')
print(star_trek_captain_series.iloc[1])
print(star_trek_captain_series.iloc[1:3])

#### Data Selection: `DataFrame`

In [None]:
star_trek_df = pd.DataFrame({
    'season': 
    ['STNG',
        'TOG',
        'Discovery',
        'DS9',
        'Voyager'
    ],
    'ranking': [1,5,4,2,3],
    'name': ['Jean-luc Picard','James T. Kirk', 'Saru','Benjamin Sisko','Kathryn Janeway']
}, index=['Jean-luc Picard','James T. Kirk', 'Saru','Benjamin Sisko','Kathryn Janeway'])

star_trek_df

In [None]:
print(star_trek_df.columns)
print(star_trek_df['season'])   # dict style index
print(star_trek_df.season)      # attribute style access with column names that are strings
print(star_trek_df['season'] is star_trek_df.season)

In [None]:
# Add a new column using dict style assignment
star_trek_df['Full Position'] = "Captain " + star_trek_df["name"] + " is ranked " + star_trek_df.ranking.astype(str) + "."
print(star_trek_df['Full Position'])

In [None]:
print(star_trek_df)
# Transpose
star_trek_df.T

In [None]:
# Values as 2D Array
star_trek_df.values

#### Data Indexing: `DataFrame`s

In [None]:
star_trek_df.loc['Saru']

In [None]:
star_trek_df.loc['Saru', 'Full Position']

In [None]:
star_trek_df.iloc[2,3]

In [None]:
star_trek_df.loc['Saru':'Kathryn Janeway', 'Full Position']

In [None]:
star_trek_df.loc['Saru':'Kathryn Janeway']

In [None]:
star_trek_df.iloc[2:5, 3]

In [None]:
star_trek_df.iloc[2:5, 1:3]

In [None]:
# Masking
star_trek_df[star_trek_df['ranking'] < 4]

In [None]:
# Masking
star_trek_df[(star_trek_df['ranking'] < 4) & (star_trek_df.name.str.contains('-'))]

#### `DataFrame` Operations

Pandas `DataFrame` objects inherit efficient element-wise operations from NumPy. Additionally, `DataFrame` objects "include a couple of useful twists":

For unary operations, ... ufuncs will *preserve index and column labels* in the output.

In [None]:
import numpy as np 
df = pd.DataFrame(np.random.randint(0, 10, (3,4)), columns=['A','B','C','D'])
df

In [None]:
np.sin(df * np.pi / 4)

#### Creating a DataFrame from File

```
df_from_csv = pd.read_csv('/path/to/file.csv')
df_from_web_csv = pd.read_csv('https://data.cityofchicago.org/api/views/x8fc-8rcq/rows.csv?accessType=DOWNLOAD')
df_from_json_csv = pd.read_json('/path/to/file.json')
df_from_web_json_csv = pd.read_json('https://website.com/resource')
```

In [None]:
chicago_libraries_df = pd.read_csv('https://data.cityofchicago.org/api/views/x8fc-8rcq/rows.csv?accessType=DOWNLOAD')
chicago_libraries_df