## Topics

We'll wrap up our discussion on Scientific Python by discussing `pandas`.

### Pandas

Pandas builds on the structured data tools available in NumPy by giving us a data structure called a `DataFrame`, which acts as a multidimensional array with row and column labels, heterogeneous types, and/or missing data.

"As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs." - Python Data Science Handbook, Jake VanderPlas

In [45]:
import pandas as pd

### Pandas: Series Object
A Series is a one-dimensional array of indexed data. It wraps:

- A sequence of values (accessible via values attribute).
- A sequence of indices (accessible via index attribute).

In [12]:
data = pd.Series(range(10,20))
print(data)

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int64


In [15]:
print(data.values)
print(data.index)
print(data.dtype)

[10 11 12 13 14 15 16 17 18 19]
RangeIndex(start=0, stop=10, step=1)
int64


Data is accessible by offset (index) in square brackets.

In [16]:
print(data[0])
print(data[2:5])

10
2    12
3    13
4    14
dtype: int64


We may consider a Pandas Series object as a generalized NumPy array. Whereas a NumPy array has an implicit integer index, a Pandas Series has an explicit index that may consist of values of any type.

In [19]:
data = pd.Series(range(10,20), index=['ten', 'eleven', '12', 13, 14, 15, 16, 17, 18, 19])
print(data['ten'])
print(data[19])
print(data.index)

10
19
Index(['ten', 'eleven', '12', 13, 14, 15, 16, 17, 18, 19], dtype='object')


There exists no requirement that an index be sequential.

In [20]:
data = pd.Series(range(10,20), index=[20, 19, 18, 15, 14, 'six', 'eight', 'seven', 9, 10])
print(data['six'])
print(data['eight'])
print(data.index)

15
16
Index([20, 19, 18, 15, 14, 'six', 'eight', 'seven', 9, 10], dtype='object')


We may also consider a Pandas `Series` a specialized dictionary. Whereas a Python `dict` maps a set of arbitrary keys to a set of arbitrary values, a Series maps a set of typed keys to a set of typed values.

"This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type of information of a Pandas `Series` makes it more efficient than a Python dictionary for certain operations."

In [22]:
star_trek_captains = {
    'Jean-luc Picard': 'STNG',
    'James T. Kirk': 'TOG',
    'Saru': 'Discovery',
    'Benjamin Sisko': 'DS9',
    'Kathryn Janeway': 'Voyager'
}
star_trek_captains_series = pd.Series(star_trek_captains)
print(star_trek_captains_series)
print(star_trek_captains_series['Kathryn Janeway'])

Jean-luc Picard         STNG
James T. Kirk            TOG
Saru               Discovery
Benjamin Sisko           DS9
Kathryn Janeway      Voyager
dtype: object
Voyager


The Series supports array-style operations, like slicing:

In [23]:
star_trek_captains_series['Saru': 'Kathryn Janeway']

Saru               Discovery
Benjamin Sisko           DS9
Kathryn Janeway      Voyager
dtype: object

Pandas Series can be created from:

- Lists, NumPy arrays: index defaults to sequence of integers. 
- Dictionaries: index defaults to sorted keys of the dictionary. 
- Scalars: value repeated to fill given index.

In [24]:
pd.Series(42, index=[1, 2, 3])

1    42
2    42
3    42
dtype: int64

"If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names."

"Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by 'aligned' we mean that they share the same index."

### Pandas `DataFrame` Object

In [25]:
star_trek_captains = {
    'Jean-luc Picard': 'STNG',
    'James T. Kirk': 'TOG',
    'Saru': 'Discovery',
    'Benjamin Sisko': 'DS9',
    'Kathryn Janeway': 'Voyager'
}

star_trek_captain_ranking = {
    'Jean-luc Picard': 1,
    'James T. Kirk': 5,
    'Saru': 4,
    'Benjamin Sisko': 2,
    'Kathryn Janeway': 3
}

star_trek_captain_series = pd.Series(star_trek_captains)
star_trek_captain_ranking_series = pd.Series(star_trek_captain_ranking)

star_trek_df = pd.DataFrame({
    'ranking': star_trek_captain_ranking_series,
    'season': star_trek_captain_series
})

star_trek_df

Unnamed: 0,ranking,season
Jean-luc Picard,1,STNG
James T. Kirk,5,TOG
Saru,4,Discovery
Benjamin Sisko,2,DS9
Kathryn Janeway,3,Voyager


A `DataFrame` has attributes:

- index: An Index object. The values are the row/index labels.
- columns: An index object. The values are the column labels.

In [26]:
print(star_trek_df.index)
print(star_trek_df.columns)

Index(['Jean-luc Picard', 'James T. Kirk', 'Saru', 'Benjamin Sisko',
       'Kathryn Janeway'],
      dtype='object')
Index(['ranking', 'season'], dtype='object')


Another way to frame our understanding of the `DataFrame` object is to consider it a specialized dictionary. Whereas a dictionary maps arbitrary keys to arbitrary values, a `DataFrame` maps a column name to a `Series` of column data.

In [29]:
print('===== DataFrame =====')
print(star_trek_df)           # DataFrame
print('===== Season =====')
print(star_trek_df.season)
print('===== Ranking =====')
print(star_trek_df['ranking'])

===== DataFrame =====
                 ranking     season
Jean-luc Picard        1       STNG
James T. Kirk          5        TOG
Saru                   4  Discovery
Benjamin Sisko         2        DS9
Kathryn Janeway        3    Voyager
===== Season =====
Jean-luc Picard         STNG
James T. Kirk            TOG
Saru               Discovery
Benjamin Sisko           DS9
Kathryn Janeway      Voyager
Name: season, dtype: object
===== Ranking =====
Jean-luc Picard    1
James T. Kirk      5
Saru               4
Benjamin Sisko     2
Kathryn Janeway    3
Name: ranking, dtype: int64


#### Data Indexing and Selection

Because the `__getitem__` behavior of a `DataFrame` returns a column, our conceptualization of the `DataFrame` as a two-dimensional ndarray may be misleading. For this reason, the specialized dictionary conceptualization is preferable.

In [38]:
star_trek_captains = {
    'Jean-luc Picard': 'STNG',
    'James T. Kirk': 'TOG',
    'Saru': 'Discovery',
    'Benjamin Sisko': 'DS9',
    'Kathryn Janeway': 'Voyager'
}

star_trek_captain_series = pd.Series(star_trek_captains)

# Access an element by index like a dictionary
print(star_trek_captain_series['Saru'])

# Access element by implicit integer index
print(star_trek_captain_series[2])

Discovery
Discovery


#### Data Indexing and Selection: Series

In [52]:
# Extend series
print('===== Extend series =====')
star_trek_captain_series['John Archer'] = 'Enterprise'
print(star_trek_captain_series)
# Slicing by explicit index
print('===== Slicing series =====')
print(star_trek_captain_series['Jean-luc Picard':'Saru'])
print(star_trek_captain_series[0:3])
# Masking 
print('===== Masking series =====')
print(star_trek_captain_series[(star_trek_captain_series != 'TOG')])
# Fancy Indexing!
print('===== Fancying indexing in series =====')
star_trek_captain_series[['Jean-luc Picard', 'Benjamin Sisko']]

===== Extend series =====
Jean-luc Picard          STNG
James T. Kirk             TOG
Saru                Discovery
Benjamin Sisko            DS9
Kathryn Janeway       Voyager
John Archer        Enterprise
dtype: object
===== Slicing series =====
Jean-luc Picard         STNG
James T. Kirk            TOG
Saru               Discovery
dtype: object
Jean-luc Picard         STNG
James T. Kirk            TOG
Saru               Discovery
dtype: object
===== Masking series =====
Jean-luc Picard          STNG
Saru                Discovery
Benjamin Sisko            DS9
Kathryn Janeway       Voyager
John Archer        Enterprise
dtype: object
===== Fancying indexing in series =====


Jean-luc Picard    STNG
Benjamin Sisko      DS9
dtype: object

#### Data Indexers: Series

In [56]:
print('===== Always reference Explicit Index =====')
print(star_trek_captain_series['Saru'])
print(star_trek_captain_series.loc['Saru'])
print(star_trek_captain_series.loc['Saru': 'Kathryn Janeway'])
print('===== Always reference Implicit Index =====')
print(star_trek_captain_series.iloc[1])
print(star_trek_captain_series.iloc[1:3])

===== Always reference Explicit Index =====
Discovery
Discovery
Saru               Discovery
Benjamin Sisko           DS9
Kathryn Janeway      Voyager
dtype: object
===== Always reference Implicit Index =====
TOG
James T. Kirk          TOG
Saru             Discovery
dtype: object


#### Data Selection: `DataFrame`

In [4]:
star_trek_df = pd.DataFrame({
    'season': 
    ['STNG',
        'TOG',
        'Discovery',
        'DS9',
        'Voyager'
    ],
    'ranking': [1,5,4,2,3],
    'name': ['Jean-luc Picard','James T. Kirk', 'Saru','Benjamin Sisko','Kathryn Janeway']
}, index=['Jean-luc Picard','James T. Kirk', 'Saru','Benjamin Sisko','Kathryn Janeway'])

star_trek_df

Unnamed: 0,season,ranking,name
Jean-luc Picard,STNG,1,Jean-luc Picard
James T. Kirk,TOG,5,James T. Kirk
Saru,Discovery,4,Saru
Benjamin Sisko,DS9,2,Benjamin Sisko
Kathryn Janeway,Voyager,3,Kathryn Janeway


In [5]:
print(star_trek_df.columns)
print(star_trek_df['season'])   # dict style index
print(star_trek_df.season)      # attribute style access with column names that are strings
print(star_trek_df['season'] is star_trek_df.season)

Index(['season', 'ranking', 'name'], dtype='object')
Jean-luc Picard         STNG
James T. Kirk            TOG
Saru               Discovery
Benjamin Sisko           DS9
Kathryn Janeway      Voyager
Name: season, dtype: object
Jean-luc Picard         STNG
James T. Kirk            TOG
Saru               Discovery
Benjamin Sisko           DS9
Kathryn Janeway      Voyager
Name: season, dtype: object
True


In [6]:
# Add a new column using dict style assignment
star_trek_df['Full Position'] = "Captain " + star_trek_df["name"] + " is ranked " + star_trek_df.ranking.astype(str) + "."
print(star_trek_df['Full Position'])

Jean-luc Picard    Captain Jean-luc Picard is ranked 1.
James T. Kirk        Captain James T. Kirk is ranked 5.
Saru                          Captain Saru is ranked 4.
Benjamin Sisko      Captain Benjamin Sisko is ranked 2.
Kathryn Janeway    Captain Kathryn Janeway is ranked 3.
Name: Full Position, dtype: object


In [7]:
print(star_trek_df)
# Transpose
star_trek_df.T

                    season  ranking             name  \
Jean-luc Picard       STNG        1  Jean-luc Picard   
James T. Kirk          TOG        5    James T. Kirk   
Saru             Discovery        4             Saru   
Benjamin Sisko         DS9        2   Benjamin Sisko   
Kathryn Janeway    Voyager        3  Kathryn Janeway   

                                        Full Position  
Jean-luc Picard  Captain Jean-luc Picard is ranked 1.  
James T. Kirk      Captain James T. Kirk is ranked 5.  
Saru                        Captain Saru is ranked 4.  
Benjamin Sisko    Captain Benjamin Sisko is ranked 2.  
Kathryn Janeway  Captain Kathryn Janeway is ranked 3.  


Unnamed: 0,Jean-luc Picard,James T. Kirk,Saru,Benjamin Sisko,Kathryn Janeway
season,STNG,TOG,Discovery,DS9,Voyager
ranking,1,5,4,2,3
name,Jean-luc Picard,James T. Kirk,Saru,Benjamin Sisko,Kathryn Janeway
Full Position,Captain Jean-luc Picard is ranked 1.,Captain James T. Kirk is ranked 5.,Captain Saru is ranked 4.,Captain Benjamin Sisko is ranked 2.,Captain Kathryn Janeway is ranked 3.


In [8]:
# Values as 2D Array
star_trek_df.values

array([['STNG', 1, 'Jean-luc Picard',
        'Captain Jean-luc Picard is ranked 1.'],
       ['TOG', 5, 'James T. Kirk', 'Captain James T. Kirk is ranked 5.'],
       ['Discovery', 4, 'Saru', 'Captain Saru is ranked 4.'],
       ['DS9', 2, 'Benjamin Sisko',
        'Captain Benjamin Sisko is ranked 2.'],
       ['Voyager', 3, 'Kathryn Janeway',
        'Captain Kathryn Janeway is ranked 3.']], dtype=object)

#### Data Indexing: `DataFrame`s

In [9]:
star_trek_df.loc['Saru']

season                           Discovery
ranking                                  4
name                                  Saru
Full Position    Captain Saru is ranked 4.
Name: Saru, dtype: object

In [10]:
star_trek_df.loc['Saru', 'Full Position']

'Captain Saru is ranked 4.'

In [11]:
star_trek_df.iloc[2,3]

'Captain Saru is ranked 4.'

In [20]:
star_trek_df.loc['Saru':'Kathryn Janeway', 'Full Position']

Saru                          Captain Saru is ranked 4.
Benjamin Sisko      Captain Benjamin Sisko is ranked 2.
Kathryn Janeway    Captain Kathryn Janeway is ranked 3.
Name: Full Position, dtype: object

In [21]:
star_trek_df.loc['Saru':'Kathryn Janeway']

Unnamed: 0,season,ranking,name,Full Position
Saru,Discovery,4,Saru,Captain Saru is ranked 4.
Benjamin Sisko,DS9,2,Benjamin Sisko,Captain Benjamin Sisko is ranked 2.
Kathryn Janeway,Voyager,3,Kathryn Janeway,Captain Kathryn Janeway is ranked 3.


In [23]:
star_trek_df.iloc[2:5, 3]

Saru                          Captain Saru is ranked 4.
Benjamin Sisko      Captain Benjamin Sisko is ranked 2.
Kathryn Janeway    Captain Kathryn Janeway is ranked 3.
Name: Full Position, dtype: object

In [24]:
star_trek_df.iloc[2:5, 1:3]

Unnamed: 0,ranking,name
Saru,4,Saru
Benjamin Sisko,2,Benjamin Sisko
Kathryn Janeway,3,Kathryn Janeway


In [25]:
# Masking
star_trek_df[star_trek_df['ranking'] < 4]

Unnamed: 0,season,ranking,name,Full Position
Jean-luc Picard,STNG,1,Jean-luc Picard,Captain Jean-luc Picard is ranked 1.
Benjamin Sisko,DS9,2,Benjamin Sisko,Captain Benjamin Sisko is ranked 2.
Kathryn Janeway,Voyager,3,Kathryn Janeway,Captain Kathryn Janeway is ranked 3.


In [33]:
# Masking
star_trek_df[(star_trek_df['ranking'] < 4) & (star_trek_df.name.str.contains('-'))]

Unnamed: 0,season,ranking,name,Full Position
Jean-luc Picard,STNG,1,Jean-luc Picard,Captain Jean-luc Picard is ranked 1.


#### `DataFrame` Operations

Pandas `DataFrame` objects inherit efficient element-wise operations from NumPy. Additionally, `DataFrame` objects "include a couple of useful twists":

For unary operations, ... ufuncs will *preserve index and column labels* in the output.

In [36]:
import numpy as np 
df = pd.DataFrame(np.random.randint(0, 10, (3,4)), columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,3,9,3,2
1,4,2,2,7
2,7,6,8,6


In [37]:
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,0.7071068,0.707107,0.7071068,1.0
1,1.224647e-16,1.0,1.0,-0.707107
2,-0.7071068,-1.0,-2.449294e-16,-1.0


#### Creating a DataFrame from File

```
df_from_csv = pd.read_csv('/path/to/file.csv')
df_from_web_csv = pd.read_csv('https://data.cityofchicago.org/api/views/x8fc-8rcq/rows.csv?accessType=DOWNLOAD')
df_from_json_csv = pd.read_json('/path/to/file.json')
df_from_web_json_csv = pd.read_json('https://website.com/resource')
```

In [41]:
chicago_libraries_df = pd.read_csv('https://data.cityofchicago.org/api/views/x8fc-8rcq/rows.csv?accessType=DOWNLOAD')
chicago_libraries_df

Unnamed: 0,NAME,HOURS OF OPERATION,ADDRESS,CITY,STATE,ZIP,PHONE,WEBSITE,LOCATION
0,Vodak-East Side,"Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....",3710 E. 106th St.,Chicago,IL,60617,(312) 747-5281,https://www.chipublib.org/locations/71/,"(41.70302747819179, -87.53350318470449)"
1,Albany Park,"Sun., Closed; Mon. & Wed., 10-6; Tue. & Thu., ...",3401 W. Foster Ave.,Chicago,IL,60625,(773) 539-5450,https://www.chipublib.org/locations/3/,"(41.975456, -87.71409)"
2,Avalon,"Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....",8148 S. Stony Island Ave.,Chicago,IL,60617,(312) 747-5234,https://www.chipublib.org/locations/8/,"(41.74671722160199, -87.58539658719002)"
3,Brainerd,"Sun., Closed; Mon. & Wed., 10-6; Tue. & Thu., ...",1350 W. 89th St.,Chicago,IL,60620,(312) 747-6291,https://www.chipublib.org/locations/13/,"(41.732316757097045, -87.65723451581835)"
4,Popular Library at Water Works,"Mon. - Thu., 9-7; Fri. & Sat., 9-5; Sun., 1-5",163 E. Pearson St.,Chicago,IL,60611,(312) 742-8811,https://www.chipublib.org/locations/73/,"(41.89758435399726, -87.62314877110073)"
...,...,...,...,...,...,...,...,...,...
76,Woodson Regional Library,"Mon. - Thu., 9-9; Fri. & Sat., 9-5; Sun., 1-5",9525 S. Halsted St.,Chicago,IL,60628,(312) 747-6900,https://www.chipublib.org/locations/81/,"(41.72093481286736, -87.64309395216542)"
77,Mayfair,"Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....",4400 W. Lawrence Ave.,Chicago,IL,60630,(312) 744-1254,https://www.chipublib.org/locations/49/,"(41.968354456496165, -87.7381618989123)"
78,Chicago Bee,"Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....",3647 S. State St.,Chicago,IL,60609,(312) 747-6872,https://www.chipublib.org/locations/18/,"(41.8284268698988, -87.62640221644719)"
79,Uptown,"Sun., Closed; Mon. & Wed., 10-6; Tue. & Thu., ...",929 W. Buena Ave.,Chicago,IL,60613,(312) 744-8400,https://www.chipublib.org/locations/70/,"(41.958445207514295, -87.65351012268938)"
