(c) 2016 - present, Enplus Advisors, Inc.

# Programming with Data 
## Part I: Building Blocks of Tabular Data

Goals:

* pandas has a huge API
* Our goal is to distill some wisdom about python

# Pandas

Expansive library with a huge API.

Distill some `pandas` wisdom.

I try to present a conceptual approach to "Programming with Data",
starting with primitive data types and working up from there.

I like to think about Series in two primary ways (on next slide):


# What is a `Series`?

Series is the building block of `pandas`.

* Ordered key-value pairs with homogeneous data type

* A data array and a label array

# A simple `Series`

* These are the standard imports we'll assume.
* Let's create a Series.

In [1]:
import numpy as np
import pandas as pd

s = pd.Series([6, 8, 7, 5])
s

0    6
1    8
2    7
3    5
dtype: int64

`dtype` is the `Series` datatype. Corresponds to the `numpy` dtype.

In [2]:
s.astype(np.float64)

0    6.0
1    8.0
2    7.0
3    5.0
dtype: float64

# Series

A series is:
A mapping from an `index` --> `values`

Note: Not required to be unique but some operations are not supported
if there are duplicates, e.g., unstacking

This is important because the index determines alignment of pandas
objects, as we will see shortly.

# Series - Implicit Index

In [3]:
s = pd.Series([6, 8, 7, 5])

In [4]:
s.index

RangeIndex(start=0, stop=4, step=1)

If you don't provide explicit labels you get increasing integer labels.

In [5]:
s.values

array([6, 8, 7, 5])

Notice that `s.values` is a `numpy` array

# Series - Explicit Index

In [6]:
s1 = pd.Series([6, 8, 7, 5], index=['b', 'd', 'c', 'a'])
s1

b    6
d    8
c    7
a    5
dtype: int64

In [7]:
s1.index

Index(['b', 'd', 'c', 'a'], dtype='object')

Now we have a new kind of `Index`, but the values stay the same.

In [8]:
s1.values

array([6, 8, 7, 5])

# Series

Can also create a series from a `dict`.

`dict`s preserve insertion order since Python 3.6 so you can no longer
rely on `pandas` automatically sorting the index

In [9]:
s2 = pd.Series({'b': 6, 'd': 8, 'c': 7, 'a': 5})
s2

b    6
d    8
c    7
a    5
dtype: int64

# Series - Selection

Selection by integer index, labels, and slices.

In [10]:
s2[0]

6

In [11]:
s2['a']

5

In [12]:
s2[['a', 'c']]

a    5
c    7
dtype: int64

# Series - Selection

In [13]:
s2[0:2]

b    6
d    8
dtype: int64

Note how all of the selection operations preserve the index.

# Series - Filtering

All operations are vectorized over the `Series`

In [14]:
idx = s2 > 5
idx

b     True
d     True
c     True
a    False
dtype: bool

In [15]:
# Selecting with a boolean vector
s2[idx]

b    6
d    8
c    7
dtype: int64

## Filtering

In [16]:
idx = (6 < s2) & (s2 < 100)
idx

b    False
d     True
c     True
a    False
dtype: bool

`&` binds more tightly than comparison operators so you need parentheses.

# Filtering - Missing Data

Pandas has decent support for missing data.

In [17]:
s3 = pd.Series([6., 8., np.nan, 7.])
s3

0    6.0
1    8.0
2    NaN
3    7.0
dtype: float64

In [18]:
s3.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [19]:
s3[pd.notnull(s3)]

0    6.0
1    8.0
3    7.0
dtype: float64

Note the preservation of labels (index)

## Types of Missing Data

In [20]:
s = pd.Series([1, 2, 3, np.nan, 5], index=list('abcde'))

What is the `dtype` of `s`?

In [21]:
s

a    1.0
b    2.0
c    3.0
d    NaN
e    5.0
dtype: float64

WARNING: There is no integer NA type in `pandas < 0.24.0`

# Alignment

Operations in `pandas` are implicitly aligned by index!

In [22]:
s1 = pd.Series([6, 8, 7, 5], index=list('abcd'))
s2 = pd.Series([1, 2, 3, np.nan, 5], index=list('abcde'))

In [23]:
s1 + s2

a     7.0
b    10.0
c    10.0
d     NaN
e     NaN
dtype: float64

Performs an outer join, filling missing levels with NAs