# Pandas Notes

`Pandas` is a package built on top of `NumPy`, which is consist of Series and DataFrame objects, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

Pandas has more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.

* Pandas series
* DataFrame - creation, read from files
* Quick checking DataFrame
* Descriptive stats on DataFrame
* Indexing, slicing, conditional subsetting
* Operations on specific rows/columns

In [1]:
import numpy as np
import pandas as pd

## 1.1. Series Object

The Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

In [2]:
s = pd.Series([0.25, 0.5, 0.75, 1.0], index=["alice", "bob", "charles", "darwin"])
s

alice      0.25
bob        0.50
charles    0.75
darwin     1.00
dtype: float64

We can access with the `values` and `index` attributes, return a NumPy array and Index

In [3]:
s.values

array([0.25, 0.5 , 0.75, 1.  ])

In [4]:
s.index

Index(['alice', 'bob', 'charles', 'darwin'], dtype='object')

The Pandas Series is also like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

In [5]:
s = {'color': ['black'],
     'size': ['S'],
     'data': pd.date_range('1/1/2019', periods=1, freq='W'),
     'a': 0,
     'b': 1}
pd.Series(s)

color                                              [black]
size                                                   [S]
data     DatetimeIndex(['2019-01-06'], dtype='datetime6...
a                                                        0
b                                                        1
dtype: object

By default, a Series will be created where the index is drawn from the sorted keys.

In [6]:
s['size']

['S']

## 1.2. DataFrame Object

The DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. A DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [7]:
data = {'color': ['black', 'white', 'black', 'white', 'black', 'white', 'black', 'white', 'black', 'white'],
        'size': ['S', 'M', 'L', 'M', 'L', 'S', 'S', 'XL', 'XL', 'M'],
        'data': pd.date_range('1/1/2019', periods=10, freq='W'),
        'a': np.random.randn(10),
        'b': np.random.normal(0.5, 2, 10)}

df = pd.DataFrame(data)
df

Unnamed: 0,color,size,data,a,b
0,black,S,2019-01-06,-2.00404,2.258491
1,white,M,2019-01-13,0.110237,-1.277053
2,black,L,2019-01-20,-0.607797,5.881012
3,white,M,2019-01-27,-0.368079,1.041533
4,black,L,2019-02-03,0.346664,-0.054605
5,white,S,2019-02-10,0.877276,1.243388
6,black,S,2019-02-17,-0.480872,4.105147
7,white,XL,2019-02-24,1.415476,3.304951
8,black,XL,2019-03-03,2.405514,1.850534
9,white,M,2019-03-10,-0.427718,2.235355


In [8]:
df.values

array([['black', 'S', Timestamp('2019-01-06 00:00:00'),
        -2.0040401838698876, 2.2584906041664516],
       ['white', 'M', Timestamp('2019-01-13 00:00:00'),
        0.11023707024584892, -1.2770529298433477],
       ['black', 'L', Timestamp('2019-01-20 00:00:00'),
        -0.6077972180329199, 5.881012054507084],
       ['white', 'M', Timestamp('2019-01-27 00:00:00'),
        -0.36807858928647536, 1.0415333618574834],
       ['black', 'L', Timestamp('2019-02-03 00:00:00'),
        0.34666430707517915, -0.05460454981876628],
       ['white', 'S', Timestamp('2019-02-10 00:00:00'),
        0.8772756465446854, 1.2433875340571219],
       ['black', 'S', Timestamp('2019-02-17 00:00:00'),
        -0.4808721365590081, 4.105147205273319],
       ['white', 'XL', Timestamp('2019-02-24 00:00:00'),
        1.415475632323723, 3.304951299456659],
       ['black', 'XL', Timestamp('2019-03-03 00:00:00'),
        2.4055144698881317, 1.850533821634467],
       ['white', 'M', Timestamp('2019-03-10 00:0

Like the Series object, the DataFrame has an index attribute that gives access to the index labels:

In [9]:
df.index

RangeIndex(start=0, stop=10, step=1)

Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels:

In [10]:
df.columns

Index(['color', 'size', 'data', 'a', 'b'], dtype='object')

## 1.3. Index Object

This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values). Index objects also have many of the attributes familiar from NumPy arrays:

In [11]:
index = [['A', 'B', 'B', 'B', 'C', 'A', 'B', 'A', 'C', 'C'], ['JP', 'CN', 'US', 'US', 'US', 'CN', 'CN', 'CA', 'JP', 'CA']]

ind = pd.Index(index)
ind

Index([['A', 'B', 'B', 'B', 'C', 'A', 'B', 'A', 'C', 'C'], ['JP', 'CN', 'US', 'US', 'US', 'CN', 'CN', 'CA', 'JP', 'CA']], dtype='object')

In [12]:
print(ind[0])

print(ind[0][0])

['A', 'B', 'B', 'B', 'C', 'A', 'B', 'A', 'C', 'C']
A


In [13]:
print('size:', ind.size)
print('shape:', ind.shape)
print('dim:', ind.ndim)
print('type:', ind.dtype)

size: 2
shape: (2,)
dim: 1
type: object


One difference between Index objects are immutable–that is, they cannot be modified, but NumPy arrays does:

In [14]:
ind[0] = 1

TypeError: Index does not support mutable operations

The Index object follows many of the conventions used by Python's built-in `set` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [15]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

# intersection
print(indA & indB)

# union
print(indA | indB)

# symmetric difference
print(indA ^ indB)

Int64Index([3, 5, 7], dtype='int64')
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Int64Index([1, 2, 9, 11], dtype='int64')


Adding multi-indexing to dataframe

In [16]:
index = pd.MultiIndex.from_arrays(index, names=['class', 'country'])

df = pd.DataFrame(data, index=index)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,data,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,-2.00404,2.258491
B,CN,white,M,2019-01-13,0.110237,-1.277053
B,US,black,L,2019-01-20,-0.607797,5.881012
B,US,white,M,2019-01-27,-0.368079,1.041533
C,US,black,L,2019-02-03,0.346664,-0.054605
A,CN,white,S,2019-02-10,0.877276,1.243388
B,CN,black,S,2019-02-17,-0.480872,4.105147
A,CA,white,XL,2019-02-24,1.415476,3.304951
C,JP,black,XL,2019-03-03,2.405514,1.850534
C,CA,white,M,2019-03-10,-0.427718,2.235355


## 2.1. Query Row Data

In [29]:
# select row by index, return Series
df.iloc[0]

color                  black
size                       S
data     2019-01-06 00:00:00
a                   -2.00404
b                    2.25849
Name: (A, JP), dtype: object

In [27]:
# select row by index, return dataframe
df.iloc[[0]]

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,data,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,-2.00404,2.258491


In [43]:
# select row by name, return dataframe
df.loc['B']

Unnamed: 0_level_0,color,size,data,a,b
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CN,white,M,2019-01-13,0.110237,-1.277053
US,black,L,2019-01-20,-0.607797,5.881012
US,white,M,2019-01-27,-0.368079,1.041533
CN,black,S,2019-02-17,-0.480872,4.105147


## 2.2. Query Column Data

In [32]:
# select column by index, return Series
print(df.iloc[:, 0])

class  country
A      JP         black
B      CN         white
       US         black
       US         white
C      US         black
A      CN         white
B      CN         black
A      CA         white
C      JP         black
       CA         white
Name: color, dtype: object


In [33]:
# select column by name, return Series
df.loc[:, 'color']

class  country
A      JP         black
B      CN         white
       US         black
       US         white
C      US         black
A      CN         white
B      CN         black
A      CA         white
C      JP         black
       CA         white
Name: color, dtype: object

In [18]:
# select column by index, return dataframe
df.iloc[:, [0]].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,color
class,country,Unnamed: 2_level_1
A,JP,black
B,CN,white
B,US,black
B,US,white
C,US,black


In [19]:
# select column by name, return dataframe
df[['color']].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,color
class,country,Unnamed: 2_level_1
A,JP,black
B,CN,white
B,US,black
B,US,white
C,US,black


## 2.3. Query with Condition

In [65]:
df[df.a > 0]

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,data,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B,CN,white,M,2019-01-13,0.110237,-1.277053
C,US,black,L,2019-02-03,0.346664,-0.054605
A,CN,white,S,2019-02-10,0.877276,1.243388
A,CA,white,XL,2019-02-24,1.415476,3.304951
C,JP,black,XL,2019-03-03,2.405514,1.850534


In [62]:
df.loc[(df.color == 'white') & (df['size'] == 'M'), ['a', 'b']]

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1
B,CN,0.110237,-1.277053
B,US,-0.368079,1.041533
C,CA,-0.427718,2.235355


## 2.4. Appending

There are two ways to append data records: `append` a dictionary and `concat` a dataframe

In [20]:
df2 = pd.DataFrame(data)
df2.append({'color': 'green', 'size': 'XS', 'data': '2019-02-01 00:00:00', 'a': 1, 'b': -3}, ignore_index=True)

Unnamed: 0,color,size,data,a,b
0,black,S,2019-01-06 00:00:00,-2.00404,2.258491
1,white,M,2019-01-13 00:00:00,0.110237,-1.277053
2,black,L,2019-01-20 00:00:00,-0.607797,5.881012
3,white,M,2019-01-27 00:00:00,-0.368079,1.041533
4,black,L,2019-02-03 00:00:00,0.346664,-0.054605
5,white,S,2019-02-10 00:00:00,0.877276,1.243388
6,black,S,2019-02-17 00:00:00,-0.480872,4.105147
7,white,XL,2019-02-24 00:00:00,1.415476,3.304951
8,black,XL,2019-03-03 00:00:00,2.405514,1.850534
9,white,M,2019-03-10 00:00:00,-0.427718,2.235355


Using `concat` is more efficent way to append two dataframe

In [21]:
temp = dict({'color': ['green'], 'size': ['XS'], 'data': ['2019-02-01 00:00:00'], 'a': [1], 'b': [-3]})

# append row wise
pd.concat([df2, pd.DataFrame(temp)], axis=0, ignore_index=True)

# append column wise
temp = pd.Series(np.linspace(4,20,10))
pd.concat([df2, temp], axis=1)

Unnamed: 0,color,size,data,a,b,0
0,black,S,2019-01-06,-2.00404,2.258491,4.0
1,white,M,2019-01-13,0.110237,-1.277053,5.777778
2,black,L,2019-01-20,-0.607797,5.881012,7.555556
3,white,M,2019-01-27,-0.368079,1.041533,9.333333
4,black,L,2019-02-03,0.346664,-0.054605,11.111111
5,white,S,2019-02-10,0.877276,1.243388,12.888889
6,black,S,2019-02-17,-0.480872,4.105147,14.666667
7,white,XL,2019-02-24,1.415476,3.304951,16.444444
8,black,XL,2019-03-03,2.405514,1.850534,18.222222
9,white,M,2019-03-10,-0.427718,2.235355,20.0


## 2.5. Evaluating an expression

A great feature supported by pandas is expression evaluation. This relies on the `numexpr` library which must be installed.

In [69]:
# quick evaluate
print(df.eval('a + b'))

# directly modify the DataFrame
df.eval('Bool = a + b > 0')

class  country
A      JP         0.254450
B      CN        -1.166816
       US         5.273215
       US         0.673455
C      US         0.292060
A      CN         2.120663
B      CN         3.624275
A      CA         4.720427
C      JP         4.256048
       CA         1.807637
dtype: float64


Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,data,a,b,Bool
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A,JP,black,S,2019-01-06,-2.00404,2.258491,True
B,CN,white,M,2019-01-13,0.110237,-1.277053,False
B,US,black,L,2019-01-20,-0.607797,5.881012,True
B,US,white,M,2019-01-27,-0.368079,1.041533,True
C,US,black,L,2019-02-03,0.346664,-0.054605,True
A,CN,white,S,2019-02-10,0.877276,1.243388,True
B,CN,black,S,2019-02-17,-0.480872,4.105147,True
A,CA,white,XL,2019-02-24,1.415476,3.304951,True
C,JP,black,XL,2019-03-03,2.405514,1.850534,True
C,CA,white,M,2019-03-10,-0.427718,2.235355,True


## 3.1. Group by

In [63]:
size_lvl = df.groupby('size')

for i in size_lvl:
    print(i)

('L',                color size       data         a         b
class country                                           
B     US       black    L 2019-01-20  0.386777 -0.366475
C     US       black    L 2019-02-03  0.304819  1.441471)
('M',                color size       data         a         b
class country                                           
B     CN       white    M 2019-01-13 -1.820500  1.562026
      US       white    M 2019-01-27 -1.604988  3.103769
C     CA       white    M 2019-03-10 -0.621649 -0.967063)
('S',                color size       data         a         b
class country                                           
A     JP       black    S 2019-01-06  0.652694  2.277243
      CN       white    S 2019-02-10 -0.356302  0.380898
B     CN       black    S 2019-02-17 -0.440898 -3.134000)
('XL',                color size       data         a        b
class country                                          
A     CA       white   XL 2019-02-24 -1.118385  1.21721
C     

## 3.2. Select specific group

In [65]:
size_lvl.get_group('M')

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,data,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B,CN,white,M,2019-01-13,-1.8205,1.562026
B,US,white,M,2019-01-27,-1.604988,3.103769
C,CA,white,M,2019-03-10,-0.621649,-0.967063


## 4.1. Aggregation

In [66]:
size_lvl.sum().add_prefix('sum_')

Unnamed: 0_level_0,sum_a,sum_b
size,Unnamed: 1_level_1,Unnamed: 2_level_1
L,0.691596,1.074996
M,-4.047138,3.698732
S,-0.144507,-0.47586
XL,-0.988178,2.28193


In [67]:
df.groupby(['size', 'color']).agg({'a': np.min, 'b': np.mean})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
size,color,Unnamed: 2_level_1,Unnamed: 3_level_1
L,black,0.304819,0.537498
M,white,-1.8205,1.232911
S,black,-0.440898,-0.428379
S,white,-0.356302,0.380898
XL,black,0.130208,1.06472
XL,white,-1.118385,1.21721


## 5.1. Apply customize function

In [96]:
# Transform
data_range = lambda x: x.max() - x.min()
df.groupby('size').transform(data_range)

Unnamed: 0_level_0,Unnamed: 1_level_0,data,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,JP,42 days,1.093592,5.411243
B,CN,56 days,1.198851,4.070832
B,US,14 days,0.081958,1.807946
B,US,56 days,1.198851,4.070832
C,US,14 days,0.081958,1.807946
A,CN,42 days,1.093592,5.411243
B,CN,42 days,1.093592,5.411243
A,CA,7 days,1.248593,0.152489
C,JP,7 days,1.248593,0.152489
C,CA,56 days,1.198851,4.070832


In [98]:
# Apply
df.groupby('size')['a'].apply(lambda x: x.max() - x.min())

size
L     0.081958
M     1.198851
S     1.093592
XL    1.248593
Name: a, dtype: float64

## 6.1. Rolling

Creating a n-row window to aggregate data group by columns. If not enough data, then return `NaN`

In [90]:
df.groupby('color').rolling(2).a.sum()

color  class  country
black  A      JP              NaN
       B      US         1.039471
       C      US         0.691596
       B      CN        -0.136079
       C      JP        -0.310691
white  B      CN              NaN
              US        -3.425488
       A      CN        -1.961290
              CA        -1.474688
       C      CA        -1.740035
Name: a, dtype: float64

## 7.1. Expanding

Creating a window to aggregate data group by columns, but the window is increasing every step

In [84]:
# Cumulative sum
df.groupby('color').expanding(1).a.sum()

color  class  country
black  A      JP         0.652694
       B      US         1.039471
       C      US         1.344290
       B      CN         0.903392
       C      JP         1.033599
white  B      CN        -1.820500
              US        -3.425488
       A      CN        -3.781790
              CA        -4.900176
       C      CA        -5.521825
Name: a, dtype: float64

## 8.1. Filter

In [91]:
df.groupby('class').filter(lambda x: len(x) > 3)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,data,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B,CN,white,M,2019-01-13,-1.8205,1.562026
B,US,black,L,2019-01-20,0.386777,-0.366475
B,US,white,M,2019-01-27,-1.604988,3.103769
B,CN,black,S,2019-02-17,-0.440898,-3.134


## 9.1 Evaluating an expression