# Pandas Tutorial

0. Recap of Numpy Arrays
1. Pandas Series
2. Creating DataFrames
3. Accessing DataFrames
4. Operations on DataFrames
5. Saving and Loading DataFrames (see last workshop)

## 1. Recap of Numpy Arrays

In [5]:
import numpy as np

In [6]:
arr = np.arange(1,21,1)
mat = arr.reshape(5,4)
mat

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16],
       [17, 18, 19, 20]])

In [7]:
mat.sum()

210

In [8]:
mat.sum(axis=0)

array([45, 50, 55, 60])

In [9]:
mat.sum(axis=1)

array([10, 26, 42, 58, 74])

**axis=0**: downwards across rows -> apply the method down each column (**column wise**)  
**axis=1**: horizontally across columns -> apply the method across each row (**row wise**)

## 2. Pandas Series

These objects are similar to numpy arrays, but with named Indices.

In [10]:
import pandas as pd

 ### Creating Pandas Series

In [11]:
my_list = [1,2,3,4]
my_arr = np.array(my_list)
my_dict = {'a':1, 'c':2, 'b':3, 'd':4}

In [19]:
pd.Series(my_list)

0    1
1    2
2    3
3    4
dtype: int64

In [20]:
pd.Series(my_arr)

0    1
1    2
2    3
3    4
dtype: int64

In [21]:
pd.Series(my_dict)

a    1
b    3
c    2
d    4
dtype: int64

### Accessing Elements

In [22]:
my_series = pd.Series(my_dict)
my_series

a    1
b    3
c    2
d    4
dtype: int64

In [23]:
my_series[0]

1

In [24]:
# access via named index
my_series['a']

1

In [25]:
my_series[['a','c']]

a    1
c    2
dtype: int64

In [26]:
my_series['a':'c']

a    1
b    3
c    2
dtype: int64

In [27]:
my_series > 2

a    False
b     True
c    False
d     True
dtype: bool

In [28]:
my_series[my_series > 2]

b    3
d    4
dtype: int64

In [29]:
type(my_series)

pandas.core.series.Series

### Some Functions

In [30]:
my_series.index

Index([u'a', u'b', u'c', u'd'], dtype='object')

In [31]:
pd.Series(my_arr).index

RangeIndex(start=0, stop=4, step=1)

In [32]:
my_series**2

a     1
b     9
c     4
d    16
dtype: int64

In [33]:
def squareElement(x):
    return x**2

In [34]:
my_series.apply(squareElement)

a     1
b     9
c     4
d    16
dtype: int64

In [35]:
def multiply(x,y):
    return x*y

In [37]:
my_func = lambda x: multiply(x,10) 

In [38]:
my_func(2)

20

In [36]:
my_series.apply(lambda x: multiply(x,10))

a    10
b    30
c    20
d    40
dtype: int64

In [39]:
my_series.apply(multiply, args=[10])

a    10
b    30
c    20
d    40
dtype: int64

In [40]:
my_series.apply(multiply, y=10)

a    10
b    30
c    20
d    40
dtype: int64

In [52]:
my_series = pd.Series(np.arange(2,22,2))
my_series

0     2
1     4
2     6
3     8
4    10
5    12
6    14
7    16
8    18
9    20
dtype: int64

In [53]:
my_series % 3

0    2
1    1
2    0
3    2
4    1
5    0
6    2
7    1
8    0
9    2
dtype: int64

In [55]:
my_series.apply(lambda x: x / 0)

ZeroDivisionError: integer division or modulo by zero

In [56]:
my_series / 0

0    inf
1    inf
2    inf
3    inf
4    inf
5    inf
6    inf
7    inf
8    inf
9    inf
dtype: float64

In [57]:
1/0

ZeroDivisionError: integer division or modulo by zero

In [58]:
my_series

0     2
1     4
2     6
3     8
4    10
5    12
6    14
7    16
8    18
9    20
dtype: int64

In [59]:
my_series / 1

0     2.0
1     4.0
2     6.0
3     8.0
4    10.0
5    12.0
6    14.0
7    16.0
8    18.0
9    20.0
dtype: float64

In [60]:
my_series.apply(lambda x: x/1)

0     2
1     4
2     6
3     8
4    10
5    12
6    14
7    16
8    18
9    20
dtype: int64

In [62]:
5 / 2.

2.5

**Excercise**:  
- create a numpy array and train accessing different parts and elements
- create a pandas series containg even integer numbers up to 20
- write a function which returns the remainder of the division by 3
- divide your series by 1 (do you notice something?)
- divide your series by 0 (do you notice something?)

## 3.  Creating Pandas DataFrames
A DataFrame is a collection of Pandas Series. Each Row or Column contains a Pandas Series.  
We can create DataFrames in several different ways.

In [63]:
pd.DataFrame(data=np.arange(1,21,1).reshape(5,4))

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12
3,13,14,15,16
4,17,18,19,20


In [64]:
df_data = pd.DataFrame(data=np.arange(1,21,1).reshape(5,4),
                       index=['A', 'B', 'C', 'D', 'E'],
                       columns=['W', 'X', 'Y', 'Z'])
df_data

Unnamed: 0,W,X,Y,Z
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16
E,17,18,19,20


In [None]:
for element in ['A', 'B']:
    print element    

In [65]:
pd.DataFrame({'Animal': ['cat', 'bird', 'dog'], 'weight': [20,2,30]})

Unnamed: 0,Animal,weight
0,cat,20
1,bird,2
2,dog,30


In [70]:
test2 = pd.DataFrame({'Animal': ['cat', 'bird', 'dog'], 'weight': [20,2,30]})
test2.index

RangeIndex(start=0, stop=3, step=1)

In [68]:
test = pd.DataFrame({'Animal': ['cat', 'bird', 'dog'], 'weight': [20,2,30]}).set_index('Animal')
test

Unnamed: 0_level_0,weight
Animal,Unnamed: 1_level_1
cat,20
bird,2
dog,30


In [69]:
test.index

Index([u'cat', u'bird', u'dog'], dtype='object', name=u'Animal')

In [71]:
df = pd.DataFrame({'Animal': ['cat', 'bird', 'dog', 'cat', 'dog'], 'weight': [20,2,30, 25, 30],
              'sound': ['miau','chirp', 'wuff', 'miau', 'wuff'],
             'name': ['Mila', 'Bernd', 'Walter', 'Milu', 'Bello']})

df

Unnamed: 0,Animal,name,sound,weight
0,cat,Mila,miau,20
1,bird,Bernd,chirp,2
2,dog,Walter,wuff,30
3,cat,Milu,miau,25
4,dog,Bello,wuff,30


In [72]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
Animal    5 non-null object
name      5 non-null object
sound     5 non-null object
weight    5 non-null int64
dtypes: int64(1), object(3)
memory usage: 232.0+ bytes


In [73]:
df_multiIndex = df.set_index(['Animal','name'])
df_multiIndex

Unnamed: 0_level_0,Unnamed: 1_level_0,sound,weight
Animal,name,Unnamed: 2_level_1,Unnamed: 3_level_1
cat,Mila,miau,20
bird,Bernd,chirp,2
dog,Walter,wuff,30
cat,Milu,miau,25
dog,Bello,wuff,30


In [74]:
df_multiIndex.index

MultiIndex(levels=[[u'bird', u'cat', u'dog'], [u'Bello', u'Bernd', u'Mila', u'Milu', u'Walter']],
           labels=[[1, 0, 2, 1, 2], [2, 1, 4, 3, 0]],
           names=[u'Animal', u'name'])

**Excercise**:  
- create two DataFrames. One containing only numerical Values and named Indices,  
and one contain string and integer values and a multi level index
- does the Index have to be unique?

## 4. Accessing Elements

In [83]:
df_data

Unnamed: 0,W,X,Y,Z
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16
E,17,18,19,20


In [77]:
df_data['W']

A     1
B     5
C     9
D    13
E    17
Name: W, dtype: int64

In [78]:
type(df_data['W'])

pandas.core.series.Series

In [79]:
df_data.loc['A']

W    1
X    2
Y    3
Z    4
Name: A, dtype: int64

In [109]:
type(df_data.loc['A']) # this is a comment

pandas.core.series.Series

In [84]:
df_data[['W', 'Z']]

Unnamed: 0,W,Z
A,1,4
B,5,8
C,9,12
D,13,16
E,17,20


In [85]:
df_data.loc[['B', 'C']]

Unnamed: 0,W,X,Y,Z
B,5,6,7,8
C,9,10,11,12


In [86]:
df_data.loc['B']['W']

5

In [87]:
df_data

Unnamed: 0,W,X,Y,Z
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16
E,17,18,19,20


In [88]:
df_data[df_data > 10]

Unnamed: 0,W,X,Y,Z
A,,,,
B,,,,
C,,,11.0,12.0
D,13.0,14.0,15.0,16.0
E,17.0,18.0,19.0,20.0


In [89]:
df_data['X'] > 10

A    False
B    False
C    False
D     True
E     True
Name: X, dtype: bool

In [90]:
df_data[df_data['X'] > 10]

Unnamed: 0,W,X,Y,Z
D,13,14,15,16
E,17,18,19,20


In [92]:
True or False

True

In [94]:
df_data

Unnamed: 0,W,X,Y,Z
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16
E,17,18,19,20


In [93]:
df_data[(df_data['X'] > 10) & (df_data['Y'] < 19)]

Unnamed: 0,W,X,Y,Z
D,13,14,15,16


In [98]:
df_data.iloc[0]

W    1
X    2
Y    3
Z    4
Name: A, dtype: int64

In [99]:
df_data

Unnamed: 0,W,X,Y,Z
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16
E,17,18,19,20


In [101]:
df_data.loc['A','W']

1

In [95]:
df_data.loc['A'] > 3

W    False
X    False
Y    False
Z     True
Name: A, dtype: bool

In [102]:
df_data.loc[:,df_data.loc['A']>3]

Unnamed: 0,Z
A,4
B,8
C,12
D,16
E,20


In [103]:
df_data.iloc[1,1]

6

In [104]:
df_data.loc['B', 'X']

6

In [105]:
df_multiIndex

Unnamed: 0_level_0,Unnamed: 1_level_0,sound,weight
Animal,name,Unnamed: 2_level_1,Unnamed: 3_level_1
cat,Mila,miau,20
bird,Bernd,chirp,2
dog,Walter,wuff,30
cat,Milu,miau,25
dog,Bello,wuff,30


In [106]:
df_multiIndex.loc['cat','Mila']

sound     miau
weight      20
Name: (cat, Mila), dtype: object

In [107]:
df_multiIndex.xs('Mila', level=1)

Unnamed: 0_level_0,sound,weight
Animal,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,miau,20


In [108]:
df_multiIndex.xs('dog', level=0)

Unnamed: 0_level_0,sound,weight
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Walter,wuff,30
Bello,wuff,30


## 5. Operations on DataFrames
Here, we list some useful operations for Pandas DataFrames

In [111]:
df_data

Unnamed: 0,W,X,Y,Z
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16
E,17,18,19,20


In [112]:
# add columns
df_data['new'] = df_data['W'] + df_data['Z']

In [113]:
df_data

Unnamed: 0,W,X,Y,Z,new
A,1,2,3,4,5
B,5,6,7,8,13
C,9,10,11,12,21
D,13,14,15,16,29
E,17,18,19,20,37


In [114]:
# summary statistics
df_data.describe()

Unnamed: 0,W,X,Y,Z,new
count,5.0,5.0,5.0,5.0,5.0
mean,9.0,10.0,11.0,12.0,21.0
std,6.324555,6.324555,6.324555,6.324555,12.649111
min,1.0,2.0,3.0,4.0,5.0
25%,5.0,6.0,7.0,8.0,13.0
50%,9.0,10.0,11.0,12.0,21.0
75%,13.0,14.0,15.0,16.0,29.0
max,17.0,18.0,19.0,20.0,37.0


In [115]:
# use custom functions
df_data.apply(lambda element: element**2)

Unnamed: 0,W,X,Y,Z,new
A,1,4,9,16,25
B,25,36,49,64,169
C,81,100,121,144,441
D,169,196,225,256,841
E,289,324,361,400,1369


In [117]:
df_data

Unnamed: 0,W,X,Y,Z,new
A,1,2,3,4,5
B,5,6,7,8,13
C,9,10,11,12,21
D,13,14,15,16,29
E,17,18,19,20,37


In [126]:
for element in df_data.columns.tolist():
    print df_data[element]

A     1
B     5
C     9
D    13
E    17
Name: W, dtype: int64
A     2
B     6
C    10
D    14
E    18
Name: X, dtype: int64
A     3
B     7
C    11
D    15
E    19
Name: Y, dtype: int64
A     4
B     8
C    12
D    16
E    20
Name: Z, dtype: int64
A     5
B    13
C    21
D    29
E    37
Name: new, dtype: int64


In [123]:
df_data.columns.tolist()

['W', 'X', 'Y', 'Z', 'new']

In [127]:
# column wise
df_data.sum(axis=0)

W       45
X       50
Y       55
Z       60
new    105
dtype: int64

In [118]:
# row wise
df_data.max(axis=1)

A     5
B    13
C    21
D    29
E    37
dtype: int64

In [128]:
df_multiIndex

Unnamed: 0_level_0,Unnamed: 1_level_0,sound,weight
Animal,name,Unnamed: 2_level_1,Unnamed: 3_level_1
cat,Mila,miau,20
bird,Bernd,chirp,2
dog,Walter,wuff,30
cat,Milu,miau,25
dog,Bello,wuff,30


In [129]:
df_multiIndex.nunique()

sound     3
weight    4
dtype: int64

In [130]:
df_multiIndex.groupby('Animal').sum()

Unnamed: 0_level_0,weight
Animal,Unnamed: 1_level_1
bird,2
cat,45
dog,60


In [134]:
df_multiIndex.groupby('Animal').agg(['sum', 'max', 'count'])

Unnamed: 0_level_0,sound,sound,sound,weight,weight,weight
Unnamed: 0_level_1,sum,max,count,sum,max,count
Animal,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bird,chirp,chirp,1,2,2,1
cat,miaumiau,miau,2,45,25,2
dog,wuffwuff,wuff,2,60,30,2


In [135]:
df_data

Unnamed: 0,W,X,Y,Z,new
A,1,2,3,4,5
B,5,6,7,8,13
C,9,10,11,12,21
D,13,14,15,16,29
E,17,18,19,20,37


In [136]:
df_new = df_data[(df_data> 3) & (df_data < 16)]
df_new

Unnamed: 0,W,X,Y,Z,new
A,,,,4.0,5.0
B,5.0,6.0,7.0,8.0,13.0
C,9.0,10.0,11.0,12.0,
D,13.0,14.0,15.0,,
E,,,,,


In [137]:
df_new.isnull()

Unnamed: 0,W,X,Y,Z,new
A,True,True,True,False,False
B,False,False,False,False,False
C,False,False,False,False,True
D,False,False,False,True,True
E,True,True,True,True,True


In [138]:
df_new.isnull().values.sum()

11

In [139]:
df_new.isnull().sum()

W      2
X      2
Y      2
Z      2
new    3
dtype: int64

In [143]:
df_new.isnull().sum(axis=1) / len(df_new.columns.tolist())

A    0.6
B    0.0
C    0.2
D    0.4
E    1.0
dtype: float64

In [144]:
df_new

Unnamed: 0,W,X,Y,Z,new
A,,,,4.0,5.0
B,5.0,6.0,7.0,8.0,13.0
C,9.0,10.0,11.0,12.0,
D,13.0,14.0,15.0,,
E,,,,,


In [145]:
df_new.apply(pd.Series.nunique)

W      3
X      3
Y      3
Z      3
new    2
dtype: int64

In [146]:
df_new.dropna()

Unnamed: 0,W,X,Y,Z,new
B,5.0,6.0,7.0,8.0,13.0


In [147]:
df_new.dropna(axis=0)

Unnamed: 0,W,X,Y,Z,new
B,5.0,6.0,7.0,8.0,13.0


In [148]:
df_new.dropna(axis=1)

A
B
C
D
E


In [149]:
df_new.dropna(subset=['new'])

Unnamed: 0,W,X,Y,Z,new
A,,,,4.0,5.0
B,5.0,6.0,7.0,8.0,13.0


In [154]:
df_new.std()

W      4.000000
X      4.000000
Y      4.000000
Z      4.000000
new    5.656854
dtype: float64

In [150]:
df_new.mean()

W       9.0
X      10.0
Y      11.0
Z       8.0
new     9.0
dtype: float64

In [151]:
df_new.fillna(df_new.mean())

Unnamed: 0,W,X,Y,Z,new
A,9.0,10.0,11.0,4.0,5.0
B,5.0,6.0,7.0,8.0,13.0
C,9.0,10.0,11.0,12.0,9.0
D,13.0,14.0,15.0,8.0,9.0
E,9.0,10.0,11.0,8.0,9.0


In [None]:
df_new.loc['E','W'] = 10​df_new.loc['E','W'] = 10

In [None]:
df_new

In [None]:
pd.merge()

### Notes:

There are many many more operations like joins (on indices) or merges (on columns), concat, pivot, agg, etc.
Check the online documentation for more functions.

In [156]:
#final important function
df_cat = pd.DataFrame({'CarBrand':['Audi', 'Mercedes', 'BMW', 'Audi']})
df_cat

Unnamed: 0,CarBrand
0,Audi
1,Mercedes
2,BMW
3,Audi


In [157]:
pd.get_dummies(df_cat, prefix_sep='=')

Unnamed: 0,CarBrand=Audi,CarBrand=BMW,CarBrand=Mercedes
0,1,0,0
1,0,0,1
2,0,1,0
3,1,0,0


In [None]:
my_dict = {''} 

In [159]:
df = pd.DataFrame({'CarBrand':['Audi', 'Mercedes', 'BMW', 'Audi'] , 'PS':[190, 250, 220, 150]})
df

Unnamed: 0,CarBrand,PS
0,Audi,190
1,Mercedes,250
2,BMW,220
3,Audi,150


In [160]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
CarBrand    4 non-null object
PS          4 non-null int64
dtypes: int64(1), object(1)
memory usage: 136.0+ bytes


In [161]:
pd.get_dummies(df)

Unnamed: 0,PS,CarBrand_Audi,CarBrand_BMW,CarBrand_Mercedes
0,190,1,0,0
1,250,0,0,1
2,220,0,1,0
3,150,1,0,0


In [162]:
df.select_dtypes(include=[object])

Unnamed: 0,CarBrand
0,Audi
1,Mercedes
2,BMW
3,Audi


In [163]:
df.select_dtypes(exclude=[object])

Unnamed: 0,PS
0,190
1,250
2,220
3,150
