# Pandas Demo

- Pandas is built on top of NumPy and provides DataFrames
- DataFrames are **multidimensional Arrays with attached row and column labels**
- they often contain heterogeneous types and missing data
- implements a number of data operations familiar to users of databases and spreadsheets
- overcomes NumPy limitations: more flexible, attempting operations that do not map element-wise (grouping, pivot, etc.)
- Pandas provides an efficient access to the so called *data munging* tasks that occupy much of a Data Scientist's time 


## Pandas Tutorial
0. Recap of Numpy Arrays
1. Pandas Series
2. Creating DataFrames
3. Accessing DataFrames
4. Operations on DataFrames
5. Saving and Loading DataFrames

## 1. Pandas Series

- These objects are similar to numpy arrays, but with named Indices.
- one dimensional array with named indices

In [5]:
import pandas as pd
import numpy as np

 ### Creating Pandas Series

In [6]:
my_list = [1,2,3,4]
my_arr = np.array(my_list)
my_dict = {'a':1, 'c':2, 'b':3, 'd':4}
index = ['A', 'B', 'C', 'D']

In [7]:
pd.Series(my_list)

0    1
1    2
2    3
3    4
dtype: int64

In [8]:
pd.Series(data = my_arr, index=index)

A    1
B    2
C    3
D    4
dtype: int64

In [9]:
pd.Series(my_dict)

a    1
b    3
c    2
d    4
dtype: int64

### Accessing Elements

In [10]:
my_series = pd.Series(my_dict)
my_series

a    1
b    3
c    2
d    4
dtype: int64

In [11]:
# get the indices
my_series.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [12]:
# get the values
my_series.values

array([1, 3, 2, 4])

In [13]:
# get one element with the implicitly defined integer index
my_series[0]

1

In [14]:
# access via the explicitly defined named index
my_series['a']

1

In [15]:
# you can pass a list of indices
my_series[['a','c']]

a    1
c    2
dtype: int64

In [16]:
# also the integer index
my_series[[0,2]]

a    1
c    2
dtype: int64

In [17]:
# you can use slicing like in numpy
my_series['a':'c']

a    1
b    3
c    2
dtype: int64

In [18]:
# conditions lead to a boolean series
my_series > 2

a    False
b     True
c    False
d     True
dtype: bool

In [19]:
# use conditional indexing
my_series[my_series > 2]

b    3
d    4
dtype: int64

In [20]:
# get the type
type(my_series)

pandas.core.series.Series

### Some Functions

In [21]:
# get the index
my_series.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [22]:
# if no index has been defines
pd.Series(my_arr).index

RangeIndex(start=0, stop=4, step=1)

In [23]:
%%timeit
# square
my_series**2

65.1 µs ± 755 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [24]:
def squareElement(x):
    return x**2

In [25]:
%%timeit
my_series.apply(squareElement)

67.8 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [26]:
# create a new series
my_series = pd.Series(np.arange(2,22,2))
my_series

0     2
1     4
2     6
3     8
4    10
5    12
6    14
7    16
8    18
9    20
dtype: int64

In [27]:
# modulo operation
my_series % 3

0    2
1    1
2    0
3    2
4    1
5    0
6    2
7    1
8    0
9    2
dtype: int64

In [28]:
# does not work
my_series.apply(lambda x: x / 0)

ZeroDivisionError: division by zero

In [29]:
# works, why? Can you think of a situation where this is useful?
my_series / 0

0    inf
1    inf
2    inf
3    inf
4    inf
5    inf
6    inf
7    inf
8    inf
9    inf
dtype: float64

## 3. Pandas DataFrames
- it's a **generalization of two-dimensional arrays with named indices and columns**
- a DataFrame is a collection of Pandas Series
- each row or column contains (represents) a Pandas Series.  
- we can create DataFrames in several different ways.

In [30]:
# use just a numpy array -> integer indices and columns
pd.DataFrame(data=np.arange(1,21,1).reshape(5,4))

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12
3,13,14,15,16
4,17,18,19,20


In [31]:
# adding names
df_data = pd.DataFrame(data=np.arange(1,21,1).reshape(5,4),
                       index=['A', 'B', 'C', 'D', 'E'],
                       columns=['W', 'X', 'Y', 'Z'])
df_data

Unnamed: 0,W,X,Y,Z
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16
E,17,18,19,20


In [32]:
# use a dict
pd.DataFrame({'Animal': ['cat', 'bird', 'dog'], 'weight': [20,2,30]})

Unnamed: 0,Animal,weight
0,cat,20
1,bird,2
2,dog,30


In [33]:
#set the index to a column (not sorted)
test = pd.DataFrame({'Animal': ['cat', 'bird', 'dog'], 'weight': [20,2,30]}).set_index('Animal')
test

Unnamed: 0_level_0,weight
Animal,Unnamed: 1_level_1
cat,20
bird,2
dog,30


In [34]:
# get the index
test.index

Index(['cat', 'bird', 'dog'], dtype='object', name='Animal')

In [35]:
# create a larger dataframe from a dict
df = pd.DataFrame({'Animal': ['cat', 'bird', 'dog', 'cat', 'dog'], 'weight': [20,2,30, 25, 30],
              'sound': ['miau','chirp', 'wuff', 'miau', 'wuff'],
             'name': ['Mila', 'Bernd', 'Walter', 'Milu', 'Bello']})

df

Unnamed: 0,Animal,name,sound,weight
0,cat,Mila,miau,20
1,bird,Bernd,chirp,2
2,dog,Walter,wuff,30
3,cat,Milu,miau,25
4,dog,Bello,wuff,30


In [36]:
# info as one of the first methods you use
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
Animal    5 non-null object
name      5 non-null object
sound     5 non-null object
weight    5 non-null int64
dtypes: int64(1), object(3)
memory usage: 240.0+ bytes


## 4. Accessing Elements

In [37]:
# reprint the df from above
df_data

Unnamed: 0,W,X,Y,Z
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16
E,17,18,19,20


In [38]:
# extract one column
df_data['W']

A     1
B     5
C     9
D    13
E    17
Name: W, dtype: int64

In [39]:
# the column is a series
type(df_data['W'])

pandas.core.series.Series

In [40]:
# access a row using loc
# guideline principle: use explict indexing via loc or iloc --> code is more readable and precents subtle bugs
df_data.loc['A']

W    1
X    2
Y    3
Z    4
Name: A, dtype: int64

In [41]:
type(df_data.loc['A'])

pandas.core.series.Series

In [42]:
# extract multiple columns with a list
df_data[['W', 'Z']]

Unnamed: 0,W,Z
A,1,4
B,5,8
C,9,12
D,13,16
E,17,20


In [43]:
# or multiple indices
df_data.loc[['B', 'C']]

Unnamed: 0,W,X,Y,Z
B,5,6,7,8
C,9,10,11,12


In [44]:
# not so good
df_data.loc['B']['W']

5

In [45]:
# better
df_data.loc['B', 'W']

5

In [46]:
# get the full df again
df_data

Unnamed: 0,W,X,Y,Z
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16
E,17,18,19,20


In [47]:
# conditional indexing
df_data[df_data > 10]

Unnamed: 0,W,X,Y,Z
A,,,,
B,,,,
C,,,11.0,12.0
D,13.0,14.0,15.0,16.0
E,17.0,18.0,19.0,20.0


In [48]:
# conditional indexing on a column
df_data['X'] > 10

A    False
B    False
C    False
D     True
E     True
Name: X, dtype: bool

In [49]:
# extract only the rows where the entries in column X are larger than 10
df_data[df_data['X'] > 10]

Unnamed: 0,W,X,Y,Z
D,13,14,15,16
E,17,18,19,20


In [50]:
df_data

Unnamed: 0,W,X,Y,Z
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16
E,17,18,19,20


In [51]:
# use multiple conditions with & (instead of and) and | (instead of or)
df_data[(df_data['X'] > 10) & (df_data['Y'] < 19)]

Unnamed: 0,W,X,Y,Z
D,13,14,15,16


In [52]:
# use the implictly defined index
df_data.iloc[0]

W    1
X    2
Y    3
Z    4
Name: A, dtype: int64

In [53]:
df_data

Unnamed: 0,W,X,Y,Z
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16
E,17,18,19,20


In [54]:
df_data.loc['A'] > 3

W    False
X    False
Y    False
Z     True
Name: A, dtype: bool

In [55]:
# slicing: get all rows and the column where row A is larger than 3
df_data.loc[:,df_data.loc['A']>3]

Unnamed: 0,Z
A,4
B,8
C,12
D,16
E,20


In [56]:
df_data.iloc[1,1]

6

In [57]:
df_data.loc['B', 'X']

6

## 5. Operations on DataFrames
Here, we list some useful operations for Pandas DataFrames

In [58]:
df_data

Unnamed: 0,W,X,Y,Z
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16
E,17,18,19,20


In [59]:
# add columns
df_data['new'] = df_data['W'] + df_data['Z']

In [60]:
df_data

Unnamed: 0,W,X,Y,Z,new
A,1,2,3,4,5
B,5,6,7,8,13
C,9,10,11,12,21
D,13,14,15,16,29
E,17,18,19,20,37


In [61]:
# summary statistics
df_data.describe()

Unnamed: 0,W,X,Y,Z,new
count,5.0,5.0,5.0,5.0,5.0
mean,9.0,10.0,11.0,12.0,21.0
std,6.324555,6.324555,6.324555,6.324555,12.649111
min,1.0,2.0,3.0,4.0,5.0
25%,5.0,6.0,7.0,8.0,13.0
50%,9.0,10.0,11.0,12.0,21.0
75%,13.0,14.0,15.0,16.0,29.0
max,17.0,18.0,19.0,20.0,37.0


In [62]:
# use custom functions
df_data.apply(lambda element: element**2)

Unnamed: 0,W,X,Y,Z,new
A,1,4,9,16,25
B,25,36,49,64,169
C,81,100,121,144,441
D,169,196,225,256,841
E,289,324,361,400,1369


In [63]:
df_data

Unnamed: 0,W,X,Y,Z,new
A,1,2,3,4,5
B,5,6,7,8,13
C,9,10,11,12,21
D,13,14,15,16,29
E,17,18,19,20,37


In [64]:
df_data.columns.tolist()

['W', 'X', 'Y', 'Z', 'new']

In [65]:
# column wise sum
df_data.sum(axis=0)

W       45
X       50
Y       55
Z       60
new    105
dtype: int64

In [66]:
# row wise max
df_data.max(axis=1)

A     5
B    13
C    21
D    29
E    37
dtype: int64

In [67]:
df = pd.DataFrame({'Animal': ['cat', 'bird', 'dog', 'cat', 'dog'], 'weight': [20,2,30, 25, 30],
              'sound': ['miau','chirp', 'wuff', 'miau', 'wuff'],
             'name': ['Mila', 'Bernd', 'Walter', 'Milu', 'Bello']})

df

Unnamed: 0,Animal,name,sound,weight
0,cat,Mila,miau,20
1,bird,Bernd,chirp,2
2,dog,Walter,wuff,30
3,cat,Milu,miau,25
4,dog,Bello,wuff,30


In [68]:
# use multiple aggregations at once
df.groupby('Animal').agg(['sum', 'max', 'count'])

Unnamed: 0_level_0,name,name,name,sound,sound,sound,weight,weight,weight
Unnamed: 0_level_1,sum,max,count,sum,max,count,sum,max,count
Animal,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
bird,Bernd,Bernd,1,chirp,chirp,1,2,2,1
cat,MilaMilu,Milu,2,miaumiau,miau,2,45,25,2
dog,WalterBello,Walter,2,wuffwuff,wuff,2,60,30,2


In [69]:
# map different aggregations to different columns
df.groupby('Animal').agg({'sound':'sum', 'weight': 'count'})

Unnamed: 0_level_0,sound,weight
Animal,Unnamed: 1_level_1,Unnamed: 2_level_1
bird,chirp,1
cat,miaumiau,2
dog,wuffwuff,2


In [71]:
# compute std
df_data.std()

W       6.324555
X       6.324555
Y       6.324555
Z       6.324555
new    12.649111
dtype: float64

In [72]:
# compute mean
df_data.mean()

W       9.0
X      10.0
Y      11.0
Z      12.0
new    21.0
dtype: float64

### Notes:

There are many many more operations like joins (on indices) or merges (on columns), concat, pivot, agg, etc.
Check the online documentation for more functions. You will also learn some of the new operations during the exercise.