# Pandas library 

Pandas is the main library in data analysis for Python. It includes several new datatypes used for studying tabular data and sequences. It is built on top of Numpy, so it has several connections with this library. In the graphic part, it has its own graphic module built in Matplotlib, but its connection with Seaborn is usually more powerful. 



In [65]:
import numpy as np
import pandas as pd

### Pandas Series

Series are the most basic pandas-defined structures. They are used to study sequences of values. They are very similar to numpy arrays but in a combination with a dictionary. What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

### Creating a series

In [66]:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

In [69]:
#pd.Series(data=arr,index=labels) #from a numpy array
#pd.Series(data=d) #from a dictionary
#pd.Series(data=my_list,index=labels) # from two lists, data and index

a    10
b    20
c    30
dtype: int64

Also, the data may not be numeric, strings for example are also avaialable.

In [70]:
pd.Series(labels)

0    a
1    b
2    c
dtype: object

### Functionality of Pandas Series

The index usually has a meaning, related to the way we are accesing the data in the future. 

In [71]:
ser1 = pd.Series([10,20,30,40],index = ['Madrid', 'Barcelona','Sevilla', 'Valencia']) 
ser1

Madrid       10
Barcelona    20
Sevilla      30
Valencia     40
dtype: int64

In [72]:
ser1["Madrid"]

10

### Operations with Series

Sum is provided, and it sums the values with the same index, leaving a NaN where there is no information.

In [74]:
ser2 = pd.Series([10,20,20,40],index = ['Madrid', 'Cádiz','Sevilla', 'Barcelona'])     
ser2

Madrid       10
Cádiz        20
Sevilla      20
Barcelona    40
dtype: int64

In [75]:
ser1 + ser2

Barcelona    60.0
Cádiz         NaN
Madrid       20.0
Sevilla      50.0
Valencia      NaN
dtype: float64

## Pandas DataFrames

We can think of a DataFrame as a bunch of Series objects put together to share the same index. They are inspired in the R language and are the main object used in Python data analysis projects. 

In [77]:
df = pd.DataFrame(data=np.random.randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,0.538786,0.793401,-2.974288,-0.768657
B,-2.20203,-1.030051,-0.060189,0.488137
C,-1.303315,-0.115804,-0.554257,1.383544
D,-1.329175,-0.380292,0.355807,-0.472564
E,-0.599012,1.302905,-0.509144,0.976067


In a dataframe, besides index, we have columns, and this is the main parameter that is added in their definition. 

### Selection of data 

In [80]:
df['W']

A    0.538786
B   -2.202030
C   -1.303315
D   -1.329175
E   -0.599012
Name: W, dtype: float64

In [82]:
df[["W","Y"]]

Unnamed: 0,W,Y
A,0.538786,-2.974288
B,-2.20203,-0.060189
C,-1.303315,-0.554257
D,-1.329175,0.355807
E,-0.599012,-0.509144


Columns provide the basic math operations. Also, a new columns is defined as it would be a dictionary, defining its name as the key.

In [93]:
df['new'] = df['W'] + df['Y']

In [94]:
df

Unnamed: 0,W,X,Y,Z,new
A,0.538786,0.793401,-2.974288,-0.768657,-2.435502
B,-2.20203,-1.030051,-0.060189,0.488137,-2.262219
C,-1.303315,-0.115804,-0.554257,1.383544,-1.857572
D,-1.329175,-0.380292,0.355807,-0.472564,-0.973368
E,-0.599012,1.302905,-0.509144,0.976067,-1.108155


### Delete columns 

The keyword axis will be important to define the direction in which a certain operation has to be applied. 

In [95]:
df.drop('new',axis=1)

Unnamed: 0,W,X,Y,Z
A,0.538786,0.793401,-2.974288,-0.768657
B,-2.20203,-1.030051,-0.060189,0.488137
C,-1.303315,-0.115804,-0.554257,1.383544
D,-1.329175,-0.380292,0.355807,-0.472564
E,-0.599012,1.302905,-0.509144,0.976067


In [89]:
df.drop('E',axis=0)

Unnamed: 0,W,X,Y,Z
A,0.538786,0.793401,-2.974288,-0.768657
B,-2.20203,-1.030051,-0.060189,0.488137
C,-1.303315,-0.115804,-0.554257,1.383544
D,-1.329175,-0.380292,0.355807,-0.472564


Another important keyword is inplace 

In [16]:
df

Unnamed: 0,W,X,Y,Z,new
A,-1.301108,-0.015679,-1.035618,-0.303252,-2.336726
B,-1.718755,0.531014,-1.470471,-0.039898,-3.189226
C,-0.317106,-1.837657,1.042565,0.240121,0.725459
D,0.752255,-1.055463,-0.82166,0.922159,-0.069406
E,-0.390417,-0.095208,-0.910883,-0.160497,-1.3013


In [96]:
df.drop('new',axis=1,inplace=True)
df

Unnamed: 0,W,X,Y,Z
A,0.538786,0.793401,-2.974288,-0.768657
B,-2.20203,-1.030051,-0.060189,0.488137
C,-1.303315,-0.115804,-0.554257,1.383544
D,-1.329175,-0.380292,0.355807,-0.472564
E,-0.599012,1.302905,-0.509144,0.976067


### Data selection 

The command .loc allows us to select rows or columns based on names of index and labels 

In [99]:
df.loc['A']

W    0.538786
X    0.793401
Y   -2.974288
Z   -0.768657
Name: A, dtype: float64

In [108]:
df.loc[:,"Y"]

A   -2.974288
B   -0.060189
C   -0.554257
D    0.355807
E   -0.509144
Name: Y, dtype: float64

But selection can be done also using the position of a row 

In [109]:
df.iloc[2]

W   -1.303315
X   -0.115804
Y   -0.554257
Z    1.383544
Name: C, dtype: float64

Selection of groups is provided with .loc

In [21]:
df.loc['B','Y']

-1.4704714721905507

In [22]:
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,-1.301108,-1.035618
B,-1.718755,-1.470471


### Conditional selection

Another feature that is implemented is the selection based on a condition. Given a df 

In [110]:
df

Unnamed: 0,W,X,Y,Z
A,0.538786,0.793401,-2.974288,-0.768657
B,-2.20203,-1.030051,-0.060189,0.488137
C,-1.303315,-0.115804,-0.554257,1.383544
D,-1.329175,-0.380292,0.355807,-0.472564
E,-0.599012,1.302905,-0.509144,0.976067


In [111]:
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,False,False
B,False,False,False,True
C,False,False,False,True
D,False,False,True,False
E,False,True,False,True


In [25]:
df[df>0]

Unnamed: 0,W,X,Y,Z
A,,,,
B,,0.531014,,
C,,,1.042565,0.240121
D,0.752255,,,0.922159
E,,,,


The condition may be based on a single columns

In [112]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
A,0.538786,0.793401,-2.974288,-0.768657


And mix with selection 

In [113]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,-2.974288,0.793401


For two conditions you can use | and & with parenthesis:

In [119]:
df

Unnamed: 0,W,X,Y,Z
A,0.538786,0.793401,-2.974288,-0.768657
B,-2.20203,-1.030051,-0.060189,0.488137
C,-1.303315,-0.115804,-0.554257,1.383544
D,-1.329175,-0.380292,0.355807,-0.472564
E,-0.599012,1.302905,-0.509144,0.976067


In [118]:
df[(df['W']<0) & (df['Y'] > 0)]

Unnamed: 0,W,X,Y,Z
D,-1.329175,-0.380292,0.355807,-0.472564


### Index

Dataframes support hierarchy on indexing. This means that we can group our data using several index and move them in and out using columns. 

In [122]:
df

Unnamed: 0,W,X,Y,Z
A,0.538786,0.793401,-2.974288,-0.768657
B,-2.20203,-1.030051,-0.060189,0.488137
C,-1.303315,-0.115804,-0.554257,1.383544
D,-1.329175,-0.380292,0.355807,-0.472564
E,-0.599012,1.302905,-0.509144,0.976067


In [121]:
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,0.538786,0.793401,-2.974288,-0.768657
1,B,-2.20203,-1.030051,-0.060189,0.488137
2,C,-1.303315,-0.115804,-0.554257,1.383544
3,D,-1.329175,-0.380292,0.355807,-0.472564
4,E,-0.599012,1.302905,-0.509144,0.976067


We can create a new column and include it as index

In [125]:
indice = 'Z X C V B'.split()

In [126]:
df['new_ind'] = indice

In [127]:
df

Unnamed: 0,W,X,Y,Z,new_ind
A,0.538786,0.793401,-2.974288,-0.768657,Z
B,-2.20203,-1.030051,-0.060189,0.488137,X
C,-1.303315,-0.115804,-0.554257,1.383544,C
D,-1.329175,-0.380292,0.355807,-0.472564,V
E,-0.599012,1.302905,-0.509144,0.976067,B


In [128]:
df.set_index('new_ind',inplace=True)

In [131]:
df.index.name = "nuevo"

Unnamed: 0_level_0,W,X,Y,Z
nuevo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Z,0.538786,0.793401,-2.974288,-0.768657
X,-2.20203,-1.030051,-0.060189,0.488137
C,-1.303315,-0.115804,-0.554257,1.383544
V,-1.329175,-0.380292,0.355807,-0.472564
B,-0.599012,1.302905,-0.509144,0.976067


Hierarchy and multi index 

In [36]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [132]:
df_2 = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df_2

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,-0.183721,0.94795
G1,2,1.013282,-0.15879
G1,3,1.034599,-0.61791
G2,1,0.091992,0.749739
G2,2,-0.382808,-1.31798
G2,3,-0.263826,0.542578


Set_index also allows us to proceed like this, introducing a length 2 list

In [38]:
df.reset_index().set_index(["W","X"])

Unnamed: 0_level_0,Unnamed: 1_level_0,new_ind,Y,Z
W,X,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
-1.301108,-0.015679,Z,-1.035618,-0.303252
-1.718755,0.531014,X,-1.470471,-0.039898
-0.317106,-1.837657,C,1.042565,0.240121
0.752255,-1.055463,V,-0.82166,0.922159
-0.390417,-0.095208,B,-0.910883,-0.160497


Selection on Multi Index allows us to select group directly

In [135]:
df_2.loc['G1']

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,-0.183721,0.94795
2,1.013282,-0.15879
3,1.034599,-0.61791


In [136]:
df_2.loc['G1'].loc[1]

A   -0.183721
B    0.947950
Name: 1, dtype: float64

In [137]:
df_2.index.names = ['Group','Num']

In [141]:
df_2.columns = ["C","D"]

In [143]:
df_2

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,-0.183721,0.94795
G1,2,1.013282,-0.15879
G1,3,1.034599,-0.61791
G2,1,0.091992,0.749739
G2,2,-0.382808,-1.31798
G2,3,-0.263826,0.542578


In [145]:
df_2.xs(('G1',2))

C    1.013282
D   -0.158790
Name: (G1, 2), dtype: float64

In [146]:
df_2.xs(1,level='Num')

Unnamed: 0_level_0,C,D
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,-0.183721,0.94795
G2,0.091992,0.749739


# Missing Data

Let's show a few convenient methods to deal with Missing Data in pandas: the functions fillna and dropna

In [147]:
df_3 = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})

In [148]:
df_3

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [149]:
df_3.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [48]:
df_3.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


In [49]:
df_3.fillna(value='FILL VALUE')

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,FILL VALUE,2
2,FILL VALUE,FILL VALUE,3


### Exercises

We reuse the titanic dataset, .head() allows us to check the first values of the DataFrame

In [55]:
import seaborn as sns

In [56]:
titanic = sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


Select the values where age is greater than 25. Then select the fare column and calculate the mean. Use .mean() to calculate this value.

Calculate the number of survivors (survived == 1) of the titanic, and the percentage. len(df) allows you to calculate the number of rows. It may be useful

Create a new dataframe that includes only the sex, fare and fare columns. Then calculate the mean fare for men and women. 