# Pandas

> References:

> 1. [Top 25 pandas tricks (Youtube)](https://www.youtube.com/watch?v=RlIiVeig3hc)
> 2. [Pandas in a Nutshell (Kaggle)](https://www.kaggle.com/salaheddinhetalani/pandas-in-a-nutshell)
> 3. [Mastery of Pandas - I (Medium)](https://medium.com/swlh/the-mastery-of-pandas-i-50156db42125)
> 4. [Pandas Pipe Function](https://towardsdatascience.com/using-pandas-pipe-function-to-improve-code-readability-96d66abfaf8)

![pandas-common-methods.png](images/pandas-common-methods.png)

In [3]:
import pandas as pd
import numpy as np
import re 
import random 

# Reading & Creating Data

* From dictionary (API Lecture)
* From CSV
* From SQL (check SQLAlchemy Lecture)
* From MongoDB (Check pymongo Lecture)

```
df2 = pd.read_csv("data.csv", sep = "|", names = ["Name", "Surname", "Height", "Weight"])
```

![pandas-read.png](images/pandas-read.png)

In [4]:
df = pd.read_csv('./data/titanic_train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [220]:
# quick way to create a data frame
df_toy = pd.DataFrame(np.random.rand(8, 10), index=range(1,9), columns=list('abcdefghij'))
df_toy

Unnamed: 0,a,b,c,d,e,f,g,h,i,j
1,0.929612,0.556072,0.885567,0.106549,0.170706,0.539939,0.881222,0.391566,0.495007,0.855888
2,0.204568,0.409444,0.735698,0.005126,0.713374,0.716749,0.185039,0.883164,0.892307,0.958264
3,0.612601,0.801677,0.896813,0.47025,0.69383,0.884332,0.014334,0.931482,0.282617,0.059278
4,0.431341,0.54078,0.908973,0.68118,0.11404,0.655396,0.822682,0.3818,0.376231,0.674356
5,0.254098,0.473552,0.603327,0.605616,0.763005,0.751652,0.291799,0.426566,0.004717,0.497453
6,0.94208,0.301579,0.774311,0.555649,0.272629,0.644733,0.448786,0.918053,0.843528,0.945848
7,0.326762,0.430574,0.971986,0.884212,0.922727,0.248512,0.58265,0.052155,0.45954,0.253592
8,0.23941,0.260032,0.463758,0.344647,0.540856,0.197043,0.86514,0.038047,0.744995,0.929759


# Data Exploration
* `.info()`: to get the general info about the dataset, includes: name, non-null count, dtype
* `.describe()`: to give us some common statistics about the numeric data
    - `include=['object','float64','int64']`: also summarise the categorical variables

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [8]:
df.describe(include=['object','float64','int64'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Nirva, Mr. Iisakki Antino Aijo",male,,,,347082.0,,C23 C25 C27,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [45]:
df.select_dtypes(include='object').head()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S


In [49]:
df.Age.astype('object')

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: object

## Slicing

In [191]:
df.set_index('Name', inplace=True)

### Records (Rows)

In [192]:
df[0:5] ## first 5 lines

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,old_women
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.25,,S,False
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C,False
"Heikkinen, Miss. Laina",3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,False
"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1,C123,S,False
"Allen, Mr. William Henry",5,0,3,male,35.0,0,0,373450,8.05,,S,False


In [193]:
df.loc['Braund, Mr. Owen Harris'] # Access by Sepcific index value

PassengerId            1
Survived               0
Pclass                 3
Sex                 male
Age                 22.0
SibSp                  1
Parch                  0
Ticket         A/5 21171
Fare                7.25
Cabin                NaN
Embarked               S
old_women          False
Name: Braund, Mr. Owen Harris, dtype: object

In [194]:
df.iloc[0] # iloc with numeric order index

PassengerId            1
Survived               0
Pclass                 3
Sex                 male
Age                 22.0
SibSp                  1
Parch                  0
Ticket         A/5 21171
Fare                7.25
Cabin                NaN
Embarked               S
old_women          False
Name: Braund, Mr. Owen Harris, dtype: object

### Columns

In [195]:
df['Sex'].iloc[0:5]

Name
Braund, Mr. Owen Harris                                  male
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    female
Heikkinen, Miss. Laina                                 female
Futrelle, Mrs. Jacques Heath (Lily May Peel)           female
Allen, Mr. William Henry                                 male
Name: Sex, dtype: object

In [198]:
df.loc[['Braund, Mr. Owen Harris'],['Sex','Survived']]

Unnamed: 0_level_0,Sex,Survived
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
"Braund, Mr. Owen Harris",male,0


In [199]:
df.loc['Braund, Mr. Owen Harris',['Survived']]

Survived    0
Name: Braund, Mr. Owen Harris, dtype: object

In [202]:
## only want to access the scala value
df.loc['Braund, Mr. Owen Harris',['Survived']].squeeze()

0

## Filter

In [234]:
df.Cabin.value_counts().nlargest(5)

C23 C25 C27    4
G6             4
B96 B98        4
C22 C26        3
E101           3
Name: Cabin, dtype: int64

In [235]:
df.Cabin.value_counts().nlargest(5).index

Index(['C23 C25 C27', 'G6', 'B96 B98', 'C22 C26', 'E101'], dtype='object')

In [237]:
df.loc[df.Cabin.isin(df.Cabin.value_counts().nlargest(5).index)].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S
123,124,1,2,"Webber, Miss. Susan",female,32.5,0,0,27267,13.0,E101,S
205,206,0,3,"Strom, Miss. Telma Matilda",female,2.0,0,1,347054,10.4625,G6,S


## Sampling to random subsets

In [225]:
df_subset1 = df.sample(frac=0.75, random_state=1234)

In [226]:
df_subset1.shape

(668, 12)

In [227]:
df_subset2 = df.drop(df_subset1.index)

In [228]:
df_subset2.shape 

(223, 12)

# Data Manipulation

## Structure: Indexes, Columns, Drops, Rename

In [24]:
df.index

Index(['Braund, Mr. Owen Harris',
       'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
       'Heikkinen, Miss. Laina',
       'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
       'Allen, Mr. William Henry', 'Moran, Mr. James',
       'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard',
       'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)',
       'Nasser, Mrs. Nicholas (Adele Achem)',
       ...
       'Markun, Mr. Johann', 'Dahlberg, Miss. Gerda Ulrika',
       'Banfield, Mr. Frederick James', 'Sutehall, Mr. Henry Jr',
       'Rice, Mrs. William (Margaret Norton)', 'Montvila, Rev. Juozas',
       'Graham, Miss. Margaret Edith',
       'Johnston, Miss. Catherine Helen "Carrie"', 'Behr, Mr. Karl Howell',
       'Dooley, Mr. Patrick'],
      dtype='object', name='Name', length=891)

In [26]:
df.reset_index(inplace=True)
df 

Unnamed: 0,Name,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.2500,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,"Heikkinen, Miss. Laina",3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1000,C123,S
4,"Allen, Mr. William Henry",5,0,3,male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,"Montvila, Rev. Juozas",887,0,2,male,27.0,0,0,211536,13.0000,,S
887,"Graham, Miss. Margaret Edith",888,1,1,female,19.0,0,0,112053,30.0000,B42,S
888,"Johnston, Miss. Catherine Helen ""Carrie""",889,0,3,female,,1,2,W./C. 6607,23.4500,,S
889,"Behr, Mr. Karl Howell",890,1,1,male,26.0,0,0,111369,30.0000,C148,C


In [28]:
df0 = df.drop(['PassengerId'], axis=1)
df0.head()

Unnamed: 0,Name,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,"Braund, Mr. Owen Harris",0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,"Heikkinen, Miss. Laina",1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,1,female,35.0,1,0,113803,53.1,C123,S
4,"Allen, Mr. William Henry",0,3,male,35.0,0,0,373450,8.05,,S


In [29]:
df0 = df.drop([0], axis=0)
df0.head()

Unnamed: 0,Name,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,"Heikkinen, Miss. Laina",3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1,C123,S
4,"Allen, Mr. William Henry",5,0,3,male,35.0,0,0,373450,8.05,,S
5,"Moran, Mr. James",6,0,3,male,,0,0,330877,8.4583,,Q


In [40]:
df_toy.rename(columns={'a':'A'}).head(3)

Unnamed: 0,A,b,c,d,e,f,g,h,i,j
1,0.258436,0.423116,0.287774,0.964585,0.385787,0.916427,0.79857,0.973122,0.906834,0.564802
2,0.096466,0.96327,0.880865,0.791004,0.541368,0.165742,0.638345,0.722718,0.458387,0.781024
3,0.233432,0.563873,0.543492,0.624769,0.309379,0.25551,0.48116,0.112016,0.978358,0.907759


In [221]:
df_toy.columns = df_toy.columns.str.upper()
df_toy.head()

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
1,0.929612,0.556072,0.885567,0.106549,0.170706,0.539939,0.881222,0.391566,0.495007,0.855888
2,0.204568,0.409444,0.735698,0.005126,0.713374,0.716749,0.185039,0.883164,0.892307,0.958264
3,0.612601,0.801677,0.896813,0.47025,0.69383,0.884332,0.014334,0.931482,0.282617,0.059278
4,0.431341,0.54078,0.908973,0.68118,0.11404,0.655396,0.822682,0.3818,0.376231,0.674356
5,0.254098,0.473552,0.603327,0.605616,0.763005,0.751652,0.291799,0.426566,0.004717,0.497453


## Group-wise / Pivot

In [43]:
max_id = df_toy.b.idxmax()
df_toy.loc[max_id]

a    0.076350
b    0.996881
c    0.910751
d    0.745177
e    0.386984
f    0.325438
g    0.003434
h    0.222724
i    0.803945
j    0.612526
Name: 8, dtype: float64

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Name         891 non-null    object 
 1   PassengerId  891 non-null    int64  
 2   Survived     891 non-null    int64  
 3   Pclass       891 non-null    int64  
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [181]:
df.groupby('Age').Survived.mean() ## 88 of length (aggregated the size reduced)

Age
0.42     1.0
0.67     1.0
0.75     1.0
0.83     1.0
0.92     1.0
        ... 
70.00    0.0
70.50    0.0
71.00    0.0
74.00    0.0
80.00    1.0
Name: Survived, Length: 88, dtype: float64

In [183]:
df.groupby('Age').Survived.transform(np.mean) ## 891 as length (transform maintain the size)

0      0.407407
1      0.454545
2      0.333333
3      0.611111
4      0.611111
         ...   
886    0.611111
887    0.360000
888         NaN
889    0.333333
890    0.500000
Name: Survived, Length: 891, dtype: float64

In [52]:
df.groupby(['Sex']).Age.agg([len, min, max]) ## Apply multile aggregation

Unnamed: 0_level_0,len,min,max
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,314.0,0.75,63.0
male,577.0,0.42,80.0


In [211]:
df 

Unnamed: 0,id,values
0,1,"[1, 2, 3]"
1,2,"[4, 5, 6]"


In [216]:
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [252]:
df.groupby(['Sex']).Age.agg(lambda s: s.iloc[0])

Sex
female    38.0
male      22.0
Name: Age, dtype: float64

In [255]:
df.pivot_table(index='Sex',columns='Pclass',values='Survived', aggfunc='mean', margins=True)

Pclass,1,2,3,All
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.968085,0.921053,0.5,0.742038
male,0.368852,0.157407,0.135447,0.188908
All,0.62963,0.472826,0.242363,0.383838


In [293]:
df.groupby(['Sex','Pclass'])['Survived'].mean().unstack()

Pclass,1,2,3
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


## Reshape: Long-Wide Form

### Expand a list into multiple columns

In [295]:
df_dumb = pd.DataFrame(data={"id": [1, 2], 
                        "values": [[1, 2, 3], [4, 5, 6]]})
df_dumb

Unnamed: 0,id,values
0,1,"[1, 2, 3]"
1,2,"[4, 5, 6]"


In [296]:
df_dumb['1 2 3'.split()] = df_dumb['values'].apply(pd.Series)
df_dumb 

Unnamed: 0,id,values,1,2,3
0,1,"[1, 2, 3]",1,2,3
1,2,"[4, 5, 6]",4,5,6


In [299]:
df_wide = df_dumb.drop('values', axis='columns')
df_wide 

Unnamed: 0,id,1,2,3
0,1,1,2,3
1,2,4,5,6


In [302]:
df_long = pd.melt(df_wide, id_vars ='id', value_vars = ['1', '2','3'])
df_long 

Unnamed: 0,id,variable,value
0,1,1,1
1,2,1,4
2,1,2,2
3,2,2,5
4,1,3,3
5,2,3,6


In [321]:
df_reshape = df_long.pivot(index='id',columns='variable',values='value')
df_reshape 

variable,1,2,3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,2,3
2,4,5,6


In [325]:
df_reshape.rename_axis(None).rename_axis(None, axis=1).reset_index() # reset_index after pivot

Unnamed: 0,index,1,2,3
0,1,1,2,3
1,2,4,5,6


In [320]:
df_reshape

variable,1,2,3
None,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,2,3
2,4,5,6


## Sorting

In [56]:
df.sort_values(by=['Age','Sex'], ascending=False).head()

Unnamed: 0,Name,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,"Barkworth, Mr. Algernon Henry Wilson",631,1,1,male,80.0,0,0,27042,30.0,A23,S
851,"Svensson, Mr. Johan",852,0,3,male,74.0,0,0,347060,7.775,,S
96,"Goldschmidt, Mr. George B",97,0,1,male,71.0,0,0,PC 17754,34.6542,A5,C
493,"Artagaveytia, Mr. Ramon",494,0,1,male,71.0,0,0,PC 17609,49.5042,,C
116,"Connors, Mr. Patrick",117,0,3,male,70.5,0,0,370369,7.75,,Q


## Mapping
* `map()`: `Series ->` mapping a function to values of **Serires** according to input correpondence
    - Expect a single value from serires
    - Return a transformed version fo that value. In other words, a transformed version of series under a function

In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Name         891 non-null    object 
 1   PassengerId  891 non-null    int64  
 2   Survived     891 non-null    int64  
 3   Pclass       891 non-null    int64  
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [75]:
df.Name.map(lambda name: name.split(' ')[0].replace(',','')).head() # take the last name

0       Braund
1      Cumings
2    Heikkinen
3     Futrelle
4        Allen
Name: Name, dtype: object

In [79]:
df.Sex.value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [90]:
df[(df.Name.map(lambda name: 'mrs' in name.lower()))].head()#.Sex.value_counts()

Unnamed: 0,Name,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1,C123,S
8,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",9,1,3,female,27.0,0,2,347742,11.1333,,S
9,"Nasser, Mrs. Nicholas (Adele Achem)",10,1,2,female,14.0,1,0,237736,30.0708,,C
15,"Hewlett, Mrs. (Mary D Kingcome)",16,1,2,female,55.0,0,0,248706,16.0,,S


## Apply
* `apply()`: `Apply ->` mapping method that apply a function along an axis of DataFrame
    - Expect a single value from serires
    - Return a transformed version fo that value. In other words, a transformed version of series under a function
* `df.apply(lambda x: func(x['col1'],x['col2']),axis=1)`

In [121]:
re.findall(r'[A-Z]{2}','PC 1234')

['PC']

In [126]:
df.iloc[0]

Name           Braund, Mr. Owen Harris
PassengerId                          1
Survived                             0
Pclass                               3
Sex                               male
Age                               22.0
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object

In [138]:
def manipulate_cat(row):
    char = re.findall(r'[A-Z]+', row.Ticket)
    return char[0] if len(char) > 0 else None # ie. Each row, so row.Ticket is an element

df.select_dtypes('object').apply(manipulate_cat,axis='columns').head() # equivalent to map

0       A
1      PC
2    STON
3    None
4    None
dtype: object

In [140]:
def old_women(row):
    return True if row.Sex == 'female' and row.Age > 60 else False

df['old_women'] = df.apply(old_women, axis='columns')
df.head()

Unnamed: 0,Name,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,old_women
0,"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.25,,S,False
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C,False
2,"Heikkinen, Miss. Laina",3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,False
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1,C123,S,False
4,"Allen, Mr. William Henry",5,0,3,male,35.0,0,0,373450,8.05,,S,False


![pandas-axis.jpeg](images/pandas-axis.jpeg)

In [1]:
df.apply(lambda x: x.Pclass * x.Age, axis=1)

NameError: name 'df' is not defined

In [188]:
## Apply use for filter
df[df.apply(lambda x: 'mrs' in x.Name.lower(), axis=1)].head()

Unnamed: 0,Name,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,old_women
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C,False
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1,C123,S,False
8,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",9,1,3,female,27.0,0,2,347742,11.1333,,S,False
9,"Nasser, Mrs. Nicholas (Adele Achem)",10,1,2,female,14.0,1,0,237736,30.0708,,C,False
15,"Hewlett, Mrs. (Mary D Kingcome)",16,1,2,female,55.0,0,0,248706,16.0,,S,False


## Pipe

In [326]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [331]:
def split_name(df):
    df[['firstname','secondname']] = df['Name'].str.split(',').pad().apply(pd.Series)
    return df
def old_women(df):
    df['old_women'] = df.apply(lambda x: x.Sex=='female' and x.Age > 60, axis='columns')
    return df

res = (
    df
    .pipe(split_name)
    .pipe(old_women)
    )

res.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,firstname,secondname,old_women
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,False
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),False
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,False
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),False
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,False


## Missing
* `isnull()`, `notnull()
* `fillna()`

In [145]:
df[df.Cabin.isnull()][['Name','Survived','Cabin']].head()

Unnamed: 0,Name,Survived,Cabin
0,"Braund, Mr. Owen Harris",0,
2,"Heikkinen, Miss. Laina",1,
4,"Allen, Mr. William Henry",0,
5,"Moran, Mr. James",0,
7,"Palsson, Master. Gosta Leonard",0,


In [148]:
df0 = df[df.Cabin.isnull()][['Name','Survived','Cabin']].fillna('Unknown').head()
df0 

Unnamed: 0,Name,Survived,Cabin
0,"Braund, Mr. Owen Harris",0,Unknown
2,"Heikkinen, Miss. Laina",1,Unknown
4,"Allen, Mr. William Henry",0,Unknown
5,"Moran, Mr. James",0,Unknown
7,"Palsson, Master. Gosta Leonard",0,Unknown


In [149]:
df0.replace('Unknown','Funny')

Unnamed: 0,Name,Survived,Cabin
0,"Braund, Mr. Owen Harris",0,Funny
2,"Heikkinen, Miss. Laina",1,Funny
4,"Allen, Mr. William Henry",0,Funny
5,"Moran, Mr. James",0,Funny
7,"Palsson, Master. Gosta Leonard",0,Funny


## Vectorize: Split the string to multiple columns

In [244]:
df[['firstname','lastname']] = df.Name.str.split(',', expand=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,"(firstname, lastname)",firstname,lastname
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,"[Braund, Mr. Owen Harris]",Braund,Mr. Owen Harris
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"[Cumings, Mrs. John Bradley (Florence Briggs ...",Cumings,Mrs. John Bradley (Florence Briggs Thayer)
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,"[Heikkinen, Miss. Laina]",Heikkinen,Miss. Laina
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,"[Futrelle, Mrs. Jacques Heath (Lily May Peel)]",Futrelle,Mrs. Jacques Heath (Lily May Peel)
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,"[Allen, Mr. William Henry]",Allen,Mr. William Henry


## Numeric to Catergorical

In [257]:
pd.cut(df.Age, bins=[0,18,25,99], labels=['child','young adult','adult']).head()

0    young adult
1          adult
2          adult
3          adult
4          adult
Name: Age, dtype: category
Categories (3, object): ['child' < 'young adult' < 'adult']

In [275]:
cat_series = pd.qcut(df.Age, q=5, labels=False).astype('object')

## Categorical to Numeric

In [276]:
pd.get_dummies(cat_series).head()

Unnamed: 0,0.0,1.0,2.0,3.0,4.0
0,0,1,0,0,0
1,0,0,0,1,0
2,0,0,1,0,0
3,0,0,0,1,0
4,0,0,0,1,0


# Concatenate & Merge

In [164]:
df_toy1 = df_toy[:4]
df_toy1 

Unnamed: 0,a,b,c,d,e,f,g,h,i,j
1,0.258436,0.423116,0.287774,0.964585,0.385787,0.916427,0.79857,0.973122,0.906834,0.564802
2,0.096466,0.96327,0.880865,0.791004,0.541368,0.165742,0.638345,0.722718,0.458387,0.781024
3,0.233432,0.563873,0.543492,0.624769,0.309379,0.25551,0.48116,0.112016,0.978358,0.907759
4,0.702748,0.90963,0.757972,0.771639,0.960781,0.879888,0.194031,0.928934,0.134611,0.200854


In [165]:
df_toy2 = df_toy[4:]
df_toy2

Unnamed: 0,a,b,c,d,e,f,g,h,i,j
5,0.905405,0.866378,0.644015,0.812307,0.071296,0.112941,0.196816,0.380298,0.726505,0.277098
6,0.439242,0.918783,0.079394,0.35236,0.15749,0.041199,0.68922,0.568227,0.983529,0.839613
7,0.004884,0.862336,0.213394,0.313237,0.312096,0.565927,0.394071,0.357896,0.776445,0.929125
8,0.07635,0.996881,0.910751,0.745177,0.386984,0.325438,0.003434,0.222724,0.803945,0.612526


## pd.concat()

In [166]:
# Concat by rows
pd.concat([df_toy1, df_toy2])

Unnamed: 0,a,b,c,d,e,f,g,h,i,j
1,0.258436,0.423116,0.287774,0.964585,0.385787,0.916427,0.79857,0.973122,0.906834,0.564802
2,0.096466,0.96327,0.880865,0.791004,0.541368,0.165742,0.638345,0.722718,0.458387,0.781024
3,0.233432,0.563873,0.543492,0.624769,0.309379,0.25551,0.48116,0.112016,0.978358,0.907759
4,0.702748,0.90963,0.757972,0.771639,0.960781,0.879888,0.194031,0.928934,0.134611,0.200854
5,0.905405,0.866378,0.644015,0.812307,0.071296,0.112941,0.196816,0.380298,0.726505,0.277098
6,0.439242,0.918783,0.079394,0.35236,0.15749,0.041199,0.68922,0.568227,0.983529,0.839613
7,0.004884,0.862336,0.213394,0.313237,0.312096,0.565927,0.394071,0.357896,0.776445,0.929125
8,0.07635,0.996881,0.910751,0.745177,0.386984,0.325438,0.003434,0.222724,0.803945,0.612526


In [169]:
# Concat by columns
pd.concat([df_toy1.add_prefix('toy1_').reset_index(),df_toy2.reset_index()], axis='columns') # try without the reset_index()

Unnamed: 0,index,toy1_a,toy1_b,toy1_c,toy1_d,toy1_e,toy1_f,toy1_g,toy1_h,toy1_i,...,a,b,c,d,e,f,g,h,i,j
0,1,0.258436,0.423116,0.287774,0.964585,0.385787,0.916427,0.79857,0.973122,0.906834,...,0.905405,0.866378,0.644015,0.812307,0.071296,0.112941,0.196816,0.380298,0.726505,0.277098
1,2,0.096466,0.96327,0.880865,0.791004,0.541368,0.165742,0.638345,0.722718,0.458387,...,0.439242,0.918783,0.079394,0.35236,0.15749,0.041199,0.68922,0.568227,0.983529,0.839613
2,3,0.233432,0.563873,0.543492,0.624769,0.309379,0.25551,0.48116,0.112016,0.978358,...,0.004884,0.862336,0.213394,0.313237,0.312096,0.565927,0.394071,0.357896,0.776445,0.929125
3,4,0.702748,0.90963,0.757972,0.771639,0.960781,0.879888,0.194031,0.928934,0.134611,...,0.07635,0.996881,0.910751,0.745177,0.386984,0.325438,0.003434,0.222724,0.803945,0.612526


## pd.merge()
![pandas-join.png](./images/pandas-join.png)

In [None]:
# Merge by index (or could even choose key)
pd.merge(df_toy1[['a','b']], df_toy1[['i','j']], how='inner', left_index=True, right_index=True)

In [180]:
# Merge and add prefix
df_toy1[['a','b']].join(df_toy1[['a','b']], lsuffix='_left', rsuffix='_right')

Unnamed: 0,a_left,b_left,a_right,b_right
1,0.258436,0.423116,0.258436,0.423116
2,0.096466,0.96327,0.096466,0.96327
3,0.233432,0.563873,0.233432,0.563873
4,0.702748,0.90963,0.702748,0.90963


## merge_asof
* Match to the nearest key (usually timestamp)
    - backward
    - forward
    - nearest

In [206]:
trades = pd.DataFrame(
       {
           "time": [
               pd.Timestamp("2016-05-25 13:30:00.023"),
               pd.Timestamp("2016-05-25 13:30:00.038"),
               pd.Timestamp("2016-05-25 13:30:00.048"),
               pd.Timestamp("2016-05-25 13:30:00.048"),
               pd.Timestamp("2016-05-25 13:30:00.048")
           ],
           "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
           "price": [51.95, 51.95, 720.77, 720.92, 98.0],
           "quantity": [75, 155, 100, 100, 100]
       }
   )

quotes = pd.DataFrame(
    {
        "time": [
            pd.Timestamp("2016-05-25 13:30:00.023"),
            pd.Timestamp("2016-05-25 13:30:00.023"),
            pd.Timestamp("2016-05-25 13:30:00.030"),
            pd.Timestamp("2016-05-25 13:30:00.041"),
            pd.Timestamp("2016-05-25 13:30:00.048"),
            pd.Timestamp("2016-05-25 13:30:00.049"),
            pd.Timestamp("2016-05-25 13:30:00.072"),
            pd.Timestamp("2016-05-25 13:30:00.075")
        ],
        "ticker": [
               "GOOG",
               "MSFT",
               "MSFT",
               "MSFT",
               "GOOG",
               "AAPL",
               "GOOG",
               "MSFT"
           ],
           "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
           "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03]
    }
)

In [208]:
pd.merge_asof(trades, quotes, on='time', by='ticker', tolerance=pd.Timedelta("2ms"))

Unnamed: 0,time,ticker,price,quantity,bid,ask
0,2016-05-25 13:30:00.023,MSFT,51.95,75,51.95,51.96
1,2016-05-25 13:30:00.038,MSFT,51.95,155,,
2,2016-05-25 13:30:00.048,GOOG,720.77,100,720.5,720.93
3,2016-05-25 13:30:00.048,GOOG,720.92,100,720.5,720.93
4,2016-05-25 13:30:00.048,AAPL,98.0,100,,
