<span>
<img src="http://www.sobigdata.eu/sites/default/files/logo-SoBigData-DEFINITIVO.png" width="180px" align="right"/>
</span>
<span>
<b>Author:</b> <a href="http://about.giuliorossetti.net">Giulio Rossetti</a><br/>
<b>Python version:</b>  3.x<br/>
<b>Last update:</b> 22/01/2018
</span>

<a id='top'></a>
# *Data Manipulation with Pandas*
This notebook contains an overview of basic Pandas functionalities.

**Note:** this notebook is purposely not 100% comprehensive, it only discusses the basic things you need to get started.

## Table of Contents
1. [Series](#series) 
    1. [Create](#sa)
    2. [Index and Slice](#sb)
    3. [Adding/Merging](#sc)
2. [DataFrames](#dataframes) 
    1. [Create and Access](#da)
    2. [Load a DataFrame from csv file](#db)
    3. [Reshape](#dc)
        1. [Index and Slice Columns](#dc1)
        2. [Select/Index Rows](#dc2)
        3. [Create and Delete Columns/Rows](#dc3)
        4. [Subset](#dc4)
        5. [Conditional Selection](#dc5)
        6. [Re-setting and Setting Index](#dc6)
        7. [Multi-indexing](#dc7)
    4. [Data Transformation](#de)
        1. [Missing Values](#de1)
        2. [GroupBy](#de2)
        3. [Concatenation](#de3)
        4. [Merging](#de4)
        5. [Joining](#de5)
        6. [Miscellanea](#de6)

In [1]:
import numpy as np
import pandas as pd

<a id='series'></a>
## 1. Series ([to top](#top))

Pandas Series are **one-dimensional** labeled arrays capable of holding any data type (integers, strings, floating point numbers...) <br/>
The axis labels are collectively referred to as the **index**. 

<a id='sa'></a>
### 1.A Create ([to top](#top))
Pandas Series can be build by leveraging different data types

From numerical data

In [2]:
my_data = [10,20,30]
pd.Series(data=my_data)

0    10
1    20
2    30
dtype: int64

From numerical data and corresponding index (row labels)

In [3]:
labels = ['A','B','C']
pd.Series(data=my_data, index=labels)

A    10
B    20
C    30
dtype: int64

Just using a pre-defined dictionary

In [4]:
d = {'A':10,'B':20,'C':30}
pd.Series(d)

A    10
B    20
C    30
dtype: int64

<a id='sb'></a>
### 1.B Index and Slice ([to top](#top))
Series can be indexed and sliced

In [5]:
ser = pd.Series([1, 2, 3, 4], ['A', 'B', 'C', 'D'])

print("by name, A:", ser['A'])
print("by positional value in the series, A:", ser[0])
print("by range, B:D\n", ser[1:4], sep='')

by name, A: 1
by positional value in the series, A: 1
by range, B:D
B    2
C    3
D    4
dtype: int64


<a id='sc'></a>
### 1.C Adding/Merging  ([to top](#top))
Series havin having common indices can be combined

In [6]:
ser1 = pd.Series([1, 2, 3, 4], ['A', 'B', 'C', 'D'])
ser2 = pd.Series([1, 2, 5, 4], ['A', 'B', 'E', 'D'])
ser3 = ser1+ser2

After adding the two series, the result looks like this...

In [7]:
ser3

A    2.0
B    4.0
C    NaN
D    8.0
E    NaN
dtype: float64

Python tries to add values where it finds common index name, and puts NaN where indices are missing <br/>
The same happens for all possible operations (e.g., product)

In [8]:
ser1*ser2

A     1.0
B     4.0
C     NaN
D    16.0
E     NaN
dtype: float64

<a id='dataframes'></a>
## 2. DataFrame ([to top](#top))
A DataFrame is a **2-dimensional** labeled data structure with columns of potentially different types. <br/>
You can think of it like a spreadsheet or SQL table, or a dict of Series objects. <br/>
It is generally the most commonly used pandas object.

In [9]:
from numpy.random import randn as rn

<a id='da'></a>
### 2.A Create and Access ([to top](#top))
We start generating some random data...

In [10]:
np.random.seed()
matrix_data = rn(5,4)
matrix_data

array([[ 0.26743873,  0.98953223,  0.23225527, -0.44244591],
       [-0.80924616,  0.30080056,  0.20135332,  0.06512674],
       [ 1.02101342, -1.04870364,  0.4461839 , -1.56344312],
       [-0.13577441,  0.48456665,  0.67251112,  0.51181276],
       [-0.22743822, -0.66156282,  0.16536224, -0.49756965]])

Now we can transform such random matrix in a DataFrame

In [11]:
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
df

Unnamed: 0,W,X,Y,Z
A,0.267439,0.989532,0.232255,-0.442446
B,-0.809246,0.300801,0.201353,0.065127
C,1.021013,-1.048704,0.446184,-1.563443
D,-0.135774,0.484567,0.672511,0.511813
E,-0.227438,-0.661563,0.165362,-0.49757


<a id='db'></a>
### 2.B Load a DataFrame from csv file ([to top](#top))
Datasets formatted as csv files can be easily loaded into a DataFrame

In [12]:
titanic = pd.read_csv("data/titanic_passengers.csv")

In [13]:
titanic.head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch
0,1,"Braund, Mr. Owen Harris",male,22.0,1,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0
4,5,"Allen, Mr. William Henry",male,35.0,0,0


Simple statistics can be obtained through the *describe* method

In [14]:
titanic.describe()

Unnamed: 0,PassengerId,Age,SibSp,Parch
count,891.0,714.0,891.0,891.0
mean,446.0,29.699118,0.523008,0.381594
std,257.353842,14.526497,1.102743,0.806057
min,1.0,0.42,0.0,0.0
25%,223.5,20.125,0.0,0.0
50%,446.0,28.0,0.0,0.0
75%,668.5,38.0,1.0,0.0
max,891.0,80.0,8.0,6.0


Or with the *info* one

In [15]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Name         891 non-null    object 
 2   Sex          891 non-null    object 
 3   Age          714 non-null    float64
 4   SibSp        891 non-null    int64  
 5   Parch        891 non-null    int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 41.9+ KB


<a id='dc'></a>
### 2.C Reshape ([to top](#top))
DataFrame structures can be reshaped in several ways in order to facilitate the analysis of the data they describe 

<a id='dc1'></a>
### 2.C.a Index and Slice Columns ([to top](#top))

Slicing a single column produces a Series...

In [16]:
titanic['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [17]:
type(titanic['Age'])

pandas.core.series.Series

An alternative syntax to access a single column is the *dot* notation

In [18]:
titanic.Age

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

#### Unique values

In [19]:
titanic['Age'].unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

In [20]:
titanic['Age'].value_counts()

24.00    30
22.00    27
18.00    26
19.00    25
28.00    25
         ..
36.50     1
55.50     1
0.92      1
23.50     1
74.00     1
Name: Age, Length: 88, dtype: int64

#### DataFrame Filter

To obtain a DataFrame an additional bracket must be used

In [21]:
titanic[['Age']].head()

Unnamed: 0,Age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0


In [22]:
type(titanic[['Age']])

pandas.core.frame.DataFrame

In [23]:
titanic[['Age','Sex']].head() # Multiple selection

Unnamed: 0,Age,Sex
0,22.0,male
1,38.0,female
2,26.0,female
3,35.0,female
4,35.0,male


<a id='dc2'></a>
### 2.C.b Select/index Rows ([to top](#top))
Rows can be indexed by **label** as well as by **index**

In [24]:
titanic.loc[[1, 2]]

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0


In [25]:
titanic.iloc[[1,2]]

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0


<a id='dc3'></a>
### 2.C.c Create and Delete Columns/Rows ([to top](#top))

Adding a novel column combining existing ones

In [26]:
titanic['Family'] = titanic['SibSp'] + titanic['Parch']
titanic.head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch,Family
0,1,"Braund, Mr. Owen Harris",male,22.0,1,0,1
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,1
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,1
4,5,"Allen, Mr. William Henry",male,35.0,0,0,0


Deleting an existing column

In [27]:
titanic = titanic.drop('Parch', axis=1)
titanic = titanic.drop('SibSp', axis=1)
titanic.head()

Unnamed: 0,PassengerId,Name,Sex,Age,Family
0,1,"Braund, Mr. Owen Harris",male,22.0,1
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1
2,3,"Heikkinen, Miss. Laina",female,26.0,0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1
4,5,"Allen, Mr. William Henry",male,35.0,0


Deleting a row (index) by using df.drop() method and axis=0

In [28]:
titanic1 = titanic.drop(0, axis=0)
titanic1.head()

Unnamed: 0,PassengerId,Name,Sex,Age,Family
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1
2,3,"Heikkinen, Miss. Laina",female,26.0,0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1
4,5,"Allen, Mr. William Henry",male,35.0,0
5,6,"Moran, Mr. James",male,,0


Updates can be performed **inplace** (without reasigning to a variable) by setting inplace=True

In [29]:
titanic.drop(0, axis=0, inplace=True)
titanic.head()

Unnamed: 0,PassengerId,Name,Sex,Age,Family
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1
2,3,"Heikkinen, Miss. Laina",female,26.0,0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1
4,5,"Allen, Mr. William Henry",male,35.0,0
5,6,"Moran, Mr. James",male,,0


<a id='dc4'></a>
### 2.C.d Subset ([to top](top))
Accessing an element in position (D,Y)

In [30]:
titanic.loc[2,'Age']

26.0

Slicing by rows and columns at the same time

In [31]:
titanic.loc[[2, 3],['Age','Sex']]

Unnamed: 0,Age,Sex
2,26.0,female
3,35.0,female


<a id='dc4'></a>
### 2.C.e Conditional Selection ([to top](#top))
Logical operators can be applied to DataFrame to perform filtering and selections

**Example:** a boolean DataFrame where we are checking if the ages are greater than 20

In [89]:
titanic['Age'] > 20

1       True
2       True
3       True
4       True
5      False
       ...  
886     True
887    False
888    False
889     True
890     True
Name: Age, Length: 890, dtype: bool

In [88]:
rslt_df = titanic[titanic['Age'] > 20]
rslt_df.head()

Unnamed: 0,PassengerId,Name,Sex,Age,Family
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1
2,3,"Heikkinen, Miss. Laina",female,26.0,0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1
4,5,"Allen, Mr. William Henry",male,35.0,0
6,7,"McCarthy, Mr. Timothy J",male,54.0,0


#### Conditionally subset a DataFrame with Boolean series
Let's define a DataFrame describing individuals measurements

We can filter it easily by specifying a condition on the cell values

In [34]:
titanic[titanic['Age']>35].head()

Unnamed: 0,PassengerId,Name,Sex,Age,Family
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1
6,7,"McCarthy, Mr. Timothy J",male,54.0,0
11,12,"Bonnell, Miss. Elizabeth",female,58.0,0
13,14,"Andersson, Mr. Anders Johan",male,39.0,6
15,16,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0


Multiple conditions can be stacked together using boolean operators

In [35]:
booldf1 = titanic['Age']>35
booldf2 = titanic['Family']>2

In [36]:
booldf2

1      False
2      False
3      False
4      False
5      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Family, Length: 890, dtype: bool

In [37]:
titanic[(booldf1) & (booldf2)].head()

Unnamed: 0,PassengerId,Name,Sex,Age,Family
13,14,"Andersson, Mr. Anders Johan",male,39.0,6
25,26,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...",female,38.0,6
167,168,"Skoog, Mrs. William (Anna Bernhardina Karlsson)",female,45.0,5
360,361,"Skoog, Mr. Wilhelm",male,40.0,5
390,391,"Carter, Mr. William Ernest",male,36.0,3


Filtering results can be subsetted as usual selecting rows/columns

In [38]:
titanic[booldf1][['Age','Sex']].head()

Unnamed: 0,Age,Sex
1,38.0,female
6,54.0,male
11,58.0,female
13,39.0,male
15,55.0,female


<a id='dc5'></a>
### 2.C.f Re-setting and Setting Index ([to top](#top))
In the previuos example we specified an index. <br/>
We can destroy such index (making it a simple column of the DataFrame) easily

In [39]:
titanic.reset_index().head()

Unnamed: 0,index,PassengerId,Name,Sex,Age,Family
0,1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1
1,2,3,"Heikkinen, Miss. Laina",female,26.0,0
2,3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1
3,4,5,"Allen, Mr. William Henry",male,35.0,0
4,5,6,"Moran, Mr. James",male,,0


Conversely, if we do not need such additional column we can drop it contextually

In [40]:
titanic.reset_index(drop=True).head()

Unnamed: 0,PassengerId,Name,Sex,Age,Family
0,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1
1,3,"Heikkinen, Miss. Laina",female,26.0,0
2,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1
3,5,"Allen, Mr. William Henry",male,35.0,0
4,6,"Moran, Mr. James",male,,0


Now we can add a new column and set it as new index

In [41]:
titanic.set_index('Name').head()

Unnamed: 0_level_0,PassengerId,Sex,Age,Family
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",2,female,38.0,1
"Heikkinen, Miss. Laina",3,female,26.0,0
"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,female,35.0,1
"Allen, Mr. William Henry",5,male,35.0,0
"Moran, Mr. James",6,male,,0


<a id='dc5'></a>
### 2.C.g Multi-indexing ([to top](#top))
DataFrame indexes can have multiple levels <br/>
We can define a two level index as follows:

In [42]:
t2 = titanic[['Name', 'Family', 'Age']]
t2.head()

Unnamed: 0,Name,Family,Age
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0
2,"Heikkinen, Miss. Laina",0,26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0
4,"Allen, Mr. William Henry",0,35.0
5,"Moran, Mr. James",0,


In [43]:
t2.set_index(['Family', 'Age'], inplace=True)

For sake of clarity we can rename the indexes as follows

In [44]:
t2.index.names=['Outer', 'Inner']
t2.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name
Outer,Inner,Unnamed: 2_level_1
1,38.0,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
0,26.0,"Heikkinen, Miss. Laina"
1,35.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
0,35.0,"Allen, Mr. William Henry"
0,,"Moran, Mr. James"


Subsetting now becomes trikier, but the syntax remains the same. <br/>
We can select the subset (38.0)(Name) of Outer=1 by using the *loc* function twice

In [45]:
t2.loc[1].loc[[38.0]][['Name']]

Unnamed: 0_level_0,Name
Inner,Unnamed: 1_level_1
38.0,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
38.0,"Hoyt, Mr. Frederick Maxfield"
38.0,"Graham, Mr. George Edward"


<a id='de'></a>
## 3. Data Transformation ([to top](#top))
Data stored in a DataFrame can be transformed applying several functions.

<a id='de1'></a>
### 3.A Missing Values ([to top](#top))
In presence of missing values different policies can be selected

#### Dropping any rows with a NaN value

In [46]:
titanic.dropna(axis=0).head()

Unnamed: 0,PassengerId,Name,Sex,Age,Family
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1
2,3,"Heikkinen, Miss. Laina",female,26.0,0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1
4,5,"Allen, Mr. William Henry",male,35.0,0
6,7,"McCarthy, Mr. Timothy J",male,54.0,0


#### Dropping any columns with NaN value

In [47]:
titanic.dropna(axis=1).head()

Unnamed: 0,PassengerId,Name,Sex,Family
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1
2,3,"Heikkinen, Miss. Laina",female,0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1
4,5,"Allen, Mr. William Henry",male,0
5,6,"Moran, Mr. James",male,0


#### Thresholding: dropping a row with a minimum 5 NaN

In [48]:
titanic.dropna(axis=0, thresh=5).head()

Unnamed: 0,PassengerId,Name,Sex,Age,Family
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1
2,3,"Heikkinen, Miss. Laina",female,26.0,0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1
4,5,"Allen, Mr. William Henry",male,35.0,0
6,7,"McCarthy, Mr. Timothy J",male,54.0,0


#### Filling values with a default value

In [49]:
titanic.fillna(value='FILL VALUE').head()

Unnamed: 0,PassengerId,Name,Sex,Age,Family
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1
2,3,"Heikkinen, Miss. Laina",female,26.0,0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1
4,5,"Allen, Mr. William Henry",male,35.0,0
5,6,"Moran, Mr. James",male,FILL VALUE,0


#### Filling values with a computed value (e.g., mean of column A)

In [50]:
titanic.fillna(value={'Age': titanic['Age'].mean()}).head()

Unnamed: 0,PassengerId,Name,Sex,Age,Family
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1
2,3,"Heikkinen, Miss. Laina",female,26.0,0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1
4,5,"Allen, Mr. William Henry",male,35.0,0
5,6,"Moran, Mr. James",male,29.709916,0


<a id='de2'></a>
### 3.B GroupBy ([to top](#top))
DataFrames allow to group rows by column values so to compute aggregated statistics (i.e., sum, mean...)

In [51]:
t3 = titanic.groupby('Family')
t3[['Age']].mean()

Unnamed: 0_level_0,Age
Family,Unnamed: 1_level_1
0,32.220297
1,31.459565
2,26.035806
3,18.274815
4,20.818182
5,18.409091
6,15.166667
7,15.666667
10,


In [52]:
t4 = pd.DataFrame(titanic[['Age', 'Family']].groupby('Family').describe())
t4

Unnamed: 0_level_0,Age,Age,Age,Age,Age,Age,Age,Age
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Family,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,404.0,32.220297,12.899871,5.0,22.0,29.5,39.0,80.0
1,138.0,31.459565,13.509524,0.42,22.0,29.5,42.0,65.0
2,93.0,26.035806,16.542123,0.67,15.0,27.0,37.0,70.0
3,27.0,18.274815,14.304131,0.75,3.5,23.0,28.0,48.0
4,11.0,20.818182,17.069377,2.0,8.5,18.0,25.0,54.0
5,22.0,18.409091,17.388171,1.0,4.75,12.0,24.0,64.0
6,12.0,15.166667,14.732977,2.0,4.75,9.0,22.25,39.0
7,6.0,15.666667,14.361987,1.0,9.5,12.5,15.5,43.0
10,0.0,,,,,,,


In order to select a single row of the resulting DataFrame it is necessary to:
- access it via *loc*
- transpose the results

In [53]:
titanic_first_class = pd.DataFrame(t4.loc[1])
titanic_first_class

Unnamed: 0,Unnamed: 1,1
Age,count,138.0
Age,mean,31.459565
Age,std,13.509524
Age,min,0.42
Age,25%,22.0
Age,50%,29.5
Age,75%,42.0
Age,max,65.0


In [54]:
titanic_first_class.transpose()

Unnamed: 0_level_0,Age,Age,Age,Age,Age,Age,Age,Age
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
1,138.0,31.459565,13.509524,0.42,22.0,29.5,42.0,65.0


The latter operation is not required if the selection is on multiple rows

In [55]:
titanic[['Age', 'Family']].groupby('Family').describe().loc[[1, 2, 3]]

Unnamed: 0_level_0,Age,Age,Age,Age,Age,Age,Age,Age
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Family,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
1,138.0,31.459565,13.509524,0.42,22.0,29.5,42.0,65.0
2,93.0,26.035806,16.542123,0.67,15.0,27.0,37.0,70.0
3,27.0,18.274815,14.304131,0.75,3.5,23.0,28.0,48.0


<a id='de3'></a>
### 3.C Concatenation ([to top](#top))

DataFrames can be easily contatenated by row as well as by column

In [56]:
trip = pd.read_csv("data/titanic_status.csv")
trip.head()

Unnamed: 0,PassengerId,Survived,Pclass,Embarked,Cabin,Ticket,Fare
0,2,1,1,C,C85,PC 17599,71.2833
1,3,1,3,S,,STON/O2. 3101282,7.925
2,4,1,1,S,C123,113803,53.1
3,5,0,3,S,,373450,8.05
4,6,0,3,Q,,330877,8.4583


In [57]:
passengers = pd.read_csv("data/titanic_passengers.csv")
passengers.head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch
0,1,"Braund, Mr. Owen Harris",male,22.0,1,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0
4,5,"Allen, Mr. William Henry",male,35.0,0,0


In [58]:
row_concat = pd.concat([passengers, trip], axis=0)
row_concat.head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch,Survived,Pclass,Embarked,Cabin,Ticket,Fare
0,1,"Braund, Mr. Owen Harris",male,22.0,1.0,0.0,,,,,,
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1.0,0.0,,,,,,
2,3,"Heikkinen, Miss. Laina",female,26.0,0.0,0.0,,,,,,
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1.0,0.0,,,,,,
4,5,"Allen, Mr. William Henry",male,35.0,0.0,0.0,,,,,,


In [59]:
column_concat = pd.concat([passengers, trip], axis=1)
column_concat.head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch,PassengerId.1,Survived,Pclass,Embarked,Cabin,Ticket,Fare
0,1,"Braund, Mr. Owen Harris",male,22.0,1,0,2.0,1.0,1.0,C,C85,PC 17599,71.2833
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,3.0,1.0,3.0,S,,STON/O2. 3101282,7.925
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,4.0,1.0,1.0,S,C123,113803,53.1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,5.0,0.0,3.0,S,,373450,8.05
4,5,"Allen, Mr. William Henry",male,35.0,0,0,6.0,0.0,3.0,Q,,330877,8.4583


Filling NaN with a fixed value

In [60]:
column_concat.fillna(value=0, inplace=True)
column_concat.head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch,PassengerId.1,Survived,Pclass,Embarked,Cabin,Ticket,Fare
0,1,"Braund, Mr. Owen Harris",male,22.0,1,0,2.0,1.0,1.0,C,C85,PC 17599,71.2833
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,3.0,1.0,3.0,S,0,STON/O2. 3101282,7.925
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,4.0,1.0,1.0,S,C123,113803,53.1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,5.0,0.0,3.0,S,0,373450,8.05
4,5,"Allen, Mr. William Henry",male,35.0,0,0,6.0,0.0,3.0,Q,0,330877,8.4583


<a id='de4'></a>
### 3.D Merging ([to top](#top))
DataFrames can be merged if they share a **common key**. <br/>
The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together.

In [61]:
merge1 = pd.merge(passengers, trip, how='inner',on=['PassengerId'])
merge1.head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch,Survived,Pclass,Embarked,Cabin,Ticket,Fare
0,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,1,1,C,C85,PC 17599,71.2833
1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,1,3,S,,STON/O2. 3101282,7.925
2,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,1,1,S,C123,113803,53.1
3,5,"Allen, Mr. William Henry",male,35.0,0,0,0,3,S,,373450,8.05
4,6,"Moran, Mr. James",male,,0,0,0,3,Q,,330877,8.4583


In [62]:
pd.merge(passengers, trip, how='outer',on=['PassengerId']).head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch,Survived,Pclass,Embarked,Cabin,Ticket,Fare
0,1,"Braund, Mr. Owen Harris",male,22.0,1,0,,,,,,
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,1.0,1.0,C,C85,PC 17599,71.2833
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,1.0,3.0,S,,STON/O2. 3101282,7.925
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,1.0,1.0,S,C123,113803,53.1
4,5,"Allen, Mr. William Henry",male,35.0,0,0,0.0,3.0,S,,373450,8.05


In [63]:
pd.merge(passengers, trip, how='left',on=['PassengerId']).head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch,Survived,Pclass,Embarked,Cabin,Ticket,Fare
0,1,"Braund, Mr. Owen Harris",male,22.0,1,0,,,,,,
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,1.0,1.0,C,C85,PC 17599,71.2833
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,1.0,3.0,S,,STON/O2. 3101282,7.925
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,1.0,1.0,S,C123,113803,53.1
4,5,"Allen, Mr. William Henry",male,35.0,0,0,0.0,3.0,S,,373450,8.05


In [64]:
pd.merge(passengers, trip, how='right',on=['PassengerId']).head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch,Survived,Pclass,Embarked,Cabin,Ticket,Fare
0,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,1,1,C,C85,PC 17599,71.2833
1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,1,3,S,,STON/O2. 3101282,7.925
2,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,1,1,S,C123,113803,53.1
3,5,"Allen, Mr. William Henry",male,35.0,0,0,0,3,S,,373450,8.05
4,6,"Moran, Mr. James",male,,0,0,0,3,Q,,330877,8.4583


<a id='de5'></a>
### 3.E Joining ([to top](#top))
Joining is a convenient method for combining the columns of two **potentially differently-indexed** DataFrames into a single DataFrame based on 'index keys'.

In [65]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                      'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [66]:
left.join(right)

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [67]:
left.join(right, how='outer')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


<a id='de6'></a>
### 3.F Miscellanea ([to top](#top))

#### Applying functions to DataFrame values
Pandas works with 'apply' method to accept any user-defined function...

In [68]:
# Define a function
def adulthood(x):
    if x<18:
        return False
    else:
        return True

In [69]:
passengers['Adult'] = passengers['Age'].apply(adulthood)
passengers.head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch,Adult
0,1,"Braund, Mr. Owen Harris",male,22.0,1,0,True
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,True
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,True
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,True
4,5,"Allen, Mr. William Henry",male,35.0,0,0,True


... as well as with **built-in ones**

In [70]:
passengers['Name Length']= passengers['Name'].apply(len)
passengers.head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch,Adult,Name Length
0,1,"Braund, Mr. Owen Harris",male,22.0,1,0,True,23
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,True,51
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,True,22
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,True,44
4,5,"Allen, Mr. William Henry",male,35.0,0,0,True,24


#### Standard statistical functions

In [71]:
passengers['Age'].max()

80.0

In [72]:
passengers['Age'].mean()

29.69911764705882

In [73]:
passengers['Age'].std()

14.526497332334044

In [74]:
passengers['Age'].min()

0.42

#### Get the list of column and row names

Getting column names

In [75]:
passengers.columns

Index(['PassengerId', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Adult',
       'Name Length'],
      dtype='object')

#### Deletion by *del* command 
(N.B.: This affects the dataframe immediately, unlike drop method)

In [76]:
del passengers['Name Length']
passengers.head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch,Adult
0,1,"Braund, Mr. Owen Harris",male,22.0,1,0,True
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,True
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,True
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,True
4,5,"Allen, Mr. William Henry",male,35.0,0,0,True


#### Sorting and Ordering a DataFrame

In [77]:
passengers.sort_values(by='Age').head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch,Adult
803,804,"Thomas, Master. Assad Alexander",male,0.42,0,1,False
755,756,"Hamalainen, Master. Viljo",male,0.67,1,1,False
644,645,"Baclini, Miss. Eugenie",female,0.75,2,1,False
469,470,"Baclini, Miss. Helene Barbara",female,0.75,2,1,False
78,79,"Caldwell, Master. Alden Gates",male,0.83,0,2,False


In [78]:
passengers.sort_values(by='Age',ascending=False).head()

Unnamed: 0,PassengerId,Name,Sex,Age,SibSp,Parch,Adult
630,631,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,True
851,852,"Svensson, Mr. Johan",male,74.0,0,0,True
493,494,"Artagaveytia, Mr. Ramon",male,71.0,0,0,True
96,97,"Goldschmidt, Mr. George B",male,71.0,0,0,True
116,117,"Connors, Mr. Patrick",male,70.5,0,0,True


#### Find Null Values or Check for Null Values

In [79]:
titanic.isnull().head()

Unnamed: 0,PassengerId,Name,Sex,Age,Family
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
5,False,False,False,True,False


In [80]:
titanic.fillna('FAKE VALUE').head()

Unnamed: 0,PassengerId,Name,Sex,Age,Family
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1
2,3,"Heikkinen, Miss. Laina",female,26.0,0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1
4,5,"Allen, Mr. William Henry",male,35.0,0
5,6,"Moran, Mr. James",male,FAKE VALUE,0


#### Pivot Table

In [81]:
# Index out of 'Age', columns from 'SibSp', actual numerical values from 'Age'
passengers.pivot_table(values='Age',index=['Sex'],columns=['SibSp'], aggfunc='mean')

SibSp,0,1,2,3,4,5
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,28.631944,30.738889,16.541667,16.5,8.333333,16.0
male,32.615443,29.461505,28.230769,8.75,6.416667,8.75


In [82]:
# Index out of 'SibSp' and 'Parch', columns from 'Sex', actual numerical values from 'Age'
passengers.pivot_table(values='Age',index=['SibSp', 'Parch'],columns=['Sex'], fill_value='FILLED', aggfunc='mean')

Unnamed: 0_level_0,Sex,female,male
SibSp,Parch,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,30.15,32.901316
0,1,27.086957,33.53
0,2,20.705882,21.536667
0,3,24.0,FILLED
0,4,29.0,FILLED
0,5,40.0,FILLED
1,0,31.806122,32.311321
1,1,29.16,25.258621
1,2,21.2,19.417143
1,3,51.0,16.0


#### Check and reset variable types

In [83]:
passengers.dtypes

PassengerId      int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Adult             bool
dtype: object

In [84]:
passengers.Age = passengers.Age.astype(float)

In [85]:
passengers.dtypes

PassengerId      int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Adult             bool
dtype: object