## Import pandas

In [144]:
import pandas as pd
import numpy as np

# 1. Read Data

###### Read csv data
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record https://en.wikipedia.org/wiki/Comma-separated_values. In pandas we use the pandas.read_csv() method to read csv data. The read_csv() method accepts important arguments such as filepath_or_buffer which specifies the file path. sep indicates the delimiter to use, engine determines which engine to use between C which is faster but less features or Python which is slower but feature-complete.
usecols defines the columns to be fetched, nrows which specifies the number of rows to read, chunksize limits the amount or records to fetch at a time and many other arguments.


In [2]:
titanic_df=pd.read_csv('titanic.csv')
titanic_df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


###### Read Excel data

In [3]:
titanic_excel_df=pd.read_excel('titanic.xlsx','Sheet1')
titanic_excel_df

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Read html file
The HyperText Markup Language, or HTML is the standard markup language for documents designed to be displayed in a web browser. https://en.wikipedia.org/wiki/HTML. Pandas has pandas.read_html() that extracts data from html files.

In [4]:
gdp_df=pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)#Table')
gdp_df[2].head()

Unnamed: 0_level_0,Country/Territory,Region,IMF[1],IMF[1],United Nations[12],United Nations[12],World Bank[13][14],World Bank[13][14]
Unnamed: 0_level_1,Country/Territory,Region,Estimate,Year,Estimate,Year,Estimate,Year
0,United States,Americas,22675271.0,2021,21433226,2019,20936600.0,2020
1,China,Asia,16642318.0,[n 2]2021,14342933,[n 3]2019,14722731.0,2020
2,Japan,Asia,5378136.0,2021,5082465,2019,4975415.0,2020
3,Germany,Europe,4319286.0,2021,3861123,2019,3806060.0,2020
4,United Kingdom,Europe,3124650.0,2021,2826441,2019,2707744.0,2020


###### Read Json Data

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write https://www.json.org/json-en.html. We can read json data to pandas dataframe using the read_json() method. pd.read_json() method converts a json data to a pandas dataframe. It accepts important arguments such as; path_or_buf which is the location of the data source file. chunksize which is useful when fetching large data. encoding decodes the data to a readable formart. 
orient defines the expected json formart and can take any of index, split, records, columns or values. Check the pandas documentation for more details https://pandas-docs.github.io/pandas-docs-travis/reference/api/pandas.read_json.html 


In [5]:
# json_df=pd.read_json("https://raw.githubusercontent.com/BindiChen/machine-learning/master/data-analysis/027-pandas-convert-json/data/simple.json")
json_exchange_rate=pd.read_json("https://api.exchangerate-api.com/v4/latest/USD")
json_exchange_rate=json_exchange_rate[['provider','base','date','time_last_updated','rates']].head()
json_exchange_rate

Unnamed: 0,provider,base,date,time_last_updated,rates
AED,https://www.exchangerate-api.com,USD,2021-12-02,1638403201,3.67
AFN,https://www.exchangerate-api.com,USD,2021-12-02,1638403201,96.34
ALL,https://www.exchangerate-api.com,USD,2021-12-02,1638403201,107.32
AMD,https://www.exchangerate-api.com,USD,2021-12-02,1638403201,487.66
ANG,https://www.exchangerate-api.com,USD,2021-12-02,1638403201,1.79


###### Parquet
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language https://parquet.apache.org/. The pandas.read_parquet() method reads the parquet data file to pandas data frame. It provides a partitioned binary columnar serialization for data frames. 

First we will create a parquet file and demonstrated how to read it in pandas

In [6]:
# Uncomment this command to install pyarrow
# !pip install pyarrow

In [7]:
titanic_df.to_parquet('parquet_sample_data.parquet', engine='pyarrow')

In [8]:
# Read parquet file
parquet_df=pd.read_parquet('parquet_sample_data.parquet', engine='pyarrow')
parquet_df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


###### Read Pickle file
Pickle is a file formart that serializes Python object to byte stream. In Machine Learning we use pickle files to share models and deploy them in production. Let's first create a pickle file and demonstrate how to read it in pandas.

In [9]:
json_exchange_rate.to_pickle('pickle_sample_data.pkl')
pickle_df=pd.read_pickle('pickle_sample_data.pkl')
pickle_df.head()

Unnamed: 0,provider,base,date,time_last_updated,rates
AED,https://www.exchangerate-api.com,USD,2021-12-02,1638403201,3.67
AFN,https://www.exchangerate-api.com,USD,2021-12-02,1638403201,96.34
ALL,https://www.exchangerate-api.com,USD,2021-12-02,1638403201,107.32
AMD,https://www.exchangerate-api.com,USD,2021-12-02,1638403201,487.66
ANG,https://www.exchangerate-api.com,USD,2021-12-02,1638403201,1.79


###### Read SQL Data
SQL is a popular language for working with and manipulating data in the databases. In Data Science and Analytics understanding SQL is a key skills to efficiently and commfortably work with data. Modern databases such as Oracle, MSSQL, GBQ and Amazon Redshift have advanced to the level beyond being a data storage container to providing capabilities such as writing advanced in-database Machine Learning models through SQL. Pandas enables us to read data from the database through SQL. To read data through SQL in pandas we first need to install the respective python-database driver.

In [10]:
# Let's first create an sqlite database and save titanic dataset to it
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
titanic_df.to_sql('titanic_sqlite_data', engine, chunksize=100)

In [11]:
titanic_sqlite_df=pd.read_sql_table('titanic_sqlite_data', engine)
titanic_sqlite_df.head()

Unnamed: 0,index,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


# 2. Save Data

###### Save data to csv

In [12]:
json_exchange_rate.to_csv("Exchange_Rate.csv")

###### Save data to excel

In [13]:
json_exchange_rate.to_excel("Exchange_Rate.xlsx")

###### Save data to html

In [14]:
titanic_df.head().to_html()

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Survived</th>\n      <th>Pclass</th>\n      <th>Name</th>\n      <th>Sex</th>\n      <th>Age</th>\n      <th>Siblings/Spouses Aboard</th>\n      <th>Parents/Children Aboard</th>\n      <th>Fare</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0</td>\n      <td>3</td>\n      <td>Mr. Owen Harris Braund</td>\n      <td>male</td>\n      <td>22.0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>7.2500</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1</td>\n      <td>1</td>\n      <td>Mrs. John Bradley (Florence Briggs Thayer) Cumings</td>\n      <td>female</td>\n      <td>38.0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>71.2833</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>1</td>\n      <td>3</td>\n      <td>Miss. Laina Heikkinen</td>\n      <td>female</td>\n      <td>26.0</td>\n      <td>0</td>\n      <td>0</td>\n      <td

###### Save data to json

In [15]:
titanic_df.head().to_json(orient='columns')

'{"Survived":{"0":0,"1":1,"2":1,"3":1,"4":0},"Pclass":{"0":3,"1":1,"2":3,"3":1,"4":3},"Name":{"0":"Mr. Owen Harris Braund","1":"Mrs. John Bradley (Florence Briggs Thayer) Cumings","2":"Miss. Laina Heikkinen","3":"Mrs. Jacques Heath (Lily May Peel) Futrelle","4":"Mr. William Henry Allen"},"Sex":{"0":"male","1":"female","2":"female","3":"female","4":"male"},"Age":{"0":22.0,"1":38.0,"2":26.0,"3":35.0,"4":35.0},"Siblings\\/Spouses Aboard":{"0":1,"1":1,"2":0,"3":1,"4":0},"Parents\\/Children Aboard":{"0":0,"1":0,"2":0,"3":0,"4":0},"Fare":{"0":7.25,"1":71.2833,"2":7.925,"3":53.1,"4":8.05}}'

###### Save data to Parquet

In [16]:
titanic_df.to_parquet('parquet_titanic_data.parquet', engine='pyarrow')

###### Save data to pickle

In [17]:
titanic_df.to_pickle('pickle_titanic_data.pkl')

###### Save data to SQLite

In [18]:
json_exchange_rate.to_sql('exchange_rate_sqlite_data', engine, chunksize=100)

## Inspecting Data

###### Show first 3 rows of data

In [19]:
titanic_df.head(3)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925


###### Show last 3 rows of data

In [20]:
titanic_df.tail(3)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.45
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0
886,0,3,Mr. Patrick Dooley,male,32.0,0,0,7.75


###### Show number of rows and columns

In [21]:
titanic_df.shape

(887, 8)

###### Show data datypes, columns and memory

In [22]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Survived                 887 non-null    int64  
 1   Pclass                   887 non-null    int64  
 2   Name                     887 non-null    object 
 3   Sex                      887 non-null    object 
 4   Age                      887 non-null    float64
 5   Siblings/Spouses Aboard  887 non-null    int64  
 6   Parents/Children Aboard  887 non-null    int64  
 7   Fare                     887 non-null    float64
dtypes: float64(2), int64(4), object(2)
memory usage: 55.6+ KB


###### Show data types

In [30]:
titanic_df.dtypes

Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object

###### Show all columns

In [24]:
titanic_df.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard',
       'Parents/Children Aboard', 'Fare'],
      dtype='object')

###### Show statistical summary

In [25]:
titanic_df.describe()

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.25,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [26]:
titanic_df.describe(include='all')

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887,887,887.0,887.0,887.0,887.0
unique,,,887,2,,,,
top,,,Mr. Matti Rintamaki,male,,,,
freq,,,1,573,,,,
mean,0.385569,2.305524,,,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,,,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,,,0.42,0.0,0.0,0.0
25%,0.0,2.0,,,20.25,0.0,0.0,7.925
50%,0.0,3.0,,,28.0,0.0,0.0,14.4542
75%,1.0,3.0,,,38.0,1.0,0.0,31.1375


###### Show unique values and there frequency

In [29]:
titanic_df['Sex'].value_counts()

male      573
female    314
Name: Sex, dtype: int64

###### Return dimesnion of the data frame

In [32]:
titanic_df.ndim

2

###### Show number of elemnts in the data frame

In [33]:
titanic_df.size

7096

## Selecting Data

###### Select specific columns and all records

In [40]:
# Select one column only
titanic_df['Survived'] 
# Use . syntax if the column name don't have spaces
titanic_df.Survived
# Select more than one column
titanic_df[['Survived','Pclass','Age']]

Unnamed: 0,Survived,Pclass,Age
0,0,3,22.0
1,1,1,38.0
2,1,3,26.0
3,1,1,35.0
4,0,3,35.0
...,...,...,...
882,0,2,27.0
883,1,1,19.0
884,0,3,7.0
885,1,1,26.0


In [43]:
titanic_df.head(10)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05
5,0,3,Mr. James Moran,male,27.0,0,0,8.4583
6,0,1,Mr. Timothy J McCarthy,male,54.0,0,0,51.8625
7,0,3,Master. Gosta Leonard Palsson,male,2.0,3,1,21.075
8,1,3,Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson,female,27.0,0,2,11.1333
9,1,2,Mrs. Nicholas (Adele Achem) Nasser,female,14.0,1,0,30.0708


###### Select data by position

In [65]:
titanic_df.iloc[:10] # select first 3 records
titanic_df.iloc[-10:] # select last 10 records
titanic_df.iloc[5:8] # select records from 5 to 8 records. Including 5 excluding 8
titanic_df.iloc[0:-10] # select all data excluding last 10 records
titanic_df.iloc[0:,0:2] # select first two columns with all records
titanic_df.iloc[0:,3:5] # select all rows and only columns from 3 to 5.
titanic_df.iloc[:,0:-1] # select all records and all columns excluding last 1 column

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0
4,0,3,Mr. William Henry Allen,male,35.0,0,0
...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0


## Missing Data

###### Show null values and return false if not null

In [85]:
titanic_df.isnull()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
882,False,False,False,False,False,False,False,False
883,False,False,False,False,False,False,False,False
884,False,False,False,False,False,False,False,False
885,False,False,False,False,False,False,False,False


###### Show count of null values

In [86]:
titanic_df.isnull().sum()

Survived                   0
Pclass                     0
Name                       0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
dtype: int64

###### Show not null values and return true if not null

In [87]:
titanic_df.notnull()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...
882,True,True,True,True,True,True,True,True
883,True,True,True,True,True,True,True,True
884,True,True,True,True,True,True,True,True
885,True,True,True,True,True,True,True,True


###### Drop entier row with all values null

In [88]:
titanic_df.dropna(how='all')

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Drop column with any null value

In [151]:
titanic_df.dropna(how='any',axis=1)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Drop rows with any of specified columns have null

In [212]:
titanic_df.dropna(subset=['Pclass', 'Age'], how='any')

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Drop rows with all of specified columns have null

In [None]:
titanic_df.dropna(subset=['Pclass', 'Age'], how='all')

###### Drop row with a given number of null values

In [90]:
titanic_df.dropna(axis=1,thresh=2)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Replace null values with a scalar value

In [91]:
titanic_df.fillna(-999) # replace null values with -999

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Backward and Forward fill

In [150]:
titanic_df.fillna(method='bfill') # backward fill
titanic_df.fillna(method='ffill') # forward fill

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Replace missing value with a specific value

In [145]:
titanic_df.replace(np.nan, 0)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Impute null value with statistical measures

In [92]:
titanic_df.fillna(titanic_df.Fare.mean()) # fillna null value in fare column with mean of the fare
titanic_df.fillna(titanic_df.Sex.mode()) # fillna null value in Sex column with mode of the Sex
titanic_df.fillna(titanic_df.Age.median()) # fillna null value in Age column with median of the Age

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Interpolate missing values

In [147]:
titanic_df.interpolate(method='linear')

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


## Pandas String Functions

###### Convert text to Upper Case

In [155]:
titanic_df['Name'].str.upper()

0                                 MR. OWEN HARRIS BRAUND
1      MRS. JOHN BRADLEY (FLORENCE BRIGGS THAYER) CUM...
2                                  MISS. LAINA HEIKKINEN
3            MRS. JACQUES HEATH (LILY MAY PEEL) FUTRELLE
4                                MR. WILLIAM HENRY ALLEN
                             ...                        
882                                 REV. JUOZAS MONTVILA
883                          MISS. MARGARET EDITH GRAHAM
884                       MISS. CATHERINE HELEN JOHNSTON
885                                 MR. KARL HOWELL BEHR
886                                   MR. PATRICK DOOLEY
Name: Name, Length: 887, dtype: object

###### Convert text to Lower Case

In [157]:
titanic_df['Name'].str.lower()

0                                 mr. owen harris braund
1      mrs. john bradley (florence briggs thayer) cum...
2                                  miss. laina heikkinen
3            mrs. jacques heath (lily may peel) futrelle
4                                mr. william henry allen
                             ...                        
882                                 rev. juozas montvila
883                          miss. margaret edith graham
884                       miss. catherine helen johnston
885                                 mr. karl howell behr
886                                   mr. patrick dooley
Name: Name, Length: 887, dtype: object

###### Calculate length of a string

In [165]:
titanic_df['Name'].str.len().head()

0    22
1    50
2    21
3    43
4    23
Name: Name, dtype: int64

###### Strip white spaces

In [168]:
titanic_df['Name'].str.strip().head()

0                               Mr. Owen Harris Braund
1    Mrs. John Bradley (Florence Briggs Thayer) Cum...
2                                Miss. Laina Heikkinen
3          Mrs. Jacques Heath (Lily May Peel) Futrelle
4                              Mr. William Henry Allen
Name: Name, dtype: object

###### Split string

In [171]:
titanic_df['Name'].str.split(' ').head()

0                          [Mr., Owen, Harris, Braund]
1    [Mrs., John, Bradley, (Florence, Briggs, Thaye...
2                            [Miss., Laina, Heikkinen]
3    [Mrs., Jacques, Heath, (Lily, May, Peel), Futr...
4                         [Mr., William, Henry, Allen]
Name: Name, dtype: object

###### Convert strings to categorical numbers

In [173]:
titanic_df['Sex'].str.get_dummies().head()

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


###### Repeat text

In [175]:
titanic_df['Sex'].str.repeat(5).head()

0              malemalemalemalemale
1    femalefemalefemalefemalefemale
2    femalefemalefemalefemalefemale
3    femalefemalefemalefemalefemale
4              malemalemalemalemale
Name: Sex, dtype: object

###### Count occurrence of certain words

In [181]:
titanic_df['Name'].str.count('Rev.').sum()

6

###### String search startswith and endswith

In [197]:
titanic_df['Name'].str.startswith('Mrs.').sum()
titanic_df['Name'].str.endswith('y').sum()

51

###### Swap Cases

In [199]:
titanic_df['Name'].str.swapcase().head()

0                               mR. oWEN hARRIS bRAUND
1    mRS. jOHN bRADLEY (fLORENCE bRIGGS tHAYER) cUM...
2                                mISS. lAINA hEIKKINEN
3          mRS. jACQUES hEATH (lILY mAY pEEL) fUTRELLE
4                              mR. wILLIAM hENRY aLLEN
Name: Name, dtype: object

###### Filter data with substring

In [160]:
titanic_df['Name'].str.contains('Mr.')

0     True
1     True
2    False
3     True
4     True
Name: Name, dtype: bool

###### Replace string

In [163]:
titanic_df['Sex'].str.replace('female','F')

0      male
1         F
2         F
3         F
4      male
       ... 
882    male
883       F
884       F
885    male
886    male
Name: Sex, Length: 887, dtype: object

###### Check if text is lower or upper

In [209]:
titanic_df['Name'].str.islower()
titanic_df['Name'].str.isupper().head()

0    False
1    False
2    False
3    False
4    False
Name: Name, dtype: bool

###### Check if value is numeric

In [211]:
titanic_df['Name'].str.isnumeric().head()

0    False
1    False
2    False
3    False
4    False
Name: Name, dtype: bool

## Statistical functions

###### Summary statistics

In [93]:
titanic_df.describe(include='all')
titanic_df.describe()

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.25,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,80.0,8.0,6.0,512.3292


###### Show min value

In [94]:
titanic_df['Age'].min() # min of age
titanic_df.min() # min of every column

Survived                                             0
Pclass                                               1
Name                       Capt. Edward Gifford Crosby
Sex                                             female
Age                                               0.42
Siblings/Spouses Aboard                              0
Parents/Children Aboard                              0
Fare                                                 0
dtype: object

###### Show max value

In [95]:
titanic_df['Age'].max() # max of age
titanic_df.max() # max of every column

Survived                                                                   1
Pclass                                                                     3
Name                       the Countess. of (Lucy Noel Martha Dyer-Edward...
Sex                                                                     male
Age                                                                       80
Siblings/Spouses Aboard                                                    8
Parents/Children Aboard                                                    6
Fare                                                                 512.329
dtype: object

###### Show mode of values

In [100]:
titanic_df['Age'].mode() # max of Age

0    22.0
dtype: float64

###### Show median of values

In [102]:
titanic_df['Pclass'].mode() # max of Pclass

0    3
dtype: int64

###### Show sum of values

In [105]:
titanic_df['Fare'].sum() # max of Fare

28654.907699999996

###### Show frequency of each category

In [108]:
titanic_df['Pclass'].value_counts() # freequency of passengers in each class

3    487
1    216
2    184
Name: Pclass, dtype: int64

###### Calculate mean

In [79]:
titanic_df['Age'].mean()

29.471443066516347

###### Calculate standard deviation

In [114]:
titanic_df['Age'].std() # std for age only
titanic_df.std() # std for all columns

Survived                    0.487004
Pclass                      0.836662
Age                        14.121908
Siblings/Spouses Aboard     1.104669
Parents/Children Aboard     0.807466
Fare                       49.782040
dtype: float64

###### Show Variance

In [120]:
titanic_df['Age'].var() # variance for age column only
titanic_df.var() # variance for all numeric columns

Survived                      0.237173
Pclass                        0.700003
Age                         199.428297
Siblings/Spouses Aboard       1.220293
Parents/Children Aboard       0.652001
Fare                       2478.251546
dtype: float64

###### Show Covariance

In [124]:
titanic_df[['Age','Fare']].cov() # Covariance of Age and Fare
titanic_df.cov() # Covariance for entire dataframe

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
Survived,0.237173,-0.137121,-0.410343,-0.01995,0.031497,6.210808
Pclass,-0.137121,0.700003,-4.625577,0.078584,0.013681,-22.862898
Age,-0.410343,-4.625577,199.428297,-4.643648,-2.209222,78.968988
Siblings/Spouses Aboard,-0.01995,0.078584,-4.643648,1.220293,0.369498,8.734998
Parents/Children Aboard,0.031497,0.013681,-2.209222,0.369498,0.652001,8.661314
Fare,6.210808,-22.862898,78.968988,8.734998,8.661314,2478.251546


###### Show Correlation
Correlation Measures the relationship between two variables

###### 1. Pearson Correlation
Measures the linear relationship between two variables. Pearson correlation coefficient is the default correlation method in Pandas Data Frame. NOTE: Pearson Correlation assumes that the data is normally distributed. It's sensitive to outliers

In [132]:
titanic_df.corr(method='pearson') 
titanic_df.corr() # Or don't specifiy since it's the default

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
Survived,1.0,-0.336528,-0.059665,-0.037082,0.080097,0.256179
Pclass,-0.336528,1.0,-0.391492,0.085026,0.020252,-0.548919
Age,-0.059665,-0.391492,1.0,-0.297669,-0.193741,0.112329
Siblings/Spouses Aboard,-0.037082,0.085026,-0.297669,1.0,0.414244,0.158839
Parents/Children Aboard,0.080097,0.020252,-0.193741,0.414244,1.0,0.21547
Fare,0.256179,-0.548919,0.112329,0.158839,0.21547,1.0


###### 2. Spearman Rank Correlation
Measures the monotonic relationship between two variables. Does not assume normal distribution of the dataset. Has a growth rate of O(nlogn)

In [135]:
titanic_df.corr(method='spearman') 

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
Survived,1.0,-0.337648,-0.030265,0.086571,0.13653,0.322264
Pclass,-0.337648,1.0,-0.387982,-0.040348,-0.020617,-0.688234
Age,-0.030265,-0.387982,1.0,-0.199269,-0.254234,0.156062
Siblings/Spouses Aboard,0.086571,-0.040348,-0.199269,1.0,0.449198,0.44598
Parents/Children Aboard,0.13653,-0.020617,-0.254234,0.449198,1.0,0.409202
Fare,0.322264,-0.688234,0.156062,0.44598,0.409202,1.0


###### 3. Kendall Rank Correlation
It measures the monotonic relationship between two variables. It does not assume normal distribution of the data. It has a growth rate of 
O(n^2) hence tends to be abit slower on large dataset.

In [134]:
titanic_df.corr(method='kendall') 

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
Survived,1.0,-0.321558,-0.024993,0.083671,0.132233,0.264998
Pclass,-0.321558,1.0,-0.308253,-0.037085,-0.018995,-0.573648
Age,-0.024993,-0.308253,1.0,-0.15609,-0.200868,0.107466
Siblings/Spouses Aboard,0.083671,-0.037085,-0.15609,1.0,0.424367,0.357309
Parents/Children Aboard,0.132233,-0.018995,-0.200868,0.424367,1.0,0.329609
Fare,0.264998,-0.573648,0.107466,0.357309,0.329609,1.0


###### Calculate Kurtosis

In [127]:
titanic_df.kurtosis()

Survived                   -1.782183
Pclass                     -1.288638
Age                         0.292559
Siblings/Spouses Aboard    17.797537
Parents/Children Aboard     9.723066
Fare                       33.264605
dtype: float64

###### Calculate Skew

In [128]:
titanic_df.skew()

Survived                   0.470999
Pclass                    -0.623409
Age                        0.447189
Siblings/Spouses Aboard    3.686760
Parents/Children Aboard    2.741198
Fare                       4.777671
dtype: float64

###### Compute Percent change
Calculates the percent change over a given number of periods. Handle missing values (Nulls) before computing the percent change).

In [138]:
titanic_df['Fare'].pct_change(periods=3)

0           NaN
1           NaN
2           NaN
3      6.324138
4     -0.887070
         ...   
882    0.238095
883    3.255319
884   -0.194850
885    1.307692
886   -0.741667
Name: Fare, Length: 887, dtype: float64

###### Rank ata
ranks the data and shows the ties in data values

In [142]:
titanic_df.rank()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,273.0,644.0,602.0,601.0,277.0,709.0,337.5,77.0
1,716.5,108.5,823.0,157.5,667.5,709.0,337.5,785.0
2,716.5,644.0,172.0,157.5,392.0,302.5,337.5,229.5
3,716.5,108.5,814.0,157.5,615.0,709.0,337.5,744.0
4,273.0,644.0,733.0,601.0,615.0,302.5,337.5,261.0
...,...,...,...,...,...,...,...,...
882,273.0,308.5,883.0,601.0,415.5,302.5,337.5,404.5
883,716.5,108.5,188.0,157.5,183.0,302.5,337.5,650.5
884,273.0,644.0,100.0,157.5,55.0,709.0,832.5,542.5
885,716.5,108.5,525.0,601.0,392.0,302.5,337.5,650.5
