## Import pandas

In [1]:
import pandas as pd
import numpy as np

# 1. Read Data

###### Read csv data
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record https://en.wikipedia.org/wiki/Comma-separated_values. In pandas we use the pandas.read_csv() method to read csv data. The read_csv() method accepts important arguments such as filepath_or_buffer which specifies the file path. sep indicates the delimiter to use, engine determines which engine to use between C which is faster but less features or Python which is slower but feature-complete.
usecols defines the columns to be fetched, nrows which specifies the number of rows to read, chunksize limits the amount or records to fetch at a time and many other arguments.


In [2]:
titanic_df=pd.read_csv('titanic.csv')
titanic_df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


###### Read Excel data

In [3]:
titanic_excel_df=pd.read_excel('titanic.xlsx','Sheet1')
titanic_excel_df

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Read html file
The HyperText Markup Language, or HTML is the standard markup language for documents designed to be displayed in a web browser. https://en.wikipedia.org/wiki/HTML. Pandas has pandas.read_html() that extracts data from html files.

In [4]:
gdp_df=pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)#Table')[2]
gdp_df.head()

Unnamed: 0_level_0,Country/Territory,Region,IMF[1],IMF[1],United Nations[12],United Nations[12],World Bank[13][14],World Bank[13][14]
Unnamed: 0_level_1,Country/Territory,Region,Estimate,Year,Estimate,Year,Estimate,Year
0,United States,Americas,22675271.0,2021,21433226,2019,20936600.0,2020
1,China,Asia,16642318.0,[n 2]2021,14342933,[n 3]2019,14722731.0,2020
2,Japan,Asia,5378136.0,2021,5082465,2019,4975415.0,2020
3,Germany,Europe,4319286.0,2021,3861123,2019,3806060.0,2020
4,United Kingdom,Europe,3124650.0,2021,2826441,2019,2707744.0,2020


###### Read Json Data

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write https://www.json.org/json-en.html. We can read json data to pandas dataframe using the read_json() method. pd.read_json() method converts a json data to a pandas dataframe. It accepts important arguments such as; path_or_buf which is the location of the data source file. chunksize which is useful when fetching large data. encoding decodes the data to a readable formart. 
orient defines the expected json formart and can take any of index, split, records, columns or values. Check the pandas documentation for more details https://pandas-docs.github.io/pandas-docs-travis/reference/api/pandas.read_json.html 


In [5]:
# json_df=pd.read_json("https://raw.githubusercontent.com/BindiChen/machine-learning/master/data-analysis/027-pandas-convert-json/data/simple.json")
json_exchange_rate=pd.read_json("https://api.exchangerate-api.com/v4/latest/USD")
json_exchange_rate=json_exchange_rate[['provider','base','date','time_last_updated','rates']].head()
json_exchange_rate

Unnamed: 0,provider,base,date,time_last_updated,rates
AED,https://www.exchangerate-api.com,USD,2021-12-06,1638748801,3.67
AFN,https://www.exchangerate-api.com,USD,2021-12-06,1638748801,96.25
ALL,https://www.exchangerate-api.com,USD,2021-12-06,1638748801,107.26
AMD,https://www.exchangerate-api.com,USD,2021-12-06,1638748801,489.44
ANG,https://www.exchangerate-api.com,USD,2021-12-06,1638748801,1.79


###### Parquet
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language https://parquet.apache.org/. The pandas.read_parquet() method reads the parquet data file to pandas data frame. It provides a partitioned binary columnar serialization for data frames. 

First we will create a parquet file and demonstrated how to read it in pandas

In [6]:
# Uncomment this command to install pyarrow
# !pip install pyarrow

In [7]:
titanic_df.to_parquet('parquet_sample_data.parquet', engine='pyarrow')

In [8]:
# Read parquet file
parquet_df=pd.read_parquet('parquet_sample_data.parquet', engine='pyarrow')
parquet_df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


###### Read Pickle file
Pickle is a file formart that serializes Python object to byte stream. In Machine Learning we use pickle files to share models and deploy them in production. Let's first create a pickle file and demonstrate how to read it in pandas.

In [9]:
json_exchange_rate.to_pickle('pickle_sample_data.pkl')
pickle_df=pd.read_pickle('pickle_sample_data.pkl')
pickle_df.head()

Unnamed: 0,provider,base,date,time_last_updated,rates
AED,https://www.exchangerate-api.com,USD,2021-12-06,1638748801,3.67
AFN,https://www.exchangerate-api.com,USD,2021-12-06,1638748801,96.25
ALL,https://www.exchangerate-api.com,USD,2021-12-06,1638748801,107.26
AMD,https://www.exchangerate-api.com,USD,2021-12-06,1638748801,489.44
ANG,https://www.exchangerate-api.com,USD,2021-12-06,1638748801,1.79


###### Read SQL Data
SQL is a popular language for working with and manipulating data in the databases. In Data Science and Analytics understanding SQL is a key skills to efficiently and commfortably work with data. Modern databases such as Oracle, MSSQL, GBQ and Amazon Redshift have advanced to the level beyond being a data storage container to providing capabilities such as writing advanced in-database Machine Learning models through SQL. Pandas enables us to read data from the database through SQL. To read data through SQL in pandas we first need to install the respective python-database driver.

In [10]:
# Let's first create an sqlite database and save titanic dataset to it
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
titanic_df.to_sql('titanic_sqlite_data', engine, chunksize=100)

In [11]:
titanic_sqlite_df=pd.read_sql_table('titanic_sqlite_data', engine)
titanic_sqlite_df.head()

Unnamed: 0,index,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


# 2. Save Data

###### Save data to csv

In [12]:
json_exchange_rate.to_csv("Exchange_Rate.csv")

###### Save data to excel

In [13]:
json_exchange_rate.to_excel("Exchange_Rate.xlsx")

###### Save data to html

In [14]:
titanic_df.head().to_html()

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Survived</th>\n      <th>Pclass</th>\n      <th>Name</th>\n      <th>Sex</th>\n      <th>Age</th>\n      <th>Siblings/Spouses Aboard</th>\n      <th>Parents/Children Aboard</th>\n      <th>Fare</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0</td>\n      <td>3</td>\n      <td>Mr. Owen Harris Braund</td>\n      <td>male</td>\n      <td>22.0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>7.2500</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1</td>\n      <td>1</td>\n      <td>Mrs. John Bradley (Florence Briggs Thayer) Cumings</td>\n      <td>female</td>\n      <td>38.0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>71.2833</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>1</td>\n      <td>3</td>\n      <td>Miss. Laina Heikkinen</td>\n      <td>female</td>\n      <td>26.0</td>\n      <td>0</td>\n      <td>0</td>\n      <td

###### Save data to json

In [15]:
titanic_df.head().to_json(orient='columns')

'{"Survived":{"0":0,"1":1,"2":1,"3":1,"4":0},"Pclass":{"0":3,"1":1,"2":3,"3":1,"4":3},"Name":{"0":"Mr. Owen Harris Braund","1":"Mrs. John Bradley (Florence Briggs Thayer) Cumings","2":"Miss. Laina Heikkinen","3":"Mrs. Jacques Heath (Lily May Peel) Futrelle","4":"Mr. William Henry Allen"},"Sex":{"0":"male","1":"female","2":"female","3":"female","4":"male"},"Age":{"0":22.0,"1":38.0,"2":26.0,"3":35.0,"4":35.0},"Siblings\\/Spouses Aboard":{"0":1,"1":1,"2":0,"3":1,"4":0},"Parents\\/Children Aboard":{"0":0,"1":0,"2":0,"3":0,"4":0},"Fare":{"0":7.25,"1":71.2833,"2":7.925,"3":53.1,"4":8.05}}'

###### Save data to Parquet

In [16]:
titanic_df.to_parquet('parquet_titanic_data.parquet', engine='pyarrow')

###### Save data to pickle

In [17]:
titanic_df.to_pickle('pickle_titanic_data.pkl')

###### Save data to SQLite

In [18]:
json_exchange_rate.to_sql('exchange_rate_sqlite_data', engine, chunksize=100)

# 3. Inspecting Data

###### Show first 3 rows of data

In [19]:
titanic_df.head(3)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925


###### Show last 3 rows of data

In [20]:
titanic_df.tail(3)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.45
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0
886,0,3,Mr. Patrick Dooley,male,32.0,0,0,7.75


###### Show number of rows and columns

In [21]:
titanic_df.shape

(887, 8)

###### Show data datypes, columns and memory

In [22]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Survived                 887 non-null    int64  
 1   Pclass                   887 non-null    int64  
 2   Name                     887 non-null    object 
 3   Sex                      887 non-null    object 
 4   Age                      887 non-null    float64
 5   Siblings/Spouses Aboard  887 non-null    int64  
 6   Parents/Children Aboard  887 non-null    int64  
 7   Fare                     887 non-null    float64
dtypes: float64(2), int64(4), object(2)
memory usage: 55.6+ KB


###### Show data types

In [23]:
titanic_df.dtypes

Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object

###### Show all columns

In [24]:
titanic_df.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard',
       'Parents/Children Aboard', 'Fare'],
      dtype='object')

###### Show statistical summary

In [25]:
titanic_df.describe()

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.25,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [26]:
titanic_df.describe(include='all')

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887,887,887.0,887.0,887.0,887.0
unique,,,887,2,,,,
top,,,Miss. Mari Aina Jussila,male,,,,
freq,,,1,573,,,,
mean,0.385569,2.305524,,,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,,,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,,,0.42,0.0,0.0,0.0
25%,0.0,2.0,,,20.25,0.0,0.0,7.925
50%,0.0,3.0,,,28.0,0.0,0.0,14.4542
75%,1.0,3.0,,,38.0,1.0,0.0,31.1375


###### Show unique values and there frequency

In [27]:
titanic_df['Sex'].value_counts()

male      573
female    314
Name: Sex, dtype: int64

###### Return dimesnion of the data frame

In [28]:
titanic_df.ndim

2

###### Show number of elemnts in the data frame

In [29]:
titanic_df.size

7096

###### Rename columns

In [30]:
gdp_df.columns=['country','continent','imf_estimate','imf_year','un_estimate','un_year','wb_estimate','wb_year']
gdp_df.head()

Unnamed: 0,country,continent,imf_estimate,imf_year,un_estimate,un_year,wb_estimate,wb_year
0,United States,Americas,22675271.0,2021,21433226,2019,20936600.0,2020
1,China,Asia,16642318.0,[n 2]2021,14342933,[n 3]2019,14722731.0,2020
2,Japan,Asia,5378136.0,2021,5082465,2019,4975415.0,2020
3,Germany,Europe,4319286.0,2021,3861123,2019,3806060.0,2020
4,United Kingdom,Europe,3124650.0,2021,2826441,2019,2707744.0,2020


###### Create new column

In [232]:
gdp_df['imf_estimate_log']=np.log(gdp_df['imf_estimate'])
gdp_df.head()

Unnamed: 0,country,continent,imf_estimate,imf_year,un_estimate,un_year,wb_estimate,wb_year,imf_estimate_log
0,United States,Americas,22675271.0,2021,21433226,2019,20936600.0,2020,16.936786
1,China,Asia,16642318.0,[n 2]2021,14342933,[n 3]2019,14722731.0,2020,16.627459
2,Japan,Asia,5378136.0,2021,5082465,2019,4975415.0,2020,15.497852
3,Germany,Europe,4319286.0,2021,3861123,2019,3806060.0,2020,15.278601
4,United Kingdom,Europe,3124650.0,2021,2826441,2019,2707744.0,2020,14.954833


# 4. Pandas Data Types Converstions

In [31]:
datatypes_df = pd.DataFrame(
    {
        "Students": ["Tom", "Peter", "Mary", "Smith"],
        "Reg_No": [1790, 1731, 1780, 1755],
        "Reg_Date": ["15/01/2021", "16/01/2021", "19/01/2021", "27/01/2021"],
        "Math": ["79.00", "67.00", "84.00", "70.00"],
        "Physics": ["60", "70", "50", "90"],
        "Computer": ["65.80", "80", "70", "75"],
    }
)

datatypes_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer
0,Tom,1790,15/01/2021,79.0,60,65.8
1,Peter,1731,16/01/2021,67.0,70,80.0
2,Mary,1780,19/01/2021,84.0,50,70.0
3,Smith,1755,27/01/2021,70.0,90,75.0


###### Check for data types

In [32]:
datatypes_df.dtypes

Students    object
Reg_No       int64
Reg_Date    object
Math        object
Physics     object
Computer    object
dtype: object

###### Convert object to integer
Let's convert Physics  column to interger

In [33]:
datatypes_df['Physics']=datatypes_df['Physics'].astype(np.int)

In [34]:
datatypes_df.dtypes

Students    object
Reg_No       int64
Reg_Date    object
Math        object
Physics      int32
Computer    object
dtype: object

###### Convert object to float
Let's convert Math and Coputer columns to float

In [35]:
datatypes_df[['Math','Computer']]=datatypes_df[['Math','Computer']].astype(np.float)

In [36]:
datatypes_df.dtypes

Students     object
Reg_No        int64
Reg_Date     object
Math        float64
Physics       int32
Computer    float64
dtype: object

###### Convert object to Date
Let's convert Reg_Date column to valid pandas date

In [37]:
# datatypes_df['Reg_Date']=pd.to_datetime(datatypes_df['Reg_Date']) # You can also pass in format argument format='%Y/%m%d'
datatypes_df['Reg_Date']=pd.to_datetime(datatypes_df['Reg_Date'],format='%d/%m/%Y') 

In [38]:
datatypes_df.dtypes

Students            object
Reg_No               int64
Reg_Date    datetime64[ns]
Math               float64
Physics              int32
Computer           float64
dtype: object

###### Convert int to String/object
Let's convert Reg_No column to String/Object

In [39]:
datatypes_df['Reg_No']=datatypes_df['Reg_No'].astype(str)

In [40]:
datatypes_df.dtypes

Students            object
Reg_No              object
Reg_Date    datetime64[ns]
Math               float64
Physics              int32
Computer           float64
dtype: object

###### Convert int to float to integer
Let's convert Math column to Integer

In [41]:
datatypes_df['Math']=datatypes_df['Math'].astype(np.int)

In [42]:
datatypes_df.dtypes

Students            object
Reg_No              object
Reg_Date    datetime64[ns]
Math                 int32
Physics              int32
Computer           float64
dtype: object

In [43]:
datatypes_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer
0,Tom,1790,2021-01-15,79,60,65.8
1,Peter,1731,2021-01-16,67,70,80.0
2,Mary,1780,2021-01-19,84,50,70.0
3,Smith,1755,2021-01-27,70,90,75.0


# 5. Select, Sort and Filter Data

###### Select specific columns and all records

In [44]:
# Select one column only
titanic_df['Survived'] 
# Use . syntax if the column name don't have spaces
titanic_df.Survived
# Select more than one column
titanic_df[['Survived','Pclass','Age']]

Unnamed: 0,Survived,Pclass,Age
0,0,3,22.0
1,1,1,38.0
2,1,3,26.0
3,1,1,35.0
4,0,3,35.0
...,...,...,...
882,0,2,27.0
883,1,1,19.0
884,0,3,7.0
885,1,1,26.0


In [45]:
titanic_df.head(10)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05
5,0,3,Mr. James Moran,male,27.0,0,0,8.4583
6,0,1,Mr. Timothy J McCarthy,male,54.0,0,0,51.8625
7,0,3,Master. Gosta Leonard Palsson,male,2.0,3,1,21.075
8,1,3,Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson,female,27.0,0,2,11.1333
9,1,2,Mrs. Nicholas (Adele Achem) Nasser,female,14.0,1,0,30.0708


###### Select data by position

In [46]:
titanic_df.iloc[:10] # select first 3 records
titanic_df.iloc[-10:] # select last 10 records
titanic_df.iloc[5:8] # select records from 5 to 8 records. Including 5 excluding 8
titanic_df.iloc[0:-10] # select all data excluding last 10 records
titanic_df.iloc[0:,0:2] # select first two columns with all records
titanic_df.iloc[0:,3:5] # select all rows and only columns from 3 to 5.
titanic_df.iloc[:5,0:-1] # select first 5 records and all columns excluding last 1 column

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0
4,0,3,Mr. William Henry Allen,male,35.0,0,0


###### Sort data ascending

In [47]:
titanic_df.sort_values(by='Age').head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
799,1,3,Master. Assad Alexander Thomas,male,0.42,0,1,8.5167
751,1,2,Master. Viljo Hamalainen,male,0.67,1,1,14.5
641,1,3,Miss. Eugenie Baclini,female,0.75,2,1,19.2583
466,1,3,Miss. Helene Barbara Baclini,female,0.75,2,1,19.2583
827,1,2,Master. George Sibley Richards,male,0.83,1,1,18.75


###### Sort descending

In [48]:
titanic_df.sort_values(by='Age', ascending=False).head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
627,1,1,Mr. Algernon Henry Wilson Barkworth,male,80.0,0,0,30.0
847,0,3,Mr. Johan Svensson,male,74.0,0,0,7.775
490,0,1,Mr. Ramon Artagaveytia,male,71.0,0,0,49.5042
95,0,1,Mr. George B Goldschmidt,male,71.0,0,0,34.6542
115,0,3,Mr. Patrick Connors,male,70.5,0,0,7.75


###### Sort data by two columns

In [49]:
titanic_df.sort_values(by=['Age','Fare'], ascending=[True,False]).head() # Sort age by ascending and then Fare by descending

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
799,1,3,Master. Assad Alexander Thomas,male,0.42,0,1,8.5167
751,1,2,Master. Viljo Hamalainen,male,0.67,1,1,14.5
466,1,3,Miss. Helene Barbara Baclini,female,0.75,2,1,19.2583
641,1,3,Miss. Eugenie Baclini,female,0.75,2,1,19.2583
77,1,2,Master. Alden Gates Caldwell,male,0.83,0,2,29.0


###### Filter data where age > 50

In [50]:
titanic_df[titanic_df['Age']>50].head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
6,0,1,Mr. Timothy J McCarthy,male,54.0,0,0,51.8625
11,1,1,Miss. Elizabeth Bonnell,female,58.0,0,0,26.55
15,1,2,Mrs. (Mary D Kingcome) Hewlett,female,55.0,0,0,16.0
33,0,2,Mr. Edward H Wheadon,male,66.0,0,0,10.5
53,0,1,Mr. Engelhart Cornelius Ostby,male,65.0,0,1,61.9792


###### Filter data where age is between 30 and 40

In [51]:
titanic_df[(titanic_df['Age']>30) & (titanic_df['Age']<40)].head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05
13,0,3,Mr. Anders Johan Andersson,male,39.0,1,5,31.275
18,0,3,Mrs. Julius (Emelia Maria Vandemoortele) Vande...,female,31.0,1,0,18.0


###### Using IN function

In [52]:
titanic_df[titanic_df['Sex'].isin(['male'])].head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05
5,0,3,Mr. James Moran,male,27.0,0,0,8.4583
6,0,1,Mr. Timothy J McCarthy,male,54.0,0,0,51.8625
7,0,3,Master. Gosta Leonard Palsson,male,2.0,3,1,21.075


###### Using NOT IN

In [53]:
titanic_df[~titanic_df['Sex'].isin(['male'])].head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
8,1,3,Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson,female,27.0,0,2,11.1333
9,1,2,Mrs. Nicholas (Adele Achem) Nasser,female,14.0,1,0,30.0708


# 6. Reshaping DataFrames and Pivot Tables

In [54]:
gdp_df.head()

Unnamed: 0,country,continent,imf_estimate,imf_year,un_estimate,un_year,wb_estimate,wb_year
0,United States,Americas,22675271.0,2021,21433226,2019,20936600.0,2020
1,China,Asia,16642318.0,[n 2]2021,14342933,[n 3]2019,14722731.0,2020
2,Japan,Asia,5378136.0,2021,5082465,2019,4975415.0,2020
3,Germany,Europe,4319286.0,2021,3861123,2019,3806060.0,2020
4,United Kingdom,Europe,3124650.0,2021,2826441,2019,2707744.0,2020


###### Groupby

In [55]:
groups=gdp_df.groupby(['continent'])
groups

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000227DB8D4BE0>

###### Iterate through groups

In [56]:
for i,j in groups:
    print(i)
#     print(j)

Africa
Americas
Asia
Europe
Oceania


###### Select groups

In [57]:
groups.get_group('Asia').head()

Unnamed: 0,country,continent,imf_estimate,imf_year,un_estimate,un_year,wb_estimate,wb_year
1,China,Asia,16642318.0,[n 2]2021,14342933,[n 3]2019,14722731.0,2020
2,Japan,Asia,5378136.0,2021,5082465,2019,4975415.0,2020
5,India,Asia,3049704.0,2021,2891582,2019,2622984.0,2020
9,South Korea,Asia,1806707.0,2021,1646539,2019,1630525.0,2020
15,Indonesia,Asia,1158783.0,2021,1119190,2019,1058424.0,2020


###### Groupby with aggregation

In [58]:
gdp_df.groupby(['continent']).aggregate(np.sum) # first approach
gdp_df.groupby(['continent']).sum() #second approach

Unnamed: 0_level_0,imf_estimate,un_estimate,wb_estimate
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Africa,2605979.0,2460570,2376597.0
Americas,29317319.0,28619051,27562046.0
Asia,36574093.0,33165597,32307365.0
Europe,23596659.0,21644976,20850775.0
Oceania,1895098.0,1638761,1582488.0


###### Groupby and passing multiple aggregation

In [59]:
gdp_df.groupby(['continent']).agg([np.sum,np.mean,np.std]).reset_index()

Unnamed: 0_level_0,continent,imf_estimate,imf_estimate,imf_estimate,un_estimate,un_estimate,un_estimate,wb_estimate,wb_estimate,wb_estimate
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,std,sum,mean,std,sum,mean,std
0,Africa,2605979.0,48258.87037,96793.19,2460570,44737.636364,89544.46,2376597.0,44011.055556,85798.16
1,Americas,29317319.0,814369.972222,3771881.0,28619051,622153.282609,3162716.0,27562046.0,640977.813953,3189404.0
2,Asia,36574093.0,731481.86,2462533.0,33165597,650305.823529,2125590.0,32307365.0,646147.3,2186434.0
3,Europe,23596659.0,575528.268293,963029.5,21644976,491931.272727,859351.7,20850775.0,473881.25,826905.3
4,Oceania,1895098.0,135364.142857,431433.7,1638761,96397.705882,334534.4,1582488.0,98905.5,332719.0


###### Groupby with aggregation and set index

In [60]:
gdp_df.groupby(['continent'],as_index=False).sum() #first approach
gdp_df.groupby(['continent']).sum().reset_index() #second approach

Unnamed: 0,continent,imf_estimate,un_estimate,wb_estimate
0,Africa,2605979.0,2460570,2376597.0
1,Americas,29317319.0,28619051,27562046.0
2,Asia,36574093.0,33165597,32307365.0
3,Europe,23596659.0,21644976,20850775.0
4,Oceania,1895098.0,1638761,1582488.0


###### Pivot
Reshape dataframe from long format to wide format.
NOTE: Pivot does not deal with repeated/duplicated values in index. Instead we use Pivot Table

In [61]:
continental_temperature_df=pd.read_csv('continental_temperature.csv')

In [62]:
# Add Temperature in Farenheit
continental_temperature_df['AvgTemperature_FH']=continental_temperature_df['AvgTemperature']*(9/5)+32

continental_temperature_df.head()

Unnamed: 0,Region,Year,AvgTemperature,AvgTemperature_FH
0,Africa,1995,52.976737,127.358127
1,Africa,1996,48.34838,119.027084
2,Africa,1997,37.282495,99.108491
3,Africa,1998,30.327075,86.588735
4,Africa,1999,34.153869,93.476964


In [63]:
continental_temperature_df.pivot(index='Year',columns='Region',values=['AvgTemperature']).head()

Unnamed: 0_level_0,AvgTemperature,AvgTemperature,AvgTemperature,AvgTemperature,AvgTemperature,AvgTemperature,AvgTemperature
Region,Africa,Asia,Australia/South Pacific,Europe,Middle East,North America,South/Central America & Carribean
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
1995,52.976737,51.184689,61.126347,38.634474,55.416305,54.884315,36.252988
1996,48.34838,60.293226,60.811293,36.53759,57.235675,51.071101,53.665272
1997,37.282495,57.963918,61.577032,41.291805,58.599941,53.795868,55.117885
1998,30.327075,53.736634,49.237078,40.936384,58.497397,54.517819,49.526641
1999,34.153869,55.595147,61.653699,42.250435,59.681566,55.64553,58.425118


###### Pivot with more than one measure

In [230]:
continental_temperature_df.pivot(index='Year',columns='Region',values=['AvgTemperature','AvgTemperature_FH']).head()

Unnamed: 0_level_0,AvgTemperature,AvgTemperature,AvgTemperature,AvgTemperature,AvgTemperature,AvgTemperature,AvgTemperature,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH
Region,Africa,Asia,Australia/South Pacific,Europe,Middle East,North America,South/Central America & Carribean,Africa,Asia,Australia/South Pacific,Europe,Middle East,North America,South/Central America & Carribean
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
1995,52.976737,51.184689,61.126347,38.634474,55.416305,54.884315,36.252988,127.358127,124.132439,142.027425,101.542053,131.749348,130.791768,97.255378
1996,48.34838,60.293226,60.811293,36.53759,57.235675,51.071101,53.665272,119.027084,140.527806,141.460328,97.767661,135.024215,123.927983,128.597489
1997,37.282495,57.963918,61.577032,41.291805,58.599941,53.795868,55.117885,99.108491,136.335052,142.838658,106.325249,137.479894,128.832563,131.212193
1998,30.327075,53.736634,49.237078,40.936384,58.497397,54.517819,49.526641,86.588735,128.725941,120.62674,105.68549,137.295315,130.132075,121.147954
1999,34.153869,55.595147,61.653699,42.250435,59.681566,55.64553,58.425118,93.476964,132.071264,142.976658,108.050784,139.426818,132.161954,137.165212


###### Pivot Table
Create a spreadsheet-style pivot table as a DataFrame. For repeated values we specify the aggregation function

###### Pivot Table with one aggregation 

In [65]:
pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],values='AvgTemperature',aggfunc=np.mean).head()

Region,Africa,Asia,Australia/South Pacific,Europe,Middle East,North America,South/Central America & Carribean
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1995,52.976737,51.184689,61.126347,38.634474,55.416305,54.884315,36.252988
1996,48.34838,60.293226,60.811293,36.53759,57.235675,51.071101,53.665272
1997,37.282495,57.963918,61.577032,41.291805,58.599941,53.795868,55.117885
1998,30.327075,53.736634,49.237078,40.936384,58.497397,54.517819,49.526641
1999,34.153869,55.595147,61.653699,42.250435,59.681566,55.64553,58.425118


###### Pivot Table with Totals

In [231]:
pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],values='AvgTemperature',
               aggfunc=np.mean,margins=True,margins_name='Mean')

Region,Africa,Asia,Australia/South Pacific,Europe,Middle East,North America,South/Central America & Carribean,Mean
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1995,52.976737,51.184689,61.126347,38.634474,55.416305,54.884315,36.252988,50.067979
1996,48.34838,60.293226,60.811293,36.53759,57.235675,51.071101,53.665272,52.566077
1997,37.282495,57.963918,61.577032,41.291805,58.599941,53.795868,55.117885,52.232706
1998,30.327075,53.736634,49.237078,40.936384,58.497397,54.517819,49.526641,48.11129
1999,34.153869,55.595147,61.653699,42.250435,59.681566,55.64553,58.425118,52.48648
2000,30.256454,54.118454,61.784699,46.42167,60.211397,55.084793,61.089126,52.709513
2001,39.146197,60.571804,61.309498,44.922228,72.698571,55.915848,64.430904,56.999293
2002,34.032093,59.585393,59.302055,43.831409,71.000646,53.367261,62.904055,54.860416
2003,34.280538,63.260892,60.405982,43.504275,73.234932,54.524121,60.304953,55.645099
2004,41.053458,62.384676,61.856512,47.620571,72.056284,55.114735,57.01471,56.728707


###### Pivot Table with multiple aggregations

In [67]:
pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],values=['AvgTemperature','AvgTemperature_FH'],
               aggfunc={'AvgTemperature':np.mean,'AvgTemperature_FH':[np.min,np.max,np.mean,np.std]}).head()

Unnamed: 0_level_0,AvgTemperature,AvgTemperature,AvgTemperature,AvgTemperature,AvgTemperature,AvgTemperature,AvgTemperature,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH,AvgTemperature_FH
Unnamed: 0_level_1,mean,mean,mean,mean,mean,mean,mean,amax,amax,amax,...,amin,amin,amin,mean,mean,mean,mean,mean,mean,mean
Region,Africa,Asia,Australia/South Pacific,Europe,Middle East,North America,South/Central America & Carribean,Africa,Asia,Australia/South Pacific,...,Middle East,North America,South/Central America & Carribean,Africa,Asia,Australia/South Pacific,Europe,Middle East,North America,South/Central America & Carribean
Year,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
1995,52.976737,51.184689,61.126347,38.634474,55.416305,54.884315,36.252988,127.358127,124.132439,142.027425,...,131.749348,130.791768,97.255378,127.358127,124.132439,142.027425,101.542053,131.749348,130.791768,97.255378
1996,48.34838,60.293226,60.811293,36.53759,57.235675,51.071101,53.665272,119.027084,140.527806,141.460328,...,135.024215,123.927983,128.597489,119.027084,140.527806,141.460328,97.767661,135.024215,123.927983,128.597489
1997,37.282495,57.963918,61.577032,41.291805,58.599941,53.795868,55.117885,99.108491,136.335052,142.838658,...,137.479894,128.832563,131.212193,99.108491,136.335052,142.838658,106.325249,137.479894,128.832563,131.212193
1998,30.327075,53.736634,49.237078,40.936384,58.497397,54.517819,49.526641,86.588735,128.725941,120.62674,...,137.295315,130.132075,121.147954,86.588735,128.725941,120.62674,105.68549,137.295315,130.132075,121.147954
1999,34.153869,55.595147,61.653699,42.250435,59.681566,55.64553,58.425118,93.476964,132.071264,142.976658,...,139.426818,132.161954,137.165212,93.476964,132.071264,142.976658,108.050784,139.426818,132.161954,137.165212


In [68]:
pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],values=['AvgTemperature','AvgTemperature_FH'],
               aggfunc={'AvgTemperature':np.mean,'AvgTemperature_FH':[np.min,np.max,np.mean,np.std]}).head().T

Unnamed: 0_level_0,Unnamed: 1_level_0,Year,1995,1996,1997,1998,1999
Unnamed: 0_level_1,Unnamed: 1_level_1,Region,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AvgTemperature,mean,Africa,52.976737,48.34838,37.282495,30.327075,34.153869
AvgTemperature,mean,Asia,51.184689,60.293226,57.963918,53.736634,55.595147
AvgTemperature,mean,Australia/South Pacific,61.126347,60.811293,61.577032,49.237078,61.653699
AvgTemperature,mean,Europe,38.634474,36.53759,41.291805,40.936384,42.250435
AvgTemperature,mean,Middle East,55.416305,57.235675,58.599941,58.497397,59.681566
AvgTemperature,mean,North America,54.884315,51.071101,53.795868,54.517819,55.64553
AvgTemperature,mean,South/Central America & Carribean,36.252988,53.665272,55.117885,49.526641,58.425118
AvgTemperature_FH,amax,Africa,127.358127,119.027084,99.108491,86.588735,93.476964
AvgTemperature_FH,amax,Asia,124.132439,140.527806,136.335052,128.725941,132.071264
AvgTemperature_FH,amax,Australia/South Pacific,142.027425,141.460328,142.838658,120.62674,142.976658


###### Stack

In [69]:
# Let's first create a dataframe that we can stack
continental_temp_df=pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],
                                   values='AvgTemperature',aggfunc=np.mean)
continental_temp_df.head()

Region,Africa,Asia,Australia/South Pacific,Europe,Middle East,North America,South/Central America & Carribean
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1995,52.976737,51.184689,61.126347,38.634474,55.416305,54.884315,36.252988
1996,48.34838,60.293226,60.811293,36.53759,57.235675,51.071101,53.665272
1997,37.282495,57.963918,61.577032,41.291805,58.599941,53.795868,55.117885
1998,30.327075,53.736634,49.237078,40.936384,58.497397,54.517819,49.526641
1999,34.153869,55.595147,61.653699,42.250435,59.681566,55.64553,58.425118


In [70]:
# Let's now stack the above dataframe with Year and Region
continental_temp_df.stack(0).reset_index().head()

Unnamed: 0,Year,Region,0
0,1995,Africa,52.976737
1,1995,Asia,51.184689
2,1995,Australia/South Pacific,61.126347
3,1995,Europe,38.634474
4,1995,Middle East,55.416305


###### Unstack

In [71]:
stacked_df=gdp_df.groupby(['continent']).sum()
stacked_df.head()

Unnamed: 0_level_0,imf_estimate,un_estimate,wb_estimate
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Africa,2605979.0,2460570,2376597.0
Americas,29317319.0,28619051,27562046.0
Asia,36574093.0,33165597,32307365.0
Europe,23596659.0,21644976,20850775.0
Oceania,1895098.0,1638761,1582488.0


###### Melt
Unpivots a DataFrame from wide format to long format

In [72]:
continental_temp_df.reset_index().head()

Region,Year,Africa,Asia,Australia/South Pacific,Europe,Middle East,North America,South/Central America & Carribean
0,1995,52.976737,51.184689,61.126347,38.634474,55.416305,54.884315,36.252988
1,1996,48.34838,60.293226,60.811293,36.53759,57.235675,51.071101,53.665272
2,1997,37.282495,57.963918,61.577032,41.291805,58.599941,53.795868,55.117885
3,1998,30.327075,53.736634,49.237078,40.936384,58.497397,54.517819,49.526641
4,1999,34.153869,55.595147,61.653699,42.250435,59.681566,55.64553,58.425118


In [73]:
year_region_temp_df=pd.melt(continental_temp_df.reset_index().head(),id_vars=['Year'],value_vars=['Asia','Africa','Europe'],
        value_name='Temperature_C')
year_region_temp_df

Unnamed: 0,Year,Region,Temperature_C
0,1995,Asia,51.184689
1,1996,Asia,60.293226
2,1997,Asia,57.963918
3,1998,Asia,53.736634
4,1999,Asia,55.595147
5,1995,Africa,52.976737
6,1996,Africa,48.34838
7,1997,Africa,37.282495
8,1998,Africa,30.327075
9,1999,Africa,34.153869


###### Cross-Tabulation
Pandas cross tabulation function allows us to summarize data similar to Pivot and Pivot_table.

In [74]:
# This is our original data before applying crosstab function
year_region_temp_df.head()

Unnamed: 0,Year,Region,Temperature_C
0,1995,Asia,51.184689
1,1996,Asia,60.293226
2,1997,Asia,57.963918
3,1998,Asia,53.736634
4,1999,Asia,55.595147


In [75]:
'''Let's apply crosstab function and round the results to 1 decimal place. Also note that we are using mean as
    our aggregation function '''
pd.crosstab(year_region_temp_df['Year'],year_region_temp_df['Region'],
            values=year_region_temp_df['Temperature_C'],aggfunc='mean').round(2)

Region,Africa,Asia,Europe
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1995,52.98,51.18,38.63
1996,48.35,60.29,36.54
1997,37.28,57.96,41.29
1998,30.33,53.74,40.94
1999,34.15,55.6,42.25


###### Adding Totals to crosstab table

In [76]:
pd.crosstab(year_region_temp_df['Year'],year_region_temp_df['Region'],values=year_region_temp_df['Temperature_C'],
            aggfunc='mean', margins=True, margins_name='Mean').round(2)

Region,Africa,Asia,Europe,Mean
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1995,52.98,51.18,38.63,47.6
1996,48.35,60.29,36.54,48.39
1997,37.28,57.96,41.29,45.51
1998,30.33,53.74,40.94,41.67
1999,34.15,55.6,42.25,44.0
Mean,40.62,55.75,39.93,45.43


###### Cross Tabulation with Normalization

In [77]:
titanic_df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [78]:
'''Let's use titanic dataset and count the number of passenger who survived and the distribution of there gender.'''
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'])

Sex,female,male
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,81,464
1,233,109


In [79]:
'''Now let's add total ans see how it looks.'''
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'], margins=True, margins_name='Total')

Sex,female,male,Total
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,81,464,545
1,233,109,342
Total,314,573,887


In [80]:
'''Let's normalize and multiply by 100%. Normalize argument computes the percent contribution of each value
    to the total. For example for female and 0 survived we take (81/887)*100 which we get 9.131905'''
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'], margins=True, margins_name='Total',normalize=True)*100

Sex,female,male,Total
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,9.131905,52.311161,61.443067
1,26.26832,12.288613,38.556933
Total,35.400225,64.599775,100.0


###### Crosstab and Normalize a cross rows

In [81]:
'''Let's normalize rowise by setting normalize='index'. For example to get the first value of female and 0 survived we take
    (81/545)*100 to get 14.862385'''   
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'], margins=True, margins_name='Total',normalize='index')*100

Sex,female,male
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,14.862385,85.137615
1,68.128655,31.871345
Total,35.400225,64.599775


###### Crosstab and Normalize a cross columns

In [82]:
'''Let's normalize rowise by setting normalize='columns'. For example to get the first value of female and 0 survived we take
    (81/314)*100 to get 25.79617834394904 '''     
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'], margins=True, margins_name='Total',normalize='columns')*100

Sex,female,male,Total
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,25.796178,80.977312,61.443067
1,74.203822,19.022688,38.556933


###### Add Grouping to our Crosstab function

In [83]:
titanic_df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [84]:
'''Now let's add total ans see how it looks.'''
pd.crosstab([titanic_df['Survived'],titanic_df['Pclass']],[titanic_df['Sex'],titanic_df['Siblings/Spouses Aboard']], 
            rownames=['Survival','Passenger Class'], colnames=['Gender','Siblings Aboard'], margins=True, margins_name='Total')

Unnamed: 0_level_0,Gender,female,female,female,female,female,female,female,male,male,male,male,male,male,male,Total
Unnamed: 0_level_1,Siblings Aboard,0,1,2,3,4,5,8,0,1,2,3,4,5,8,Unnamed: 16_level_1
Survival,Passenger Class,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
0,1.0,1,2,0,0,0,0,0,59,16,1,1,0,0,0,80
0,2.0,3,3,0,0,0,0,0,67,20,4,0,0,0,0,97
0,3.0,33,21,3,7,4,1,3,231,35,7,4,11,4,4,368
1,1.0,48,38,3,2,0,0,0,29,15,1,0,0,0,0,136
1,2.0,41,25,3,1,0,0,0,9,7,1,0,0,0,0,87
1,3.0,48,17,4,1,2,0,0,35,10,1,0,1,0,0,119
Total,,174,106,13,11,6,1,3,430,103,15,5,12,4,4,887


# 7. Missing Data

###### Show null values and return false if not null

In [85]:
titanic_df.isnull().head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False


###### Show count of null values

In [86]:
titanic_df.isnull().sum()

Survived                   0
Pclass                     0
Name                       0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
dtype: int64

###### Show not null values and return true if not null

In [87]:
titanic_df.notnull().head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True


###### Drop entier row with all values null

In [88]:
titanic_df.dropna(how='all')

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Drop column with any null value

In [89]:
titanic_df.dropna(how='any',axis=1)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Drop rows with any of specified columns have null

In [90]:
titanic_df.dropna(subset=['Pclass', 'Age'], how='any')

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Drop rows with all of specified columns have null

In [91]:
titanic_df.dropna(subset=['Pclass', 'Age'], how='all')

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Drop row with a given number of null values

In [92]:
titanic_df.dropna(axis=1,thresh=2)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Replace null values with a scalar value

In [93]:
titanic_df.fillna(-999) # replace null values with -999

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Backward and Forward fill

In [94]:
titanic_df.fillna(method='bfill') # backward fill
titanic_df.fillna(method='ffill') # forward fill

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Replace missing value with a specific value

In [95]:
titanic_df.replace(np.nan, 0)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Impute null value with statistical measures

In [96]:
titanic_df.fillna(titanic_df.Fare.mean()) # fillna null value in fare column with mean of the fare
titanic_df.fillna(titanic_df.Sex.mode()) # fillna null value in Sex column with mode of the Sex
titanic_df.fillna(titanic_df.Age.median()) # fillna null value in Age column with median of the Age

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


###### Interpolate missing values

In [97]:
titanic_df.interpolate(method='linear')

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


# 8. Date and Time

### Pnadas to_datetime function

In [98]:
clv_df=pd.read_csv('clv.csv')
clv_df['text_with_date']='Some text '+clv_df['InvoiceDate'] +'Other text'
clv_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,text_with_date
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01/12/2010,2.55,17850.0,United Kingdom,Some text 01/12/2010Other text
1,536365,71053,WHITE METAL LANTERN,6,01/12/2010,3.39,17850.0,United Kingdom,Some text 01/12/2010Other text
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01/12/2010,2.75,17850.0,United Kingdom,Some text 01/12/2010Other text
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01/12/2010,3.39,17850.0,United Kingdom,Some text 01/12/2010Other text
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01/12/2010,3.39,17850.0,United Kingdom,Some text 01/12/2010Other text


###### Check if the date columns has datetime type

In [99]:
clv_df.dtypes

InvoiceNo          object
StockCode          object
Description        object
Quantity            int64
InvoiceDate        object
UnitPrice         float64
CustomerID        float64
Country            object
text_with_date     object
dtype: object

###### Convert a String Columns to Datetime object

In [100]:
# clv_df['valid_invoice_date_object']=pd.to_datetime(clv_df['InvoiceDate'])
clv_df['valid_invoice_date_object']=pd.to_datetime(clv_df['InvoiceDate'],format='%d/%m/%Y') # add format
clv_df['date_only']=pd.to_datetime(clv_df['text_with_date'],format='Some text %d/%m/%YOther text') # add format
clv_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,text_with_date,valid_invoice_date_object,date_only
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01/12/2010,2.55,17850.0,United Kingdom,Some text 01/12/2010Other text,2010-12-01,2010-12-01
1,536365,71053,WHITE METAL LANTERN,6,01/12/2010,3.39,17850.0,United Kingdom,Some text 01/12/2010Other text,2010-12-01,2010-12-01
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01/12/2010,2.75,17850.0,United Kingdom,Some text 01/12/2010Other text,2010-12-01,2010-12-01
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01/12/2010,3.39,17850.0,United Kingdom,Some text 01/12/2010Other text,2010-12-01,2010-12-01
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01/12/2010,3.39,17850.0,United Kingdom,Some text 01/12/2010Other text,2010-12-01,2010-12-01


In [101]:
clv_df.dtypes

InvoiceNo                            object
StockCode                            object
Description                          object
Quantity                              int64
InvoiceDate                          object
UnitPrice                           float64
CustomerID                          float64
Country                              object
text_with_date                       object
valid_invoice_date_object    datetime64[ns]
date_only                    datetime64[ns]
dtype: object

###### Extract Year from date 

In [102]:
clv_df['valid_invoice_date_object'].dt.year.head()

0    2010
1    2010
2    2010
3    2010
4    2010
Name: valid_invoice_date_object, dtype: int64

###### Extract Month from date 

In [103]:
clv_df['valid_invoice_date_object'].dt.month.head()

0    12
1    12
2    12
3    12
4    12
Name: valid_invoice_date_object, dtype: int64

###### Extract Day from date 

In [104]:
clv_df['valid_invoice_date_object'].dt.day.head()

0    1
1    1
2    1
3    1
4    1
Name: valid_invoice_date_object, dtype: int64

###### Extract Day of Year from date 

In [105]:
clv_df['valid_invoice_date_object'].dt.dayofyear.head()

0    335
1    335
2    335
3    335
4    335
Name: valid_invoice_date_object, dtype: int64

###### Extract Week of Year from date 

In [106]:
clv_df['valid_invoice_date_object'].dt.isocalendar().week.head()

0    48
1    48
2    48
3    48
4    48
Name: week, dtype: UInt32

###### Extract Day of Year from date 

In [107]:
# clv_df['valid_invoice_date_object'].dt.dayofweek.head()
clv_df['valid_invoice_date_object'].dt.isocalendar().day.head()

0    3
1    3
2    3
3    3
4    3
Name: day, dtype: UInt32

###### Name of Day

In [108]:
def week_day(x):
    return pd.Timestamp(x).day_name()

In [109]:
clv_df['valid_invoice_date_object'].apply(lambda x: week_day(x)).head()

0    Wednesday
1    Wednesday
2    Wednesday
3    Wednesday
4    Wednesday
Name: valid_invoice_date_object, dtype: object

###### Extract Quarter of Year from date 

In [110]:
clv_df['valid_invoice_date_object'].dt.quarter.head()

0    4
1    4
2    4
3    4
4    4
Name: valid_invoice_date_object, dtype: int64

###### Extract Number of Days in a month

In [111]:
clv_df['valid_invoice_date_object'].dt.days_in_month.head()

0    31
1    31
2    31
3    31
4    31
Name: valid_invoice_date_object, dtype: int64

###### Check for leap year

In [112]:
clv_df['valid_invoice_date_object'].dt.is_leap_year.head()

0    False
1    False
2    False
3    False
4    False
Name: valid_invoice_date_object, dtype: bool

###### Extract Hour from date 

In [113]:
dates=pd.date_range('1/12/2022', periods = 100, freq ='H')
date_df=pd.DataFrame(dates,columns=['DatetimeColumn'])
date_df.head()

Unnamed: 0,DatetimeColumn
0,2022-01-12 00:00:00
1,2022-01-12 01:00:00
2,2022-01-12 02:00:00
3,2022-01-12 03:00:00
4,2022-01-12 04:00:00


In [114]:
date_df['DatetimeColumn'].dt.hour.head()

0    0
1    1
2    2
3    3
4    4
Name: DatetimeColumn, dtype: int64

###### Extract Minute from date 

In [115]:
date_df['DatetimeColumn'].dt.minute.head()

0    0
1    0
2    0
3    0
4    0
Name: DatetimeColumn, dtype: int64

###### Extract Second from date 

In [116]:
date_df['DatetimeColumn'].dt.second.head()

0    0
1    0
2    0
3    0
4    0
Name: DatetimeColumn, dtype: int64

### Python Date Functions

###### Python date

In [117]:
from datetime import date

In [118]:
python_date=date.today()
print(python_date)
python_date.day
python_date.month
python_date.year

2021-12-06


2021

###### Python time

In [119]:
from datetime import time

In [120]:
python_time=time(15,12,23,56)
print(python_time)
python_time.hour
python_time.minute
python_time.second

15:12:23.000056


23

###### Python datetime

In [121]:
from datetime import datetime

In [122]:
python_datetime=datetime.now()
print(python_datetime)
python_datetime.date()
python_datetime.time()
python_datetime.strftime('%A') # Add formart to extract more features. Refer to Python documentation 
# for more details https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

2021-12-06 22:02:32.543810


'Monday'

# 9. Combining Data in Pandas

### 1. Concatenating Data Frames
This involves joining data frames along a specified axis (rowise or columnwise).

In [123]:
df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)

df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
)

df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    },
)

df4 = pd.DataFrame(
    {
        "B": ["B2", "B3", "B6", "B7"],
        "D": ["D2", "D3", "D6", "D7"],
        "F": ["F2", "F3", "F6", "F7"],
    },
)

In [124]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [125]:
df2

Unnamed: 0,A,B,C,D
0,A4,B4,C4,D4
1,A5,B5,C5,D5
2,A6,B6,C6,D6
3,A7,B7,C7,D7


In [126]:
df3

Unnamed: 0,A,B,C,D
0,A8,B8,C8,D8
1,A9,B9,C9,D9
2,A10,B10,C10,D10
3,A11,B11,C11,D11


In [127]:
df4

Unnamed: 0,B,D,F
0,B2,D2,F2
1,B3,D3,F3
2,B6,D6,F6
3,B7,D7,F7


###### Stack dataframes on top of each other (rowise/axis=0)
We set ignore_index=True to reset the indexes of two dataframes to a unified index

In [128]:
# pd.concat([df1,df2], ignore_index=True) # Default concat function stacks dataframes on axis=0 (rowise)
pd.concat([df1,df2,df3], ignore_index=True,axis=0)

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


###### Pandas concat and set keys

In [129]:
pd.concat([df1,df2,df3],axis=0,keys=['df1','df2','df3'])

Unnamed: 0,Unnamed: 1,A,B,C,D
df1,0,A0,B0,C0,D0
df1,1,A1,B1,C1,D1
df1,2,A2,B2,C2,D2
df1,3,A3,B3,C3,D3
df2,0,A4,B4,C4,D4
df2,1,A5,B5,C5,D5
df2,2,A6,B6,C6,D6
df2,3,A7,B7,C7,D7
df3,0,A8,B8,C8,D8
df3,1,A9,B9,C9,D9


###### Stack dataframes next to each other (columnwise/axis=1)

In [130]:
pd.concat([df1,df4],axis=1)

Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,B2,D2,F2
1,A1,B1,C1,D1,B3,D3,F3
2,A2,B2,C2,D2,B6,D6,F6
3,A3,B3,C3,D3,B7,D7,F7


###### Stack dataframes next to each other (columnwise/axis=1) with keys

In [131]:
pd.concat([df1,df4],axis=1,keys=['df1','df4'])

Unnamed: 0_level_0,df1,df1,df1,df1,df4,df4,df4
Unnamed: 0_level_1,A,B,C,D,B,D,F
0,A0,B0,C0,D0,B2,D2,F2
1,A1,B1,C1,D1,B3,D3,F3
2,A2,B2,C2,D2,B6,D6,F6
3,A3,B3,C3,D3,B7,D7,F7


###### Pandas append function

In [132]:
df1.append([df2,df3],ignore_index=True)

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


###### Pandas merge function

In [133]:
hr_departments_df=pd.read_csv('hr_departments.csv')
hr_employees_df=pd.read_csv('hr_employees.csv')

In [134]:
hr_departments_df.head()

Unnamed: 0,DEPARTMENT_ID,DEPARTMENT_NAME,MANAGER_ID,LOCATION_ID
0,10,Administration,200.0,1700
1,20,Marketing,201.0,1800
2,30,Purchasing,114.0,1700
3,40,Human Resources,203.0,2400
4,50,Shipping,121.0,1500


In [135]:
hr_employees_df.head()

Unnamed: 0,EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,HIRE_DATE,JOB_ID,SALARY,COMMISSION_PCT,MANAGER_ID,DEPARTMENT_ID
0,100,Steven,King,SKING,515.123.4567,17-JUN-03,AD_PRES,24000,,,90.0
1,101,Neena,Kochhar,NKOCHHAR,515.123.4568,21-SEP-05,AD_VP,17000,,100.0,90.0
2,102,Lex,De Haan,LDEHAAN,515.123.4569,13-JAN-01,AD_VP,17000,,100.0,90.0
3,103,Alexander,Hunold,AHUNOLD,590.423.4567,03-JAN-06,IT_PROG,9000,,102.0,60.0
4,104,Bruce,Ernst,BERNST,590.423.4568,21-MAY-07,IT_PROG,6000,,103.0,60.0


###### Pandas merge function with keys

In [136]:
pd.merge(hr_employees_df,hr_departments_df,on=['DEPARTMENT_ID']).head()

Unnamed: 0,EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,HIRE_DATE,JOB_ID,SALARY,COMMISSION_PCT,MANAGER_ID_x,DEPARTMENT_ID,DEPARTMENT_NAME,MANAGER_ID_y,LOCATION_ID
0,100,Steven,King,SKING,515.123.4567,17-JUN-03,AD_PRES,24000,,,90.0,Executive,100.0,1700
1,101,Neena,Kochhar,NKOCHHAR,515.123.4568,21-SEP-05,AD_VP,17000,,100.0,90.0,Executive,100.0,1700
2,102,Lex,De Haan,LDEHAAN,515.123.4569,13-JAN-01,AD_VP,17000,,100.0,90.0,Executive,100.0,1700
3,103,Alexander,Hunold,AHUNOLD,590.423.4567,03-JAN-06,IT_PROG,9000,,102.0,60.0,IT,103.0,1400
4,104,Bruce,Ernst,BERNST,590.423.4568,21-MAY-07,IT_PROG,6000,,103.0,60.0,IT,103.0,1400


###### Pandas merge function with keys and inner join criteria
Inner join is equivalent to SQL INNER JOIN. Results in intersection of data with only matching keys from both dataframes

In [137]:
left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)

right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)

In [138]:
left

Unnamed: 0,key1,key2,A,B
0,K0,K0,A0,B0
1,K0,K1,A1,B1
2,K1,K0,A2,B2
3,K2,K1,A3,B3


In [139]:
right

Unnamed: 0,key1,key2,C,D
0,K0,K0,C0,D0
1,K1,K0,C1,D1
2,K1,K0,C2,D2
3,K2,K0,C3,D3


###### Merge on left join
Left join resembles SQL LEFT OUTER JOIN where the resulting dataframe contains all the keys from left dataframe

In [140]:
pd.merge(left,right,on=['key1','key2'],how='left')

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,


###### Merge on right join
Left join resembles SQL RIGHT OUTER JOIN where the resulting dataframe contains all the keys from right dataframe

In [141]:
pd.merge(left,right,on=['key1','key2'],how='right')

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2
3,K2,K0,,,C3,D3


###### Merge on inner join
Inner join resembles SQL INNER JOIN resulting to intersection of keys from both dataframes

In [142]:
pd.merge(left,right,on=['key1','key2'],how='inner')

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


###### Remove duplicates during joining
Removes duplicates to specified dataframe before joining

In [143]:
pd.merge(left.drop_duplicates(),right,on=['key1','key2'],how='inner')

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


###### Validate join
We can validate our join criteria based on; one_to_one, one_to_many and many_to_many. If the condition is not satisfied it throws an error

In [144]:
pd.merge(left.drop_duplicates(),right,on=['key1','key2'],how='inner',validate='many_to_many')

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


###### Merge with outer join criteria
Outer join resembles SQL like FULL OUTER JOIN. Results in a union 

In [145]:
pd.merge(left,right,on=['key1','key2'],how='outer')

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,
5,K2,K0,,,C3,D3


###### Merge with outer join criteria and indicator
The argument indicator=True creates a new column that describes the kind of join relationship in each records

In [146]:
pd.merge(left,right,on=['key1','key2'],how='outer',indicator=True)

Unnamed: 0,key1,key2,A,B,C,D,_merge
0,K0,K0,A0,B0,C0,D0,both
1,K0,K1,A1,B1,,,left_only
2,K1,K0,A2,B2,C1,D1,both
3,K1,K0,A2,B2,C2,D2,both
4,K2,K1,A3,B3,,,left_only
5,K2,K0,,,C3,D3,right_only


###### Merge on cross join
This is synonymous to CROSS JOIN in SQL. It results in a cartesian product of rows in both dataframes. The cross argument works in Pandas version 1.2+. First we will check the pandas version. It's less than 1.2 we will upgrade it.

In [147]:
# Check pandas version
pd.__version__

'1.1.5'

# 10. Pandas Apply Function

###### Using map function

In [148]:
titanic_df['Survived'].map({0:'N',1:'Y'}).head()

0    N
1    Y
2    Y
3    Y
4    N
Name: Survived, dtype: object

###### Using apply function

In [149]:
titanic_df['Fare'].apply(int).head()

0     7
1    71
2     7
3    53
4     8
Name: Fare, dtype: int64

###### Using applymap

In [150]:
# titanic_df['Fare'].applymap(int).head()

###### Using lambda function

In [151]:
titanic_df['Fare'].apply(lambda x: x*100).head()

0     725.00
1    7128.33
2     792.50
3    5310.00
4     805.00
Name: Fare, dtype: float64

###### Using python defined functions with lambda

In [152]:
def upper_case(x):
    return x.upper()

In [153]:
titanic_df['Sex'].apply(lambda x: upper_case(x)).head()

0      MALE
1    FEMALE
2    FEMALE
3    FEMALE
4      MALE
Name: Sex, dtype: object

# 11. Pandas String Functions

###### Convert text to Upper Case

In [154]:
titanic_df['Name'].str.upper()

0                                 MR. OWEN HARRIS BRAUND
1      MRS. JOHN BRADLEY (FLORENCE BRIGGS THAYER) CUM...
2                                  MISS. LAINA HEIKKINEN
3            MRS. JACQUES HEATH (LILY MAY PEEL) FUTRELLE
4                                MR. WILLIAM HENRY ALLEN
                             ...                        
882                                 REV. JUOZAS MONTVILA
883                          MISS. MARGARET EDITH GRAHAM
884                       MISS. CATHERINE HELEN JOHNSTON
885                                 MR. KARL HOWELL BEHR
886                                   MR. PATRICK DOOLEY
Name: Name, Length: 887, dtype: object

###### Convert text to Lower Case

In [155]:
titanic_df['Name'].str.lower()

0                                 mr. owen harris braund
1      mrs. john bradley (florence briggs thayer) cum...
2                                  miss. laina heikkinen
3            mrs. jacques heath (lily may peel) futrelle
4                                mr. william henry allen
                             ...                        
882                                 rev. juozas montvila
883                          miss. margaret edith graham
884                       miss. catherine helen johnston
885                                 mr. karl howell behr
886                                   mr. patrick dooley
Name: Name, Length: 887, dtype: object

###### Calculate length of a string

In [156]:
titanic_df['Name'].str.len().head()

0    22
1    50
2    21
3    43
4    23
Name: Name, dtype: int64

###### Strip white spaces

In [157]:
titanic_df['Name'].str.strip().head()

0                               Mr. Owen Harris Braund
1    Mrs. John Bradley (Florence Briggs Thayer) Cum...
2                                Miss. Laina Heikkinen
3          Mrs. Jacques Heath (Lily May Peel) Futrelle
4                              Mr. William Henry Allen
Name: Name, dtype: object

###### Split string

In [158]:
titanic_df['Name'].str.split(' ').head()

0                          [Mr., Owen, Harris, Braund]
1    [Mrs., John, Bradley, (Florence, Briggs, Thaye...
2                            [Miss., Laina, Heikkinen]
3    [Mrs., Jacques, Heath, (Lily, May, Peel), Futr...
4                         [Mr., William, Henry, Allen]
Name: Name, dtype: object

###### Convert strings to categorical numbers

In [159]:
titanic_df['Sex'].str.get_dummies().head()

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


###### Repeat text

In [160]:
titanic_df['Sex'].str.repeat(5).head()

0              malemalemalemalemale
1    femalefemalefemalefemalefemale
2    femalefemalefemalefemalefemale
3    femalefemalefemalefemalefemale
4              malemalemalemalemale
Name: Sex, dtype: object

###### Count occurrence of certain words

In [161]:
titanic_df['Name'].str.count('Rev.').sum()

6

###### String search startswith and endswith

In [162]:
titanic_df['Name'].str.startswith('Mrs.').sum()
titanic_df['Name'].str.endswith('y').sum()

51

###### Swap Cases

In [163]:
titanic_df['Name'].str.swapcase().head()

0                               mR. oWEN hARRIS bRAUND
1    mRS. jOHN bRADLEY (fLORENCE bRIGGS tHAYER) cUM...
2                                mISS. lAINA hEIKKINEN
3          mRS. jACQUES hEATH (lILY mAY pEEL) fUTRELLE
4                              mR. wILLIAM hENRY aLLEN
Name: Name, dtype: object

###### Filter data with substring

In [164]:
titanic_df['Name'].str.contains('Mr.')

0       True
1       True
2      False
3       True
4       True
       ...  
882    False
883    False
884    False
885     True
886     True
Name: Name, Length: 887, dtype: bool

###### Replace string

In [165]:
titanic_df['Sex'].str.replace('female','F')

0      male
1         F
2         F
3         F
4      male
       ... 
882    male
883       F
884       F
885    male
886    male
Name: Sex, Length: 887, dtype: object

###### Check if text is lower or upper

In [166]:
titanic_df['Name'].str.islower()
titanic_df['Name'].str.isupper().head()

0    False
1    False
2    False
3    False
4    False
Name: Name, dtype: bool

###### Check if value is numeric

In [167]:
titanic_df['Name'].str.isnumeric().head()

0    False
1    False
2    False
3    False
4    False
Name: Name, dtype: bool

# 12. Mathematical Operations

In [168]:
students_score_df = pd.DataFrame(
    {
        "Students": ["Tom", "Peter", "Mary", "Smith"],
        "Reg_No": [1790, 1731, 1780, 1755],
        "Reg_Date": ["15/01/2021", "16/01/2021", "19/01/2021", "27/01/2021"],
        "Math": ["79.00", "67.00", "84.00", "70.00"],
        "Physics": ["60", "70", "50", "90"],
        "Computer": ["65.80", "80", "70", "75"],
    }
)

students_score_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer
0,Tom,1790,15/01/2021,79.0,60,65.8
1,Peter,1731,16/01/2021,67.0,70,80.0
2,Mary,1780,19/01/2021,84.0,50,70.0
3,Smith,1755,27/01/2021,70.0,90,75.0


In [169]:
# Check if data types have proper data structure representation if not convert them to proper data types
students_score_df.dtypes

Students    object
Reg_No       int64
Reg_Date    object
Math        object
Physics     object
Computer    object
dtype: object

In [170]:
# Convert data types to proper representation
students_score_df[['Math','Physics','Computer']]=students_score_df[['Math','Physics','Computer']].astype(np.float) # Math, Physics and Computer need to be float
students_score_df['Reg_No']=students_score_df['Reg_No'].astype(str) # Reg_No need to object
students_score_df['Reg_Date']=pd.to_datetime(students_score_df['Reg_Date']) # Reg_Date need to be a valid pandas date

In [171]:
students_score_df.dtypes

Students            object
Reg_No              object
Reg_Date    datetime64[ns]
Math               float64
Physics            float64
Computer           float64
dtype: object

###### Scalar Addition
Add a scalar value to every numeric element in the dataframe

In [172]:
students_score_df[['Math','Physics','Computer']]=students_score_df[['Math','Physics','Computer']].add(5)
students_score_df.head()

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer
0,Tom,1790,2021-01-15,84.0,65.0,70.8
1,Peter,1731,2021-01-16,72.0,75.0,85.0
2,Mary,1780,2021-01-19,89.0,55.0,75.0
3,Smith,1755,2021-01-27,75.0,95.0,80.0


###### Elementwise addition
Add two dataframes elementwise

In [173]:
score_1_df=students_score_df[['Math','Physics','Computer']]-90
score_2_df=students_score_df[['Math','Physics','Computer']]-85

In [174]:
score_1_df

Unnamed: 0,Math,Physics,Computer
0,-6.0,-25.0,-19.2
1,-18.0,-15.0,-5.0
2,-1.0,-35.0,-15.0
3,-15.0,5.0,-10.0


In [175]:
score_2_df

Unnamed: 0,Math,Physics,Computer
0,-1.0,-20.0,-14.2
1,-13.0,-10.0,0.0
2,4.0,-30.0,-10.0
3,-10.0,10.0,-5.0


In [176]:
# add the two dataframes
score_3_df=score_1_df.add(score_2_df)
score_3_df

Unnamed: 0,Math,Physics,Computer
0,-7.0,-45.0,-33.4
1,-31.0,-25.0,-5.0
2,3.0,-65.0,-25.0
3,-25.0,15.0,-15.0


###### Subtraction with Scalar value

In [177]:
students_score_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer
0,Tom,1790,2021-01-15,84.0,65.0,70.8
1,Peter,1731,2021-01-16,72.0,75.0,85.0
2,Mary,1780,2021-01-19,89.0,55.0,75.0
3,Smith,1755,2021-01-27,75.0,95.0,80.0


In [178]:
students_score_df[['Math','Physics','Computer']]=students_score_df[['Math','Physics','Computer']]-50

In [179]:
students_score_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer
0,Tom,1790,2021-01-15,34.0,15.0,20.8
1,Peter,1731,2021-01-16,22.0,25.0,35.0
2,Mary,1780,2021-01-19,39.0,5.0,25.0
3,Smith,1755,2021-01-27,25.0,45.0,30.0


###### Elementwise Subtraction
Subtract two dataframes elementwise

In [180]:
# score_2_df-score_1_df # Option 1
score_4_df=score_2_df.sub(score_1_df) # Option 2
score_4_df

Unnamed: 0,Math,Physics,Computer
0,5.0,5.0,5.0
1,5.0,5.0,5.0
2,5.0,5.0,5.0
3,5.0,5.0,5.0


###### Multiplication with Scalar value

In [181]:
students_score_df[['Math','Physics','Computer']]=students_score_df[['Math','Physics','Computer']]*-3
students_score_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer
0,Tom,1790,2021-01-15,-102.0,-45.0,-62.4
1,Peter,1731,2021-01-16,-66.0,-75.0,-105.0
2,Mary,1780,2021-01-19,-117.0,-15.0,-75.0
3,Smith,1755,2021-01-27,-75.0,-135.0,-90.0


###### Elementwise Multiplication
Multiply two dataframes elementwise

In [182]:
# score_1_df * score_2_df # Option 1
score_1_df.mul(score_2_df) # Option 2

Unnamed: 0,Math,Physics,Computer
0,6.0,500.0,272.64
1,234.0,150.0,-0.0
2,-4.0,1050.0,150.0
3,150.0,50.0,50.0


###### Division with Scalar value

In [183]:
students_score_df[['Math','Physics','Computer']]=students_score_df[['Math','Physics','Computer']]/3
students_score_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer
0,Tom,1790,2021-01-15,-34.0,-15.0,-20.8
1,Peter,1731,2021-01-16,-22.0,-25.0,-35.0
2,Mary,1780,2021-01-19,-39.0,-5.0,-25.0
3,Smith,1755,2021-01-27,-25.0,-45.0,-30.0


###### Elementwise Division
Multiply two dataframes elementwis

In [184]:
# score_1_df / score_2_df # Option 1
score_1_df.div(score_2_df) # Option 2

Unnamed: 0,Math,Physics,Computer
0,6.0,1.25,1.352113
1,1.384615,1.5,-inf
2,-0.25,1.166667,1.5
3,1.5,0.5,2.0


###### Pandas power function

###### Using ** to raise element to specified power

In [185]:
score_1_df**2 # Raise each element to power 2

Unnamed: 0,Math,Physics,Computer
0,36.0,625.0,368.64
1,324.0,225.0,25.0
2,1.0,1225.0,225.0
3,225.0,25.0,100.0


###### Using power pow() function
The pow() function calculates the exponential power of dataframe and other, element-wise (binary operator pow). It resembles the ** operator but allows handling of missing values.

In [186]:
score_1_df=score_1_df.pow(2)
score_1_df

Unnamed: 0,Math,Physics,Computer
0,36.0,625.0,368.64
1,324.0,225.0,25.0
2,1.0,1225.0,225.0
3,225.0,25.0,100.0


###### Elementwise power along specified axis
We can specify axis when performing exponential power of two dataframes or a dataframe and a series

###### Logarithm on base 2 
We can use numpy log function to perform logarithmic operation on dataframe

In [187]:
score_1_df['Log2_Computer']=np.log2(score_1_df['Computer'])
score_1_df

Unnamed: 0,Math,Physics,Computer,Log2_Computer
0,36.0,625.0,368.64,8.526069
1,324.0,225.0,25.0,4.643856
2,1.0,1225.0,225.0,7.813781
3,225.0,25.0,100.0,6.643856


###### Logarithm on base 10

In [188]:
score_1_df['Log10_Computer']=np.log10(score_1_df['Computer'])
score_1_df

Unnamed: 0,Math,Physics,Computer,Log2_Computer,Log10_Computer
0,36.0,625.0,368.64,8.526069,2.566602
1,324.0,225.0,25.0,4.643856,1.39794
2,1.0,1225.0,225.0,7.813781,2.352183
3,225.0,25.0,100.0,6.643856,2.0


###### Natural logarithmic

In [189]:
score_1_df['Natural_Log_Computer']=np.log(score_1_df['Computer'])
score_1_df

Unnamed: 0,Math,Physics,Computer,Log2_Computer,Log10_Computer,Natural_Log_Computer
0,36.0,625.0,368.64,8.526069,2.566602,5.909821
1,324.0,225.0,25.0,4.643856,1.39794,3.218876
2,1.0,1225.0,225.0,7.813781,2.352183,5.4161
3,225.0,25.0,100.0,6.643856,2.0,4.60517


###### Pandas aggregate agg function

In [190]:
score_1_df

Unnamed: 0,Math,Physics,Computer,Log2_Computer,Log10_Computer,Natural_Log_Computer
0,36.0,625.0,368.64,8.526069,2.566602,5.909821
1,324.0,225.0,25.0,4.643856,1.39794,3.218876
2,1.0,1225.0,225.0,7.813781,2.352183,5.4161
3,225.0,25.0,100.0,6.643856,2.0,4.60517


In [191]:
score_1_df.agg('mean')

Math                    146.500000
Physics                 525.000000
Computer                179.660000
Log2_Computer             6.906891
Log10_Computer            2.079181
Natural_Log_Computer      4.787492
dtype: float64

###### Aggregate values a long a specified axis

In [192]:
score_1_df.agg('mean',axis=1) # axis=1 implies columnwise

0    174.440415
1     97.210112
2    244.430344
3     60.541504
dtype: float64

###### Nesting multiple aggregations

In [193]:
score_1_df.agg(['min','max','sum','mean','std','var'])

Unnamed: 0,Math,Physics,Computer,Log2_Computer,Log10_Computer,Natural_Log_Computer
min,1.0,25.0,25.0,4.643856,1.39794,3.218876
max,324.0,1225.0,368.64,8.526069,2.566602,5.909821
sum,586.0,2100.0,718.64,27.627562,8.316725,19.149967
mean,146.5,525.0,179.66,6.906891,2.079181,4.787492
std,153.89282,529.150262,150.592814,1.696536,0.510708,1.175949
var,23683.0,280000.0,22678.195733,2.878233,0.260823,1.382856


###### Using different agg() functions on each column

In [194]:
score_1_df.agg({'Math':['sum','min','max'],'Log2_Computer':['mean'],'Log10_Computer':['std'],'Natural_Log_Computer':['var']})

Unnamed: 0,Math,Log2_Computer,Log10_Computer,Natural_Log_Computer
max,324.0,,,
mean,,6.906891,,
min,1.0,,,
std,,,0.510708,
sum,586.0,,,
var,,,,1.382856


###### Subtract one year from the date

In [195]:
students_score_df['Reg_Date_Less_1_Yr']=students_score_df['Reg_Date']-pd.DateOffset(years=1)
students_score_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer,Reg_Date_Less_1_Yr
0,Tom,1790,2021-01-15,-34.0,-15.0,-20.8,2020-01-15
1,Peter,1731,2021-01-16,-22.0,-25.0,-35.0,2020-01-16
2,Mary,1780,2021-01-19,-39.0,-5.0,-25.0,2020-01-19
3,Smith,1755,2021-01-27,-25.0,-45.0,-30.0,2020-01-27


###### Subtract one month from the date

In [196]:
students_score_df['Reg_Date_Less_1_Mn']=students_score_df['Reg_Date']-pd.DateOffset(months=1)
students_score_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer,Reg_Date_Less_1_Yr,Reg_Date_Less_1_Mn
0,Tom,1790,2021-01-15,-34.0,-15.0,-20.8,2020-01-15,2020-12-15
1,Peter,1731,2021-01-16,-22.0,-25.0,-35.0,2020-01-16,2020-12-16
2,Mary,1780,2021-01-19,-39.0,-5.0,-25.0,2020-01-19,2020-12-19
3,Smith,1755,2021-01-27,-25.0,-45.0,-30.0,2020-01-27,2020-12-27


###### Subtract one day from the date

In [197]:
students_score_df['Reg_Date_Less_1_Day']=students_score_df['Reg_Date']-pd.DateOffset(days=1)
students_score_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer,Reg_Date_Less_1_Yr,Reg_Date_Less_1_Mn,Reg_Date_Less_1_Day
0,Tom,1790,2021-01-15,-34.0,-15.0,-20.8,2020-01-15,2020-12-15,2021-01-14
1,Peter,1731,2021-01-16,-22.0,-25.0,-35.0,2020-01-16,2020-12-16,2021-01-15
2,Mary,1780,2021-01-19,-39.0,-5.0,-25.0,2020-01-19,2020-12-19,2021-01-18
3,Smith,1755,2021-01-27,-25.0,-45.0,-30.0,2020-01-27,2020-12-27,2021-01-26


# 13. Statistical functions

###### Summary statistics

In [198]:
titanic_df.describe(include='all')
titanic_df.describe()

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.25,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,80.0,8.0,6.0,512.3292


###### Show min value

In [199]:
titanic_df['Age'].min() # min of age
titanic_df.min() # min of every column

Survived                                             0
Pclass                                               1
Name                       Capt. Edward Gifford Crosby
Sex                                             female
Age                                               0.42
Siblings/Spouses Aboard                              0
Parents/Children Aboard                              0
Fare                                                 0
dtype: object

###### Show max value

In [200]:
titanic_df['Age'].max() # max of age
titanic_df.max() # max of every column

Survived                                                                   1
Pclass                                                                     3
Name                       the Countess. of (Lucy Noel Martha Dyer-Edward...
Sex                                                                     male
Age                                                                       80
Siblings/Spouses Aboard                                                    8
Parents/Children Aboard                                                    6
Fare                                                                 512.329
dtype: object

###### Show mode of values

In [201]:
titanic_df['Age'].mode() # max of Age

0    22.0
dtype: float64

###### Show median of values

In [202]:
titanic_df['Pclass'].mode() # max of Pclass

0    3
dtype: int64

###### Show sum of values

In [203]:
titanic_df['Fare'].sum() # max of Fare

28654.907699999996

###### Show frequency of each category

In [204]:
titanic_df['Pclass'].value_counts() # freequency of passengers in each class

3    487
1    216
2    184
Name: Pclass, dtype: int64

###### Calculate mean

In [205]:
titanic_df['Age'].mean()

29.471443066516347

###### Calculate standard deviation

In [206]:
titanic_df['Age'].std() # std for age only
titanic_df.std() # std for all columns

Survived                    0.487004
Pclass                      0.836662
Age                        14.121908
Siblings/Spouses Aboard     1.104669
Parents/Children Aboard     0.807466
Fare                       49.782040
dtype: float64

###### Show Variance

In [207]:
titanic_df['Age'].var() # variance for age column only
titanic_df.var() # variance for all numeric columns

Survived                      0.237173
Pclass                        0.700003
Age                         199.428297
Siblings/Spouses Aboard       1.220293
Parents/Children Aboard       0.652001
Fare                       2478.251546
dtype: float64

###### Show Covariance

In [208]:
titanic_df[['Age','Fare']].cov() # Covariance of Age and Fare
titanic_df.cov() # Covariance for entire dataframe

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
Survived,0.237173,-0.137121,-0.410343,-0.01995,0.031497,6.210808
Pclass,-0.137121,0.700003,-4.625577,0.078584,0.013681,-22.862898
Age,-0.410343,-4.625577,199.428297,-4.643648,-2.209222,78.968988
Siblings/Spouses Aboard,-0.01995,0.078584,-4.643648,1.220293,0.369498,8.734998
Parents/Children Aboard,0.031497,0.013681,-2.209222,0.369498,0.652001,8.661314
Fare,6.210808,-22.862898,78.968988,8.734998,8.661314,2478.251546


###### Show Correlation
Correlation Measures the relationship between two variables

###### 1. Pearson Correlation
Measures the linear relationship between two variables. Pearson correlation coefficient is the default correlation method in Pandas Data Frame. NOTE: Pearson Correlation assumes that the data is normally distributed. It's sensitive to outliers

In [209]:
titanic_df.corr(method='pearson') 
titanic_df.corr() # Or don't specifiy since it's the default

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
Survived,1.0,-0.336528,-0.059665,-0.037082,0.080097,0.256179
Pclass,-0.336528,1.0,-0.391492,0.085026,0.020252,-0.548919
Age,-0.059665,-0.391492,1.0,-0.297669,-0.193741,0.112329
Siblings/Spouses Aboard,-0.037082,0.085026,-0.297669,1.0,0.414244,0.158839
Parents/Children Aboard,0.080097,0.020252,-0.193741,0.414244,1.0,0.21547
Fare,0.256179,-0.548919,0.112329,0.158839,0.21547,1.0


###### 2. Spearman Rank Correlation
Measures the monotonic relationship between two variables. Does not assume normal distribution of the dataset. Has a growth rate of O(nlogn)

In [210]:
titanic_df.corr(method='spearman') 

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
Survived,1.0,-0.337648,-0.030265,0.086571,0.13653,0.322264
Pclass,-0.337648,1.0,-0.387982,-0.040348,-0.020617,-0.688234
Age,-0.030265,-0.387982,1.0,-0.199269,-0.254234,0.156062
Siblings/Spouses Aboard,0.086571,-0.040348,-0.199269,1.0,0.449198,0.44598
Parents/Children Aboard,0.13653,-0.020617,-0.254234,0.449198,1.0,0.409202
Fare,0.322264,-0.688234,0.156062,0.44598,0.409202,1.0


###### 3. Kendall Rank Correlation
It measures the monotonic relationship between two variables. It does not assume normal distribution of the data. It has a growth rate of 
O(n^2) hence tends to be abit slower on large dataset.

In [211]:
titanic_df.corr(method='kendall') 

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
Survived,1.0,-0.321558,-0.024993,0.083671,0.132233,0.264998
Pclass,-0.321558,1.0,-0.308253,-0.037085,-0.018995,-0.573648
Age,-0.024993,-0.308253,1.0,-0.15609,-0.200868,0.107466
Siblings/Spouses Aboard,0.083671,-0.037085,-0.15609,1.0,0.424367,0.357309
Parents/Children Aboard,0.132233,-0.018995,-0.200868,0.424367,1.0,0.329609
Fare,0.264998,-0.573648,0.107466,0.357309,0.329609,1.0


###### Calculate Kurtosis

In [212]:
titanic_df.kurtosis()

Survived                   -1.782183
Pclass                     -1.288638
Age                         0.292559
Siblings/Spouses Aboard    17.797537
Parents/Children Aboard     9.723066
Fare                       33.264605
dtype: float64

###### Calculate Skew

In [213]:
titanic_df.skew()

Survived                   0.470999
Pclass                    -0.623409
Age                        0.447189
Siblings/Spouses Aboard    3.686760
Parents/Children Aboard    2.741198
Fare                       4.777671
dtype: float64

###### Compute Percent change
Calculates the percent change over a given number of periods. Handle missing values (Nulls) before computing the percent change).

In [214]:
titanic_df['Fare'].pct_change(periods=3)

0           NaN
1           NaN
2           NaN
3      6.324138
4     -0.887070
         ...   
882    0.238095
883    3.255319
884   -0.194850
885    1.307692
886   -0.741667
Name: Fare, Length: 887, dtype: float64

###### Rank ata
ranks the data and shows the ties in data values

In [215]:
titanic_df.rank().head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,273.0,644.0,602.0,601.0,277.0,709.0,337.5,77.0
1,716.5,108.5,823.0,157.5,667.5,709.0,337.5,785.0
2,716.5,644.0,172.0,157.5,392.0,302.5,337.5,229.5
3,716.5,108.5,814.0,157.5,615.0,709.0,337.5,744.0
4,273.0,644.0,733.0,601.0,615.0,302.5,337.5,261.0


# 14. Window Functions

###### Lag in Pandas
It's best practice to sort values first based on a key column such as date

In [216]:
windows_data_df = pd.DataFrame(
    {
        "Students": ["Tom", "Peter", "Mary", "Smith"],
        "Reg_No": ["1790","1731","1780", "1755"],
        "Reg_Date": ["15/01/2021", "13/01/2021", "14/01/2021", "27/01/2021"],
        "Math": [79.00, 67.00, 84.00, 70.00],
        "Physics": [60, 70, 50, 90],
        "Computer": [65.80, 80, 70, 75],
    }
)

windows_data_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer
0,Tom,1790,15/01/2021,79.0,60,65.8
1,Peter,1731,13/01/2021,67.0,70,80.0
2,Mary,1780,14/01/2021,84.0,50,70.0
3,Smith,1755,27/01/2021,70.0,90,75.0


In [217]:
windows_data_df=windows_data_df.sort_values(by='Reg_Date', ascending=False)
windows_data_df['Previous_Computer'] = windows_data_df['Computer'].shift(1)
windows_data_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer,Previous_Computer
3,Smith,1755,27/01/2021,70.0,90,75.0,
0,Tom,1790,15/01/2021,79.0,60,65.8,75.0
2,Mary,1780,14/01/2021,84.0,50,70.0,65.8
1,Peter,1731,13/01/2021,67.0,70,80.0,70.0


###### Lead in Pandas
It's best practice to sort values first based on a key column such as date

In [218]:
windows_data_df=windows_data_df.sort_values(by='Reg_Date', ascending=False)
windows_data_df['Next_Computer'] = windows_data_df['Computer'].shift(-1)
windows_data_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer,Previous_Computer,Next_Computer
3,Smith,1755,27/01/2021,70.0,90,75.0,,65.8
0,Tom,1790,15/01/2021,79.0,60,65.8,75.0,70.0
2,Mary,1780,14/01/2021,84.0,50,70.0,65.8,80.0
1,Peter,1731,13/01/2021,67.0,70,80.0,70.0,


###### Rolling window
Generic fixed or variable sliding window over the values.

In [219]:
'''Look at how we are calculating the Rolling_Window_Physics column. We're going back and adding previous
    number to the current. In this case our window=2'''
windows_data_df['Rolling_Window_Physics']=windows_data_df['Physics'].rolling(window=2).sum()
windows_data_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer,Previous_Computer,Next_Computer,Rolling_Window_Physics
3,Smith,1755,27/01/2021,70.0,90,75.0,,65.8,
0,Tom,1790,15/01/2021,79.0,60,65.8,75.0,70.0,150.0
2,Mary,1780,14/01/2021,84.0,50,70.0,65.8,80.0,110.0
1,Peter,1731,13/01/2021,67.0,70,80.0,70.0,,120.0


###### Specifying Weights in the Windows 
Passing win_type to .rolling generates a generic rolling window computation, that is weighted according the win_type. Refer to documentation for more details https://pandas-docs.github.io/pandas-docs-travis/user_guide/computation.html#window-functions

In [220]:
windows_data_df['Rolling_Window_Physics']=windows_data_df['Physics'].rolling(window=2, win_type='triang').sum()
windows_data_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer,Previous_Computer,Next_Computer,Rolling_Window_Physics
3,Smith,1755,27/01/2021,70.0,90,75.0,,65.8,
0,Tom,1790,15/01/2021,79.0,60,65.8,75.0,70.0,75.0
2,Mary,1780,14/01/2021,84.0,50,70.0,65.8,80.0,55.0
1,Peter,1731,13/01/2021,67.0,70,80.0,70.0,,60.0


###### Expanding window
Accumulating window over the values. An expanding window yields the value of an aggregation statistic with all the data available up to that point in time.

In [221]:
'''Expanding window with min_periods=1 with sum function resembes cumulative sum. 
    Trying using other aggregation functions such as mean etc.'''
windows_data_df['Expanding_Window_Physics']=windows_data_df['Physics'].expanding(min_periods=1).mean()
windows_data_df

Unnamed: 0,Students,Reg_No,Reg_Date,Math,Physics,Computer,Previous_Computer,Next_Computer,Rolling_Window_Physics,Expanding_Window_Physics
3,Smith,1755,27/01/2021,70.0,90,75.0,,65.8,,90.0
0,Tom,1790,15/01/2021,79.0,60,65.8,75.0,70.0,75.0,75.0
2,Mary,1780,14/01/2021,84.0,50,70.0,65.8,80.0,55.0,66.666667
1,Peter,1731,13/01/2021,67.0,70,80.0,70.0,,60.0,67.5


###### Exponentially Weighted window
Exponentially Weighted window: Accumulating and exponentially weighted window over the values. An exponentially weighted window is similar to an expanding window but with each prior point being exponentially weighted down relative to the current point. Available EW functions are mean(), var(), std(), corr(), cov()

In [222]:
# Let's load our sales data
clv_df=pd.read_csv('clv.csv')

In [223]:
clv_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01/12/2010,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,01/12/2010,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01/12/2010,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01/12/2010,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01/12/2010,3.39,17850.0,United Kingdom


In [224]:
# Let's created a sales column by multiplying Quantity by UnitPrice
clv_df['sales']=clv_df['Quantity']*clv_df['UnitPrice']
clv_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,sales
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01/12/2010,2.55,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,01/12/2010,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01/12/2010,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01/12/2010,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01/12/2010,3.39,17850.0,United Kingdom,20.34


In [225]:
# Let's order our data in ascending order based on InvoiceDate
clv_df=clv_df.sort_values(by='InvoiceDate',ascending=True)
clv_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,sales
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01/12/2010,2.55,17850.0,United Kingdom,15.3
2066,536557,84029E,RED WOOLLY HOTTIE WHITE HEART.,1,01/12/2010,3.75,17841.0,United Kingdom,3.75
2067,536557,22678,FRENCH BLUE METAL DOOR SIGN 3,3,01/12/2010,1.25,17841.0,United Kingdom,3.75
2068,536557,22686,FRENCH BLUE METAL DOOR SIGN No,1,01/12/2010,1.25,17841.0,United Kingdom,1.25
2069,536557,22468,BABUSHKA LIGHTS STRING OF 10,1,01/12/2010,6.75,17841.0,United Kingdom,6.75


###### exponentially weighted moving average
Exponentially weighted moving average gives more weight to recent observations, which makes it powerful to capture recent trends more quickly.

In [226]:
### Exponential Moving Average on sales
clv_df['5_day_Sales_EWM'] = clv_df['sales'].ewm(span=5).mean()
clv_df.head(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,sales,5_day_Sales_EWM
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01/12/2010,2.55,17850.0,United Kingdom,15.3,15.3
2066,536557,84029E,RED WOOLLY HOTTIE WHITE HEART.,1,01/12/2010,3.75,17841.0,United Kingdom,3.75,8.37
2067,536557,22678,FRENCH BLUE METAL DOOR SIGN 3,3,01/12/2010,1.25,17841.0,United Kingdom,3.75,6.181579
2068,536557,22686,FRENCH BLUE METAL DOOR SIGN No,1,01/12/2010,1.25,17841.0,United Kingdom,1.25,4.133077
2069,536557,22468,BABUSHKA LIGHTS STRING OF 10,1,01/12/2010,6.75,17841.0,United Kingdom,6.75,5.137678
2070,536557,85232B,SET OF 3 BABUSHKA STACKING TINS,1,01/12/2010,4.95,17841.0,United Kingdom,4.95,5.069098
2071,536557,21479,WHITE SKULL HOT WATER BOTTLE,5,01/12/2010,3.75,17841.0,United Kingdom,18.75,9.912895
2072,536557,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,2,01/12/2010,3.75,17841.0,United Kingdom,7.5,9.07594
2073,536557,22837,HOT WATER BOTTLE BABUSHKA,5,01/12/2010,4.65,17841.0,United Kingdom,23.25,13.926809
2074,536557,22112,CHOCOLATE HOT WATER BOTTLE,2,01/12/2010,4.95,17841.0,United Kingdom,9.9,12.560851
