# Advanced Data Analysis Techniques with Python & Pandas

<a target="_blank" href="https://colab.research.google.com/github/JovianHQ/notebooks/blob/main/data-analysis-and-visualization-with-python/advanced-data-analysis-techniques/advanced-data-analysis-pandas.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This tutorial is a part of the [Zero to Data Science Bootcamp by Jovian](https://zerotodatascience.com).

![](https://i.imgur.com/jspPDKJ.png)

Pandas is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet). Pandas offers several easy-to-use and efficient utilities for loading, processing, cleaning and analyzing large tabular datasets. Datasets containing millions of records can be processed using Pandas in a matter of minutes.

This tutorial covers the following topics:

- Downloading datasets from online sources
- Processing massive datasets using Pandas
- Working with categorical data
- Handling missing and duplicate data
- Transforming data with type-specific functions
- Data frame concatenation and merging

### How to run the code

This tutorial is an executable [Jupyter notebook](https://jupyter.org) hosted on [Jovian](https://www.jovian.ai). You can _run_ this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your computer*.

#### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the **Run** button at the top of this page and select **Run on Colab**. [Follow these instructions](https://jovian.ai/docs/user-guide/run.html#run-on-colab) to connect your Google Drive with Jovian.


#### Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up [Python](https://www.python.org), download the notebook and install the required libraries. We recommend using the [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) distribution of Python. Click the **Run** button at the top of this page, select the **Run Locally** option, and follow the instructions.

>  **Jupyter Notebooks**: This tutorial is a [Jupyter notebook](https://jupyter.org) - a document made of _cells_. Each cell can contain code written in Python or explanations in plain English. You can execute code cells and view the results, e.g., numbers, messages, graphs, tables, files, etc., instantly within the notebook. Jupyter is a powerful platform for experimentation and analysis. Don't be afraid to mess around with the code & break things - you'll learn a lot by encountering and fixing errors. You can use the "Kernel > Restart & Clear Output" menu option to clear all outputs and start again from the top.

Let's install and import the required libraries.

In [1]:
#restart kernel after installation
!pip install numpy pandas-profiling jovian --upgrade --quiet

In [2]:
import pandas as pd
import numpy as np
import jovian

## Finding and downloading datasets from online sources

There are many great sources for finding datasets online:

- [Kaggle datasets](http://kaggle.com/datasets)
- [World Bank Open Data](https://data.worldbank.org)
- [Yahoo Finance](https://finance.yahoo.com)
- [Google Dataset Search](https://datasetsearch.research.google.com)
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php)
- [FastAI datasets](https://course.fast.ai/datasets)
- and many more..

While some of these provide public URLs to an easily downloadable dataset archive, others require login to limit abuse. As an example, let's look at Kaggle, which contains over 60,000+ community curated datasets. We'll download the [US Accidents dataset](https://www.kaggle.com/sobhanmoosavi/us-accidents), which contains nearly 3 million records.


We can't use `requests` directly to download a dataset from Kaggle, because it doesn't provide a raw URL for the dataset. We'll use the `opendatasets` library, which can download a Kaggle dataset using an API token.

In [3]:
!pip install opendatasets --upgrade --quiet

In [4]:
import opendatasets as od

We'll use the `od.download` function to download the dataset.

In [5]:
help(od.download)

Help on function download in module opendatasets:

download(dataset_id_or_url, data_dir='.', force=False, dry_run=False, **kwargs)



In [6]:
us_accidents_url = 'https://www.kaggle.com/sobhanmoosavi/us-accidents'

In [7]:
od.download(us_accidents_url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: aakashns
Your Kaggle Key: ········


  0%|          | 0.00/290M [00:00<?, ?B/s]

Downloading us-accidents.zip to ./us-accidents


100%|██████████| 290M/290M [00:52<00:00, 5.82MB/s] 





To download the dataset, you'll need to supply your Kaggle credentials, as explained here: https://github.com/jovianml/opendatasets#kaggle-credentials

The data has been downloaded and unzipped to the folder `./us-accidents`

In [8]:
!ls -lh us-accidents

total 2261120
-rw-r--r--  1 aakashns  staff   1.1G May 15 17:45 US_Accidents_Dec20_Updated.csv


It consists of just one file, `US_Accidents_Dec20_updated.csv`, which is over 1 GB in size. We can also check the length of the file using the `wc` terminal command (only works on Linux and Mac).


**NOTE**: The latest version of the `us-accidents` dataset is over 500 MB in size.

In [9]:
!wc -l us-accidents/US_Accidents_Dec20_updated.csv

 2906611 us-accidents/US_Accidents_Dec20_Updated.csv


The file consists of over 2.9 million records! You can learn more about the dataset by reading the dataset description on Kaggle: https://www.kaggle.com/sobhanmoosavi/us-accidents .

**NOTE**: The latest version of the `us-accidents` dataset has 1.5 million records. 


Try downloading a few other datasets from the sources listed above.

> **EXERCISE**: Find and a download a dataset providing country-wise population for the last 50 years. Use it to identify the countries with the highest percentage growth in population. What other insights can you gather from this data? Experiment with it in a new notebook.
>
> *Hint*: Visit https://data.worldbank.org .



> **EXERCISE**: Download the historical monthly stock price data for Apple Inc. (AAPL) since 1988. If you had bought Apple shares worth $100 Jan 1, 1991, what would they be worth on Jan 1, 2021? What other insights can you gather from this data? Experiment with it in a new notebook.
>
> *Hint*: Visit https://finance.yahoo.com .

> **EXERCISE**: Learn about and download data set from https://archive.ics.uci.edu/ml/datasets/Air+quality . Show the trend of CO concentration using a line chart. What other insights can you gather from this data? Experiment with it in a new notebook.

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m


## Processing massive datasets using Pandas

Let's load the US accidents data into a Pandas dataframe, and track the amount of time it takes using the `%%time` Jupyter magic command.

In [12]:
us_accidents_csv = 'us-accidents/US_Accidents_Dec20_updated.csv'

In [13]:
%%time
accidents_df = pd.read_csv(us_accidents_csv)

CPU times: user 18.4 s, sys: 2.25 s, total: 20.6 s
Wall time: 20.8 s


While the exact time for this operation depends on the hardware configuration of your computer, you will likely find that it takes less than a minute for Pandas to process a 1.1 GB containing over 2.9 million records. Isn't that impressive?

**NOTE**: The latest version of the `us-accidents` dataset is over 500 MB in size containing over 1.5 million records. 

Let's take a look at the first few rows, and gather some information about the dataset.



In [14]:
accidents_df.head()

Unnamed: 0,ID,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-1,2,2019-05-21 08:29:55,2019-05-21 09:29:40,34.808868,-82.269157,34.808868,-82.269157,0.0,Accident on Tanner Rd at Pennbrooke Ln.,...,False,False,False,False,False,False,Day,Day,Day,Day
1,A-2,2,2019-10-07 17:43:09,2019-10-07 19:42:50,35.09008,-80.74556,35.09008,-80.74556,0.0,Accident on Houston Branch Rd at Providence Br...,...,False,False,False,False,False,False,Day,Day,Day,Day
2,A-3,2,2020-12-13 21:53:00,2020-12-13 22:44:00,37.14573,-121.985052,37.16585,-121.988062,1.4,Stationary traffic on CA-17 from Summit Rd (CA...,...,False,False,False,False,False,False,Night,Night,Night,Night
3,A-4,2,2018-04-17 16:51:23,2018-04-17 17:50:46,39.11039,-119.773781,39.11039,-119.773781,0.0,Accident on US-395 Southbound at Topsy Ln.,...,False,False,False,False,True,False,Day,Day,Day,Day
4,A-5,3,2016-08-31 17:40:49,2016-08-31 18:10:49,26.102942,-80.265091,26.102942,-80.265091,0.0,Accident on I-595 Westbound at Exit 4 / Pine I...,...,False,False,False,False,True,False,Day,Day,Day,Day


In [15]:
accidents_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2906610 entries, 0 to 2906609
Data columns (total 47 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   ID                     object 
 1   Severity               int64  
 2   Start_Time             object 
 3   End_Time               object 
 4   Start_Lat              float64
 5   Start_Lng              float64
 6   End_Lat                float64
 7   End_Lng                float64
 8   Distance(mi)           float64
 9   Description            object 
 10  Number                 float64
 11  Street                 object 
 12  Side                   object 
 13  City                   object 
 14  County                 object 
 15  State                  object 
 16  Zipcode                object 
 17  Country                object 
 18  Timezone               object 
 19  Airport_Code           object 
 20  Weather_Timestamp      object 
 21  Temperature(F)         float64
 22  Wind_Chill(F)     

The dataset contains 2.9 million rows, 46 columns and occupies 790 MB of memory (RAM). Let's look at some strategies to load the data faster and use less memory.

**NOTE**: The latest version of the `us-accidents` contains 1.5 million rows, 46 columns and occupies 412 MB of memory (RAM).

### Load only the required columns

You can provide the `usecols` argument to `read_csv` create a dataframe with just the given columns. This reduces the loading time, and uses lesser memory.

In [16]:
selected_cols = ['Severity', 'Start_Time', 'End_Time', 'Start_Lat', 'Start_Lng',
                 'End_Lat', 'End_Lng', 'Distance(mi)', 'Description', 'City', 'State', 
                 'Timezone', 'Weather_Condition']

In [17]:
%%time
accidents_df2 = pd.read_csv(us_accidents_csv, usecols=selected_cols)

CPU times: user 10.3 s, sys: 711 ms, total: 11 s
Wall time: 11.2 s


In [18]:
accidents_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2906610 entries, 0 to 2906609
Data columns (total 13 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Severity           int64  
 1   Start_Time         object 
 2   End_Time           object 
 3   Start_Lat          float64
 4   Start_Lng          float64
 5   End_Lat            float64
 6   End_Lng            float64
 7   Distance(mi)       float64
 8   Description        object 
 9   City               object 
 10  State              object 
 11  Timezone           object 
 12  Weather_Condition  object 
dtypes: float64(5), int64(1), object(7)
memory usage: 288.3+ MB


We've reduced the load time by over 40% and the memory usage by over 60%. 

In [19]:
accidents_df2.head()

Unnamed: 0,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,City,State,Timezone,Weather_Condition
0,2,2019-05-21 08:29:55,2019-05-21 09:29:40,34.808868,-82.269157,34.808868,-82.269157,0.0,Accident on Tanner Rd at Pennbrooke Ln.,Greenville,SC,US/Eastern,Fair
1,2,2019-10-07 17:43:09,2019-10-07 19:42:50,35.09008,-80.74556,35.09008,-80.74556,0.0,Accident on Houston Branch Rd at Providence Br...,Charlotte,NC,US/Eastern,Cloudy
2,2,2020-12-13 21:53:00,2020-12-13 22:44:00,37.14573,-121.985052,37.16585,-121.988062,1.4,Stationary traffic on CA-17 from Summit Rd (CA...,Los Gatos,CA,US/Pacific,Fair
3,2,2018-04-17 16:51:23,2018-04-17 17:50:46,39.11039,-119.773781,39.11039,-119.773781,0.0,Accident on US-395 Southbound at Topsy Ln.,Carson City,NV,US/Pacific,Clear
4,3,2016-08-31 17:40:49,2016-08-31 18:10:49,26.102942,-80.265091,26.102942,-80.265091,0.0,Accident on I-595 Westbound at Exit 4 / Pine I...,Fort Lauderdale,FL,US/Eastern,Overcast


### Use smaller data types

By default, Pandas uses large datatypes like `int64` and `float64` for numerical data. However, in many cases the data in the CSV file can be represented using a smaller data type such as `int32`, `float32`, `int16` etc. 

Date columns can be specified using the `parse_dates` argument.
 

In [20]:
selected_cols = ['Severity', 'Start_Time', 'End_Time', 'Start_Lat', 'Start_Lng',
                 'End_Lat', 'End_Lng', 'Distance(mi)', 'Description', 'City', 
                 'State', 'Timezone','Weather_Condition']

selected_dtypes = {
    'Severity': 'int16',
    'Start_Lat': 'float32',
    'Start_Lng': 'float32',
    'End_Lat': 'float32',
    'End_Lng': 'float32',
    'Distance(mi)': 'float32'   
}

In [21]:
%%time
accidents_df3 = pd.read_csv(us_accidents_csv, 
                            usecols=selected_cols, 
                            dtype=selected_dtypes, 
                            parse_dates=['Start_Time', 'End_Time'])

CPU times: user 10.4 s, sys: 670 ms, total: 11 s
Wall time: 11.1 s


In [22]:
accidents_df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2906610 entries, 0 to 2906609
Data columns (total 13 columns):
 #   Column             Dtype         
---  ------             -----         
 0   Severity           int16         
 1   Start_Time         datetime64[ns]
 2   End_Time           datetime64[ns]
 3   Start_Lat          float32       
 4   Start_Lng          float32       
 5   End_Lat            float32       
 6   End_Lng            float32       
 7   Distance(mi)       float32       
 8   Description        object        
 9   City               object        
 10  State              object        
 11  Timezone           object        
 12  Weather_Condition  object        
dtypes: datetime64[ns](2), float32(5), int16(1), object(5)
memory usage: 216.2+ MB


The load time and memory gains depend on the nature of the dataset. In this case, it leads to a 25% reduction in memory usage, with about the same load time. However, keep in mind that we no longer need to parse dates columns separately, which itself would take a few seconds for this dataset.

> **EXERCISE**: Parse the `Start_Time` and `End_Time` columns of `accidents_df2` as dates using `pd.to_datetime`. Measure the time taken for the conversion.

### Using binary formats for intermediate results

Since CSVs are plain text files with no structure, they often take longer to read compared to other binary formats which recognize the tabular structure of the data. Files can be saved and loaded using the `feather` and `parquet` formats for memory efficiency and faster processing.

Let's save `accidents_df` to the feather format and load it back. It requires the `pyarrow` library to be installed.

In [23]:
!pip install pyarrow --upgrade --quiet

In [24]:
%%time
accidents_df.to_feather('us-accidents.feather')

CPU times: user 6.17 s, sys: 2.57 s, total: 8.75 s
Wall time: 4.87 s


In [25]:
!ls -lh us-accidents.feather

-rw-r--r--  1 aakashns  staff   633M May 15 17:47 us-accidents.feather


The feather file is over 40% smaller than the CSV file.

In [26]:
%%time
accidents_df4 = pd.read_feather('us-accidents.feather')

CPU times: user 7.15 s, sys: 4.44 s, total: 11.6 s
Wall time: 8.03 s


Notice that reading a feather file is 60% faster compared to reading a CSV file.  It's a good idea to save the intermediate results of your analysis in the feather format, so that you can load the file faster and avoid recomputing results when you resume your work.

Check out a comparison of the feather and parquet formats here: https://ursalabs.org/blog/2020-feather-v2/

### Working with a sample

When working with a large dataset, sometimes it's better to work with a sample to set up your notebook, and then repeat your analysis with the entire dataset, to save time. You can use the `nrows` argument to supply the number of rows to be read.

In [27]:
%%time
accidents_sample_df = pd.read_csv(us_accidents_csv, 
                                  usecols=selected_cols, 
                                  dtype=selected_dtypes, 
                                  nrows=1000,
                                  parse_dates=['Start_Time', 'End_Time'])

CPU times: user 19.3 ms, sys: 22.5 ms, total: 41.8 ms
Wall time: 40.7 ms


Reading the first 1000 rows takes just a few milliseconds. 

### Using dask for parallelism and memory efficiency

Dask uses parallel processing to speed up data loading.

In [28]:
!pip install "dask[dataframe]" --quiet --upgrade

In [29]:
import dask.dataframe as dd

In [30]:
%%time
accidents_dask_df = dd.read_csv(us_accidents_csv)

CPU times: user 22.5 ms, sys: 5.1 ms, total: 27.6 ms
Wall time: 26.5 ms


Many Pandas operations implemented using more efficient algorithms in dask.

In [31]:
accidents_dask_df.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 47 entries, ID to Astronomical_Twilight
dtypes: object(20), bool(13), float64(13), int64(1)

To compute the memory usage, we need to provide `memory_usage=True`. Warning: This may take a while.

In [33]:
%%time
accidents_dask_df.info(memory_usage=True)

<class 'dask.dataframe.core.DataFrame'>
Columns: 47 entries, ID to Astronomical_Twilight
dtypes: object(20), bool(13), float64(13), int64(1)
memory usage: 790.0 MB
CPU times: user 34.6 s, sys: 5.07 s, total: 39.7 s
Wall time: 19.2 s


Keep in mind that dask has a slightly different API compared to Pandas, and not all Pandas functions will work the same way. Check out the documentation of Dask to learn more: https://docs.dask.org/en/latest/dataframe.html

> **EXERCISE**: List the various file types supported by Pandas for reading & writing. Demonstrate their usage with some examples. Use the official documentation for reference: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

> **EXERCISE**: Save the contents of `accidents_df3` into various file formats like CSV, JSON, HTML, Excel, SQLite, Parquet, Feather etc. and read the files back using Pandas. Compare the writing time, size of created file and reading time for different formats.


> **EXERCISE**: Download the New York Taxi Fare Prediction dataset from https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data . Pick 7 columns of the dataset and save it to an efficient intermediate format. How much improvement can you achieve in the file size, memory usage and reading time using the techniques listed above? 
> 
> *Warning*: This dataset is quite large (> 10 GB after uncompressing). Make sure you have enough disk space while before downloading it, or use an online platform like Google Colab.

In [34]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "aakashns/advanced-data-analysis-pandas" on https://jovian.ai/[0m
[jovian] Uploading notebook..[0m
[jovian] Capturing environment..[0m
[jovian] Committed successfully! https://jovian.ai/aakashns/advanced-data-analysis-pandas[0m


'https://jovian.ai/aakashns/advanced-data-analysis-pandas'

## Working with  Categorical Data

Consider the `Weather_Condition` column of the `accidents_sample_df`. While the values in the column are strings, there are only a limited number of values or _categories_ that occur in the column. `Weather_Condition` is a _categorical column_.

In [35]:
weather = accidents_sample_df.Weather_Condition

In [36]:
weather

0               Fair
1             Cloudy
2               Fair
3              Clear
4           Overcast
           ...      
995       Light Snow
996            Clear
997    Mostly Cloudy
998    Mostly Cloudy
999             Haze
Name: Weather_Condition, Length: 1000, dtype: object

We can list all the values in the column using the `.unique` method.

In [37]:
weather.unique()

array(['Fair', 'Cloudy', 'Clear', 'Overcast', 'Light Snow',
       'Mostly Cloudy', 'Partly Cloudy', 'Scattered Clouds', 'Wintry Mix',
       'Shallow Fog', 'Fog', 'Haze', nan, 'Light Rain', 'Smoke', 'Rain',
       'Cloudy / Windy', 'Light Drizzle', 'Heavy Snow', 'Snow',
       'Thunderstorm', 'Light Rain Shower', 'Heavy Rain', 'Mist',
       'Thunderstorms and Rain', 'Fair / Windy', 'Light Freezing Rain',
       'Light Thunderstorms and Rain', 'Light Snow / Windy',
       'Thunder in the Vicinity', 'Drizzle', 'Rain / Windy', 'Thunder',
       'Drizzle and Fog'], dtype=object)

To check the number of unique values, use `nunique`.

In [38]:
weather.nunique()

33

We can see the no. of occurrences of each value using `.value_counts()`

In [39]:
weather.value_counts()

Fair                            252
Clear                           159
Mostly Cloudy                   134
Partly Cloudy                    95
Overcast                         84
Cloudy                           84
Light Rain                       43
Scattered Clouds                 35
Rain                             19
Light Snow                       16
Haze                             10
Fog                               7
Snow                              4
Cloudy / Windy                    4
Heavy Rain                        4
Light Drizzle                     3
Thunderstorms and Rain            3
Smoke                             3
Light Freezing Rain               3
Light Snow / Windy                2
Shallow Fog                       2
Wintry Mix                        2
Thunder in the Vicinity           1
Thunderstorm                      1
Mist                              1
Thunder                           1
Light Rain Shower                 1
Fair / Windy                

We can convert the string column to a categorical column in Pandas by changing its data type.

In [40]:
accidents_sample_df['Weather_Condition'] = accidents_sample_df['Weather_Condition'].astype('category')

In [41]:
accidents_sample_df['Weather_Condition']

0               Fair
1             Cloudy
2               Fair
3              Clear
4           Overcast
           ...      
995       Light Snow
996            Clear
997    Mostly Cloudy
998    Mostly Cloudy
999             Haze
Name: Weather_Condition, Length: 1000, dtype: category
Categories (33, object): ['Clear', 'Cloudy', 'Cloudy / Windy', 'Drizzle', ..., 'Thunder in the Vicinity', 'Thunderstorm', 'Thunderstorms and Rain', 'Wintry Mix']

While there's no visible change, the conversion allows Pandas to optimize the storage & querying for the column by representing each category internally using a numeric code.

We can view the codes for each row as follows:

In [42]:
accidents_sample_df['Weather_Condition'].cat.codes

0       5
1       1
2       5
3       0
4      20
       ..
995    15
996     0
997    19
998    19
999     8
Length: 1000, dtype: int8

The category code is the index of the category in the following list:

In [43]:
accidents_sample_df['Weather_Condition'].cat.categories

Index(['Clear', 'Cloudy', 'Cloudy / Windy', 'Drizzle', 'Drizzle and Fog',
       'Fair', 'Fair / Windy', 'Fog', 'Haze', 'Heavy Rain', 'Heavy Snow',
       'Light Drizzle', 'Light Freezing Rain', 'Light Rain',
       'Light Rain Shower', 'Light Snow', 'Light Snow / Windy',
       'Light Thunderstorms and Rain', 'Mist', 'Mostly Cloudy', 'Overcast',
       'Partly Cloudy', 'Rain', 'Rain / Windy', 'Scattered Clouds',
       'Shallow Fog', 'Smoke', 'Snow', 'Thunder', 'Thunder in the Vicinity',
       'Thunderstorm', 'Thunderstorms and Rain', 'Wintry Mix'],
      dtype='object')

Categorical columns are often replaced with their numeric codes before passing data into a machine learning algorithm which can only work with numbers. 

### Numeric Categorical Columns

The column `Severity` consists of categories too, even though its values are numeric.

In [44]:
accidents_sample_df.Severity.value_counts()

2    750
3    207
4     35
1      8
Name: Severity, dtype: int64

Let's convert it into a categorical column.

In [45]:
accidents_sample_df.Severity = accidents_sample_df.Severity.astype('category')

In [46]:
accidents_sample_df.Severity

0      2
1      2
2      2
3      2
4      3
      ..
995    3
996    3
997    2
998    2
999    2
Name: Severity, Length: 1000, dtype: category
Categories (4, int64): [1, 2, 3, 4]

In [47]:
accidents_sample_df.Severity.cat.categories

Int64Index([1, 2, 3, 4], dtype='int64')

### One Hot Encoding

![](https://i.imgur.com/n8GuiOO.png)

Sometimes it's useful to create a new column for each category of a categorical column, and set the value in the column to `1` if row belongs to the category and `0` otherwise. This technique is known as one-hot encoding and is commonly applied before passing data into machine learning algorithms.

We can use the `pd.get_dummies` function to create a new column for each category of a categorical column.

In [48]:
accidents_sample_df.Severity

0      2
1      2
2      2
3      2
4      3
      ..
995    3
996    3
997    2
998    2
999    2
Name: Severity, Length: 1000, dtype: category
Categories (4, int64): [1, 2, 3, 4]

In [49]:
severity_onehot_df = pd.get_dummies(accidents_sample_df.Severity)
severity_onehot_df

Unnamed: 0,1,2,3,4
0,0,1,0,0
1,0,1,0,0
2,0,1,0,0
3,0,1,0,0
4,0,0,1,0
...,...,...,...,...
995,0,0,1,0
996,0,0,1,0
997,0,1,0,0
998,0,1,0,0


The new columns can be added to the original data frame using the `pd.concat` method (we'll learn more about it later).

In [51]:
combined_df = pd.concat((accidents_sample_df, severity_onehot_df), axis=1)
combined_df.sample(5)

Unnamed: 0,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,City,State,Timezone,Weather_Condition,1,2,3,4
26,3,2017-09-07 20:37:46,2017-09-07 21:07:21,32.87109,-80.010628,32.87109,-80.010628,0.0,Accident on I-26 Eastbound at Exit 213B Montag...,North Charleston,SC,US/Eastern,Partly Cloudy,0,0,1,0
799,3,2018-12-04 15:17:05,2018-12-04 15:45:57,30.008568,-90.219955,30.008568,-90.219955,0.0,2 right lane blocked due to accident on I-10 W...,Metairie,LA,US/Central,Partly Cloudy,0,0,1,0
108,2,2016-07-25 09:12:43,2016-07-25 09:58:41,29.849245,-95.411613,29.849245,-95.411613,0.0,Accident on TX-261 Spur Shepherd Dr at Montgom...,Houston,TX,US/Central,Mostly Cloudy,0,1,0,0
113,3,2020-06-14 20:38:00,2020-06-14 21:07:50,39.565201,-104.872261,39.565201,-104.872261,0.0,At County Line Rd/Exit 195 - Accident.,Englewood,CO,US/Mountain,Mostly Cloudy,0,0,1,0
780,2,2019-12-24 15:47:00,2019-12-24 16:53:27,33.787949,-117.880074,33.787949,-117.880074,0.0,At Chapman Ave (Orange) - Accident.,Orange,CA,US/Pacific,Mostly Cloudy,0,1,0,0


> **EXERICSE**: Repeat the aboves steps with `accidents_df` and `accidents_dask_df`. Track and compare the times taken for each operation.

> **EXERCISE**: Perform one-hot encoding for the `Weather_Condition` column of the dataframe `accidents_sample_df`.

Learn more about working with categorical data in Pandas here: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

In [52]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "aakashns/advanced-data-analysis-pandas" on https://jovian.ai/[0m
[jovian] Uploading notebook..[0m
[jovian] Capturing environment..[0m
[jovian] Committed successfully! https://jovian.ai/aakashns/advanced-data-analysis-pandas[0m


'https://jovian.ai/aakashns/advanced-data-analysis-pandas'

## Handling missing & duplicate data

Missing data in Pandas is indicate using `np.nan`. We can find the number of missing values in each column of a dataframe using the following expression: 

In [53]:
accidents_sample_df.isna().sum()

Severity              0
Start_Time            0
End_Time              0
Start_Lat             0
Start_Lng             0
End_Lat              96
End_Lng              96
Distance(mi)          0
Description           0
City                  0
State                 0
Timezone              0
Weather_Condition    21
dtype: int64

The `End_Lat` and `End_Lng` columns have 96 missing values, and the `Weather_Condition` column has 21 missing values.

**NOTE**: The latest version of the `us-accidents` dataset has no missing values in the `End_Lat` and `End_Lng` columns. Only the `Weather_Condition` column has 9 missing values.

> **EXERCISE**: What is the output of the `isna` method of a Pandas data frame or series. Demonstrate with examples. 

We have the following options for dealing with missing values in numerical columns:

1. Leave them as is, if they won't affect your analysis
2. Replace them with an average 
3. Replace them with some other fixed value
4. Remove the rows containing missing values
5. Use the values from other rows & columns to estimate the missing value (imputation)

Here's how approach 4 can be applied:

In [54]:
fixed_df = accidents_sample_df.dropna(subset=['End_Lng', 'End_Lat'])

In [55]:
fixed_df.isna().sum()

Severity              0
Start_Time            0
End_Time              0
Start_Lat             0
Start_Lng             0
End_Lat               0
End_Lng               0
Distance(mi)          0
Description           0
City                  0
State                 0
Timezone              0
Weather_Condition    19
dtype: int64

> **EXERCISE**: Replace the missing values in the columns `End_Lng` and `End_Lat` using the average value in each column. Hint: Use the function `.fillna`.

For categorical columns, we have the following options for dealing with missing values:

1. Leave them as is, if they won't affect your analysis
2. Create a new category for missing values
3. Replace them with the most frequent category (or by some other fixed value)
4. Replace them & add a new binary column indicating whether the value was missing
5. Replace the columns with one-hot encoded columns

In [56]:
weather = accidents_sample_df.Weather_Condition

In [57]:
weather.isna().sum()

21

Let's apply technique 4 i.e. replace the null values with the most common value (the mode)

In [58]:
# Create a copy of the original data frame
temp_df = accidents_sample_df.copy()

In [67]:
# Create a column to track missing values
temp_df['Weather_Missing'] = temp_df.Weather_Condition.isna()

In [64]:
# Check the most frequently occuring values
temp_df.Weather_Condition.value_counts().head(5)

Fair             252
Clear            159
Mostly Cloudy    134
Partly Cloudy     95
Cloudy            84
Name: Weather_Condition, dtype: int64

In [65]:
## Get the single most frequently occurring value
most_common_weather = temp_df.Weather_Condition.mode()[0]
most_common_weather

'Fair'

In [71]:
# Replace missing values with the most frequent value
temp_df.Weather_Condition.fillna(most_common_weather, inplace=True)

In [82]:
temp_df.sample(5)

Unnamed: 0,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,City,State,Timezone,Weather_Condition,Weather_Missing
446,3,2018-09-25 21:02:21,2018-09-25 21:32:10,33.551689,-117.672981,33.551689,-117.672981,0.0,Lane blocked due to accident on I-5 Southbound...,Mission Viejo,CA,US/Pacific,Partly Cloudy,False
387,2,2018-10-28 10:39:11,2018-10-28 11:23:49,37.282066,-121.808708,37.282066,-121.808708,0.0,#2 lane blocked and right hand shoulder blocke...,San Jose,CA,US/Pacific,Scattered Clouds,False
5,3,2018-10-17 16:40:36,2018-10-17 17:10:18,35.34824,-80.847221,35.34824,-80.847221,0.0,Three lanes blocked due to accident on I-77 No...,Charlotte,NC,US/Eastern,Clear,False
320,2,2018-07-05 17:36:56,2018-07-05 23:36:56,33.99633,-117.927223,33.994061,-117.900612,1.532,Between Azusa Ave and Fullerton Rd - Accident.,Rowland Heights,CA,US/Pacific,Clear,False
152,3,2017-03-14 22:51:49,2017-03-14 23:21:30,30.424627,-97.671585,,,0.01,Accident on I-35 Service Rd Northbound at Exit...,Pflugerville,TX,US/Central,Clear,False


In [75]:
# Check for missing values again
temp_df.Weather_Condition.isna().sum()

0

In [76]:
# Check value counts
temp_df.Weather_Condition.value_counts().head(5)

Fair             273
Clear            159
Mostly Cloudy    134
Partly Cloudy     95
Cloudy            84
Name: Weather_Condition, dtype: int64

> **EXERCISE**: Apply the other techniques listed above to handle missing values in the dataframe `accidents_sample_df`.

> **EXERCISE**: Repeat the operations performed in the above section with `accidents_df` and `accidents_dask_df`. Measure and compare the time taken for each operation.

In [108]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "aakashns/advanced-data-analysis-pandas" on https://jovian.ai/[0m
[jovian] Uploading notebook..[0m
[jovian] Capturing environment..[0m
[jovian] Committed successfully! https://jovian.ai/aakashns/advanced-data-analysis-pandas[0m


'https://jovian.ai/aakashns/advanced-data-analysis-pandas'

### Duplicate Data

In [85]:
accidents_sample_df.duplicated().sum()

0

In [86]:
candies_df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})

In [88]:
candies_df

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
1,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0
4,Indomie,pack,5.0


In [87]:
candies_df.duplicated()

0    False
1     True
2    False
3    False
4    False
dtype: bool

If required, duplicate rows can be removed using the `.drop_duplicates` method.

In [89]:
candies_df.drop_duplicates()

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0
4,Indomie,pack,5.0


Think carefully about how the data was collected before removing duplicates. Removing duplicates may not always be the right approach.

> **EXERCISE**: Check for duplicates in `accidents_df` and remove them if required.

> **EXERCISE**: Repeat the exercises in this section with `accidents_df` and `accidents_dask_df` and track the time taken by each operation.

In [90]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "aakashns/advanced-data-analysis-pandas" on https://jovian.ai/[0m
[jovian] Uploading notebook..[0m
[jovian] Capturing environment..[0m
[jovian] Committed successfully! https://jovian.ai/aakashns/advanced-data-analysis-pandas[0m


'https://jovian.ai/aakashns/advanced-data-analysis-pandas'

## Transforming and aggregating data with type-specific functions

Pandas offers several methods for working with specific types of data. Additionally, we can also use Numpy functions to perform operations on Pandas series. Let's look at some utility methods by three types of data: numbers, strings and dates.

### Numbers

Here are some functions useful for transforming and aggregating numeric data.

In [91]:
accidents_sample_df.head(5)

Unnamed: 0,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,City,State,Timezone,Weather_Condition
0,2,2019-05-21 08:29:55,2019-05-21 09:29:40,34.808868,-82.269157,34.808868,-82.269157,0.0,Accident on Tanner Rd at Pennbrooke Ln.,Greenville,SC,US/Eastern,Fair
1,2,2019-10-07 17:43:09,2019-10-07 19:42:50,35.09008,-80.74556,35.09008,-80.74556,0.0,Accident on Houston Branch Rd at Providence Br...,Charlotte,NC,US/Eastern,Cloudy
2,2,2020-12-13 21:53:00,2020-12-13 22:44:00,37.145729,-121.985054,37.165852,-121.98806,1.4,Stationary traffic on CA-17 from Summit Rd (CA...,Los Gatos,CA,US/Pacific,Fair
3,2,2018-04-17 16:51:23,2018-04-17 17:50:46,39.11039,-119.773781,39.11039,-119.773781,0.0,Accident on US-395 Southbound at Topsy Ln.,Carson City,NV,US/Pacific,Clear
4,3,2016-08-31 17:40:49,2016-08-31 18:10:49,26.102942,-80.265091,26.102942,-80.265091,0.0,Accident on I-595 Westbound at Exit 4 / Pine I...,Fort Lauderdale,FL,US/Eastern,Overcast


In [92]:
distance = accidents_sample_df['Distance(mi)']

In [105]:
# Sum
distance.sum()

422.59003

In [94]:
# Average
distance.mean()

0.42259002

In [95]:
# Standard deviation
distance.std()

1.3762158

In [96]:
# Median
distance.median()

0.0

We can also apply numpy functions to Pandas series

In [101]:
# Square root
np.sqrt(distance)

0      0.000000
1      0.000000
2      1.183216
3      0.000000
4      0.000000
         ...   
995    0.000000
996    0.000000
997    0.100000
998    0.277489
999    0.568331
Name: Distance(mi), Length: 1000, dtype: float32

In [106]:
# Power
np.power(distance, 2)

0      0.000000
1      0.000000
2      1.960000
3      0.000000
4      0.000000
         ...   
995    0.000000
996    0.000000
997    0.000100
998    0.005929
999    0.104329
Name: Distance(mi), Length: 1000, dtype: float32

In [107]:
# Variance
np.var(distance)

1.892076

Pandas series also support arithmetic operators.

In [102]:
# Addition
distance + 2

0      2.000
1      2.000
2      3.400
3      2.000
4      2.000
       ...  
995    2.000
996    2.000
997    2.010
998    2.077
999    2.323
Name: Distance(mi), Length: 1000, dtype: float32

In [103]:
# Multiplication
distance_km = distance * 1.6

In [104]:
distance_km

0      0.0000
1      0.0000
2      2.2400
3      0.0000
4      0.0000
        ...  
995    0.0000
996    0.0000
997    0.0160
998    0.1232
999    0.5168
Name: Distance(mi), Length: 1000, dtype: float32

> **EXERCISE**: Try out some more arithmetic operations with other numeric columns of `accidents_df` and `accidents_dask_df`. Measure and compare the time taken for each operation.

### Strings

The `.str` property of a Pandas series provides several utility functions for manipulating string data.

In [109]:
accidents_sample_df.head(5)

Unnamed: 0,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,City,State,Timezone,Weather_Condition
0,2,2019-05-21 08:29:55,2019-05-21 09:29:40,34.808868,-82.269157,34.808868,-82.269157,0.0,Accident on Tanner Rd at Pennbrooke Ln.,Greenville,SC,US/Eastern,Fair
1,2,2019-10-07 17:43:09,2019-10-07 19:42:50,35.09008,-80.74556,35.09008,-80.74556,0.0,Accident on Houston Branch Rd at Providence Br...,Charlotte,NC,US/Eastern,Cloudy
2,2,2020-12-13 21:53:00,2020-12-13 22:44:00,37.145729,-121.985054,37.165852,-121.98806,1.4,Stationary traffic on CA-17 from Summit Rd (CA...,Los Gatos,CA,US/Pacific,Fair
3,2,2018-04-17 16:51:23,2018-04-17 17:50:46,39.11039,-119.773781,39.11039,-119.773781,0.0,Accident on US-395 Southbound at Topsy Ln.,Carson City,NV,US/Pacific,Clear
4,3,2016-08-31 17:40:49,2016-08-31 18:10:49,26.102942,-80.265091,26.102942,-80.265091,0.0,Accident on I-595 Westbound at Exit 4 / Pine I...,Fort Lauderdale,FL,US/Eastern,Overcast


In [110]:
# Change to lowercase
accidents_sample_df.Description.str.lower()

0                accident on tanner rd at pennbrooke ln.
1      accident on houston branch rd at providence br...
2      stationary traffic on ca-17 from summit rd (ca...
3             accident on us-395 southbound at topsy ln.
4      accident on i-595 westbound at exit 4 / pine i...
                             ...                        
995    two lanes blocked due to accident on i-90 west...
996    accident on i-694 eastbound at exits 35a 35b 3...
997    lane blocked and very slow traffic due to acci...
998    stationary traffic on m-11 from chamberlain av...
999    a disabled vehicle has the left lane closed sb...
Name: Description, Length: 1000, dtype: object

In [111]:
# Convert a string into a list of words
words = accidents_sample_df.Description.str.split(' ')
words

0        [Accident, on, Tanner, Rd, at, Pennbrooke, Ln.]
1      [Accident, on, Houston, Branch, Rd, at, Provid...
2      [Stationary, traffic, on, CA-17, from, Summit,...
3      [Accident, on, US-395, Southbound, at, Topsy, ...
4      [Accident, on, I-595, Westbound, at, Exit, 4, ...
                             ...                        
995    [Two, lanes, blocked, due, to, accident, on, I...
996    [Accident, on, I-694, Eastbound, at, Exits, 35...
997    [Lane, blocked, and, very, slow, traffic, due,...
998    [Stationary, traffic, on, M-11, from, Chamberl...
999    [A, disabled, vehicle, has, the, left, lane, c...
Name: Description, Length: 1000, dtype: object

In [112]:
words[0]

['Accident', 'on', 'Tanner', 'Rd', 'at', 'Pennbrooke', 'Ln.']

In [113]:
# Replacing a substring
accidents_sample_df.Description.str.replace('Accident', 'ACCIDENT')

0                ACCIDENT on Tanner Rd at Pennbrooke Ln.
1      ACCIDENT on Houston Branch Rd at Providence Br...
2      Stationary traffic on CA-17 from Summit Rd (CA...
3             ACCIDENT on US-395 Southbound at Topsy Ln.
4      ACCIDENT on I-595 Westbound at Exit 4 / Pine I...
                             ...                        
995    Two lanes blocked due to accident on I-90 West...
996    ACCIDENT on I-694 Eastbound at Exits 35A 35B 3...
997    Lane blocked and very slow traffic due to acci...
998    Stationary traffic on M-11 from Chamberlain Av...
999    A disabled vehicle has the left lane closed SB...
Name: Description, Length: 1000, dtype: object

In [114]:
# Remove whitespace
idx = pd.Index(["    jack", "jill     ", " jesse   ", "    frank    montana   "])
idx.str.strip()

Index(['jack', 'jill', 'jesse', 'frank    montana'], dtype='object')

In [115]:
# Checking the presence of a substring
accidents_sample_df.Description.str.contains("Accident")

0       True
1       True
2      False
3       True
4       True
       ...  
995    False
996     True
997    False
998    False
999    False
Name: Description, Length: 1000, dtype: bool

> **EXERCISE**: Explore other string methods supported by Pandas data frames and series: https://pandas.pydata.org/docs/user_guide/text.html#string-methods . Demonstrate their usage with examples.

### Date & Time

The `.dt` property of a Pandas consists of utlity methods for working with dates.

In [117]:
start_time = accidents_sample_df.Start_Time

In [118]:
start_time

0     2019-05-21 08:29:55
1     2019-10-07 17:43:09
2     2020-12-13 21:53:00
3     2018-04-17 16:51:23
4     2016-08-31 17:40:49
              ...        
995   2018-01-15 08:15:01
996   2018-12-15 20:23:10
997   2017-01-13 20:42:28
998   2020-09-22 12:46:00
999   2020-09-11 02:00:41
Name: Start_Time, Length: 1000, dtype: datetime64[ns]

Let's extract different parts of the data.

In [120]:
# Year
start_time.dt.year

0      2019
1      2019
2      2020
3      2018
4      2016
       ... 
995    2018
996    2018
997    2017
998    2020
999    2020
Name: Start_Time, Length: 1000, dtype: int64

In [121]:
# Month
start_time.dt.month

0       5
1      10
2      12
3       4
4       8
       ..
995     1
996    12
997     1
998     9
999     9
Name: Start_Time, Length: 1000, dtype: int64

In [122]:
# Day
start_time.dt.day

0      21
1       7
2      13
3      17
4      31
       ..
995    15
996    15
997    13
998    22
999    11
Name: Start_Time, Length: 1000, dtype: int64

In [126]:
# Convert to date string
start_time.dt.strftime('%Y-%m-%d')

0      2019-05-21
1      2019-10-07
2      2020-12-13
3      2018-04-17
4      2016-08-31
          ...    
995    2018-01-15
996    2018-12-15
997    2017-01-13
998    2020-09-22
999    2020-09-11
Name: Start_Time, Length: 1000, dtype: object

> **EXERCISE**: Explore other date methods supported by Pandas series: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dt-accessors . Demonstrate their usage with examples.

### `map` and `apply`

The `map` method of a column/series can be used to apply a custom function to each element of a series. Let's use it to convert the distance from miles to kilometres.

In [130]:
def convert_to_km(dist_miles):
    return dist_miles * 1.6

In [132]:
distance_km = distance.map(convert_to_km)
distance_km.name = 'Distance(km)'
distance_km

0      0.0000
1      0.0000
2      2.2400
3      0.0000
4      0.0000
        ...  
995    0.0000
996    0.0000
997    0.0160
998    0.1232
999    0.5168
Name: Distance(km), Length: 1000, dtype: float64

The `apply` method  can be used to apply a custom function to each column/row of a dataframe. Let's use it to compute the duration of each event.

In [135]:
accidents_sample_df.head(3)

Unnamed: 0,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,City,State,Timezone,Weather_Condition
0,2,2019-05-21 08:29:55,2019-05-21 09:29:40,34.808868,-82.269157,34.808868,-82.269157,0.0,Accident on Tanner Rd at Pennbrooke Ln.,Greenville,SC,US/Eastern,Fair
1,2,2019-10-07 17:43:09,2019-10-07 19:42:50,35.09008,-80.74556,35.09008,-80.74556,0.0,Accident on Houston Branch Rd at Providence Br...,Charlotte,NC,US/Eastern,Cloudy
2,2,2020-12-13 21:53:00,2020-12-13 22:44:00,37.145729,-121.985054,37.165852,-121.98806,1.4,Stationary traffic on CA-17 from Summit Rd (CA...,Los Gatos,CA,US/Pacific,Fair


In [144]:
def compute_duration(row):
    return (row.End_Time - row.Start_Time).total_seconds()

In [145]:
accidents_sample_df['Duration'] = accidents_sample_df.apply(compute_duration, axis=1)

In [146]:
accidents_sample_df.head(3)

Unnamed: 0,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,City,State,Timezone,Weather_Condition,Duration
0,2,2019-05-21 08:29:55,2019-05-21 09:29:40,34.808868,-82.269157,34.808868,-82.269157,0.0,Accident on Tanner Rd at Pennbrooke Ln.,Greenville,SC,US/Eastern,Fair,3585.0
1,2,2019-10-07 17:43:09,2019-10-07 19:42:50,35.09008,-80.74556,35.09008,-80.74556,0.0,Accident on Houston Branch Rd at Providence Br...,Charlotte,NC,US/Eastern,Cloudy,7181.0
2,2,2020-12-13 21:53:00,2020-12-13 22:44:00,37.145729,-121.985054,37.165852,-121.98806,1.4,Stationary traffic on CA-17 from Summit Rd (CA...,Los Gatos,CA,US/Pacific,Fair,3060.0


> **EXERCISE**: Look up the documentation for the `applymap` method of a data frame. How is it different from `apply` and `map` methods? Demonstrate with examples.

> **EXERCISE**: Repeat the operations performed in this section (type-specific functions) with `accidents_df` and `accidents_dask_df`. Measure and compare the time taken for each operation.

Learn more about `map` and `apply` here: https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff

In [None]:
jovian.commit()

## Data frame concatenation and merging

Pandas provides various utilities for combining multiple data frames. We'll look at two examples in this section: concatenation and merging.

### Concatenation

Concatenation is the process of stacking two more dataframes vertically or horizontally. When concatenating vertically, columns are lined up together. Here's what vertical concatenation looks like:

![](https://i.imgur.com/ti195t3.png)

In [147]:
df1 = pd.DataFrame(
    {
         "A": ["A0", "A1", "A2", "A3"],
         "B": ["B0", "B1", "B2", "B3"],
         "C": ["C0", "C1", "C2", "C3"],
         "D": ["D0", "D1", "D2", "D3"],
    }, index=[0, 1, 2, 3])
 

df2 = pd.DataFrame(
    {
         "A": ["A4", "A5", "A6", "A7"],
         "B": ["B4", "B5", "B6", "B7"],
         "C": ["C4", "C5", "C6", "C7"],
         "D": ["D4", "D5", "D6", "D7"],
    }, index=[4, 5, 6, 7])
 

df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    }, index=[8, 9, 10, 11])
 

We can now concatenate these along axis 0 i.e. vertically using `pd.concat`

In [148]:
pd.concat([df1, df3, df3], axis=0)

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11
8,A8,B8,C8,D8
9,A9,B9,C9,D9


This operation can also be performed using the `.append` method of a dataframe.

In [149]:
df1.append([df2, df3])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


> **EXERCISE**: Remove the column `D` from `df3`. How does it affect the result of vertical concatenation? Try passing the argument `join="inner"` to `pd.concat`. Do you observe any change?

> **EXERCISE**: Create two dataframes that don't have any common columns and concatenate them vertically. What do you observe? Try providing the arguments `join="outer"` and `join="inner"`. How do they affect the results?

> **EXERCISE**: Explore the arguments supported by `pd.concat` and come up with some examples to demonstrate the purpose of each argument.

Concatenation can also be performed horizontally by providing the argument `axis=1` to `pd.concat`. Rows are lined up together using the index.



In [150]:
df1 = pd.DataFrame(
    {
         "A": ["A0", "A1", "A2", "A3"],
         "B": ["B0", "B1", "B2", "B3"],
         "C": ["C0", "C1", "C2", "C3"],
         "D": ["D0", "D1", "D2", "D3"],
    }, index=[0, 1, 2, 3])

df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [151]:
df4 = pd.DataFrame(
    {
        "B": ["B2", "B3", "B6", "B7"],
        "D": ["D2", "D3", "D6", "D7"],
        "F": ["F2", "F3", "F6", "F7"],
    }, index=[2, 3, 6, 7])

df4

Unnamed: 0,B,D,F
2,B2,D2,F2
3,B3,D3,F3
6,B6,D6,F6
7,B7,D7,F7


In [152]:
pd.concat([df1, df4], axis=1)

Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3
6,,,,,B6,D6,F6
7,,,,,B7,D7,F7


`pd.concat` performs an "outer" join by default, which retains all the indexes from both data frames. An "inner" join only retains the common indices.

In [153]:
pd.concat([df1, df4], axis=1, join="inner")

Unnamed: 0,A,B,C,D,B.1,D.1,F
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3


Learn more about dataframe concatenation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#concatenating-objects

In [154]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "aakashns/advanced-data-analysis-pandas" on https://jovian.ai/[0m
[jovian] Uploading notebook..[0m
[jovian] Capturing environment..[0m
[jovian] Committed successfully! https://jovian.ai/aakashns/advanced-data-analysis-pandas[0m


'https://jovian.ai/aakashns/advanced-data-analysis-pandas'

### Merging

Two Pandas dataframes can be merged together row-wise using one more columns using the `.merge` method of a dataframe. A merge can be peformed in several ways:

![](https://i.imgur.com/p2fXTFs.png)

> **EXERCISE**: Demonstrate the four types of join listed above using the following dataframes. Use the `key` column for merging

In [155]:
left = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    })

right = pd.DataFrame(
    {
        "key": ["K0", "K1", "K4", "K5"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    })

In [156]:
left

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3


In [157]:
right

Unnamed: 0,key,C,D
0,K0,C0,D0
1,K1,C1,D1
2,K4,C2,D2
3,K5,C3,D3


In [158]:
# Inner join
pd.merge(left, right, how="inner", on="key")

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1


In [159]:
# Left join


In [160]:
# Right join


In [161]:
# Outer join


> **EXERCISE**: Show an example of merging two dataframes on two columns.
> 
> *Hint*: Read the docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging

> **EXERCISE**: Look up the documentation for the `pd.join` function. How is it different from `pd.merge`? Demonstrate with examples. Hint: A join is always performed on the index.

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

## Summary and Further Reading

We've covered the following topics in this tutorial:

- Downloading datasets from online sources
- Processing massive datasets using Pandas
- Handling missing, incorrect & duplicate data
- Transforming data with type-specific functions
- Techniques for encoding categorical data
- Concatenation, merging and comparison

As an exercise, you can apply the above to other datasets, from the following sources:

- [Kaggle datasets](http://kaggle.com/datasets)
- [World Bank Open Data](https://data.worldbank.org)
- [Yahoo Finance](https://finance.yahoo.com)
- [Google Dataset Search](https://datasetsearch.research.google.com)
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php)
- [FastAI datasets](https://course.fast.ai/datasets)


Check out the following resources to learn more:

- Working with categorical data in Pandas: https://jovian.ai/himani007/categorical-data-with-pandas
- Working with large datasets in Pandas: https://jovian.ai/himani007/pandas1-large-datasets
- Python for Data Analysis: https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython-ebook/dp/B075X4LT6K
- Pandas API reference: https://pandas.pydata.org/pandas-docs/stable/reference/index.html
- Merging Pandas dataframes: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
- Advanced Pandas tutorial notebooks: https://www.kaggle.com/residentmario/welcome-to-advanced-pandas
- Dask dataframes documentation: https://docs.dask.org/en/latest/dataframe.html
- [How to load CSV files 10x faster and use 10x less memory](https://towardsdatascience.com/%EF%B8%8F-load-the-same-csv-file-10x-times-faster-and-with-10x-less-memory-%EF%B8%8F-e93b485086c7)


## Questions for Revision

1.	How do you download a dataset from Kaggle?
2.	How do you check length of the file on Windows?
3.	What is the purpose of `%%time`?
4.	What are the different methods and functions you can use to get information about the data in dataframe?
5.	How to load only the required columns from a large dataset?
6.	What is the purpose of using smaller datatype?
7.	How is `parse_dates` different from `pd.to_datetime`?
8.	What are the different formats one can use when loading CSV files for better memory efficiency and faster processing?
9.	How does working with a sample of your data first help with analysis?
10.	What is dask?
11.	What is categorical data? How to deal with them during analysis?
12.	What is One Hot Encoding?
13.	What are the different techniques to handle missing values?
14.	Why should one be careful when removing duplicates from the data?
15.	What are the different methods you can use on numeric, string, and date type data?
16.	How is `map()` different from `apply()`? 
17.	What is `applymap()`?
18.	What is axis parameter in Pandas?
19.	How do `join='inner'` and `join='outer'` work?
20.	What are the several ways to perform `merge()`?
21.	What is `on` parameter in `merge()`?
22.	How is `concate()` different from `merge()`?