# Tutorial 5: Data Cleaning
*Also called: data pre-processing, data wrangling*

## Objectives

After this tutorial you will be able to:

*   Identify and handle missing values
*   Remove duplicates
*   Standardize data
*   Validate the cleaned data to ensure that it is accurate and complete.

<h2>Table of Contents</h2>

<ol>
    <li>
        <a href="#import">Import the dataset</a>
    </li>
    <br>
    <li>
        <details>
            <summary><a href="#clean">Identify and handle common data cleaning problems</a></summary>
            <ul>
                <li><a href="#clean-missing">Handle missing values</a></li>
                <li><a href="#clean-duplicates">Remove duplicates</a></li>
                <li><a href="#clean-standardize">Standardize data</a></li>
            </ul>
        </details>
    </li>
    <br>
    <li>
        <a href="#validate">Validate cleaned data</a>
    </li>
    <br>    
</ol>


<hr id="import">

<h2>1. Import the dataset</h2>

Import the `Pandas` library

In [None]:
import pandas as pd

Read the data from `data.csv` into a `Pandas DataFrame`

In [None]:
df = pd.read_csv('data.csv')
df

<hr id="clean">

<h2>2. Identify and handle common data cleaning problems</h2>

<h5 id="clean-missing">Handle missing values</h5>

Identify missing values

In [None]:
# get more info about the data
df.info()

In [None]:
# find the number of missing values in each column
df.isna().sum()

Drop rows with "NaN" from certain columns

In [None]:
# drop the rows with missing values
df.dropna(subset=['Flow Rate [l/min]'], inplace=True)
df

Replace "NaN" by the mean for numeric data

In [None]:
# replace the missing values with the mean of the column
avg_temp = df['Temperature [C]'].mean()
print(avg_temp)

df['Temperature [C]'].fillna(avg_temp, inplace=True)
df

Replace "NaN" by the mode (for categorical data)

In [None]:
# find the unique values in the column
df['Reactor Type'].value_counts()

In [None]:
# replace the missing values with the most frequent value
mode_type = df['Reactor Type'].mode()[0]
mode_type

In [None]:
df['Reactor Type'].fillna(mode_type, inplace=True)
df

<h5 id="clean-duplicates">Remove duplicates</h5>

In [None]:
# find the number of duplicate rows
df.duplicated().sum()

In [None]:
# drop the duplicate rows
df.drop_duplicates(inplace=True)
df

<h5 id="clean-standardize">Standardize data</h5>

In [None]:
# convert column to float
df['Temperature [C]'] = df['Temperature [C]'].astype('float64')

# convert multilple columns to float
df[['Pressure [bar]', 'Flow Rate [l/min]']] = df[['Pressure [bar]', 'Flow Rate [l/min]']].astype('float64')

# convert column to datetime
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

In [None]:
# standardize the values in a column
df['Reactor Type'] = df['Reactor Type'].replace('Tubular', 'PFR')
df['Reactor Type'] = df['Reactor Type'].replace('MFR', 'CSTR')

# we can also use a dictionary to replace multiple values at once
df['Reactor Type'] = df['Reactor Type'].replace({
    'Tubular': 'PFR',
    'MFR': 'CSTR',
})
df

In [None]:
# apply a function to a column
df['Temperature [C]'] = df['Temperature [C]'].apply(lambda x: x + 273)
df.rename(columns={'Temperature [C]': 'Temperature [K]'}, inplace=True)
df

<hr id="validate">

<h2>3. Validate cleaned data</h2>

In [None]:
# check data types
df.dtypes

In [None]:
# check for missing values
df.isna().sum()

**OR** we can use the `info()` method to check the 2 steps above at the same time

In [None]:
df.info()

In [None]:
# check for duplicates
df.duplicated().sum()

In [None]:
# save the cleaned data for future use
df.to_csv('cleaned_data.csv', index=False)

<hr style="margin-top: 4rem;">
<h2>Author</h2>

<a href="https://github.com/SamerHany">Samer Hany</a>

<h2>References</h2>
<a href="https://www.w3schools.com/python/default.asp">w3schools.com</a>