# Tutorial 5: Data Cleaning
*Also called: data pre-processing, data wrangling*

## Objectives

After this tutorial you will be able to:

*   Identify and handle missing values
*   Remove duplicates
*   Standardize data
*   Validate the cleaned data to ensure that it is accurate and complete.

<h2>Table of Contents</h2>

<ol>
    <li>
        <a href="#import">Import the dataset</a>
    </li>
    <br>
    <li>
        <details>
            <summary><a href="#clean">Identify and handle common data cleaning problems</a></summary>
            <ul>
                <li><a href="#clean-missing">Handle missing values</a></li>
                <li><a href="#clean-duplicates">Remove duplicates</a></li>
                <li><a href="#clean-standardize">Standardize data</a></li>
            </ul>
        </details>
    </li>
    <br>
    <li>
        <a href="#validate">Validate cleaned data</a>
    </li>
    <br>    
</ol>


<hr id="import">

<h2>1. Import the dataset</h2>

Import the `Pandas` library

In [1]:
import pandas as pd

Read the data from `data.csv` into a `Pandas DataFrame`

In [2]:
df = pd.read_csv('data.csv')
df

Unnamed: 0,Timestamp,Reactor Type,Temperature [C],Pressure [bar],Flow Rate [l/min]
0,2023-10-21 10:00:00,CSTR,75.0,10.0,50.0
1,2023-10-21 10:00:00,CSTR,75.0,10.0,50.0
2,2023-10-21 10:00:05,CSTR,76.0,10.1,51.0
3,2023-10-21 10:00:10,PFR,77.0,10.0,
4,2023-10-21 10:00:15,PFR,,17.0,60.0
5,2023-10-21 10:00:20,Tubular,78.0,17.0,60.0
6,2023-10-21 10:00:25,,79.0,17.0,61.0
7,2023-10-21 10:00:30,MFR,80.0,18.0,62.0


<hr id="clean">

<h2>2. Identify and handle common data cleaning problems</h2>

<h5 id="clean-missing">Handle missing values</h5>

Identify missing values

In [3]:
# get more info about the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Timestamp          8 non-null      object 
 1   Reactor Type       7 non-null      object 
 2   Temperature [C]    7 non-null      float64
 3   Pressure [bar]     8 non-null      float64
 4   Flow Rate [l/min]  7 non-null      float64
dtypes: float64(3), object(2)
memory usage: 452.0+ bytes


In [4]:
# find the number of missing values in each column
df.isna().sum()

Timestamp            0
Reactor Type         1
Temperature [C]      1
Pressure [bar]       0
Flow Rate [l/min]    1
dtype: int64

Drop rows with "NaN" from certain columns

In [5]:
# drop the rows with missing values
df.dropna(subset=['Flow Rate [l/min]'], inplace=True)
df

Unnamed: 0,Timestamp,Reactor Type,Temperature [C],Pressure [bar],Flow Rate [l/min]
0,2023-10-21 10:00:00,CSTR,75.0,10.0,50.0
1,2023-10-21 10:00:00,CSTR,75.0,10.0,50.0
2,2023-10-21 10:00:05,CSTR,76.0,10.1,51.0
4,2023-10-21 10:00:15,PFR,,17.0,60.0
5,2023-10-21 10:00:20,Tubular,78.0,17.0,60.0
6,2023-10-21 10:00:25,,79.0,17.0,61.0
7,2023-10-21 10:00:30,MFR,80.0,18.0,62.0


Replace "NaN" by the mean for numeric data

In [6]:
# replace the missing values with the mean of the column
avg_temp = df['Temperature [C]'].mean()
df['Temperature [C]'].fillna(avg_temp, inplace=True)
df

Unnamed: 0,Timestamp,Reactor Type,Temperature [C],Pressure [bar],Flow Rate [l/min]
0,2023-10-21 10:00:00,CSTR,75.0,10.0,50.0
1,2023-10-21 10:00:00,CSTR,75.0,10.0,50.0
2,2023-10-21 10:00:05,CSTR,76.0,10.1,51.0
4,2023-10-21 10:00:15,PFR,77.166667,17.0,60.0
5,2023-10-21 10:00:20,Tubular,78.0,17.0,60.0
6,2023-10-21 10:00:25,,79.0,17.0,61.0
7,2023-10-21 10:00:30,MFR,80.0,18.0,62.0


Replace "NaN" by the mode (for categorical data)

In [7]:
# find the unique values in the column
df['Reactor Type'].value_counts()

Reactor Type
CSTR       3
PFR        1
Tubular    1
MFR        1
Name: count, dtype: int64

In [8]:
# replace the missing values with the most frequent value
mode_type = df['Reactor Type'].mode()[0]
df['Reactor Type'].fillna(mode_type, inplace=True)
df

Unnamed: 0,Timestamp,Reactor Type,Temperature [C],Pressure [bar],Flow Rate [l/min]
0,2023-10-21 10:00:00,CSTR,75.0,10.0,50.0
1,2023-10-21 10:00:00,CSTR,75.0,10.0,50.0
2,2023-10-21 10:00:05,CSTR,76.0,10.1,51.0
4,2023-10-21 10:00:15,PFR,77.166667,17.0,60.0
5,2023-10-21 10:00:20,Tubular,78.0,17.0,60.0
6,2023-10-21 10:00:25,CSTR,79.0,17.0,61.0
7,2023-10-21 10:00:30,MFR,80.0,18.0,62.0


<h5 id="clean-duplicates">Remove duplicates</h5>

In [9]:
# find the number of duplicate rows
df.duplicated().sum()

1

In [10]:
# drop the duplicate rows
df.drop_duplicates(inplace=True)
df

Unnamed: 0,Timestamp,Reactor Type,Temperature [C],Pressure [bar],Flow Rate [l/min]
0,2023-10-21 10:00:00,CSTR,75.0,10.0,50.0
2,2023-10-21 10:00:05,CSTR,76.0,10.1,51.0
4,2023-10-21 10:00:15,PFR,77.166667,17.0,60.0
5,2023-10-21 10:00:20,Tubular,78.0,17.0,60.0
6,2023-10-21 10:00:25,CSTR,79.0,17.0,61.0
7,2023-10-21 10:00:30,MFR,80.0,18.0,62.0


<h5 id="clean-standardize">Standardize data</h5>

In [11]:
# convert column to float
df['Temperature [C]'] = df['Temperature [C]'].astype('float64')

# convert multilple columns to float
df[['Pressure [bar]', 'Flow Rate [l/min]']] = df[['Pressure [bar]', 'Flow Rate [l/min]']].astype('float64')

# convert column to datetime
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

In [12]:
# standardize the values in a column
df['Reactor Type'] = df['Reactor Type'].replace('Tubular', 'PFR')
df['Reactor Type'] = df['Reactor Type'].replace('MFR', 'CSTR')
df

Unnamed: 0,Timestamp,Reactor Type,Temperature [C],Pressure [bar],Flow Rate [l/min]
0,2023-10-21 10:00:00,CSTR,75.0,10.0,50.0
2,2023-10-21 10:00:05,CSTR,76.0,10.1,51.0
4,2023-10-21 10:00:15,PFR,77.166667,17.0,60.0
5,2023-10-21 10:00:20,PFR,78.0,17.0,60.0
6,2023-10-21 10:00:25,CSTR,79.0,17.0,61.0
7,2023-10-21 10:00:30,CSTR,80.0,18.0,62.0


In [13]:
# apply a function to a column
df['Temperature [C]'] = df['Temperature [C]'].apply(lambda x: x + 273)
df.rename(columns={'Temperature [C]': 'Temperature [K]'}, inplace=True)
df

Unnamed: 0,Timestamp,Reactor Type,Temperature [K],Pressure [bar],Flow Rate [l/min]
0,2023-10-21 10:00:00,CSTR,348.0,10.0,50.0
2,2023-10-21 10:00:05,CSTR,349.0,10.1,51.0
4,2023-10-21 10:00:15,PFR,350.166667,17.0,60.0
5,2023-10-21 10:00:20,PFR,351.0,17.0,60.0
6,2023-10-21 10:00:25,CSTR,352.0,17.0,61.0
7,2023-10-21 10:00:30,CSTR,353.0,18.0,62.0


<hr id="validate">

<h2>3. Validate cleaned data</h2>

In [14]:
# check data types
df.dtypes

Timestamp            datetime64[ns]
Reactor Type                 object
Temperature [K]             float64
Pressure [bar]              float64
Flow Rate [l/min]           float64
dtype: object

In [15]:
# check for duplicates
df.duplicated().sum()

0

In [16]:
# check for missing values
df.isna().sum()

Timestamp            0
Reactor Type         0
Temperature [K]      0
Pressure [bar]       0
Flow Rate [l/min]    0
dtype: int64

In [17]:
# save the cleaned data for future use
df.to_csv('cleaned_data.csv', index=False)

<hr style="margin-top: 4rem;">
<h2>Author</h2>

<a href="https://github.com/SamerHany">Samer Hany</a>

<h2>References</h2>
<a href="https://www.w3schools.com/python/default.asp">w3schools.com</a>