# DATA PREPROCESSING
Data preprocessing is the key step in any data analysis or machine learning pipeline. This step often interleaved with **Data Exploration**. It involves cleaning, transforming and organizing raw data to ensure it is accurate, consistent and ready for modeling. It has a big impact on model building such as:

- Clean and well-structured data allows models to learn meaningful patterns rather than noise.

- Properly processed data prevents misleading inputs, leading to more reliable predictions.

- Organized data makes it simpler to create useful inputs for the model, enhancing model performance.

- Organized data supports better Exploratory Data Analysis (EDA), making patterns and trends more interpretable.

**Steps in Data Preprocessing:** Some key steps in data preprocessing are: **Data Cleaning**, **Data Integration**, **Data Transformation**, **Data Reduction** and **Feature Engineering**.

## I. Data Cleaning
It is the process of correcting errors or inconsistencies, normalization and handling any missing, duplicated or irrelevant data in the dataset which was discovered by us through step **Data Exploration**. Clean data is essential for effective analysis, as it improves the quality of results and enhances the performance of data models. We will seperate data cleaning into **six key steps**, included in:

- **Synstax Errors**: Correcting issues such as **typos, incorrect data types, and invalid characters**.

- **Format Data**: Standardizing data formats, including **dates, numeric values, text cases, and units of measurement**.

- **Irrelevant Data**: Removing data fields or records that do not contribute to the analytical objective.

- **Handling Missing Data**: Addressing missing values through **removal, imputation, or estimation**.

- **Handling Duplicated Data**: Eliminating duplicate records.

- **Handling Outlier**: Managing extreme or abnormal values which is detected.

Through insights gained from **Data Understanding of the Raw Dataset**, we recognize that the collected data is relatively clean, with no missing values or duplicates. Therefore, in this section, we focus on addressing the remaining issues through the **Irrelevant Data** and **Format Data** steps.

At the beginning like any step, we also need import the libraries as well as load the data by `pandas`

**Import Libraries**

In [None]:
import os
import sys
import numpy as np
import pandas as pd

sys.path.append(os.path.abspath(os.path.join('..')))
from src import preprocessing as dp

**Load the Dataset**

In [15]:
path = "../data/raw/vietnam_air_quality.csv"
try:
    df = pd.read_csv(path)
    print("[SUCCESS]: Loading dataset successful")
except Exception as e:
    print(f"[ERROR]: Loading dataset fail: {e}")

[SUCCESS]: Loading dataset successful


### 1. Format Data
As we explored in step **explore datatype of the data** at **Data Exploration**. Most of columns have been right datatype except column `timestamp`. Therefore, in this section we will convert it to right datatype that is `datetime`

In [16]:
try:
    df['timestamp'] = pd.to_datetime(df['timestamp'], errors='raise')
    print("[SUCCESS] Formating data is successful")
except Exception as e:
    print(f"[FAIL] Formating data fail: {e}")

[SUCCESS] Formating data is successful


### 2. Irrelevant Data
The goal of this step is reduce memory and noise by removing columns which is not essential for analysis as well as build model.

With **20 columns** in our dataset, we will decide to drop column `lat`, `lon`  because the data from the above column was used for mapping to create the **`city`** column.
 

In [17]:
try:
    df.drop(columns=["lat", "lon"], inplace=True)
    print("[SUCCESS] Irrelevant column is successfull")
except Exception as e:
    print(f"[ERROR] Irrelevant column is fail")

[SUCCESS] Irrelevant column is successfull


### 3. Handling Outlier
Before jumping into handle outlier we need save processed data for explore continuously which we process in previous step. This data is ready to define about distribution data and from this we can have insight about outliers

### 4. Save Processed Data

In [18]:
path = "../data/processed/processed_data.csv"
df.to_csv(path, index = False)

After checking Outlier through **Data Exploration**, we will decide not to do handle outlier in **Data Cleaning** process because some these reasons:

...

However, if in step **Modeling** the accurancy or efficiency of model doesn't good performance we will consider to fall back in this step - **Handling Outliers**.


## II. Data Intergation
It involves merging data from various sources into a single, unified dataset. It can be challenging due to differences in data formats, structures, and meanings. However, this dataset only store in one file `csv`, so at this step we will not record linkage as well as data fusion.