# Preprocessing

This notebook is for preprocessing the data. It will be used to clean the data and prepare it for training and exploratory data analysis.

This ensures that the data is in a format that can be easily used for training and analysis. It also helps to remove any noise or irrelevant information from the data.

The preprocessing steps may include:
- Removing null values- Removing duplicates
- Removing outliers
- Normalizing the data
- Encoding categorical variables
- Feature engineering

Depending on the specific dataset and the problem you are trying to solve, you may need to perform additional preprocessing steps or skip some steps entirely. The goal of this notebook is to give you a starting point for preprocessing your data, and you are encouraged to explore more advanced techniques on your own.

In [5]:
# Import necessary libraries
import pandas as pd

In [None]:
# Load the raw data
RAW_DATA_PATH = "../data/raw/raw_data.csv"
data = pd.read_csv(RAW_DATA_PATH)

In [None]:
# See the first few rows of the data
data.head()

In [None]:
# Check for common noise in the data such as null values and duplicates
print(f"Number of null values: {data.isnull().sum().sum()}\n Number of duplicates: {data.duplicated().sum()}")

In [None]:
# Remove null values and duplicates
data = data.dropna()
data = data.drop_duplicates()

In [None]:
# Check for outliers in the data using box plots
import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot(data=data)
plt.show()

In [None]:
# Remove outliers using the IQR method
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]

In [None]:
# Normalize the data using min-max scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)

In [None]:
# Encode categorical variables using one-hot encoding
data_encoded = pd.get_dummies(data_scaled, columns=["categorical_feature1", "categorical_feature2"])

In [None]:
# Rename the columns to remove any spaces or special characters
data_encoded.columns = data_encoded.columns.str.replace(" ", "_").str.replace("-", "_")

In [None]:
# Save the processed data
PROCESSED_DATA_PATH = "../data/processed/processed_data.csv"
data_encoded.to_csv(PROCESSED_DATA_PATH, index=False)