 02 – DATA PREPROCESSING 
 
The purpose of this notebook is to clean and prepare the raw data so it’s suitable for analysis and modeling. This step includes loading the dataset, checking for issues like missing values or duplicates, and making sure the data types are correct. It also covers encoding categorical variables, scaling features, and preparing the train/test split.

1. Load the Datasets
   
Instead of loading a single file, this step loads all CSV files located in the data/raw/ folder into a dictionary of pandas DataFrames. Each dataset is stored using its filename (without the .csv extension) as the key. This approach allows easy access and inspection of multiple datasets simultaneously.

In [1]:
import pandas as pd
import os
import glob

# Path to the raw data folder
raw_data_path = 'data/raw/'

# Get a list of all CSV files in the folder
csv_files = glob.glob(os.path.join(raw_data_path, '*.csv'))

# Create a dictionary of DataFrames with the filename (without extension) as the key
dataframes = {}

# Function to load CSV files in chunks
def load_csv_in_chunks(file_path, chunk_size=5000):
    chunk_list = []
    for chunk in pd.read_csv(file_path, chunksize=chunk_size):
        chunk_list.append(chunk)
    return pd.concat(chunk_list, axis=0)

# Loop through all CSV files
for file in csv_files:
    name = os.path.splitext(os.path.basename(file))[0]  # filename without path or extension
    df = load_csv_in_chunks(file, chunk_size=5000)
    dataframes[name] = df
    print(f"Loaded {name} with shape {df.shape}")

    print(f"Processing {file}")



2. Initial Checks: Data Types, Missing Values, and Duplicates
This step checks for missing values, incorrect data types, and duplicate records — common issues that need to be addressed before proceeding.

In [None]:
# Overview of the dataset
df.info()

# Count missing values per column
df.isnull().sum()

# Check for duplicates
df.duplicated().sum()

3. Handle Missing Values
Dropped rows with missing values in the target column, since they can’t be used for training. Other missing numerical values were filled using the median to avoid skewing from outliers.

In [None]:
# Drop rows where the target is missing
df = df.dropna(subset=['target_column'])

# Fill missing values in numerical columns
df['numerical_column'] = df['numerical_column'].fillna(df['numerical_column'].median())

4. Fix Data Types
Some columns were not in the correct format for processing. Converting them to the appropriate data types ensures compatibility with future transformations and modeling steps.

In [None]:
# Convert values to numeric
df['numeric_column'] = pd.to_numeric(df['numeric_column'], errors='coerce')

5. Remove Duplicates
Removing duplicate rows to avoid bias and redundancy during analysis or training.

In [None]:
# Drop duplicates
df = df.drop_duplicates()

6. Encode Categorical Variables
Categorical columns were one-hot encoded to convert them into a numerical format suitable for machine learning models.

In [None]:
# One-hot encode selected categorical column
df = pd.get_dummies(df, columns=['categorical_column'])

7. Scale Numerical Features
Numerical features were scaled using StandardScaler to ensure all features contribute equally to model performance.

In [None]:
from sklearn.preprocessing import StandardScaler

# Apply standard scaling
scaler = StandardScaler()
df[['numerical_column1', 'numerical_column2']] = scaler.fit_transform(df[['numerical_column1', 'numerical_column2']])

8. Split Dataset into Train/Test
The cleaned and prepared dataset was split into training and testing sets to allow model validation on unseen data.

In [None]:
from sklearn.model_selection import train_test_split

# Separate features and target
X = df.drop('target_column', axis=1)
y = df['target_column']

# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
