# Preparing Data

In this section, we prepared the dataset using a pipeline. We used different scaling techniques such as standard scaler to prepare the data for training. Additionally, we also used one-hot encoding to encode categorical variables.

## Preparing the Environment

In [1]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

In [2]:
# Import user-defined modules
import sys
from pathlib import Path
SRC_DIR = Path.cwd().parent / "src"
sys.path.append(str(SRC_DIR))

import data_utils

# Set global variables
RAW_DATA_DIR, PROCESSED_DATA_DIR = data_utils.get_data_directories()

Data directories successfully set.


We listed the numeric, nominal, and ordinal categories along with its ordinal categories. This uses the dataset we cleaned during the exploratory data analysis.

In [3]:
# Define columns
numeric_features = ['age', 'study_hours_per_day', 'attendance_percentage', 
                    'sleep_hours', 'exercise_frequency', 'mental_health_rating', 
                    'total_screen_time']

nominal_features = ['gender', 'part_time_job', 'internet_quality', 'extracurricular_participation']

ordinal_features = ['diet_quality', 'parental_education_level']
ordinal_categories = [
    ['Poor', 'Fair', 'Good'],  # diet_quality
    ['No education', 'High School', 'Bachelor', 'Master']  # parental education
]

## Creating Pipelines and Transformer

We used a standard scaler because we saw during the exploratory data analysis that our numeric features are in fact showing a normalized distribution. For nominal data, we used one-hot encoder because this is a very straightforward way of encoding categorical data. There are also few unique values in our nominal features so it's a good fit. We also used an ordinal encoder for the ordinal features which are diet quality and parental education level. This is important because it outputs an increasing value for higher order data. As you can see, we also didn't use any imputing technique, because we have already cleaned the data and handled the imputing earlier.

In [4]:
# Numeric pipeline
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Nominal pipeline
nominal_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Ordinal pipeline
ordinal_transformer = Pipeline(steps=[
    ('ordinal', OrdinalEncoder(categories=ordinal_categories))
])

# Combine all
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('nom', nominal_transformer, nominal_features),
    ('ord', ordinal_transformer, ordinal_features)
])

We used this `ColumnTransformer` pipeline to preprocess, encode, and scale the data from the cleaned dataset. We removed the exam score field because this is not a feature but a label that we're trying to predict in the future training runs. 

In [5]:
cleaned_df = pd.read_csv(f"{PROCESSED_DATA_DIR}/cleaned_dataset.csv")

X = cleaned_df.drop(columns=['exam_score'])

# Fit + transform
X_preprocessed = preprocessor.fit_transform(X)

We then saved this preprocessed dataset as `scaled_encoded_global_dataset.csv` for future use.

In [6]:
# Get feature names
feature_names = preprocessor.get_feature_names_out()

# Convert to DataFrame
X_preprocessed_df = pd.DataFrame(X_preprocessed, columns=feature_names)

# Save to CSV
data_utils.save_dataset_to_csv(X_preprocessed_df, "scaled_encoded_global_dataset")

Data directories successfully set.
Dataset saved in /home/asimov/Projects/lifestyle_learning/data/preprocessed


A copy of the transformer is present in `data_utils` just to avoid reprogramming of the logic as the pipeline works for all dataset clusters derived from the original cleaned dataset.

In [7]:
# You can see here that data_utils.preprocessor can be used to perform the same functionality
# of the ColumnTransformer above
X_preprocessed == data_utils.preprocessor(X)

array([[ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       ...,
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True]], shape=(1000, 19))