Data Preprocessing

This notebook prepares the dataset for machine learning by:
- Separating features and target
- Encoding categorical variables
- Scaling numerical variables
- Splitting the data into training and test sets.

In [1]:
# Load the dataset
import pandas as pd
df = pd.read_csv("../data/raw/student-mat.csv", sep=";")

#Recreate the target variable
df['at_risk'] = (df['G3'] < 10).astype(int)
df.head()


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,at_risk
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,3,4,1,1,3,6,5,6,6,1
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,3,3,1,1,3,4,5,5,6,1
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,3,2,2,3,3,10,7,8,10,0
3,GP,F,15,U,GT3,T,4,2,health,services,...,2,2,1,1,5,2,15,14,15,0
4,GP,F,16,U,GT3,T,3,3,other,other,...,3,2,1,2,5,4,6,10,10,0


Removing Leakage-Prone Columns
The column G3 is the final grade & was used to create the target variable 'at_risk'. Therefore, it must be removed from the inout features to avoid the **data leakage**.

In [2]:
# Drop columns that would cause data to leak
df_model = df.drop(columns=['G3','G1','G2'])

In [3]:
# Separating  features and target
# We separate: 'X': Input Features & 'Y': Target Variable ('at_risk')
x = df_model.drop(columns=['at_risk'])
y = df_model['at_risk']

print("X Shape:", x.shape)
print("Y Shape:", y.shape)


X Shape: (395, 30)
Y Shape: (395,)


In [4]:
# Identifying categorical & numerical columns
categorical_cols = x.select_dtypes(include='object').columns.tolist()
numerical_cols = x.select_dtypes(exclude='object').columns.tolist()

print("Categorical columns:", categorical_cols)
print("Numerical columns:", numerical_cols)

Categorical columns: ['school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic']
Numerical columns: ['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


In [5]:
print(len(categorical_cols), len(numerical_cols))

17 13


Identifying Feature Types

The dataset contains a mixture of categorical and numerical features.

- **Categorical features (17 columns):**
  - These represent qualitative attributes such as gender, school, parental job, and support indicators.
  - These features must be converted into numerical form using one-hot encoding.

- **Numerical features (13 columns):**
  - These represent quantitative measurements such as age, number of absences, and study time.
  - These features will be scaled to ensure they are on a comparable numerical scale.

In total, the feature set contains 30 input variables.

- **Numerical features**:
  - Will be scaled using `StandardScaler` so that they have zero mean and unit variance.
- **Categorical features**:
  - Will be converted into numerical form using **One-Hot Encoding**.

To ensure clean, reproducible, and leakage-free preprocessing, we will use:
- `ColumnTransformer` to apply different transformations to different columns.
- `Pipeline` to chain preprocessing steps together.

In [6]:
# Import Preprocessing Tools

from sklearn.model_selection import train_test_split  # splits data into training and test sets
from sklearn.preprocessing import StandardScaler, OneHotEncoder   # Scales numeric features & the other converts categories into numeric vectors
from sklearn.compose import ColumnTransformer   # applies different transformations to different columns
from sklearn.pipeline import Pipeline   # chains steps together safely


Defining Feature Transformers

- A numerical transformer that scales numerical features.
- A categorical transformer that one-hot encodes categorical features.

In [8]:
# Transformer for numerical feature (Pipeline creates a sequence of transformations)
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
# Transformer for categorical features (OneHotEncoder converts categories to numbers & avoids crashing if unseen categories appear in test data.)
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

Combining Transformers Using **ColumnTransformer**

Combining the numerical and categorical transformers into a single preprocessing object.


In [9]:
# Combine preprocessing for numerical & categorical features
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numerical_cols), ('cat', categorical_transformer, categorical_cols)])

Splitting the dataset:
- Training set: (80%)
- Testing set: (20%)


In [10]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)   # test_size=20% for testing, random_state=42 for reproducible split & stratify=y as it preserves the class balance in both splits

print("Training set shape: ", X_train.shape)
print("Test set shape: ", X_test.shape)

Training set shape:  (316, 30)
Test set shape:  (79, 30)


Applying the Preprocessing Pipeline

- Fit the preprocessing pipeline on the training data
- Transform both training and testing data

In [12]:
# Fit preprocessor on training data and transform both sets
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print("Processed training shape: ", X_train_processed.shape)
print("Processed test shape: ", X_test_processed.shape)

Processed training shape:  (316, 56)
Processed test shape:  (79, 56)


Saving the Processed Data

Saving the processed datasets so they can be reused for model training and evaluation.


In [13]:
import numpy as np

np.save("../data/preprocessed/X_train.npy", X_train_processed)
np.save("../data/preprocessed/X_test.npy", X_test_processed)
np.save("../data/preprocessed/y_train.npy", y_train.values)     # .values converts pandas Series → NumPy array
np.save("../data/preprocessed/y_test.npy", y_test.values)       # .values converts pandas Series → NumPy array
