## 8. Observations

- Numeric features are standardized (mean 0, variance 1).  
- Categorical features are one-hot encoded (dummy variables created).  
- The final feature matrix has many more columns than the original dataset due to encoding.  


## 7. Inspect Transformed Features


In [13]:
from sklearn.preprocessing import OneHotEncoder

# Extract the fitted OneHotEncoder from the preprocessor
ohe = preprocessor.named_transformers_["cat"]

# Get encoded feature names for categorical columns
ohe_features = ohe.get_feature_names_out(categorical_cols)

# Combine numeric + encoded categorical features
all_features = numeric_cols + list(ohe_features)

print("Total features after preprocessing:", len(all_features))
print("First 20 features:", all_features[:20])


Total features after preprocessing: 41
First 20 features: ['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'school_MS', 'sex_M', 'address_U', 'famsize_LE3', 'Pstatus_T']


## 6. Fit and Transform Sample


In [14]:
# 1. Split train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 2. Build the preprocessor (THIS was missing in your code)
preprocessor = build_preprocessor(numeric_cols, categorical_cols)

# 3. Fit the preprocessor
preprocessor.fit(X_train)

# 4. Transform a small sample
X_train_transformed = preprocessor.transform(X_train.head())
print("Transformed shape:", X_train_transformed.shape)


Transformed shape: (5, 41)


In [15]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

def build_preprocessor(numeric_cols, categorical_cols):
    """
    Build a ColumnTransformer for preprocessing:
    - Standardize numeric columns
    - One-hot encode categorical columns
    """
    return ColumnTransformer(
        transformers=[
            ("num", StandardScaler(), numeric_cols),
            ("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), categorical_cols),
        ]
    )




## 5. Build Preprocessor


In [16]:
numeric_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_cols = X.select_dtypes(include=["object"]).columns.tolist()

print("Numeric columns:", numeric_cols[:10], "... total:", len(numeric_cols))
print("Categorical columns:", categorical_cols[:10], "... total:", len(categorical_cols))


Numeric columns: ['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc'] ... total: 15
Categorical columns: ['school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup'] ... total: 17


## 4. Identify Numeric and Categorical Columns


In [17]:
y = df["G3"]
X = df.drop("G3", axis=1)

print("Features shape:", X.shape)
print("Target shape:", y.shape)


Features shape: (395, 32)
Target shape: (395,)


## 3. Split Features and Target


In [18]:
mat, por = load_data()
df = mat.copy()

print("Dataset shape:", df.shape)
df.head()


2025-09-20 20:13:56,734 [INFO] Math dataset loaded successfully with shape (395, 33)
2025-09-20 20:13:56,735 [INFO] Portuguese dataset loaded successfully with shape (649, 33)


Dataset shape: (395, 33)


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


## 2. Load Data


In [19]:
import os, sys
import pandas as pd

# Add ../src to Python path
sys.path.append(os.path.abspath(os.path.join("..", "src")))

from data_loader import load_data
from preprocessing import build_preprocessor
from sklearn.model_selection import train_test_split



# 02 Feature Engineering

This notebook demonstrates preprocessing for the student performance dataset.  
We identify feature types (numeric vs categorical), build transformations,  
and preview the transformed feature matrix.
