# Students Performance Prediction - Model Training and Evaluation

**Goals:**
* Split the dataset into training dataset, and testing dataset
* Building and fitting the model to data
* Hyperparameter tuning
* Evaluating the model
* Saving the final model

## Importing libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (
    LabelEncoder,
    OneHotEncoder,
    OrdinalEncoder,
    StandardScaler
)
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import (
    RandomForestRegressor,
    GradientBoostingRegressor
)
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    r2_score
)
from sklearn.model_selection import cross_val_score

In [2]:
students = pd.read_csv("../data/processed/clean_data.csv")
students.head()

Unnamed: 0,Hours_Studied,Attendance,Sleep_Hours,Previous_Scores,Tutoring_Sessions,Physical_Activity,Exam_Score,Parental_Involvement_encoded,Access_to_Resources_encoded,Extracurricular_Activities_encoded,Motivation_Level_encoded,Internet_Access_encoded,Family_Income_encoded,Teacher_Quality_encoded,School_Type_encoded,Peer_Influence_encoded,Learning_Disabilities_encoded,Parental_Education_Level_encoded,Distance_from_Home_encoded,Gender_encoded
0,23,84,7,73,0,3,67,1,0,0,1,1,1,2,1,2,0,1,2,1
1,19,64,8,59,2,4,61,1,2,0,1,1,2,2,1,0,0,0,1,0
2,24,98,7,91,2,4,74,2,2,1,2,1,2,2,1,1,0,2,2,1
3,29,89,8,98,1,4,71,1,2,1,2,1,2,2,1,0,0,1,1,1
4,19,92,6,65,3,4,70,2,2,1,2,1,2,0,1,1,0,0,2,0


## Train and test split

In [3]:
X = students.drop('Exam_Score', axis=1)
y = students['Exam_Score']

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Training and evaluation without hyperparameter tuning

#### Training Linear Regression Model

In [5]:
base_model = LinearRegression()
base_model.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


#### Training RandomForestRegressor Model

In [6]:
rf_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)

rf_model.fit(X_train, y_train)

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


#### Training GradientBoostingRegressor Model

In [7]:
from sklearn.ensemble import GradientBoostingRegressor

gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train, y_train)

0,1,2
,loss,'squared_error'
,learning_rate,0.1
,n_estimators,100
,subsample,1.0
,criterion,'friedman_mse'
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_depth,3
,min_impurity_decrease,0.0


#### Prediction and evaluation of models

In [8]:
y_pred_base = base_model.predict(X_test)
y_pred_rf = rf_model.predict(X_test)
y_pred_gb = gb_model.predict(X_test)

In [9]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [10]:
def evaluate(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    return mae, rmse, r2

In [11]:
results = []
models = {
    "Linear Regression": y_pred_base,
    "Random Forest": y_pred_rf,
    "Gradient Boosting": y_pred_gb
}

for name, preds in models.items():
    mae, rmse, r2 = evaluate(y_test, preds)
    results.append([name, mae, rmse, r2])
    
results_df = pd.DataFrame(
    results,
    columns=["Model", "MAE", "RMSE", "R2"]
)
results_df

Unnamed: 0,Model,MAE,RMSE,R2
0,Linear Regression,1.064214,2.28368,0.664387
1,Random Forest,1.182837,2.42818,0.620572
2,Gradient Boosting,0.849548,2.181163,0.693843


## Training and evaluation of Models with Hyperparameter Tuning

In [12]:
from sklearn.model_selection import GridSearchCV

#### Ridge Regression model

In [13]:
from sklearn.linear_model import Ridge

ridge = Ridge()

param_grid_ridge = {
    'alpha': [0.01, 0.1, 1, 10, 100]
}

ridge_grid = GridSearchCV(
    ridge,
    param_grid_ridge,
    cv=5,
    scoring='r2'
)

ridge_grid.fit(X_train, y_train)

best_ridge = ridge_grid.best_estimator_

In [14]:
ridge_preds = best_ridge.predict(X_test)
evaluate(y_test, ridge_preds)

(1.064534013713852, 2.283908355436834, 0.6643202082197928)

#### Random Forest Regressor Model

In [15]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=42)

param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

In [16]:
rf_grid = GridSearchCV(
    rf,
    param_grid_rf,
    cv=5,
    scoring='r2',
    n_jobs=-1
)

rf_grid.fit(X_train, y_train)

best_rf = rf_grid.best_estimator_

In [17]:
rf_grid.best_params_

{'max_depth': 20,
 'min_samples_leaf': 2,
 'min_samples_split': 5,
 'n_estimators': 200}

In [18]:
rf_preds = best_rf.predict(X_test)
evaluate(y_test, rf_preds)

(1.137242279533423, 2.3610744676568585, 0.6412538738656766)

#### GradientBoostingRegressor Model

In [19]:
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(random_state=42)

param_grid_gbr = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1],
    'max_depth': [3, 5],
    'subsample': [0.8, 1.0]
}

In [20]:
gbr_grid = GridSearchCV(
    gbr,
    param_grid_gbr,
    cv=5,
    scoring='r2'
)

gbr_grid.fit(X_train, y_train)

best_gbr = gbr_grid.best_estimator_

In [21]:
gbr_grid.best_params_

{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 1.0}

In [22]:
gbr_preds = best_gbr.predict(X_test)
evaluate(y_test, gbr_preds)

(0.714752154994543, 2.1232317638086955, 0.7098900289025668)

In [23]:
results = []

results.append(['Ridge Regression', *evaluate(y_test, ridge_preds)])
results.append(['Random Forest (Tuned)', *evaluate(y_test, rf_preds)])
results.append(['Gradient Boosting (Tuned)', *evaluate(y_test, gbr_preds)])

tuned_results_df = pd.DataFrame(
    results,
    columns=['Model', 'MAE', 'RMSE', 'R2']
)

tuned_results_df

Unnamed: 0,Model,MAE,RMSE,R2
0,Ridge Regression,1.064534,2.283908,0.66432
1,Random Forest (Tuned),1.137242,2.361074,0.641254
2,Gradient Boosting (Tuned),0.714752,2.123232,0.70989


## Final Model Comparison & Conclusions

The models were evaluated **before and after hyperparameter tuning** using MAE, RMSE, and R² metrics.

---

### Before Hyperparameter Tuning
| Model | MAE | RMSE | R² |
|------|-----|------|----|
| Linear Regression | 1.06 | 2.28 | 0.66 |
| Random Forest | 1.18 | 2.43 | 0.62 |
| Gradient Boosting | 0.85 | 2.18 | 0.69 |

**Observation:**
- Gradient Boosting outperformed other models even before tuning.
- Random Forest showed weaker generalization compared to Linear Regression.
- Linear Regression served as a strong baseline.

---

### After Hyperparameter Tuning
| Model | MAE | RMSE | R² |
|------|-----|------|----|
| Ridge Regression | 1.06 | 2.28 | 0.66 |
| Random Forest (Tuned) | 1.14 | 2.36 | 0.64 |
| Gradient Boosting (Tuned) | **0.71** | **2.12** | **0.71** |

---

### Model-wise Interpretation

#### **Ridge Regression**
- Performance remains nearly identical to Linear Regression.
- Regularization does not significantly improve results.
- Indicates low multicollinearity and stable linear relationships.

#### **Random Forest (Tuned)**
- Shows improvement over untuned Random Forest.
- Still underperforms compared to Gradient Boosting.
- Suggests ensemble bagging is less effective for this dataset.

#### **Gradient Boosting (Tuned)**
- Achieves the **lowest MAE and RMSE**.
- Achieves the **highest R² (≈71%)**.
- Demonstrates better learning of non-linear relationships.
- Provides the best balance between bias and variance.

---

### Final Model Selection

**Gradient Boosting Regressor (Tuned)** is selected as the **final model** because:
- It produces the most accurate predictions.
- It generalizes better to unseen data.
- It consistently outperforms other models before and after tuning.

This model is therefore chosen for **deployment in the Streamlit application**.

---

### Next Step
The trained Gradient Boosting model is saved and will be loaded in the Streamlit app for real-time student performance prediction.

## Saving the model

In [25]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingRegressor
import joblib

best_params = gbr_grid.best_params_
num_cols = [
    'Hours_Studied', 'Attendance', 'Sleep_Hours',
    'Previous_Scores', 'Tutoring_Sessions', 'Physical_Activity'
]

cat_cols = [
    'Gender_encoded', 'School_Type_encoded', 'Parental_Involvement_encoded',
    'Internet_Access_encoded', 'Motivation_Level_encoded'
]

preprocessor = ColumnTransformer([
    ("num", "passthrough", num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
])

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", GradientBoostingRegressor(**best_params))
])

pipeline.fit(X_train, y_train)

joblib.dump(pipeline, "../model.pkl")

['../model.pkl']