# STUDENT PERFORMANCE PREDICTION

#  # Task
Develop a machine learning pipeline to predict student performance using Random Forest and Gradient Boosting models on the "student_data.csv" dataset. The pipeline should include data loading, preprocessing, model training, and evaluation.

In [1]:
# installing required libraries
get_ipython().system('pip install pandas numpy scikit-learn xgboost matplotlib seaborn')
get_ipython().system('unzip archive.zip')

unzip:  cannot find or open archive.zip, archive.zip.zip or archive.zip.ZIP.


In [7]:
# uploading the file
from google.colab import files
uploaded = files.upload()

Saving archive.zip to archive.zip


In [8]:
#unzipping the files
get_ipython().system('unzip archive.zip')

Archive:  archive.zip
  inflating: student_data.csv        


## LOADING THE DATASET

### Subtask:
Load the `student_data.csv` file into a pandas DataFrame.


**Reasoning**:
The first step is to load the data into a pandas DataFrame for further processing.



In [9]:
import pandas as pd

df = pd.read_csv('student_data.csv')
display(df.head())

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


## PREPROCESSING THE DATA

### Subtask:
Handle categorical features and split the data into training and testing sets.


**Reasoning**:
Identify categorical features, apply one-hot encoding, separate features and target, and split the data into training and testing sets.



In [10]:
categorical_cols = df.select_dtypes(include=['object']).columns
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

X = df_encoded.drop('G3', axis=1)
y = df_encoded['G3']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (316, 41)
Shape of X_test: (79, 41)
Shape of y_train: (316,)
Shape of y_test: (79,)


## MODEL TRAINING

### Subtask:
Train Random Forest and Gradient Boosting models on the training data.


**Reasoning**:
Import the necessary classes and train the Random Forest and Gradient Boosting models on the training data.



In [11]:
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

rf_model = RandomForestRegressor(random_state=42)
xgb_model = XGBRegressor(random_state=42)

rf_model.fit(X_train, y_train)
xgb_model.fit(X_train, y_train)

## EVALUATING

### Subtask:
Evaluate the performance of both models on the testing data.


**Reasoning**:
Evaluate the performance of both models on the testing data using MSE and R-squared metrics.



In [12]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred_rf = rf_model.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

y_pred_xgb = xgb_model.predict(X_test)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f"Random Forest - MSE: {mse_rf:.4f}, R-squared: {r2_rf:.4f}")
print(f"Gradient Boosting - MSE: {mse_xgb:.4f}, R-squared: {r2_xgb:.4f}")

Random Forest - MSE: 3.7977, R-squared: 0.8148
Gradient Boosting - MSE: 4.5518, R-squared: 0.7780


SAVING AND DOWNLOADING THE MODEL

In this code in saved the model using pickle library

In [13]:
import pickle
import os

# Create a directory to save the models if it doesn't exist
if not os.path.exists('models'):
    os.makedirs('models')

# Save the Random Forest model
with open('models/rf_model.pkl', 'wb') as f:
    pickle.dump(rf_model, f)

# Save the Gradient Boosting model
with open('models/xgb_model.pkl', 'wb') as f:
    pickle.dump(xgb_model, f)

print("Models saved successfully in the 'models' directory.")

Models saved successfully in the 'models' directory.


The models have been saved as `rf_model.pkl` and `xgb_model.pkl` in the `models` directory.

To download the files, you can use the file browser on the left sidebar in Colab, navigate to the `models` directory, right-click on each file, and select "Download".

Alternatively, i used the following code to download the files programmatically:

In [15]:
from google.colab import files

files.download('models/rf_model.pkl')
files.download('models/xgb_model.pkl')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## SUMMARY OF PROJECT

### Data Analysis Key

*   Categorical features in the dataset were successfully handled using one-hot encoding, resulting in a transformed dataset `df_encoded`.
*   The data was split into training and testing sets, with 80% allocated for training and 20% for testing, ensuring proper model evaluation.
*   Both a Random Forest Regressor and an XGBoost Regressor model were successfully trained on the preprocessed training data.
*   The Random Forest model achieved a Mean Squared Error (MSE) of 3.7977 and an R-squared score of 0.8148 on the test set.
*   The Gradient Boosting model achieved a Mean Squared Error (MSE) of 4.5518 and an R-squared score of 0.7780 on the test set.

### Insights or Next Steps

*   The Random Forest model slightly outperformed the Gradient Boosting model on this dataset based on the evaluated metrics. Further hyperparameter tuning could potentially improve the performance of both models.
*   Investigating feature importance from the trained models could provide insights into which student attributes are most influential in predicting performance.
