# Student Performance Prediction

This project focuses on a dataset that contains various information on students, such as their study habits, attendance, parental involvement, and other aspects influencing academic success. This purpose of this project is to develop a predictive model that can accurately predict the final exam scores of these students. 

The dataset is provided by a user from Kaggle (https://www.kaggle.com/datasets/lainguyn123/student-performance-factors). We will be using Python and various tools such as machine learning algorithms, preprocessing and pipelines to create our model. 

## Import Libraries

In [124]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import numpy as np
from matplotlib import pyplot as plt 
import seaborn as sns
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import mean_squared_error, r2_score 
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor

## Load and Inspect Data 

In [125]:
df = pd.read_csv('StudentPerformanceFactors.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Hours_Studied               6607 non-null   int64 
 1   Attendance                  6607 non-null   int64 
 2   Parental_Involvement        6607 non-null   object
 3   Access_to_Resources         6607 non-null   object
 4   Extracurricular_Activities  6607 non-null   object
 5   Sleep_Hours                 6607 non-null   int64 
 6   Previous_Scores             6607 non-null   int64 
 7   Motivation_Level            6607 non-null   object
 8   Internet_Access             6607 non-null   object
 9   Tutoring_Sessions           6607 non-null   int64 
 10  Family_Income               6607 non-null   object
 11  Teacher_Quality             6529 non-null   object
 12  School_Type                 6607 non-null   object
 13  Peer_Influence              6607 non-null   obje

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70


The dataset contains 6607 rows and 19 columns containing information on used cars. It seems that the 'Teacher_Quality', 'Distance_from_Home' and 'Parental_Education_Level' columns are the only columns that have null values.

Here's a summary of all the columns:

- **Hours_Studied**:	Number of hours spent studying per week.
- **Attendance**:	Percentage of classes attended.
- **Parental_Involvement**:	Level of parental involvement in the student's education (Low, Medium, High).
- **Access_to_Resources**:	Availability of educational resources (Low, Medium, High).
- **Extracurricular_Activities**:	Participation in extracurricular activities (Yes, No).
- **Sleep_Hours**:	Average number of hours of sleep per night.
- **Previous_Scores**:	Scores from previous exams.
- **Motivation_Level**:	Student's level of motivation (Low, Medium, High).
- **Internet_Access**:	Availability of internet access (Yes, No).
- **Tutoring_Sessions**:	Number of tutoring sessions attended per month.
- **Family_Income**:	Family income level (Low, Medium, High).
- **Teacher_Quality**:	Quality of the teachers (Low, Medium, High).
- **School_Type**:	Type of school attended (Public, Private).
- **Peer_Influence**:	Influence of peers on academic performance (Positive, Neutral, Negative).
- **Physical_Activity**:	Average number of hours of physical activity per week.
- **Learning_Disabilities**:	Presence of learning disabilities (Yes, No).
- **Parental_Education_Level**:	Highest education level of parents (High School, College, Postgraduate).
- **Distance_from_Home**:	Distance from home to school (Near, Moderate, Far).
- **Gender**:	Gender of the student (Male, Female).
- **Exam_Score**:	Final exam score.

## Data Cleaning and Preparation

### Dealing with null values 

We will use backward fill to replace the null values in the 'Teacher_Quality', 'Distance_from_Home' and 'Parental_Education_Level' columns.

In [126]:
df['Teacher_Quality'].bfill(inplace = True)
df['Distance_from_Home'].bfill(inplace = True)
df['Parental_Education_Level'].bfill(inplace = True)

## Feature Engineering

In [127]:
features = df.drop(columns = ['Exam_Score']).columns
target = ['Exam_Score']
X = df[features]
y = df[target]

cat_cols = X.select_dtypes(include = 'object').columns
num_cols = X.select_dtypes(include = 'int').columns

preprocessor = ColumnTransformer(
    transformers = [
        ('num_vals', StandardScaler(), num_cols),
        ('cat_vals', OneHotEncoder(sparse = False, drop = 'first'), cat_cols)
    ]
)

In [128]:
# Apply the transformations to the training data
X_preprocessed = preprocessor.fit_transform(X)
X_preprocessed = pd.DataFrame(X_preprocessed, columns=preprocessor.get_feature_names_out())

# Split the data into train and test sets
x_train_processed, x_test_processed, y_train_processed, y_test_processed = train_test_split(X_preprocessed, y, test_size=0.2, random_state=0)

In [129]:
# Initialize the GradientBoostingRegressor
rgr = DecisionTreeRegressor(criterion = 'squared_error')

# Fit the model to the training data
rgr.fit(x_train_processed, y_train_processed)

In [130]:
# Get feature importances
importances = rgr.feature_importances_

# Create a DataFrame to view feature importances
feature_importances = pd.DataFrame({'feature': x_train_processed.columns, 'importance': importances})

# Sort by importance
feature_importances = feature_importances.sort_values(by='importance', ascending=False)

# Print the top 10 most important features
print(feature_importances)

                                            feature  importance
1                              num_vals__Attendance    0.375035
0                           num_vals__Hours_Studied    0.242800
3                         num_vals__Previous_Scores    0.103598
4                       num_vals__Tutoring_Sessions    0.034364
2                             num_vals__Sleep_Hours    0.031008
5                       num_vals__Physical_Activity    0.027029
6                cat_vals__Parental_Involvement_Low    0.023485
8                 cat_vals__Access_to_Resources_Low    0.018865
18                     cat_vals__School_Type_Public    0.016921
10         cat_vals__Extracurricular_Activities_Yes    0.014043
14                      cat_vals__Family_Income_Low    0.012281
9              cat_vals__Access_to_Resources_Medium    0.012093
25                cat_vals__Distance_from_Home_Near    0.012013
21              cat_vals__Learning_Disabilities_Yes    0.011464
15                   cat_vals__Family_In

As shown above, the numerical cols have a higher importance compared to the categorical columns. Attendance has the highest impact on the model compared to all the features, followed by hours studied and previous scores. 

## Model Selection and Evaluation

### Decision Tree Regressor 

In [131]:
# Split the data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

pipeline_dtr = Pipeline([('preprocessing', preprocessor), ('regressor', dtr)])

In [132]:
# Fit the pipeline on the training data and predict on test set
pipeline_dtr.fit(x_train, y_train)

# Predicting y values from test data
y_pred = pipeline_dtr.predict(x_test)

#Pipeline score
train_score = pipeline_dtr.score(x_train, y_train)
test_score = pipeline_dtr.score(x_test, y_test)
print(f'Decision Tree Regressor Train Score: {train_score}')
print(f'Decision Tree Regressor Test Score: {test_score}')

Decision Tree Regressor Train Score: 1.0
Decision Tree Regressor Test Score: 0.14996692164505254


### Random Forest Regressor

In [133]:
# Initializing RandomForestRegressor
rfr = RandomForestRegressor()

pipeline_rfr = Pipeline([('preprocessing', preprocessor), ('regressor', rfr)])

In [137]:
# Fit the pipeline on the training data and predict on test set
pipeline_rfr.fit(x_train, y_train.values.ravel())

# Predicting y values from test data
y_pred = pipeline_rfr.predict(x_test)

#Pipeline score
train_score = pipeline_rfr.score(x_train, y_train)
test_score = pipeline_rfr.score(x_test, y_test)
print(f'Random Forest Regressor Train Score: {train_score}')
print(f'Random Forest Regressor Test Score: {test_score}')

Random Forest Regressor Train Score: 0.9495235797971369
Random Forest Regressor Test Score: 0.5786620051459379


### Gradient Boosting Regressor

In [135]:
# Initializing RandomForestRegressor
gbr = GradientBoostingRegressor()

pipeline_gbr = Pipeline([('preprocessing', preprocessor), ('regressor', gbr)])

In [138]:
# Fit the pipeline on the training data and predict on test set
pipeline_gbr.fit(x_train, y_train.values.ravel())

# Predicting y values from test data
y_pred = pipeline_gbr.predict(x_test)

#Pipeline score
train_score = pipeline_rfr.score(x_train, y_train)
test_score = pipeline_rfr.score(x_test, y_test)
print(f'Gradient Boosting Regressor Train Score: {train_score}')
print(f'Gradient Boosting Regressor Test Score: {test_score}')

Gradient Boosting Regressor Train Score: 0.9495235797971369
Gradient Boosting Regressor Test Score: 0.5786620051459379


## Hyperparameter Tuning