# **Prediction model**

## Objectives

* Build model to predict students' performance based on the data available.

## Inputs

* Cleaned data saved in https://github.com/8osco/academic_performance_analysis/blob/main/data/inputs/cleaned/edu_data_cleaned.csv
* Initial analyses performed in [1_data_etl.ipynb](https://github.com/8osco/academic_performance_analysis/blob/main/jupyter_notebooks/1_data_etl.ipynb)
* Exploratory data analysis performed in [2_exploratory_analysis](https://github.com/8osco/academic_performance_analysis/blob/main/jupyter_notebooks/2_exploratory_analysis.ipynb)
* Hypothesis testing performed in [3_hypothesis_testing] (https://github.com/8osco/academic_performance_analysis/blob/main/jupyter_notebooks/3_hypothesis_testing.ipynb)


## Outputs

* Prediction model and key drivers.

# 1 Import packages


Import relevant packages required for data analysis and visualisation.

In [1]:
# Import NumPy, Pandas, Matplotlib, Seaborn and Plotly
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# 2 Data extract and familiarisation

Read in the cleaned csv file and familiarise with the file structure at a high level, through use of various dataframe methods:

In [2]:
# Load the dataset and display the first few rows
df = pd.read_csv('../data/inputs/cleaned/edu_data_cleaned.csv')
df.head()

Unnamed: 0,gender,nationality,place_of_birth,education_stage,grade,classroom_id,subject,semester,parent_involved,raised_hands,resource_visits,announcements_viewed,discussion_participation,parent_answered_survey,parent_school_satisfaction,absence_category,pass_fail_status
0,male,kuwait,kuwait,lowerschool,g-04,a,it,first,father,15,16,2,20,yes,good,low,pass
1,male,kuwait,kuwait,lowerschool,g-04,a,it,first,father,20,20,3,25,yes,good,low,pass
2,male,kuwait,kuwait,lowerschool,g-04,a,it,first,father,10,7,0,30,no,bad,high,fail
3,male,kuwait,kuwait,lowerschool,g-04,a,it,first,father,30,25,5,35,no,bad,high,fail
4,male,kuwait,kuwait,lowerschool,g-04,a,it,first,father,40,50,12,50,no,bad,high,pass


In [3]:
# Display the shape of the DataFrame
df.shape

(478, 17)

In [4]:
# Display the column data types and check for null or missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   gender                      478 non-null    object
 1   nationality                 478 non-null    object
 2   place_of_birth              478 non-null    object
 3   education_stage             478 non-null    object
 4   grade                       478 non-null    object
 5   classroom_id                478 non-null    object
 6   subject                     478 non-null    object
 7   semester                    478 non-null    object
 8   parent_involved             478 non-null    object
 9   raised_hands                478 non-null    int64 
 10  resource_visits             478 non-null    int64 
 11  announcements_viewed        478 non-null    int64 
 12  discussion_participation    478 non-null    int64 
 13  parent_answered_survey      478 non-null    object

# 3 Model building

As the goal is to predict if a student will pass or fail based on the independent variables available in the dataset, we are working with categorical variables and so we will build a classification model.

There are some classification model choices and we will start with logistic regression.

Step 1 - data readiness check

We have already converted student performances into pass and fail categories in the data cleaning section earlier (refer to 1_data_etl.ipynb), and so we can work on the next steps.

We can also consider feature selection to improve model performance, e.g. dropping the grade column as education_stage column could be sufficiently representative and is more generic for modelling purposes. We hold off making changes at this stage, and can examine how feature selection can improve model performance at a later time. We should also remind ourselves that there is a potential typo/misclassification of grade for one of the data records, which has not been treated yet. Feature scaling (e.g. normalisation) is not required, as per the comparison we have looked at in 1_data_etl.ipynb indicated that the value ranges are similar amongst the numerical variables.

Step 2 - split the data into train and test sets

In [5]:
# import train_test_split function to split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
                                    df.drop(['pass_fail_status'], axis=1),
                                    df['pass_fail_status'],
                                    test_size=0.2, # split 80% for training and 20% for testing
                                    random_state=101 # set random state for reproducibility
                                    )

Step 3 - encoding categorical variables fo machine learning model

As we will be deploying machine learning tools, numerical encoding for the categorical variables are necessary. 

We did not proceed with this in the data cleaning section, as we prioritised descriptiveness of column values over this as mentioned. 

We will use feature_engine to encode automatically and create a Scikit-learn pipeline to run this and the logistic regression model. One hot encoding is chosen, as it is very commonly used in logistic regression models.

In [6]:
from sklearn.pipeline import Pipeline
from feature_engine.encoding import OneHotEncoder
from sklearn.linear_model import LogisticRegression

# Define categorical variables, excluding pass_fail_status as it is the target variable
cat_cols = X_train.select_dtypes(include='object').columns.tolist()

# Create pipeline for one-hot encoding and logistic regression model
pipe = Pipeline([
    ('ohe', OneHotEncoder(variables=cat_cols, drop_last=True)), # one-hot encoding for categorical variables, dropping the last category as it is redundant
    ('model', LogisticRegression(max_iter=1000)) # logistic regression model with increased max_iter for convergence
])

Step 4 - 

In [7]:
# import classification report, confusion matrix, and accuracy score for model evaluation
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Fit the pipeline on the training data
pipe.fit(X_train, y_train)

# Make predictions on the test data
y_pred = pipe.predict(X_test)

# Evaluate the model's performance
print("Accuracy:", accuracy_score(y_test, y_pred)) # print accuracy score
print("\nClassification Report:\n", classification_report(y_test, y_pred)) # print classification report
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred)) # print confusion matrix

Accuracy: 0.9375

Classification Report:
               precision    recall  f1-score   support

        fail       0.83      0.96      0.89        26
        pass       0.98      0.93      0.96        70

    accuracy                           0.94        96
   macro avg       0.91      0.95      0.92        96
weighted avg       0.94      0.94      0.94        96


Confusion Matrix:
 [[25  1]
 [ 5 65]]


Find the sweet spot....