# Predicting Student Performance in Mathematics

The obejctive of this project is to build and compare three binary classifiers for predicting student performance in Mathematics, using the data collected from two public schools in Portugal during the school year 2005/06. The dataset was retrieved from the UCI Machine
Learning Repository (Cortez & Silva, 2008). The descriptive features include 5 numeric, 17 nominal and 10 ordinal features.
The target feature, G3, is a numeric variable, which shows the final grade. The values of G3 range from 0 to 20, where 0 represents the lowest grade while 20 represents full marks. For a binary classification, G3 values that are greater or equal to 10 represent "Pass", else "Fail". 

## Outline:
- [Section 1 (Overview)](#1)
- [Section 2 (Data Preparation)](#2)  
- [Section 3 (Hyperparameter Tuning)](#3) 
- [Section 4 (Performance Comparison)](#4) 
- [Section 5 (Summary)](#5) 
- [Section 6 (References)](#6) 

# Overview <a class="anchor" id="1"></a> 

## Methodology

I build three classification models, K-Nearest Neighbors (KNN), Decision Trees (DT) and Naive Bayes (NB), to predict whether students pass or fail in Mathematics.

I start by transforming the dataset. The categorical features are encoded into numerical features and the whole descriptive features are scaled using Min-Max Scaling. The dataset is then partitioned into two parts at a 70:30 ratio for training and test.

Then, given the large number of columns of descriptive features after transformation, applying feature selection could be beneficial before fitting the model. I select the top 10 features by Random Forest Importance and F-Score. Then, I compare the performance of these two feature selection methods and continue with the better one for further model fitting.

After the feature selection, I train the models with hyperparameter search in a pipeline with 5-fold repeated stratified cross-validation based on the train data with full features and the same train data but only with the top 10 features selected in the previous stage. 

Stratification is necessary throughout the model fitting and selection as the binary target classes are imbalanced.

In the end, I fit the best models identified from the hyperparameter search on the test data with a 5-fold repeated stratified cross-validation and compare the model performance by a paired t-test to determine if these models yield any significant differences. The comparison is initially based on the metric area under curve (AUC), and I integrate other evaluation metrics, such as recall, precision, and F1-score, for a comprehensive and in-depth comparison.

# Data Preparation <a class="anchor" id="2"></a> 

## Loading Dataset

In [None]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
np.random.seed(999)

In [None]:
math = pd.read_csv('../input/matcsv/mat.csv')
math.head()

In [None]:
print(math.shape)
math.columns.values

The dataset contains 395 observations. The numeric target feature "G3" has been renamed as "target" and transformed into a binary categorical feature with two levels "pass" and "fail".

## Checking for Missing Values

In [None]:
math.isna().sum()

## Summary Statistics

In [None]:
math.describe(include='all')

## Encoding Categorical Features

It is necessary to encode all categorical features into numerical features, since Scikit-learn requires all data to be numeric before putting them into the algorithm. 

Before encoding the target feature, the descriptive features and the target feature need to be partitioned. 

In [None]:
data = math.drop(columns='target')
target = math['target']

### Target Feature

It is obvious that the target classes are imbalanced. The number of "pass" is twice as many as that of "fail".

The positive target feature level "pass" is encoded as "1". 

In [None]:
print(target.value_counts())
target = target.replace({'pass':1,'fail':0})
target.value_counts()

### Categorical Descriptive Features

There are two types of categorical descriptive features in the dataset.

#### Nominal:

13 out of the 17 nominal features have only two levels, therefore they are simply encoded into a single column of 0 and 1. The remaining 4 features have more than two levels, therefore applying one-hot-encoding is necessary as it can create a binary column for each unique value under these multi-level nominal features.

1. sex: binary - female or male)
2. school: binary - Gabriel Pereira or Mousinho da Silveira
3. address: binary - urban or rural
4. Pstatus: binary - living together or apart
5. Mjob: 5 levels
6. Fjob: 5 levels
7. guardian: 3 levels
8. famsize: binary - ≤ 3 or > 3
9. reason: 4 levels
10. schoolsup: binary - yes or no
11.	famsup: binary - yes or no
12.	activities: binary - yes or no
13.	paidclass: binary - yes or no
14.	internet: binary - yes or no
15.	nursey: binary - yes or no
16.	higher: binary - yes or no
17.	romantic: binary - yes or no

In [None]:
nominal_cols = data.columns[data.dtypes==object].tolist()
nominal_cols

Ordinal:
The ordinal categorical features have been encoded into numbers in the original dataset and therefore there is no need to further transform them. The numbers under each ordinal categorical feature are meaningful. For example, under the feature "Medu" (mother's education level), 0 is "none"; 1 is "primary education"; 2 is "5th to 9th grade"; 3 is "secondary education"; 4 is "higher education". The larger the number, the higher the education level.

1. Medu: 0 to 4; the larger the number, the higher the education level
2. Fedu: 0 to 4; the larger the number, the higher the education level
3. famrel: 1 to 5; the larger the number, the higher the quality of family relationship
4. traveltime: 1 to 4; the larger the number, the longer the travel time to school
5. studytime: 1 to 4; the larger the number, the longer the weekly study time
6. freetime: 1 to 5; the larger the number, the more free time after school
7. goout: 1 to 5; the larger the number, the more frequent going out with friends
8. Walc: 1 to 5; the larger the number, the more weekend alcohol consumption
9. Dalc: 1 to 5; the larger the number, the more workday alcohol consumption
10. health: 1 to 5; the larger the number, the healthier

In [None]:
for col in nominal_cols:
    n = len(data[col].unique())
    if (n == 2):
        data[col] = pd.get_dummies(data[col], drop_first=True)

In [None]:
data = pd.get_dummies(data)
data = pd.get_dummies(data)

After performing one-hot-enconding on those 4 nominal features, the number of columns with descriptive features in the dataset extend from 32 to 45. 

## Feature Scaling

Scaling descriptive features is beneficial as it can normalise the numeric values among different variables within a specific range and can help speed up the processing time in the algorithm. 
Min-Max Scaling is applied to scale the descriptive features between 0 and 1. Each binary feature can be still kept as binary after scaling.

In [None]:
from sklearn import preprocessing
data_unscaled=data.values
data_scaled = preprocessing.MinMaxScaler().fit_transform(data_unscaled)

In [None]:
pd.DataFrame(data_scaled, columns=data.columns).head()

## Feature Selection & Ranking

The 1-nearest neighbor classifier is used as a wrapper to compare the performance of feature selection methods: F-Score and Random Forest Importance. I also use stratified 5-fold cross validation with 3 repetitions for assessment.

In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=1)
cv_method = RepeatedStratifiedKFold(n_splits=5, n_repeats=3,random_state=999)

### Full Set of Features
Before applying the feature selection methods, I assess the perfomance using all descriptive features in the dataset.

In [None]:
from sklearn.model_selection import cross_val_score
cv_results_full = cross_val_score(estimator=clf,
                             X=data_scaled,
                             y=target, 
                             cv=cv_method, 
                             scoring='roc_auc')
cv_results_full.mean()

The AUC score for the full features is very low, at 0.556. There are probably some irrelevant features in the dataset that weaken the performance of the model.

### Random Forest Importance

In [None]:
from sklearn.ensemble import RandomForestClassifier
np.random.seed(999)
model_rfi = RandomForestClassifier(n_estimators=100)
model_rfi.fit(data_scaled, target)
fs_indices_rfi = np.argsort(model_rfi.feature_importances_)[::-1][0:10]

best_features_rfi = data.columns[fs_indices_rfi].values
print(best_features_rfi)

feature_importances_rfi = model_rfi.feature_importances_[fs_indices_rfi]
print(feature_importances_rfi)

In [None]:
cv_results_rfi = cross_val_score(estimator=clf,
                             X=data_scaled[:, fs_indices_rfi],
                             y=target, 
                             cv=cv_method, 
                             scoring='roc_auc')
cv_results_rfi.mean()

The AUC score for the top 10 features selected by Random Forest Importance is 0.753.

### F-Score

In [None]:
from sklearn import feature_selection as fs
np.random.seed(999)
fs_fit_fscore = fs.SelectKBest(fs.f_classif, k=10)
fs_fit_fscore.fit_transform(data_scaled, target)
fs_indices_fscore = np.argsort(fs_fit_fscore.scores_)[::-1][0:10]
best_features_fscore = data.columns[fs_indices_fscore].values
print(best_features_fscore)
feature_importances_fscore = fs_fit_fscore.scores_[fs_indices_fscore]
print(feature_importances_fscore)

In [None]:
cv_results_fscore = cross_val_score(estimator=clf,
                             X=data_scaled[:, fs_indices_fscore],
                             y=target, 
                             cv=cv_method, 
                             scoring='roc_auc')
cv_results_fscore.mean()

The AUC score for the top 10 features selected by F-Score is 0.812.

### Performance Comparison Using Paired T-Tests

In [None]:
from scipy import stats
print(stats.ttest_rel(cv_results_full, cv_results_fscore).pvalue.round(3))
print(stats.ttest_rel(cv_results_full, cv_results_rfi).pvalue.round(3))
print(stats.ttest_rel(cv_results_rfi, cv_results_fscore).pvalue.round(3))

The performances after feature selection by both Random Forest Importance and F-Score are statistically better than the performance based on the full features, at 5% level of significance. 
Meanwhile, the difference between F-Score and Random Forest Importance is also statistically significant. Therefore, for the further analysis, I continue with the top 10 features selected by F-Score as shown in the below figure. In this figure, it shows that the importance decreases sharply after the top 2 features. The importance becomes very marginalised till the last feature.

In [None]:
feature_ranking = pd.DataFrame({'Feature': best_features_fscore, 
                                'Importance': list(feature_importances_fscore)}, 
                               columns=['Feature', 'Importance'])
import seaborn as sns
sns. barplot(x="Feature",y="Importance",
            color='blue',data=feature_ranking)

## Train-Test Splitting

The dataset is split into train and test at a 70:30 partition ratio by stratification:

* Training (70%): X_train (descriptive), y_train (target)
* Testing (30%): X_test (desciptive), y_test (target)

Meanwhile, I created X_train_10 and X_test_10, which have the same sample rows as X_train and X_test, but only have the top 10 features selected by F-score from the previous process.

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data_scaled,
                                                 target.values,
                                                 test_size=0.3,
                                                 random_state=999,
                                                 stratify=target.values)
print(X_train.shape)
print(X_test.shape)

In [None]:
X_train_10 = pd.DataFrame(X_train, columns=data.columns)
X_train_10 = X_train_10[best_features_fscore].values
X_test_10 = pd.DataFrame(X_test, columns=data.columns)
X_test_10 = X_test_10[best_features_fscore].values
print(X_train_10.shape)
print(X_test_10.shape)

# Hyperparameter Tuning<a class="anchor" id="3"></a>

In this section, I train and fine-tune the models based on the 276 rows of training data. I also compare the performance of models with the full features and with the top 10 features.

Each model is evaluated by 5-fold stratified cross-validation with 3 repetitions for hyperparameter tuning.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
cv_method = RepeatedStratifiedKFold(n_splits = 5,n_repeats=3,random_state=999)

## 1. K-Nearest Neighbors

I use grid search for hyperparameter tuning in a pipeline and train the KNN model with different k-nearest neighbors and distance types.

In [None]:
params_knn = {'n_neighbors': [2,3,4,5,6,7,8,9,10,11,12,13,14,15], 
              'p': [1, 2]}
gs_knn = GridSearchCV(estimator=KNeighborsClassifier(), 
                      param_grid=params_knn, 
                      cv=cv_method,
                      verbose=1, 
                      scoring='roc_auc')

In [None]:
gs_knn.fit(X_train, y_train)
print(gs_knn.best_params_)
print(gs_knn.best_score_)
knn_best = gs_knn.best_estimator_
print(knn_best)

In [None]:
gs_knn.fit(X_train_10, y_train)
print(gs_knn.best_params_)
print(gs_knn.best_score_)
knn_best10 = gs_knn.best_estimator_
print(knn_best10)

* The optimal KNN model based on the full features has a mean AUC score of 0.725 with 14 nearest neighbors and with Manhattan distance.
* The optimal KNN model based on the top 10 features has a mean AUC score of 0.934 with 14 nearest neighbors and with Manhattan distance.

In general, the performance of KNN models seems to have improved after feature selection. 

## 2. Decision Tree

To find the optimal Decision Tree model, I include different criterion (gini index and entropy), maximum depth and minimum sample split in the grid search.

In [None]:
from sklearn.tree import DecisionTreeClassifier
params_dt = {'criterion':['gini','entropy'],
             'max_depth':[3,4,5,6,7,8,9,10],
             'min_samples_split':[2,3,4,5]}
    
gs_dt = GridSearchCV(estimator=DecisionTreeClassifier(random_state=999), 
                      param_grid=params_dt, 
                      cv=cv_method,
                      verbose=1, 
                      scoring='roc_auc')

In [None]:
gs_dt.fit(X_train, y_train)
print(gs_dt.best_params_)
print(gs_dt.best_score_)
dt_best = gs_dt.best_estimator_
print(dt_best)

In [None]:
gs_dt.fit(X_train_10, y_train)
print(gs_dt.best_params_)
print(gs_dt.best_score_)
dt_best10 = gs_dt.best_estimator_
print(dt_best10)

The optimal Decision Tree model based on the full features has a mean AUC score of 0.956 using gini index. It has a maximum depth of 3 and minimum split value of 2 samples.
The optimal Decision Tree model based on the top 10 features has a mean AUC score of 0.963 using entropy. It also has a maximum depth of 3 and minimum split value of 2 samples.
It seems the performance of KNN models does not improve much after feature selection, but I will confirm it with a paired t-tests later.

## 3. (Gaussian) Naive Bayes

I include different var_smoothing to search for the optimal Gaussian Naive Bayes model, starting with 1 to 10^-9 with 100 different values. Before fitting into the algorithm, I perform a power transformation to ensure that each descriptive feature follows a Gaussian distribution.

In [None]:
from sklearn.preprocessing import PowerTransformer
X_train_trans = PowerTransformer().fit_transform(X_train)
X_train_10_trans = PowerTransformer().fit_transform(X_train_10)

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import RandomizedSearchCV

np.random.seed(999)
params_nb ={'var_smoothing': np.logspace(0,-9,num=100)}
gs_nb =GridSearchCV(estimator=GaussianNB(),
                   param_grid=params_nb,
                   cv=cv_method,
                   verbose=1,
                   scoring='roc_auc')

In [None]:
gs_nb.fit(X_train_trans, y_train)
print(gs_nb.best_params_)
print(gs_nb.best_score_)
nb_best = gs_nb.best_estimator_
print(nb_best)

In [None]:
gs_nb.fit(X_train_10_trans, y_train)
print(gs_nb.best_params_)
print(gs_nb.best_score_)
nb_best10 = gs_nb.best_estimator_
print(nb_best10)

The optimal Naive Bayes model based on the full features yields a meal AUC score of 0.889, while the optimal one based on the top 10 features has a mean AUC score of 0.931.

# Performance Comparison <a class="anchor" id="4"></a> 

In this section, I fit the optimal models from the above analyses on the test data with 5-fold stratified cross validation and 3 repetitions. Then, I compare the performance of models by paired t-test:
* DT (full features) vs. DT (top 10 features)
* KNN (full features) vs. KNN (top 10 features)
* NB (full features) vs. NB (top 10 features)

Full Features:
* DT vs. KNN
* DT vs. NB
* KNN vs. NB

Top 10 Features:
* DT vs. KNN
* DT vs. NB
* KNN vs. NB

In [None]:
from sklearn.model_selection import cross_val_score

cv_method_ttest = RepeatedStratifiedKFold(n_splits=5,n_repeats=3,
                                          random_state=999)

In [None]:
cv_results_knn = cross_val_score(estimator=knn_best,
                                 X=X_test,
                                 y=y_test, 
                                 cv=cv_method_ttest, 
                                 scoring='roc_auc')
cv_results_knn_10 = cross_val_score(estimator=knn_best10,
                                 X=X_test_10,
                                 y=y_test, 
                                 cv=cv_method_ttest, 
                                 scoring='roc_auc')
cv_results_dt = cross_val_score(estimator=dt_best,
                                X=X_test,
                                y=y_test, 
                                cv=cv_method_ttest, 
                                scoring='roc_auc')
cv_results_dt_10 = cross_val_score(estimator=dt_best10,
                                X=X_test_10,
                                y=y_test, 
                                cv=cv_method_ttest, 
                                scoring='roc_auc')
cv_results_nb = cross_val_score(estimator=nb_best,
                                X=X_test,
                                y=y_test, 
                                cv=cv_method_ttest, 
                                scoring='roc_auc')
cv_results_nb_10 = cross_val_score(estimator=nb_best10,
                                X=X_test_10,
                                y=y_test, 
                                cv=cv_method_ttest, 
                                scoring='roc_auc')
print("KNN(full features):",cv_results_knn.mean())
print("KNN(top 10 features):",cv_results_knn_10.mean())
print("DT(full features):",cv_results_dt.mean())
print("DT(top 10 features):",cv_results_dt_10.mean())
print("NB(full features):",cv_results_nb.mean())
print("NB(top 10 features):",cv_results_nb_10.mean())

In [None]:
print(stats.ttest_rel(cv_results_dt, cv_results_dt_10))
print(stats.ttest_rel(cv_results_nb, cv_results_nb_10))
print(stats.ttest_rel(cv_results_knn, cv_results_knn_10))

The above results show that all models perform significantly better with the top 10 features selected by F-Score, than with the full features, at a 5% level of significance.

In [None]:
print(stats.ttest_rel(cv_results_dt, cv_results_knn))
print(stats.ttest_rel(cv_results_dt, cv_results_nb))
print(stats.ttest_rel(cv_results_knn, cv_results_nb))

For models based on the full features, the Decision Tree model performs significantly better than KNN and Naive Bayes models, at a 5% level of significance. KNN and Naive Bayes models perform at similar levels.

In [None]:
print(stats.ttest_rel(cv_results_dt_10, cv_results_knn_10))
print(stats.ttest_rel(cv_results_dt_10, cv_results_nb_10))
print(stats.ttest_rel(cv_results_knn_10, cv_results_nb_10))

For models based on the top 10 features selected by F-Score, the KNN model performs significantly better than the Naive Bayes model. Decision Tree performs similarily comparing with either KNN or Naive Bayes models. It is hard to decide the optimal model at this stage with only AUC score available, but based on the above performance comparison, it is clear that all models with the top 10 features perform better on the test data. Therefore, for further evaluation on accuracy, precision, recall, F1 Score and confusion matrix, I only consider models with the top 10 features.

In [None]:
pred_knn = gs_knn.predict(X_test_10)
pred_dt = gs_dt.predict(X_test_10)
X_test_10_trans = PowerTransformer().fit_transform(X_test_10)
pred_nb = gs_nb.predict(X_test_10_trans)

In [None]:
from sklearn import metrics
print("\nKNN: Confusion matrix") 
print(metrics.confusion_matrix(y_test, pred_knn))
print("\nKNN: Classification report") 
print(metrics.classification_report(y_test, pred_knn))

print("\nDT: Confusion matrix") 
print(metrics.confusion_matrix(y_test, pred_dt))
print("\nDT: Classification report") 
print(metrics.classification_report(y_test, pred_dt))

print("\nNB: Confusion matrix") 
print(metrics.confusion_matrix(y_test, pred_nb))
print("\nNB: Classification report") 
print(metrics.classification_report(y_test, pred_nb))

Suppose the school wants to predict students who are likely to fail the Mathematics course in order to give these students more support in advance. Then the recall of "0" (fail) is an important metric to consider in this case. The Decision Tree model is the most optimal one.
However, if the school wants to correctly predict students who are likely to pass in order to select good students to attend Mathematics competitions, the recall of "1" (pass) should be emphasized. Therefore, the KNN model can be the most optimal one.
In general, the Decision Tree model can be regarded as the best model in terms of the F1-score as this score is a weighted harmonic mean of precision and recall.

# Summary <a class="anchor" id="5"></a> 

The Decision Tree model based on the top 10 features selected by F-Score is the best model for prediction in this analysis, under a limited hyperparameter tuning and feature selection approach. Although there is not enough statiscal evidence to show that this Descision Tree model performs better than the other KNN and Naive Bayes models, at a 5% level of significance, it yields the highest AUC score (0.916) and F1 score (0.9). In general, all models with only the top 10 features selected by F-Score perform significantly better than those with the full features, at a 5% level of significance.

# References <a class="anchor" id="6"></a> 
* P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. 