# Exam results prediction
This notebook will provide an analysis of the data and will predict the exam results of the students.
The dataset is taken from [Kaggle](https://www.kaggle.com/datasets/rkiattisak/student-performance-in-mathematics).

## Importing the libraries

In [None]:
from os import stat
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import seaborn as sns
%matplotlib inline

## Importing the dataset

In [None]:
# Size of the file

filename = 'data/exams.csv'

print(f'File size: {stat(filename).st_size / 1024} kB.')

# Read the data
df = pd.read_csv(filename)

## Information about the data

In [None]:
display(df.info())
display(df.head(10))
print(df.describe())

## Data visualization
We are now going to visualize the data to get a better understanding of it.
### Gender repartition

In [None]:
plt.figure(figsize=(10, 7))
plt.bar(df['gender'].unique(), df['gender'].value_counts(), color = 'blue')
plt.title('Gender repartition')
plt.show()

### Parental level of education

In [None]:
plt.figure(figsize=(15, 11))
plt.bar(df['parental level of education'].unique(), df['parental level of education'].value_counts(), color = 'blue')
plt.title('Parental level of education repartition')
plt.show()

### Lunch

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(df['lunch'].unique(), df['lunch'].value_counts(), color = 'blue')
plt.title('Lunch repartition')
plt.show()

### Test preparation course

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(df['test preparation course'].unique(), df['test preparation course'].value_counts(), color = 'blue')
plt.title('Taken test preparation course repartition')
plt.show()

## Data preprocessing
We are now going to preprocess the data to make it ready for the machine learning model.
As we have several categorical variables, we will encode them using integers.
This will apply to the following columns:
- `gender`;
- `parental level of education`;
- `lunch`;
- `test preparation course`.

In [None]:
for col in df.columns:
    if col != 'math score' and col != 'reading score' and col != 'writing score':
        dic = {}
        i = 0
        for k in df[col].unique():
            dic[k] = i
            i += 1
        df[col] = df[col].map(dic)

display(df.head(10))

## Exploratory data analysis
### Correlation matrix

In [None]:
display(df.corr(numeric_only=False))
plt.figure(figsize=(13,10))
sns.heatmap(df.corr(), cmap = plt.cm.RdYlBu_r, vmin = -0.25, annot = True, vmax = 0.6)
plt.title('Correlation heatmap of the dataseet')
plt.show()

This correlation matrix shows that the explanatory variables are not correlated with each other.
We'll have to keep all of them in the model.
Also, the targets variables are not correlated with the explanatory variables, which could be a problem.
PCA will indeed be useless in this case.

### Boxplots

In [None]:
plt.figure(figsize=(10, 7))
sns.boxplot(x = 'math score', data = df.drop(['reading score', 'writing score'], axis = 1))
plt.title('Math score boxplot')
plt.show()

plt.figure(figsize=(10, 7))
sns.boxplot(x = 'reading score', data = df.drop(['math score', 'writing score'], axis = 1))
plt.title('Reading score boxplot')
plt.show()

plt.figure(figsize=(10, 7))
sns.boxplot(x = 'writing score', data = df.drop(['math score', 'reading score'], axis = 1))
plt.title('Writing score boxplot')
plt.show()

### Stem and leaf plot

In [None]:
import stemgraphic as sg
for col in ['math score', 'reading score', 'writing score']:
    sg.stem_graphic(df[col], scale = 10)

## Splitting the dataset into the Training set and Test set
We are going to split our dataset into a training set (80%) and a test set (20%).

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop(['math score', 'reading score', 'writing score'], axis=1), df[['math score', 'reading score', 'writing score']], test_size=0.2, random_state=42)
print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'y_test shape: {y_test.shape}')

## Training the Multiple Linear Regression model on the Training set
### Training the model
We are now going to train our model on the training set.

In [None]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_train, y_train)

### Testing the model
Now, we are going to test our model.

In [None]:
print(f'Score prediction: {reg.score(X_test, y_test)}')

### Cross validation
We are going to use the `cross_val_score` function to evaluate our model.

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(reg, df.drop(['math score', 'reading score', 'writing score'], axis = 1), df[['math score', 'reading score', 'writing score']], cv=5)
print(f'Cross validation scores: {scores}')
print(f'Cross validation mean score: {scores.mean()}')
print(f'Cross validation standard deviation: {scores.std()}')

This confirms that our regression model is absolutely not good.
We are going to try another model.

## Training the Decision Tree Regression model on the Training set
We are going to train our decision tree regression model on the training set.
### Training the model

In [None]:
from sklearn.tree import DecisionTreeRegressor

reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)

### Testing the model

In [None]:
print(f'Score prediction: {reg.score(X_test, y_test)}')

### Cross validation

In [None]:
scores = cross_val_score(reg, df.drop(['math score', 'reading score', 'writing score'], axis = 1), df[['math score', 'reading score', 'writing score']], cv=5)
print(f'Cross validation scores: {scores}')
print(f'Cross validation mean score: {scores.mean()}')
print(f'Cross validation standard deviation: {scores.std()}')

This confirms that our decision tree regression model is absolutely not good.
We are now going to try a random forest.

## Training the Random Forest Regression model on the Training set
We are now going to train a random forest regression model on the training set.
### Training the model

In [None]:
from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(n_estimators=150, max_depth=10, random_state=42)
reg.fit(X_train, y_train)

### Testing the model
We are going to use the `score` function to evaluate our model.

In [None]:
print(f'Score prediction: {reg.score(X_test, y_test)}')

### Cross validation

In [None]:
scores = cross_val_score(reg, df.drop(['math score', 'reading score', 'writing score'], axis = 1), df[['math score', 'reading score', 'writing score']], cv=5)
print(f'Cross validation scores: {scores}')
print(f'Cross validation mean score: {scores.mean()}')
print(f'Cross validation standard deviation: {scores.std()}')

The scores are a little bit better than the multiple linear regression model.
However, this cannot still be considered as a good model.

## Training the Support Vector Regression model on the Training set
We are now going to train a support vector regression model on the training set.
### Training the model

In [None]:
from sklearn.svm import SVR

reg = SVR()
reg.fit(X_train, y_train)

### Testing the model
We are going to use the `score` function to evaluate our model.

In [None]:
print(f'Score prediction: {reg.score(X_test, y_test)}')

### Cross validation

In [None]:
scores = cross_val_score(reg, df.drop(['math score', 'reading score', 'writing score'], axis = 1), df[['math score', 'reading score', 'writing score']], cv=5)
print(f'Cross validation scores: {scores}')
print(f'Cross validation mean score: {scores.mean()}')
print(f'Cross validation standard deviation: {scores.std()}')

This model is not better than the random forest regression model.
We are goiing to try a boosting model.

## Training the XGBoost model on the Training set
We are now going to train an XGBoost model on the training set.
### Training the model

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

reg_dict =  {
    "math" : GradientBoostingRegressor(n_estimators=150, max_depth=10, random_state=42),
    "reading" : GradientBoostingRegressor(n_estimators=150, max_depth=10, random_state=42),
    "writing" : GradientBoostingRegressor(n_estimators=150, max_depth=10, random_state=42)
}

for col in ['math', 'reading', 'writing']:
    reg_dict[col].fit(X_train, y_train[f'{col} score'])

### Testing the model

In [None]:
for col in ['math', 'reading', 'writing']:
    print(f'{col.capitalize()} score prediction: {reg_dict[col].score(X_test, y_test[f"{col} score"])}')

We notice that we have a better score than the random forest regression model for the `math score`, but not for the `reading score` and the `writing score`.

### Cross validation

In [None]:
for col in ['math', 'reading', 'writing']:
    scores = cross_val_score(reg_dict[col], df.drop(['math score', 'reading score', 'writing score'], axis = 1), df[[f'{col} score']], cv=5)
    print(f'{col.capitalize()} score cross validation scores: {scores}')
    print(f'{col.capitalize()} score cross validation mean score: {scores.mean()}')
    print(f'{col.capitalize()} score cross validation standard deviation: {scores.std()}')

This model is not better than the random forest regression model.
We are going to try a neural network.

## Training the Artificial Neural Network on the Training set
We are now going to train an artificial neural network on the training set.
### Training the model

In [None]:
from sklearn.neural_network import MLPRegressor