# Exam results prediction
This notebook will provide an analysis of the data and will predict the exam results of the students.
The dataset is taken from [Kaggle](https://www.kaggle.com/datasets/rkiattisak/student-performance-in-mathematics).

## Importing the libraries

In [None]:
from os import stat
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
%matplotlib inline

## Importing the dataset

In [None]:
# Size of the file

filename = 'data/exams.csv'

print(f'File size: {stat(filename).st_size / 1024} kB.')

# Read the data
df = pd.read_csv(filename)

## Information about the data

In [None]:
display(df.info())
display(df.head(10))
print(df.describe())

## Data visualization
We are now going to visualize the data to get a better understanding of it.
### Gender repartition

In [None]:
plt.figure(figsize=(10, 7))
plt.bar(df['gender'].unique(), df['gender'].value_counts(), color = 'blue')
plt.title('Gender repartition')
plt.show()

### Parental level of education

In [None]:
plt.figure(figsize=(15, 11))
plt.bar(df['parental level of education'].unique(), df['parental level of education'].value_counts(), color = 'blue')
plt.title('Parental level of education repartition')
plt.show()

### Lunch

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(df['lunch'].unique(), df['lunch'].value_counts(), color = 'blue')
plt.title('Lunch repartition')
plt.show()

### Test preparation course

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(df['test preparation course'].unique(), df['test preparation course'].value_counts(), color = 'blue')
plt.title('Taken test preparation course repartition')
plt.show()

## Data preprocessing
We are now going to preprocess the data to make it ready for the machine learning model.
As we have several categorical variables, we will encode them using integers.
This will apply to the following columns:
- `gender`;
- `parental level of education`;
- `lunch`;
- `test preparation course`.

In [None]:
for col in df.columns:
    if col != 'math score' and col != 'reading score' and col != 'writing score':
        dic = {}
        i = 0
        for k in df[col].unique():
            dic[k] = i
            i += 1
        df[col] = df[col].map(dic)

display(df.head(10))

## Splitting the dataset into the Training set and Test set
We are going to split our dataset into a training set (80%) and a test set (20%).

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop(['math score', 'reading score', 'writing score'], axis=1), df[['math score', 'reading score', 'writing score']], test_size=0.2, random_state=42)
print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'y_test shape: {y_test.shape}')

## Training the Multiple Linear Regression model on the Training set
We are now going to train our model on the training set.

In [None]:
from sklearn.linear_model import LinearRegression
