# Predicting students' performance on exams

### Imports

In [None]:
%matplotlib inline

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## Introduction

A model using "Students Performance in Exams" dataset (available in Kaggle at this [link](https://www.kaggle.com/spscientist/students-performance-in-exams)) predicts a mean score (computed as the average of math, reading, and writing scores) that a student would get based on certain demographic, social, and academic features.

## 1. Load data

The dataset is loaded and stored in `student_performance`. Its features are displayed below.

In [None]:
student_performance = pd.read_csv("../input/students-performance-in-exams/StudentsPerformance.csv")

In [None]:
student_performance.head()

## 2. Exploratory Data Analysis

Students' performance dataset holds 1000 samples with 8 features: gender, race or ethnicity, parental level of education, lunch mode, taken or not test preparation course, and scores in math, reading, and writing.

In [None]:
student_performance.shape

All three subjects’ mean score is around 66-69 with standard deviation of 14-15 points. Scores range between 0 and 100 but neither student got zero in reading or writing.

In [None]:
student_performance.describe().T

The code line below confirms that there are not missing values in the dataset. 

In [None]:
student_performance.isna().any()

Slightly more female than male students took the exams. Majority of girls and boys declared "Group C" ethicity, followed by those in "Group D".

In [None]:
student_performance["gender"].value_counts()

In [None]:
fig, ax = plt.subplots(figsize = (8, 5)) 
sns.countplot(data = student_performance, x = "gender", hue = "race/ethnicity", palette = "Blues")
ax.set(title = "Students by gender and race/ethnicity", ylabel = "number")
plt.show()

Data also show that two thirds of students didn't attend a test preparation course. This choice was not determined by the educational level attained by their parents.

In [None]:
student_performance["test preparation course"].value_counts()

In [None]:
fig, ax = plt.subplots(figsize = (8, 5)) 
sns.countplot(data = student_performance, x = "test preparation course", hue = "parental level of education", palette = "Blues")
ax.set(title = "Students by test preparation and parental level of education", ylabel = "number")
plt.show()

The correlation matrix below indicates that there is a strong positive (linear) correlation only between exam scores. Other variables do not reveal any robust inter-links.

In [None]:
# Encode categorical variables
sp = student_performance.copy()
col = ["gender", "race/ethnicity", "parental level of education", "lunch", "test preparation course"]
for title in col:
    sp[title] = LabelEncoder().fit_transform(sp[title])
corr = sp.corr()

In [None]:
plt.figure(figsize = (8, 6))
sns.heatmap(corr, fmt = ".2f", cmap = "Blues", annot = True,
           linewidths = 2, vmin = -1.0, vmax = 1.0)
plt.show()

Scores in all three subjects have similar distribution.

In [None]:
math = student_performance["math score"]
reading = student_performance["reading score"]
writing = student_performance["writing score"]

In [None]:
# Display score distribution in three subjects
sns.set_palette("Paired")
plt.figure(figsize=(8,5))
plt.hist(math, alpha = 0.8, label = "math")
plt.hist(reading, alpha = 0.8, label = "reading")
plt.hist(writing, alpha = 0.8, label = "writing")
plt.xlabel("Scores")
plt.ylabel("Number of students")
plt.title("Students' scores in math, reading and writing")
plt.legend()
plt.show()

## 3. Preprocessing

Data should be pre-processed before being passed to a modelling algorithm. First thing first is to calculate the target variable - a mean score computed as the average of math, reading, and writing.

In [None]:
col = student_performance.loc[: , "math score":"writing score"]

In [None]:
student_performance["mean_score"] = col.mean(axis = 1)

The dataset with mean scores is displayed below.

In [None]:
student_performance.head()

Next step is to make all features numeric. To that end, all categorical variables are passed through `pd.get_dummies` and thus converted into numeric ones.

In [None]:
student_performance = pd.get_dummies(student_performance)

In [None]:
student_performance.head()

Most columns hold "uint8" or "int64" values. It is important, however, all features to be floating point numbers. Therefore, their type is changed to "float32".

In [None]:
student_performance.dtypes

In [None]:
student_performance = student_performance.astype("float32")

Furthermore, features are rearranged since `get_dummies` function placed dummies in the right side of the table. Now, scores and their mean are moved to the end.

In [None]:
student_performance = student_performance[["gender_female", "gender_male", "race/ethnicity_group A", "race/ethnicity_group B",
                          "race/ethnicity_group C", "race/ethnicity_group D", "race/ethnicity_group E",
                          "parental level of education_associate's degree", 
                          "parental level of education_bachelor's degree",
                          "parental level of education_high school", 
                          "parental level of education_master's degree", "parental level of education_some college",
                          "parental level of education_some high school", "lunch_free/reduced", "lunch_standard",
                          "test preparation course_completed", "test preparation course_none", 
                          "math score", "reading score", "writing score", "mean_score"]]

Predicting values is a supervised learning task. The latter means that an algorithm will expect independent (features) and dependent (target) variables. In this particular case, "mean score" is the target variable and is separated from "features" and stored in "target".

In [None]:
features = student_performance.drop("mean_score", axis = 1).values

In [None]:
target = student_performance["mean_score"].values

Features should be scaled before being passed to a machine learning algorithm. `scikit-learn` `StandardScaler()` will do this job; it makes all values between 0 and 1.

In [None]:
scaler = StandardScaler()

In [None]:
features_scaled = scaler.fit_transform(features)

The last preprocessing step is to split data into training and testing set (validation step is skipped here) and to check if all sets are in proper shape. Only 10% of samples are withheld for testing.

In [None]:
features_train, features_test, target_train, target_test = train_test_split(features_scaled, target,
                                                                        test_size = 0.1,
                                                                        random_state = 42)

In [None]:
features_train.shape, target_train.shape, features_test.shape, target_test.shape

## 4. Modelling

Linear regression is (perhaps) the simplest modelling algorithm. It does not have hyper parameters and does not require specific fine-tuning. Thus, the function is only instantiated as it is and the data (features and target) are passed to it.

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(features_train, target_train)

Model's performance is evaluated by applying `predict` on the testing data.

In [None]:
predicted = lr.predict(features_test)

Performance metrics (see below) suggest that model is quite good: Root Mean Squared Error is negligible and R-squared reached almost 100%.

In [None]:
print(f"MSE, testing set: {mean_squared_error(target_test, predicted)}")
print(f"RMSE, testing set: {np.sqrt(mean_squared_error(target_test, predicted))}")
print(f"R-squared on testing set: {r2_score(target_test, predicted)}")

Thus, a student's background and success on exams are good predictors of the mean score he or she will most likely achieve.