# Student Performance Prediction

Welcome to the practice session of Lecture 1. During this tutorial, we will use a dataset acquired from [Kaggle](https://www.kaggle.com/datasets/larsen0966/student-performance-data-set) to illustrate the implementation of an ML model. In this dataset, the information regarding students in two Portuguese schools is mentioned. Using the given data, we desire to predict the final grade of student (G3).

Description of the dataset:
- `school`: student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
- `sex`: student's sex (binary: 'F' - female or 'M' - male)
- `age`: student's age (numeric: from 15 to 22)
- `address`: student's home address type (binary: 'U' - urban or 'R' - rural)
- `famsize`: family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
- `Pstatus`: parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
- `Medu`: mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 is 5th to 9th grade, 3 is secondary education or 4 is higher education)
- `Fedu`: fathers's education (numeric: 0 - none, 1 - primary education (4th grade), 2 is 5th to 9th grade, 3 is secondary education or 4 is higher education)
- `Mjob`: mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
- `Fjob`: father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
- `reason`: reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
- `guardian`: student's guardian (nominal: 'mother', 'father' or 'other')
- `traveltime`: home to school travel time (numeric: 1 is < 15 min., 2 is 15 to 30 min., 3 is 30 min. to 1 hour, or 4 is > 1 hour)
- `studytime`: weekly study time (numeric: 1 is < 2 hours, 2 is 2 to 5 hours, 3 is 5 to 10 hours, or 4 is >10 hours)
- `failures`: number of past class failures (numeric: n if 1 ≤ n < 3, else 4)
- `schoolsup`: extra educational support (binary: yes or no)
- `famsup`: family educational support (binary: yes or no)
- `paid`: extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- `activities`: extra-curricular activities (binary: yes or no)
- `nursery`: attended nursery school (binary: yes or no)
- `higher`: wants to take higher education (binary: yes or no)
- `internet`: Internet access at home (binary: yes or no)
- `famrel`: quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- `freetime`: free time after school (numeric: from 1 - very low to 5 - very high)
- `goout`: going out with friends (numeric: from 1 - very low to 5 - very high)
- `Dalc`: workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- `Walc`: weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- `health`: current health status (numeric: from 1 - very bad to 5 - very good)
- `absences`: number of school absences (numeric: from 0 to 93)
- `G1`: first period grade (numeric: from 0 to 20)
- `G2`: second period grade (numeric: from 0 to 20)
- `G3`: final period grade (numeric: from 0 to 20)

First, we need to download the dataset to the Colab instance

In [None]:
# download the dataset
!wget https://fully-connected-graph.github.io/datasets/student-performance/dataset.csv

If you are running this code on your local machine uncomment and run this cell to install the necessary libraries.

In [None]:
# !pip install numpy pandas sklearn lightgbm

To verify that data has been downloaded, we can press the folder icon on the left panel to view the directory of the instance.

Now that we have our dataset, we can start working with python to create our model. Let's import all the necessary functions and libraries.

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error as mse

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from lightgbm import LGBMRegressor
from sklearn.dummy import DummyRegressor

Saving and using the random state allows you to reproduce the results later.

In [None]:
RANDOM_STATE=17112022

After executing the above block we can load the dataset and see it

In [None]:
data = pd.read_csv("dataset.csv")

data.head()

## Exploratory Data Analysis

Data acquired in the real world is often messy and incomplete. Before we create our model, it is a good idea to check if any column is missing data. Thankfully, pandas comes to our rescue and allows us to see a summary of the dataset.

In [None]:
data.info()

There are 649 non-null out of 649 in each column, this means that no data is missing. Awesome! 

## Data preprocessing

Let's prepare the dataset for training the machine learning models. As we saw prevously:
- We don't have to deal with omissions (The data is complete)
- Some data is categorical, and we will have to one hot encode it
- We have to split the data into train and test sets

Let's begin by one hot encoding the categorical data. We first need to specify the categorical features.

<details>
<summary style="font-size:1.5rem"> Hint
</summary>

Look at the description of the data at the top and infer what data is categorical.

<details>
<summary style="font-size:1.5rem"> Answer
</summary>

Here we included the features that are definetly categorical. There are features like `health`, `Walc`, `freetime`, etc. that are ranked features. You can include them in the list also, to turn them into categorical, or not include and treat them as numerical. 

<pre>
categorical_feat = [
    "school",
    "sex",
    "address",
    "famsize",
    "Pstatus",
    "Mjob",
    "Fjob",
    "reason",
    "guardian",
    "schoolsup",
    "famsup",
    "paid",
    "activities",
    "nursery",
    "higher",
    "internet",
    "romantic"
]
</pre>

</details>

</details>

In [None]:
"""
Task 1. identify what features are categorical, list there names as strings below
""""

categorical_feat = [
  # write your code here
]

Now that we know the categorical features, we need to isolate them from the rest of the dataset and one hot encode them. In pandas, one way to do it is by calling [`pd.get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) method.

<details>
<summary style="font-size:1.5rem"> Hint
</summary>

To subindex the data use `data[categorical_feat]`. 

To also account for redundancy set the `drop_first` attribute to  to true. That way you will have one less column for each feature. Having all of them zero implies that the dropped one is true.


<details>
<summary style="font-size:1.5rem"> Hint 2
</summary>

<pre>
ohe_data = pd.get_dummies(
    data[categorical_feat],
    drop_first=True
)
</pre>

</details>

</details>

In [None]:
# one hot encode the data (or how it's otherwise called dummy)

"""
Task 2. One hot encode the categorical features. To select them you can subindex them from data variable.
See the documentation on how to pass the data to it. 
"""

ohe_data = pd.get_dummies(
    # write code here
)

ohe_data.head()

We can work towards recombining the dataset since the categorical data is in a workable form. We already have the categorical data isolated so we need to do the same for the numerical data. We'll use the [`df.drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) method for that.

<details>
<summary style="font-size:1.5rem"> Hint
</summary>

Just put the names of the columns that are categorical. The left values will be numerical.

<details>
<summary style="font-size:1.5rem"> Hint 2
</summary>

<pre>
num_data = data.drop(categorical_feat, axis=1)
</pre>

</details>

</details>

In [None]:
# get the numerical feautures

"""
Task 3. Separate the numerical data from categorical
"""

num_data = data.drop(
    # write code here, 
    axis=1
)

num_data.head()

Finally, we concatenate the OHE and numerical features back together.

In [None]:
prep_data = pd.concat([
    ohe_data,
    num_data
], axis=1)

prep_data.head()

Our dataset can now be used for machine learning. We can proceed to the next step of pre-processing which is to split the dataset into a training and testing component.

First, we need to split the pre-processed data into features and target feature.

<details>
<summary>Hint</summary>
<pre>
features = prep_data.drop(
    "G3", 
    axis=1
)
target = prep_data["G3"]
</pre>
</details>


In [None]:
"""
Task 4. Separate the target feature (G3) from rest of the data
"""

features = prep_data.drop(
    # your code here, 
    axis=1
)
target = prep_data[
    # your code here
]

Now, let's split the data into train and test sets, in a 80:20 ratio

In [None]:
features_train, features_test, target_train, target_test = train_test_split(
    features,
    target,
    test_size = .2,
    random_state=RANDOM_STATE
)

To determine how the splitting took place, we can look at the shape of the features to get their dimensions.

In [None]:
print(features_train.shape)
print(features_test.shape)

There are 519 items in the train set and 130 in test.

With this step, we are done with pre-processing and can now work on training the model. 

## Model training

We will train 4 models on the train set and compare them with respect to [MSE](https://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-error) metric on the training set. In addition to that we will use somehthing called cross validation (CV), but you don't have to worry about it now. In the next section, we will test the best model on the test set.

The four models we will examine are:
- Decision trees
- Random forest 
- Linear regression
- Gradient boosting

To make comparison between the models easier, we will create a dataframe to store the performance of each model.

In [None]:
# custom function to display the model results

results = pd.DataFrame([], columns=['Model', 'Best parameters', 'MSE'])
results.set_index('Model', inplace=True)

def add_result(model_name, best_params, neg_result):
    results.loc[model_name] = [best_params, -neg_result]
    display(results.loc[[model_name]])

results.head()

Unnamed: 0_level_0,Best parameters,MSE
Model,Unnamed: 1_level_1,Unnamed: 2_level_1


### Linear Regression

Linear regression models are different from the other models. We do not need to pass in any parameters for the model to work. So, we will just use cross validation to get its preformace.

In [None]:
lr_model = LinearRegression()

lr_score = cross_val_score(
    lr_model,
    features_train,
    target_train,
    scoring="neg_mean_squared_error"
).mean()

add_result("Linear Regression", {}, lr_score)

The tree based models require parameters in order to be trained on data. If not specified they use the default values. But, we can search for the best parameters using [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). 

The best parameters for a model are unique to each dataset. Grid search is passed a list of different parameters and checks the model on all combinations to obtain the best params.

### Decision tree
We need to pass in the parameters for the maximum depth and features.

In [None]:
dt_params = {
    "max_depth": [3, 5, 8],
    "max_features": [3, 5, 8]
}

dt_gs = GridSearchCV(
    DecisionTreeRegressor(random_state=RANDOM_STATE),
    dt_params,
    scoring="neg_mean_squared_error"
)

dt_gs.fit(
    features_train,
    target_train
)

add_result("Decision Tree", dt_gs.best_params_, dt_gs.best_score_)

### Random forest
We need to pass in parameters for the maximum depth and features along with the number of estimators to be used.


<details>
<summary>
Hint
</summary>
Look at the imports in the beginning. See how random forest model is called.
<details>
<summary>
Hint 2
</summary>

<pre>
rf_gs = GridSearchCV(
    RandomForest(random_state=RANDOM_STATE),
    rf_params,
    scoring="neg_mean_squared_error"
)
</pre>
</details>

</details>

In [None]:
rf_params = {
    "max_depth": [3, 5, 8],
    "max_features": [3, 5, 8],
    "n_estimators": [50, 100, 150, 200]
}

"""
Task 5. Write the correct model into GridSearch, don't forget about the random state :)
"""

rf_gs = GridSearchCV(
    # your code here,
    rf_params,
    scoring="neg_mean_squared_error"
)

rf_gs.fit(
    features_train,
    target_train
)

add_result("Random Forest", rf_gs.best_params_, rf_gs.best_score_)

### Gradient Boosting
We need to pass in the learning rate, number of estimators and the maximum depth of the tree.


<details>
<summary>
Hint
</summary>
Repeat the same thing we did previously
<details>
<summary>
Hint 2
</summary>

<pre>
gb_gs = GridSearchCV(
    LGBMRegressor(),
    gb_params,
    scoring="neg_mean_squared_error"
)

gb_gs.fit(
    features_train,
    target_train
)
</pre>
</details>

</details>

In [None]:
"""
Task 6. Define the GridSearch and fit it for gradient boosting yourself. 
        The search params for grid search are given. Look at the imports
        for the name of the gradient boosting model.
"""

gb_params = {
    "learning_rate": [0.1, 0.01, 0.001],
    "n_estimators": [50, 100, 150, 200],
    "max_depth": [3, 5, 8]
}

# write code here

add_result("Gradient Boosting", gb_gs.best_params_, gb_gs.best_score_)

Congratulations! You (may) have created your first set of ML models. Unfortunately, not all models are created equal and some will do better than others on a particular task.

## Model Selection

Let's compare the trained models and select the best one

<details>
<summary>
Hint
</summary>
The metric is MSE.
<details>
<summary>
Hint 2
</summary>

<pre>
results.sort_values(
    by = "MSE",
    ascending=True
)
</pre>
</details>

</details>

In [None]:
"""
Task 7. Display the results table sorted by the metric 
"""

results.sort_values(
    by = # write code here, 
    ascending=True
)

As we can see, the __________ model has the lowest MSE on cross validation – best preformance. Therefore, we will proceed with this model.

## Testing
Now that we have our model, we can test it using the testing data that we set aside earlier.

First, let's recreate the model.


<details>
<summary>
Hint
</summary>
You can get the best model from the grid search by doing <pre>gs.best_model_</pre> But the best model (technicaly) should be the linear regression, so just define it.
<details>
<summary>
Hint 2
</summary>
Ideally it would be 
<pre>best_model = gs.best_model_</pre>

but we didn't do grid search for linear regression, so just do
<pre>best_model = LinearRegression()</pre>
</details>

</details>

In [None]:
"""
Task 8. Select the best model
"""

best_model = # write code here

best_model.fit(
    features_train,
    target_train
)

Next, we can begin testing.

<details>
<summary>
Hint
</summary>
Put the features of the test set.
<details>
<summary>
Hint 2
</summary>
<pre>features_test</pre>
</details>

</details>

In [None]:
"""
Task 9. Make predictions on the test features
"""

preds_test = best_model.predict(
    # write code here
)

In order to determine the performance of the model on the test set, we can observe the mean squared error.

In [None]:
mse(target_test, preds_test)

To check whether the model did actually learn anything, we will compare it to a dummy model which returns a constant prediction (median of the target value).

In [None]:
dummy_model = DummyRegressor(
    strategy="median"
)

dummy_model.fit(
    features_train,
    target_train
)

dummy_preds_test = dummy_model.predict(
    features_test
)

mse(target_test, dummy_preds_test)

Our model preforms better than the dummy model (MSE is much smaller) – sanity check passed! All good :)

## [Bonus] Interpreting the model

Let's find the most important features

<details>
<summary>
Hint
</summary>

Try looking at the `feature_importances_` attribute of the best model
- [Example for forests](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)
- For a linear model you can look at `coef_` attribute instead of `feature_importances_` ([docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html))


</details>

## Conclusion

_Write a conclusion: what models have you tried? what model preformed the best? what are its parameters? what is the best score did you get?_
