# Predict student performance in secondary education (high school).
From: https://archive-beta.ics.uci.edu/ml/datasets/student+performance

## Main operations carried out:
1. Cleaned the data
2. Dummy encoded the categorical features
3. Simple Linear Regression: (mae, mse, r2) = (1.047997908821082, 3.2033972398230905, 0.7473179461540751)
4. Used all sklearn and xgb regression models for comparison, with default parameters. Lasso performed best with mae 0.89 and mse 3.07, both better than Simple Linear. Random Forest Regressor was a very close runners up with mae 0.97 and mse 2.99.
5. Used GridSearchCV with all models to find the best parameters, then predict and check score. Random Forest and Lasso turned out to be the top ones.
6. Finally dedicating this notebook to XG Boost hyperparameter tuning, because it needs to be done manually in stages. Results were not so encouraging, only MSE score improved.

In [1]:
# Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
# 1 school - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
# 2 sex - student's sex (binary: "F" - female or "M" - male)
# 3 age - student's age (numeric: from 15 to 22)
# 4 address - student's home address type (binary: "U" - urban or "R" - rural)
# 5 famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
# 6 Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart)
# 7 Medu - mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
# 8 Fedu - father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
# 9 Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
# 10 Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
# 11 reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
# 12 guardian - student's guardian (nominal: "mother", "father" or "other")
# 13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
# 14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
# 15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
# 16 schoolsup - extra educational support (binary: yes or no)
# 17 famsup - family educational support (binary: yes or no)
# 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
# 19 activities - extra-curricular activities (binary: yes or no)
# 20 nursery - attended nursery school (binary: yes or no)
# 21 higher - wants to take higher education (binary: yes or no)
# 22 internet - Internet access at home (binary: yes or no)
# 23 romantic - with a romantic relationship (binary: yes or no)
# 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
# 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
# 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
# 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
# 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
# 29 health - current health status (numeric: from 1 - very bad to 5 - very good)
# 30 absences - number of school absences (numeric: from 0 to 93)

# these grades are related with the course subject, Math or Portuguese:
# 31 G1 - first period grade (numeric: from 0 to 20)
# 31 G2 - second period grade (numeric: from 0 to 20)
# 32 G3 - final grade (numeric: from 0 to 20, output target)

In [2]:
# Setting the path for the two school datasets
path_mat = '/home/sandeep/Development/Datasets/Education/student/student-mat.csv'
path_por = '/home/sandeep/Development/Datasets/Education/student/student-por.csv'

In [3]:
# Loading the initial libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
# Reading the csv files into dataframes
df_mat = pd.read_csv(path_mat, sep=';')
df_por = pd.read_csv(path_por, sep=';')

In [5]:
# See all columns of dataframe in the notebook
pd.set_option('display.max_columns', None)

In [6]:
df_mat.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,yes,yes,yes,yes,yes,yes,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,yes,no,yes,yes,no,no,4,3,2,1,2,5,4,6,10,10


In [7]:
df_por.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,fatherd,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,0,yes,no,no,no,yes,yes,yes,no,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,no,yes,yes,yes,yes,yes,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,no,no,yes,yes,no,no,4,3,2,1,2,5,0,11,13,13


In [8]:
# Check if columns of both datasets match
sum(list(df_mat.columns != df_por.columns))

1

In [9]:
# Find the column which is different
i = list(df_mat.columns == df_por.columns).index(False)
df_mat.columns[i], df_por.columns[i], i

('paid', 'fatherd', 17)

In [10]:
# Fix the column name that was a typo
df_por.rename(columns={'fatherd':'paid'}, inplace=True)
df_por.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,0,yes,no,no,no,yes,yes,yes,no,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,no,yes,yes,yes,yes,yes,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,no,no,yes,yes,no,no,4,3,2,1,2,5,0,11,13,13


In [11]:
sum(list(df_mat.columns != df_por.columns))

0

In [12]:
df_mat.shape, df_por.shape

((395, 33), (649, 33))

In [13]:
# Merge both the dataframes into a single file
df = pd.concat([df_mat, df_por], ignore_index=True)
df

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,yes,yes,yes,yes,yes,yes,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,yes,no,yes,yes,no,no,4,3,2,1,2,5,4,6,10,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1039,MS,F,19,R,GT3,T,2,3,services,other,course,mother,1,3,1,no,no,no,yes,no,yes,yes,no,5,4,2,1,2,5,4,10,11,10
1040,MS,F,18,U,LE3,T,3,1,teacher,services,course,mother,1,2,0,no,yes,no,no,yes,yes,yes,no,4,3,4,1,1,1,4,15,15,16
1041,MS,F,18,U,GT3,T,1,1,other,other,course,mother,2,2,0,no,no,no,yes,yes,yes,no,no,1,1,1,1,1,5,6,11,12,9
1042,MS,M,17,U,LE3,T,3,1,services,services,course,mother,2,1,0,no,no,no,no,no,yes,yes,no,2,4,5,3,4,2,6,10,10,10


In [14]:
# Check for nan values
df.isnull().sum()

school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64

In [15]:
# Select all columns
columns_all = df.columns
columns_all

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
      dtype='object')

In [16]:
df.head(1)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6


In [17]:
# Check dtype of all columns
df.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object

In [18]:
# Select the categorical columns
columns_cat = df.select_dtypes(include=['object']).columns
columns_cat

Index(['school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob',
       'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities',
       'nursery', 'higher', 'internet', 'romantic'],
      dtype='object')

In [19]:
# Select the numerical columns
columns_num = [col for col in columns_all if col not in columns_cat]
columns_num

['age',
 'Medu',
 'Fedu',
 'traveltime',
 'studytime',
 'failures',
 'famrel',
 'freetime',
 'goout',
 'Dalc',
 'Walc',
 'health',
 'absences',
 'G1',
 'G2',
 'G3']

In [20]:
# Encode the categorical columns using dummies
df_encoded = pd.get_dummies(df, columns=columns_cat, drop_first=True)
df_encoded

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,school_MS,sex_M,address_U,famsize_LE3,Pstatus_T,Mjob_health,Mjob_other,Mjob_services,Mjob_teacher,Fjob_health,Fjob_other,Fjob_services,Fjob_teacher,reason_home,reason_other,reason_reputation,guardian_mother,guardian_other,schoolsup_yes,famsup_yes,paid_yes,activities_yes,nursery_yes,higher_yes,internet_yes,romantic_yes
0,18,4,4,2,2,0,4,3,4,1,1,3,6,5,6,6,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,1,0,0
1,17,1,1,1,2,0,5,3,3,1,1,3,4,5,5,6,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0
2,15,1,1,1,2,3,4,3,2,2,3,3,10,7,8,10,0,0,1,1,1,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,1,0,1,1,1,0
3,15,4,2,1,3,0,3,2,2,1,1,5,2,15,14,15,0,0,1,0,1,1,0,0,0,0,0,1,0,1,0,0,1,0,0,1,1,1,1,1,1,1
4,16,3,3,1,2,0,4,3,2,1,2,5,4,6,10,10,0,0,1,0,1,0,1,0,0,0,1,0,0,1,0,0,0,0,0,1,1,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1039,19,2,3,1,3,1,5,4,2,1,2,5,4,10,11,10,1,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,0
1040,18,3,1,1,2,0,4,3,4,1,1,1,4,15,15,16,1,0,1,1,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,1,1,1,0
1041,18,1,1,2,2,0,1,1,1,1,1,5,6,11,12,9,1,0,1,0,1,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,1,1,0,0
1042,17,3,1,2,1,0,2,4,5,3,4,2,6,10,10,10,1,1,1,1,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0


In [21]:
# Select the features column for the model
X = df_encoded.drop('G3', axis=1)
X.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,school_MS,sex_M,address_U,famsize_LE3,Pstatus_T,Mjob_health,Mjob_other,Mjob_services,Mjob_teacher,Fjob_health,Fjob_other,Fjob_services,Fjob_teacher,reason_home,reason_other,reason_reputation,guardian_mother,guardian_other,schoolsup_yes,famsup_yes,paid_yes,activities_yes,nursery_yes,higher_yes,internet_yes,romantic_yes
0,18,4,4,2,2,0,4,3,4,1,1,3,6,5,6,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,1,0,0
1,17,1,1,1,2,0,5,3,3,1,1,3,4,5,5,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0
2,15,1,1,1,2,3,4,3,2,2,3,3,10,7,8,0,0,1,1,1,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,1,0,1,1,1,0
3,15,4,2,1,3,0,3,2,2,1,1,5,2,15,14,0,0,1,0,1,1,0,0,0,0,0,1,0,1,0,0,1,0,0,1,1,1,1,1,1,1
4,16,3,3,1,2,0,4,3,2,1,2,5,4,6,10,0,0,1,0,1,0,1,0,0,0,1,0,0,1,0,0,0,0,0,1,1,0,1,1,0,0


In [22]:
# Select the target column for the model
y = df_encoded['G3']
y.head()

0     6
1     6
2    10
3    15
4    10
Name: G3, dtype: int64

In [23]:
# Split the data into training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [24]:
# Select the Linear Regression model
from sklearn.linear_model import LinearRegression

In [25]:
# Instantiate the model
model = LinearRegression()

In [26]:
# Train the model using training data
model.fit(X_train, y_train)

LinearRegression()

In [27]:
model.get_params()

{'copy_X': True,
 'fit_intercept': True,
 'n_jobs': None,
 'normalize': 'deprecated',
 'positive': False}

In [28]:
# Predict the outcome of test data
preds = model.predict(X_test)

In [29]:
# Check the performance metrics for linear regression
from sklearn.metrics import mean_absolute_error as mae, mean_squared_error as mse, r2_score as r2
mae(preds, y_test), mse(preds, y_test), r2(preds, y_test)

(1.047997908821082, 3.2033972398230905, 0.7473179461540751)

In [30]:
# Compare sample results with the test labels
y_test[:5], preds[:5]

(971    11
 280     8
 536    13
 824    11
 644    12
 Name: G3, dtype: int64,
 array([10.49273145,  8.13112871, 14.25245916, 11.00068375, 12.06797902]))

# Now we will use all models in turn for comparison

In [33]:
# Build all regression models for comparison

from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor

models = []
models.append(('LIN', LinearRegression()))
models.append(('RID', Ridge()))
models.append(('LAS', Lasso()))
models.append(('RFR', RandomForestRegressor()))
models.append(('DT', DecisionTreeRegressor()))
models.append(('XGB', XGBRegressor()))

# Evaluate each model in turn

results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=5, shuffle=True, random_state=42)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='neg_mean_absolute_error')
    results.append(cv_results)
    names.append(name)
    print(f"{name}: {cv_results.mean():.2f} {cv_results.std():.2f}")

LIN: -1.00 0.06
RID: -1.00 0.06
LAS: -0.87 0.04
RFR: -0.92 0.07
DT: -1.23 0.08
XGB: -0.99 0.05


It appears that using default parameters the Lasso regression gives lowest mean absolute error with also the least standard deviation

In [34]:
# Let's compare the accuracy with the test data

final_results = []
from sklearn.metrics import mean_absolute_error as mae, mean_squared_error as mse
for name, model in models:
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    print(f"{name}, mae {mae(preds, y_test):.2f}, mse {mse(preds, y_test):.2f}")
    final_results.append({'name':name, 'mae':mae(preds, y_test), 'mse':mse(preds, y_test)})

LIN, mae 1.05, mse 3.20
RID, mae 1.05, mse 3.20
LAS, mae 0.89, mse 3.07
RFR, mae 0.97, mse 2.99
DT, mae 1.20, mse 4.89
XGB, mae 1.02, mse 3.10


As expected from training stage, Lasso gave the best results with test data as well. Just as a note, Random Forest Regressor is actually a good competitor.

# Now let's use all models with parameters using Grid search

In [24]:
# Build all regression models for comparison

from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score as r2, mean_absolute_error as mae, mean_squared_error as mse
import numpy as np
import datetime as dt

In [35]:
# Define the models for regression, first three selected for initial run

models = {
'LIN': LinearRegression(),
'RID': Ridge(),
'LAS': Lasso(),
'RFR': RandomForestRegressor(),
'DT': DecisionTreeRegressor(),
'XGB': XGBRegressor()
}

In [42]:
# Define the Grid Search function

def grid(models, X_train, y_train):
    # Declare the parameters to iterate upon
    params = {
    'RID' : { 'alpha' : [1e-10, 1e-8, 1e-3, 1e-2, 1, 5, 10, 20, 30, 50, 55, 100, 200, 300]},
    'LAS': { 'alpha' : [0.02, 0.025, 0.03, 0.035, 0.05, 0.08, 0.1, 0.15, 0.2, 0.3, 0.5, 0.8, 1, 2, 5, 10, 20]},
    'RFR': { 'n_estimators' : [50, 100, 200, 300, 400], 'max_depth': [3, 4, 5, 6, 7]},
    'DT': { 'max_leaf_nodes' : [100, 200, 300, 400]},
    'XGB': {"learning_rate"    : [0.10, 0.20, 0.3] ,
             "max_depth"        : [ 3, 5, 7],
             "min_child_weight" : [ 1, 2, 3],
             "gamma"            : [ 0.0, 0.5, 1]
    }}
    
    grid_Model = {}
    # Grid Search operation on every model
    for i in models:
        if i=='LIN':
            grid_Model[i] = models[i]
        else:
            t1=dt.datetime.now()
            print(f"Starting Grid Search of model: {i} parameters")
            grid_Model[i] = GridSearchCV(models[i], param_grid=params[i], 
                            scoring='neg_mean_absolute_error', 
                            cv=5, verbose=1).fit(X_train, y_train).best_estimator_
            t2=dt.datetime.now()
            print(f"Completed Grid Search of model: {i} parameters in time: {t2-t1}")
            
    return grid_Model

In [43]:
# Run the Grid Search function

t3=dt.datetime.now()
print("Starting the Grid Search process")
best_Models = grid(models, X_train, y_train)
t4=dt.datetime.now()
print(f"Grid search for all models is complete in total time: {t4-t3}")

Starting the Grid Search process
Starting Grid Search of model: RID parameters
Fitting 5 folds for each of 14 candidates, totalling 70 fits
Completed Grid Search of model: RID parameters in time: 0:00:00.750103
Starting Grid Search of model: LAS parameters
Fitting 5 folds for each of 17 candidates, totalling 85 fits
Completed Grid Search of model: LAS parameters in time: 0:00:01.072348
Starting Grid Search of model: RFR parameters
Fitting 5 folds for each of 25 candidates, totalling 125 fits
Completed Grid Search of model: RFR parameters in time: 0:01:46.095805
Starting Grid Search of model: DT parameters
Fitting 5 folds for each of 4 candidates, totalling 20 fits
Completed Grid Search of model: DT parameters in time: 0:00:00.331860
Starting Grid Search of model: XGB parameters
Fitting 5 folds for each of 81 candidates, totalling 405 fits
Completed Grid Search of model: XGB parameters in time: 0:02:15.515905
Grid search for all models is complete in total time: 0:04:03.770645


In [44]:
# Function to conduct the training, prediction, and scoring

def test(models, X_train, X_test, y_train, y_test, iterations = 1):
    results = {}
    for i in models:
        print(f"Starting model {i}")
        t1 = dt.datetime.now()
        r2_train, r2_test, mae_train, mae_test, mse_train, mse_test = [], [], [], [], [], []
        for j in range(iterations):
            print(f"Starting iteration {j+1} of model {i}")
            preds_train = models[i].fit(X_train, y_train).predict(X_train)
            preds_test = models[i].fit(X_train, y_train).predict(X_test)
            r2_train.append(r2(y_train, preds_train))
            r2_test.append(r2(y_test, preds_test))
            mae_train.append(mae(y_train, preds_train))
            mae_test.append(mae(y_test, preds_test))
            mse_train.append(mse(y_train, preds_train))
            mse_test.append(mse(y_test, preds_test))
            print(f"Completed iteration {j+1} of model {i}")
            
        results[i] = [np.mean(r2_train), np.mean(r2_test), np.mean(mae_train), np.mean(mae_test), 
                      np.mean(mse_train), np.mean(mse_test)]
        t2 = dt.datetime.now()
        print(f"Completed scoring model: {i} in time: {t2-t1}")
        
    index = ['r2_train', 'r2_test', 'mae_train', 'mae_test', 'mse_train', 'mse_test']
    return pd.DataFrame(results, index=index)

In [45]:
# Run the prediction tests
t3=dt.datetime.now()
print("Starting the training, prediction and scoring process")
result = test(best_Models, X_train, X_test, y_train, y_test)
t4=dt.datetime.now()
print(f"Training, prediction and scoring for all models is complete in total time: {t4-t3}")

Starting the training, prediction and scoring process
Starting model LIN
Starting iteration 1 of model LIN
Completed iteration 1 of model LIN
Completed scoring model: LIN in time: 0:00:00.037056
Starting model RID
Starting iteration 1 of model RID
Completed iteration 1 of model RID
Completed scoring model: RID in time: 0:00:00.035135
Starting model LAS
Starting iteration 1 of model LAS
Completed iteration 1 of model LAS
Completed scoring model: LAS in time: 0:00:00.031636
Starting model RFR
Starting iteration 1 of model RFR
Completed iteration 1 of model RFR
Completed scoring model: RFR in time: 0:00:00.615580
Starting model DT
Starting iteration 1 of model DT
Completed iteration 1 of model DT
Completed scoring model: DT in time: 0:00:00.040522
Starting model XGB
Starting iteration 1 of model XGB
Completed iteration 1 of model XGB
Completed scoring model: XGB in time: 0:00:00.933559
Training, prediction and scoring for all models is complete in total time: 0:00:01.699809


In [46]:
# Print the results

result

Unnamed: 0,LIN,RID,LAS,RFR,DT,XGB
r2_train,0.850318,0.844957,0.830735,0.915006,0.978808,0.984552
r2_test,0.792805,0.801172,0.801565,0.820758,0.705457,0.814933
mae_train,0.929532,0.915336,0.860948,0.723076,0.420201,0.369237
mae_test,1.047998,0.986693,0.887243,0.924003,1.075832,0.945806
mse_train,2.209638,2.288785,2.498727,1.2547,0.312843,0.228046
mse_test,3.203397,3.074048,3.067965,2.77123,4.553877,2.86129


In [41]:
# With cv=3
# result_cv3 = result.copy()
result_cv3

Unnamed: 0,LIN,RID,LAS,RFR,DT,XGB
r2_train,0.850318,0.844957,0.830735,0.885376,0.978938,0.92408
r2_test,0.792805,0.801172,0.801565,0.83032,0.724194,0.828537
mae_train,0.929532,0.915336,0.860948,0.799063,0.420474,0.678326
mae_test,1.047998,0.986693,0.887243,0.881153,1.059914,0.928093
mse_train,2.209638,2.288785,2.498727,1.692109,0.310916,1.12075
mse_test,3.203397,3.074048,3.067965,2.623399,4.264179,2.650951


In [47]:
# With cv=5
result_cv5 = result.copy()
result_cv5

Unnamed: 0,LIN,RID,LAS,RFR,DT,XGB
r2_train,0.850318,0.844957,0.830735,0.915006,0.978808,0.984552
r2_test,0.792805,0.801172,0.801565,0.820758,0.705457,0.814933
mae_train,0.929532,0.915336,0.860948,0.723076,0.420201,0.369237
mae_test,1.047998,0.986693,0.887243,0.924003,1.075832,0.945806
mse_train,2.209638,2.288785,2.498727,1.2547,0.312843,0.228046
mse_test,3.203397,3.074048,3.067965,2.77123,4.553877,2.86129


## We can therefore conclude that Random Forest wins in all tests with cv=3; On another trial with cv=5, Lasso gives best MAE result, Random Forest wins in R2 and MSE. XGBoost performance is very close to these two.

# Still not gien up hope on XG Boost, it was a very close runners up in the previous round of hyperparameter tuning. Now will carry out tuning on XGB in stages

In [25]:
params = {'learning_rate':[0.1],
          'n_estimators':[100, 400, 600],
          'max_depth':[5],
         'min_child_weight':[1],
         'gamma':[0],
         'subsample':[0.8],
         'colsample_bytree':[0.8],
         }
model_search0 = GridSearchCV(XGBRegressor(), param_grid=params, scoring='neg_mean_absolute_error', 
                            cv=5, verbose=1).fit(X_train, y_train)
model_search0.best_score_, model_search0.best_params_

Fitting 5 folds for each of 3 candidates, totalling 15 fits


(-0.9310380259584523,
 {'colsample_bytree': 0.8,
  'gamma': 0,
  'learning_rate': 0.1,
  'max_depth': 5,
  'min_child_weight': 1,
  'n_estimators': 100,
  'subsample': 0.8})

In [26]:
# Let's change the learning rate and reduce n_estimators
params = {'learning_rate':[0.05, 0.1, 0.15],
          'n_estimators':[50,100, 200],
          'max_depth':[5],
         'min_child_weight':[1],
         'gamma':[0],
         'subsample':[0.8],
         'colsample_bytree':[0.8],
         }
model_search1 = GridSearchCV(XGBRegressor(), param_grid=params, scoring='neg_mean_absolute_error', 
                            cv=5, verbose=1).fit(X_train, y_train)
model_search1.best_score_, model_search1.best_params_

Fitting 5 folds for each of 9 candidates, totalling 45 fits


(-0.8973042003620796,
 {'colsample_bytree': 0.8,
  'gamma': 0,
  'learning_rate': 0.05,
  'max_depth': 5,
  'min_child_weight': 1,
  'n_estimators': 100,
  'subsample': 0.8})

In [27]:
# Now we will fix n_estimators at 100, and reduce learning rate further
params = {'learning_rate':[0.01, 0.02, 0.03, 0.04, 0.05, 0.06],
          'n_estimators':[100],
          'max_depth':[5],
         'min_child_weight':[1],
         'gamma':[0],
         'subsample':[0.8],
         'colsample_bytree':[0.8],
         }
model_search2 = GridSearchCV(XGBRegressor(), param_grid=params, scoring='neg_mean_absolute_error', 
                            cv=5, verbose=1).fit(X_train, y_train)
model_search2.best_score_, model_search2.best_params_

Fitting 5 folds for each of 6 candidates, totalling 30 fits


(-0.8973042003620796,
 {'colsample_bytree': 0.8,
  'gamma': 0,
  'learning_rate': 0.05,
  'max_depth': 5,
  'min_child_weight': 1,
  'n_estimators': 100,
  'subsample': 0.8})

In [28]:
# Alright, so we fix learning rate at 0.05, n_iter at 100 and start changing the others
params = {'learning_rate':[0.05],
          'n_estimators':[100],
          'max_depth':[4, 5, 6],
         'min_child_weight':[0.5, 1, 1.5],
         'gamma':[0],
         'subsample':[0.8],
         'colsample_bytree':[0.8],
         }
model_search3 = GridSearchCV(XGBRegressor(), param_grid=params, scoring='neg_mean_absolute_error', 
                            cv=5, verbose=1).fit(X_train, y_train)
model_search3.best_score_, model_search3.best_params_

Fitting 5 folds for each of 9 candidates, totalling 45 fits


(-0.8824756787900261,
 {'colsample_bytree': 0.8,
  'gamma': 0,
  'learning_rate': 0.05,
  'max_depth': 4,
  'min_child_weight': 1.5,
  'n_estimators': 100,
  'subsample': 0.8})

In [29]:
# Reduce max depth further, increase min child weight further
params = {'learning_rate':[0.05],
          'n_estimators':[100],
          'max_depth':[2,3,4],
         'min_child_weight':[1.5, 2,3],
         'gamma':[0],
         'subsample':[0.8],
         'colsample_bytree':[0.8],
         }
model_search4 = GridSearchCV(XGBRegressor(), param_grid=params, scoring='neg_mean_absolute_error', 
                            cv=5, verbose=1).fit(X_train, y_train)
model_search4.best_score_, model_search4.best_params_

Fitting 5 folds for each of 9 candidates, totalling 45 fits


(-0.8654205156568281,
 {'colsample_bytree': 0.8,
  'gamma': 0,
  'learning_rate': 0.05,
  'max_depth': 3,
  'min_child_weight': 3,
  'n_estimators': 100,
  'subsample': 0.8})

In [30]:
# Fix max depth at 3, increase min child weight
params = {'learning_rate':[0.05],
          'n_estimators':[100],
          'max_depth':[3],
         'min_child_weight':[3, 4, 5, 6],
         'gamma':[0],
         'subsample':[0.8],
         'colsample_bytree':[0.8],
         }
model_search5 = GridSearchCV(XGBRegressor(), param_grid=params, scoring='neg_mean_absolute_error', 
                            cv=5, verbose=1).fit(X_train, y_train)
model_search5.best_score_, model_search5.best_params_

Fitting 5 folds for each of 4 candidates, totalling 20 fits


(-0.8643913078718557,
 {'colsample_bytree': 0.8,
  'gamma': 0,
  'learning_rate': 0.05,
  'max_depth': 3,
  'min_child_weight': 6,
  'n_estimators': 100,
  'subsample': 0.8})

In [31]:
# Very small improvement in the score, further increase the min child weight, let's also change gamma
params = {'learning_rate':[0.05],
          'n_estimators':[100],
          'max_depth':[3],
         'min_child_weight':[6,7,8,9,10],
         'gamma':[0,1,2,3],
         'subsample':[0.8],
         'colsample_bytree':[0.8],
         }
model_search6 = GridSearchCV(XGBRegressor(), param_grid=params, scoring='neg_mean_absolute_error', 
                            cv=5, verbose=1).fit(X_train, y_train)
model_search6.best_score_, model_search6.best_params_

Fitting 5 folds for each of 20 candidates, totalling 100 fits


(-0.8587671503870787,
 {'colsample_bytree': 0.8,
  'gamma': 0,
  'learning_rate': 0.05,
  'max_depth': 3,
  'min_child_weight': 9,
  'n_estimators': 100,
  'subsample': 0.8})

In [33]:
# Cool! so we fix min child weight at 9, use smaller gamma numbers, and also change subsample
params = {'learning_rate':[0.05],
          'n_estimators':[100],
          'max_depth':[3],
         'min_child_weight':[9],
         'gamma':[0,0.1,0.2],
         'subsample':[0.1,0.5,0.8,0.9,1],
         'colsample_bytree':[0.8],
         }
model_search7 = GridSearchCV(XGBRegressor(), param_grid=params, scoring='neg_mean_absolute_error', 
                            cv=5, verbose=1).fit(X_train, y_train)
model_search7.best_score_, model_search7.best_params_

Fitting 5 folds for each of 15 candidates, totalling 75 fits


(-0.8587671503870787,
 {'colsample_bytree': 0.8,
  'gamma': 0,
  'learning_rate': 0.05,
  'max_depth': 3,
  'min_child_weight': 9,
  'n_estimators': 100,
  'subsample': 0.8})

In [35]:
# Subsample 0.8 is fixed, 
params = {'learning_rate':[0.05],
          'n_estimators':[100],
          'max_depth':[3],
         'min_child_weight':[9],
         'gamma':[0,0.01,0.02],
         'subsample':[0.8],
         'colsample_bytree':[0.7,0.8,0.9],
         }
model_search7 = GridSearchCV(XGBRegressor(), param_grid=params, scoring='neg_mean_absolute_error', 
                            cv=5, verbose=1).fit(X_train, y_train)
model_search7.best_score_, model_search7.best_params_

Fitting 5 folds for each of 9 candidates, totalling 45 fits


(-0.8587671503870787,
 {'colsample_bytree': 0.8,
  'gamma': 0,
  'learning_rate': 0.05,
  'max_depth': 3,
  'min_child_weight': 9,
  'n_estimators': 100,
  'subsample': 0.8})

In [36]:
# looks like all parameters are tuned! Let's try n_estimators once again, and change reg_alpha
params = {'learning_rate':[0.05],
          'n_estimators':[75,100,125,150],
          'max_depth':[3],
         'min_child_weight':[9],
         'gamma':[0],
         'subsample':[0.8],
         'colsample_bytree':[0.8],
          'reg_alpha':[0, 0.01, 0.02, 0.03]
         }
model_search8 = GridSearchCV(XGBRegressor(), param_grid=params, scoring='neg_mean_absolute_error', 
                            cv=5, verbose=1).fit(X_train, y_train)
model_search8.best_score_, model_search8.best_params_

Fitting 5 folds for each of 16 candidates, totalling 80 fits


(-0.8587671503870787,
 {'colsample_bytree': 0.8,
  'gamma': 0,
  'learning_rate': 0.05,
  'max_depth': 3,
  'min_child_weight': 9,
  'n_estimators': 100,
  'reg_alpha': 0,
  'subsample': 0.8})

In [37]:
# Now let's use the final model to train, predict, and score
preds = model_search8.best_estimator_.fit(X_train, y_train).predict(X_test)

In [38]:
r2(preds, y_test), mae(preds, y_test), mse(preds, y_test)

(0.7794410151655299, 0.9445553800563493, 2.5780292552915833)

## Had a lot of hope from this exercise, however, only the MSE score improved. MAE remained almost same, R2 deteriorated