# Student Grades Regression Model

In this notebook you will build a regression model to predict student grades.

## Imports

In [1]:
import pandas as pd

## 1. Dataset

This dataset comes from Kaggle and has information about student grades and alcohol usage along with information about their family:

https://www.kaggle.com/uciml/student-alcohol-consumption/kernels

In [2]:
raw_data = pd.read_csv('/data/student-alcohol-consumption/student-mat.csv')

In [3]:
raw_data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [4]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
school        395 non-null object
sex           395 non-null object
age           395 non-null int64
address       395 non-null object
famsize       395 non-null object
Pstatus       395 non-null object
Medu          395 non-null int64
Fedu          395 non-null int64
Mjob          395 non-null object
Fjob          395 non-null object
reason        395 non-null object
guardian      395 non-null object
traveltime    395 non-null int64
studytime     395 non-null int64
failures      395 non-null int64
schoolsup     395 non-null object
famsup        395 non-null object
paid          395 non-null object
activities    395 non-null object
nursery       395 non-null object
higher        395 non-null object
internet      395 non-null object
romantic      395 non-null object
famrel        395 non-null int64
freetime      395 non-null int64
goout         395 non-null int64
Dalc          395 no

In [5]:
raw_data.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
      dtype='object')

## 2. Features

Create a feature `DataFrame`, `X` with the following columns:

* `Dalc` (weekday alcohol consumption)
* `Walc` (weekend alcohol consumption)
* `Medu` (mother's education level)
* `Fedu` (father's education level)
* `traveltime`
* `studytime`
* `goout`
* `romantic` (one hot encoded, with `get_dummies` and `drop_first=True, prefix='romantic'`)
* `higher` (one hot encoded, with `get_dummies` and `drop_first=True, prefix='higher'`)
* `sex` (one hot encoded, with `get_dummies` and `drop_first=True, prefix='higher'`)

In [6]:
X = raw_data[['Dalc', 'Walc', 'Medu', 'Fedu', 'traveltime', 'studytime', 'goout']]
X = pd.merge(X, pd.get_dummies(raw_data['higher'], drop_first=True, prefix='higher'), left_index=True, right_index=True)
X = pd.merge(X, pd.get_dummies(raw_data['sex'], drop_first=True, prefix='sex'), left_index=True, right_index=True)
X = pd.merge(X, pd.get_dummies(raw_data['romantic'], drop_first=True, prefix='romantic'), left_index=True, right_index=True)

In [7]:
X.columns

Index(['Dalc', 'Walc', 'Medu', 'Fedu', 'traveltime', 'studytime', 'goout',
       'higher_yes', 'sex_M', 'romantic_yes'],
      dtype='object')

In [8]:
assert list(X.columns)==['Dalc', 'Walc', 'Medu', 'Fedu', 'traveltime', 'studytime', 'goout',
       'higher_yes', 'sex_M', 'romantic_yes']

In [9]:
X.head()

Unnamed: 0,Dalc,Walc,Medu,Fedu,traveltime,studytime,goout,higher_yes,sex_M,romantic_yes
0,1,1,4,4,2,2,4,1,0,0
1,1,1,1,1,1,2,3,1,0,0
2,2,3,1,1,1,2,2,1,0,0
3,1,1,4,2,1,3,2,1,0,1
4,1,2,3,3,1,2,2,1,0,0


Create the target column `y` from the `G3` column (total grade):

In [10]:
y = raw_data['G3']

In [11]:
assert list(y.value_counts().values)==[56, 47, 38, 33, 32, 31, 31, 28, 27, 16, 15, 12,  9,  7,  6,  5,  1,
        1]

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=.3)

In [14]:
assert Xtrain.shape==(276,10)
assert Xtest.shape==(119,10)
assert ytrain.shape==(276,)
assert ytest.shape==(119,)

## Regression model

In [15]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score

In the following cells create and tune regression models using the following models:

* `LinearRegression`
* `RandomForestRegression`
* `Lasso`

For each of the models:

* Create a pipeline with a `PolynomialFeatures` preprocessor first and model second.
* Compute the $R^2$ score for both the training and test datasets.
* Tune model parameters, including the polynomial degree to balance the bias and variance of the model.

Create, fit, tune and predict using the `LinearRegression` model here:

In [16]:
model_lr = make_pipeline(PolynomialFeatures(1), LinearRegression())
model_lr.fit(Xtrain, ytrain)
ypred = model_lr.predict(Xtest)

Compute and print the training and test $R^2$ score here: 

In [17]:
r2_lr = r2_score(ytest, ypred)
print(r2_lr)

0.0401610155764


Create, fit, tune and predict using the `RandomForestRegression` model here:

In [18]:
model_rf = make_pipeline(PolynomialFeatures(5), RandomForestRegressor())
model_rf.fit(Xtrain, ytrain)
ypred = model_rf.predict(Xtest)

Compute and print the training and test $R^2$ score here: 

In [19]:
r2_rf = r2_score(ytest, ypred)
print(r2_rf)

-0.250672055075


Create, fit, tune and predict using the `Ridge` model here:

In [52]:
model_ridge = make_pipeline(PolynomialFeatures(2), Ridge(alpha=.2))
model_ridge.fit(Xtrain, ytrain)
ypred = model_ridge.predict(Xtest)

Compute and print the training and test $R^2$ score here: 

In [53]:
r2_ridge = r2_score(ytest, ypred)
print(r2_ridge)

-0.373844967079
