# How accurately can a math score be predicted?

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import plotly.express as px

from sklearn.linear_model import SGDRegressor, LinearRegression, Lasso, Ridge
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import metrics
from sklearn.preprocessing import StandardScaler

In [2]:
scoreData = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/CASchools.csv")
scoreData = scoreData.iloc[:, 5:]
scoreData.isnull().sum()

students       0
teachers       0
calworks       0
lunch          0
computer       0
expenditure    0
income         0
english        0
read           0
math           0
dtype: int64

# Data Pre-Processing

In [3]:
scoreDataSC = pd.DataFrame(StandardScaler().fit_transform(scoreData.drop(["calworks", "math"], 
                                                                         axis=1)), 
                           columns = scoreData.drop(["calworks", "math"], axis=1).columns)
                                                                         
scoreDataSC

Unnamed: 0,students,teachers,lunch,computer,expenditure,income,english,read
0,-0.622701,-0.629592,-1.574852,-0.536241,1.693832,1.021633,-0.863339,1.823812
1,-0.611188,-0.628260,0.118543,-0.459111,-0.336438,-0.761033,-0.612392,0.275319
2,-0.276016,-0.245978,1.167077,-0.304852,0.299356,-0.878251,0.779223,-0.929619
3,-0.610420,-0.613075,1.193898,-0.495407,2.826081,-0.878251,-0.863339,-0.152880
4,-0.331025,-0.306717,1.244756,-0.300315,-0.120692,-0.864073,-0.104603,-0.655769
...,...,...,...,...,...,...,...,...
415,-0.420831,-0.369427,-1.518889,-0.245870,3.123796,1.856711,-0.535049,2.286869
416,0.280216,0.423108,-1.594674,0.947373,0.677618,3.660314,-0.604575,2.441220
417,-0.559761,-0.580308,-0.277266,-0.586148,-1.436516,1.166147,0.465113,-0.332129
418,-0.646752,-0.661027,0.542639,-0.656473,-0.846630,-0.743298,-0.700709,0.643773


# Data Splitting

In [4]:
X_train, X_test, y_train, y_test = train_test_split(scoreDataSC, scoreData["math"], 
                                                    test_size = .2)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(336, 8)
(84, 8)
(336,)
(84,)


# Model Building and Evaluation

In [5]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_train = lr.predict(X_train)
y_pred_test = lr.predict(X_test)

print("Training: ", metrics.mean_squared_error(y_train, y_pred_train))
print("Testing: ", metrics.mean_squared_error(y_test, y_pred_test))

Training:  47.363763084793014
Testing:  42.108884205375205


#### Using the MSE above, there is no evidence of overfitting.  The testing data has less error than the training data.  There might be some underfitting, but we need to tune the hyperparameters to see if we can address this.  There is definitely underfitting, but the question is can a linear model address it.  After tuning the parameters, we can be more sure that a linear model might not be as good as other models like decision trees, random forests, etc.  If I took the square root of each of those figures, I can see that the model predicts the math score between 6 and 7 points.

In [6]:
lasso = Lasso()
lasso.fit(X_train, y_train)
y_pred_train = lasso.predict(X_train)
y_pred_test = lasso.predict(X_test)

print("Training: ", metrics.mean_squared_error(y_train, y_pred_train))
print("Testing: ", metrics.mean_squared_error(y_test, y_pred_test))

Training:  52.02357621210231
Testing:  47.90628245791123


In [7]:
ridge = Ridge()
ridge.fit(X_train, y_train)
y_pred_train = ridge.predict(X_train)
y_pred_test = ridge.predict(X_test)

print("Training: ", metrics.mean_squared_error(y_train, y_pred_train))
print("Testing: ", metrics.mean_squared_error(y_test, y_pred_test))

Training:  47.37948104938662
Testing:  42.34883857366002


#### Comparing the OLS, Lasso, and Ridge regressions above, it appears that the Lasso did the worst.  The MSE scores for the OLS and the Ridge were quite similar/close, with the OLS doing just slightly better.

In [8]:
Lasso().get_params()

{'alpha': 1.0,
 'copy_X': True,
 'fit_intercept': True,
 'max_iter': 1000,
 'normalize': False,
 'positive': False,
 'precompute': False,
 'random_state': None,
 'selection': 'cyclic',
 'tol': 0.0001,
 'warm_start': False}

In [9]:
param_grid = {"alpha": np.arange(0.01, 1, .01),            
              "max_iter": [1000, 5000, 10000, 100000, 1000000],
             "random_state": [42],
             }

gs = GridSearchCV(estimator = lasso, param_grid = param_grid, 
                  scoring = "neg_mean_squared_error", cv=5)

gs.fit(X_train, y_train)

print("Best Parameters: ", gs.best_params_)
print("Train MSE: ", gs.score(X_train, y_train))
print("Test MSE: ", gs.score(X_test, y_test))

Best Parameters:  {'alpha': 0.06999999999999999, 'max_iter': 1000, 'random_state': 42}
Train MSE:  -47.43920955977542
Test MSE:  -42.47366631255777


In [10]:
lasso2 = Lasso(alpha=0.05, random_state = 42)
lasso2.fit(X_train, y_train)
y_pred_train = lasso2.predict(X_train)
y_pred_test = lasso2.predict(X_test)

print("Training: ", metrics.mean_squared_error(y_train, y_pred_train))
print("Testing: ", metrics.mean_squared_error(y_test, y_pred_test))

Training:  47.41862340236622
Testing:  42.382394821042894


In [11]:
Ridge().get_params()

{'alpha': 1.0,
 'copy_X': True,
 'fit_intercept': True,
 'max_iter': None,
 'normalize': False,
 'random_state': None,
 'solver': 'auto',
 'tol': 0.001}

In [12]:
param_grid = {"alpha": [0.001, 0.01, 0.1, 1, 10, 100, 1000],            
              "max_iter": [100, 500, 1000, 5000, 10000],
             "random_state": [42],
             }

gs = GridSearchCV(estimator = ridge, param_grid = param_grid, 
                  scoring = "neg_mean_squared_error", cv=5)

gs.fit(X_train, y_train)

print("Best Parameters: ", gs.best_params_)
print("Train MSE: ", gs.score(X_train, y_train))
print("Test MSE: ", gs.score(X_test, y_test))

Best Parameters:  {'alpha': 1, 'max_iter': 100, 'random_state': 42}
Train MSE:  -47.37948104938662
Test MSE:  -42.34883857366002


In [13]:
ridge2 = Ridge(alpha=1, random_state=42, max_iter = 100)
ridge2.fit(X_train, y_train)
y_pred_train = ridge2.predict(X_train)
y_pred_test = ridge2.predict(X_test)

print("Training: ", metrics.mean_squared_error(y_train, y_pred_train))
print("Testing: ", metrics.mean_squared_error(y_test, y_pred_test))

Training:  47.37948104938662
Testing:  42.34883857366002


### The Lasso and Ridge regressions were tuned.  The hyperparameters selected are noted above.  A variety of alpha values were tested.  In the end, the best alpha for the lasso was 0.05, and the best alpha for the ridge was 1.  The default alpha value for lasso is 1.0, and the default alpha value for ridge is also 1.0, which turned out to be the best.