<a href="https://colab.research.google.com/github/CREVIOS/cscongress/blob/master/Lecture_4_3_Fundamental_Concepts_of_Machine_Learning_%5BSOLUTION%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Codealong - Train, Val, and Test

In this codealong you are going to be seperating your data into train, val, and test sets, as well as trying out different models at higher degrees

First, we have some code that imports the necessary imports for this codealong

In [None]:
import numpy as np
import pandas as pd
import sklearn 
import sklearn.datasets
import sklearn.linear_model
import sklearn.metrics

Next, we load in the data that we're going to be using (same data as the data from 4.2's last codealong, but we only take the first 500 rows so that some aspects of our analysis work better)

In [None]:
POLY_DATA_URL = "https://raw.githubusercontent.com/alextsun/ssi-ds-bootcamp-2020/master/Poly_Data.txt"

data = pd.read_csv(POLY_DATA_URL, names=["X", "y"], header=0, sep=" ")
data = data.to_numpy()
data = data[:500, :]

## Seperating into train, val, and test sets

Write some code to split our data into the train, val, and test sets from lecture. Write this as a function that we can pass any array and it will return a train, val, and test set. Make sure to shuffle the data first (outside of the function).

*Hint: Use `np.split()` and `np.random.shuffle(arr)`*

In [None]:
np.random.seed(19)
np.random.shuffle(data)

X, y = data[:, 0], data[:, 1]
print(X.shape)

# make it a column vector. -1 chooses the right number of rows auto!
X = X.reshape((-1, 1))
print(X.shape)

def train_val_test_split(dataset):
  # Returns a tuple of 3 sub-datasets. 
  # The first 80% train, next 10% val, last 10% test.
  return np.split(dataset, [int(dataset.shape[0] * 0.8), int(dataset.shape[0] * 0.9)])

X_train, X_val, X_test = train_val_test_split(X)
y_train, y_val, y_test = train_val_test_split(y)

(500,)
(500, 1)


## Generating the linear model

Create a linear regression model and fit it to our `X_train` and `y_train`. Print the weights and the bias from this model.

In [None]:
model = sklearn.linear_model.LinearRegression()
model.fit(X_train, y_train)
w = model.coef_
b = model.intercept_
print(w)
print(b)

[730.17253364]
-4677.941116980869


### Calculating MSE for the train and val set

Calculate the model's MSE (mean squared error) based on its predictions to both the `X_train` set and the `X_val` set.

*Hint: Use `sklearn.metrics.mean_squared_error()`*

In [None]:
yhat_train = model.predict(X_train)
yhat_val = model.predict(X_val)

In [None]:
print(sklearn.metrics.mean_squared_error(yhat_train, y_train))
print(sklearn.metrics.mean_squared_error(yhat_val, y_val))

10120.096075647876
10803.018770692204


## Generating the 3rd-degree model

Create a new array `trans_X` that is a modified copy of the original `X` array. Modify it by adding columns for $X^3$ and $X^2$.

*Hint: We already did this for the second Codealong in 4.2 :)*

In [None]:
X2 = X ** 2
X3 = X ** 3

trans_X = np.hstack((X3, X2, X))

Then, split that new dataset into Train, Val, and Test sets again (use the function we defined before!)

In [None]:
X_train, X_val, X_test = train_val_test_split(trans_X)
y_train, y_val, y_test = train_val_test_split(y)

Finally, fit our model on this new `X_train` and `y_train`, and once again print the weights and bias.

In [None]:
model.fit(X_train, y_train)
w = model.coef_
b = model.intercept_
print(w)
print(b)

[1.99790323 5.05363468 2.5668831 ]
31.0868662125049


### Calculating MSE for the train and val set (again)

Calculate the model's MSE (mean squared error) based on its predictions to both the `X_train` set and the `X_val` set.

*Hint: Use `sklearn.metrics.mean_squared_error()`*

In [None]:
yhat_train = model.predict(X_train)
yhat_val = model.predict(X_val)

In [None]:
print(sklearn.metrics.mean_squared_error(yhat_train, y_train))
print(sklearn.metrics.mean_squared_error(yhat_val, y_val))

0.056931370289016466
0.06614913785813024


## Generating the 21st degree model

Do the same (or similar) process as what you did for calculating the 3rd degree model, but do it for the 21st degree!

In [None]:
DEGREE = 21

First, transform the data! Use the above constant to make modifying the code easier down the line.

In [None]:
n = X.shape[0]
trans_X = np.repeat(X, DEGREE).reshape((n, DEGREE))

for i in range(2, DEGREE+1):
  trans_X[:, i-1] **= i

Split it into Train, Val, and Test... (still use the function!)

In [None]:
X_train, X_val, X_test = train_val_test_split(trans_X)
y_train, y_val, y_test = train_val_test_split(y)

...and train it again!

In [None]:
model.fit(X_train, y_train)
w = model.coef_
b = model.intercept_
print(w)
print(b)

[ 9.62538339e-17 -1.02856239e-09  4.73315057e-12  2.43260193e-14
  6.03737525e-13  3.97442851e-12  2.42343764e-11  1.36332790e-10
  6.99442269e-10  3.20618591e-09  1.26859900e-08  4.07426134e-08
  9.34264488e-08  1.02396855e-07 -7.93610055e-08  2.16248103e-08
 -3.14962093e-09  2.71736208e-10 -1.39936668e-11  3.99417460e-13
 -4.88029120e-15]
567.3410349531378


### Calculating MSE for the train and val set (for the last time)

Calculate the model's MSE (mean squared error) based on its predictions to both the `X_train` set and the `X_val` set.

*Hint: Use `sklearn.metrics.mean_squared_error()`*

In [None]:
yhat_train = model.predict(X_train)
yhat_val = model.predict(X_val)

In [None]:
print(sklearn.metrics.mean_squared_error(yhat_train, y_train))
print(sklearn.metrics.mean_squared_error(yhat_val, y_val))

1.2423548220377996
2.9523730255719975


In [None]:
ridge = sklearn.linear_model.Lasso(alpha=1000000000)
model.fit(X_train, y_train)
w = model.coef_
b = model.intercept_
print(w)
print(b)

[ 9.62538339e-17 -1.02856239e-09  4.73315057e-12  2.43260193e-14
  6.03737525e-13  3.97442851e-12  2.42343764e-11  1.36332790e-10
  6.99442269e-10  3.20618591e-09  1.26859900e-08  4.07426134e-08
  9.34264488e-08  1.02396855e-07 -7.93610055e-08  2.16248103e-08
 -3.14962093e-09  2.71736208e-10 -1.39936668e-11  3.99417460e-13
 -4.88029120e-15]
567.3410349531378
