### Preprocessing

In [0]:
# import statistical packages
import numpy as np
import pandas as pd

In [0]:
# import data visualisation packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

*I do not need to specify a separate 50% training dataset. Instead we use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method from sklearn.*

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
# import and preprocess data
url = "abfss://training@sa8451learningdev.dfs.core.windows.net/interpretable_machine_learning/eml_data/Auto.csv"
df = spark.read.option("header", "true").csv(url).toPandas()

str_cols = ["name"]
num_cols = list(set(df.columns) - set(str_cols))
df[str_cols] = df[str_cols].astype(str)
df[num_cols] = df[num_cols].astype(float)

In [0]:
df.head()

### Regressions using random state = 1

**Simple Linear Regression**

In [0]:
X = df[['horsepower']]
y = df['mpg']

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

In [0]:
X_train.shape

In [0]:
y_train.shape

In [0]:
X_test.shape

In [0]:
y_test.shape

In [0]:
df.shape

*The Auto dataset contains 397 rows whereas the same dataset in the book example contains 392 rows. This can be explained
by the fact that some of the rows have missing values and have been deleted. I have, however, imputed those values. So,
I have the same number of rows as the original dataset. More information about imputation of missing values can be found 
[here](http://www.stat.columbia.edu/~gelman/arm/missing.pdf). In any case, it does not matter since the prime purpose of the chapter is to show relative differences in prediction abilities of different methodologies. So as long as the relative difference is more or less the same, the point still stands.*

In [0]:
from sklearn.linear_model import LinearRegression

In [0]:
lmfit = LinearRegression().fit(X_train, y_train)

In [0]:
lmpred = lmfit.predict(X_test)

In [0]:
from sklearn.metrics import mean_squared_error

In [0]:
MSE = mean_squared_error(y_test, lmpred)

In [0]:
round(MSE, 2)

**Polynomial Regression (horsepower$^2$)**

In [0]:
from sklearn.preprocessing import PolynomialFeatures as PF

In [0]:
X = df[['horsepower']]
X_ = pd.DataFrame(PF(2).fit_transform(X))
y = df[['mpg']]

In [0]:
X_.head()

In [0]:
X_.drop(columns=0, inplace=True)

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.5, random_state=1)

In [0]:
lmfit2 = LinearRegression().fit(X_train, y_train)

In [0]:
lmpred2 = lmfit2.predict(X_test)

In [0]:
MSE2 = mean_squared_error(y_test, lmpred2)

In [0]:
round(MSE2, 2)

**Polynomial Regression (horsepower$^3$)**

In [0]:
from sklearn.preprocessing import PolynomialFeatures as PF

In [0]:
X = df[['horsepower']]
X_ = pd.DataFrame(PF(3).fit_transform(X))
y = df[['mpg']]

In [0]:
X_.head()

In [0]:
X_.drop(columns=0, inplace=True)

In [0]:
X_.head()

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.5, random_state=1)

In [0]:
lmfit3 = LinearRegression().fit(X_train, y_train)

In [0]:
lmpred3 = lmfit3.predict(X_test)

In [0]:
MSE3 = mean_squared_error(y_test, lmpred3)

In [0]:
round(MSE3, 2)

### Regressions using random state = 2

**Simple Linear Regression**

In [0]:
X = df[['horsepower']]
y = df['mpg']

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=2)

In [0]:
from sklearn.linear_model import LinearRegression

In [0]:
lmfit = LinearRegression().fit(X_train, y_train)

In [0]:
lmpred = lmfit.predict(X_test)

In [0]:
MSE = mean_squared_error(y_test, lmpred)

In [0]:
round(MSE, 2)

**Polynomial Regression (horsepower$^2$)**

In [0]:
from sklearn.preprocessing import PolynomialFeatures as PF

In [0]:
X = df[['horsepower']]
X_ = pd.DataFrame(PF(2).fit_transform(X))
y = df[['mpg']]

In [0]:
X_.head()

In [0]:
X_.drop(columns=0, inplace=True)

In [0]:
X_.head()

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.5, random_state=2)

In [0]:
lmfit2 = LinearRegression().fit(X_train, y_train)

In [0]:
lmpred2 = lmfit2.predict(X_test)

In [0]:
MSE2 = mean_squared_error(y_test, lmpred2)

In [0]:
round(MSE2, 2)

**Polynomial Regression (horsepower$^3$)**

In [0]:
from sklearn.preprocessing import PolynomialFeatures as PF

In [0]:
X = df[['horsepower']]
X_ = pd.DataFrame(PF(3).fit_transform(X))
y = df[['mpg']]

In [0]:
X_.head()

In [0]:
X_.drop(columns=0, inplace=True)

In [0]:
X_.head()

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.5, random_state=2)

In [0]:
lmfit3 = LinearRegression().fit(X_train, y_train)

In [0]:
lmpred3 = lmfit3.predict(X_test)

In [0]:
MSE3 = mean_squared_error(y_test, lmpred3)

In [0]:
round(MSE3, 2)

**Thus we see there is a difference in errors when we choose different training sets.**