# Business Analytics - Assignment 3
#### **Student Name:** Koorosh Shakoori

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Reading the data and making individual predictor and predictand Series of X and y.
data = pd.read_csv('MiniExam3DataSet.csv')
data.head()
X = data.x.values.reshape(-1,1)
y = data.y.values.reshape(-1,1)

## (a)
In this part of the assignment it is requested to split the given dataset into train and test datasets with 0.2 test ratio.

We use train_test_split module from scikit-learn for this task.

To avoid incosistency we assign a constant random_state value in the function.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## (b)
In this section it is requested to try and compare various polynomial degrees using Mean Squared Error of LOOCV.

Then we proceed with picking the degree that resulted in least error in the validation process.

For this purpose, we investigate degrees 1 through 9.

In [4]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score

In [5]:
loocv_scores = []
for i in range(1, 10):
    #The training data is transformed into polynomial feature so we can later run them through the models.
    polynomial_feature = PolynomialFeatures(degree=i, include_bias=False)
    X_poly_train = polynomial_feature.fit_transform(X_train)
    model = LinearRegression()
    
    #Since the scoring methods in cross_val_score is implemented in a way to be handled the same way as accuracy(higher is better),
    #the MSE provided is negative. Hence, we will use the in-built abs() function to get the positive results.
    #Also every score is for one test, therefore we use the mean to see the overall performance of each degree.
    loo_score = abs(cross_val_score(model, X_poly_train, y_train, cv=LeaveOneOut(), scoring='neg_mean_squared_error').mean())
    loocv_scores.append((i, loo_score))

scores = pd.DataFrame(loocv_scores, columns = ['degree', 'MSE'])
loo_ideal_degree = scores.degree[scores.MSE.idxmin()]
print(f'a polynomial function with {loo_ideal_degree} degrees results in minimum LOO cross validation MSE')
print(scores)

a polynomial function with 3 degrees results in minimum LOO cross validation MSE
   degree         MSE
0       1  519.422915
1       2  624.974783
2       3   10.768963
3       4   14.050213
4       5   20.232977
5       6   17.073647
6       7   26.696216
7       8  102.643631
8       9   56.064143


## (c)
The results above show that Leave One Out cross validation method suggests the superiority of a degree 3 polynomial function.

This value is stored in loo_ideal_degree and is used below.

Therefore, as requested in problem we refit the model with the suggested degree on full train dataset, and get the test MSE.

In [6]:
#transforming both train and test data into polynomial features with the ideal degree yielded above.
polynomial_feature = PolynomialFeatures(degree=loo_ideal_degree, include_bias=False)
X_poly_train = polynomial_feature.fit_transform(X_train)
X_poly_test = polynomial_feature.fit_transform(X_test)

#Initializing and training the model
model = LinearRegression()
model.fit(X_poly_train, y_train)

#Making predictions with test dataset to determine its respective MSE
test_prediction = model.predict(X_poly_test)
test_MSE = mean_squared_error(y_test, test_prediction)
print(f'The Mean Squared Error over the test dataset is :{test_MSE}')

The Mean Squared Error over the test dataset is :7.38539544607036


## (d)
In this block, the same approach is followed as part b, however this time for 5-fold cross validation.

Again, the degree between 1 and 9 with minimum cross validation MSE is chosen.

In [7]:
kfold_scores = []
for i in range(1, 10):
    polynomial_feature = PolynomialFeatures(degree=i, include_bias=False)
    X_poly_train = polynomial_feature.fit_transform(X_train)
    model = LinearRegression()
    
    #Same as above, the MSE provided by the scroing is negative, hence the abs() function
    #This time .mean() function gives the mean of all errors over the 5 iterations.
    kfold_score = abs(cross_val_score(model, X_poly_train, y_train, cv=5, scoring='neg_mean_squared_error').mean())
    kfold_scores.append((i, kfold_score))

scores = pd.DataFrame(kfold_scores, columns = ['degree', 'MSE'])
kfold_ideal_degree = scores.degree[scores.MSE.idxmin()]
print(f'a polynomial function with {kfold_ideal_degree} degrees results in minimum 5-fold cross validation MSE')
print(scores)

a polynomial function with 3 degrees results in minimum 5-fold cross validation MSE
   degree         MSE
0       1  493.015798
1       2  599.487773
2       3   11.669149
3       4   13.458086
4       5   17.678115
5       6   13.379101
6       7   26.355376
7       8  426.319639
8       9   40.997351


## (e)
Once again with the optimal polynomial degree at hand(stored in kfold_ideal_degree), we refit the model with training dataset.

This time in addition to MSE for the test dataset we also compute the R2 score according to the needs of problem.

In [8]:
#transforming both train and test data into polynomial features with the ideal degree yielded above.
polynomial_feature = PolynomialFeatures(degree=kfold_ideal_degree, include_bias=False)
X_poly_train = polynomial_feature.fit_transform(X_train)
X_poly_test = polynomial_feature.fit_transform(X_test)

#Initializing and training the model
model = LinearRegression()
model.fit(X_poly_train, y_train)

#Making predictions with test dataset to determine its respective MSE and R2score
test_prediction = model.predict(X_poly_test)
test_MSE = mean_squared_error(y_test, test_prediction)
print(f'The Mean Squared Error over the test dataset is :{test_MSE}')
#The R2 score is available as a method of the sklearn model we used in this assignment.
R2score = model.score(X_poly_test, y_test)
print(f'The R2 score over the test dataset is :{R2score}')

The Mean Squared Error over the test dataset is :7.38539544607036
The R2 score over the test dataset is :0.9819137069010943


## (f)
Except for some rare random occasions where the distribution of training dataset is highly skewed as a result of splitting the dataset, In most scenarios both cross validation methods suggest the same polynomial degree of **3**. This issue is handled by assigning a random_state argument in the train_test_split function. By assigining this value we can observe that both methods lead into the same number of degrees chosen.

In [9]:
print(f'a polynomial function with {loo_ideal_degree} degrees results in minimum LOO cross validation MSE')
print(f'a polynomial function with {kfold_ideal_degree} degrees results in minimum 5-fold cross validation MSE')

a polynomial function with 3 degrees results in minimum LOO cross validation MSE
a polynomial function with 3 degrees results in minimum 5-fold cross validation MSE
