# Homework #3: Cross-Validation and Norms
by Francisco Reveriano

Cross-validation is used for both model selection and hyperparameter selection, to ensure the chosen model and/or hyperparameters(s) are not too highly tuned ("overfit") to the data. Here you are going to explore the impact of cross-validation to select a model to predict a car's price from its characteristics. 

In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from tqdm.notebook import trange, tqdm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Problem-1" data-toc-modified-id="Problem-1-1">Problem 1</a></span><ul class="toc-item"><li><span><a href="#Reload-Data" data-toc-modified-id="Reload-Data-1.1">Reload Data</a></span></li><li><span><a href="#Problem-A" data-toc-modified-id="Problem-A-1.2">Problem A</a></span></li></ul></li></ul></div>

## Problem 1

Continuing with the 13 continous predictor variables from the Automobile Data Set from the UCI Machine Learning Respository that you used in Homework #2 to predict a car's price from its characteristics, you are going to further explore the 3 models you proposed in problem 2(a) in Homework #2. 

### Reload Data

In [38]:
# The first part is reading the dataset. With Pandas I can treat the dataset as if it was a .csv file. 
data = pd.read_csv("imports-85.data", header=None)

# At this point the dataset has no header. It is easier for me to place a header to make it easier to choose columns
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

# We add the headers to the table to make it readable. 
data.columns = headers

# We can see that there are 26 columns. This means that we need to delete them. And use only the columns we have been assigned. 
# So in other words we are dropping the columns using the panda drop command function. 
data = data.drop(columns=["symboling", "normalized-losses", "make", "fuel-type", "aspiration", 
                          "num-of-doors", "body-style", "drive-wheels", "engine-location"
                          ,"engine-type", "num-of-cylinders", "fuel-system"])

# We now proceed to reformat the dataset by first turning all the '?' into 'NaN'
data["price"] = pd.to_numeric(data["price"], errors='coerce')
data["bore"] = pd.to_numeric(data["bore"], errors='coerce')
data["stroke"] = pd.to_numeric(data["stroke"], errors='coerce')
data["compression-ratio"]= pd.to_numeric(data["compression-ratio"], errors='coerce')
data["horsepower"]= pd.to_numeric(data["horsepower"], errors='coerce')
# We then proceed to drop the NaN
data = data.dropna(subset=["price"], axis = 0)
#data = data.dropna(subset=["bore"], axis= 0)
data = data.dropna()

# Make a copy of the model
Model = data.copy()

### Problem A

Remind us what your proposed model #1 is (write down the equation price = f(features, w), with the parameters w unspecified). 

price = wheel-base + length + width + height + curb-weight + engine-size + bore + stroke + compression-ratio + horsepower + peak-rpm + city-mpg + highway-mpg

In [39]:
# Call Linear Regression
Linear_Model_1 = LinearRegression()

# We first set a new dataframe
Model_1 = Model.copy()

# Create the X Variables in our model
X = Model_1[["wheel-base", "length", "width", "height", "curb-weight", "engine-size", "bore", "stroke",
               "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg"]]

# Create the Y Variable in our model
Y = Model_1["price"]

# Now We Proceed to run the linear regression
Linear_Model_1.fit(X,Y)

print("Regular Regression Model")
print("Intercept:", Linear_Model_1.intercept_)
print("Coefficients:", Linear_Model_1.coef_)

Regular Regression Model
Intercept: -62068.15319037426
Coefficients: [ 7.04671241e+01 -8.97337480e+01  6.20846258e+02  3.19938816e+02
  1.71246392e+00  1.26674808e+02 -9.18710926e+02 -2.96297261e+03
  2.39724757e+02  3.80152790e+01  2.08564561e+00 -3.08035124e+02
  2.83956094e+02]


For your proposed model #1, perform linear regression with 3-10-folds cross-validation (3 independent repetitions of 10-folds cross-validation) to evaluate the consistency in both the estimated model and the model performance. 

In [41]:
#10 Folds Cross-Validation
scores = cross_val_score(Linear_Model_1, X, Y, cv=10)
print(scores)

[ 0.62206027 -0.19951547  0.80881162  0.86720684 -0.44829771  0.66247507
  0.65921197 -0.5014665  -0.32265775  0.59960085]
