## Homework Assignment 3 - Evan Callaghan

### 1. You are given a dataset having more variables than observations. Assuming that there seems to be a linear relationship between the target variable and the input variables in the dataset, why is ordinary least squares (OLS) a bad option to estimate the model parameters? Which technique would be best to use? Why?

#### Given a data set with more variables than observations, ordinary least squares is a bad option to estimate model parameters because there does not exist a unique solution for the least squares coefficient estimates (linear algebra). The better technique to use in this situation is Lasso. Lasso works to shrink coefficient estimates to equal zero, acting as a variable selector. This way, we will be able to find a unique solution for the variables that are most influential in the prediction of the target variable, and we have a model that can be much more easily interpreted. 

### 2. For Ridge regression, if the regularization parameter, λ, is equal to 0, what are the implications?

#### F. Large coefficients in the linear model are not penalized and the objective function is the same as ordinary least squares objective function.

### 3. For Lasso Regression, if the regularization parameter, λ, is very high, which options are true? Select all that apply.

#### F. Can be used to select important features of a dataset and shrinks the coefficients of less important features to exactly 0.

### 4.  Suppose you are using Ridge Regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the model suffers from high bias or high variance? Should you increase the regularization parameter, λ, or reduce it?

#### If the training and testing error are almost equal and fairly high, the model suffers from under-fitting, and therefore high bias. Typically, high bias is related to a model that is over-simplified. As a solution, we should reduce the regularization parameter, λ, to make the model more complex.

In [1]:
## 5. a) Using the pandas library to read the csv data file and create a data-frame called heart

import boto3
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, train_test_split
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV

## Defining the bucket
s3 = boto3.resource('s3')
bucket_name = 'data-445-bucket-callaghan'
bucket = s3.Bucket(bucket_name)

## Defining the csv file
file_key = 'CarPrice_Assignment.csv'

bucket_object = bucket.Object(file_key)
file_object = bucket_object.get()
file_content_stream = file_object.get('Body')

car_price = pd.read_csv(file_content_stream)

car_price.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [2]:
## b) Using the wheelbase, enginesize, compressionratio, horsepower, peakrpm, citympg, and highwaympg as the predictor variables, and price is the target 
## variable to do the following:
##     (1) Split the data into train (80%) and test (20%)
##     (2) Using the train dataset:
##       (i) Estimate the optimal lambda using default values for lambda in scikit-learn and
##           5-folds. Make sure to normalize the data (normalize = True).
##       (ii) Perform LASSO as a variable selector (using the optimal lambda from previous
##           step (i)). Make sure to normalize the data (normalize = True).


## Defining input and target variables
X = car_price[['wheelbase', 'enginesize', 'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg']]
Y = car_price['price']

## Defining coefficients data frame
coefficients = pd.DataFrame(columns = ['wheelbase', 'enginesize', 'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg'])

for i in range (0, 1000):
    
    ## Splitting the data
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

    ## Estimating the optimal lambda
    lasso_cv = LassoCV(normalize = True, cv = 5).fit(X_train, Y_train)

    ## Extracting the optimal alpha 
    cv_alpha = lasso_cv.alpha_

    ## LASSO as a variable selector
    lasso_md = Lasso(alpha = cv_alpha, normalize = True, max_iter = 10000).fit(X_train, Y_train)
    
    ## Extracting the estimated coefficients and putting them into a data frame
    coefs = pd.DataFrame(lasso_md.coef_).T
    coefs.columns = ['wheelbase', 'enginesize', 'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg']
    
    ## Appending the coefficients into the data frame for all coefficients
    dataframes = [coefficients, coefs]
    coefficients = pd.concat(dataframes)


coefficients = coefficients.reset_index().drop(columns = ['index'])

## Repeating steps (1) and (2) 1000 times and storing the estimated model coefficients of each iteration in a data-frame. 

In [3]:
## Removing the variables, whose estimated coefficients is 0 more than 500 times, from the training and testing datasets.

n = coefficients.shape[0]
m = coefficients.shape[1]

for i in range(0, m):
    
    zero_count = 0
    
    for j in range(0, n):
        
        if (coefficients.iloc[j, i] == 0):
            zero_count += 1
            
    print("Column in Coefficients:", i)
    print("Zero Count:", zero_count)
    
## Since the column "highwaympg" has been set to zero 728 times, we are dropping it from the train and test sets

Column in Coefficients: 0
Zero Count: 0
Column in Coefficients: 1
Zero Count: 0
Column in Coefficients: 2
Zero Count: 0
Column in Coefficients: 3
Zero Count: 0
Column in Coefficients: 4
Zero Count: 0
Column in Coefficients: 5
Zero Count: 27
Column in Coefficients: 6
Zero Count: 728


In [4]:
## c) Splitting the data into train (80%) and test (20%). Then, normalizing the inputs variables of the train and test datasets using the L2 normalization

X = car_price[['wheelbase', 'enginesize', 'compressionratio', 'horsepower', 'peakrpm', 'citympg']]
Y = car_price['price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

def l2_normalization(X):
    X_mean = np.mean(X)
    l2 = np.sqrt(sum(X**2))
    l2_norm = (X - X_mean) / l2
    return l2_norm

X_train = X_train.apply(l2_normalization, axis = 1)
X_test = X_test.apply(l2_normalization, axis = 1)

In [5]:
## d) Using the train dataset to build a linear regression model

## Building the model
lm_md = LinearRegression().fit(X_train, Y_train)

## Predicting on the test set
lm_md_preds = lm_md.predict(X_test)

## Calculating the MSE
mse = np.mean(np.power(lm_md_preds - Y_test, 2))

## Reporting the MSE of this model.
mse

## The MSE of this linear regression model is about 14,943,727

14943727.910315841

In [8]:
## e) Using the train dataset to build a Ridge regression model as follows:
##     (i) Using the train dataset to estimate the optimal lambda from the following set [0.001, 0.01, 0.1, 1, 10, 100] and using 5-folds.
##     (ii) Repeating (i) 100 times and storing the optimal lambda of each iteration.

lambda_estimates = []

for i in range(0, 100):
    
    ## Estimating alpha
    ridge_cv = RidgeCV(alphas = [0.001, 0.01, 0.1, 1, 10, 100], cv = 5).fit(X_train, Y_train)

    lambda_estimates.append(ridge_cv.alpha_)

lambda_estimates = pd.DataFrame(lambda_estimates)

lambda_estimates.value_counts()

0.001    100
dtype: int64

In [9]:
# Using the most common lambda of the 100 optimal lambdas and the train dataset to build a Ridge regression model

ridge_alpha = 0.001

## Building the ridge model
ridge_md = Ridge(alpha = ridge_alpha).fit(X_train, Y_train)

## Predicting on the test set
ridge_md_preds = ridge_md.predict(X_test)

## Calculating the MSE of ridge model
mse2 = np.mean(np.power(ridge_md_preds - Y_test, 2))

## Reporting the MSE of this model
mse2

## The MSE of this ridge regression model is about 17,263,443

17263443.605159502

In [None]:
## f) Using the results from parts (d) and (e), we would select the linear regression model to predict car prices because it
## has a smaller MSE value.