### Exercise 1 - You are given a dataset having more variables than observations. Assuming that there seems to be a linear relationship between the target variable and the input variables in the dataset, why ordinary least squares (OLS) is a bad option to estimate the model parameters? Which technique would be best to use? Why?

**Answer:** OLS can't be used since there are more variables than observations, it means there aren't unique squares coefficients. Ridge regression or LASSO is probably the best to shrink the regression coefficients towards zero. 

## Exercise 2 - For Ridge regression, if the regularization parameter, λ, is equal to 0, what are the implications?

**Answer:** D) All the above

### Exercise 3 - For Lasso Regression, if the regularization parameter, λ, is very high, which options are true? Select all that apply.

**Answer:** A and B are true, so F is the answer

### Exercise 4 - An important theoretical result of statistics and Machine Learning is the fact that model’s generalization error can be expressed as the sum of two very different errors:

• Bias: This part of the generalization error is due to wrong assumptions, such as assuming
that the data is linear when it is actually quadratic. A high-bias model is most likely to
under-fit the training data.

• Variance: This part is due to the model’s excessive sensitivity to small variations in the
training data. A model with many degrees of freedom (such as a high-degree polynomial
model) is likely to have high variance and thus overfit the training data.

Suppose you are using Ridge Regression and you notice that the training error and
the validation error are almost equal and fairly high. Would you say that the model suffers from
high bias or high variance? Should you increase the regularization parameter, λ, or reduce it?


**Answer:** High bias, and you should reduce the regularization parameter λ

### Exercise 5 - Consider the CarPrice Assignment.csv data file. This data has information on cars (characteristics related to car dimensions, engine and more). The goal is to use car information to predict the price of the car. In Python, answer the following:

##### 5a - Using the pandas library, read the csv data file and create a data-frame called car_price.<br>
##### 5b - Split the data into train (80%) and test (20%).<br>
##### 5c - Using the wheelbase, enginesize, compressionratio, horsepower, peakrpm, citympg, and highwaympg as the predictor variables (from the train dataset), and price is the target variable (from the train dataset). Do the following:<br>
###### (1) Using the train dataset (make sure to normalize the input data with StandardScaler and Pipeline), (i) Estimate the optimal lambda in the LASSO model using default values (that is, youdon’t need to define the range of values for alphas) for lambda in scikit-learn and 5-folds. (ii) Perform LASSO as a variable selector (using the optimal lambda from previous step). Repeat step (1) 1000 times. Store the estimated model coefficients of each iteration in a data-frame. Remove the variables, whose estimated coefficients is 0 more than 500 times, from the training and testing datasets. <br>
##### 5d - Using the results from part (c), build a linear regression model. After that, use this model to predict on the test dataset. Report the MSE of this model. Make sure to normalize the input data with StandardScaler and Pipeline.<br>
##### 5e - Using the results from part (c), build a Ridge regression model as follows:
###### Using the train dataset, estimate the optimal lambda from the following set [0.001, 0.01, 0.1, 1, 10, 100] and using 5-folds. Make sure to normalize the input data with StandardScaler and Pipeline. Repeat (i) 100 times, store the optimal lambda of each iteration. Using the most common lambda of the 100 optimal lambdas to build a Ridge regression model. After that, use this model to predict on the test dataset. Report the MSE of this model. Make sure to normalize the input data with StandardScaler and Pipeline.<br>
##### 5f - Using the results from parts (d) and (e), what model would you use to predict car price? Explain.






In [49]:
import pandas as pd
car_price = pd.read_csv('CarPrice_Assignment(6).csv')
car_price.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [50]:
import numpy as np
from tqdm import tqdm_notebook, tqdm
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV
from sklearn.metrics import mean_squared_error

In [51]:
#Input variables
X = car_price[['wheelbase', 'enginesize', 'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg']]

#Target variable
Y = car_price['price']

#Split data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

In [52]:
lasso_coef = list()

for i in tqdm(range(0, 1000)):
    kf = KFold(n_splits = 5, shuffle = True, random_state = i)

    lasso_cv = Pipeline([('scaler', StandardScaler()),
                         ('model', LassoCV(cv = kf.split(X_train)))]).fit(X_train, Y_train)

    lasso_lambda = lasso_cv.named_steps['model'].alpha_

    lasso_md = Pipeline([('scaler', StandardScaler()),
                         ('model', Lasso(alpha = lasso_lambda))]).fit(X_train, Y_train)

    lasso_coef.append(lasso_md.named_steps['model'].coef_)

100%|██████████| 1000/1000 [02:17<00:00,  7.26it/s]


In [53]:
lasso_coef = pd.DataFrame(lasso_coef)
lasso_coef.columns = X.columns.tolist()
lasso_coef.head()

Unnamed: 0,wheelbase,enginesize,compressionratio,horsepower,peakrpm,citympg,highwaympg
0,1152.40009,4515.157914,1305.553496,1234.262201,704.601631,-1161.262878,-80.390506
1,1154.083824,4530.958694,1319.422579,1221.468942,721.846824,-1180.073669,-70.619115
2,1157.589115,4563.955924,1348.3851,1194.775556,757.869101,-1219.121536,-50.432917
3,1155.095132,4540.147626,1327.553812,1214.048151,731.943176,-1191.06273,-64.905037
4,1150.034878,4494.666604,1287.229642,1250.791252,681.905031,-1136.247247,-93.566204


In [54]:
#Total times it equals 0
(lasso_coef == 0).sum(axis = 0)

wheelbase           0
enginesize          0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
dtype: int64

In [55]:
#'highwaympg' has coefficients that equal 0 more than 500 times, so we remove it from both datasets
X_train = X_train.drop(columns = 'highwaympg', axis = 1)
X_test = X_test.drop(columns = 'highwaympg', axis = 1)

In [56]:
#Build model
lm_md = make_pipeline(StandardScaler(), LinearRegression()).fit(X_train, Y_train)

#Predict
lm_pred = lm_md.predict(X_test)

#MSE
lm_mse = mean_squared_error(Y_test, lm_pred)
print('Linear Regression MSE is ', lm_mse)

Linear Regression MSE is  16647977.984586375


In [57]:
ridge_lambda = list()

#Repeat 100 times. Ridge regression
for i in tqdm(range(0, 100)):

    kf = KFold(n_splits = 5, shuffle = True, random_state = i)
    
    ridge_cv = Pipeline([('scaler', StandardScaler()),
                         ('model', RidgeCV(alphas = [0.001, 0.01, 0.1, 1, 10, 100],
                                           cv = kf.split(X_train)))]).fit(X_train, Y_train)

    #Store lambda with .append
    ridge_lambda.append(ridge_cv.named_steps['model'].alpha_)

100%|██████████| 100/100 [00:08<00:00, 12.34it/s]


In [58]:
ridge_model = make_pipeline(StandardScaler(), Ridge(alpha = 10)).fit(X_train, Y_train)

ridge_pred = ridge_model.predict(X_test)
ridge_mse = mean_squared_error(Y_test, ridge_pred)
print('Ridge mse is ', ridge_mse)

Ridge mse is  17153297.00556623


### From the 2 models above, I would use Linear Regression model since the MSE is slightly lower