# HW3 Problem 1 (15 points): Linear Regression [TA: Vodelina Samatova]

In this exercise, you will apply linear regression and Lasso regression methods to the dataset supplied to you and then compare their results to determine whether Lasso regression is needed for this dataset. Additionally, you will use sklearn's pipeline framework, which is so helpful when you have a sequence of transforms (e.g. normalization) and estimators (e.g. classifiers or regressors).

**Note**: This assignment will have less provided code - you have to write most of it yourself. Remember to use prior homeworks as examples, and **always use the suggested random seed** to ensure the test cases work as provided.

**Dataset description**: You are provided a dataset with 20 variables. Variables $x1\ -\ x19$ refer to the independent variables, while variable $y$ is your dependent variable. Training data is stored in the file `./regression-train.csv`.

**Note on Test Cases**: TAs will use a test set to verify your solution. The format (independent variables $x1\ -\ x19$, dependent variable  $y$) will be same, but TAs' file may contain different number of data points than the split version from training set. Please ensure you take this into account, and do not hard code any dimensions.


In [None]:
import warnings
warnings.filterwarnings('ignore')

## Part 0: Add necessary imports

As you work through the homework, don't forget to add imports.
We often put imports at the top of the file.
For this assignment, you'll likely want to import pandas and numpy.

In [None]:
#Import necessary library


## Part 1: Linear Regression and Lasso Regression

You will write code to normalize and train simple linear regression and Lasso Regression using scikit-learn.

### 1.1 Loading Dataset

Load a dataset into pandas data frame `df` from this file: `./regression-train.csv`, assign columns $x1\ -\ x19$ to a variable `X`, and assign column $y$ to a variable `y`.

In [None]:
#TODO: Read the data

df = None
X = None
y = None

#TODO: Output the data
df

In [None]:
import numpy as np
np.testing.assert_equal(df.shape, (132,20))

In [None]:
# Note: we will run hidden test cases too


### 1.2 Train/Test Split

Create a 80% train / 20% test split, using **0** as the random state.

In [None]:
from sklearn.model_selection import train_test_split

#TODO
X_train = None
X_test = None
y_train = None
y_test = None


In [None]:
np.testing.assert_equal(X_train.shape, (105,19))
np.testing.assert_equal(X_test.shape, (27,19))

In [None]:
# Note: we will run hidden test cases too


### 1.3 Linear Regression v.s. Lasso CV

In this section you will compare a LinearRegression with standard hyperparameters to a LassoCV model.

Before your begin, read the documentation on sklearn's [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) and [LassoCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html) - a Lasso regression model that uses CV to tune its hyperparameters.

**Note** that the lasso regression model has *built-in* crossvalidation, which it performs on the training dataset provided, to select the best shrinkage coefficient for the validation data.

For regression, it is particularly important to normalize our data before training the model (ensuring all variables are on the same scale), so we can better interpret our coefficients. For both models, make sure data is scaled first using **a standard scaler**, fit to the training data. Hint: you can use the sklearn's [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to achieve this.

Note that we could use a pipeline for this process, but to make things easier, we will instead normalize our `X_train` and `X_test` variables.

Complete the following:
1. Fit a StandardScaler to the training dataset, and the normalize the training and test datasets. (**Note**: we fit the scale only to the training dataset - just like our model - we cannot use the test dataset to fit any part of our pipeline). 
2. Create both a LinearRegression model, and a LassoCV that uses **10 folds** for cross-validation and has a random state of **0**. 
3. Then fit both models to the normalized training dataset.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV

#TODO
scaler = None
linear_regressor = None
Lasso_regressor = None



In [None]:
# Test that the training dataset has been normalized
# Go through each attribute
for i in range(X_train.shape[1]):
    # Assert that the mean is near 0 and the standard deviation is near 1
    np.testing.assert_almost_equal(np.mean(X_train[:,i]), 0)
    np.testing.assert_almost_equal(np.std(X_train[:,i]), 1)

np.testing.assert_almost_equal(scaler.n_features_in_, 19)
np.testing.assert_almost_equal(linear_regressor.n_features_in_, 19)
np.testing.assert_almost_equal(Lasso_regressor.n_features_in_, 19)

In [None]:
# Note: we will run hidden test cases too


### 1.5 Inference and Evaluation
Calculate the training and testing RMSE for both models and assign them to the corresponding variables.

Which model do you expect will have lower training error? What about testing error? Why?

**ANSWER HERE**

In [None]:
# Note: You can use this function to calcualte rmse when true value and prediction values are known.
import math
import sklearn
def calculate_rmse(y_true, y_pred):
    return math.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))

#TODO
LR_training_RMSE = None
LR_testing_RMSE = None
Lasso_training_RMSE = None
Lasso_testing_RMSE = None


print(f"Linear: Training RMSE = {LR_training_RMSE}; Testing RMSE = {LR_testing_RMSE}\nLasso:  Training RMSE = {Lasso_training_RMSE}; Testing RMSE = {Lasso_testing_RMSE}")

In [None]:
np.testing.assert_almost_equal(LR_training_RMSE, 524.9532838526169)
np.testing.assert_almost_equal(Lasso_training_RMSE, 541.6957360523041)

In [None]:
# Note: we will run hidden test cases too


Review your prediction above. Were you correct?

## Part 2 Parameters of Estimators

You can access the parameters specific to the estimators. If you have been using pipeline, please find the documentation on [Pipelines and composite estimators](https://scikit-learn.org/stable/modules/compose.html).

### 2.1 Parameters

Save the parameters of the models to corresponding variables using `.coef_` attribute of each of the model

In [None]:
#TODO
LR_parameter = None
Lasso_parameter = None

print("Linear Regression parameters:")
print(LR_parameter)
print("\nLasso Regression parameters:")
print(Lasso_parameter)

In [None]:
np.testing.assert_almost_equal(LR_parameter[10], -23.517789172238473)
np.testing.assert_almost_equal(Lasso_parameter[10], 0.0)

In [None]:
# Note: we will run hidden test cases too


From the results, compare the two regression models, including the training and testing RMSE, and the coefficients. Use the output of these functions to answer the following questions below:

1. The dataset contains 19 attributes. Are all 19 attributes useful for predicting the dependent variable? Why or why not? Use your results to justify the answer.
2. If not all attributes are predictive, use your Lasso model to perform feature selection. Which attributes should be kept? Use a correlation and/or scatter plot to justify your answer for at least one attribute (in a new cell below).

**ANSWER HERE**