# Assignment 6: Feature selection and regularization

# Total: /100

## Instructions

* Complete the assignment

* Once the notebook is complete, **restart** your kernel and **rerun** your cells

* Submit this notebook to owl by the deadline

* You may use any python library functions you wish to complete the assignment.

In [57]:
# You may need these
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn as sk
import sklearn.linear_model as skl
from sklearn import preprocessing
from sklearn import metrics
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LassoCV
import matplotlib.pyplot as plt
from IPython.display import display

seed = 2023
np.random.seed(seed)


## Question 1: /20 pts


Customer Lifetime Value (CLV) is the total income a business can expect from a customer over the entire period of their relationship. It’s an important metric as it costs less to keep existing customers than it does to acquire new ones, so increasing the value of your existing customers is a great way to drive growth. We want to predict CLV for an auto insurance company.

1. Read in the `Vehicle_Insurance.csv` dataset and display the last 5 rows.
2. Conduct the required data preparation.

### 1.1 Read the dataset and display the last 5 rows

In [58]:
df = pd.read_csv("Vehicle_Insurance.csv")
df.tail()


Unnamed: 0,clv,Coverage,Gender,Income,Marital.Status,Monthly.Premium.Auto,Number.of.Open.Complaints,Number.of.Policies,Renew.Offer.Type,Total.Claim.Amount,Vehicle.Class
8625,4100.398533,Premium,F,47761,Single,104,0,1,Offer1,541.282007,Four-Door Car
8626,3096.511217,Extended,F,21604,Divorced,79,0,1,Offer1,379.2,Four-Door Car
8627,8163.890428,Extended,M,0,Single,85,3,2,Offer1,790.784983,Four-Door Car
8628,7524.442436,Extended,M,21941,Married,96,0,3,Offer3,691.2,Four-Door Car
8629,2611.836866,Extended,M,0,Single,77,0,1,Offer4,369.6,Two-Door Car


### 1.2 Remove the rows with "clv" $> 16000$ as well as those with "clv" $< 2200$ from the dataset. What's the shape of the dataframe now?

In [59]:
df = df[(df["clv"] <= 16000) & (df["clv"] >= 2200)]  # remove outliers
print("Dataframe shape is now", df.shape)


Dataframe shape is now (8212, 11)


### 1.3 Using `preprocessing.OneHotEncoder()`, convert all categorical features. Make sure not to add collinear features during the encoding process. Then, display the first 3 rows.

In [60]:
categorical_features = [
    "Coverage",
    "Gender",
    "Marital.Status",
    "Renew.Offer.Type",
    "Vehicle.Class",
]

encoder = preprocessing.OneHotEncoder(drop="first", sparse_output=False)

encoded_data = encoder.fit_transform(df[categorical_features])

encoded_df = pd.DataFrame(
    encoded_data, columns=encoder.get_feature_names_out(categorical_features)
)

other_cols = df.drop(columns=categorical_features)

df = pd.concat(
    [other_cols.reset_index(drop=True), encoded_df.reset_index(drop=True)], axis=1
)

df.head(3)


Unnamed: 0,clv,Income,Monthly.Premium.Auto,Number.of.Open.Complaints,Number.of.Policies,Total.Claim.Amount,Coverage_Extended,Coverage_Premium,Gender_M,Marital.Status_Married,Marital.Status_Single,Renew.Offer.Type_Offer2,Renew.Offer.Type_Offer3,Renew.Offer.Type_Offer4,Vehicle.Class_Luxury Car,Vehicle.Class_Luxury SUV,Vehicle.Class_SUV,Vehicle.Class_Sports Car,Vehicle.Class_Two-Door Car
0,2763.519279,56274,69,0,1,384.811147,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,6979.535903,0,94,0,8,1131.464935,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,12887.43165,48767,108,0,2,566.472247,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### 1.4 Use `pandas.DataFrame.apply` to apply square root transformation to "Total.Claim.Amount" and log to the target variable. And then, create your `X` and `y`. (No training/test splitting yet) 

In [61]:
df["Total.Claim.Amount"] = df["Total.Claim.Amount"].apply(np.sqrt)
df["clv"] = df["clv"].apply(np.log)

X = df.drop("clv", axis=1)
y = df["clv"]


### 1.5 Build a new design matrix by applying polynomial expansion on the `X` from Question 1.4.

Hint: Specify degree=2 and do NOT include the column with power 0 (i.e., the column with all elements being 1)

In [62]:
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
X_poly_df = pd.DataFrame(X_poly, columns=poly.get_feature_names_out())


### 1.6 Standardize your design matrix (from Question 1.5) with `StandardScaler()`, and store the result into a Pandas dataFrame.

In [63]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_poly)
X_scaled_df = pd.DataFrame(X_scaled, columns=poly.get_feature_names_out())


### 1.7 What is the shape of the resultant DataFrame obtained from question 1.6?

In [64]:
X_scaled_df.shape


(8212, 189)

## Question 2: /7 pts

Split the data into training and test sets. Hold out 30% of observations as the test set.  Pass `random_state=seed` to `train_test_split` to ensure you get the same sets per run. The design matrix to pass in to the splitter function is the dataframe whcih you got in Question 1.6. As for the target, you have created it in Question 1.4.

In [65]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled_df, y, test_size=0.3, random_state=seed
)

print("Training set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)

print(round(y_train.mean(), 2))


Training set shape: (5748, 189) (5748,)
Test set shape: (2464, 189) (2464,)
8.62


How many observations in your traning data set? What is the average value of the target variable in the traning data set (keep 2 decimal place)?

**YOUR ANSWER HERE:** [2pts]

Observations in training data set is 5748. Average value of target variable in training dataset is 8.62.

## Question 3: /23 pts

### 3.1 Create a SciKit Learn `Ridge` regression object. Using this object, run a ridge regression analysis of the target variable against all the transformed predictor variables using your training data. Include the arguement `alpha=4.0`. In addition, the ridge regression should be fitted with the intercept.

In [None]:
# Ridge regression


### 3.2 Vary the ridge coeficient `alpha` according to the hint. Use `cross_val_score()` to select the best `alpha` based on 'mean_squared_error'. Include the argument `cv=5`. Report the `alpha` that yields the smallest mean_squared_error.   

In [None]:
# hint: lam = np.exp(np.linspace(-4,1,10))


### 3.3 Re-fit the ridge regression with `alpha` being the value obtained in the previous question. `Print` the first 3 parameters of your model.

In [None]:
#


### 3.4 Fit the linear regression without any penalty, and the regression should be fitted with the intercept. `Print` the first 3 parameters of your model.

In [None]:
#


Comparing the parameters that you obtain in questions 3.3 and 3.4, what do you find?

**YOUR ANSWER HERE:** [2pts]

...

### 3.5 Use your trained model from Question 3.4 to predict over the test set and `print` the first 5 prediction values.

In [None]:
#


## Question 4: /25 pts

### 4.1 Consider to fit a Lasso regression to the train dataset. Use `lasso_path()` to show the full path of the first 20 coefficients of the Lasso regression. Include the arguement `eps=8e-3` and `n_alphas=50`.

In [None]:
# Draw a plot to show the path. Legend is not required in the plot.


Describe the trend that shows in your figure.

**YOUR ANSWER HERE:** [3pts] 

...

### 4.2 Use the Scikit Learn's cross-validated LASSO to automatically search for the best tuning parameter of the LASSO regression on the training set with intercept. Include arguments `eps=8e-3`, `n_alphas=30`, `tol=0.001`, `cv=5`, and `random_state=seed`. Report the best tuning parameter and the number of the non-zero coefficients in the model.

In [None]:
#


### 4.3 Use the Scikit Learn's cross-validated ElasticNet to automatically search for the best tuning parameters of the Elasticnet regression with intercept on the training data set. Include the same argument as question 4.2 as well as `l1_ratio=[.7, .9, .95, .99,1]`. Report the best tuning parameters.

In [None]:
#


From the obtained tuning parameters, is the Elasticnet regression model equivalent to the Lasso regression? Briefly describe the reason. 

**YOUR ANSWER HERE:** [3pts]

...

## Question 5: /16 pts

### 5.1 Start from the regression model in question 3.4, use `SequentialFeatureSelector()` to conduct the forward selection for the features of the regression model. Include the argument `n_features_to_select=20`. Report the indices of the selected features. 

FYI: Running this using 8 physical cores took about 1 minute for me.

In [None]:
#


What do you need to change about the argument in your model if you want to conduct a backward selection?

**YOUR ANSWER HERE:** [2pts] 

...

### 5.2 Re-fit the regular linear regression based on the traning set by using the selected features from the question 5.1. Report the first 3 parameters of your model as indicated with the print function.

In [None]:
#


## Question 6: /9 pts

### 6.1 Make predictions on the test set using your models in questions 3.3, 4.2, 4.3, and 5.2, respectively. Together with the predicted values obtained in question 3.5, report the first 5 rows of predicted values obtained from different models in a single DataFrame.

In [None]:
#


### 6.2 Use `mean_squared_error()` to assess the performance of different models based on all the predicted values mentioned in Question 6.1.  

In [None]:
#


Which model yields the smallest mean squared error on the test dataset?

**YOUR ANSWER HERE:** [2pts]

...