In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import r2_score

# Comparison of Regularization Techniques

The purpose of this next exercise is to compare Lasso and Ridge regularization and observe the different results with respect to a given dataset.  Two different datasets will be compared to see the role that their features play when making predictions. The measure that will be compared is the [$R^2$ score](https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score-the-coefficient-of-determination) of the models which expresses how much variance in the independent variable is captured by the model. Tasks 2-7 explain the models that will be used to explore the dataset.  

# Dataset I: California Housing Dataset

The first set of models will predict meidan housing prices from a set of features of houses in the California dataset.  This dataset is included with the scikit learn library.

In [3]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
# returns a Bunch object
# housing

### Examine the Bunch Object 
Print the description of the dataset. How many of the features will be relavant for predicting housing prices? 

In [4]:
lines = (housing.DESCR).split('\n')
for line in lines:
    print(line)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

### Task 1:
Use the housing data as the features to train the models (X), use the median house values as the target (y). Divide these collections into the appropriate training and test sets of data. For grading be sure to set the random state to 0.  

In [5]:
X = housing.data
y = housing.target
len(y)

20640

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### Task 2: Linear Regression
Remember that a Linear Regression model assumes a linear relationship in the data : 

$$ \begin{aligned}\widehat{y} = \widehat{w_0}{x_0} + \widehat{w_1}{x_1} + \ldots\cdots + \widehat{w_n}{x_n} + b\end{aligned}$$

Where the predicted values of y are derived by minimizing the residual sum of squares (RSS, the sum of squared differences) of the predicted and actual target values:  

$$ \begin{aligned}RSS(w,b) = \displaystyle\sum_{i=1}^n(y_i - (wx_i + b))^2 \end{aligned}$$

Using a Linear Regression model produce an $R^2$ score for the prediction on the **y_test** values.  Store this in a variable called **r2_linear_regression**.

In [7]:
reg = LinearRegression()
reg.fit(X_train, y_train)
pred_linReg = reg.predict(X_test)
r2_linear_regression = r2_score(y_test, pred_linReg)
r2_linear_regression

0.5911695436410476

In [8]:
# for use building the dataframe
linReg = {'r2_linReg': r2_linear_regression}

### Task 3: Lasso Regression

Lasso regularization reduces the potential for overfitting a model to the data by introducing an error term based on the norm of the vector associated with the model according to the equation: 

$$ \begin{aligned}RSS_{LASSO}(w,b) = \displaystyle\sum_{i=1}^n(y_i - (wx_i + b))^2 + \alpha \sum_1^p |w_i|\end{aligned}$$

Fortunately this relationship is included in the scikit learn library.  

The number of features captured by the model (from 1 to p) is assumed to be greater than 0, some features are removed from the model when building the predicted values.  The degree that the model is adjusted is based on the scalar $\alpha$ value applied to the norm. The default value for the function is 1 for a full normal factor to be applied to the error term. 

Train a model using Lasso Regression and store the $R^2$ score of its predictions on the **y_test** values in a variable called **r2_lasso**.

In [9]:
reg_lso = Lasso()
reg_lso.fit(X_train, y_train)
pred_lso = reg_lso.predict(X_test)
r2_lasso = r2_score(y_test, pred_lso)
r2_lasso

0.28490402733386166

In [10]:
# for building the dataframe
lasso = {'r2_lasso': r2_lasso}

### Task 4: Lasso Regression, reduced alpha
Next train a second Lasso Regression model with an $\alpha$ of 0.5 and store it's $R^2$ value in a variable called **r2_lasso_half**.

In [11]:
reg_lso_2 = Lasso(alpha=0.5)
reg_lso_2.fit(X_train, y_train)
pred_lso_2 = reg_lso_2.predict(X_test)
r2_lasso_half = r2_score(y_test, pred_lso_2)
r2_lasso_half

0.44351557737688474

In [12]:
lasso_2 = {'r2_lasso_half': r2_lasso_half}

## Task 5: Ridge Regression

The next two tasks involve creating a prediction model for the California housing data using Ridge Regression.  Ridge Regression differs from Lasso Regression due to the nature of the error term added to the prediction:

$$ \begin{aligned}RSS_{LASSO}(w,b) = \displaystyle\sum_{i=1}^n(y_i - (wx_i + b))^2 + \alpha \sum_{j=1}^p w_j^2\end{aligned}$$

Again Ridge Regression models are part of the scikit learn library. As with the Lasso Regression model the degree that the error term influences the predictions is controled by the scalar value $\alpha$

Train a model using Ridge Regression and store the $R^2$ score of its predictions on the **y_test** values in a variable called **r2_ridge**.

In [13]:
reg_ridge = Ridge()
reg_ridge.fit(X_train, y_train)
pred_ridge = reg_ridge.predict(X_test)
r2_ridge = r2_score(y_test, pred_ridge)
r2_ridge

0.5911615930747933

In [14]:
ridge = {'r2_ridge': r2_ridge}

## Task 6: Ridge Regression, reduced alpha

Next train a second ridge regression model with an $\alpha$ value of 0.5 and store it's $R^2$ value in a variable called **r2_ridge_half**.

In [15]:
reg_ridge_half = Ridge(alpha=0.5)
reg_ridge_half.fit(X_train, y_train)
pred_ridge_half = reg_ridge_half.predict(X_test)
r2_ridge_half = r2_score(y_test, pred_ridge_half)
r2_ridge_half

0.5911655745845577

In [16]:
ridge_2 = {'r2_ridge_half': r2_ridge_half}

## Task 7: DataFrame to Compare

Next assemble the collection of $R^2$ values into a DataFrame where the indices are the names of the $R^2$ scores and the single column is labled, **"R_2 California Housing"**.  Sort the values in ascending order. 

In [17]:
obj = [linReg, lasso, lasso_2, ridge, ridge_2]
o2 = {}
names = []
for d in obj:
    for k,v in d.items():
        names.append(k)
        o2[k] = v

In [18]:
d2 = pd.DataFrame(o2, index=['R2 California Housing'])

In [19]:
d2 = d2.T
d3 = d2.sort_values(by='R2 California Housing')
d3

Unnamed: 0,R2 California Housing
r2_lasso,0.284904
r2_lasso_half,0.443516
r2_ridge,0.591162
r2_ridge_half,0.591166
r2_linReg,0.59117


**Reflection:** What can you conclude from the order of the values?

# Augmented California Housing Dataset

Next we'll perform the same analysis on a different dataset and observe the difference in performance of the various regressors.  Using the ... dataset construct a DataFrame with a single column of the $R^2$ values and the model regressors as indices. 

In [20]:
housing_2 = fetch_california_housing()

In [21]:
X = housing.data
y = housing.target
df = pd.DataFrame(X, columns=housing.feature_names)

In [22]:
df_ = df.iloc[:, [4]]
most_important_array = df_.values

In [23]:
def augment_X_random(X):
    augmented_array = []

    for sample in X:
        sample = list(sample)
        augmentation = np.random.randint(1000, size=100)
        sample.extend(augmentation)
        augmented_array.append(sample)

    return np.array(augmented_array)

In [24]:
X_aug = augment_X_random(most_important_array)

In [25]:
X_train_a, X_test_a, y_train_a, y_test_a = train_test_split(X_aug, y, random_state=0)

## Task 2a: Linear Regression

In [26]:
reg_a = LinearRegression()
reg_a.fit(X_train_a, y_train_a)
pred_linReg_a = reg_a.predict(X_test_a)
r2_linear_regression_a = r2_score(y_test_a, pred_linReg_a)
r2_linear_regression_a

-0.00663445991388123

In [27]:
linReg_a = {'r2_linReg_a': r2_linear_regression_a}

## Task 3a: Lasso Regression

In [28]:
reg_lso_a = Lasso()
reg_lso_a.fit(X_train_a, y_train_a)
pred_lso_a = reg_lso_a.predict(X_test_a)
r2_lasso_a = r2_score(y_test_a, pred_lso_a)
r2_lasso_a

-0.0024213714868543956

In [29]:
lasso_a = {'r2_lasso_a': r2_lasso_a}

### Task 4: Lasso Regression, reduced alpha

In [30]:
reg_lso_2_a = Lasso(alpha=0.5)
reg_lso_2_a.fit(X_train_a, y_train_a)
pred_lso_2_a = reg_lso_2_a.predict(X_test_a)
r2_lasso_half_a = r2_score(y_test_a, pred_lso_2_a)
r2_lasso_half_a

-0.004166446344924912

In [31]:
lasso_2_a = {'r2_lasso_half_a': r2_lasso_half_a}

## Task 5: Ridge Regression

In [32]:
reg_ridge_a = Ridge()
reg_ridge_a.fit(X_train_a, y_train_a)
pred_ridge_a = reg_ridge_a.predict(X_test_a)
r2_ridge_a = r2_score(y_test_a, pred_ridge_a)
r2_ridge_a

-0.0066344599024348305

In [33]:
ridge_a = {'r2_ridge_a': r2_ridge_a}

## Task 6: Ridge Regression, reduced alpha

In [34]:
reg_ridge_half_a = Ridge(alpha=0.5)
reg_ridge_half_a.fit(X_train_a, y_train_a)
pred_ridge_half_a = reg_ridge_half_a.predict(X_test_a)
r2_ridge_half_a = r2_score(y_test_a, pred_ridge_half_a)
r2_ridge_half_a

-0.006634459908157808

In [35]:
ridge_2_a = {'r2_ridge_half_a': r2_ridge_half_a}

## Task 7: DataFrame to Compare

In [36]:
obj_a = [linReg_a, lasso_a, lasso_2_a, ridge_a, ridge_2_a]
o2_a = {}
names_a = []
for d in obj_a:
    for k,v in d.items():
        names_a.append(k)
        o2_a[k] = v

In [37]:
d2_a = pd.DataFrame(o2_a, index=['R2 2D Altered Random California Housing'])

In [38]:
d2_a = d2_a.T
d3_a = d2_a.sort_values(by='R2 2D Altered Random California Housing')
d3_a

Unnamed: 0,R2 2D Altered Random California Housing
r2_linReg_a,-0.006634
r2_ridge_half_a,-0.006634
r2_ridge_a,-0.006634
r2_lasso_half_a,-0.004166
r2_lasso_a,-0.002421
