# Data Science 1 Tutorial 5.1 - Regression

In this tutorial, we will use the [California Housing](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) dataset that is available from [Scikit-Learn](https://scikit-learn.org/).

In [None]:
# Run this cell
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Preparation

In [None]:
# Complete the following

# load the dataset into housing_data
housing_data = ________

# Inspect the attributes in the output of the following
housing_data

# What is the response variable? What is the dimension of the features?

<u>Let's print only the description for easier reading. See from the output right above for the attribute that you need.</u>

In [None]:
# Print only the description
print(housing_data.________)

<u>Now let's put the data into a DataFrame called housing_df.<br/>
You should have 8 predictive variables as given by `feature_names` plus the target variable, call this `medprice` for the medium price.</u>

In [None]:
# Complete this as specified.
housing_df = pd.DataFrame(housing_data.________, #just the features
                          columns=housing_data.feature_names)
housing_df['medprice'] = pd.Series(housing_data.________)

# print a few rows of housing_df
housing_df.head()

<u>For a quick exploration, display a pairplot of `housing_df`</u>

In [None]:
# Use pairplot from seaborn
sns.pairplot(housing_df)
#plt.savefig('housing.png')
plt.show()

## Simple Linear Regression

Let's first look at the simple linear regression, where there is only one explanatory variable. Let's start with `MedInc`.

**Q** Write down the simple linear regression model and define the variables.

**A**

_________

In [None]:
# Assign x and y accordingly
x = housing_df['________']
y = housing_df['________']

<u>The Numpy function `polyfit` gives the least squares polynomial fit. With the degree 1, it solves the simple linear regression problem.</u>

In [None]:
# Find the estimates of w1 and w0
w1,w0 = np.polyfit(x,y,1)

# print out the estimates
w1,w0

In [None]:
# With the following new data points:
xnew = np.linspace(1,10,20)

# Find the prediction yhat
ynew = xnew*w1 + w0

In [None]:
# use regplot in seaborn to plot x and y and the linear regression line
# on the same figure, plot the new estimated data points
sns.regplot(x=x,y=y)
plt.plot(xnew, ynew, 'rx')
plt.show()

We will now calculate the performance of the simple linear model using the metrics
1. Mean absolute error (MAE)
2. Mean absolute percentage error (MAPE)
3. Mean squared error (MSE)
4. Root mean squared error (RMSE)

In [None]:
# First calculate the estimated y values using x and the estimated parameters
yhat = x*w1 + w0

In [None]:
from sklearn.metrics import mean_absolute_error as mae, \
mean_squared_error as mse, \
mean_absolute_percentage_error as mape

# Complete the following and print the results
#
mae_SLR = ________
mape_SLR =________
mse_SLR = ________
rmse_SLR = ________
print("MAE:", mae_SLR)
print("MAPE:", mape_SLR)
print("MSE:", mse_SLR)
print("RMSE:", rmse_SLR)

## Polynomial Regression

We will still use one feature, `MedInc`, but we will see if increasing the complexity of the model improves our prediction.

Consider the 2nd degree polynomial model $$ y=w_{2}x^{2}+w_{1}x+w_{0}+\varepsilon, $$ where $x$ is still the feature `MedInc`. We can estimate the parameters using the polyfit function in Numpy.

In [None]:
# Estimate w0, w1, and w2 using the polyfit function
# then estimate the house prices from the same RM = xnew
w2,w1,w0 = np.________

ynew = w2*xnew**2 + w1*xnew + w0

In [None]:
# Again use regplot in seaborn to plot x and y and
# on the same figure, plot the new estimated data points
sns.regplot(________)
plt.plot(xnew, ynew, 'rx')
plt.show()

In [None]:
# And print out the MAE, MSE, and RMSE
yhat = w2*x**2 + w1*x + w0
mae_PR = ________
mape_PR = ________
mse_PR = ________
rmse_PR = ________
print("MAE:", mae_PR)
print("MAPE:", mape_PR)
print("MSE:", mse_PR)
print("RMSE:", rmse_PR)

## Polynomial Regression with Interaction

To illustrate the polynomial regression with interaction, take the 2 features `AveRooms` and `AveBedrms`.

Let's take `MedInc` and `AveOccup` into a new DataFrame $X$.

In [None]:
X = housing_df[__________]
X

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly_converter = _________(degree=2, include_bias=False)
poly_features = poly_converter.fit_transform(X)
poly_features

# PolynomialFeatures will also take interaction terms into account
# With input [x1, x2], the degree-2 polynomial features are
# [1, x1, x2, x1^2, x1x2, x2^2] if include_bias=True

In [None]:
# Illustration
print(np.arange(1,4).reshape(1,3))
poly_converter.fit_transform(np.arange(1,4).reshape(1,3))

This time we'll use `sklearn` to solve for the coefficients.

In [None]:
from sklearn.linear_model import LinearRegression
model_PRI = _________()
model_PRI.fit(poly_features,y)
yhat = model_PRI.predict(poly_features)
mae_PRI = mae(y, yhat)
mape_PRI = mape(y, yhat)
mse_PRI = mse(y, yhat)
rmse_PRI = mse(y, yhat, squared=False)
print("MAE:", mae_PRI)
print("MAPE:", mape_PRI)
print("MSE:", mse_PRI)
print("RMSE:", rmse_PRI)

In [None]:
print(model_PRI.intercept_)
print(model_PRI.coef_)

## Train-test split and model selection

Up to now we've only used the whole dataset to report the performance of our model. Of course this is unacceptable because the model is trained with this dataset. In this section we will do a data split to simulate an unknown data set to report the performance of our model.

Referring to this page: [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html),
find out how you can implement a 80-20 training-test split below. Include all 8 features. Specify the `random_state` for reproducibility, say `random_state=123`.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(housing_data.__________,#features
                                                    housing_data.__________,#target
                                                    test_size = 0.20,
                                                    random_state=123)

# Alternative from a DataFrame:
X = housing_df.drop('medprice', axis=1)
y = housing_df['medprice']
X_train, X_test, y_train, y_test = train_test_split(__________,
                                                    __________,
                                                    test_size = 0.20,
                                                    random_state=123)

In [None]:
train_mapes = []
test_mapes = []

for d in range(1,6):
    # new features with degree "d"
    poly_converter = PolynomialFeatures(degree=d,include_bias=False)
    poly_features = poly_converter.fit_transform(X)

    # SPLIT THIS NEW POLY DATA SET
    X_train, X_test, y_train, y_test = train_test_split(poly_features, y,
                                                        test_size=0.2, random_state=123)

    # TRAIN ON THIS NEW POLY SET
    model = LinearRegression()
    model.fit(X_train,y_train)

    # PREDICT ON BOTH TRAIN AND TEST
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)

    # Calculate Errors

    # Errors on Train Set
    train_MAPE = mape(y_train,train_pred)

    # Errors on Test Set
    test_MAPE = mape(y_test,test_pred)

    # Append errors to lists for plotting later

    train_mapes.append(train_MAPE)
    test_mapes.append(test_MAPE)

In [None]:
X_train.shape

In [None]:
plt.plot(range(1,6), train_mapes, label="Training MAPE")
plt.plot(range(1,6), test_mapes, label="Testing MAPE")
plt.ylabel("RMSE")
plt.xlabel("Degree of polynomials")
plt.legend()
plt.show()

In [None]:
# Train the final model with the whole dataset
final_model_PR = LinearRegression(fit_intercept=False)

poly_converter = PolynomialFeatures(degree=2,include_bias=False)
poly_features = poly_converter.fit_transform(housing_data.data)

final_model_PR.fit(poly_features, housing_data.target)

final_model_PR.intercept_
final_model_PR.coef_

Reference: [joblib](https://joblib.readthedocs.io/en/latest/)

In [None]:
# Saving the model
from joblib import dump, load
dump(final_model_PR, 'final_medhouseprice_model_PR.joblib')

In [None]:
# Loading the model
loaded_final_PR_model = load('final_medhouseprice_model_PR.joblib')
loaded_final_PR_model.coef_

## Feature Scaling, Cross Validation , and Regularization

It is necessary to standardize variables before using Lasso and Ridge Regression. In this section we will only take the Ridge Regression as an example.

**Q**: Write the error model of the ridge regression. Why is it necessary to scale the features?

**A**:_________

In [None]:
X.shape, y.shape

In [None]:
# Do 80-20 split, specify the random_state=123
X_train, X_test, y_train, y_test = train_test_split(_________, _________,
                                                    _________,
                                                    _________)

In [None]:
X_train.shape, X_test.shape

In [None]:
# Run this cell to use StandardScaler from sklearn for the feature scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
# We fit to the training data, not the whole data, to avoid data leakage
scaler.fit(X_train)

In [None]:
# Scale the training and test features
scaled_X_train = _________._________(X_train)
scaled_X_test = _________._________(X_test)

In [None]:
from sklearn.linear_model import Ridge

# Optional
help(Ridge)

In [None]:
# Alpha is a hyperparameter, later on we'll use cross validation to choose the optimal value
model_ridge = Ridge(alpha=0.1)

# Fit the training set
model_ridge.fit(scaled_X_train, y_train)

# Prediction on the test set
test_pred = _________._________(scaled_X_test)

# Performance
print('MAPE: ', mape(y_test, test_pred))
print('RMSE: ', mse(y_test, test_pred, squared=False))


We want to decide $\alpha$ by cross validation

In [None]:
from sklearn.linear_model import RidgeCV

# Try changing alpha values
model_ridgecv = _________()

#from sklearn.metrics import SCORERS
#SCORERS.keys()
_________.fit(_________, _________)

In [None]:
model_ridgecv.alpha_

In [None]:
# Fit the training set
model_ridgecv.fit(scaled_X_train, y_train)

# Prediction on the test set
test_pred = model_ridgecv.predict(scaled_X_test)

# Print the MSE
print('MAPE: ', mape(y_test, test_pred))
print('RMSE: ', mse(y_test, test_pred, squared=False))

In [None]:
model_ridgecv.coef_

Now, instead of RidgeCV, use LassoCV to find the optimum alpha for Lasso regularized model and print out the MAPE and RMSE.