# Part 1: The impact of regularization

This part highlights the impact of using Ridge Regularization. The required imports are shown in the following cell. The random seed has been set to 42 for reproducibility. Some stylistic settings have been activated to improve the presentation of the graphs.

In [None]:
import math
import numpy as np
import seaborn as sns
import pandas as pd
import sklearn as sk
import sklearn.linear_model as skl
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
import matplotlib.pyplot as plt
#sns.set_context('paper')
#plt.style.use('seaborn')
%matplotlib inline
np.random.seed(42)

<p>The first step is to generate two sets of data. The premise of this exercise is to mimic a real-world situation where the relationship between input and output is affected by noise. To do this, we will add Gaussian noise to a sample of 10 data points from the function y = sin(2πx). To visualize the effect of the noise on the data points, we also generate 100 data points for the continuous function y = sin(2πx)</p>

In [None]:
#Generate datapoints for y = sin (2πx) + noise
x = np.random.random_sample(10)
y_orig = np.sin(2*math.pi*x) 
noise = np.random.normal(0,0.3,10)
y_noise = np.sin(2*math.pi*x) + noise

#Generate Curve for y = sin (2πx)
x2 = np.linspace(0,1,100)
y2 = np.sin(2*math.pi*x2)

<p>Below is the visualization of the original function (orange) along with the noisy datapoints (blue). A low number of data points was selected to clearly demonstrate how fitting regression models with low amounts of data leads to the manifestation of over and under fitting</p>

In [None]:
plt.plot(x2,y2, label = 'y = sin (2πx)', color='orange')
plt.scatter(x, y_noise, label = 'y = sin (2πx) + noise ')
plt.legend()
plt.xlabel("x")
plt.ylabel('y')
plt.title('y = sin (2πx) vs. y = sin (2πx) + noise ')
plt.show()

<p>For our first model, we will be using linear regression with polynomial features, effectively making a polynomial regression model. The first stage in creating this model is to create the polynomial features. As each data point currently contains an x and y value, the 9 more features are generated by raising the x value to the power of 2, 3, ... 10.<p>

In [None]:
#generate polynomial features up to degree 10
data = pd.DataFrame(x, columns = ['x'])   ## These data points will be using to train the model
for i in range(2,11):  
    colname = 'x_%d'%i      
    data[colname] = data['x']**i

Test_Data = np.linspace(np.sort(x)[0], np.sort(x)[-1], num =50)   # 50 data points that will be used to demonstrate the relationship built by the model (extreme case) 
Test_Data = pd.DataFrame(Test_Data, columns = ['x'])
for i in range(2,11): 
    colname = 'x_%d'%i 
    Test_Data[colname] = Test_Data['x']**i

data.head()

<p>The following code creates 10 subplots, one for each of the polynomial degrees. For each degree, only the associated polynomial terms are used to build the linear regression model (ex. a polynomial degree of 4 uses 4 terms: x1, x2, x3, x4). As seen in these figures, the first two degrees show underfitting whereas degrees 7+ show clear signs of overfitting.</p>

In [None]:
coefs = []  ## Python list
rss = []
fig, axs = plt.subplots(2,5, figsize = (25,12.5))
for i in range(0,2):
    for j in range (0,5):
        LeastSquaresModel = Pipeline([('scaling', StandardScaler()),
                                      ('linreg', LinearRegression())])
        LeastSquaresModel.fit(data.iloc[:,0:5*i+j+1], y_noise)
        Test_Data_pred_curve = LeastSquaresModel.predict(Test_Data.iloc[:,0:5*i+j+1]) ## we applied the model to the 50 data  points
        y_pred_points = LeastSquaresModel.predict(data.iloc[:,0:5*i+j+1])  ## prediciting of the training data points
        rss.append(np.sum(np.square(y_noise - y_pred_points))) ## Training RSS
        coefs.append(LeastSquaresModel[1].coef_)  ## Model Coefficients
        axs[i,j].plot(x2,y2, label = 'y = sin (2πx)', color='orange') ## the sin function
        axs[i,j].plot(x, y_noise,"o", color='b') # the 10 training points (with noise)
        axs[i,j].plot(Test_Data.x, Test_Data_pred_curve, color='g')  ## visulization of the model
        axs[i,j].title.set_text("n = " + str(5*i+j+1))

<p>As seen above, as the model complexity increases, so too does the level of overfitting. To further illustrate a consequence of the higher-order polynomials, we will explore the coefficients of each model. </p>

In [None]:
#Visualizing Size of Coefficients
coef_mat = pd.DataFrame(coefs)  ## coefficent matrix
pd.options.display.float_format = '{:,.2g}'.format
coef_mat.index.name = 'polynomial rank'
coef_mat.columns = ['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']
coef_mat

<p>What we have observed above is known as coefficient explosion (magnitude of coefficients increases exponentially as model complexity increases) and is a key indication of overfitting. To further verify this, we will plot the Residual Sum of Squares (RSS) for each of the polynomial degrees. As seen in this plot, the highest rank polynomials have an RSS of 0 as they fit each point.</p>

In [None]:
plt.plot(rss)
plt.xlabel('Polynomial Rank')
plt.ylabel('Residual Sum of Sqaures Loss')
plt.title("The effect of Polynomial Rank on RSS")

<p>To illustrate coefficient explosion, we plot the mean absolute magnitude of the coefficients for each of the polynomial ranks. As seen, the average magnitude of the coefficient drastically increases with model complexity.</p>

In [None]:
abs(coef_mat).mean(axis=1).plot(logy=True)
plt.ylabel("Mean Absolute Magnitude of Coefficients")
plt.title("The effect of model complexity on the mean absolute magnitude of coefficients")
plt.show()

To mitigate the effect of overfitting with respect to polynomial rank, Ridge Regression is used. As you know, in terms of Ridge Regression, the α parameter defines the level of regularization; an α of 0 indicates no regularization whereas an $α$ approaching $\infty $ indicates full regularization and reduces all coefficients to 0. The following code illustrates the effect of this regularization on the 10 term polynomial developed above.

In [None]:
coefs = []
rss = []
fig, axs = plt.subplots(2,5, figsize = (25,12.5))
alphas = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 100]
for i in range(0,2):
    for j in range (0,5):
        RidgeModel = Pipeline([('scaling', StandardScaler()),
                               ('ridge_reg', Ridge(alpha = alphas[5*i+j]))])
        RidgeModel.fit(data, y_noise)  # training the model using the 10 data points 
        TestData_pred_curve = RidgeModel.predict(Test_Data) ## Model applied to the test data
        y_pred_points = RidgeModel.predict(data)  ## Model applied on the traing data
        rss.append(np.sum(np.square(y_noise - y_pred_points))) ## training RSS
        coefs.append(RidgeModel[1].coef_)
        axs[i,j].plot(x2,y2, label = 'y = sin (2πx)', color='orange') ## the sin function
        axs[i,j].plot(x, y_noise,"o" , color='b') # plot the 10 training data points
        axs[i,j].plot(Test_Data.x, TestData_pred_curve, color = 'g') # plot the predicted ouput of the 50 test data points
        axs[i,j].title.set_text("α = " + str(alphas[5*i+j]))

<p>As seen above, as the value of α increases, there is a transition from overfitting to underfitting. This shows the power of regression to reverse the effects of overfitting but also highlights the importance of appropriately affecting the α value as too high of a value results in severe underfitting. We can see the effect through the size of coefficients for the various values of α below. </p>

In [None]:
coef_mat = pd.DataFrame(coefs)
pd.options.display.float_format = '{:,.2g}'.format
coef_mat.index = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 100]
coef_mat.index.name = 'alpha'
coef_mat.columns = ['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']
coef_mat

The following is a visualization of the ridge where we can see the size of the coefficients drastically dropping and approaching zero as α approaches $\infty$.

In [None]:
abs(coef_mat).mean(axis=1).plot(logy=True)
plt.ylabel("Mean Absolute Magnitude of Coefficients")
plt.title("The effect of regularization term α on the mean absolute magnitude of coefficients")
plt.show()

# Part 2: Feature expansion and regularization

This part highlights feature expansion and the different regularization techniques.

## Polynomial expansion 

In [None]:
# Let's load a data set and look at it.
D=pd.read_csv('regression_data.csv')
plt.plot(D.x, D.y, 'bo')
plt.show()

In [None]:
# Build a design matrix with polynomial expansion on X
x=
x=
poly = 
X = 


plt.imshow(X, cmap='gray') # drop the cmap flag to get color
plt.colorbar()
plt.show()

XDF = pd.DataFrame(X)
XDF.head()

## Ridge regression

In [None]:
# Now fit a standard linear model to the data and plot prediction 
reg = 
reg
yp=

print(reg)

plt.plot(x, D.y, 'k.', x, yp, 'r-')
plt.show()

In [None]:
# This seems very wiggly.
# Could we do better with Ridge regression? 
# Let's regularize a lot
print('Coef. matrix shape (%i x 1)' % reg.coef_.shape)

ridge = 
ridge
ypp=

plt.plot(x,D.y,'k.',x,yp,'r-',x,ypp,'b-')
plt.show()

### Fix the intercept problem

In [None]:
# What happened? 
# The problem is that the Ridge coefficient was also applied to the intercept
# Sometimes this is desired, sometimes not. (Usually not though)
# In this case we do not want to include the intercept 
# into the regressors that should be regularized 
poly = 
X = 
scaler = 

X = 

pd.DataFrame(X).head()
plt.imshow(X, cmap='gray')

plt.colorbar()
plt.show()

In [None]:
# If we set fit_intercept = True (default), 
# ridge regression fits the intercept 
ridge = 
ridge
ypp=

plt.plot(x,D.y,'k.',x,yp,'r-',x,ypp,'b-')
plt.show()

In [None]:
# Also redo the linear regression 
reg = skl.LinearRegression(fit_intercept=True)
reg.fit(X,D.y)

In [None]:
# Now inspect the coefficients: No explicit intercept is fitted - ridge coefficients are smaller 
(reg.coef_,ridge.coef_)

### How to set the regularization coefficent? 

In [None]:
# So, how should we tune the regularization coefficient? 
# Let's use crossvalidation 
cv_scores = 
-cv_scores

In [None]:
# Systematically vary the ridge coeficient on a log-scale
lam = np.exp(np.linspace(-4,2,10))
mse = np.zeros(10)

for i in range(lam.size):
    cv_scores = 
    mse[i]=-cv_scores.mean()

In [None]:
# Determine lowest value 
plt.scatter(np.log(lam),mse)
plt.show()

In [None]:
# So low let's look at the crossvalidation error for the best setting of lambda 
cv_scores = 

print('CV score for alpha=exp(0.8): %.3f' % -cv_scores.mean())

## Lasso 

In [None]:
las = 
las.
yl=

plt.plot(x, D.y, 'k.', label='_nolegend_')
plt.plot(x, ypp, 'r-', label='Ridge')
plt.plot(x, yl, 'b-', label='LASSO')
plt.legend()
plt.show()

In [None]:
# Let's check the coefficients. 
# What do you notice compared to the ridge? 
pd.DataFrame(

# Ridge shrinks all coefficients towards zero; the lasso tends to give
# a set of zero coefficients and leads to a sparse solution.
# Note that for both ridge regression and the lasso the regression coefficients
# can move from positive to negative values as they are shrunk toward zero.

### Lasso Path 

In [None]:
# Get a full path for Lasso
eps = 5e-3 # The smaller eps, the longer the path  
lambda_lasso, coefs_lasso, _ = 

print(f'minimum regularization parameter : %.3f'% np.amin(lambda_lasso))
print(f'maximum regularization parameter : %.3f' % np.amax(lambda_lasso))
#plt.plot(lambda_lasso)


colors = ['b', 'r', 'g', 'c', 'k','c']
neg_log_lambda = 

for i in range(6):
    l1 = plt.plot(neg_log_lambda, coefs_lasso[i,], c=colors[i])

plt.xlabel('Neg. Log. Lambda')
plt.ylabel('Coef.')
plt.legend(['Coef. 1', 'Coef. 2', 'Coef. 3',
           'Coef. 4', 'Coef. 5', 'Coef. 6'])
plt.show()

## ElasticNet (and automatic search for best hyperparameters)

In [None]:
# Create ElasticNet object
ElasticNet = 

In [None]:
# Fit it
ElasticNet.

In [None]:
# Print best coefficients
print(

In [None]:
# Plot fit
yen=

plt.plot(x, D.y, 'k.', label='_nolegend_')
plt.plot(x, ypp, 'r-', label='Ridge')
plt.plot(x, yl, 'b-', label='LASSO')
plt.plot(x, yen, 'g-.', label='ElasticNet')
plt.legend()
plt.show()

In [None]:
# Compare coefficients
pd.DataFrame(