<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/master/Class_05_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

**Module 5: Regularization and Dropout**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)

### Module 5 Material

* **Part 5.1: Part 5.1: Introduction to Regularization: Ridge and Lasso**
* Part 5.2: Using K-Fold Cross Validation with Keras
* Part 5.3: Using L1 and L2 Regularization with Keras to Decrease Overfitting
* Part 5.4: Drop Out for Keras to Decrease Overfitting
* Part 5.5: Benchmarking Keras Deep Learning Regularization Techniques



## Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.
  Running the following code will map your GDrive to ```/content/drive```.

In [None]:
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
except:
    print("Note: not using Google CoLab")
    COLAB = False

### Lesson Setup

Run the next code cell to load necessary packages

In [None]:
# You MUST run this code cell first
import sklearn
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet

from sklearn import metrics
from scipy.stats import zscore
from sklearn.model_selection import train_test_split 

import pandas as pd
import numpy as np

import os
import shutil
path = '/'
memory = shutil.disk_usage(path)
dirpath = os.getcwd()
print("Your current working directory is : " + dirpath)
print("Disk", memory)

# Part 5.1: Avoiding the problem of overfitting

A common problem that can occur during traing a neural network is called **_overfitting_**. Overfitting occurs when a neural network learns the training data too well. 

At first glance, it might seem strange that learning something "too well" would be bad. However, imagine a neural network trying to learn the difference between apples and oranges based on their color. If you have a neural network with too many parameters and you show it a training set where all the red fruits are apples and all the orange fruits are oranges, the network may learn to categorize fruits solely based on color.

The problem comes when you show your trained neural network a new, unseen red fruit that is actually an orange. Your network may struggle to correctly classify it because its learning was too focused on the color feature during training. This would be an example of overfitting, where the network learned the training data too well at the expense of generalizing to new data. 

In technical terms, overfitting occurs when a neural network starts to memorize noise and outliers in the data rather than generalizing patterns. This can lead to poor performance on new, unseen data because the network is too specialized to that specific training set. Overfitting is more likely to occur when you build a "fancy" neural network with a lot of parameters (e.g. many hidden layers with many neurons) and then train it on a relatively small dataset for far too many epochs. Overall, training a small network on a lot of data promotes better generalization, computational efficiency, robustness, and scalability compared to training a large network on a small amount of data. 

In this lesson we look at a group of techniques that can reduce overfitting by promoting something called _regularization_. 

## Regularization: Ridge and Lasso

Regularization is a technique used to prevent overfitting by adding a **_penalty term_** to the loss function. The goal of regularization is to discourage the model from becoming too complex and instead promote simpler and more generalizable models. There are two common types of regularization used in neural networks:

* **L1 Regularization (Lasso):** In L1 regularization, the penalty term added to the loss function is the sum of the absolute values of the weights (L1 norm). This encourages sparsity in the weights, effectively reducing the number of parameters and promoting a simpler model.
* **L2 Regularization (Ridge):** In L2 regularization, the penalty term added to the loss function is the sum of the squares of the weights (L2 norm). This encourages smaller weights overall, effectively smoothing out the model and preventing large weight values that can lead to overfitting.

Humans are capable of overfitting as well. Human programmers often take certification exams to show their competence in a given programming language. To help prepare for these exams, the test makers often make practice exams available. Consider a programmer who enters a loop of taking the practice exam, studying more, and then retaking the practice exam. The programmer has **_memorized_** much of the practice exam at some point rather than learning the techniques necessary to figure out the individual questions. The programmer has now overfitted for the practice exam. When this programmer takes the real exam, his actual score will likely be lower than what he earned on the practice exam.

Although a neural network received a high score on its training data, this result does not mean that the same neural network will score high on data that was not inside the training set. A computer can overfit as well. Regularization is one of the techniques that can prevent overfitting. Several different regularization techniques exist. Most work by analyzing and potentially modifying the weights of a neural network as it trains.  

## L1 and L2 Regularization

As mentioned above, L1 and L2 regularization are two standard regularization techniques that can reduce the effects of overfitting. These algorithms can either work with an objective function or as part of the backpropagation algorithm. The regularization algorithm is attached to the training algorithm by adding an objective in both cases.  

These algorithms work by adding a **_weight penalty_** to the neural network training. This penalty encourages the neural network to keep the weights to small values. Both L1 and L2 calculate this penalty differently. You can add this penalty calculation to the calculated gradients for gradient-descent-based algorithms, such as backpropagation. The penalty is negatively combined with the objective score for objective-function-based training, such as simulated annealing.

In this lesson we will see how L1 and L2 regularization work using a linear regression neural network to analyze fruit (apple) quality. The following code sets up the Apple Quality dataset for this purpose.

### Example 1: Read the datafile and create a DataFrame

The code in the cell below reads the Apple Quality dataset `apple_quality.csv` from the course HTTPS server and creates a new DataFrame called `sweetDF`. It then prints out 6 rows and all 9 columns of the new DataFrame. 

In [None]:
# Example 1: Read the datafile and create a DataFrame

# Read the data
sweetDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/apple_quality.csv", 
    na_values=['NA', '?'])

# Set display options
pd.set_option('display.max_rows', 6)
pd.set_option('display.max_columns', 9)

# Display DataFrame
display(sweetDF)

If your code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05_1_Exm1.png)


### **Exercise 1: Read the datafile and create a DataFrame** 

In the cell below write the code to read the Apple Quality dataset `apple_quality.csv` from the course HTTPS server and create a new DataFrame called `acidDF`. Print out 6 rows all 9 columns of `acidDF`. 

In [None]:
# Insert your code for Exercise 1 here



If your code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05_1_Exm1.png)


### Example 2: Preprocess and split data for neural network training

The code in the cell below preprocesses the data in the DataFrame `sweetDF`. In the first step it maps the two categorical values `good` and `bad` in the column `Quality` to the integers `1` and `0` respectively. 

A slightly different approach is used to create the X-values compared to approached used in previous lessons. Rather than create a variable (e.g. `x_columns`) that holds all of the column names, except for particlar columns that were dropped (e.g. the response column), the code below uses a more direct approach. A Python list of column names called `sweetColumnNames` is manually created by simply adding individual column names to the list using the following code: 

~~~text
# Select column names for X
sweetColumnNames = ['Size', 'Weight', 'Crunchiness', 'Juiciness',
       'Ripeness', 'Acidity', 'Quality']
~~~

Notice that the column `A_id` was **not** one of the names added to the column name list since it would not have added any useful information to the regression analysis. More importantly, you should also note that the column name `Sweetness` not included either. That is because `Sweetness` is going to be our response column (Y-values), that we will want our regression model to predict.

The list `sweetColumnNames` is then used to generate the X-values using this line of code:

> `sweetX = sweetDF[sweetColumnlNames].values`

To generate the Y-values, we simply use the numeric values in the column `Sweetness`. 

After splitting the data into training and test sets, the X-values for the first 4 apples in the test set are printed out for verification.

In [None]:
# Example 2: Preprocess and split data for neural network training

# Map categorical values to ints
mapping = {'good': 1, 'bad': 0}
sweetDF['Quality'] = sweetDF['Quality'].map(mapping)

# Select column names for X
sweetColumnNames = ['Size', 'Weight', 'Crunchiness', 'Juiciness',
       'Ripeness', 'Acidity', 'Quality']

# Generate X 
sweetX = sweetDF[sweetColumnNames].values
sweetX = np.asarray(sweetX).astype('float32')

# Generate Y from response column
sweetY = sweetDF['Sweetness'].values 
sweetY = np.asarray(sweetY).astype('float32')

# Split into train/test
sweetX_train, sweetX_test, sweetY_train, sweetY_test = train_test_split(    
    sweetX, sweetY, test_size=0.25, random_state=45)

# Print x_test
print(sweetX_test[0:4])

If your code is correct you should see the following output:

~~~text
[[-0.6360764  -3.7468562  -0.88680995 -0.6558604   2.5874615  -0.4822437
   1.        ]
 [-0.40847746 -1.1501935   0.8439847  -2.3181307   2.4774127  -1.1014463
   0.        ]
 [-2.458768   -0.8214087   1.4252522   0.42937222  1.8534137  -2.0206604
   0.        ]
 [ 0.4894509   0.3779261   0.5097401  -0.16708393  2.2951744   0.42738742
   0.        ]]
~~~

However, you might see the following output. Note the presence of `nan` as the last element in each array.

~~~text
[[-0.6360764  -3.7468562  -0.88680995 -0.6558604   2.5874615  -0.4822437
          nan]
 [-0.40847746 -1.1501935   0.8439847  -2.3181307   2.4774127  -1.1014463
          nan]
 [-2.458768   -0.8214087   1.4252522   0.42937222  1.8534137  -2.0206604
          nan]
 [ 0.4894509   0.3779261   0.5097401  -0.16708393  2.2951744   0.42738742
          nan]]
~~~

This error will occur if you run Example 2 twice during debugging. The solution is to simply go back to Example 1, re-read the datafile and recreate the DataFrame `sweetDF` **BEFORE** you run Example 2. 

### **Exercise 2: Preprocess and split data for neural network training**

In the cell below preprocess the data in the DataFrame `acidDF`. Map the two categorical values `good` and `bad` in the column `Quality` to the integers `1` and `0` respectively. 

Create a Python list called `acidColumnNames` by adding individual column names to the list for generating the X-values for your neural network. Since the goal of your neural network will be to predict the `Acidity` of individual apples, do **not** include the name `Acidity` in your list. Instead, substitute the name `Sweetness` in your list. 

Create a variable called `acidX` to hold your X-values and a variable called `acidY` from the column `Acidity` to hold your Y-values (response variable). 

Split your data into testing and training datasets using `test_size=0.25`. Call your training datasets `acidX_train` and `acidY_train` and your testing datasets `acidX_train` and `acidY_train`. Then print out the X-values of the first 4 apples in the test set. 

In [None]:
# Insert your code for Exercise 2 here



If your code is correct you should see the following output:

~~~text
[[-0.6360764  -3.7468562   4.377382   -0.88680995 -0.6558604   2.5874615
   1.        ]
 [-0.40847746 -1.1501935  -3.031134    0.8439847  -2.3181307   2.4774127
   0.        ]
 [-2.458768   -0.8214087  -2.4106574   1.4252522   0.42937222  1.8534137
   0.        ]
 [ 0.4894509   0.3779261  -2.8841462   0.5097401  -0.16708393  2.2951744
   0.        ]]
~~~

However, you might see the following output. Note the presence of `nan` as the last element in each array.

~~~text
[[-0.6360764  -3.7468562   4.377382   -0.88680995 -0.6558604   2.5874615
          nan]
 [-0.40847746 -1.1501935  -3.031134    0.8439847  -2.3181307   2.4774127
          nan]
 [-2.458768   -0.8214087  -2.4106574   1.4252522   0.42937222  1.8534137
          nan]
 [ 0.4894509   0.3779261  -2.8841462   0.5097401  -0.16708393  2.2951744
          nan]]
~~~

This error will occur if you run your **Exercise 2** twice during debugging. The solution is to simply go back to **Exercise 1**, re-read the datafile and recreate the DataFrame `acidDF` **BEFORE** you run **Exercise 2**. 

## Linear Regression

In a **_linear regression_** model, coefficients represent the relationships between the independent variables and the dependent variable. Each coefficient indicates the change in the dependent variable when the corresponding independent variable increases by one unit, while holding all other variables constant. The coefficients determine the slope of the line in a linear regression model and are estimated using statistical methods to best fit the data.

We will use the data just loaded for several examples. The first examples in this part use several forms of linear regression. For linear regression, it is helpful to examine the model's coefficients. The following function is utilized to display these coefficients.

In [None]:
# Simple function to evaluate the coefficients of a regression

%matplotlib inline    
from IPython.display import display, HTML    

def report_coef(names,coef,intercept):
    r = pd.DataFrame( { 'coef': coef, 'positive': coef>=0  }, index = names )
    r = r.sort_values(by=['coef'])
    display(r)
    print(f"Intercept: {intercept}")
    r['coef'].plot(kind='barh', color=r['positive'].map(
        {True: 'b', False: 'r'}))

## L1/L2 Regularization with Linear Regression

Before examining L1/L2 regularization for neural networks, we'll begin with linear regression.  Researchers first introduced the L1/L2 form of regularization for [linear regression](https://en.wikipedia.org/wiki/Linear_regression).  


### Classical Linear Regression

The **_classical_** mathematical procedure for performing linear regression, also known as ordinary least squares (OLS) regression, was invented by Sir Francis Galton, a British mathematician, and biologist, in the late 19th century--long before anyone had idea of building an electronic computer, let alone artificial neural networks. 

Galton developed the concept of regression analysis while studying the relationship between the heights of parents and their children. He coined the term "regression" to describe the phenomenon of offspring tending to move towards the average height of the population, rather than inheriting the exact height of their parents.

Regression analysis is a commonly used statistical method for modeling the relationship between a dependent variable (Y) and one or more independent variables (X). The goal of classical linear regression is to find the best-fitting line through the data points that minimizes the sum of squared differences between the observed values and the values predicted by the model.

The key assumptions of classical linear regression include linearity (the relationship between variables can be approximated by a straight line), independence of errors (the errors in the model are not correlated with each other), homoscedasticity (the variance of the errors is constant across all levels of the independent variable), and normality of errors (the errors are normally distributed).

The coefficients in classical linear regression represent the estimated effect of each independent variable on the dependent variable, while the intercept is the value of the dependent variable when all independent variables are set to zero. The model is typically evaluated using measures such as R-squared, which indicates the proportion of variance in the dependent variable explained by the independent variables.



### Example 3: Classical Linear Regression

The following code uses classical linear regression with the Apple Quality dataset to find the cofficients $ \beta_{n} $ for each of the factors (X's) that predict $ Sweetness $ using the following linear regression equation where $ \alpha $ is the Y-intercept.  

$ {Sweetness_t} = \alpha + \beta_{1}Size_{t} +\beta_{2}Weight_{t} + \beta_{3}Ripeness_{t}+\beta_{4}Juiciness_{t}+\beta_{5}Crunchiness_{t}+\beta_{6}Acidity_{t}+\beta_{7}Quality_{t} $

Rather than use a neural network, the code in the cell below uses **_discrete mathematics_** to determine the best value (number) for each of the 7 coefficients ( $ \beta_{n} $ ) in the above equation. 

In a regression neural network, training would change the connection weights between the neurons in the different layers to give a similar prediction of $ Sweetness $.  

The code in the cell below also computes the Root Mean Square Error (RMSE) for the linear regression and prints out its value.

In [None]:
# Example 3: Classical Linear Regression

# Set display options
pd.set_option('display.max_rows', 8)

# Create linear regression
regressor = sklearn.linear_model.LinearRegression()

# Fit/train linear regression
regressor.fit(sweetX_train,sweetY_train)

# Use regression to predict 
pred = regressor.predict(sweetX_test)

# Compare actual and predicted to measure RMSE error
score = np.sqrt(metrics.mean_squared_error(pred,sweetY_test))
print(f"Final score (RMSE): {score}")

report_coef(
  sweetColumnNames,
  regressor.coef_,
  regressor.intercept_)

If your code is correct you should see the following table of coefficients:

![__](https://biologicslab.co/BIO1173/images/class_05_Exm3A.png)

And the following chart:

![__](https://biologicslab.co/BIO1173/images/class_05_1_Exm3B.png)

From the output above, we can rewrite our linear regression equation with the actual coefficients as follows:

$ Sweetness =  -1.44 -0.52Size -0.41Weight -0.34Ripeness -0.14Juiciness -0.12Crunchiness +0.15Acidity +1.25Quality $

The results of the regression are quite interesting. Not unexpectedly, the sweetness of an apple is directly correlated with its "Quality". A "good" apple is a sweet apple. What is a bit surprising is that acidity is also positively correlated with sweetness. (The Acidity coefficient is positve). Apparently, the more sugar an apple produces, the more acid it produces as well. The rest of the "factors" (e.g. size, weight, etc) are **_negatively_** correlated with sweetness. This suggests that smaller apples are sweeter than larger apples.  

Based on the linear regression equation using the above coefficients gave an RMSE = `1.5093181133270264`

### **Exercise 3: Classical Linear Regression**

In the cell below use classical linear regression your testing and training data (`acidX_train`, `acidY_train`, `acidX_test`, `acidY_test`) to detemine the best values for the coefficients in the following regression equation as was done in Example 3.

$ {Acidity_t} = \alpha + \beta_{1}Size_{t} +\beta_{2}Weight_{t} + \beta_{3}Ripeness_{t}+\beta_{4}Juiciness_{t}+\beta_{5}Crunchiness_{t}+\beta_{6}Sweetness_{t}+\beta_{7}Quality_{t} $

Print out the RMSE for your classical linear regression. 

In [None]:
# Insert your code for Exercise 3 here



If your code is correct you should see the following table of coefficients:

![__](https://biologicslab.co/BIO1173/images/class_05_1_Exe3A.png)

And the following chart:

![__](https://biologicslab.co/BIO1173/images/class_05_1_Exe3B.png)

If your code was correct you should have found the following values for the coefficients:

$ Acidity =  0.76 +0.35Size +0.15Weight -0.09Ripeness +0.36Juiciness +0.13Crunchiness +0.23Sweetness -1.01Quality $

Perhaps not unexpectedly, the relationships of the factors to response variable are very different when the response variable is $ Acidity $ instead of $ Sweetness $. 

In particular, the acidity of an apple is **_inversely_** correlated with its "Quality". An apple with a high acidity is a "bad" apple. This is, in fact, literally true! Rotten apples become acidic because of the buildup of bacteria and enzymes during the decomposition process. These microorganisms break down the sugars in the apple, converting them into acids such as acetic acid.

Based on the linear regression equation using the above coefficients gave an RMSE = `1.9228757619857788`. This is slightly higher (worse) than the RMSE = `1.5093181133270264` produced by the regression analysis for $ Sweetness $ in the previous example.

## L1 (Lasso) Regularization

L1 regularization, also called **_LASSO_** (Least Absolute Shrinkage and Selection Operator) should be used to create sparsity in the neural network. In other words, the L1 algorithm will push many weight connections to near 0. When the weight is near 0, the program drops it from the network. Dropping weighted connections will create a **_sparse_** neural network. A sparse neural network is a type of neural network where most of the connections between neurons have zero weights, resulting in a network with fewer connections and parameters compared to a dense neural network.

Feature selection is a useful byproduct of sparse neural networks. Features are the values that the training set provides to the input neurons. Once all the weights of an input neuron reach 0, the neural network training determines that the feature is unnecessary. If your data set has many unnecessary input features, L1 regularization can help the neural network detect and ignore unnecessary features.

L1 is implemented by adding the following error to the objective to minimize:

$$ E_1 = \alpha \sum_w{ |w| } $$

You should use L1 regularization to create sparsity in the neural network. In other words, the L1 algorithm will push many weight connections to near 0. When the weight is near 0, the program drops it from the network. Dropping weighted connections will create a sparse neural network.

The following code demonstrates lasso regression. Notice the effect of the coefficients compared to the previous section that used linear regression.

### Example 4: L1 (Lasso) Regularization



In [None]:
# Example 4: L1 (Lasso) Regularization

import sklearn
from sklearn.linear_model import Lasso

# Create linear regression
regressor = Lasso(random_state=0,alpha=0.1)

# Fit/train LASSO
regressor.fit(sweetX_train,sweetY_train)
# Predict
pred = regressor.predict(sweetX_test)

# Measure RMSE error.  RMSE is common for regression.
score = np.sqrt(metrics.mean_squared_error(pred,sweetY_test))
print(f"Final score (RMSE): {score}")

report_coef(
  sweetColumnNames,
  regressor.coef_,
  regressor.intercept_)


If your code is correct you should see the following table of coefficients:

![__](https://biologicslab.co/BIO1173/images/class_05_1_Exm5A.png)

And the following chart:

![__](https://biologicslab.co/BIO1173/images/class_05_1_Exm5B.png)

The following table compares the results of the Classical Linear regression with L1 (LASSO).

|            |  Classic  |   L1      |
|------------|-----------|-----------|
|Size        | -0.519299 | -0.434176 |
|Weight      | -0.407031 | -0.319322 |
|Ripeness    | -0.341094 | -0.309975 |
|Juiciness   | -0.137281 | -0.028268 |
|Crunchiness | -0.119313 | -0.026203 |
|Acidity 	 |  0.146360 |  0.083409 |
|Quality 	 |  1.253890 |  0.696090 |
|RMSE        |  1.509318 |  1.560978 |
|Intercept   | -1.441211 | -1.191548 |

In general, L1 (LASSO) reduced the _magnitude_ of the coefficients, without changing their sign. The RMSE was slightly higher with L1. 

L1 Lasso regularization can significantly impact the coefficients generated by ordinary least squares (OLS) linear regression. By adding a penalty term based on the sum of the absolute values of the coefficients, L1 Lasso encourages sparsity in the model and can shrink coefficients towards zero or eliminate them entirely. This regularization technique promotes feature selection by reducing the impact of less important variables, leading to a simpler and more interpretable model compared to OLS linear regression.

### **Exercise 4: L1 (Lasso) Regularization**



In [None]:
# Insert your code for Exercise 4 here




If your code is correct you should see the following table of coefficients:

![__](class_05_1_Exe4A.png)

And the following chart:

![__](class_05_1_Exe4B.png)

Perhaps not unexpectedly, the relationships of the factors to response variable are very different when the response variable is acidity instead of sweetness. 

In particular, the acidity of an apple is **_inversely_** correlated with its "Quality". An apple with a high acidity is a "bad" apple. This is, in fact, literally true! Rotten apples become acidic because of the buildup of bacteria and enzymes during the decomposition process. These microorganisms break down the sugars in the apple, converting them into acids such as acetic acid.

## L2 (Ridge) Regularization

You should use Tikhonov/Ridge/L2 regularization when you are less concerned about creating a space network and are more concerned about low weight values.  The lower weight values will typically lead to less overfitting. 

$$ E_2 = \alpha \sum_w{ w^2 } $$

Like the L1 algorithm, the $\alpha$ value determines how important the L2 objective is compared to the neural network’s error.  Typical L2 values are below 0.1 (10%).  The main calculation performed by L2 is the summing of the squares of all of the weights.  The algorithm will not sum bias values.

You should use L2 regularization when you are less concerned about creating a space network and are more concerned about low weight values.  The lower weight values will typically lead to less overfitting.  Generally, L2 regularization will produce better overall performance than L1.  However, L1 might be useful in situations with many inputs, and you can prune some of the weaker inputs.

The following code uses L2 with linear regression (Ridge regression):

### Example 5: L2 (Ridge) Regularization

The code in the cell below changes the `regressor` to `Ridge` with `alpha=1` and computes the coefficients for the regression equation that predicts $ Sweetness $. 


In [None]:
# Example 5: L2 (Ridge) Regularization

# Create linear regression
regressor = Ridge(alpha=1)

# Fit/train Ridge
regressor.fit(sweetX_train,sweetY_train)

# Predict
pred = regressor.predict(sweetX_test)

# Measure RMSE error.  RMSE is common for regression.
score = np.sqrt(metrics.mean_squared_error(pred,sweetY_test))

#print("Final score (RMSE): {score}")
print(f"Final score (RMSE): {score}")

report_coef(
  sweetColumnNames,
  regressor.coef_,
  regressor.intercept_)


If your code is correct you should see the following table of coefficients:

![__](class_05_1_Exm6A.png)

And the following chart:

![__](class_05_1_Exm6B.png)

The following table now compares the results of the Classical, L1 (LASSO) and L2 (Ridge) regressions.

|            |  Classical  | L1 (LASSO)| L2 (Ridge)|   
|------------|-----------|-----------|-----------|
|Size        | -0.519299 | -0.434176 | -0.519081 | 
|Weight      | -0.407031 | -0.319322 | -0.406918 |
|Ripeness    | -0.341094 | -0.309975 | -0.341152 |
|Juiciness   | -0.137281 | -0.028268 | -0.137084 |
|Crunchiness | -0.119313 | -0.026203 | -0.119269 |
|Acidity 	 |  0.146360 |  0.083409 |  0.146248 |
|Quality 	 |  1.253890 |  0.696090 |  1.251762 |
|RMSE        |  1.509318 |  1.560978 |  1.509346 |
|Intercept   | -1.441211 | -1.191548 | -1.440038 |

L2 Ridge regularization affects the coefficients generated by OLS linear regression by adding a penalty term that is proportional to the sum of the squared values of the regression coefficients. This penalty encourages smaller coefficients overall and helps prevent overfitting by reducing the impact of large coefficient values. Consequently, L2 Ridge regularization can lead to coefficients that are more stable and less sensitive to multicollinearity compared to OLS linear regression, resulting in a more robust and generalizable model.

### **Exercise 5: L2 (Ridge) Regularization**

In the cell below change the `regressor` to `Ridge` with `alpha=1` and compute the coefficients for the regression equation that predicts $ Acidity $. 


In [None]:
# Insert your code for Exercise 5 here




If your code is correct you should see the following table of coefficients:

![__](https://biologicslab.co/BIO1173/images/class_05_1_Exe6A.png)

And the following chart:

![__](https://biologicslab.co/BIO1173/images/class_05_1_Exe6B2.png)

## ElasticNet Regularization

The ElasticNet regression combines both L1 and L2.  Both penalties are applied.  The amount of L1 and L2 are governed by the parameters alpha and beta.

$$ a * {\rm L}1 + b * {\rm L}2 $$

### Example 6: ElasticNet Regularization

The code in the cell below uses `ElasticNet` regularization as the `regressor` function with the following parameters: $ \alpha = 0.1 $ and $ \beta = 0.1 $. The argument `l1_ratio` is $ \beta $. The code computes the coefficients for the regression equation that predicts $ Sweetness $. 

In [None]:
# Example 6: ElasticNet Regularization

# Create linear regression
regressor = ElasticNet(alpha=0.1, l1_ratio=0.1)

# Fit/train LASSO
regressor.fit(sweetX_train,sweetY_train)
# Predict
pred = regressor.predict(sweetX_test)

# Measure RMSE error.  RMSE is common for regression.
score = np.sqrt(metrics.mean_squared_error(pred,sweetY_test))
print(f"Final score (RMSE): {score}")

report_coef(
  sweetColumnNames,
  regressor.coef_,
  regressor.intercept_)

If your code is correct you should see the following table of coefficients:

![__](https://biologicslab.co/BIO1173/images/class_05_1_Exm7A.png)

And the following chart:

![__](https://biologicslab.co/BIO1173/images/class_05_1_Exm7B.png)

The following table now compares the results of the Classical, L1 (LASSO) and L2 (Ridge) regressions.

|            |Classical  | L1 (LASSO)| L2 (Ridge) |ElasticNet |  
|------------|-----------|-----------|------------|-----------|
|Size        | -0.519299 | -0.434176 | -0.519081  | -0.467359 | 
|Weight      | -0.407031 | -0.319322 | -0.406918  | -0.371605 |
|Ripeness    | -0.341094 | -0.309975 | -0.341152  | -0.342934 |
|Juiciness   | -0.137281 | -0.028268 | -0.137084  | -0.099195 | 
|Crunchiness | -0.119313 | -0.026203 | -0.119269  | -0.089476 | 
|Acidity 	 |  0.146360 |  0.083409 |  0.146248  |  0.119367 |
|Quality 	 |  1.253890 |  0.696090 |  1.251762  |  0.820881 |
|RMSE        |  1.509318 |  1.560978 |  1.509346  |  1.531453 |
|Intercept   | -1.441211 | -1.191548 | -1.440038  | -1.205765 |



### **Exercise 6: ElasticNet Regularization**

In the cell below, use `ElasticNet` regularization as the `regressor` function with the following parameters: $ \alpha = 0.1 $ and $ \beta = 0.1 $ Compute the coefficients for the regression equation that predicts $ Acidity $. 

In [None]:
# Insert your code for Exercise 6 here



If your code is correct you should see the following table of coefficients:

![__](https://biologicslab.co/BIO1173/images/class_05_1_Exe7A.png)

And the following chart:

![__](https://biologicslab.co/BIO1173/images/class_05_1_Exe7B.png)

## Summary 

In this lesson we looked at 3 different ways to reduce overfitting when building neural networks:

* **L1 (Lasso):** Lasso stands for Least Absolute Shrinkage and Selection Operator. It is frequently used in machine learning to handle high dimensional data as it facilitates automatic feature selection with its application. This penalty promotes sparsity within the model, which can help avoid issues of multicollinearity and overfitting issues within datasets.
* **L2 (Ridge):** Ridge regression specifically corrects for multicollinearity in regression analysis. This is useful when developing machine learning models that have a large number of parameters, particularly if those parameters also have high weights.
* **ElasticNet:** The elastic net method overcomes the limitations of the LASSO (least absolute shrinkage and selection operator) method. Also if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and ignore the others. To overcome these limitations, the elastic net adds a quadratic part to the penalty, which when used alone is ridge regression (known also as Tikhonov regularization).

These regularization techniques were originally developed for linear regression using discrete mathematics as a way to avoid the problem of [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity). 

Multicollinearity refers to a situation where the predictive variables have a nearly exact linear relationship. Multicollinearity is a problem for linear regression analysis because it can cause several issues with the model interpretation and estimation of coefficients.
* **Inflated Standard Errors:** Multicollinearity leads to high correlations between predictor variables, which can result in inflated standard errors of the regression coefficients. This makes it difficult to determine the true significance of the predictors in the model.
* **Unstable Coefficients:** Multicollinearity can cause instability in the estimated coefficients, making them sensitive to small changes in the data. This makes it challenging to interpret the individual effects of each predictor on the target variable.
* **Difficulty in Feature Selection:** Multicollinearity makes it hard to distinguish between the relative importance of correlated predictors. It can lead to difficulties in identifying which predictors are truly contributing to the model and which ones are redundant.
* **Decreased Predictive Accuracy:** Multicollinearity can reduce the predictive accuracy of the model as it introduces noise and biases in the estimation process. This may result in a less reliable model for making predictions on new data.

Overall, multicollinearity complicates the interpretation of the model, reduces the precision of coefficient estimates, and hinders the predictive performance of a linear regression model.

It turns out the solutions for reducing multicollinearity in mathematical regression equations are also useful for preventing overfitting in neural network models.


## **Lesson Turn-in**

When you have completed all of the code cells, and run them in sequential order (the last code cell should be number 15), use the **File --> Print.. --> Save to PDF** to generate a PDF of your JupyterLab notebook. Save your PDF as `Class_05_1.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.