# [Title of Module]


## Overview
In previous submodules we looked at classification and segmentation to predict the next label. The label is a finite set of distinct values. However, that is not always be the case and the predicted value may need to be continuous and can take on an infinite set of values. In this case we use a method called `Regression`. 

## Learning Objectives
+ `Understand the concept of regression`:
 Learn what regression is, its purpose (estimating relationships between variables), and different types (linear, non-linear, univariate, multivariate).
+ `Read and preprocess tabular data`: 
Learn how to read data from a CSV file using pandas, drop unnecessary columns, and convert categorical data to numerical data for regression.
+ `Evaluate regression models`: Understand and apply metrics for evaluating regression models, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).
+ `Implement regression using scikit-learn`:  Learn how to use scikit-learn's LinearRegression, DecisionTreeRegressor, and RandomForestRegressor to build and train regression models.
+ `Implement regression using PyTorch`: Learn how to build and train a linear regression model using PyTorch. 
+ `Perform feature selection`: Understand the importance of feature selection and apply techniques like correlation statistics (f_regression) and mutual information statistics (mutual_info_regression) to select the most relevant features for regression modeling using SelectKBest.
+ `Visualize and interpret results`:  Learn how to visualize correlation matrices using heatmaps and interpret the results of regression experiments.


## Prerequisites

**Data**

+ `Kaggle Breast Cancer Dataset`

**Main libraries**

* `scikit-learn`: this library is specifically designed for data analysis. It includes functions for classification, clustering, and regression as well functions for preprocessing data [2].
* `pandas`: this library is for storing and retrieving data.
* `numpy`: this library is for converting data into vectors and matrices and performing matrix operations (like multiplication, addition, subtraction, etc.)
* `pytorch`: this library focuses on designing machine learning models
* `matplotlib` and `seaborn`: libraries used for plotting and visual analysis of data




## Get Started

In previous submodules we looked at classification and segmentation to predict the next label. The label is a finite set of distinct values. However, that is not always be the case and the predicted value may need to be continuous and can take on an infinite set of values. In this case we use a method called `Regression`. 

Regression is a statistical process that estimates the relationship between a dependent (also called *response* or *label*) and one or more independent variables (also called *predictors*, *covariates* or *features*).  Regression can be linear or non-linear depending on the relationship between the the dependent and independent variables. Regression is useful for estimating/predicting or forecasting the next value in a sequence and the linear regression can be represented as:

$ \textbf{y} = m\textbf{x} + c $

![Figure 1: reg](reg0.png)

In the equation: $ \textbf{y} $ is the dependent variable and $ \textbf{x} $ is the independent variable. The object of regression is to estimate $ m $ and $ c $. If there is only one independent variable, then it is referred to as *univariate* regression and if there are multiple variables then it is referred to as *multivariate* regression.

Regression is usually implemented on tabular data. For example, we may extract specific features from an image and present them in a tabular form. We could then use the extracted features to predict other values.

This tutorial covers basic regression and its concepts. We use Python libraries to implement regression. We will begin the implementation of regression by using a simple example with tabular data. For tabular data we use Python’s `scikit-learn` library, a very popular library for data analysis and implementing regression. Regression can also be implemented using machine learning. For implementing regression in machine learning we use the pytorch library. 

The following topics will be covered in this tutorial:

* <a href="#reading">Reading tabular data</a></br>
* <a href="#scikit-learn">Regression using ``scikit-learn``</a></br>
* <a href="#pytorch">Regression using ``pytorch``</a></br>
* <a href="#feature">Feature selection</a></br>
* <a href="#conclusion">Conclusion</a></br>
* <a href="#ref">References</a></br>
* <a href="#quiz">Self assessment</a></br>

## <a name="reading">Reading tabular data</a>
---

We begin the tutorial by first reading the tabular data that will be used for demonstrating regression. The tabular data represents breast cancer diagnosis. 

The tabular data is read in as a pandas dataframe.

In [None]:
# To read csv files requires the Pandas library
import pandas as pd
import numpy as np

# Read in the data
df = pd.read_csv('kaggle_breastcancer_data.csv')

# IDs column is not required so it is dropped
df.drop(['id'], axis=1, inplace=True)

# Visualize first 4 rows
df.head(4)

Next, we separate the independent variables and dependent variables.

The *radius_mean* is the dependent variable which is stored as *label*. All other columns in the table are independent variables which are stored as *features*. The dependent and independent variables can be viewed using the following commands:

In [None]:
# This is the column we want to predict (dependent variable)
TARGET_COLUMN = 'radius_mean'

# We drop everything in the dataframe thats not the dependent variable
features = df.drop(TARGET_COLUMN, axis=1)

label = df[TARGET_COLUMN]

# We need to convert "M" and "B" to numerical vaues 1 and 0 for regression
features['diagnosis'] = features['diagnosis'].apply(lambda x: 0 if x == 'B' else 1)

# Display the features (indepdent variables) and label (dependent variables)
print("List of Features: %s" % ', '.join(features))
print("\nLabel: %s" % TARGET_COLUMN)

### Metrics for model evaluation
---

Once a regression model is created, the next task is to determine how well the model works. There are many different metrics that can be used to measure the effectiveness of the model. 
* **Mean Absolute Error (MAE)**: measures the difference between the actual and the predicted values (also called the *residual*) and then computes the mean.

MAE = $\frac{1}{N}\sum \limits_{i=0}^{N} (y_i - \hat{y_i})^2$
* **Mean Squared Error (MSE)**: measures the square of the difference between the actual and the predicted values and then computes the mean.

MSE = $\frac{1}{N}\sum \limits_{i=0}^{N} (y_i - \hat{y_i})$
* **Root Mean Square Error (RMSE)**: measures the mean of the square of the residual and then computes the square root of that mean.  

RMSE = $\sqrt{\frac{1}{N}\sum \limits_{i=0}^{N} (y_i - \hat{y_i})^2}$

Where $y_i$ is the actual ith variable, $\hat{y_i}$ is the estimated ith variable, $N$ is the number of points.

Ideally, the MAE, MSE and RMSE should be as low as possible.

To evaluate the model, we use mean functions from ``scikit-learn`` to calculate these metrics.

## <a name="scikit-learn">Regression using ``scikit-learn``</a>
---

We create a basic regression function. The function takes in input regression model, features which are the independent variables, labels which are dependent variables and the number of iterations over which the model is repeatedly run.
Within the function, the features and labels are divided into training and testing set. We use the ``train_test_split()`` function from ``scikit-learn`` library. The testing size is $1/3$ of the data size. The training set is used for training the regression model and the testing set is used for evaluating the regression model.
We use MAE and RMSE metrics to measure the effectiveness of the model.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

def SciKitRegressionModel(model, features, label, iterations = 100):
    
    '''
        Function for running Regression using SciKit-learn library
        Parameters
        ----------
        model : SciKit-learn regression function, e.g. LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor
        features : independent variable, matrix of shape = [dim, n_features]
        label : dependent variable, vector of shape = [dim, ]
        iterations : number of iterations to run
        
        Returns
        ----------
        maeReg : MAE values from regression model after evaluation, list of shape = [iterations, ]
        rmseReg: RMSE values from regression model after evaluation, list of shape = [iterations, ]
    ''' 
    
    maeReg, rmseReg, r2Reg = [], [], []

    for i in range(iterations):

        RegMod = model
        XTrain, XTest, yTrain, yTest = train_test_split(features, label, test_size=1 / 3)
        RegMod.fit(XTrain, yTrain)

        reslist = RegMod.predict(XTest).tolist()
        truthlist = yTest.tolist()

        mae = mean_absolute_error(reslist, truthlist)
        rmse = mean_squared_error(reslist, truthlist) ** .5

        maeReg.append(mae)
        rmseReg.append(rmse)

    print("Model    : " + str(model))
    print("MAE      : Mean %.4f Deviation %.4f" % (np.mean(maeReg), np.std(maeReg)))
    print("RMSE     : Mean %.4f Deviation %.4f" % (np.mean(rmseReg), np.std(rmseReg)))
    
    return maeReg, rmseReg

In [None]:
from sklearn.linear_model import LinearRegression

maeLin, rmseLin = SciKitRegressionModel(LinearRegression(), features, label, 100)

In addition to the linear regression model in ``scikit-learn``, there are many other regression methods, like logistic regression, ensemble methods, etc. For this tutorial, we will focus on ensemble methods. The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm to improve generalizability / robustness over a single estimator. In this tutorial we will focus on using decision tree and random forest ensemble methods for regression.

Decision trees create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree is built through a process known as binary recursive partitioning, which is an iterative process that splits the data into partitions or branches, and then continues splitting each partition into smaller groups as the method moves up each branch. The random forest regressor is a randomized version of the decision tree. Each tree in the ensemble is built from a sample drawn with replacement from the training set.

We will use the ``DecisionTreeRegressor`` and ``RandomForestRegressor`` function from ``scikit-learn`` to implement the random forest ensemble regression [2].

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

maeDtree, rmseDTree = SciKitRegressionModel(DecisionTreeRegressor(), features, label, 100)
maeRfr, rmseRfr = SciKitRegressionModel(RandomForestRegressor(), features, label, 100)

## <a name="pytorch">Regression using ``pytorch``</a>
---

Regression model can also be implemented using `pytorch`. To implement regression using Pytorch we use the `nn.Linear`.

The `nn.Linear(a,b)` is a module that creates single layer feed forward network with `a` inputs and `b` outputs. Mathematically, this module is designed to calculate the linear equation $ \textbf{y} = m\textbf{x} + c $ where $ \textbf{x} $ is input, $ \textbf{y} $ is output, $m$ is weight and $c$ is the bias. For our case `a` is the number of features of the data frame and `b` is 1 which is the number of labels we are estimating. The weight and bias are obtained using the training phase.

The model also requires defining of some parameters which are learning rate, loss function and the optimizer. 

Before we implement Regression model in Pytorch, we will perform some preprocessing. This includes splitting the data into training and testing tests and converting the pandas dataframe into a numpy array. The numpy array is then normalized. The goals of normalization are to change the values to a common scale and reduce the effect of outliers. Normalization improves the numerical stability of the model. We use the popular *min-max* normalization. *min-max* is defined as:

$ x[:,i] = \frac{x[:,i] - min(x[:,i])}{max(x[:,i]) - min(x[:,i])} $ 

where $i$ is the $ith$ value of $x$.
Normalization can also be implemented using the ``MinMaxScaler()`` from scikit-learn. However, we design our custom function for nomalization for understanding how the min-max normalization takes place.

In [None]:
# Split the data into training and testing sets

featuresnp = features.to_numpy()
labelnp = label.to_numpy()

XTrainNP, XTestNP, yTrainNP, yTestNP = train_test_split(featuresnp, labelnp, test_size=1 / 3)

In [None]:
# Normalization function
def normalizer(x, y):
    
    '''
        Function to perform normalization
        Parameters
        ----------
        x : numpy array to normalize, matrix of shape [dim, n_features]
        y : numpy array used to normalize x, matrix of shape [dim, n_features]
        
        Returns
        ----------
        normX : normalized x values between 0 and 1, matrix of shape [dim, n_features]
    ''' 
    
    minRange = np.min(y, axis=0)
    maxRange = np.max(y, axis=0)
    normX = (x - minRange) / (maxRange - minRange)
    return normX

<div class="alert alert-block alert-info"> <b>EXERCISE</b> We use a custom function for normalization. Try Using the MinMaxScaler() function to perform normalization and use the normalized data as input to the Pytorch model. Is the result the same?

The function can be implemented using the following commands:
    
from **sklearn.preprocessing** import **MinMaxScaler**
    
$sc = MinMaxScaler()$
    
normalizedData = sc.fit_transform(data) </div>

In [None]:
XTrainNorm = normalizer(XTrainNP, XTrainNP)
yTrainNorm = normalizer(yTrainNP, yTrainNP).reshape(-1,1)

XTestNorm = normalizer(XTestNP, XTrainNP)
yTestNorm = normalizer(yTestNP, yTrainNP).reshape(-1,1)

<div class="alert alert-block alert-info"> <b>Knowledge Check</b> </div>

In [None]:
!pip install jupyterquiz==2.0.7 --quiet
from jupyterquiz import display_quiz
display_quiz('../quiz_files/submodule_04/kc1.json')

### Training the model
---

First we define a Torch model using Pytorch.

In [None]:
# If not already installed: 
# !pip install tqdm torchshow torch 

In [None]:
# Import necessary libraries
import torch
import torchvision
import torch.nn as nn

input_size = features.shape[1]
output_size = 1

class LinearRegressionModel(nn.Module):
    
     # Class for implementing Linear Regression in Pytorch

    def __init__(self, input_size , output_size):
        
        '''
        Function for creating linear regression model
        Parameters
        ----------
        input_size : input size for model, integer = n_features
        output_size : output size of model, integer = n_features
        ''' 
        
        super(LinearRegressionModel, self).__init__()
        self.linear = nn.Linear(input_size , output_size)  

    def forward(self, x):
        
        '''
        Function for creating linear regression model
        Parameters
        ----------
        x : indepedent variable, matrix of shape [dim, n_features]
        
        Returns
        ----------
        y_pred : predicted labels, vector of shape [dim, ]
        ''' 
        
        y_pred = self.linear(x)
        
        return y_pred

model = LinearRegressionModel(input_size , output_size)

To enter data into a Pytorch model, it has to be converted into a tensor object. In the next cell we convert the numpy data into torch tensor objects.

In [None]:
# Convert numpy to torch for training the model
XTrainTensor = torch.from_numpy(XTrainNorm.astype(np.float32))
yTrainTensor = torch.from_numpy(yTrainNorm.astype(np.float32))

Like the regression model from before, the machine learning model also has to be trained. We define a training function that trains on training data by minimizing MSE. The training is done using an algorithm called *gradient descent*. Gradient descent is an iterative algorithm that finds a local minimum of a differentiable function. The idea is to take small incremental steps to measure an approximate gradient at each point which is the minimum of the loss function. The incremental steps are determined by the parameter *learning rate*.

The weights and bias are updated accordingly. The gradient descent is implement using the following steps:

* Determine the loss function (in this case we are using MSE)
* Calculate the gradient of the loss with respect to the independent variables by using the command ``.backward()``. This is referred to as *backpropagation*
* Update the weights and bias using the ``step()`` command
* Repeat the above steps

At each training phase, we measure the MSE loss. The loss reduces as we train for more epochs. However, after a certain number of epochs, the loss will plateau and not reduce any further. At this point it will not be useful to train the model for any more epochs and in most cases the training is stopped.

In [None]:
learning_rate = 0.0001
loss_function = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate )

num_epochs = 100

maeML, rmseML = [], []

for epoch in range(num_epochs):
    #forward feed
    y_pred = model(XTrainTensor.requires_grad_())

    #calculate the loss
    loss = loss_function(y_pred, yTrainTensor)
    
    mae = mean_absolute_error(y_pred.detach().numpy(), yTrainTensor.detach().numpy())
    rmse = mean_squared_error(y_pred.detach().numpy(), yTrainTensor.detach().numpy()) ** .5
    
    maeML.append(mae)
    rmseML.append(rmse)

    #backward propagation: calculate gradients
    loss.backward()

    #update the weights
    optimizer.step()

    #clear out the gradients from the last step loss.backward()
    optimizer.zero_grad()

### Evaluating the model
---

In this section we test the model trained in the previous section. The test set is converted to numpy array.

In [None]:
# Convert numpy to torch for evaluating the model
XTestTensor = torch.from_numpy(XTestNorm.astype(np.float32))
yTestTensor = torch.from_numpy(yTestNorm.astype(np.float32))

In [None]:
yTestHat = model(XTestTensor)

In [None]:
maeMLTest = mean_absolute_error(yTestHat.detach().numpy(), yTestTensor.detach().numpy())
rmseMLTest = mean_squared_error(yTestHat.detach().numpy(), yTestTensor.detach().numpy()) ** .5

print("Model    : Pytorch Linear Regression")
print("MAE      : Mean %.4f" % (maeMLTest))
print("RMSE     : Mean %.4f"  % (rmseMLTest))

<div class="alert alert-block alert-info"> <b>EXERCISE</b> Learning rate is an important hyperparamter that effects training speed and accuracy of the model. Try changing the learning rate (lr) to 0.01 and 0.1. How does it effect MAE and RMSE? </div>

Below is a complete function that runs the entire Pytorch model and returns MAE and RMSE. We combine all the functions from before and run the entire code 100 times to create a different train-test split, which results in different MAE and RMSE values. Finally, we take the average of MAE and RMSE over the 100 iterations to get the MAE and RMSE values.

In [None]:
def PytorchRegressionModel(features, label, iterations = 100):
    
    '''
        Function for running Regression using Pytorch
        Parameters
        ----------
        features : independent variable, numpy matrix of shape = [dim, n_features]
        label : dependent variable, numpy vector of shape = [dim, ]
        iterations : number of iterations to run
        
        Returns
        ----------
        maeReg : MAE values from regression model after evaluation, list of shape = [iterations, ]
        rmseReg: RMSE values from regression model after evaluation, list of shape = [iterations, ]
    ''' 
    
    input_size = features.shape[1]
    output_size = 1
    
    maeML, rmseML = [], []
    model = LinearRegressionModel(input_size , output_size)
    
    for i in range(iterations):
        
        # Split the data into training and testing sets
        XTrainNP, XTestNP, yTrainNP, yTestNP = train_test_split(features, label, test_size=1 / 3)

        XTrainNorm = normalizer(XTrainNP, XTrainNP)
        yTrainNorm = normalizer(yTrainNP, yTrainNP).reshape(-1,1)

        XTestNorm = normalizer(XTestNP, XTrainNP)
        yTestNorm = normalizer(yTestNP, yTrainNP).reshape(-1,1)

        # Convert numpy to torch for training the model
        XTrainTensor = torch.from_numpy(XTrainNorm.astype(np.float32))
        yTrainTensor = torch.from_numpy(yTrainNorm.astype(np.float32))
       
        learning_rate = 0.0001
        loss_function = nn.MSELoss()
        optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate )

        num_epochs = 100

        maeML, rmseML = [], []

        for epoch in range(num_epochs):
            #forward feed
            y_pred = model(XTrainTensor.requires_grad_())

            #calculate the loss
            loss = loss_function(y_pred, yTrainTensor)

            mae = mean_absolute_error(y_pred.detach().numpy(), yTrainTensor.detach().numpy())
            rmse = mean_squared_error(y_pred.detach().numpy(), yTrainTensor.detach().numpy()) ** .5

            maeML.append(mae)
            rmseML.append(rmse)

            #backward propagation: calculate gradients
            loss.backward()

            #update the weights
            optimizer.step()

            #clear out the gradients from the last step loss.backward()
            optimizer.zero_grad()
            #if epoch % 100 == 0:
                #print('epoch {}, loss {}'.format(epoch, loss.item()))

        # Convert numpy to torch for evaluating the model
        XTestTensor = torch.from_numpy(XTestNorm.astype(np.float32))
        yTestTensor = torch.from_numpy(yTestNorm.astype(np.float32))

        yTestHat = model(XTestTensor)

        maeMLTest = mean_absolute_error(yTestHat.detach().numpy(), yTestTensor.detach().numpy())
        rmseMLTest = mean_squared_error(yTestHat.detach().numpy(), yTestTensor.detach().numpy()) ** .5
        
        maeML.append(maeMLTest)
        rmseML.append(rmseMLTest)
                         
    print("Model    : " + 'Pytorch Regression')
    print("MAE      : Mean %.4f Deviation %.4f" % (np.mean(maeML), np.std(maeML)))
    print("RMSE     : Mean %.4f Deviation %.4f" % (np.mean(rmseML), np.std(rmseML)))
        
    return maeML, rmseML

maeML, rmseML = PytorchRegressionModel(features.to_numpy(), label.to_numpy())

<div class="alert alert-block alert-danger"> <b>CHALLENGE</b> Use more linear layers in the model. Observe how MAE and RMSE change. </div>

<div class="alert alert-block alert-info"> <b>Knowledge Check</b> </div>

In [None]:
display_quiz('../quiz_files/submodule_04/kc2.json')

## <a name="feature">Feature selection</a>
---

**Feature selection** is the process of identifying and selecting a subset of input variables that are most relevant to the target variable.

The simplest case of feature selection is one which involves numerical input variables and a numerical target for regression predictive modeling. This is because the strength of the relationship between each input variable and the target can be calculated (referred to as correlation) and compared to each other. By selecting the features that are highly correlated, the MAE and RMSE can be significantly reduced.

Feature selection can be done using machine learning. However, here we will use the ``scikit-learn`` library for feature selection. There are two popular feature selection techniques that can be used for numerical input data and a numerical target variable.

* Correlation Statistics: Correlation Statistics is a measure of how closely two variables change together. The larger the relationship, the more likely the feature can be selected for modeling. In Python, this is implemented using the ``f_regression()`` function.
* Mutual Information Statistics: Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable. In Python, this is implemented using the ``mutual_info_regression()`` function.

We create a function that can be used for feature extraction. The function takes in as input our training data, testing data, the label, the type of feature selection function to use and number of top features to extract ($k$).

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression, mutual_info_regression

# feature selection
def select_features(features, label, score_function, k = 5):
    
    '''
        Function for selecting top k features
        Parameters
        ----------
        features : independent variable, numpy matrix of shape = [dim, n_features]
        label : dependent variable, numpy vector of shape = [dim, ]
        score_function : function to use for feature selection, can be f_regression() or mutual_info_regression()
        
        Returns
        ----------
        features_fs : independent variable, numpy matrix of shape = [dim, k]
        fs: output object from SelectKBest() function
    ''' 
        
    # configure to select all features
    fs = SelectKBest(score_func=score_function, k=k)
    # learn relationship from data
    fs.fit(features, label)
    # transform input data
    features_fs = fs.transform(features)
    return features_fs, fs

We perform feature extraction in this tutorial using the ``f_regression()`` function. However, the function can easily be applied using the ``mutual_info_regression()`` function.

In [None]:
features_fs, fs = select_features(features, label, f_regression, 5)

In [None]:
from matplotlib import pyplot

for i in range(len(fs.scores_)):
    print('Feature %d: %f' % (i, fs.scores_[i]))
# plot the scores
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)
pyplot.show()

We select the top 5 features by selecting the features with the highest values.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
corr_matrix = df.corr(method="pearson")
fig, ax = plt.subplots(figsize=(17,17)) 
sns.heatmap(corr_matrix, cmap="YlGnBu", annot=True, cbar=True, linewidths=0.5, ax=ax)
# sns.heatmap(corr_matrix, vmin=-1., vmax=1., annot=True, fmt='.2f', cmap="YlGnBu", cbar=True, linewidths=0.5)
plt.title("pearson correlation")

The correlation can also be viewed by plotting the correlation matrix as a heatmap. The dark blue colors show strong correlation between the data. This is the same result when we selected the top 5 features using the feature selection function. We will now use this data for training and evaluating our regression model.

In [None]:
maeLinfs, rmseLinfs = SciKitRegressionModel(LinearRegression(), features_fs, label, 100)
maeDtreefs, rmseDTreefs = SciKitRegressionModel(DecisionTreeRegressor(), features_fs, label, 100)
maeRfrfs, rmseRfrfs = SciKitRegressionModel(RandomForestRegressor(), features_fs, label, 100)
maeMLfs, rmseMLfs = PytorchRegressionModel(features_fs,  label.to_numpy())

It can be observed that reducing the number of features and considering only the top $k$ features reduced the MAE and RMSE. In case of linear regression, the MAE and RMSE increased. This is because the model was already optimal and thus, removing the features did not improve the performance. 

We summarize all the MAE and RMSE values for all the experiments and present them in a boxplot. The boxplot shows the MAE and RMSE values across all the experiments. The red line is the mean of the MAE and RMSE values.

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(7,7))
axs[0, 0].boxplot([maeLin, maeDtree, maeRfr, maeML], labels=['Linear', 'DTree', 'RF', 'Pytorch'])
axs[0, 0].set_title('MAE all Features')
axs[0, 1].boxplot([maeLinfs, maeDtreefs, maeRfrfs, maeMLfs], labels=['Linear', 'DTree', 'RF', 'Pytorch'])
axs[0, 1].set_title('MAE top k Features')
axs[1, 0].boxplot([rmseLin, rmseDTree, rmseRfr, rmseML], labels=['Linear', 'DTree', 'RF', 'Pytorch'])
axs[1, 0].set_title('RMSE all Features')
axs[1, 1].boxplot([rmseLinfs, rmseDTreefs, rmseRfrfs, rmseMLfs], labels=['Linear', 'DTree', 'RF', 'Pytorch'])
axs[1, 1].set_title('RMSE top k Features')
plt.tight_layout()

<div class="alert alert-block alert-info"> <b>EXERCISE</b> Implement feature extraction using ``mutual_info_regression()``. Are the top 5 features the same? How does it affect MAE and RMSE? What if the value of k is increased from 5 to 10. How does it effect MAE and RMSE?</div>

<div class="alert alert-block alert-info"> <b>Knowledge Check</b> </div>

In [None]:
display_quiz('../quiz_files/submodule_04/kc3.json')

## Conclusion
In this module, we implemented regression using different methods. We studied tabular data and estimated the radius of a cancerous tumor. We looked at how to extract the most relevant features can be extracted and how these features effect the MAE and RMSE metrics. The feature extraction significantly improved MAE and RMSE for random forest, decision tree and pytorch regression models but RMSE and MAE for linear regression was increased. 

## Clean up
To keep your workspaced organized remember to: 

1. Save your work.
2. Shut down any notebooks and active sessions to avoid extra charges.

### References
---
[1] Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. "MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification". arXiv preprint arXiv:2110.14795, 2021.

[2] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011