[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Machine Learning Methods

## Supervised Learning - Classification - Logistic Regression

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.001 | 31/03/2024 | Royi Avital | More remarks on the derivation                                     |
| 1.0.000 | 20/03/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0041LogisticRegressionSolution.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import fetch_openml
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Image Processing

# Machine Learning

# Miscellaneous
import math
import os
from platform import python_version
import random
import timeit

# Typing
from typing import Callable, Dict, List, Optional, Set, Tuple, Union

# Visualization
import matplotlib as mpl
from matplotlib.colors import LogNorm, Normalize
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image
from IPython.display import display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout, SelectionSlider
from ipywidgets import interact

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

```python
valToFill = ???
```

 - Multi Line to Fill (At least one)

```python
# You need to start writing
?????
```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

?????
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

In [None]:
# Courses Packages

from DataVisualization import PlotLabelsHistogram, PlotMnistImages


In [None]:
# General Auxiliary Functions


## Logistic Regression

In this exercise we'll use the Logistic Regression model as a classifier.  
The SciKit Learn library implement it with the [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) class.

I this exercise we'll do the following:

1. Load the [MNIST Data set](https://en.wikipedia.org/wiki/MNIST_database) using `fetch_openml()`.
2. Train a Logistic Regression model on the training data.
3. Optimize the parameters: `penalty` and `C` by the `roc_auc` score.
4. Interpret the model using its weights.

* <font color='brown'>(**#**)</font> The model is a linear model, hence its weights are easy to interpret.

### Cross Entropy Loss vs. MSE for Probabilistic Predictions

The Logistic Regression is based on the [Cross Entropy Loss](https://en.wikipedia.org/wiki/Cross-entropy) which measure similarity between distributions.  
In the context of classification is measures the distance between 2 _discrete_ distributions.

Consider the the true probabilities and 2 estimations of 6 categories data:

$$ \boldsymbol{y} = {\left[ 0, 1, 0, 0, 0, 0 \right]}^{T}, \; \hat{\boldsymbol{y}}_{1} = {\left[ 0.16, 0.2, 0.16, 0.16, 0.16, 0.16 \right]}^{T}, \; \hat{\boldsymbol{y}}_{2} = {\left[ 0.5, 0.4, 0.1, 0.0, 0.0, 0.0 \right]}^{T} $$

One could use the [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error) to measure the distance between the vectors (Called [Brier Score](https://en.wikipedia.org/wiki/Brier_score) in this context) as an alternative to the CE which will yield:

$$ MSE \left( \boldsymbol{y}, \hat{\boldsymbol{y}}_{1} \right) = 0.128, \; MSE \left( \boldsymbol{y}, \hat{\boldsymbol{y}}_{2} \right) = 0.103 $$

Yet, in $\hat{\boldsymbol{y}}_{2}$ which has a lower error the most probable class is not the correct one while in $\hat{\boldsymbol{y}}_{1}$ it is.  
The CE in contrast only "cares" about the error in the index of the _correct_ class and minimizes that.  
Another advantage of the CE is being the [_Maximum Likelihood Estimator_](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) which ensures some useful properties.

Yet there are some empirical advantages to the MSE loss in this context as given by [Evaluation of Neural Architectures Trained with Square Loss vs Cross Entropy in Classification Tasks](https://arxiv.org/abs/2006.07322) which makes it a legitimate choice as well.

See:

 * [Cross Entropy Loss vs. MSE for Multi Class Classification](https://stats.stackexchange.com/questions/573944).
 * [Disadvantages of Using a Regression Loss Function in Multi Class Classification](https://stats.stackexchange.com/questions/568238).

In [None]:
# Parameters

numSamplesTrain = 1_500
numSamplesTest  = 1_000

numImg = 3

#===========================Fill This===========================#
# 1. Set the options for the `penalty` parameter (Use: 'l1' and 'l2').
# 2. Set the options for the `C` parameter (~25 values, According to computer speed).
lPenalty    = ???
lC          = ???
#===============================================================#


## Generate / Load Data

Loading the _MNIST_ data set using SciKit Learn.


In [None]:
# Generate Data 

mX, vY = fetch_openml('mnist_784', version = 1, return_X_y = True, as_frame = False, parser = 'auto')
vY = vY.astype(np.int_) #<! The labels are strings, convert to integer

print(f'The features data shape: {mX.shape}')
print(f'The labels data shape: {vY.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')

In [None]:
# Pre Processing

# The image is in the range {0, 1, ..., 255}
# We scale it into [0, 1]

#===========================Fill This===========================#
# 1. Scale the values into the [0, 1] range.
mX = ???

#===============================================================#

In [None]:
# Train Test Split

#===========================Fill This===========================#
# 1. Split the data such that the Train Data has `numSamplesTrain`.
# 2. Split the data such that the Test Data has `numSamplesTest`.
# 3. The distribution of the classes must match the original data.

mXTrain, mXTest, vYTrain, vYTest = train_test_split(???)

#===============================================================#

print(f'The training features data shape: {mXTrain.shape}')
print(f'The training labels data shape: {vYTrain.shape}')
print(f'The test features data shape: {mXTest.shape}')
print(f'The test labels data shape: {vYTest.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')

### Plot Data

In [None]:
# Plot the Data

hF = PlotMnistImages(mX, vY, numImg)

### Distribution of Labels

When dealing with classification, it is important to know the balance between the labels within the data set.

In [None]:
# Distribution of Labels

hA = PlotLabelsHistogram(vY)
plt.show()

## Logistic Regression

The _logistic regression_ can be derived in many forms.  
We'll illustrate 2 of them.

### Derivation 001

One intuitive path is saying that we're after calculating the probability: $p \left( y = 1 \mid \boldsymbol{x} \right)$.  
Since it is a probability function is must obey some rules. The first one being in the range $\left[ 0, 1 \right]$.  

A function which maps $\left( -\infty, \infty \right) \to \left[0, 1 \right]$ is the [Sigmoid Function](https://en.wikipedia.org/wiki/Sigmoid_function): $\sigma \left( z \right) = \frac{1}{1 + \exp \left( z \right)}$.

So now we can say that: $p \left( y = 1 \mid \boldsymbol{x} \right) = \sigma \left( {z}_{i} \right)$.  
Now the problem is modeling the parameter ${z}_{i}$. In which in a linear case will be modeled as ${z}_{i} = \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i}$.  
Namely by a linear model, which in the choice of the Log of Sigmoid Function as the objective means the objective function is Convex in $\boldsymbol{w}_{i}$ and $b$:

![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Exam_pass_logistic_curve.svg/640px-Exam_pass_logistic_curve.svg.png)

* <font color='brown'>(**#**)</font> The function is convex due to the way it is formulated using the negative logarithm.  
  See [Logistic Regression - Prove the Convexity of the Loss Function](https://math.stackexchange.com/questions/1582452).
* <font color='brown'>(**#**)</font> It is guaranteed to converge only if the problem is not linear separable.  
  In the separable scenario there is an incentive for $\left\| \boldsymbol{w} \right\|$ to get larger to emphasize between the 2 classes.  
  If $\boldsymbol{w}$ is doubled the odds of elements in class 1 gets bigger log odds and elements in class 0 get smaller log odds.  
  In this case the parameter $\boldsymbol{w}$ won't converge, but the direction will converge. In particular, $\frac{\boldsymbol{w}}{\left\| \boldsymbol{w} \right\|}$ converges.

If we expand the above to multi class we'll get the [Softmax Function](https://en.wikipedia.org/wiki/Softmax_function) as in slides.

### Derivation 002

By _Bayes Theorem_ for the $L$ classes model:

$$
\begin{aligned}
p \left( y = {L}_{i} \mid \boldsymbol{x} \right) & = \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ p \left( \boldsymbol{x} \right) } && \text{} \\
& = \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ \sum_{j = 1}^{L} p \left( \boldsymbol{x} \mid y = {L}_{j} \right) p \left( y = {L}_{j} \right) } && \text{Expending by law total probability} \\
& = \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) + p \left( \boldsymbol{x} \mid y \neq {L}_{i} \right) p \left( y \neq {L}_{i} \right) } && \text{Expending by law total probability} \\
& = \frac{ 1 }{ 1 + \frac{ p \left( \boldsymbol{x} \mid y \neq {L}_{i} \right) p \left( y \neq {L}_{i} \right)}{p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right)} } && \text{Dividing by $p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right)$} \\
& = \frac{ 1 }{ 1 + {e}^{\log \frac{ p \left( \boldsymbol{x} \mid y \neq {L}_{i} \right) p \left( y \neq {L}_{i} \right)}{p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right)}} } && \text{for $x \in \left[ 0, \infty \right) \Rightarrow x = \exp \log x $} \\
& = \frac{ 1 }{ 1 + {e}^{-\log \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ p \left( \boldsymbol{x} \mid y \neq {L}_{i} \right) p \left( y \neq {L}_{i} \right) }} } && \text{$\log x = - \log \frac{1}{x}$} \\
\end{aligned}
$$

Now, if we model the log of likelihood ratio of the ${L}_{i}$ label with a linear model:

$$ \log \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ p \left( \boldsymbol{x} \mid y \neq {L}_{i} \right) p \left( y \neq {L}_{i} \right) } = \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} $$

So we get:

$$ p \left( y = {L}_{i} \mid \boldsymbol{x} \right) = \frac{1}{ 1 + {e}^{- \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right)} } $$

Yet, since $1 = {e}^{- \log \frac{p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right)}{p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right)}}$ the above can be written as:

$$
\begin{aligned}
p \left( y = {L}_{i} \mid \boldsymbol{x} \right) = \frac{1}{ 1 + {e}^{- \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right)} }
\end{aligned}
$$

### Derivation 003

By _Bayes Theorem_ for the $L$ classes model:

$$
\begin{aligned}
p \left( y = {L}_{i} \mid \boldsymbol{x} \right) & = \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ p \left( \boldsymbol{x} \right) } && \text{} \\
& = \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ \sum_{j = 1}^{L} p \left( \boldsymbol{x} \mid y = {L}_{j} \right) p \left( y = {L}_{j} \right) } && \text{Expending by law total probability} \\
& = \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ p \left( \boldsymbol{x} \mid y = {L}_{k} \right) p \left( y = {L}_{k} \right) + \sum_{j \neq k} p \left( \boldsymbol{x} \mid y = {L}_{j} \right) p \left( y = {L}_{j} \right) } && \text{} \\
& = \frac{ \frac{p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right)}{p \left( \boldsymbol{x} \mid y = {L}_{k} \right) p \left( y = {L}_{k} \right)} }{ 1 + \sum_{j \neq k} \frac{p \left( \boldsymbol{x} \mid y = {L}_{j} \right) p \left( y = {L}_{j} \right)}{p \left( \boldsymbol{x} \mid y = {L}_{k} \right) p \left( y = {L}_{k} \right)} } && \text{Dividing by $p \left( \boldsymbol{x} \mid y = {L}_{k} \right) p \left( y = {L}_{k} \right)$} \\
\end{aligned}
$$

As in above, we may model the Log Likelihood Ratio by a linear function of $\boldsymbol{x}$ then we'll get:

$$ p \left( y = {L}_{i} \mid \boldsymbol{x} \right) = \frac{ \exp{\left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} \right)} }{ 1 + \sum_{j \neq k} \exp{\left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} \right)}} $$

Since $1 = \exp{ \left( \log{ \frac{p \left( \boldsymbol{x} \mid y = {L}_{k} \right) p \left( y = {L}_{k} \right)}{p \left( \boldsymbol{x} \mid y = {L}_{k} \right) p \left( y = {L}_{k} \right)} } \right)}$ we can write:

$$ p \left( y = {L}_{i} \mid \boldsymbol{x} \right) = \frac{ \exp{\left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} \right)} }{ \sum_{j} \exp{\left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} \right)}} $$

### Derivation 004

Given $L$ classes, we can chose a reference class: ${L}_{k}$. Then define the linear model of the log likelihood ratio compared to it:

$$ \log{ \left( \frac{ p \left( y = {L}_{i} \mid \boldsymbol{x} \right) }{ p \left( {y} = {L}_{k} \mid \boldsymbol{x} \right) } \right) } = \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} $$

By definition $p \left( y = {L}_{i} \mid \boldsymbol{x} \right) = p \left( y = {L}_{k} \mid \boldsymbol{x} \right) \exp{ \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right) }$

Then:

$$
\begin{aligned}
1 - p \left( y = {L}_{k} \mid \boldsymbol{x} \right) & = \sum_{j \neq k} p \left( y = {L}_{j} \mid \boldsymbol{x} \right) && \text{} \\
& = \sum_{j \neq k} p \left( y = {L}_{k} \mid \boldsymbol{x} \right) \exp{ \left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} + {b}_{j} \right) } && \text{Since $p \left( y = {L}_{i} \mid \boldsymbol{x} \right) = p \left( y = {L}_{k} \mid \boldsymbol{x} \right) \exp{ \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right) }$} \\
& = p \left( y = {L}_{k} \mid \boldsymbol{x} \right) \sum_{j \neq k} \exp{ \left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} + {b}_{j} \right) } && \text{} \\
& \Rightarrow p \left( y = {L}_{k} \mid \boldsymbol{x} \right) = \frac{1}{1 + \sum_{j \neq k} \exp{ \left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} + {b}_{j} \right) }} \\
& \Rightarrow p \left( y = {L}_{i} \mid \boldsymbol{x} \right) = \frac{ \exp{ \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right) } }{1 + \sum_{j \neq k} \exp{ \left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} + {b}_{j} \right) }} && \text{}
\end{aligned}
$$

Since $1 = \exp{\left( \log{ \frac{ p \left( y = {L}_{k} \mid \boldsymbol{x} \right) }{ p \left( y = {L}_{k} \mid \boldsymbol{x} \right) } } \right)}$ we can write:

$$
\begin{aligned}
p \left( y = {L}_{i} \mid \boldsymbol{x} \right) & = \frac{ \exp{ \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right) } }{1 + \sum_{j \neq k} \exp{ \left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} + {b}_{j} \right) }} \\
& = \frac{ \exp{ \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right) } }{\exp{ \left( \boldsymbol{w}_{k}^{T} \boldsymbol{x} + {b}_{k} \right) } + \sum_{j \neq k} \exp{ \left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} + {b}_{j} \right) }} \\
& = \frac{ \exp{ \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right) } }{ \sum_{j} \exp{ \left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} + {b}_{j} \right) }}
\end{aligned}
$$

### Summary

While there are many way to derive the logistic regression (for instance, also by assuming Binomial Distribution), the main motivation is its numerical properties.  
Namely being convex with easy to calculate gradient.

* <font color='brown'>(**#**)</font> The main motivation for _Logistic Regression_ based model is the natural expandability of the probability of the prediction.
* <font color='brown'>(**#**)</font> The lecture slides derive the [Multinomial Logistic Regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression).
* <font color='brown'>(**#**)</font> The first "Deep Learning" model were actually chaining many _Logistic Regression_ layers.
* <font color='brown'>(**#**)</font> Most classification layers in Deep Learning models are basically _Logistic Regression_.
* <font color='brown'>(**#**)</font> The _Logistic Regression_ model is linear, yet ca be extended by _Polynomial Transform_.  
  See [Logistic Regression for Non Linear Separable Data](https://datascience.stackexchange.com/questions/21896).
* <font color='brown'>(**#**)</font> The concept of Logistic Regression can also be used as pure regression for continuous data bounded in the range $\left[ a, b \right]$.

### Manual Grid Search Hyper Parameter Optimization

1. Create a data frame with 3 columns:
  - `Penalty` - The value of the `penalty` parameter.
  - `C` - The value of `C` parameter.
  - `ROC AUC` - The value of the `roc_auc()` of the model.
   
   The number of rows should match the number of combinations.
2. Iterate over all combinations and measure the score on the test set.
3. Plot an heatmap (2D) for the combination of hyper parameters and the resulted AUC.
4. Extract the best model.

In [None]:
# Creating the Data Frame

#===========================Fill This===========================#
# 1. Calculate the number of combinations.
# 2. Create a nested loop to create the combinations between the parameters.
# 3. Store the combinations as the columns of a data frame.

# For Advanced Python users: Use iteration tools for create the cartesian product
numComb = ???
dData   = ???

for ii, paramPenalty in enumerate(lPenalty):
    for jj, paramC in enumerate(lC):
        ?????
#===============================================================#

dfModelScore = pd.DataFrame(data = dData)
dfModelScore



In [None]:
# Optimize the Model

#===========================Fill This===========================#
# 1. Iterate over each row of the data frame `dfModelScore`.  
#    Each row defines a set of hyper parameters to evaluate.
# 2. Construct the model.
# 3. Train it on the Train Data Set.
# 4. Calculate its AUC ROC score on the train set, save it to the `ROC AUC Train`.
# 5. Calculate its AUC ROC score on the test set, save it to the `ROC AUC Test`.
# !! Make sure to chose the `saga` solver as it is the only one supporting all the regularization options.
# !! Set the parameter `tol` to ~5e-3 to ensure convergence in a reasonable time.
# !! Set the parameter `max_iter` to high value (10_000 or so) to make sure convergence of the model.

for ii in range(numComb):
    paramPenalty    = ???
    paramC          = ???

    if paramPenalty == 'None':
        paramPenalty = None

    print(f'Processing model {ii + 1:03d} out of {numComb} with `penalty` = {paramPenalty} and `C` = {paramC}.')

    oLogRegCls = ???
    oLogRegCls = ???

    dfModelScore.loc[ii, 'ROC AUC Train']   = roc_auc_score(???)
    dfModelScore.loc[ii, 'ROC AUC Test']    = roc_auc_score(???)
#===============================================================#

* <font color='brown'>(**#**)</font> If one watches the timing of iterations above, he will see that higher regularization (Smaller `C`) will also be faster to calculate as the weights are less "crazy".

In [None]:
# Display the Results
# Display the results sorted (Test).
# Pandas allows sorting data by any column using the `sort_values()` method.
# The `head()` allows us to see only the the first values.
dfModelScore.sort_values(by = ['ROC AUC Test'], ascending = False).head(10)

In [None]:
# Display the Results
# Display the results sorted (Train).
dfModelScore.sort_values(by = ['ROC AUC Train'], ascending = False).head(10)

* <font color='red'>(**?**)</font> Can you see cases of Under / Over Fit?

In [None]:
# Train Data ROC AUC Heatmap
# Plotting the Train Data ROC AUC as a Heat Map.
# We can pivot the data set created to have a 2D matrix of the ROC AUC as a function of `C` and the `Penalty`.

hA = sns.heatmap(data = dfModelScore.pivot(index = 'C', columns = 'Penalty', values = 'ROC AUC Train'), robust = True, linewidths = 1, annot = True, fmt = '0.2%', norm = LogNorm())
hA.set_title('ROC AUC of the Train Data')
plt.show()

In [None]:
# Test Data ROC AUC Heatmap
# Plotting the Test Data ROC AUC as a Heat Map.
# We can pivot the data set created to have a 2D matrix of the ROC AUC as a function of `C` and the `Penalty`.

hA = sns.heatmap(data = dfModelScore.pivot(index = 'C', columns = 'Penalty', values = 'ROC AUC Test'), robust = True, linewidths = 1, annot = True, fmt = '0.2%', norm = LogNorm())
hA.set_title('ROC AUC of the Test Data')
plt.show()

In [None]:
# Extract the Optimal Hyper Parameters

#===========================Fill This===========================#
# 1. Extract the index of row of the maximum value of `ROC AUC Test`.
# 2. Use the index of the row to extract the hyper parameters which were optimized.
# !! You may find the `idxmax()` method of a Pandas data frame useful.

idxArgMax = ???
#===============================================================#

optimalPenalty  = dfModelScore.loc[idxArgMax, 'Penalty']
optimalC        = dfModelScore.loc[idxArgMax, 'C']

print(f'The optimal hyper parameters are: `penalty` = {optimalPenalty}, `C` = {optimalC}')



### Optimal Model

In this section we'll extract the best model an retrain it on the whole data (`mX`).  
We need to export the model which has the best Test values.

In [None]:
# Construct the Optimal Model & Train on the Whole Data

#===========================Fill This===========================#
# 1. Construct the logistic regression model. Use the same `tol`, `solver` and `max_itr` as above.
# 2. Fit the model on the whole data set (mX).
oLogRegCls = ???
oLogRegCls = ???
#===============================================================#

In [None]:
# Model Score (Accuracy)

print(f'The model score (Accuracy) is: {oLogRegCls.score(mX, vY):0.2%}.')

* <font color='red'>(**?**)</font> Does it match the results above? Why?

## Explain / Interpret the Model

Linear models, which works mostly on correlation, are relatively easy to interpret / explain.  
In this section we'll show how to interpret the weights of the classifier.

In [None]:
# Extract the Weights of the Classes

#===========================Fill This===========================#
# 1. Extract the weights of the model using the `coef_` attribute.
mW = ??? #<! The model weights (Without the biases)
#===============================================================#

print(f'The coefficients / weights matrix has the dimensions: {mW.shape}')

Since the weights basically match each pixel of the input image (As a vector) then we can display them as an image.

In [None]:
# Plot the Weights as Images

#===========================Fill This===========================#
# 1. Convert the weights into the form of an image.
# 2. Plot it using `imshow()` of Matplotlib.
# !! You may use `PlotMnistImages()` to do this for you, look at its code.

?????
#===============================================================#

* <font color='red'>(**?**)</font> Could you explain the results and how the model works?
* <font color='brown'>(**#**)</font> Usually, for linear models, it is important to have zero mean features.
* <font color='blue'>(**!**)</font> Run the above using the `StandardScaler()` as part of the pipeline (Don't alter the images themselves!)