![](https://i.imgur.com/qkg2E2D.png)

# UnSupervised Learning Methods

## Exercise 003 - Part II

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 28/05/2023 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/UnSupervisedLearningMethods/2023_08/Exercise0003Part002.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import fetch_openml, load_breast_cancer, load_digits, load_iris, load_wine

# Computer Vision

# Miscellaneous
import os
import math
from platform import python_version
import random
import time
import urllib.request

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
#%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

DATA_FILE_URL   = r'None'
DATA_FILE_NAME  = r'None'

T_MNIST_IMG_SIZE = (28, 28)

TOTAL_RUN_TIME = 30 #<! Don't touch it!


In [None]:
# Auxiliary Functions

def BalancedSubSample( dfX: pd.DataFrame, colName: str, numSamples: int ):
    
    # TODO: Validate the number of samples
    # TODO: Validate the column name (Existence and categorical values)
    return dfX.groupby(colName, as_index = False, group_keys = False).apply(lambda dfS: dfS.sample(numSamples, replace = False))

## Guidelines

 - Fill the full names and ID's of the team members in the `Team Members` section.
 - Answer all questions / tasks within the Jupyter Notebook.
 - Use MarkDown + MathJaX + Code to answer.
 - Verify the rendering on VS Code.
 - Submission in groups (Single submission per group).
 - You may and _should_ use the forums for questions.
 - Don't use `pip install` on the submitted notebook!  
   If you need a package that is not imported make it clear by a comment.
 - Good Luck!

<font color='red'>Total run time must be **less than `TOTAL_RUN_TIME` seconds**</font>.

In [None]:
# Run Time
print(f'The total run time must not exceed: {TOTAL_RUN_TIME} [Sec]')
startTime = time.time()

## Team Members

 - `<FULL>_<NAME>_<ID001>`.
 - `<FULL>_<NAME>_<ID002>`.

* <font color='brown'>(**#**)</font> The `Import Packages` section above imports most needed tools to apply the work. Please use it.
* <font color='brown'>(**#**)</font> You may replace the suggested functions to use with functions from other packages.
* <font color='brown'>(**#**)</font> Whatever not said explicitly to implement maybe used by a 3rd party packages.
* <font color='brown'>(**#**)</font> The total run time of this notebook must be **lower than 60 [Sec]**.

In [None]:
# Students Packages to Import
# If you need a package not listed above, use this cell
# Do not use `pip install` in the submitted notebook



## Generate / Load Data

In [None]:
# Download Data
# This section downloads data from the given URL if needed.

if (DATA_FILE_NAME != 'None') and (not os.path.exists(DATA_FILE_NAME)):
    urllib.request.urlretrieve(DATA_FILE_URL, DATA_FILE_NAME)

## 3. PCA

### 3.1. PCA Algorithm

In this section we'll implement a SciKit Learn API compatible class for the PCA.  
The class should implement the following methods:

1. `__init____()` - The object constructor by the encoder dimension.  
2. `fit()` - Given a data set builds the encoder / decoder.  
3. `transform()` - Applies the encoding on the input data.  
4. `inverse_transform()` - Applies the decoding on the input data.  

* <font color='brown'>(**#**)</font> You may use the [SciKit Learn's PCA module](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) as a reference.
* <font color='brown'>(**#**)</font> Both encoding and decoding applied as out of sample encoding / decoding.
* <font color='brown'>(**#**)</font> Pay attention to data structure (`N x D`).


In [None]:
class PCA:
    def __init__(self, d: int = 2):
        '''
        Constructing the object.
        Args:
            d - Number of dimensions of the encoder output.
        '''
        #===========================Fill This===========================#
        # 1. Keep the model parameters.

        ?????
        #===============================================================#
        
    def fit(self, mX: np.ndarray):
        '''
        Fitting model parameters to the input.
        Args:
            mX - Input data with shape N x D.
        Output:
            self
        '''
        #===========================Fill This===========================#
        # 1. Build the model encoder.
        # 2. Build the model decoder.
        # 3. Optimize calculation by the dimensions of `mX`.
        # !! You may find `scipy.sparse.linalg.svds()` useful.
        # !! You may find `scipy.sparse.linalg.eigsh()` useful.

        ?????
        #===============================================================# 
        
        return self
    
    def transform(self, mX: np.ndarray) -> np.ndarray:
        '''
        Applies (Out of sample) encoding
        Args:
            mX - Input data with shape N x D.
        Output:
            mZ - Low dimensional representation (embeddings) with shape N x d.
        '''
        #===========================Fill This===========================#
        # 1. Encode data using the model encoder.
        # return (mX - np.atleast_1d(self.vMean)) @ self.mUd
        
        ?????
        #===============================================================#

        return mZ
    
    def inverse_transform(self, mZ: np.ndarray) -> np.ndarray:
        '''
        Applies (Out of sample) decoding
        Args:
            mZ - Low dimensional representation (embeddings) with shape N x d.
        Output:
            mX - Reconstructed data with shape N x D.
        '''
        #===========================Fill This===========================#
        # 1. Encode data using the model decoder.
        # return (mZ @ self.mUd.T) + np.atleast_1d(self.vMean)
        
        ?????
        #===============================================================#

        return mX


* <font color='red'>(**?**)</font> In the class we use _out of sample_ encoding / decoding. What if we use the same `mX` for training and the encoding?  
Make sure to understand this before proceeding.

### 3.2. PCA Application

In this section the PCA (Using the above class) will be applied on several data sets:

 * Breast Cancer Dataset - Loaded using `load_breast_cancer()`.
 * Digits Dataset - Loaded using `load_digits()`.
 * Iris Dataset - Loaded using `load_iris()`.
 * Wine Dataset - Loaded using `load_wine()`.

For each data set:

1. Make yourself familiar with the data set:
    * How many features are there ($D$).
    * How many samples are there ($N$).
    * Do all features have the same unit?
2. Apply a Pre Process Step  
   In ML, usually, if the features do not have the same unit they are normalized.  
   Namely, make each feature with zero mean and unit standard deviation.   
   Write a function to normalize input data.
3. Apply the PCA  
   Set `d` to be visualization friendly and apply PCA from $D$ to $d$.  
   The obtained the low dimensional data represents $\boldsymbol{Z} \in \mathbb{R}^{d \times N}$.
4. Plot Low Dimensional Data  
   Make a scatter plot of $\boldsymbol{Z} \in \mathbb{R}^{d \times N}$ and color the data points according to the data labels.  
   For each data set show result with the normalization step and without it.
5. Calculate Lost Energy  
   For each plot, show the value of ${\left\| \tilde{\boldsymbol{X}} - \boldsymbol{X} \right\|}_{F}^{2}$.  
   Do this by applying `inverse_transform()` on the low dimensional data and calculate the norm.


* <font color='brown'>(**#**)</font> Pay attention to the difference in dimensions of the data to the derived Math formulations.
* <font color='brown'>(**#**)</font> The output should be plots figures for each data set. Show them in a single figure using sub plots.

In [None]:
#===========================Fill This===========================#
# 1. Implement the normalization function.
# !! Make sure to address the remark.

def NormalizeData(mX: np.ndarray) -> np.ndarray:
    '''
    Normalize data so each feature has zero mean and unit standard deviation.
    Args:
        mX  - Input data with shape N x d.
    Output:
        mY  - Output data with shape N x d.
    Remarks:
        - Features with zero standard deviation are not scaled (Only centered).
    '''

    ????

    return mY
#===============================================================#

In [None]:
#===========================Fill This===========================#
# 1. Set parameter `d`.
# 2. Load each data set.
# 3. Apply PCA to each data set with and without normalization.
# 4. Display results as scatter data.
# !! The figure should be 2 x numDataSets.

????

#===============================================================#

### 3.3. Question

In the above, why does the results of the normalized and non normalized data are different?  
Address the geometry of the results and the value of the reconstruction error.

### 3.3. Solution

<font color='red'>??? Fill the answer here ???</font>

---

## 4. Image Denoising

In this section the PCA algorithm will be used for denoising images from the [MNIST Dataset](https://en.wikipedia.org/wiki/MNIST_database).  
In this section:

 1. Load Data  
    Load the MNIST data set and sub sample it.  
    We'll have a perfectly balanced data set.
    The data will be in `mX` and labels in `vY`.  
    This is already implemented.
 2. Add Noise  
    We'll add noise to the data.  
    The noise of the data will be modeled as a Poisson Noise (Also known as [_Shot Noise_](https://en.wikipedia.org/wiki/Shot_noise)).  
    The _Shot Noise_ is a classic model of noise gathered by imaging sensors.  
    This is already implemented.
 3. Analyze the Data  
    Analyze the spectrum of the data and choose an appropriate ste of parameters for denoising.
 3. Apply Denoising  
    Apply denoising by utilizing the PCA algorithm.
 4. Analyze Result  
    Show the results as a function of the parameters.

In [None]:
# Parameters
numSamplesClass = 600
λ               = 35

In [None]:
# Load Data
# If you get SSL error, uncomment the following 2 lines
# import ssl
# ssl._create_default_https_context = ssl._create_unverified_context
dfX, dfY = fetch_openml(name = 'mnist_784', version = 1, return_X_y = True, as_frame = True, parser = 'auto')


In [None]:
# Sub Sample Data
dfData = pd.concat((dfX, dfY), axis = 1)

# Balanced Sub Sample
# End Result: 'numSamplesClass' samples per digit
dfData = BalancedSubSample(dfData, 'class', numSamplesClass)
vY = dfData['class'].to_numpy(dtype = np.uint8)
mX = dfData.drop(columns = ['class']).to_numpy()

In [None]:
# Add Poisson Noise
mN = np.random.poisson(λ, size = mX.shape) #<! Noise samples

In [None]:
# Add Noise
# Make sure values are in {0, 1, 2, ..., 255} range
mXRef = mX.copy() #<! Reference with no noise
mXRef = mXRef / 255

mX += mN
mX = np.minimum(mX, 255)


In [None]:
# Show Samples

lIdx = [np.flatnonzero(vY == ii)[0] for ii in range(10)]

_, mHA = plt.subplots(1, 10, figsize = (16, 4))
for ii, hA in enumerate(mHA.flat):
    idx = lIdx[ii]
    mI  = np.reshape(mX[idx], T_MNIST_IMG_SIZE)
    # mI  = np.clip(mI, 0, 1)
    hA.imshow(mI, cmap = 'gray')
    hA.axis('off')
    
plt.tight_layout()
plt.show()

### 4.1. The Data Spectrum

In this section:

 1. Pre Process the data (Optional).  
    Do this step if you think it is needed.
 2. Plot the Spectrum of the Eigen Values of the data.
 3. Choose **a range** (5 values) of `d` for the low dimensionality reduction.
 4. For each `d` value, calculate the **relative energy loss**.


In [None]:
#===========================Fill This===========================#
# 1. Pre Process Data (Optional).
# !! Make sure to keep the name of the data `mX`.
# !! Don't change the order of the data so it matches `vY`.

?????

#===============================================================#

In [None]:
#===========================Fill This===========================#
# 1. Calculate the spectrum of the Eigen Values of the data.

????

#===============================================================#

In [None]:
#===========================Fill This===========================#
# 1. Display the Spectrum.
# !! You may show both the spectrum and the relative energy.

????

#===============================================================#

In [None]:
#===========================Fill This===========================#
# 1. Choose a range of `d` values.
# 2. Per `d` plot / display the relative energy loss.
# !! Don't choose too many, keep running time and visualization reasonable.
# !! The choice should be in order to show the effect of `d` on the results and not only the optimal `d`.

????

#===============================================================#

### 4.2. PCA Based Denoising

In this section, per `d` value:

 1. Build the _Encoder_ and _Decoder_. 
 2. Denoise the images listed in the index list `lIdx`.
 3. Show results per `d`
      * For each image show the reconstruction error vs. the noisy sample (`mX`).
      * For each image show the estimation error vs. the non noisy sample (`mXRef`).

* <font color='brown'>(**#**)</font> Make sure when you use the whole data (`mX`) and when the sub set to analyze.
* <font color='brown'>(**#**)</font> For the PCA you may only use `mX`.
* <font color='brown'>(**#**)</font> The output should be the 10 images per row where the number of rows is the number of `d` values + 2 (For the reference / noisy images).

In [None]:
#===========================Fill This===========================#
# 1. Build the encoder / decoder using the `PCA` class above.
# 2. Per `d` denoise the images in `lIdx`.
# !! Only use `mX` for the PCA step.

????
    


#===============================================================#

In [None]:
#===========================Fill This===========================#
# 1. Create a subplot of `len(d) + 2 x 10` plots.
# 2. In the 1st row, show the clean images (`mXRef`).
# 3. In the 2nd row, show the noisy images (`mX`).
# 4. In the next rows show the sample per different `d`.  
#    Per row, show `d`.

????

#===============================================================#

In [None]:
#===========================Fill This===========================#
# 1. Create 2 sub plots where the `x` is the image index {0, 1, ..., 9}.
# 2. The 1st plot, per `d`, shows the reconstruction error.
# 3. The 2nd plot, per `d`, shows the estimation error.

????

#===============================================================#

### 4.3. Question

Address the following remarks:

 - How does the noise model effect the performance of the denoising?  
   Specifically, if the noise model was Gaussian with the same variance, what would change?
 - Would you use the reconstruction error as an estimation of the estimation error?  
   Answer in general and specifically for Images.
 - Explain the idea behind the PCA denoising.  
   Specifically address the trade off between small and large values of `d`.
 - If the data was 1D (Audio instead of Image), would you expect it to perform better?  
   Think if the model has any knowledge about the data being 2D.

### 4.3. Solution

<font color='red'>??? Fill the answer here ???</font>

---

### 4.4. PCA Denoising with Labels

In the above we used no knowledge on the label of the image.  
In this section you should use the labels information in order to improve results.

 1. Write a code which take advantage of the labels `vY` (Be creative).
 2. Show the plots of the reconstruction and estimation error.
 3. Explain, in words, your idea.
 4. Explain, in words, the results.

In [None]:
#===========================Fill This===========================#
# 1. Choose the maximum `d` used in the previous section.
# 2. Apply PCA Denoising on the list of images.

#===============================================================#

In [None]:
#===========================Fill This===========================#
# 1. Display the reconstruction and estimation error per image.
# 2. Compare to the previous result for the same `d`.

#===============================================================#

### 4.4.3. Solution

<font color='red'>??? Fill the answer here ???</font>

---

### 4.4.4. Solution

<font color='red'>??? Fill the answer here ???</font>

---

In [None]:
# Run Time
# Check Total Run Time.
# Don't change this!

endTime = time.time()

totalRunTime = endTime - startTime
print(f'Total Run Time: {totalRunTime} [Sec].')

if (totalRunTime > TOTAL_RUN_TIME):
    raise ValueError(f'You have exceeded the allowed run time as {totalRunTime} > {TOTAL_RUN_TIME}')