![](https://i.imgur.com/qkg2E2D.png)

# UnSupervised Learning Methods

## Exercise 003 - Part III

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.001 | 11/06/2023 | Royi Avital | Fixed questions numbering                                          |
| 0.1.000 | 23/05/2023 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/UnSupervisedLearningMethods/2023_03/Exercise0002Part002.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import fetch_openml, load_breast_cancer, load_digits, load_iris, load_wine, make_s_curve

# Computer Vision

# Miscellaneous
import os
import math
from platform import python_version
import random
import time
import urllib.request

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
#%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

DATA_FILE_URL   = r'None'
DATA_FILE_NAME  = r'None'

T_MNIST_IMG_SIZE = (28, 28)


In [None]:
# Auxiliary Functions

def BalancedSubSample( dfX: pd.DataFrame, colName: str, numSamples: int ):
    
    # TODO: Validate the number of samples
    # TODO: Validate the column name (Existence and categorical values)
    return dfX.groupby(colName, as_index = False, group_keys = False).apply(lambda dfS: dfS.sample(numSamples, replace = False))

## Guidelines

 - Fill the full names and ID's of the team members in the `Team Members` section.
 - Answer all questions / tasks within the Jupyter Notebook.
 - Use MarkDown + MathJaX + Code to answer.
 - Verify the rendering on VS Code.
 - Submission in groups (Single submission per group).
 - You may and _should_ use the forums for questions.
 - Good Luck!

* <font color='brown'>(**#**)</font> The `Import Packages` section above imports most needed tools to apply the work. Please use it.
* <font color='brown'>(**#**)</font> You may replace the suggested functions to use with functions from other packages.
* <font color='brown'>(**#**)</font> Whatever not said explicitly to implement maybe used by a 3rd party packages.
* <font color='brown'>(**#**)</font> The total run time of this notebook must be **lower than 60 [Sec]**.

## Team Members

 - `<FULL>_<NAME>_<ID001>`.
 - `<FULL>_<NAME>_<ID002>`.

## Generate / Load Data

In [None]:
# Download Data
# This section downloads data from the given URL if needed.

if (DATA_FILE_NAME != 'None') and (not os.path.exists(DATA_FILE_NAME)):
    urllib.request.urlretrieve(DATA_FILE_URL, DATA_FILE_NAME)

## 5. Kernel PCA (K-PCA / KPCA)

### 5.1. Kernel PCA Algorithm

In this section we'll implement a SciKit Learn API compatible class for the Kernel PCA.  
The class should implement the following methods:

1. `__init____()` - The object constructor by the encoder dimension.  
   The input will include the encoder dimension `d` and a callable function for the kernel.
2. `fit()` - Given a data set builds the encoder.  
3. `transform()` - Applies the encoding on the input data.   

* <font color='brown'>(**#**)</font> You may use the [SciKit Learn's Kernel PCA module](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html) as a reference.
* <font color='brown'>(**#**)</font> Both encoding and decoding applied as out of sample encoding / decoding.
* <font color='brown'>(**#**)</font> Pay attention to data structure (`N x D`).
* <font color='brown'>(**#**)</font> You may assume the kernel function `k` ($ k : \, \mathbb{R}^{D} \times \mathbb{R}^{D} \to \mathbb{R} $) has the following signature:

```python
def k(mX1: np.ndarray, mX2: np.ndarray)
    '''
    Computes the kernel function between two sets of vectors.
    Args:
        mX1 - Input data with shape N1 x D.
        mX2 - Input data with shape N2 x D.
    Output:
        mKx - Output kernel matrix with shape N1 x N2.
    '''
```


In [None]:
class KPCA:
    def __init__(self, d: int = 2, k: Callable = lambda x: x):
        '''
        Constructing the object.
        Args:
            d - Number of dimensions of the encoder output.
            k - A kernel function (Callable).
        '''
        #===========================Fill This===========================#
        # 1. Keep the model parameters.

        ?????
        #===============================================================#
        
    def fit(self, mX: np.ndarray):
        '''
        Fitting model parameters to the input.
        Args:
            mX - Input data with shape N x D.
        Output:
            self
        '''
        #===========================Fill This===========================#
        # 1. Build the model encoder.
        # 2. Optimize calculation by the dimensions of `mX`.
        # !! You may find `scipy.sparse.linalg.svds()` useful.
        # !! You may find `scipy.sparse.linalg.eigsh()` useful.
        # Do not use `J` explicitly as a matrix multiplication.

        ?????
        #===============================================================# 
        return self
    
    def transform(self, mX: np.ndarray) -> np.ndarray:
        '''
        Applies (Out of sample) encoding
        Args:
            mX - Input data with shape N x D.
        Output:
            mZ - Low dimensional representation (embeddings) with shape N x d.
        '''
        #===========================Fill This===========================#
        # 1. Encode data using the model encoder.
        
        ?????
        #===============================================================#

        return mZ
    


* <font color='red'>(**?**)</font> Why `inverse_transform()` is not implemented? You may read about SciKit Learn's `inverse_transform()`.

### 5.2. K-PCA Application

In this section the K-PCA (Using the above class) will be applied on several data sets:

 * Breast Cancer Dataset - Loaded using `load_breast_cancer()`.
 * Digits Dataset - Loaded using `load_digits()`.
 * Iris Dataset - Loaded using `load_iris()`.
 * Wine Dataset - Loaded using `load_wine()`.

For each data set:

1. Make yourself familiar with the data set:
    * How many features are there ($D$).
    * How many samples are there ($N$).
    * Do all features have the same unit?
2. Apply a Pre Process Step  
   In ML, usually, if the features do not have the same unit they are normalized.  
   Namely, make each feature with zero mean and unit standard deviation.   
   Write a function to normalize input data.
3. Apply the K-PCA  
   Set `d` to be visualization friendly and apply PCA from $D$ to $d$.  
   The obtained the low dimensional data represents $\boldsymbol{Z} \in \mathbb{R}^{d \times N}$.  
   You should use the following kernels (Implemented by yourself):
     * $k \left( \boldsymbol{x}_{i}, \boldsymbol{x}_{j} \right) = \boldsymbol{x}_{i}^{T} \boldsymbol{x}_{j}$.
     * $k \left( \boldsymbol{x}_{i}, \boldsymbol{x}_{j} \right) = \left(1 + \boldsymbol{x}_{i}^{T} \boldsymbol{x}_{j} \right)^{p}$.  
       You should set a reasonable $p$.
     * $k \left( \boldsymbol{x}_{i}, \boldsymbol{x}_{j} \right) = \exp \left( - \frac{\left\| \boldsymbol{x}_{i} - \boldsymbol{x}_{j} \right\|_{2}^{2}}{2 {\sigma}^{2}} \right)$.  
       You should set a reasonable $\sigma$.
4. Plot Low Dimensional Data  
   Make a scatter plot of $\boldsymbol{Z} \in \mathbb{R}^{d \times N}$ and color the data points according to the data labels.  
   For each data set show result with the normalization step and without it.


* <font color='brown'>(**#**)</font> Pay attention to the difference in dimensions of the data to the derived Math formulations.
* <font color='brown'>(**#**)</font> The output should be 2 figures for each data set and kernel. You may show them in a single plot using sub plots.

In [None]:
#===========================Fill This===========================#
# 1. Implement the normalization function.
# !! Make sure to address the remark.

def NormalizeData(mX: np.ndarray) -> np.ndarray:
    '''
    Normalize data so each feature has zero mean and unit standard deviation.
    Args:
        mX  - Input data with shape N x d.
    Output:
        mY  - Output data with shape N x d.
    Remarks:
        - Features with zero standard deviation are not scaled (Only centered).
    '''

    ?????

    return mY
#===============================================================#

In [None]:
#===========================Fill This===========================#
# 1. Implement the 3 kernels.
# !! Make sure to address the remarks.
# !! Pay attention that `np.dot(mA.T, mA)` is faster (Symmetric aware) than `mA.T @ mA`.

def KernelInnerProduct( mX1: np.ndarray, mX2: np.ndarray ) -> np.ndarray:
    '''
    Calculates the kernel matrix of the Inner Product kernel.
    Args:
        mX1 - Input data with shape N1 x D.
        mX2 - Input data with shape N2 x D.
    Output:
        mKx - Output data with shape N1 x N2.
    Remarks:
        - The function is implemented without explicit loops.
    '''

    ?????
    
    return mKx

def KernelPolynomial( mX1: np.ndarray, mX2: np.ndarray, p: int = 2 ) -> np.ndarray:
    '''
    Calculates the kernel matrix of the Polynomial kernel.
    Args:
        mX1 - Input data with shape N1 x D.
        mX2 - Input data with shape N2 x D.
        p   - The degree of the model.
    Output:
        mKx - Output data with shape N1 x N2.
    Remarks:
        - The function is implemented without explicit loops.
    '''

    ?????
    
    return mKx

def KernelGaussian( mX1: np.ndarray, mX2: np.ndarray, σ2: float = None ) -> np.ndarray:
    '''
    Calculates the kernel matrix of the Gaussian kernel.
    Args:
        mX1 - Input data with shape N1 x D.
        mX2 - Input data with shape N2 x D.
        σ2  - The variance of the model.
    Output:
        mKx - Output data with shape N1 x N2.
    Remarks:
        - The function is implemented without explicit loops.
    '''

    ?????
    
    return mKx

#===============================================================#

In [None]:
#===========================Fill This===========================#
# 1. Set parameter `d`.
# 2. Load each data set.
# 3. Apply PCA to each data set with and without normalization.
# 4. Display results as scatter data.

?????

#===============================================================#

### 5.3. Question

In the above, compare the results of the _Inner Product_ kernel to the PCA from the previous part.  
Explain the results.

### 5.3. Solution

<font color='red'>??? Fill the answer here ???</font>

---

### 5.4. Kernel PCA with Geodesic Distance (Bonus 4 Points)

In this question we'll build a pseudo _geodesic distance_ and apply the Kernel PCA.

In this section:

 1. Generate 750 samples of S Curve manifold (2D in 3D) using SciKit Learn's [`make_s_curve()`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_s_curve.html).  
    Make sure to read about its output, specifically `t`.    
    This is already implemented.
 2. Build a pair wise distance function utilizing both the data coordinates and the `vT` variable.  
    Since the `vT` variable holds location data, this is a geodesic like distance.
 3. Show the distance for 3 different points.  
    This is already implemented.
 4. Apply a Kernel PCA from 3D to 2D on the data utilizing the distance function.
 5. Show the results in the 2D space.
 6. Explain the results (In words).

* <font color='brown'>(**#**)</font> Since in the case above we use a pre computed distance function, you may not use the K-PCA but the PCA. You may use SciKit's Learn PCA or your own implementation.
* With some tweaking of parameters and the distance function one may get the following result:

![](https://i.imgur.com/CYVzYnF.png)

In [None]:
# Generate the Data

N      = 750
mX, vT = make_s_curve(N)

numDispPts = 4

print(f'The data has shape of {mX.shape}')

In [None]:
# Display the Data

hF = plt.figure(figsize = (8, 8))
hA = hF.add_subplot(projection = '3d')
hA.scatter(mX[:, 0], mX[:, 1], mX[:, 2], s = 50, c = vT, edgecolor = 'k', alpha = 1)
hA.set_xlim([-2, 2])
hA.set_ylim([-2, 2])
hA.set_zlim([-2, 2])
hA.set_xlabel('$x_1$')
hA.set_ylabel('$x_2$')
hA.set_zlabel('$x_3$')
hA.set_title('The S Curve Colored by `vT`')
plt.show()

In [None]:
#===========================Fill This===========================#
# 1. Generate a pair wise distance function.
# !! You may and should utilize the parameter `vT`.
# !! Since we use the location data `vT` this is a geodesic like distance.
# !! You may add any parameters you need to the function.

def DistanceFunction( mX: np.ndarray, vT: np.ndarray ) -> np.ndarray:
    '''
    Calculates the kernel matrix of the Polynomial kernel.
    Args:
        mX - Input data with shape N x D.
        vT - Input data (Location)
    Output:
        mD - Pair wise distance matrix with shape N x N.
    Remarks:
        - You may use SciPy's `cdist()` and / or `pdist()`.
    '''

    ?????
#===============================================================#
    
    return mD

In [None]:
#===========================Fill This===========================#
# 1. Calculate the Distance Matrix `mD`.
# !! You may add any parameters you need to the function.

mD = DistanceFunction(???)

#===============================================================#

In [None]:
# Display the Distance Function for few Points
# The result should look like a local distance along the surface of the S curve.

vIdx = np.random.choice(N, numDispPts, replace = False)

hF = plt.figure(figsize = (20, 6))

for ii, idx in enumerate(vIdx):
    
    hA  = hF.add_subplot(1, numDispPts, ii + 1, projection = '3d')
    hA.scatter(*mX.T, s = 50, c = mD[idx, :], edgecolor = 'k', alpha = 0.8)
    hA.scatter(*mX[idx], s = 500, c = 'r', edgecolor = 'k', alpha = 1)
    hA.set_xlim([-2, 2])
    hA.set_ylim([-2, 2])
    hA.set_zlim([-2, 2])
    hA.set_xlabel('$x_1$')
    hA.set_ylabel('$x_2$')
    hA.set_zlabel('$x_3$')
    
    hA.view_init(elev = 15, azim = 300)

hF.suptitle('Geodesic Distance from the Red Point')

plt.show()

In [None]:
#===========================Fill This===========================#
# 1. Create a Kernel Matrix from the distance matrix.
# 2. Apply the K-PCA (Manually or using SciKit Learn).

?????

#===============================================================#

In [None]:
#===========================Fill This===========================#
# 1. Display the low dimension encoding of the data.

?????

#===============================================================#

### 5.4.6. Solution

<font color='red'>??? Fill the answer here ???</font>

---