[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io/)

# Machine Learning Methods

## Exercise 006 - Clustering

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 28/02/2023 | Royi Avital | First version                                                      |
|         |            |             |                                                                    |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethods/2023_01/Exercise0006ClusteringSolution.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.mixture import GaussianMixture

# Image Processing
from skimage.color import rgb2lab
from skimage.io import imread
from skimage.metrics import structural_similarity
from skimage.transform import downscale_local_mean

# Miscellaneous
import itertools
import json
import os
from platform import python_version
import random
import urllib.request
import re

# Typing
from typing import Callable, List, Tuple

# Visualization
from matplotlib.colors import LogNorm, Normalize
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

DATA_FILE_URL = r'https://i.imgur.com/1TvYQ2R.png'

In [None]:
# Fixel Algorithms Packages


## Exercise

In this exercise we'll use clustering in order to quantize the colors of an image.
The level of quantization is inversely proportional to the number of clusters.

This exercise introduces:

 - Exploring Image Colors in 3D.
 - Resizing image to reduce memory resources.
 - Using clustering methods for non uniform, data adaptive, quantization.
 - Using the [Structural Similarity](https://en.wikipedia.org/wiki/Structural_similarity) (SSIM) as an image reconstruction score.

The objective is to compare clustering methods for the image quantization methods.

In this exercise:

1. Download the data (Automatically by the code).
2. Pre Process the image: Scale into `[0, 1]` and resize.
3. Set the hyper parameters and methods to explore.
4. Run the analysis.
5. Explore results.

One should achieve `SSIM > 0.75` in this task.

* <font color='red'>(**?**)</font> Why can't we use the DBSCAN method above?
* <font color='red'>(**?**)</font> What are the limitation of the Agglomerative method in this context? Think of number of pixels, the distance matrix and memory.

In [None]:
# Parameters

lNumColors          = list(range(3, 15)) #<! On first tries use (3, 5)
lClusterMethod      = [AgglomerativeClustering(), GaussianMixture(n_init = 15), KMeans(n_init = 15)]
lClusterMethodStr   = ['Agglomerative', 'Gaussian Mixture', 'K-Means']


In [None]:
# Auxiliary Functions

def ImageHistogram3D( mI: np.ndarray, numBins: int = 12, hA: plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, lAxisLbl = ['Red', 'Green', 'Blue'] ) -> plt.Axes:
    """
    Visualize a 3D Histogram of an Image

    Parameters
    ----------

    mI: Input Image (m x n x 3)
    numBins: Number of Bins per Channel
    """

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize, subplot_kw = {'projection': '3d'})
    else:
        hF = hA.get_figure()

    tH, lEdges = np.histogramdd(mI.reshape(-1, 3), bins = numBins)

    tR, tG, tB = np.meshgrid(lEdges[0][:-1], lEdges[1][:-1], lEdges[2][:-1], indexing = 'ij')
    # Make the representing color the middle of the bin
    tR += np.diff(lEdges[0])[0] / 2
    tG += np.diff(lEdges[1])[0] / 2
    tB += np.diff(lEdges[2])[0] / 2

    mColors = np.column_stack((tR.flatten(), tG.flatten(), tB.flatten()))
    tH = tH / np.max(tH)

    hA.scatter(tR.flatten(), tG.flatten(), tB.flatten(), s = (tH.flatten() ** 2) * 7500, c = mColors)

    hA.set_xlabel(lAxisLbl[0])
    hA.set_ylabel(lAxisLbl[1])
    hA.set_zlabel(lAxisLbl[2])

    return hA


def ImageColorsScatter( mI: np.ndarray, hA: plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, lAxisLbl = ['Red', 'Green', 'Blue'] ) -> plt.Axes:

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize, subplot_kw = {'projection': '3d'})
    else:
        hF = hA.get_figure()

    mC = np.reshape(mI, (-1, 3))

    hA.scatter(mC[:, 0], mC[:, 1], mC[:, 2], c = mC)

    hA.set_xlabel(lAxisLbl[0])
    hA.set_ylabel(lAxisLbl[1])
    hA.set_zlabel(lAxisLbl[2])

    return hA

def ConvertRgbToLab( mRgb: np.ndarray ) -> np.ndarray:
    # Converts sets of RGB features into LAB features.
    # Input (numPx x 3)
    # Output: (numPx x 3)
    mRgb3D = np.reshape(mRgb, (1, -1, 3))
    mLab3D = rgb2lab(mRgb3D)

    return np.reshape(mLab3D, (-1, 3))

def AssignColorsByLabel( mX: np.ndarray, vL: np.ndarray ) -> np.ndarray:

    mC = mX.copy()

    vU = np.unique(vL)
    vIdx = np.full(shape = (mX.shape[0]), fill_value = False)

    for iLabel in vU:

        vIdx = np.equal(vL, iLabel, out = vIdx)
        
        mC[vIdx, :] = np.mean(mX[vIdx, :], axis = 0)
    
    return mC

## Generate / Load Data

The image we'll be using in this notebook is the Peppers images from MATALB:

![](https://i.imgur.com/1TvYQ2R.png)


In [None]:
# Loading / Generating Data

mI = imread(DATA_FILE_URL)
mI = mI[:, :, :3] #<! Remove the alpha channel

print(f'The image shape: {mI.shape}')


### Pre Process Data

Scale the pixels into the [0, 1] range.

In [None]:
# Pre Process Data

mI = mI / 255

In [None]:
# Down Scale the Image

# In order to allow the Agglomerative method to work we need to factor down the image by 4.

mI = downscale_local_mean(mI, (4, 4, 1))

print(f'The image shape: {mI.shape}')

### Plot the Data

In [None]:
# Plot the Image

hF, hA = plt.subplots(figsize = (4, 4))
hA.imshow(mI)

plt.show()

In [None]:
# Plot the Image Colors Scatter
hA = ImageColorsScatter(mI)

plt.show()

In [None]:
# Plot the Color Histogram
# This is basically a uniform quantization of the data.
hA = ImageHistogram3D(mI, numBins = 10)

plt.show()

* <font color='red'>(**?**)</font> Can we expect the method used to generate better clusters? Why?

## Pre Processing

In thi section we'll rearrange data into:

1. A data set of shape `numPixels x 3`.
2. Create a variant of the data in LAB using `ConvertRgbToLab()`.

In [None]:
# Convert the Data into the numPixels x 3 Form

#===========================Fill This===========================#
# 1. Convert the image into `(numPixels x 3)` form.
# 2. Create a LAB color space variant of the data using `ConvertRgbToLab()`.
mX      = np.reshape(mI, (-1, 3))
mXLab   = ConvertRgbToLab(mX)
#===============================================================#

## Clustering

In [None]:
# Creating the Data Frame

#===========================Fill This===========================#
# 1. Calculate the number of combinations.
# 2. Create a nested loop to create the combinations between the parameters.
# 3. Store the combinations as the columns of a data frame.

# For Advanced Python users: Use iteration tools for create the cartesian product
numComb = len(lClusterMethod) * len(lNumColors)
dData   = {'Cluster Method': [], 'Number of Colors': [], 'SSIM': [0.0] * numComb}

for ii, clusterMethod in enumerate(lClusterMethodStr):
    for jj, numColors in enumerate(lNumColors):
        dData['Cluster Method'].append(clusterMethod)
        dData['Number of Colors'].append(numColors)
#===============================================================#

dfModelScore = pd.DataFrame(data = dData)
dfModelScore

In [None]:
# Optimize the Model

#===========================Fill This===========================#
# 1. Iterate over each row of the data frame `dfModelScore`. Each row defines the hyper parameters.
# 2. Construct the model.
# 3. Train it on the Train Data Set.
# 4. Calculate the score.
# 5. Store the score into the data frame column.

for ii in range(numComb):
    clusterMethod   = dfModelScore.loc[ii, 'Cluster Method']
    numColors       = dfModelScore.loc[ii, 'Number of Colors']

    print(f'Processing model {ii + 1:03d} out of {numComb} with `Cluster Method` = {clusterMethod} and `Number of Colors` = {numColors}.')

    oModelCluster = lClusterMethod[lClusterMethodStr.index(clusterMethod)]
    if (clusterMethod == 'Agglomerative') or (clusterMethod == 'K-Means'):
        oModelCluster.set_params(**{'n_clusters': numColors})
    else:
        oModelCluster.set_params(**{'n_components': numColors})
    
    vY = oModelCluster.fit_predict(mX)

    mC = AssignColorsByLabel(mX, vY)

    ssimScore = structural_similarity(mI, np.reshape(mC, mI.shape), data_range = 1, channel_axis = 2)
    dfModelScore.loc[ii, 'SSIM'] = ssimScore
    print(f'Finished processing model {ii + 1:03d} with `SSIM = {ssimScore}.')

    hF, hA = plt.subplots(figsize = (4, 4))
    hA.imshow(np.reshape(mC, mI.shape))
    hA.set_title(f'Quantized Image: Method = {clusterMethod}, Colors = {numColors}, SSIM = {ssimScore:0.3f}')
    plt.show()
#===============================================================#

In [None]:
# Display Sorted Results (Descending)
# Pandas allows sorting data by any column using the `sort_values()` method
# The `head()` allows us to see only the the first values
dfModelScore.sort_values(by = ['SSIM'], ascending = False).head(10)

In [None]:
# Plotting the Scores as a Heat Map
# We can pivot the data set created to have a 2D matrix of the score as a function of the hyper parameters.

hA = sns.heatmap(data = dfModelScore.pivot(index = 'Number of Colors', columns = 'Cluster Method', values = 'SSIM'), robust = True, linewidths = 1, annot = True, fmt = '0.2%', norm = LogNorm())
hA.set_title('SSIM')
plt.show()

In [None]:
hF, hA = plt.subplots(figsize = FIG_SIZE_DEF)

sns.lineplot(data = dfModelScore, x = 'Number of Colors', y = 'SSIM', hue = 'Cluster Method', ax = hA)
hA.set_title('SSIM as a Function of Number of Colors')

plt.show()

* <font color='red'>(**?**)</font> How will the graph will be have for larger and larger number of clusters?
* <font color='brown'>(**#**)</font> In many cases we graph the performance using MMSE as a function of the number of clusters. One method to optimize `K` is by looking at the _elbow_. 
* <font color='green'>(**@**)</font> Apply the clustering on LAB Color Space (The SSIM still should be calculated on RGB) and compare results.
* <font color='green'>(**@**)</font> In the implementation above we used the mean to represent the cluster, you may try other variations.