[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io/)

# AI Program

## Machine Learning - UnSupervised Learning - Dimensionality Reduction - Principal Component Analysis (PCA) - Exercise

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 13/04/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0063DimensionalityReductionPCA.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Miscellaneous
import math
import os
from platform import python_version
import random
import timeit

# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image
from IPython.display import display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout, SelectionSlider
from ipywidgets import interact

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

 ```python
 vallToFill = ???
 ```

 - Multi Line to Fill (At least one)

 ```python
 # You need to start writing
 ????
 ```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())


In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [None]:
# Courses Packages

from DataVisualization import PlotScatterData


In [None]:
# General Auxiliary Functions

hOrdinalNum = lambda n: '%d%s' % (n, 'tsnrhtdd'[(((math.floor(n / 10) %10) != 1) * ((n % 10) < 4) * (n % 10))::4])


## Dimensionality Reduction by PCA

In this exercise we'll use the PCA approach for dimensionality reduction within a pipeline.

This exercise introduces:

1. Working with the [Breast Cancer Wisconsin Data Set](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html).
1. Combine the PCA as a transformer in a pipeline with a linear classifier to predict the binary class of the data.  
2. Select the best features using a sequential approach.

The objective is to optimize the feature selection in order to get the best classification accuracy.


* <font color='brown'>(**#**)</font> PCA is the most basic dimensionality reduction operator.
* <font color='brown'>(**#**)</font> The PCA output is a linear combination of the input.

In [None]:
# Parameters

# Data

# Model
numComp  = 2
paramC   = 1
numKFold = 5
numFeat  = 5


## Generate / Load Data

In this notebook we'll use the [Breast Cancer Wisconsin Data Set](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html).

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.


![](https://i.imgur.com/4LE2biE.png)

In [None]:
# Load Data

mX, vY   = load_breast_cancer(return_X_y = True)
dfX, dsY = load_breast_cancer(return_X_y = True, as_frame = True)

print(f'The features data shape: {mX.shape}')


In [None]:
# Merge Label Data
dfData = dfX.copy()
dfData['Label'] = pd.Categorical(dsY)

dfData

## Exploratory Data Analysis (EDA)

### Correlation Matrix

The correlation matrix is appropriate tool to filter features which are highly correlated.  
It is less effective for _feature selection_.

In [None]:
# Correlation Matrix
hF, hA = plt.subplots(figsize = (14, 14))
dfData['Label'] = pd.to_numeric(dfData['Label'])
mC = dfData.corr(method = 'pearson')
sns.heatmap(mC.abs(), cmap = 'coolwarm', annot = True, fmt = '2.1f', ax = hA)

plt.show()

* <font color='red'>(**?**)</font> Are there redundant features? Think in the context of PCA.

## Pre Processing

In [None]:
# Standardize the Data
# Make each feature: Zero Mean, Unit Variance.

#===========================Fill This===========================#
# 1. Construct the standard scaler.
# 2. Apply it to data.
?????
#===============================================================#

## Applying Dimensionality Reduction - PCA 

The common usage for _Dimensionality Reduction_:

1. Noise Reduction (Increase SNR).
2. Compute Efficiency.
3. Visualization.
4. Feature Engineering Step (Usually as _Manifold Learning_).

In [None]:
# Applying the PCA Model

#===========================Fill This===========================#
# 1. Construct the PCA model.
# 2. Apply it to data.
oPCA = ???
mZ   = ???
#===============================================================#

### Plot Data in 2D

One useful use of _Dimensionality Reduction_ is visualizing _high dimensional_ data.

In [None]:
# Plot the 2D Result

hA = PlotScatterData(mZ, vY)

* <font color='brown'>(**#**)</font> The _optimal_ Dimensionality Reduction is the perfect feature engineering.
* <font color='brown'>(**#**)</font> Dimensionality Reduction is usually used as a step in pipeline.
* <font color='red'>(**?**)</font> Can we use _Clustering_ as a dimensionality reduction?

## Pipeline with PCA

In this section we'll build a simple pipeline:

 - Apply `PCA` with 2 components.
 - Apply Linear Classifier.

We'll tweak the model with selecting the best features as an input to the `PCA`.  
To do that we'll use the [`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html#sklearn.feature_selection.SequentialFeatureSelector) object of SciKit Learn.  

Selecting features sequentially is a compute intensive operation.  
Hence we can use when the following assumptions hold:

1. The number of features is modest (< 100).
2. The cross validation loop (The estimator / pipeline `fit()` and `predict()`) process is fast.

Of course the time budget and computing budget are also main factors.


In [None]:
# Building the Pipeline

#===========================Fill This===========================#
# 1. Construct a pipeline with the first operation being PCA and then Logistic Regressor.
# 2. Set the `n_components` and `C` hyper parameters.
oPipeCls = Pipeline([('PCA', PCA(n_components = ???)), ('Classifier', LogisticRegression(C = ???))])
#===============================================================#

In [None]:
# Base Line Score

#===========================Fill This===========================#
# 1. Compute the base line score (Accuracy) as the mean of the output of `cross_val_score`.
scoreAccBase = ???
#===============================================================#

* <font color='red'>(**?**)</font> What are the issues with `cross_val_score`? Think the cases where folds are not evenly divided or the score is not linear.

In [None]:
# Selecting the Features

#===========================Fill This===========================#
# 1. Construct the `SequentialFeatureSelector` object by setting the (Use the parameters defined above):
#   - `estimator`.
#   - `n_features_to_select`.
#   - `direction`.
#   - `cv`.
# 2. Fit it to data.
# !! Set `direction` wisely. Pay attention that `PCA` with `numComp` components requires at least `numComp` features (Assuming `numSamples` > `numFeatures`).
oFeatSelector = SequentialFeatureSelector(estimator = ???, n_features_to_select = ???, direction = ???, cv = ???)
oFeatSelector = ???
#===============================================================#

In [None]:
# Extracting Selected Features
vSelectedFeat = oFeatSelector.get_support()

* <font color='red'>(**?**)</font> How should we use the above results in production?

In [None]:
# Optimized Score

#===========================Fill This===========================#
# 1. Compute the optimized score (Accuracy) as the mean of the output of `cross_val_score`.
# 2. Select the features from `vSelectedFeat`.
scoreAccOpt = ???
#===============================================================#

In [None]:
# Comparing Results

print(f'The base line score (Accuracy): {scoreAccBase:0.2%}.')
print(f'The optimized score (Accuracy): {scoreAccOpt:0.2%}.')

In [None]:
# The Selected Features

dfX.columns[vSelectedFeat]

* <font color='red'>(**?**)</font> Look at the correlation matrix, how correlated are the selected features relative to other?
* <font color='red'>(**?**)</font> Given the pipeline above, can we think on a more efficient way to select features?
* <font color='green'>(**@**)</font> Optimize all hyper parameters of the model: `n_features_to_select`, `n_components` and `C`.