# Generative Models  <img src="https://imt-nord-europe.fr/wp-content/uploads/2021/08/cropped-IMT_Nord_Europe_.png" alt="Drawing" style="float: right; width: 150px;"/>

UV SDATA 2021 

In [3]:
from scipy import linalg
import numpy as np
import pylab as pl
import matplotlib.pyplot as plt 
import pandas as pd
import seaborn as sns
from sklearn.manifold import TSNE

## Data Set Information

This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s
paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for
example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of
iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable
from each other. The predicted attribute (the output) is the class of iris plant.

## Exercises
In this TP, you are supposed to play with different classification techniques:

1. Data visualisation
    1. Show data using `seaborn.pairplot`.
    1. T-distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm for visualization. It can be used from `sklearn.manifold.TSNE`. Embed data (not label) into two dimension. Plot these points using their labels to color them. Change the perplexity parameter.

2. Dataset preprocessing
    1. Make two numpy array $X$ and $y$, they contain respectively the set of feature vector and their label.
    1. Map label to integer using `LabelEncoder` from `sklearn.preprocessing`. Now you got the set of feature vector and their corresponding true label !

3. Implement Linear Discriminant Analysis : we are going to implement LDA to the samples onto the new subspace, then we will use results to make prediction.
    1. Compute the mean vector of each class :
    $$
    m_i = \frac{1}{n_i} \sum\limits_{x \in D_i}^{n} x_k,
    $$
    1. Compute the within-class scatter matrix :
    $$
    S_W = \sum\limits_{i=0}^{c} S_i,
    $$
    where $S_i = \sum\limits_{x \in D_i}^{n} (x-m_i)(x-m_i)^T$.
    1. Compute the between-class scatter matrix :
    $$
    S_B = \sum\limits_{x \in D_i}^{n} (m_i-m)(m_i-m)^T,
    $$
    where $m$ is the overall mean, $m_i$ and $N_i$ are the sample mean and sizes of the respectives classes.
    1. Solve the eigenvalue problem for the matrix $S_W^{-1}S_B$ using lineal from numpy.
    1. Sort eigenvector by their eigenvalues from high to low.
    1. Transform the samples onto the new subspace (by defaut the same number of dimension or a smaller given as parameter)
    
__Note that__: we do not compute the covariance matrix, it is easy to add a scaling
factor to use the covariance matrix. However, the resulting eigenspace will be identical
(identical eigenvectors, only the eigenvalues are scaled differently by a constant factor).

4. Compare your results with: `sklearn.discriminant analysis.LinearDiscriminantAnalysis`

5. Define a function that predict the label of given set of feature vector :
\begin{align}
    \hat{C}_k & = argmax~C_k~p(C_k|x)\\
    & = x^T \Sigma^{-1} \mu_k - \frac{1}{2}\mu_k^T \Sigma^{-1} \mu_k + log~p(x|C_k)
\end{align}

6. Predict the class of X and compare your results with `sklearn`. 

7. Assess your method on the iris dataset using cross-validation, (`from sklearn.model selection import train test split, StratifiedKFold`).
    - Compute `f1_score`,
    - Plot confusion matrix,
    - $\dots$

8. Plot results of the LDA transform with dim = 2, to display the regions of decision, you may use:
```
xx, yy = np.meshgrid(np.linspace(4, 8.5, 200), np.linspace(1.5, 4.5, 200))
X_{grid} = np.c_[ xx.ravel(), yy.ravel()]
```

9. QDA
    1. Implement Quadratic Discriminant Analysis
    1. Compare your results with `sklearn.discriminant analysis.QuadraticDiscriminantAnalysis`

__Note__ that in this database, we do not have test data, so in order to validate the best classification technique (and the best parameters of the technique), we will use cross-validation techniques. Plot the samples in 2D or 3D so you get intuitions before applying the classification
methods.

### Features Information

1. sepal length in cm
1. sepal width in cm
1. petal length in cm
1. petal width in cm
1. classes: Iris Setosa, Iris Versicolour, Iris Virginica

### References

[1] Fisher,R.A. “The use of multiple measurements in taxonomic problems” Annual Eugenics, 7,
Part II, 179-188 (1936); also in “Contributions to Mathematical Statistics” (John Wiley, NY, 1950).

[2] Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John
Wiley & Sons. ISBN 0-471-22361-1. See page 218.

[3] Dasarathy, B.V. (1980) “Nosing Around the Neighborhood: A New System Structure and
Classification Rule for Recognition in Partially Exposed Environments”. IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71.

[4] Gates, G.W. (1972) “The Reduced Nearest Neighbor Rule”. IEEE Transactions on Information
Theory, May 1972, 431-433