## Dimensionality Reduction Assignment

- Generate a data set of 900 3-dimensional data vectors, which stem from two classes—the first 100 vectors from a zero-mean Gaussian distribution with covariance matrix:
    - S1 = [0.5 0 0; 0 0.5 0; 0 0 0.01]. 
- The rest are grouped into 8 groups of 100 vectors.
- Each group stems from a Gaussian distribution.
- All of these distributions share the covariance matrix: 
    - S2 = [1 0 0; 0 1 0; 0 0 0.01], while their means are:

    ![means](images/image1.png "means")

In [21]:
## sklearn.datasets.make_blobs: Generate isotropic Gaussian blobs for clustering.
# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html

from sklearn.datasets import make_blobs
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

In [2]:
# S1 = [[0.5, 0, 0], [0, 0.5, 0], [0, 0, 0.01]]
# S2 = [[1, 0, 0], [0, 1, 0], [0, 0, 0.01]]

# smarter :)
S1 = np.diag([0.5, 0.5, 0.01])
S2 = np.diag([1, 1, 0.01]) 

In [3]:
# Numpy module have a lot of useful functions for generating random numbers, including Gaussian distributions.
# https://numpy.org/doc/stable/reference/random/index.html

rng = np.random.default_rng(42)
a = 20.0

mean_c1 = np.array([0.0, 0.0, 0.0])
means_c2 = np.array([
    [ a,    0,   0],
    [ a/2,  a/2, 0],
    [ 0,    a,   0],
    [-a/2,  a/2, 0],
    [-a,    0,   0],
    [-a/2, -a/2, 0],
    [ 0,   -a,   0],
    [ a/2, -a/2, 0]
], dtype=float)

In [7]:
# 100 samples from a zero-mean Gaussian distribution with covariance matrix S1
X1 = rng.multivariate_normal(mean_c1, S1, 100)

# 800 samples from 8 groups of 100 vectors, each group from a Gaussian distribution with covariance matrix S2 and means from means_c2
groups = []
for mean in means_c2:
    group = rng.multivariate_normal(mean, S2, 100)
    groups.append(group)
X2 = np.vstack(groups)

# Final set
X = np.vstack((X1, X2))
y = np.array([0] * 100 + [1] * 800)

print(f"X.shape: {X.shape}, y.shape: {y.shape}")

X.shape: (900, 3), y.shape: (900,)


In [11]:
# to pandas DataFrame to use plotly express
df = pd.DataFrame(X, columns=['X1', 'X2', 'X3'])
df['y'] = y

In [27]:
# a) Plot the 3-dimensional data set and view it from different angles to get a feeling of how the data is spread in the 3-dimensional space

fig = px.scatter_3d(
    df,
    x='X1', y='X2', z='X3',
    color='y',
    title='3D Distribution'
)

fig.show()

In [None]:
# b) Perform LDA on the previous dataset. Project the data on the subspace spanned by the eigenvectors that correspond to the nonzero eigenvalues of the matrix product S−1w Sb. Comment on the results.
lda = LDA(n_components=1)
Z = lda.fit_transform(X, y).ravel()   # 1D projection


In [29]:
df_proj = pd.DataFrame({
    "LDA_1D": Z,
    "Class": np.where(y==0, "Class 1", "Class 2")
})
fig = px.histogram(
    df_proj, x="LDA_1D", color="Class",
    barmode="overlay", nbins=60,
    title="LDA - 1D Projection",
)
fig.show()
