# Mathematics and Multivariate Statistics.
---
<b>MADS-MMS Portfolio-Exam Part 1<br>
Janosch Höfer, 938969</b>

## Table of contents

- [Imports](#intro) <br>
- [1. Exercise](#ex1) <br>
- [2. Exercise](#ex2) <br>
- [3. Exercise](#ex3) <br>
- [4. Exercise](#ex4) <br>
- [References](#ref)

## Imports

In [None]:
# Standard libraries

# Installed libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs, make_classification
from sklearn.cluster import KMeans
from sympy import symbols
from sympy.matrices import Matrix
from sympy.solvers.solveset import linsolve

# Own classes and functions
from helper_functions.plot_clusters import draw_plot

In [None]:
import seaborn as sns

In [None]:
sns.mpl_palette("jet", 6)

In [None]:
plt.cm.jet(np.linspace(0, 1, 6))

---
<a id='ex1'></a>

## 1. Exercise
### 1.1. Explain the return value of the sklearn function make_blobs. Address both structure and meaning.

The <i>make_blobs</i> function [[1]](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html) creates artificial clusters. The return value is a tuple.

In [None]:
random_state = 1

In [None]:
artificial_blobs = make_blobs(n_samples=100, centers=3, n_features=2, random_state=random_state)
print(f"Return type:\t'{type(artificial_blobs)}'\nReturn #items:\t{len(artificial_blobs)}")
for idx, item in enumerate(artificial_blobs):
    print(f"Item[{idx}]:\t{type(item)}")

Depending on whether the cluster centers are to be returned, the tuple either contains two Numpy arrays...

In [None]:
artificial_blobs = make_blobs(
    n_samples=100, centers=3, n_features=2, random_state=random_state, return_centers=True
)
print(f"Return type:\t'{type(artificial_blobs)}'\nReturn #items:\t{len(artificial_blobs)}")
for idx, item in enumerate(artificial_blobs):
    print(f"Item[{idx}]:\t{type(item)}")

... or three Numpy arrays. In the later case the additional Numpy array contains the cluster centers.<br>

The first Numpy array contains the samples. In our example we have created 10 samples with two features each. In order to create the 10 samples, first the centers are generated using the <i>numpy.random.Generator.uniform</i> function [[2]](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.uniform.html).

In [None]:
random_generator = np.random.RandomState(random_state)
random_generator.uniform(-10.0, 10.0, (3, 2))

In [None]:
artificial_blobs[2]

Next the number of samples per center is calculated and then the samples for each center are drawn from a normal Gaussian distribution [[3]](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.normal.html).

In [None]:
artificial_blobs[0][:10]

The second Numpy array contains the labels for the samples. The labels are just the indexes of the previously generated centers.

In [None]:
artificial_blobs[1][:10]

### 1.2. Compare the result of make_blobs to the data typically available in a real-world clustering task. What would be the main difference?

In [None]:
draw_plot(
    artificial_blobs[0],
    hue=artificial_blobs[1],
    title="Artificial clusters generated by make_blobs",
)

The clusters generated by <i>make_blobs</i> are very clearly separated. Real-world data usually is not that well separated, making it harder to identify the different clusters. A more "realistic" dataset would look like the example below, which was taken from the Scikit learn website [[4]](https://scikit-learn.org/stable/auto_examples/datasets/plot_random_dataset.html).

In [None]:
X1, Y1 = make_classification(
    n_features=2,
    n_redundant=0,
    n_informative=2,
    n_clusters_per_class=1,
    n_classes=3,
    random_state=random_state,
)
draw_plot(
    X1,
    hue=Y1,
    title="Artificial clusters generated by make_classification",
)

---
<a id='ex2'></a>

## 2. Exercise
Use the function make_blobs to create two datasets 𝐴 and 𝐵 with the following specifications:<br>
Each dataset should contain 500 samples. Additionally, use shuffling, a random seed of 1, and the following parameters to specify the datasets:

| **dataset** | **# features** |                          cluster_centers                         |    cluster_std   |
|:-----------:|:--------------:|:----------------------------------------------------------------:|:----------------:|
|      A      |        2       |                          [[1,2], [5,7]]                          | [[0.1,1], [2,3]] |
|      B      |        4       | [[1,1,1,0], [6,1,1,3], [1,7,2,1],<br>[1.5,2,5,5], [10,11,12,13]] |        .6        |

### 2.1. Create the dataset as described above.

In [None]:
n_samples = 500
shuffle = True

In [None]:
dataset_a, labels_a, centers_a = make_blobs(
    n_samples=n_samples,
    centers=[(1, 2), (5, 7)],
    random_state=random_state,
    shuffle=shuffle,
    cluster_std=[[0.1, 1], [2, 3]],
    return_centers=True,
)
centers_a

In [None]:
dataset_b, labels_b, centers_b = make_blobs(
    n_samples=n_samples,
    centers=[(1, 1, 1, 0), (6, 1, 1, 3), (1, 7, 2, 1), (1.5, 2, 5, 5), (10, 11, 12, 13)],
    random_state=random_state,
    shuffle=shuffle,
    cluster_std=0.6,
    return_centers=True,
)
centers_b

### 2.2. Plot the data in (one or more) suitable ways such that the structure of the blobs can be read from the visualizations. Use diagrams with the same scaling for all axes. You may use the full result of make_blobs to plot the points and highlight the generating structure.

In [None]:
draw_plot(
    dataset_a,
    hue=labels_a,
    alpha=0.75,
    labels=["Feature0", "Feature1"],
    title="Artificial clusters from dataset A",
)

In [None]:
draw_plot(
    dataset_a,
    hue=labels_a,
    plot_type="boxplot",
    figsize=(8, 4),
    grid_size=(1, 2),
    labels=["Feature", "Feature"],
    title="Class distribution per feature for dataset A",
)

In [None]:
draw_plot(
    dataset_b,
    plot_type="grid",
    hue=labels_b,
    alpha=0.80,
    figsize=(16, 8),
    grid_size=(2, 3),
    labels=["Feature", "Feature"],
    legend_loc="lower right",
    shareaxes=True,
    title="Artificial clusters from 'dataset B'",
)

In [None]:
draw_plot(
    dataset_b,
    hue=labels_b,
    plot_type="boxplot",
    figsize=(16, 4),
    grid_size=(1, 4),
    labels=["Feature", "Feature"],
    title="Class distribution per feature for dataset B",
)

### 2.3. Compute k-Means clusterings of the dataset for different choices of $k$ : 2, 3, . . . , 10. For each $k$ compute the silhouette coefficient and plot them against $k$ in a diagram. Describe and interpret the diagram.

In [None]:
k_range = range(2, 11)

In [None]:
draw_plot(
    dataset_a,
    plot_type="ksscore",
    ks=k_range,
    random_state=random_state,
    labels=["K", "Silhouette Coefficient"],
    title="Silhouette Score for different Ks using dataset A",
)

The graph above shows the Silhouette score for different cluster sizes using dataset A. The best score (0.617) was achieved for a cluster size of two. The cluster sizes 3,4 and 5 have lower but very similar scores. After a cluster size of 6 the score decreases much more.<br>
This suggest that the best number of clusters for this dataset is two; which we know is true for this artificial dataset. Interestingly, the score is very low, hinting that clusters are not well separated.

In [None]:
draw_plot(
    dataset_b,
    plot_type="ksscore",
    ks=k_range,
    random_state=random_state,
    labels=["K", "Silhouette Coefficient"],
    title="Silhouette Score for different Ks using dataset B",
)

For dataset B the best score (0.771) was achieved for a cluster size of five. 

### 2.4. Choose $k$ according to your result in 2.3 and create the silhouette plot for the clustering. Describe and interpret the diagram.

In [None]:
draw_plot(
    dataset_a,
    plot_type="silhouette",
    ks=2,
    random_state=random_state,
    labels=["The silhouette coefficient values", "Cluster label"],
    title="Silhouette analysis for KMeans clustering on Facebook data with n_clusters = 2",
)

In [None]:
draw_plot(
    dataset_b,
    plot_type="silhouette",
    ks=5,
    random_state=random_state,
    labels=["The silhouette coefficient values", "Cluster label"],
    title="Silhouette analysis for KMeans clustering on Facebook data with n_clusters = 2",
)

### 2.5. For the same $k$ plot the data in (one or more) scatter plots. Visualize the clustering using colors. Additionally visualize the cluster centers.

In [None]:
kkm_2 = KMeans(n_clusters=2, random_state=random_state, init="k-means++", max_iter=300, tol=0.0001)
predicted_labels_a = kkm_2.fit_predict(dataset_a)
centers_2 = kkm_2.cluster_centers_

In [None]:
draw_plot(
    dataset_a,
    hue=predicted_labels_a,
    centers=centers_2,
    alpha=0.75,
    labels=["Feature0", "Feature1"],
    title="Predicted clusters from dataset A",
)

In [None]:
kkm_5 = KMeans(n_clusters=5, random_state=random_state, init="k-means++", max_iter=300, tol=0.0001)
predicted_labels_b = kkm_5.fit_predict(dataset_b)
centers_5 = kkm_5.cluster_centers_

In [None]:
draw_plot(
    dataset_b,
    plot_type="grid",
    hue=predicted_labels_b,
    centers=centers_5,
    alpha=0.80,
    figsize=(16, 8),
    grid_size=(2, 3),
    labels=["Feature", "Feature"],
    legend_loc="lower right",
    shareaxes=True,
    title="Predicted clusters from 'dataset B'",
)

### 2.6.  Compare the properties used to create the datasets and the generated groups with results of $k$-means. Among others, address: differences between the number of blobs and clusters, differences between blob centers and cluster centers, group (blob vs. cluster) assignments – here a short textual overall description is sufficient, no instance-wise comparison is necessary –, the dataset’s suitability for $k$-means, possible obstacles, . . . ).


Although the artificial clusters in dataset B appeared to be much harder to correctly cluster, they produced the better results.<br>
The Silhouette scores in $3.3.$ also showed, that although five clusters was the better result, the second best was two clusters. Using $k=2$ would had grouped the small cluster displayed, in the top right, and the other four clusters together.

---
<a id='ex3'></a>

## 3. Exercise

### 3.1. Verify that the following five vectors form a base of the vector space $\mathbb{R}^5$.

$$
a = \left(\begin{aligned}
        1 \\ 4 \\ 7 \\ 6 \\ 9 \\
    \end{aligned}\right),
b = \left(\begin{aligned}
        7 \\ 8 \\ 9 \\ 2 \\ 1 \\
    \end{aligned}\right),
c = \left(\begin{aligned}
        1 \\ 5 \\ 8 \\ 9 \\ 0 \\
    \end{aligned}\right),
d = \left(\begin{aligned}
        9 \\ 9 \\ 5 \\ 3 \\ 6 \\
    \end{aligned}\right),
e = \left(\begin{aligned}
        0 \\ 0 \\ 2 \\ 3 \\ 1 \\
    \end{aligned}\right)
$$

In [None]:
a = np.array([1, 4, 7, 6, 9])
b = np.array([7, 8, 9, 2, 1])
c = np.array([1, 5, 8, 9, 0])
d = np.array([9, 9, 5, 3, 6])
e = np.array([0, 0, 2, 3, 1])

To verify whether these five vectors from a base of the vector space $\mathbb{R}^5$, we can combine them to a matrix and calculate its rank.

In [None]:
m = np.column_stack((a, b, c, d, e))
m

In [None]:
np.linalg.matrix_rank(m)

Because the matrix $m$ has max rank, the five vectors are linear independent and therefore form a base of the vector space $\mathbb{R}^5$.

### 3.2. Describe the vector $f$ as a linear combination of the above five vectors.
$$
f = \left(\begin{aligned}
        1 \\ 0 \\ 0 \\ 0 \\ 0
    \end{aligned}\right)
$$

In [None]:
def lincomb_vec(vector, matrix):
    inv_matrix = np.linalg.inv(matrix)
    return np.matmul(inv_matrix, vector)

In order to calculate $a, b, c, d$ and $e$ we can solve for <br>
$$\vec{x} = \left(\begin{aligned} a\\ b \\ c\\ d \\ e\\ \end{aligned}\right)$$ <br>

using the following equation:

$$
\begin{equation}
\begin{aligned}
    m * \vec{x} &= \vec{f} \\
    m^{-1} * m * \vec{x} &= m^{-1} * \vec{f} \\
    with: &\,\, m * m^{-1} = I \\
    \vec{x} &= m^{-1} * \vec{f}
\end{aligned}
\end{equation}
$$

With $I$ being the identity matrix.

In [None]:
f = np.array([1, 0, 0, 0, 0])
x = lincomb_vec(f, m)
x

Vector $f$ can be described as follows:

$$
\left(\begin{aligned}
        1 \\ 0 \\ 0 \\ 0 \\ 0 \\
    \end{aligned}\right) = 
(-)0.14163172 * \left(\begin{aligned}
        1 \\ 4 \\ 7 \\ 6 \\ 9 \\
    \end{aligned}\right) +
0.05775829 * \left(\begin{aligned}
        7 \\ 8 \\ 9 \\ 2 \\ 1 \\
    \end{aligned}\right) +
(-)0.15821578 * \left(\begin{aligned}
        1 \\ 5 \\ 8 \\ 9 \\ 0 \\
    \end{aligned}\right) +
0.09950438 * \left(\begin{aligned}
        9 \\ 9 \\ 5 \\ 3 \\ 6 \\
    \end{aligned}\right) +
0.61990088 * \left(\begin{aligned}
        0 \\ 0 \\ 2 \\ 3 \\ 1 \\
    \end{aligned}\right)
$$

In [None]:
x[0] * a + x[1] * b + x[2] * c + x[3] * d + x[4] * e

The result is correct. The small deviations are a result of Pythons limitations with floating point numbers.

---
<a id='ex4'></a>

## 4. Exercise

### 4.1. Verify that the set containing the following five vectors is not linear independent.
$$
a = \left(\begin{aligned}
        8 \\ 4 \\ 2 \\ 1 \\ 9 \\
    \end{aligned}\right),
b = \left(\begin{aligned}
        15 \\ 19 \\ 1 \\ 9 \\ 21 \\
    \end{aligned}\right),
c = \left(\begin{aligned}
        1 \\ 4 \\ 8 \\ 5 \\ 0 \\
    \end{aligned}\right),
d = \left(\begin{aligned}
        9 \\ 9 \\ 5 \\ 3 \\ 6 \\
    \end{aligned}\right),
e = \left(\begin{aligned}
        4 \\ 2 \\ 8 \\ 2 \\ 1 \\
    \end{aligned}\right)
$$

In [None]:
a_4 = np.array([8, 4, 2, 1, 9])
b_4 = np.array([15, 19, 1, 9, 21])
c_4 = np.array([1, 4, 8, 5, 0])
d_4 = np.array([9, 9, 5, 3, 6])
e_4 = np.array([4, 2, 8, 2, 1])

In [None]:
m_4 = np.column_stack((a_4, b_4, c_4, d_4, e_4))
m_4

In [None]:
np.linalg.matrix_rank(m_4)

The five vectors do not have max rank and are therefore not linear independent.

### 4.2.  The vectors $a$, $b$, $c$ and $d$ form the base of a vector subspace of $\mathbb{R}^5$. Verify that $f$ is not a member of that subspace.

$$
f = \left(\begin{aligned}
        1 \\ 0 \\ 0 \\ 0 \\ 0
    \end{aligned}\right)
$$

Here we can again calculate the rank of the matrix made up of $a, b, c, d$ and $f$. If it has full rank, $f$ can not be represented by any of the other vectors and therefore making independent of them.

In [None]:
m_42 = np.column_stack((a_4, b_4, c_4, d_4, f))
m_42

In [None]:
np.linalg.matrix_rank(m_42)

Because the matrix containing $a, b,c,d$ and $f$ has full rank, $f$ is not a member of the subspace created by $a,b,c$ and $d$.

### 4.3. Use the sympy library to compute a linear combination of $a$, $b$, $c$ and $d$ for $e$.


First we create the variables for which we want a solution.

In [None]:
a, b, c, d = symbols("a,b,c,d")

Next we create the matrix which has to solved.

In [None]:
sym_m = Matrix([a_4, b_4, c_4, d_4, e_4]).T
sym_m

In [None]:
res = linsolve(sym_m, (a, b, c, d))
res

In [None]:
a, b, c, d = list(res)[0]

Lastly we verify the results.

In [None]:
(a * Matrix([a_4]) + b * Matrix([b_4]) + c * Matrix([c_4]) + d * Matrix([d_4])).T

---
<a id='ref'></a>

## References

<p> [1] https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html
<p> [2] https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.uniform.html
<p> [3] https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.normal.html
<p> [4] https://scikit-learn.org/stable/auto_examples/datasets/plot_random_dataset.html