# 1. Unsupervised learning - Clustering

<img src="https://lh3.googleusercontent.com/DS4BHTkXT_9FzxuOd67PNjJT-o87kdtvP42wq_JUzQz8oWhzOOxWKu0CAkTSzBzyLKrYNWAF8dAY6FUSgjLJFBBrMjHz_cdk9-i0QhAOnIdo8Nq3192BdGxlEUwRRpCzkp_iBiIK" width="400"/>

<img src="https://miro.medium.com/max/694/1*5RDVF1xW0LfXjoxZp6jI1Q.png" />

## Kmeans clustering

**Kmeans** is one of the most popular **clustering** algorithms. K-means stores k centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster’s centroid than any other centroid. K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) choosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

<img src="https://media.geeksforgeeks.org/wp-content/uploads/merge3cluster.jpg" width="300"/>

**The Kmeans algorithm**
1.  Select `K` random starting cluster centroids
2.  Compute the distance between each observation and the clusters
3.  Reassign a cluster to each observation and then recompute the centroids
4.  Keep doing so until the labels stay constant and we no longer need to reassign


Voronoi diagram: 
- http://paperjs.org/examples/voronoi/
- http://www.raymondhill.net/voronoi/rhill-voronoi.html




## Other Clustering algorithms

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_cluster_comparison_0011.png" />

### Example: Bakery dataset  in sklearn

- https://github.com/boyander/datamad-1019/blob/master/w7-d4-unsupervised-pipeline/w7-d4-unsupervised-learning.ipynb

## Unsupervised learnign cluster metrics
- **Sillouette score**: How dense the clusters are they and how well separated. CContrasts the average distance to elements in the same cluster with the average distance to elements in other clusters.
  - This score favors convex clusters, for many non convex datasets it will give an artificially low score
  - mean ratio of intra-cluster and nearest-cluster distance
  
<img src="https://image3.slideserve.com/6607494/limitations-of-k-means-non-convex-shapes-l.jpg" width="300"/>

**Note:** A convex polygon is a simple polygon (not self-intersecting) in which no line segment between two points on the boundary ever goes outside the polygon. Equivalently, it is a simple polygon whose interior is a convex set.[1] In a convex polygon, all interior angles are less than or equal to 180 degrees, while in a strictly convex polygon all interior angles are strictly less than 180 degrees.
  
- **Distortion**: Sum of squared distances of samples to their closest cluster center. `KMeans(3).fit(iris.data).intertia_`

**Note:** Smaller `distortion` means more dense clusters.

- **Calinski Harabaz**: The score is defined as ratio between the within-cluster dispersion and the between-cluster dispersion. `sklearn.metrics.calinski_harabasz_score`


- **T-SNE Plots**: Cluster visualization. Like PCA, embbed N dimensions into 2D space.
    - https://scikit-learn.org/stable/auto_examples/manifold/plot_manifold_sphere.html#sphx-glr-auto-examples-manifold-plot-manifold-sphere-py
    - https://www.youtube.com/watch?v=NEaUSP4YerM
    
<img src="https://i.stack.imgur.com/OxEW5.png" width="400"/>



- **Elbow method**: Helps select the optimal number of clusters by fitting the model with a range of values for 𝐾

```python
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

from yellowbrick.cluster import KElbowVisualizer

# Generate synthetic dataset with 8 random clusters
X, y = make_blobs(n_samples=1000, n_features=12, centers=8, random_state=42)

# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(
    model, k=(4,12), metric='calinski_harabasz', timings=False
)

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure
```

### Example Bakery dataset  in Apache Spark

- https://github.com/boyander/datamad-1019/blob/master/w7-d5-spark-intro/spark-intro.ipynb

# 2. Generative Models - Unsupervised learning

- **Generative GMM:** https://jakevdp.github.io/PythonDataScienceHandbook/05.12-gaussian-mixtures.html

Articles:
- https://openai.com/blog/generative-models/
- https://towardsdatascience.com/generative-deep-learning-lets-seek-how-ai-extending-not-replacing-creative-
process-fded15b0561b
- https://www.youtube.com/watch?v=G5JT16flZwM&feature=emb_logo
- https://pathmind.com/wiki/generative-adversarial-network-gan

Articles on autoencoders:
- https://medium.com/intuitive-deep-learning/autoencoders-neural-networks-for-unsupervised-learning-83af5f092f0b
- https://towardsdatascience.com/pca-vs-autoencoders-1ba08362f450