<font size=4 color='blue'>
    
# <center> Clase 6, octubre 27 del 2021</center>

<font size=4 color='blue'>
    
# <center> Machines that use Unsupervised Learning </center>

<font size=5 color='blue'>
Extracting information from a dataset using clustering

<font size=4 color='black'>

[Article about Kmeans](./Literature/Kmeans_article.pdf)

<font size=5 color='blue'>

Kmeans Algorithm


<font size=4 color='black'>

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into K-pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. 
    
It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. 
    
It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. 
    
The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

<font size=4 color='black'>

The way kmeans algorithm works is as follows:

    Specify number of clusters K.
    
    Initialize centroids by first shuffling the dataset and then randomly selecting K 
    data points for the centroids without replacement.
    
    Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.

    Compute the sum of the squared distance between data points and all centroids.
    Assign each data point to the closest cluster (centroid).
    
    Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

<font size=4 color='black'>

The used metric to detect the clusters is:

    
$$ J = \sum_{i=1}^m \sum_{k=1}^K {w_i}_k ||x^{i} - \mu_k||^2$$
    
where ${w_i}_k$ = 1 for data point $x_i$ if it belongs to cluster k; otherwise, ${w_i}_k$=0. Also, $\mu_k$ is the centroid of $x_i$’s cluster.


<font size=5 color='blue'>

Examples of problems with iformation that can be separated into clusters

<font size=4 color='blue'>
    
[Machine learning for data-driven discovery in solid Earth geoscience](https://science.sciencemag.org/content/363/6433/eaau0323)

<img src="./images/Picture1.png" width=420 height=420 align = "center" >

<font size=4 color='blue'>

# <center> Geyser’s Eruptions </center>

<font size=4 color='black'>
The features, variables $\textbf X$, that characterize Geyser's Eruptions are the waiting time between eruptions ($X_1$) and the duration of the eruption ($X_2$) for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA

In [None]:
from IPython.display import HTML

HTML("""
<video width="520" height="340" controls>
  <source src="yell-InDepth-Geysers2_640x360.mp4" type="video/mp4">
</video>
""")


In [None]:
! pip install sklearn

In [None]:
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.image import imread
import pandas as pd
import seaborn as sns
import sklearn 

In [None]:
matplotlib.__version__

In [None]:
sklearn.__version__

In [None]:
#from sklearn.datasets.samples_generator import (make_blobs,
#                                                make_circles,
#                                                make_moons)
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, SpectralClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_samples, silhouette_score

%matplotlib inline

<font size=4 color='blue'>

[sklearn_paper](./Literature/Scikit-learn_2011.pdf)

<font size=4 color='black'>

The Python code of sklearn for KMeans is the file _kmeans.py

<font size=5 color='blue'>

Reading Geyser’s Eruptions data set

<font size=4 color='blue'>
    
[Geyser’s Eruptions data set](https://www.kaggle.com/janithwanni/old-faithful)

In [None]:
# Import the data
df = pd.read_csv('old_faithful.csv')

In [None]:
# Plot the data
plt.figure(figsize=(6, 6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1])
plt.xlabel('Eruption time in minuts')
plt.ylabel('Waiting time to next eruption')
plt.title('Visualization of raw data');

In [None]:
print(type(df))
print(df.shape)

In [None]:
df.head()

<font size=5 color='blue'>
Normalizing information data

In [None]:
# Standardize the data
X_std = StandardScaler().fit_transform(df)

In [None]:
print(type(X_std))
print(X_std.shape)

In [None]:
print(X_std[0:5][:])

In [None]:
# Plot the data
plt.figure(figsize=(6, 6))
plt.scatter(X_std[:, 0], X_std[:, 1])
plt.xlabel('Eruption time')
plt.ylabel('Waiting time to next eruption')
plt.title('Visualization of normilized data');

<font size=5 color='blue'>
Generating the architecture of the learning model

In [None]:
# Run local implementation of kmeans
model = KMeans(n_clusters=2, max_iter=100, init='random',n_init=10)

<font size=5 color='blue'>
Building a Machine that learns to find clusters using this model

In [None]:
#km.fit(X_std)
model.fit(X_std)

<font size=5 color='blue'>
Extracting information of the recognized clusters

In [None]:
#Obtaining clusters centroid
centroids = model.cluster_centers_

#To obtain the labels of each cluster
labels = model.labels_

<font size=5 color='blue'>
Plotting the clusters

In [None]:
# Plot the clustered data
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(X_std[labels == 0, 0], X_std[labels == 0, 1],
            c='green', label='cluster 1')
plt.scatter(X_std[labels == 1, 0], X_std[labels == 1, 1],
            c='blue', label='cluster 2')
plt.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=300,
            c='r', label='centroid')
plt.legend()
plt.xlim([-2, 2])
plt.ylim([-2, 2])
plt.xlabel('Eruption time')
plt.ylabel('Waiting time to next eruption')
plt.title('Visualization of clustered data', fontweight='bold')
ax.set_aspect('equal');

<img src="./images/Picture1.png" width=420 height=420 align = "center" >

<font size=4 color='blue'>
    
# <center> Iris flowers Clustering </center>

<img src="iris.jpg">

<font size=5 color='blue'>
Features (atributes) that define Iris flowers:

$X_0$ = sepal length in cm,
$X_1$ = sepal width in cm,
$X_2$ = petal length in cm,
$X_3$ = petal width in cm,
$X_4$ = class: Iris Setosa, Iris Versicolour, Iris Virginica

<font size=4 color='black'>
For visualizing the clusters, we will use only three of the variables associated to the iris flowers: petal width (X[:,3]), sepal length (X[:,0]), and petal length (X[:,2]). 

In [None]:
import numpy as np
# for 3D projection to work
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets

In [None]:
np.__version__

<font size=5 color='blue'>

Reading Iris data set

<font size=4 color='blac'>
    
[Iris Data set](https://archive.ics.uci.edu/ml/datasets/iris)

In [None]:
iris = datasets.load_iris()
X = iris.data
y = iris.target

# y is only used to generate the picture of the Ground Truth

In [None]:
print(X.shape)
print(y.shape)

In [None]:
print(X[0:5])

<font size=5 color='blue'>
Generating the architecture for 3 different learning models: different number of clusters and different way of centroids initialization

In [None]:
np.random.seed(5)

models = [('k_means_iris_8', KMeans(n_clusters=8, n_init=10)),
              ('k_means_iris_3', KMeans(n_clusters=3, n_init=10)),
              ('k_means_iris_bad_init', KMeans(n_clusters=3, n_init=1,
                                               init='random'))]

<font size=5 color='blue'>
Machines that learn to find clusters using the tree different models

In [None]:
fignum = 1
titles = ['8 clusters', '3 clusters', '3 clusters, bad initialization']

for name, model in models:
    fig = plt.figure(fignum, figsize=(6, 4))
    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134, auto_add_to_figure=False)
    fig.add_axes(ax)
    model.fit(X)
    labels = model.labels_

    ax.scatter(X[:, 3], X[:, 0], X[:, 2],
               c=labels.astype(np.float64), edgecolor='k')

    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel('Petal width')
    ax.set_ylabel('Sepal length')
    ax.set_zlabel('Petal length')
    ax.set_title(titles[fignum - 1])
    ax.dist = 12
    fignum = fignum + 1

<font size=5 color='blue'>
Ground Truth 
<font size=4 color='black'>
   
It means, checking Machine Learning accuracy with real world

In [None]:
# Plot the ground truth (Verdad fundamental, verdad de la distribución)

fig = plt.figure(fignum, figsize=(8, 6))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134,auto_add_to_figure=False)
fig.add_axes(ax)

for name, label in [('Setosa', 0),
                    ('Versicolour', 1),
                    ('Virginica', 2)]:
    ax.text3D(X[y == label, 3].mean(),
              X[y == label, 0].mean(),
              X[y == label, 2].mean() + 2, name,
              horizontalalignment='center',
              bbox=dict(alpha=.2, edgecolor='w', facecolor='w'))

# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float64)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor='k')

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width', size=16)
ax.set_ylabel('Sepal length', size=16)
ax.set_zlabel('Petal length', size=16)
ax.set_title('Ground Truth', size=20)
ax.dist = 11


<img src="./images/Picture1.png" width=420 height=420 align = "center" >

<font size=4 color='blue'>

# <center> Digits clustering </center>

<font size=4>
The interest is to separate the digits that represent numbers. For example, the numbers associated with an address zip code.
$$ $$    
The features for this problem are represented by the variable $\textbf X$ ($X_0$ = 0, $X_1$ = 1, $X_2$ = 2, $X_3$ = 3, $X_4$ = 4, $X_5$ = 5, $X_6$ = 6, $X_7$ = 7, $X_8$ = 8, $X_9$ = 9). 

In [None]:
import numpy as np
from PIL import Image
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt


<font size=5 color='blue'>

Reading digits data set

<font size=4 color='black'>

[digits data set](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html)

In [None]:
digits = load_digits()

data = digits.data


In [None]:
print(data.shape)

In [None]:
sample_num = 10

print(data[sample_num,:])

In [None]:
image_0=data[sample_num]

In [None]:
image=image_0.reshape(8,8)

In [None]:
print(image.shape)

In [None]:
plt.imshow(image, cmap=plt.cm.gray)
plt.axis('off')

In [None]:
# Images colors will be inverted
data=255.0-data

In [None]:
plt.imshow(data[sample_num,:].reshape(8,8), cmap=plt.cm.gray)
plt.axis('off')

<font size=5 color='blue'>
Generating the architecture of the learning model

In [None]:
np.random.seed(1)
model = KMeans(n_clusters=10,init='random')

<font size=5 color='blue'>
Machine that uses this model to find the clusters

In [None]:
model.fit(data)

<font size=5 color='blue'>

Predicting the closest cluster to which each sample belongs in the data

In [None]:
my_cluster = model.predict(data)
print(my_cluster.shape)
print(my_cluster)

<font size=5 color='blue'>

Showing the predicted digits clusters

In [None]:
for i in range(0,10):  
    
    row = np.where(my_cluster==i)[0] 
    num = row.shape[0]      
    
    r = int(np.floor(num/10.))    
    print("cluster " + str(i))
    print(str(num) + " elements")

    plt.figure(figsize = (10,10))

    # Showing the digits in each cluster
    for k in range(0, num):
        
        plt.subplot(r+1, 10, k+1)
               
        imagen = data[row[k], ]
        
        imagen = imagen.reshape(8, 8)
        
        plt.imshow(imagen, cmap=plt.cm.gray)
        plt.axis('off')

    plt.show()