# CLUSTERING

We will now discuss Clustering.  Luckily, this the foundation of machine learning and the internet is full of great tools and resources.  Below is listed some tools we like.  

The idea of clustering is how can you convert mathematical features of images/objects (or in general any numerica data) it a form that a computer can use to  classify the object.  

We already started the process by extracting features form the objects in an image (Weeks 1-5).  The feature extractions puts this into a quantifyable number that a computer use to sort objects by these parameters.  By "clustering" these values the computer can say any object that falls within the parameters (features) of a type of object is that type of object. 

Videos:

https://www.youtube.com/watch?v=EItlUEPCIzM

https://www.youtube.com/watch?v=H_L7V_BH9pc

Websites:

https://realpython.com/k-means-clustering-python/

https://scikit-learn.org/stable/modules/clustering.html

https://machinelearningmastery.com/clustering-algorithms-with-python/

https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html

Google Colab:

https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.11-K-Means.ipynb

https://colab.research.google.com/github/SANTOSHMAHER/Machine-Learning-Algorithams/blob/master/K_Means_algorithm_using_Python_from_scratch_.ipynb

Need some practice with NumPy:

https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/numpy_ultraquick_tutorial.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=mlcc-prework&hl=en

Need some practice with Pandas:

https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/pandas_dataframe_ultraquick_tutorial.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=mlcc-prework&hl=en

Here is a complete crash course on machine learning (via Google Developers):

https://developers.google.com/machine-learning/crash-course




Please watch the videos and review the materials above.  In this section, we will focus on one type of clustering strategy (described above) called k-means, but there are many ways to cluster your data.  

Below we will:
1) Walk thru k-means using this site as a template:
(https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html)

2) Walk thru 3D k-means using this site as template:

https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html#sphx-glr-auto-examples-cluster-plot-cluster-iris-py

3) Apply clustering the shape picture we covered in Week5


In [None]:
from google.colab import drive
import matplotlib.pyplot as plt
from IPython import display
import cv2
import numpy as np
from google.colab.patches import cv2_imshow
import argparse
import seaborn as sns; sns.set()  # for plot styling
import imutils  #this is open source image tools (utilities) for image processing, https://anaconda.org/conda-forge/imutils, made by the creator of PyimageSearch.com
drive.mount('/content/drive')

In [None]:
from sklearn.datasets import make_blobs  #this feature makes data sets which specified clusters (centers) and number(n_samples) and noise (cluster_std)
X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.60, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=50);  #this plots the data, what is s?

#play around with the make_blobs method return to original values (n_samples=300, centers=4,cluster_std=0.60, random_state=0)

#the x and y axis here are just arbitrary numbers, in our case, they will represent some numeric value we extracted from the pixels that represents our image (features)


In [None]:
from sklearn.cluster import KMeans  #this is the K-means clustering methods
kmeans = KMeans(n_clusters=4) #this finds the best fit to the number of clusters of size n_clusters, this step defines n_clusters and defines of object of class KMeans called kmeans 
kmeans.fit(X) #this calculates the centroids for n_clusters.  X is our array of data, not x-axix
y_kmeans = kmeans.predict(X) #Predict the closest cluster each sample in X belongs to.

#look at the variable explorer to the right, {x}, what is returned in y_kmeans?  Each number in the array corresponds to the index of X and the cluster that it was assigned.


In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')  #plots data X

centers = kmeans.cluster_centers_  #this gives the coordinates of the n_clusters defined in the object kmeans of class KMeans
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5);  #plots the center of the clusters

In [None]:
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
from ipywidgets import interact
from sklearn.metrics import pairwise_distances_argmin

#this is a more advanced coding to show how the program works.  Here you can compare clusters and steps in the algorithm process to show how it along the way

def plot_kmeans_interactive(min_clusters=1, max_clusters=6):    
    X, y = make_blobs(n_samples=300, centers=4,
                      random_state=0, cluster_std=0.60)
        
    def plot_points(X, labels, n_clusters):
        plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis',
                    vmin=0, vmax=n_clusters - 1);
            
    def plot_centers(centers):
        plt.scatter(centers[:, 0], centers[:, 1], marker='o',
                    c=np.arange(centers.shape[0]),
                    s=200, cmap='viridis')
        plt.scatter(centers[:, 0], centers[:, 1], marker='o',
                    c='red', s=50)
            

    def _kmeans_step(frame=0, n_clusters=4):
        rng = np.random.RandomState(2)
        labels = np.zeros(X.shape[0])
        centers = rng.randn(n_clusters, 2)

        nsteps = frame // 3

        for i in range(nsteps + 1):
            old_centers = centers
            if i < nsteps or frame % 3 > 0:
                labels = pairwise_distances_argmin(X, centers)

            if i < nsteps or frame % 3 > 1:
                centers = np.array([X[labels == j].mean(0)
                                    for j in range(n_clusters)])
                nans = np.isnan(centers)
                centers[nans] = old_centers[nans]

        # plot the data and cluster centers
        plot_points(X, labels, n_clusters)
        plot_centers(old_centers)

        # plot new centers if third frame
        if frame % 3 == 2:
            for i in range(n_clusters):
                plt.annotate('', centers[i], old_centers[i], 
                             arrowprops=dict(arrowstyle='->', linewidth=1))
            plot_centers(centers)

        plt.xlim(-4, 4)
        plt.ylim(-2, 10)

        if frame % 3 == 1:
            plt.text(3.8, 9.5, "1. Reassign points to nearest centroid",
                     ha='right', va='top', size=14)
        elif frame % 3 == 2:
            plt.text(3.8, 9.5, "2. Update centroids to cluster means",
                     ha='right', va='top', size=14)
    
    return interact(_kmeans_step, frame=list(range(0,50,10)),
                    n_clusters=list(range(min_clusters, max_clusters+1)))

plot_kmeans_interactive();

#here the frame # represents the numuber of inter

In [None]:
kmeans_kwargs = {
    "init": "random",
    "n_init": 10,
    "max_iter": 300,
    "random_state": 42,
}

# A list holds the SSE values for each k
sse = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
    kmeans.fit(X)
    sse.append(kmeans.inertia_)
plt.style.use("fivethirtyeight")
plt.plot(range(1, 11), sse, '-o')
plt.xticks(range(1, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()

#the plot below shows the "elbow method"  The break in the curve (the elbow) at 4 shows the max number of clusters.  More quantitatively, this show the point where added clusters does not improve
#the residiuals (SSEs numbers)


The above example is just one way to determine how many clusters to use.  There are others and ultimetly how one determines depends on nature of the data
Often clustering is done in 2D scatter plots to help visualize, but it may be done in mulitiple dimensions (2 or more).  

Below we use an example of clustering in 3D from the sci-learn website:

https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html#sphx-glr-auto-examples-cluster-plot-cluster-iris-py

It represents real data from the IRIS data set. It has three parameters.  It shows how cluster number and poor initial guesses (more important in data sets with more than 2 parameters.  In this case "local minima" are more likely depending on the number of variables).

In [None]:
# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt

# Though the following import is not directly being used, it is required
# for 3D projection to work with matplotlib < 3.2
import mpl_toolkits.mplot3d  # noqa: F401

from sklearn.cluster import KMeans
from sklearn import datasets

np.random.seed(5)

iris = datasets.load_iris()
X = iris.data
y = iris.target

estimators = [
    ("k_means_iris_8", KMeans(n_clusters=8)),
    ("k_means_iris_3", KMeans(n_clusters=3)),
    ("k_means_iris_bad_init", KMeans(n_clusters=3, n_init=1, init="random")),
]

fignum = 1
titles = ["8 clusters", "3 clusters", "3 clusters, bad initialization"]
for name, est in estimators:
    fig = plt.figure(fignum, figsize=(8, 6))
    ax = fig.add_subplot(111, projection="3d", elev=48, azim=134)
    ax.set_position([0, 0, 0.95, 1])
    est.fit(X)
    labels = est.labels_

    ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(float), edgecolor="k")

    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel("Petal width")
    ax.set_ylabel("Sepal length")
    ax.set_zlabel("Petal length")
    ax.set_title(titles[fignum - 1])
    ax.dist = 12
    fignum = fignum + 1

# Plot the ground truth
fig = plt.figure(fignum, figsize=(8, 6))
ax = fig.add_subplot(111, projection="3d", elev=48, azim=134)
ax.set_position([0, 0, 0.95, 1])

for name, label in [("Setosa", 0), ("Versicolour", 1), ("Virginica", 2)]:
    ax.text3D(
        X[y == label, 3].mean(),
        X[y == label, 0].mean(),
        X[y == label, 2].mean() + 2,
        name,
        horizontalalignment="center",
        bbox=dict(alpha=0.2, edgecolor="w", facecolor="w"),
    )
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor="k")

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel("Petal width")
ax.set_ylabel("Sepal length")
ax.set_zlabel("Petal length")
ax.set_title("Ground Truth")
ax.dist = 12

fig.show()

In [None]:
Shapes = r'/content/drive/MyDrive/SCIP_DATA/Images/shapes.png' 
colorIM = cv2.imread(Shapes) #Read the color image of difference shape. 
cv2_imshow(colorIM)

In [None]:
# Our image if too large for the algoritim to work well.  So we to The goal of this section is to 1) resize the image 2) perform detection
IM = colorIM  # our working image
resizedIM = imutils.resize(IM, width=300)  #using the open source imutils tool to resixze an image
ratio = IM.shape[0] / float(resizedIM.shape[0])  #we record the factor to use later to resize back to original size- why does this work?

# convert the resized image to grayscale, blur it, and then perform threshold 
grayIM = cv2.cvtColor(resizedIM, cv2.COLOR_BGR2GRAY)
blurredIM = cv2.medianBlur(grayIM,5)
ret,threshIM = cv2.threshold(blurredIM, 60, 255, cv2.THRESH_BINARY)
cv2_imshow(threshIM)

# find contours in the thresholded image
contourList, hierarchy = cv2.findContours(threshIM.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
#contourIM = cv2.findContours(threshIM.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
#contourIM = imutils.grab_contours(contourIM)


In this section we first compute the center of the contour, then use the above detect_shape function to determine the shape of the obeject based on the contour of that object. The goal of this  section is to highlight the challenges of clustering using an image that is clear to a human observer

In [None]:
IM_2 = IM.copy() #make a copy of the object
area = []
sides = []
aspect_ratio = []
for contours in contourList:
	area.append(cv2.contourArea(contours))
	peri = cv2.arcLength(contours, True)
	x,y,w,h = cv2.boundingRect(contours)
	sides.append(min(len(cv2.approxPolyDP(contours,0.04*peri, True)),7)) #makes the maximum 7 = circle
	aspect_ratio.append(min(float(w)/h, h/float(w),3)) #this keeps the aspect independent of x, y position
area=np.array(area)
sides=np.array(sides)
aspect_ratio = np.array(aspect_ratio)
data_features = np.column_stack((area, sides, aspect_ratio))


In [None]:
#lets see how sides and aspect_ratio correlate on an xy plot
#lets see how clustering to 5 groups works
X2 = np.column_stack((data_features[:, 1], data_features[:, 2]))
kmeans2 = KMeans(n_clusters=5) 
kmeans2.fit(X2) 
y_kmeans2 = kmeans2.predict(X2)
centers2 = kmeans2.cluster_centers_  #this gives the coordinates of the n_clusters defined in the object kmeans of class KMeans
plt.scatter(centers2[:, 0], centers2[:, 1], c='red', s=200, alpha=0.5);
plt.scatter(X2[:, 0], X2[:, 1], c=y_kmeans2, s=50, cmap='viridis')
#this works okay but likely confuses one rectangle with a square, which one?
#would this work with Area versus sides?

 

In [None]:
#would this work with Area versus sides?

X3 = np.column_stack((data_features[:, 1], data_features[:, 0]))
kmeans3 = KMeans(n_clusters=5) 
kmeans3.fit(X3) 
y_kmeans3 = kmeans3.predict(X3)
centers3 = kmeans3.cluster_centers_  #this gives the coordinates of the n_clusters defined in the object kmeans of class KMeans
plt.scatter(centers3[:, 0], centers3[:, 1], c='red', s=200, alpha=0.5);
plt.scatter(X3[:, 0], X3[:, 1], c=y_kmeans3, s=50, cmap='viridis')

#very poorly clustered.  


This completes this section on clustering and it is a first step on long journey of learning how we can us machines to solve problems


In [None]:
%reset -f