## Section A: Unsupervised Machine Learning

### Types of Machine Learning

▪ Machine learning can be divided into two types: 

![](sml_usml.png)

<img src="super.png" width="500">

https://blog.bismart.com/en/classification-vs.-clustering-a-practical-explanation

### Unsupervised Machine Learning

▪ Unsupervised algorithms make inferences from datasets using only **input vectors without referring to known, or labelled, outcomes**.

▪ The algorithm must figure out what it is viewing by itself and the objective is to discover interesting patterns in the data. 

▪ For instance, are there any subgroups or **"clusters"** among the data instances?

<img src="unsuper.png" width="500">

### Clustering

▪ Clustering can be considered the most important unsupervised learning problem.

▪ A **centroid** is the imaginary or real location representing the center of the cluster.

<img src="centroid.png" width="600">

### Example

The following 3 general clusters of consumers can let us understand our customers better:

▪ Older customers spent very little

▪ Middle-aged/older customers spent a lot

▪ Middle-aged/younger customers spent a medium amount

<img src="clustering.png" width="350">

https://blog.dataiku.com/clustering-how-it-works-in-plain-english

## Section B: Exploratory Data Analysis

### Understanding the Iris Dataset

▪ **Iris Dataset**: The data set contains 3 classes with 50 instances each, and 150 instances in total, where each class refers to a type of iris plant.

▪ **Class**: Iris Setosa, Iris Versicolour, Iris Virginica

▪ **Data Format**: (sepal length, sepal width, petal length, petal width)

https://www.analyticsvidhya.com/blog/2021/06/analyzing-decision-tree-and-k-means-clustering-using-iris-dataset/

<img src="iris_flowers.png" width="700">

<img src="iris.png" width="600">

### Loading the Dataset into Dataframe

In [None]:
import pandas as pd

iris = pd.read_csv('iris_data2.csv')

### Viewing Sample Data

In [None]:
iris.head()

In [None]:
iris.tail()

<span style = "color:red">
    
**Exercise \#1: Write python code that prints the first 5 records for species \'Iris-versicolor\'.** 

</span>

### Identifying Number of Features and Samples

In [None]:
iris.shape

### Printing Basic Information about the Dataframe

In [None]:
iris.info()

### Describing Data with Descriptive Statistics

In [None]:
iris.describe()

### Identifying Missing or Null Values

In [None]:
iris.isnull().sum()

### Identifying Duplicated Records

In [None]:
iris.duplicated().sum()

In [None]:
iris_dup = iris[iris.duplicated(keep='first')]
iris_dup

In [None]:
iris_dup = iris[iris.duplicated(keep='last')]
iris_dup

In [None]:
iris_dup = iris[iris.duplicated(keep=False)]
iris_dup

### Identifying Unique Values for Each Column

In [None]:
iris.nunique()

In [None]:
iris['species'].unique()

In [None]:
iris.groupby('species').size() 

### Visualizing Correlation Between Variables Via Scatter Plot

▪ Scatter plot is a graph in which the values of two variables are plotted along two axes.

▪ It is the most basic type of plot that helps you visualize the relationship between two variables.

In [None]:
import matplotlib.pyplot as plt 

iris.plot(kind = "scatter", x = "petal_length", y = "petal_length")
plt.show()

<span style = "color:red">
    
**Exercise \#2: Write python code that shows correlation between \'petal_length\' and \'petal_width\' via scatter plot.** 

</span>

<span style = "color:red">
    
**Exercise \#3: Write python code that shows correlation between \'petal_length\' and \'sepal_length\' via scatter plot.** 

</span>

<span style = "color:red">
    
**Exercise \#4: Write python code that shows correlation between \'sepal_width\' and \'sepal_length\' via scatter plot.** 

</span>

### Finding Correlation Among Variables

In [None]:
corr = iris['petal_length'].corr(iris['petal_length'])
print("Correlation:", round(corr, 2))

<span style = "color:red">
    
**Exercise \#5: Write python code that calculates correlation value between \'sepal_width\' and \'sepal_length\'.** 

</span>

<span style = "color:red">
    
**Exercise \#6: Write python code that calculates correlation value between \'petal_length\' and \'sepal_length\'.** 

</span>

<span style = "color:red">
    
**Exercise \#7: Write python code that calculates correlation value between \'sepal_width\' and \'sepal_length\'.** 

</span>

### Dropping Label for Correlation Calculation

In [None]:
# Remove 'species' column from data
iris_2 = iris.drop(['species'], axis = 1)
iris_2

### Constructing A Correlation Heatmap

In [None]:
import seaborn as sns

corr = iris_2.corr()
sns.heatmap(corr, cmap='coolwarm', annot=True)

### Dropping Highly Correlated Features

In [None]:
# Remove 'petal_length' column from data
iris_3 = iris.drop(['petal_length'], axis = 1)
iris_3.info()

### Visualizing Multivariate Data with Andrews curves 

▪ We can use andrews_curves() to visualize high-dimensional or multivariate data by plotting the Andrews curves.

▪ Each frame row represents a single curve.

In [None]:
from pandas.plotting import andrews_curves

plt.figure(figsize = (15, 8)) 
andrews_curves(iris, "species")
plt.show()

## Section C: Clustering with K-Means

▪ K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

▪ K-Means will be used to find groups in the flower data.

▪ The term "K" is a number that indicates how many clusters we need to create. E.g., K = 2 refers to two clusters.

<img src="centroid.png" width="600">

▪ The term "means" refers to averaging of the data; that is, finding the centroid.

https://www.simplilearn.com/tutorials/machine-learning-tutorial/k-means-clustering-algorithm

### Avoiding memory leak when dealing with KMeans

▪ A memory leak is the incorrect management of memory allocations by a computer program where the unneeded memory isn't released. 

In [None]:
import os

# os.environ["OMP_NUM_THREADS"] = "1"

### Splitting Data into Feature and Label

In [None]:
# Inputs into model
X = iris.iloc[:, 0:4]

# The label can be used to evaluate the model
y = iris.species

In [None]:
X.head()

In [None]:
y.head()

In [None]:
print(len(X))
print(len(y))

### Building the K-means model: Instructions

▪ Step 1: Import KMeans from **sklearn.cluster**.

▪ Step 2: Use KMeans() to create a KMeans instance called **km** to find the 3 clusters. 

▪ Step 3: To specify the number of clusters, use the **n_clusters** keyword argument.

▪ Step 4: Use the **.fit()** method of model to fit the model to the array of points points.

▪ Step 5: Use the **.predict()** method of model to predict the cluster labels of new_points, assigning the result to labels.

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

### Fitting the Model with Data

https://towardsdatascience.com/explain-ml-in-a-simple-way-k-means-clustering-e925d019743b

In [None]:
# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters
# n_init indicates the number of times K-Means would run with different sets of starting points.
km = KMeans(n_clusters = 3, n_init = 10)

### Accessing the Clusters of Data via labels_

In [None]:
# Fit the model to the input training instances
km.fit(X)

labels = km.labels_

# Print cluster labels of new_points
print(labels)

In [None]:
# Use fit_predict to fit model and obtain cluster labels: labels
km_labels = km.fit_predict(X)

print(km_labels)

### cluster_centers_

▪ We use the **cluster_centers_** attribute to see the cluster centers (also called centroids). 

In [None]:
# Identify the center points of the data
centers = km.cluster_centers_
print(centers)

### Scatter plot

In [None]:
# s = 100 represents the marker size 
# c = 'black' represents the marker colors
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s = 100, c = 'black')

In [None]:
# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'km_labels': km_labels, 'species': y})
df

### Correspondence with iris species

### crosstab()

▪ Use the **pd.crosstab()** function on df['labels'] and df['varieties'] to count the number of times each iris species coincides with each cluster label. 

In [None]:
# Create crosstab: ct
ct = pd.crosstab(df['km_labels'], y)
#ct = pd.crosstab(df['km_labels'], df['species'])
ct

In [None]:
# Create crosstab: ct
ct = pd.crosstab(df['km_labels'], df['species'], margins = True)
ct

### Different Types of Validation Techniques

▪ Two techniques are used to validate the results for cluster learning:

\>>> **External validation**: This type of result validation can be carried out if true cluster labels are available.

\>>> **Internal validation**: Most of the methods of internal validation combine cohesion and separation to estimate the validation score.

**[More Info: Clustering Metrics](https://scikit-learn.org/stable/modules/classes.html)**

### External Validation: V-measure

▪ **homogeneity**: each cluster contains only members of a single class.

▪ **completeness**: all members of a given class are assigned to the same cluster.

▪ **V-measure score** can be interpretated as an average of other two measures: homogeneity and completeness.

https://www.kaggle.com/code/sashr07/unsupervised-learning-tutorial/notebook

### Example

In [None]:
from sklearn import metrics

labels_true_0 = [0, 0, 0, 1, 1, 1]
labels_pred_1 = [0, 0, 1, 1, 2, 2]
labels_pred_2 = [0, 0, 0, 1, 2, 2]

In [None]:
homogeneity = metrics.homogeneity_score(labels_true, labels_pred_1)  
print(homogeneity)

completeness = metrics.completeness_score(labels_true, labels_pred_1) 
print(completeness)

from sklearn.metrics.cluster import v_measure_score
v_measure = v_measure_score(labels_true, labels_pred_1)  
print(v_measure)

In [None]:
# Output order: Homogeneity, completeness and V-measure
print('labels_pred_1:', metrics.homogeneity_completeness_v_measure(labels_true, labels_pred_1))

In [None]:
# The following clustering is perfectly homogeneous but not complete
print('labels_pred_2:', metrics.homogeneity_completeness_v_measure(labels_true, labels_pred_2))

### Back to Iris-Dataset Clustering

<span style = "color:red">
    
**Exercise \#8: Write python code that calculates homogeneity of clustering.** 

</span>

<span style = "color:red">
    
**Exercise \#9: Write python code that calculates completeness of clustering.** 

</span>

<span style = "color:red">
    
**Exercise \#10: Write python code that calculates v-measure score of clustering.** 

</span>

<span style = "color:red">
    
**Exercise \#11: Write python code that calculates homogeneity, completeness and v-measure score of clustering at once.** 

</span>

### Measuring The Quality of Clustering

▪ A good clustering has tight clusters where all samples in each cluster bunched together.

### Inertia

▪ Afer fitting a model with fit(), an attribute called **inertia_** is available.

▪ Inertia calculates the sum of distances of all the points within a cluster from the centroid of that cluster.

▪ The smaller the Inertia value, the more coherent are the different clusters. 

https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/

In [None]:
km = KMeans(n_clusters = 3, n_init = 10)
km.fit(X)

print(km.inertia_)

### How to find the optimal K?

▪ A good model is one with low inertia AND a low number of clusters (K). 

▪ However, this is a tradeoff because as K increases, inertia decreases.

▪ To find the optimal K for a dataset, use the **Elbow method** to find the point where the decrease in inertia begins to slow. 

<img src="elbow.png" width="500">

https://www.analyticsvidhya.com/blog/2021/05/k-mean-getting-the-optimal-number-of-clusters/

### Instructions

▪ For each of the given values of k, perform the following steps:

\>>> Create a KMeans instance called model with k clusters.

\>>> Fit the model to the grain data samples.

\>>> Append the value of the inertia_ attribute of model to the list inertias.

https://www.kaggle.com/sashr07/unsupervised-learning-tutorial

In [None]:
ks = range(1, 7)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters = k, n_init = 10)
    
    # Fit model to samples
    model.fit(X)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

## Detecting Outliers

In [None]:
iris.describe()

### Creating a Boxplot for Numerical Column

In [None]:
plt.figure(figsize = (15, 10))    

plt.subplot(2,2,1)    
sns.boxplot(x = 'species', y = 'sepal_length', data=iris)   

plt.subplot(2,2,2)    
sns.boxplot(x = 'species', y = 'sepal_width', data=iris)   

plt.subplot(2,2,3)    
sns.boxplot(x = 'species', y = 'petal_length', data=iris)  

plt.subplot(2,2,4)    
sns.boxplot(x = 'species', y = 'petal_width', data=iris) 

### Creating a Seaborn Boxplot without Outliers

In [None]:
plt.figure(figsize = (15, 10))    

plt.subplot(2,2,1)    
sns.boxplot(x = 'species', y = 'sepal_length', data=iris, showfliers=False)   

plt.subplot(2,2,2)    
sns.boxplot(x = 'species', y = 'sepal_width', data=iris, showfliers=False)   

plt.subplot(2,2,3)    
sns.boxplot(x = 'species', y = 'petal_length', data=iris, showfliers=False)  

plt.subplot(2,2,4)    
sns.boxplot(x = 'species', y = 'petal_width', data=iris, showfliers=False) 

### Identifying Outliers (Sepal Length)

In [None]:
sns.boxplot(x = 'species', y = 'sepal_length', data=iris)   

In [None]:
quartile_1 = iris.loc[(iris['species'] == 'Iris-verginica'), 'sepal_length'].quantile(0.25)
quartile_3 = iris.loc[(iris['species'] == 'Iris-verginica'), 'sepal_length'].quantile(0.75)

# iqr means interquartile range
iqr = quartile_3 - quartile_1
lowest = quartile_1 - 1.5 * iqr
highest = quartile_3 + 1.5 * iqr

iris.loc[(iris['species'] == 'Iris-verginica') & (iris['sepal_length'] < lowest)]

<span style = "color:red">
    
**Exercise \#12: Write python code that detects outliers based on \'sepal_width\'.** 

</span>

<span style = "color:red">
    
**Exercise \#13: Write python code that detects outliers based on \'petal_length\'.** 

</span>

<span style = "color:red">
    
**Exercise \#14: Write python code that detects outliers based on \'petal_width\'.** 

</span>

## Section D: Exercises 

### \#1: Clustering with Mean Shift

### Import the library

### Find out the number of estimated clusters by Mean Shift

### Fit Mean Shift model and generate the ct

### Calculate the score using v_measure_score()

__Example output:__ 0.6994

### \#2: Clustering with Gaussian mixture models (GMM)

### Import the library

### Fit GMM model

### Generate the ct

### Calculate the score using *v_measure_score()*

__Example output:__ 0.8997

### \#3: Clustering with Agglomerative Hierarchical  Clustering

### Import the library

### Fit the model

### Generate the ct

__Example output:__ 0.7701