# Lab 06 - Unsupervised Clustering with KNN and the Iris Dataset

Start by copying this lab notebook into your notebook folder, and run it step by step from there.

In this notebook, we will explore the concept of unsupervised clustering using the K-Nearest Neighbors (KNN) algorithm, using the famous Iris dataset.

## Table of Contents
1. Introduction to Machine Learning
    * Supervised vs Unsupervised Learning
2. Clustering, KNN and Distance Metrics
3. Working with the Iris Dataset
4. Varying `k` and Number of Classes
5. The Curse of Dimensionality
6. Summary
7. References

## Imports and Data Loading

First, let's import the necessary libraries and load the Iris dataset.

In [None]:
import pandas as pd
import plotly.express as px
from sklearn.datasets import load_iris
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import numpy as np

# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target
df['target_name'] = df['target'].apply(lambda x: iris.target_names[x])
df.head()


## 1. Introduction to Machine Learning

Machine Learning (ML) is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn" from data, without being explicitly programmed.

### Supervised vs. Unsupervised Learning
- **Supervised Learning**: The algorithm is trained on a labeled dataset, which means that each training example is paired with an output label. The goal is to map the input data to the output labels.
- **Unsupervised Learning**: The algorithm is given data without explicit instructions on what to do with it. The goal is to explore the data and find some intrinsic patterns within.

## 2. Clustering, KNN, and Distance Metrics

### Clustering
Clustering is an unsupervised learning technique that involves grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters).

### K-Nearest Neighbors (KNN)
Although KNN is traditionally a supervised learning algorithm, it can be adapted for unsupervised learning tasks such as clustering by determining the natural grouping in the data.
The core idea is to assign cluster memberships based on the distance to the `k` nearest neighbors.

### Distance Metrics
In KNN, the distance metric used influences the accuracy of the grouping. Common metrics include:
- Euclidean Distance
- Manhattan Distance
- Minkowski Distance
- Hamming Distance

For the Iris dataset, Euclidean Distance is generally found to be effective.

## 3. Working with the Iris Dataset

### Visualizing Iris Data
Let's create a pairplot to visualize the dataset.


In [None]:
# Pairplot to visualize the dataset
fig = px.scatter_matrix(df, dimensions=iris.feature_names, color='target_name')
fig.show()


### Applying KNN Clustering
Now, let's use the K-Nearest Neighbors algorithm to perform clustering on the Iris dataset.


In [None]:
# Initialize the KNN model
k = 3
knn = NearestNeighbors(n_neighbors=k)

# Fitting the model
knn.fit(df[iris.feature_names])
distances, indices = knn.kneighbors(df[iris.feature_names])

def assign_clusters(indices, targets):
    """
    Assigns cluster labels based on the most common label among the nearest neighbors.
    This function assumes neighboring targets are already labeled as is the case with the iris dataset.
    If your other nearest points are not labelled you will need to use another method to assign cluster ids, such as kmeans.

    Parameters:
    indices (ndarray): A 2D array where each row contains the indices of the k nearest neighbors for a given data point.
    targets (Series or ndarray): An array or Series containing the true labels of the data points.

    Returns:
    list: A list of cluster labels assigned to each data point based on the most common label among its k nearest neighbors.
    """
    clusters = []  # Initialize an empty list to store cluster labels
    for row in indices:
        neighbors_labels = targets[row]  # Get the labels of the nearest neighbors
        most_common = np.bincount(neighbors_labels).argmax()  # Find the most common label
        clusters.append(most_common)  # Append the most common label to the clusters list
    return clusters  # Return the list of cluster labels

df['cluster'] = assign_clusters(indices, df['target'])

# Plotting the clusters (using first two features)
fig = px.scatter(df, x='sepal length (cm)', y='sepal width (cm)', color='cluster', 
                 title='KNN Clustering with k=3 (Sepal Length vs Sepal Width)')
fig.show()


## 4. Varying `k` and Number of Classes

Let's explore how varying the `k` value affects clustering.

### Experiment with Different `k` Values


In [None]:
k_values = [1, 3, 5]
figs = []

for k in k_values:
    knn = NearestNeighbors(n_neighbors=k)
    knn.fit(df[iris.feature_names])
    distances, indices = knn.kneighbors(df[iris.feature_names])
    df['cluster'] = assign_clusters(indices, df['target'])
  
    fig = px.scatter(df, x='sepal length (cm)', y='sepal width (cm)', color='cluster', 
                     title=f"Clustering with k={k} (Sepal Length vs Sepal Width)")
    figs.append(fig)

for fig in figs:
    fig.show()

### Experiment with differnt numbers of clusters.

Even though the Iris dataset has 3 classes, we can use various cluster counts (e.g., 2, 3, 4, 5) to see how the clustering results change.

First, let's define a function that assigns clusters based on forced number of clusters using KMeans, since KNN doesn't natively support clustering configuration but rather classification. 



In [None]:
def visualize_clusters(n_clusters, data, feature_x, feature_y, title):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(data)
    df['cluster'] = kmeans.labels_
    
    fig = px.scatter(df, x=feature_x, y=feature_y, color='cluster', 
                     title=title, labels={feature_x: feature_x, feature_y: feature_y})
    fig.show()

# Visualize clusters for different n_clusters
n_clusters_values = [2, 3, 4, 5]
for n_clusters in n_clusters_values:
    visualize_clusters(n_clusters, df[iris.feature_names], 'sepal length (cm)', 'sepal width (cm)', 
                       f'Clustering with n_clusters={n_clusters} (Sepal Length vs Sepal Width)')

    # visualize_clusters(n_clusters, df[iris.feature_names], 'petal length (cm)', 'petal width (cm)', 
    #                    f'Clustering with n_clusters={n_clusters} (Petal Length vs Petal Width)')

## 5. The Curse of Dimensionality

The "curse of dimensionality" refers to the various phenomena that arise when analyzing and organizing data in high-dimensional spaces. Often, the intuition we have for low-dimensional spaces does not apply.

### Implications for KNN
- As dimensionality increases, the distance between any two points tends to become similar, making it difficult to distinguish between different points or clusters.
- This can degrade the performance of KNN because the nearest neighbor of a given point may be far less similar in high-dimensional space than in low-dimensional space.

### Dimensionality Reduction
- Techniques like PCA (Principal Component Analysis) or t-SNE (t-distributed Stochastic Neighbor Embedding) can be used to reduce the dimensionality of data before applying KNN.


In [None]:
# Reducing dimensions to 2 using PCA
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(df[iris.feature_names])

# Applying KNN on reduced data
knn = NearestNeighbors(n_neighbors=3)
knn.fit(reduced_data)
distances, indices = knn.kneighbors(reduced_data)
df['cluster_pca'] = assign_clusters(indices, df['target'])

# Plotting the reduced clusters
fig = px.scatter(df, x=reduced_data[:, 0], y=reduced_data[:, 1], color='cluster_pca', 
                 title="Clustering on Reduced Data (PCA Components)")
fig.show()


## 6. Summary

- **Dimensionality Reduction Matters**: Reducing dimensions can improve the performance of KNN in high-dimensional data.
- **Distance Metrics Matter**: The choice of distance metric influences the accuracy of clustering.
- **`k` Matters**: The value of `k` impacts the composition and the number of clusters significantly.
- **Number of Classes Matter**: The natural number of classes in data can influence the results of clustering.

By understanding these key aspects, we can apply clustering methods more effectively to any dataset.

## 7. References
- [Scikit-learn Documentation](https://scikit-learn.org/stable/documentation.html)
- [Plotly Express Documentation](https://plotly.com/python/plotly-express/)
- [Curse of Dimensionality Article on Wikipedia](https://en.wikipedia.org/wiki/Curse_of_dimensionality)

## Wrap up

Update your Overleaf with lessons learned.