<a href="https://colab.research.google.com/github/Shubham04689/colab_notebooks/blob/main/Hierarchical_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

At the end of the experiment, you will be able to:

*  find groups or clusters using Hierarchical Clustering Algorithm
*  visualize the clusters using Dendrogram


## Dataset

### Description

The dataset consists of the below 7 columns,

- **species:** penguin species (Chinstrap, Adélie, or Gentoo)
- **culmen length & depth:** The culmen is the upper ridge of a bird's beak
- **flipper_length_mm:** flipper length
- **body_mass_g:** body mass
- **island:** island name (Dream, Torgersen, or Biscoe)
- **sex:** penguin sex

## AI/ML Technique

### Hierarchical Clustering

It is an algorithm that builds hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

Why Hierarchical Clustering is used over K-means Clustering Algorithm? K-means works well when the shape of clusters are hyper-spherical  (or circular in 2 dimensions). If there are general clusters occurring in the dataset which are non-spherical then probably K-means is not a good choice.

K-means starts with random choice of cluster centers and it may lead to different clustering results and different runs of algorithm is required. Thus, the results may not be repeatable and lack of consistency with hierarchical clustering, you will definitely get the same clustering results.

K-means require prior knowledge of K (number of clusters), whereas in hierarchical clustering we can stop at any level (clusters) we wish.


Hierarchical clustering is of two types:

**Agglomerative**: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

**Divisive**: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In this experiment we will use Agglomerative Clustering.

A dendrogram is a tree like structure that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering. The main use of a dendrogram is to work out the best way to allocate objects to clusters.


Hierarchical clustering gives the deep insight of each step of converging different clusters and create dendrogram which helps you figure out which clusters combination makes sense and where you want to stop.

## Import Required Packages

In [None]:
import requests

def download_file(url, filename):
  """Downloads a file from a given URL.

  Args:
    url: The URL of the file to download.
    filename: The name of the file to save the downloaded data to.
  """
  response = requests.get(url, stream=True)
  if response.status_code == 200:
    with open(filename, 'wb') as f:
      for chunk in response.iter_content(chunk_size=1024):
        if chunk:
          f.write(chunk)
    print(f"Downloaded {filename} successfully.")
  else:
    print(f"Failed to download {url}. Status code: {response.status_code}")

# Example usage:
url = "https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Penguin.csv"
filename = "Penguin.csv"
download_file(url, filename)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

## Load the data

In [None]:
df = pd.read_csv('Penguin.csv')
df.head()

In [None]:
# Count NaN values in each column of the dataframe
df.isna().sum()

In [None]:
# Drop the records where sex column has NaN values
df.dropna(subset = ['sex'], inplace = True)

# Print the unique() elements from the sex column after dropping
print("Unique values after dropping NA values : ",df.sex.unique())

## Convert categorical values to numerical

In [None]:
LE = preprocessing.LabelEncoder()

In [None]:
df['island'] = LE.fit_transform(df['island'])
df['sex'] = LE.fit_transform(df['sex'])
df['species'] = LE.fit_transform(df['species'])
df.head()

## Store the data and labels

In [None]:
X = df.drop(['species'], axis=1)
y = df['species']

In [None]:
# Selecting first 100 rows and 2 columns from the data
X1 = X.iloc[:100,1:3].values

In [None]:
X1.shape

## Apply Agglomerative Clustering

**Note:** Refer to following [AgglomerativeClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html)

In [None]:
# Call the Agglomerative clustering function from sklearn library
# n_clusters : The number of clusters to generate
clustering = AgglomerativeClustering(n_clusters = 3)

# Fit Hierarchical Clustering to the data
Y_preds = clustering.fit_predict(X1)

# Plot the results
plt.figure(figsize = (8,5))
plt.scatter(X1[Y_preds == 0 , 0] , X1[Y_preds == 0 , 1] , c = 'red')
plt.scatter(X1[Y_preds == 1 , 0] , X1[Y_preds == 1 , 1] , c = 'blue')
plt.scatter(X1[Y_preds == 2 , 0] , X1[Y_preds == 2 , 1] , c = 'green')
plt.show()

## Visualize the dendogram

**Note:** Refer to [scipy.cluster.hierarchy.dendrogram](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html)

In [None]:
# In 'linkage' ward method minimizes the variance between the clusters being merged
clusters = linkage(X1, 'ward')

In [None]:
plt.figure(figsize=(20,20))
dendrogram(clusters)
plt.show()