
Clustering and hierarchical clustering are both unsupervised learning techniques used to group similar data points together based on certain characteristics or patterns. However, they differ in approach and use cases.

# 1. Clustering

Clustering is the process of dividing a dataset into distinct groups, or clusters, where data points in the same cluster are more similar to each other than to those in other clusters.

**Popular Methods:**

1. **K-means**: Partitions data into K clusters by minimizing the variance within each cluster.
2. **DBSCAN (Density-Based Spatial Clustering of Applications with Noise)**: Groups data points based on density, which helps detect clusters of varying shapes and sizes. Also identifies outliers as noise.
3. **Gaussian Mixture Models (GMM)**: Assumes data is generated from a mixture of several Gaussian distributions and clusters by maximizing the likelihood of the data given these distributions.

**When to Use:** Clustering is useful for finding patterns in data when there are no predefined labels. It’s used in customer segmentation, image compression, and document clustering, among other applications.

**Limitations:** Many clustering methods, like K-means, struggle with clusters that are non-spherical or vary in density and size. Additionally, choosing the right number of clusters can be challenging.

# Importing Libraries and Mounting

In [None]:
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Reading Dataset

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Autumn 2024/DM_CSE426_8/05 - Clustering and Hierarchical Clustering/countries_of_the_world.csv')

NameError: name 'pd' is not defined

In [None]:
df.head()

# Preprocessing

In [None]:
df.isnull().sum()

**Normalization**

In [None]:
df.columns

In [None]:
numeric_columns = ['Population', 'Area (sq. mi.)',
       'Pop. Density (per sq. mi.)', 'Coastline (coast/area ratio)',
       'Net migration', 'Infant mortality (per 1000 births)',
       'GDP ($ per capita)', 'Literacy (%)', 'Phones (per 1000)', 'Arable (%)',
       'Crops (%)', 'Other (%)', 'Birthrate', 'Deathrate',
       'Agriculture', 'Industry', 'Service']

# Iterate over each column in the DataFrame
for col in numeric_columns:
  try:
    df[col] = pd.to_numeric(df[col].str.replace(',', '.'))
  except:
    pass

df.head()

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Fit and transform the selected columns
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])

df.head()

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),0.023631,0.03792,0.00295,0.0,1.0,0.851138,0.003663,0.223301,0.002897,0.195299,0.004341,0.814759,1,0.904926,0.657559,0.494148,0.248307,0.356502
1,Albania,EASTERN EUROPE,0.00272,0.001683,0.007658,0.001447,0.364586,0.1018,0.07326,0.836165,0.068573,0.339559,0.087214,0.617369,3,0.180018,0.10674,0.301691,0.189616,0.579596
2,Algeria,NORTHERN AFRICA,0.025056,0.139485,0.000848,4.6e-05,0.46765,0.151985,0.100733,0.635922,0.075237,0.051844,0.004933,0.947953,1,0.22675,0.084517,0.131339,0.654628,0.264574
3,American Samoa,OCEANIA,3.9e-05,1.2e-05,0.017847,0.066949,0.006356,0.036951,0.137363,0.963592,0.250435,0.161005,0.295975,0.625019,2,0.349217,0.035701,,,
4,Andorra,WESTERN EUROPE,4.9e-05,2.7e-05,0.009348,0.0,0.626334,0.009317,0.338828,1.0,0.480008,0.035743,0.0,0.966702,3,0.032689,0.144262,,,


# Clustering

In [None]:
import pandas as pd
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.5, min_samples=5)  # Adjust eps and min_samples as needed
clusters = dbscan.fit_predict(df[numeric_columns])

# Add cluster labels to the DataFrame
df['DBSCAN_Cluster'] = clusters

# Analyze the results (e.g., count the number of points in each cluster)
print(df['DBSCAN_Cluster'].value_counts())

In [None]:
!pip install hdbscan

In [None]:
import hdbscan

clusterer = hdbscan.HDBSCAN(min_cluster_size=5) # Adjust min_cluster_size as needed
clusterer.fit(df[numeric_columns])
df['HDBSCAN_cluster'] = clusterer.labels_
df

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))  # Adjust figure size as needed
plt.scatter(df['GDP ($ per capita)'], df['Literacy (%)'], c=df['HDBSCAN_cluster'], cmap='viridis', s=50)
plt.xlabel('GDP ($ per capita)')
plt.ylabel('Literacy (%)')
plt.title('Cluster Plot (HDBSCAN)')
plt.colorbar(label='Cluster Label')
plt.show()

In [None]:
unique_hdb_clusters = df['HDBSCAN_cluster'].unique()
unique_hdb_clusters

In [None]:
cluster_groups = df.groupby('HDBSCAN_cluster')

# Iterate through each cluster and print the countries
for cluster_label, cluster_data in cluster_groups:
    print(f"Cluster {cluster_label}:")
    countries = cluster_data['Country'].tolist()
    print(countries)
    print()