
Clustering and hierarchical clustering are both unsupervised learning techniques used to group similar data points together based on certain characteristics or patterns. However, they differ in approach and use cases.

# 1. Clustering

Clustering is the process of dividing a dataset into distinct groups, or clusters, where data points in the same cluster are more similar to each other than to those in other clusters.

**Popular Methods:**

1. **K-means**: Partitions data into K clusters by minimizing the variance within each cluster.
2. **DBSCAN (Density-Based Spatial Clustering of Applications with Noise)**: Groups data points based on density, which helps detect clusters of varying shapes and sizes. Also identifies outliers as noise.
3. **Gaussian Mixture Models (GMM)**: Assumes data is generated from a mixture of several Gaussian distributions and clusters by maximizing the likelihood of the data given these distributions.

**When to Use:** Clustering is useful for finding patterns in data when there are no predefined labels. Itâ€™s used in customer segmentation, image compression, and document clustering, among other applications.

**Limitations:** Many clustering methods, like K-means, struggle with clusters that are non-spherical or vary in density and size. Additionally, choosing the right number of clusters can be challenging.

# Importing Libraries and Mounting

In [2]:
import pandas as pd

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Reading Dataset

In [4]:
df = pd.read_csv('/content/drive/MyDrive/CSE 432 ML/Clustering/countries_of_the_world.csv')

In [5]:
df.head()

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,480,0,2306,16307,700.0,360,32,1213,22,8765,1,466,2034,38.0,24.0,38.0
1,Albania,EASTERN EUROPE,3581655,28748,1246,126,-493,2152,4500.0,865,712,2109,442,7449,3,1511,522,232.0,188.0,579.0
2,Algeria,NORTHERN AFRICA,32930091,2381740,138,4,-39,31,6000.0,700,781,322,25,9653,1,1714,461,101.0,6.0,298.0
3,American Samoa,OCEANIA,57794,199,2904,5829,-2071,927,8000.0,970,2595,10,15,75,2,2246,327,,,
4,Andorra,WESTERN EUROPE,71201,468,1521,0,66,405,19000.0,1000,4972,222,0,9778,3,871,625,,,


# Preprocessing

In [6]:
df.isnull().sum()

Unnamed: 0,0
Country,0
Region,0
Population,0
Area (sq. mi.),0
Pop. Density (per sq. mi.),0
Coastline (coast/area ratio),0
Net migration,3
Infant mortality (per 1000 births),3
GDP ($ per capita),1
Literacy (%),18


**Normalization**

In [7]:
df.columns

Index(['Country', 'Region', 'Population', 'Area (sq. mi.)',
       'Pop. Density (per sq. mi.)', 'Coastline (coast/area ratio)',
       'Net migration', 'Infant mortality (per 1000 births)',
       'GDP ($ per capita)', 'Literacy (%)', 'Phones (per 1000)', 'Arable (%)',
       'Crops (%)', 'Other (%)', 'Climate', 'Birthrate', 'Deathrate',
       'Agriculture', 'Industry', 'Service'],
      dtype='object')

In [8]:
numeric_columns = ['Population', 'Area (sq. mi.)',
       'Pop. Density (per sq. mi.)', 'Coastline (coast/area ratio)',
       'Net migration', 'Infant mortality (per 1000 births)',
       'GDP ($ per capita)', 'Literacy (%)', 'Phones (per 1000)', 'Arable (%)',
       'Crops (%)', 'Other (%)', 'Birthrate', 'Deathrate',
       'Agriculture', 'Industry', 'Service']

# Iterate over each column in the DataFrame
for col in numeric_columns:
  try:
    df[col] = pd.to_numeric(df[col].str.replace(',', '.'))
  except:
    pass

df.head()

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,48.0,0.0,23.06,163.07,700.0,36.0,3.2,12.13,0.22,87.65,1,46.6,20.34,0.38,0.24,0.38
1,Albania,EASTERN EUROPE,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,3,15.11,5.22,0.232,0.188,0.579
2,Algeria,NORTHERN AFRICA,32930091,2381740,13.8,0.04,-0.39,31.0,6000.0,70.0,78.1,3.22,0.25,96.53,1,17.14,4.61,0.101,0.6,0.298
3,American Samoa,OCEANIA,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.0,15.0,75.0,2,22.46,3.27,,,
4,Andorra,WESTERN EUROPE,71201,468,152.1,0.0,6.6,4.05,19000.0,100.0,497.2,2.22,0.0,97.78,3,8.71,6.25,,,


In [9]:
from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Fit and transform the selected columns
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])

df.head()

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),0.023631,0.03792,0.00295,0.0,1.0,0.851138,0.003663,0.223301,0.002897,0.195299,0.004341,0.814759,1,0.904926,0.657559,0.494148,0.248307,0.356502
1,Albania,EASTERN EUROPE,0.00272,0.001683,0.007658,0.001447,0.364586,0.1018,0.07326,0.836165,0.068573,0.339559,0.087214,0.617369,3,0.180018,0.10674,0.301691,0.189616,0.579596
2,Algeria,NORTHERN AFRICA,0.025056,0.139485,0.000848,4.6e-05,0.46765,0.151985,0.100733,0.635922,0.075237,0.051844,0.004933,0.947953,1,0.22675,0.084517,0.131339,0.654628,0.264574
3,American Samoa,OCEANIA,3.9e-05,1.2e-05,0.017847,0.066949,0.006356,0.036951,0.137363,0.963592,0.250435,0.161005,0.295975,0.625019,2,0.349217,0.035701,,,
4,Andorra,WESTERN EUROPE,4.9e-05,2.7e-05,0.009348,0.0,0.626334,0.009317,0.338828,1.0,0.480008,0.035743,0.0,0.966702,3,0.032689,0.144262,,,


# Clustering

In [11]:
import pandas as pd
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.5, min_samples=5)  # Adjust eps and min_samples as needed
clusters = dbscan.fit_predict(df[numeric_columns])

# Add cluster labels to the DataFrame
df['DBSCAN_Cluster'] = clusters

# Analyze the results (e.g., count the number of points in each cluster)
print(df['DBSCAN_Cluster'].value_counts())

ValueError: Input X contains NaN.
DBSCAN does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [None]:
!pip install hdbscan

In [None]:
import hdbscan

clusterer = hdbscan.HDBSCAN(min_cluster_size=5) # Adjust min_cluster_size as needed
clusterer.fit(df[numeric_columns])
df['HDBSCAN_cluster'] = clusterer.labels_
df

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))  # Adjust figure size as needed
plt.scatter(df['GDP ($ per capita)'], df['Literacy (%)'], c=df['HDBSCAN_cluster'], cmap='viridis', s=50)
plt.xlabel('GDP ($ per capita)')
plt.ylabel('Literacy (%)')
plt.title('Cluster Plot (HDBSCAN)')
plt.colorbar(label='Cluster Label')
plt.show()

In [None]:
unique_hdb_clusters = df['HDBSCAN_cluster'].unique()
unique_hdb_clusters

In [None]:
cluster_groups = df.groupby('HDBSCAN_cluster')

# Iterate through each cluster and print the countries
for cluster_label, cluster_data in cluster_groups:
    print(f"Cluster {cluster_label}:")
    countries = cluster_data['Country'].tolist()
    print(countries)
    print()

# Task

Apply k-Means and k-Medoid Clustering