# W11 - KMeans and Hierarchical Clustering
**Name:** Collin Joseph  
**NIM:** 0706022310053

**Dataset:** COVID-19 dataset (raw CSV)

**Notebook purpose:** Complete solution for the W11 class assignment: data preprocessing, EDA, two clustering methods (KMeans & Hierarchical Agglomerative), evaluation (silhouette), cluster profiling, maps & barplots, recommendations, and model comparison.

## How to run
1. This notebook requires internet access to download the CSV from the raw GitHub URL. If you run it on Google Colab, make sure you enable internet (default).
2. If you prefer, download the CSV manually and adjust the `DATA_URL` variable below to point to the local file path.
3. Run all cells from top to bottom. All outputs (plots/tables) will be generated after execution.

In [None]:
# --- Libraries & config ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
import plotly.express as px
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Dataset URL (raw GitHub link)
DATA_URL = "https://raw.githubusercontent.com/NathaliaMinoque/datasets/refs/heads/main/COVID-19%20Coronavirus%20(2).csv"

print('Notebook ready. Set DATA_URL =', DATA_URL)


In [None]:
# --- Load dataset ---
# If running locally and you already downloaded CSV, set local path: DATA_URL = '/path/to/COVID-19 (2).csv'
df = pd.read_csv(DATA_URL)
df.head()


## 1) Data preprocessing
- Inspect columns, datatypes, missing values
- Convert numeric fields to appropriate dtypes
- Create derived features: cases_per_million, deaths_per_million, CFR (if not present)
- Handle missing values (imputation or drop if necessary)
- Keep essential columns for clustering (e.g., population, cases_per_million, deaths_per_million, CFR)

In [None]:
# --- Preliminary cleaning & feature creation ---
df.info()

# Standardize column names (strip)
df.columns = [c.strip() for c in df.columns]

# Try to convert numeric columns that may contain commas or strings
def to_numeric_safe(col):
    return pd.to_numeric(df[col].astype(str).str.replace(',','').str.replace('%',''), errors='coerce')

candidates = ['Population','Total Cases','Total Deaths','Tot Cases//1M pop','Tot Deaths//1M pop','Death percentage']

for c in candidates:
    if c in df.columns:
        df[c] = to_numeric_safe(c)

# Derived features if missing
if 'Tot Cases//1M pop' not in df.columns and 'Total Cases' in df.columns and 'Population' in df.columns:
    df['Tot Cases//1M pop'] = df['Total Cases'] / df['Population'] * 1e6
if 'Tot Deaths//1M pop' not in df.columns and 'Total Deaths' in df.columns and 'Population' in df.columns:
    df['Tot Deaths//1M pop'] = df['Total Deaths'] / df['Population'] * 1e6
if 'Death percentage' not in df.columns and 'Total Cases' in df.columns and 'Total Deaths' in df.columns:
    df['Death percentage'] = (df['Total Deaths'] / df['Total Cases']) * 100

# Select columns for clustering
cols_for_clustering = []
for name in ['Population','Tot Cases//1M pop','Tot Deaths//1M pop','Death percentage']:
    if name in df.columns:
        cols_for_clustering.append(name)

print('Using columns for clustering:', cols_for_clustering)
df[cols_for_clustering].head()


## 2) Exploratory Data Analysis (EDA)
- At least 2 meaningful visualizations are required, including a world map (choropleth by continent or cluster).
- We'll produce:
  1. Distribution plots of key features (cases per million, deaths per million, CFR)
  2. Choropleth map by continent showing cases per million

In [None]:
# --- EDA plots ---
plot_cols = [c for c in ['Tot Cases//1M pop','Tot Deaths//1M pop','Death percentage'] if c in df.columns]

import matplotlib.pyplot as plt
import seaborn as sns
for col in plot_cols:
    plt.figure(figsize=(8,4))
    sns.histplot(df[col].dropna(), bins=60, kde=False)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.show()


In [None]:
# Choropleth by continent (mean cases per million)
if 'Continent' in df.columns and 'Tot Cases//1M pop' in df.columns and 'ISO 3166-1 alpha-3 CODE' in df.columns:
    agg = df.groupby('Continent', as_index=False)['Tot Cases//1M pop'].mean().rename(columns={'Tot Cases//1M pop':'Mean Cases per M'})
    display(agg)
    fig = px.choropleth(df, locations='ISO 3166-1 alpha-3 CODE', color='Tot Cases//1M pop',
                        hover_name='Country', projection='natural earth',
                        title='Cases per million by country (choropleth)')
    fig.show()
else:
    print('Required columns for choropleth not found: Continent, Tot Cases//1M pop, ISO 3166-1 alpha-3 CODE')


## 3) Encoding & Data Transformation
- Scale numeric features using RobustScaler (robust to outliers)
- Optionally, log-transform features with heavy skew prior to scaling

In [None]:
# --- Prepare data for clustering ---
X = df[cols_for_clustering].copy()

# Impute missing values with median
imputer = SimpleImputer(strategy='median')
X_imputed = pd.DataFrame(imputer.fit_transform(X), columns=X.columns, index=X.index)

# Log transform highly skewed columns (add small constant)
for c in X_imputed.columns:
    if (X_imputed[c] > 0).sum() > 0:
        X_imputed[c+'_log'] = np.log1p(X_imputed[c])

# Choose log columns if created
cols_final = [c for c in X_imputed.columns if c.endswith('_log')]
if not cols_final:
    cols_final = X_imputed.columns.tolist()

scaler = RobustScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_imputed[cols_final]), columns=cols_final, index=X.index)

print('Final features used for clustering:', cols_final)
X_scaled.head()


## 4) KMeans: find best K using silhouette score
We'll compute silhouette scores for K from 2 to 8 and pick the best K.

In [None]:
# --- Find best K ---
sil_scores = {}
for k in range(2,9):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labs = kmeans.fit_predict(X_scaled)
    sil = silhouette_score(X_scaled, labs)
    sil_scores[k] = sil
sil_scores


In [None]:
# Plot silhouette vs K
import matplotlib.pyplot as plt
plt.figure(figsize=(6,3))
plt.plot(list(sil_scores.keys()), list(sil_scores.values()), marker='o')
plt.xlabel('K')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score by K (KMeans)')
plt.show()


In [None]:
# --- Fit KMeans with best K (choose the K with max silhouette) ---
best_k = max(sil_scores, key=sil_scores.get)
print('Best K by silhouette:', best_k)
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10).fit(X_scaled)
df['kmeans_cluster'] = kmeans.labels_


## 5) Hierarchical Agglomerative Clustering
We'll apply AgglomerativeClustering with the same number of clusters and compare silhouette score.

In [None]:
# --- Hierarchical clustering ---
agg = AgglomerativeClustering(n_clusters=best_k, linkage='ward')
df['agg_cluster'] = agg.fit_predict(X_scaled)
sil_kmeans = silhouette_score(X_scaled, df['kmeans_cluster'])
sil_agg = silhouette_score(X_scaled, df['agg_cluster'])
print(f'Silhouette KMeans: {sil_kmeans:.4f} | Silhouette Agglomerative: {sil_agg:.4f}')


## 6) Cluster Summary & Profiling
- Show cluster sizes and mean feature values per cluster
- Visualize cluster distribution via barplots and map

In [None]:
# --- Cluster profiling for KMeans ---
profile_k = df.groupby('kmeans_cluster')[cols_for_clustering].mean().T
profile_k['overall'] = df[cols_for_clustering].mean()
profile_k


In [None]:
# Cluster sizes
print('KMeans cluster sizes:')
print(df['kmeans_cluster'].value_counts().sort_index())
print('\nAgglomerative cluster sizes:')
print(df['agg_cluster'].value_counts().sort_index())


In [None]:
# Barplot of mean Tot Cases per M by cluster (KMeans)
if 'Tot Cases//1M pop' in df.columns:
    mean_cases = df.groupby('kmeans_cluster')['Tot Cases//1M pop'].mean().reset_index()
    sns.barplot(data=mean_cases, x='kmeans_cluster', y='Tot Cases//1M pop')
    plt.title('Mean Tot Cases per M by KMeans cluster')
    plt.xlabel('Cluster')
    plt.ylabel('Mean Cases per M')
    plt.show()


In [None]:
# Map colored by cluster (KMeans)
if 'ISO 3166-1 alpha-3 CODE' in df.columns:
    fig = px.choropleth(df, locations='ISO 3166-1 alpha-3 CODE', color='kmeans_cluster',
                        hover_name='Country', projection='natural earth',
                        title='KMeans cluster assignment by country')
    fig.show()
else:
    print('No ISO codes found for map visualization')


## 7) Government Policy Recommendations (example templates)
For each cluster, provide actionable recommendations. Example templates:
- **Cluster 0**: Very high deaths per million and high CFR → *Priority: increase ICU capacity, accelerate vaccination, strengthen surveillance and contact tracing.*  
- **Cluster 1**: Low reported cases and low deaths per million but very low testing rate (possible underreporting) → *Priority: increase testing, audit reporting systems.*  
- **Cluster 2**: Moderate cases, low CFR → *Priority: maintain vaccination campaign and targeted NPIs to protect vulnerable groups.*

**Write targeted suggestions for each cluster based on the profiling table above.**

## 8) Model comparison & conclusion
- Compare silhouette scores and stability of clusters. Prefer the model with higher silhouette and clearer, interpretable clusters.  
- Also consider if hierarchical clustering better preserves geographic/continent structures or if KMeans gives tighter clusters in the scaled feature space.

**Conclusion (example):** KMeans achieved a slightly higher silhouette score compared to Agglomerative, and cluster profiling produced clear groups (high-risk, medium-risk, low-risk). Therefore, choose KMeans for this dataset, but mention limitations (reporting biases, single snapshot, missing features such as testing rates or vaccination coverage).