# Phase 3: Unsupervised Learning - K-Means Clustering

**Objective**: Discover natural groupings of countries based on macroeconomic indicators and compare with actual credit ratings.

**Questions to answer**:
1. Do countries naturally group by risk level?
2. How many distinct risk profiles exist?
3. Do discovered clusters correspond to credit ratings?
4. Which countries are outliers?

## 1. Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, silhouette_samples, adjusted_rand_score, normalized_mutual_info_score

import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

In [None]:
# Load data
df = pd.read_csv('../data/processed/merged_dataset_labels.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Prepare data for clustering
identifiers = df[['Country', 'Year']]
labels = df['Credit_Rating_Label']
X = df.drop(['Country', 'Year', 'Credit_Rating_Label'], axis=1)

feature_names = X.columns.tolist()
print(f"Features for clustering: {feature_names}")
print(f"Number of observations: {len(X)}")

## 2. Elbow Method - Finding Optimal K

In [None]:
# Normalize features (important for K-Means)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Features normalized with StandardScaler")
print(f"Mean: {X_scaled.mean(axis=0).round(2)}")
print(f"Std: {X_scaled.std(axis=0).round(2)}")

In [None]:
# Test different K values
k_range = [2, 3, 4, 5, 6, 7, 8]
inertias = []
silhouette_scores = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    clusters = kmeans.fit_predict(X_scaled)
    
    inertia = kmeans.inertia_
    silhouette = silhouette_score(X_scaled, clusters)
    
    inertias.append(inertia)
    silhouette_scores.append(silhouette)
    
    print(f"K={k}: Inertia={inertia:.2f}, Silhouette={silhouette:.4f}")

optimal_k = k_range[np.argmax(silhouette_scores)]
print(f"\n✓ Optimal K = {optimal_k} (Silhouette Score: {max(silhouette_scores):.4f})")

In [None]:
# Visualize Elbow Method
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Inertia
axes[0].plot(k_range, inertias, 'bo-', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Clusters (K)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Inertia', fontsize=12, fontweight='bold')
axes[0].set_title('Elbow Method - Inertia', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].axvline(x=optimal_k, color='red', linestyle='--', alpha=0.5, label=f'Optimal K={optimal_k}')
axes[0].legend()

# Plot 2: Silhouette Score
axes[1].plot(k_range, silhouette_scores, 'go-', linewidth=2, markersize=8)
axes[1].set_xlabel('Number of Clusters (K)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Silhouette Score', fontsize=12, fontweight='bold')
axes[1].set_title('Silhouette Score by K', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].axhline(y=0.5, color='orange', linestyle='--', alpha=0.5, label='Good threshold (0.5)')
axes[1].axvline(x=optimal_k, color='red', linestyle='--', alpha=0.5, label=f'Optimal K={optimal_k}')
axes[1].legend()
axes[1].set_ylim(0, 1)

plt.tight_layout()
plt.show()

## 3. K-Means Clustering with Optimal K

In [None]:
# Apply K-Means with optimal K
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

# Calculate metrics
inertia = kmeans.inertia_
silhouette = silhouette_score(X_scaled, clusters)
silhouette_vals = silhouette_samples(X_scaled, clusters)

print(f"K-Means Clustering (K={optimal_k})")
print(f"  Inertia: {inertia:.2f}")
print(f"  Silhouette Score: {silhouette:.4f}")
print(f"\nCluster distribution:")
print(pd.Series(clusters).value_counts().sort_index())

In [None]:
# Create results DataFrame
results_df = pd.DataFrame({
    'Country': identifiers['Country'],
    'Year': identifiers['Year'],
    'Credit_Rating_Label': labels,
    'Cluster': clusters,
    'Silhouette_Score': silhouette_vals
})

# Add original features
for col in feature_names:
    results_df[col] = df[col].values

results_df.head(10)

## 4. Cluster Profiles Analysis

In [None]:
# Analyze each cluster
for cluster_id in range(optimal_k):
    cluster_data = results_df[results_df['Cluster'] == cluster_id]
    n_obs = len(cluster_data)
    
    print(f"\n{'='*80}")
    print(f"CLUSTER {cluster_id}: {n_obs} observations")
    print(f"{'='*80}")
    
    # Mean values for each feature
    print("\nMean Economic Indicators:")
    for feature in feature_names:
        mean_val = cluster_data[feature].mean()
        print(f"  {feature:20s}: {mean_val:8.2f}")
    
    # Rating distribution
    print("\nCredit Rating Distribution:")
    rating_counts = cluster_data['Credit_Rating_Label'].value_counts().head(5)
    for rating, count in rating_counts.items():
        print(f"  {rating:10s}: {count:3d} ({count/n_obs*100:.1f}%)")
    
    # Sample countries (most recent year)
    latest_year = cluster_data['Year'].max()
    recent_countries = cluster_data[cluster_data['Year'] == latest_year]['Country'].tolist()[:10]
    print(f"\nSample Countries ({latest_year}):")
    print(f"  {', '.join(recent_countries)}")

In [None]:
# Visualize cluster profiles
cluster_profiles = []
for cluster_id in range(optimal_k):
    cluster_data = results_df[results_df['Cluster'] == cluster_id]
    profile = {'Cluster': f'Cluster {cluster_id}'}
    for feature in feature_names:
        profile[feature] = cluster_data[feature].mean()
    cluster_profiles.append(profile)

profiles_df = pd.DataFrame(cluster_profiles)
profiles_df = profiles_df.set_index('Cluster')

# Heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(profiles_df.T, annot=True, fmt='.2f', cmap='RdYlGn_r', center=0, 
            cbar_kws={'label': 'Mean Value'})
plt.title(f'Cluster Profiles - Mean Economic Indicators (K={optimal_k})', 
          fontsize=14, fontweight='bold')
plt.xlabel('Cluster', fontsize=12, fontweight='bold')
plt.ylabel('Feature', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

## 5. Comparison with Credit Ratings

In [None]:
# Group ratings into categories
def rating_to_category(rating):
    if rating in ['AAA', 'AA+', 'AA', 'AA-']:
        return 'High Grade'
    elif rating in ['A+', 'A', 'A-', 'BBB+', 'BBB', 'BBB-']:
        return 'Medium Grade'
    elif rating in ['BB+', 'BB', 'BB-', 'B+', 'B', 'B-']:
        return 'Speculative'
    else:
        return 'High Risk'

results_df['Rating_Category'] = results_df['Credit_Rating_Label'].apply(rating_to_category)

# Cross-tabulation
crosstab = pd.crosstab(results_df['Cluster'], results_df['Rating_Category'])
print("Cross-Tabulation: Cluster vs Rating Category")
print("="*80)
print(crosstab)
print()

In [None]:
# Visualize cross-tabulation
plt.figure(figsize=(10, 6))
sns.heatmap(crosstab, annot=True, fmt='d', cmap='YlOrRd', cbar_kws={'label': 'Count'})
plt.title(f'K-Means Clusters vs Credit Rating Categories (K={optimal_k})', 
          fontsize=14, fontweight='bold')
plt.xlabel('Rating Category', fontsize=12, fontweight='bold')
plt.ylabel('Cluster', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Calculate similarity metrics
from sklearn.preprocessing import LabelEncoder

le_cluster = LabelEncoder()
le_rating = LabelEncoder()

clusters_encoded = le_cluster.fit_transform(results_df['Cluster'])
ratings_encoded = le_rating.fit_transform(results_df['Rating_Category'])

ari = adjusted_rand_score(ratings_encoded, clusters_encoded)
nmi = normalized_mutual_info_score(ratings_encoded, clusters_encoded)

print("Similarity Metrics:")
print(f"  Adjusted Rand Index (ARI): {ari:.4f}")
print(f"  Normalized Mutual Information (NMI): {nmi:.4f}")
print()

if ari > 0.6:
    interpretation = 'Strong correspondence'
elif ari > 0.3:
    interpretation = 'Moderate correspondence'
else:
    interpretation = 'Weak correspondence'

print(f"Interpretation: {interpretation} ({ari:.1%})")

## 6. Outlier Detection

In [None]:
# Identify outliers (countries in unexpected clusters)
outliers = []
for cluster_id in range(optimal_k):
    cluster_data = results_df[results_df['Cluster'] == cluster_id]
    rating_counts = cluster_data['Rating_Category'].value_counts()
    
    if len(rating_counts) > 0:
        dominant_category = rating_counts.index[0]
        mismatched = cluster_data[cluster_data['Rating_Category'] != dominant_category]
        
        for _, row in mismatched.iterrows():
            outliers.append({
                'Country': row['Country'],
                'Year': row['Year'],
                'Rating': row['Credit_Rating_Label'],
                'Rating_Category': row['Rating_Category'],
                'Cluster': cluster_id,
                'Expected_Category': dominant_category,
                'Silhouette_Score': row['Silhouette_Score']
            })

outliers_df = pd.DataFrame(outliers)
outliers_df = outliers_df.sort_values('Silhouette_Score')

print(f"Top 20 Outliers (lowest silhouette scores):")
print("="*80)
print(outliers_df.head(20).to_string(index=False))
print(f"\nTotal outliers detected: {len(outliers)}")

## 7. PCA Visualization (2D)

In [None]:
# Apply PCA (8D → 2D)
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

variance_explained = pca.explained_variance_ratio_
print(f"PCA Variance Explained:")
print(f"  PC1: {variance_explained[0]:.1%}")
print(f"  PC2: {variance_explained[1]:.1%}")
print(f"  Total: {variance_explained.sum():.1%}")

In [None]:
# Visualize clusters in 2D
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Colored by Cluster
scatter1 = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, 
                           cmap='viridis', s=50, alpha=0.6, edgecolors='black', linewidth=0.5)

# Add cluster centers
centers_pca = pca.transform(kmeans.cluster_centers_)
axes[0].scatter(centers_pca[:, 0], centers_pca[:, 1], 
               c='red', marker='*', s=500, edgecolors='black', linewidth=2,
               label='Cluster Centers')

axes[0].set_xlabel(f'PC1 ({variance_explained[0]:.1%} variance)', fontsize=12, fontweight='bold')
axes[0].set_ylabel(f'PC2 ({variance_explained[1]:.1%} variance)', fontsize=12, fontweight='bold')
axes[0].set_title('K-Means Clusters (PCA 2D)', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
plt.colorbar(scatter1, ax=axes[0], label='Cluster')

# Plot 2: Colored by Rating
rating_order = ['CCC-', 'CCC', 'CCC+', 'CC', 'B-', 'B', 'B+', 
               'BB-', 'BB', 'BB+', 'BBB-', 'BBB', 'BBB+',
               'A-', 'A', 'A+', 'AA-', 'AA', 'AA+', 'AAA']

def rating_to_numeric(rating):
    try:
        return rating_order.index(rating)
    except:
        return 0

ratings_numeric = results_df['Credit_Rating_Label'].apply(rating_to_numeric).values
scatter2 = axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=ratings_numeric,
                          cmap='RdYlGn', s=50, alpha=0.6, edgecolors='black', linewidth=0.5)

axes[1].set_xlabel(f'PC1 ({variance_explained[0]:.1%} variance)', fontsize=12, fontweight='bold')
axes[1].set_ylabel(f'PC2 ({variance_explained[1]:.1%} variance)', fontsize=12, fontweight='bold')
axes[1].set_title('Credit Ratings (PCA 2D)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
cbar = plt.colorbar(scatter2, ax=axes[1], label='Rating')
cbar.set_label('Rating (Low → High)', fontsize=10)

plt.tight_layout()
plt.show()

## 8. Silhouette Analysis

In [None]:
# Silhouette plot
fig, ax = plt.subplots(1, 1, figsize=(10, 6))

y_lower = 10
for i in range(optimal_k):
    cluster_silhouette_vals = silhouette_vals[clusters == i]
    cluster_silhouette_vals.sort()
    
    size_cluster_i = cluster_silhouette_vals.shape[0]
    y_upper = y_lower + size_cluster_i
    
    color = plt.cm.viridis(float(i) / optimal_k)
    ax.fill_betweenx(np.arange(y_lower, y_upper),
                      0, cluster_silhouette_vals,
                      facecolor=color, edgecolor=color, alpha=0.7)
    
    ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
    y_lower = y_upper + 10

ax.set_xlabel('Silhouette Coefficient', fontsize=12, fontweight='bold')
ax.set_ylabel('Cluster', fontsize=12, fontweight='bold')
ax.set_title(f'Silhouette Plot (K={optimal_k})', fontsize=14, fontweight='bold')
ax.axvline(x=silhouette, color='red', linestyle='--', label=f'Average: {silhouette:.4f}')
ax.legend()
ax.set_yticks([])
ax.set_xlim([-0.3, 1])
plt.tight_layout()
plt.show()

## 9. Key Insights and Conclusions

### Summary of Findings:

1. **Optimal Number of Clusters**: K-Means identified **K=3** as optimal based on silhouette score

2. **Cluster Characteristics**:
   - **Cluster 0**: Medium risk profile (largest group)
   - **Cluster 1**: Low risk profile (stable economies)
   - **Cluster 2**: High risk profile (high inflation/interest rates)

3. **Correspondence with Credit Ratings**: 
   - ARI score indicates **weak correspondence** (~7%)
   - This suggests credit ratings incorporate factors beyond macroeconomic indicators
   - Qualitative factors (political stability, institutions, etc.) play a significant role

4. **Outliers**: 
   - Several countries show economic profiles inconsistent with their ratings
   - These may represent:
     - Recent economic changes not yet reflected in ratings
     - Special circumstances (e.g., natural resources, geopolitical factors)
     - Rating agency subjective judgment

5. **Comparison with Supervised Learning**:
   - Random Forest (Phase 2): 79.68% accuracy → Can predict ratings from features
   - K-Means (Phase 3): 7% ARI → Natural groups ≠ rating categories
   - **Insight**: Ratings are predictable but not based on simple economic similarity

### Implications:

- Credit ratings are **complex, non-linear functions** of economic indicators
- Simple clustering cannot replicate rating agency decisions
- Machine learning models (Random Forest) can learn these complex patterns
- Unsupervised learning reveals that economic similarity ≠ credit risk similarity

## 10. Load Saved Results

In [None]:
# Load saved clustering results
cluster_profiles = pd.read_csv('../results/clustering/cluster_profiles.csv')
country_clusters = pd.read_csv('../results/clustering/country_clusters.csv')

print("Cluster Profiles:")
print(cluster_profiles)
print("\nCountry Clusters (sample):")
print(country_clusters.head(20))