# üéØ Stock Clustering by Risk Profile

**Goal**: Group NSE stocks into 4 risk categories using K-Means clustering.

**Why clustering?**
- Helps investors find stocks matching their risk tolerance
- Identifies natural risk patterns in the market
- Creates diversified portfolio buckets

**What makes good clusters?**
- **Silhouette Score > 0.5**: Stocks within cluster are similar, different clusters are distinct
- **Balanced sizes**: No cluster with too few stocks
- **Clear interpretation**: Each cluster has a clear risk profile

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
import sys
sys.path.append('../src')

from clustering import find_optimal_clusters, StockClusterer

# Styling
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

## 1Ô∏è‚É£ Load Features

In [None]:
df = pd.read_csv('../Data/Processed/nse_features.csv')
print(f"Loaded {len(df)} stocks with {len(df.columns)} features")
print(f"\nFeature names:")
print(df.columns.tolist())

## 2Ô∏è‚É£ Find Optimal Number of Clusters

**Elbow Method**: Look for "elbow" where inertia stops dropping fast

**Silhouette Score**: Measures cluster quality (0.5+ is good)

In [None]:
# Test features for clustering
feature_cols = [
    'volatility_mean', 'volatility_max', 'downside_deviation',
    'std_return', 'var_95', 'max_drawdown',
    'sharpe_ratio', 'return_skew', 'return_kurtosis',
    'rsi_mean', 'bb_width_mean', 'macd_volatility',
    'momentum_30d', 'momentum_90d', 'trend_strength',
    'trading_frequency', 'amihud_illiquidity',
    'volume_volatility', 'avg_recovery_days'
]

# Only use features that exist
feature_cols = [col for col in feature_cols if col in df.columns]
print(f"Using {len(feature_cols)} features for clustering\n")

cluster_metrics = find_optimal_clusters(df, feature_cols, max_clusters=8)
print(cluster_metrics)

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Elbow plot
ax1.plot(cluster_metrics['n_clusters'], cluster_metrics['inertia'], 'bo-', linewidth=2, markersize=8)
ax1.set_title('Elbow Method - Find Inertia Drop', fontweight='bold', fontsize=14)
ax1.set_xlabel('Number of Clusters', fontsize=12)
ax1.set_ylabel('Inertia (Within-cluster variance)', fontsize=12)
ax1.grid(True, alpha=0.3)

# Silhouette plot
ax2.plot(cluster_metrics['n_clusters'], cluster_metrics['silhouette'], 'ro-', linewidth=2, markersize=8)
ax2.axhline(y=0.5, color='g', linestyle='--', label='Good threshold (0.5)', linewidth=2)
ax2.set_title('Silhouette Score - Cluster Quality', fontweight='bold', fontsize=14)
ax2.set_xlabel('Number of Clusters', fontsize=12)
ax2.set_ylabel('Silhouette Score', fontsize=12)
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

best_k = cluster_metrics.loc[cluster_metrics['silhouette'].idxmax(), 'n_clusters']
print(f"\nüéØ Recommended: {int(best_k)} clusters (highest silhouette score)")

## 3Ô∏è‚É£ Perform Clustering

Using **4 clusters** for risk profiles:
1. Low Risk
2. Medium-Low Risk
3. Medium-High Risk
4. High Risk

In [None]:
clusterer = StockClusterer(n_clusters=4, random_state=42)
df_clustered = clusterer.fit_predict(df)

print(f"\nüìä Cluster Distribution:")
print(df_clustered['Risk_Profile'].value_counts().sort_index())

## 4Ô∏è‚É£ Visualize Clusters

**PCA** reduces 19+ dimensions to 2D for plotting

In [None]:
# Prepare data for PCA
X = df_clustered[clusterer.feature_columns].fillna(df_clustered[clusterer.feature_columns].median())
X_scaled = clusterer.scaler.transform(X)

# PCA to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot
plt.figure(figsize=(14, 9))
colors = ['green', 'blue', 'orange', 'red']
labels = ['Low Risk', 'Medium-Low Risk', 'Medium-High Risk', 'High Risk']

for i, (color, label) in enumerate(zip(colors, labels)):
    mask = df_clustered['Risk_Profile'] == label
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1],
                c=color, label=f"{label} ({mask.sum()})",
                alpha=0.7, s=150, edgecolors='black', linewidth=1.5)

plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)', fontsize=13, fontweight='bold')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)', fontsize=13, fontweight='bold')
plt.title('NSE Stock Risk Clusters (PCA Projection)', fontsize=16, fontweight='bold')
plt.legend(title='Risk Profile', title_fontsize=12, fontsize=11, loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nüí° PCA explains {pca.explained_variance_ratio_[:2].sum():.1%} of total variance")

## 5Ô∏è‚É£ Cluster Profiles

Compare clusters across key metrics

In [None]:
summary = clusterer.get_cluster_summary(df_clustered)
print("\nüìà Cluster Summary:")
print(summary)

## 6Ô∏è‚É£ Sample Stocks by Risk

In [None]:
for risk in ['Low Risk', 'Medium-Low Risk', 'Medium-High Risk', 'High Risk']:
    subset = df_clustered[df_clustered['Risk_Profile'] == risk]
    if len(subset) > 0:
        print(f"\n{'='*60}")
        print(f"{risk.upper()} ({len(subset)} stocks)")
        print('='*60)
        
        cols = ['Stock_code', 'Name', 'Sector', 'volatility_mean', 'sharpe_ratio']
        available_cols = [c for c in cols if c in subset.columns]
        
        sample = subset.nsmallest(5, 'volatility_mean')[available_cols]
        print(sample.to_string(index=False))

## 7Ô∏è‚É£ Save Results

In [None]:
# Save clustered data
df_clustered.to_csv('../Data/Processed/nse_clustered.csv', index=False)
print("‚úÖ Saved clustered data")

# Save model
clusterer.save_model('../models/stock_clusterer.pkl')
print("‚úÖ Saved trained model")

---

## üìö Summary

**What we did**:
1. ‚úÖ Tested 2-8 clusters using elbow method and silhouette scores
2. ‚úÖ Chose 4 clusters for risk profiles
3. ‚úÖ Trained K-Means with 19 advanced features
4. ‚úÖ Visualized clusters in 2D using PCA
5. ‚úÖ Analyzed cluster characteristics
6. ‚úÖ Saved model and results

**Key improvement**: Using advanced features (Sharpe ratio, technical indicators, risk metrics) dramatically improves clustering quality compared to basic volatility alone.

**Next**: Evaluate model performance and generate insights! üéØ