# SMT-WEEX Notebook 3: Evaluation & Insights
**Project:** smt-weex-2025
**Author:** Jannet Ekka

This notebook:
1. Deep evaluation of all models
2. Confusion matrices
3. Per-class performance
4. Feature importance analysis
5. Error analysis
6. Trading signal insights

## 1. Setup

In [None]:
!pip install -q catboost xgboost lightgbm scikit-learn pandas numpy matplotlib seaborn shap

In [None]:
from google.colab import auth
auth.authenticate_user()

PROJECT_ID = 'smt-weex-2025'
BUCKET = 'smt-weex-2025-models'

!gcloud config set project {PROJECT_ID}

In [None]:
import pandas as pd
import numpy as np
import json
import pickle

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix, roc_auc_score,
    precision_recall_curve, average_precision_score, balanced_accuracy_score
)

from catboost import CatBoostClassifier

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded")

## 2. Load Models and Data from GCS

In [None]:
# Download from GCS
!mkdir -p /content/models
!gsutil -m cp gs://{BUCKET}/models/initial/* /content/models/
!gsutil cp gs://{BUCKET}/data/data_splits.npz /content/
!gsutil cp gs://{BUCKET}/data/feature_config.json /content/
!gsutil cp gs://{BUCKET}/data/whale_features_cleaned.csv /content/

In [None]:
# Load data splits
splits = np.load('/content/data_splits.npz')
X_train, y_train = splits['X_train'], splits['y_train']
X_val, y_val = splits['X_val'], splits['y_val']
X_test, y_test = splits['X_test'], splits['y_test']

# Load feature config
with open('/content/feature_config.json', 'r') as f:
    config = json.load(f)
FEATURES = config['features']

# Load label encoder
with open('/content/models/label_encoder.pkl', 'rb') as f:
    le = pickle.load(f)

label_mapping = {i: label for i, label in enumerate(le.classes_)}
labels = list(label_mapping.values())
print(f"Labels: {label_mapping}")
print(f"Test set: {len(X_test)} samples")

In [None]:
# Load models
models = {}

# CatBoost
models['CatBoost'] = CatBoostClassifier()
models['CatBoost'].load_model('/content/models/catboost_whale_classifier.cbm')

# Others
with open('/content/models/xgboost_whale_classifier.pkl', 'rb') as f:
    models['XGBoost'] = pickle.load(f)

with open('/content/models/randomforest_whale_classifier.pkl', 'rb') as f:
    models['RandomForest'] = pickle.load(f)

with open('/content/models/lightgbm_whale_classifier.pkl', 'rb') as f:
    models['LightGBM'] = pickle.load(f)

print(f"Loaded {len(models)} models")

## 3. Confusion Matrices

In [None]:
def plot_confusion_matrix(y_true, y_pred, labels, title):
    """Plot confusion matrix"""
    cm = confusion_matrix(y_true, y_pred)
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=labels, yticklabels=labels)
    plt.title(title)
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.tight_layout()
    plt.show()
    
    return cm

In [None]:
# Plot confusion matrices for all models
confusion_matrices = {}
for name, model in models.items():
    y_pred = model.predict(X_test)
    cm = plot_confusion_matrix(y_test, y_pred, labels, f'{name} Confusion Matrix')
    confusion_matrices[name] = cm

## 4. Per-Class Performance

In [None]:
# Detailed classification report for best model (CatBoost)
y_pred_catboost = models['CatBoost'].predict(X_test)

print("=" * 60)
print("CatBoost Classification Report")
print("=" * 60)
print(classification_report(y_test, y_pred_catboost, target_names=labels, zero_division=0))

In [None]:
# Per-class metrics comparison across models
per_class_metrics = {}

for model_name, model in models.items():
    y_pred = model.predict(X_test)
    
    precision = precision_score(y_test, y_pred, average=None, zero_division=0)
    recall = recall_score(y_test, y_pred, average=None, zero_division=0)
    f1 = f1_score(y_test, y_pred, average=None, zero_division=0)
    
    per_class_metrics[model_name] = {
        'precision': dict(zip(labels, precision)),
        'recall': dict(zip(labels, recall)),
        'f1': dict(zip(labels, f1))
    }

# Show per-class F1 for CatBoost
print("\n=== Per-Class F1 Scores (CatBoost) ===")
for label, f1_val in per_class_metrics['CatBoost']['f1'].items():
    count = (y_test == le.transform([label])[0]).sum()
    print(f"{label:15s}: {f1_val:.4f} ({count} samples)")

In [None]:
# Visualize per-class F1 across models
fig, ax = plt.subplots(figsize=(14, 6))

x = np.arange(len(labels))
width = 0.2

for i, (model_name, metrics) in enumerate(per_class_metrics.items()):
    f1_values = [metrics['f1'][label] for label in labels]
    ax.bar(x + i*width, f1_values, width, label=model_name)

ax.set_ylabel('F1 Score')
ax.set_title('Per-Class F1 Score Comparison')
ax.set_xticks(x + width * 1.5)
ax.set_xticklabels(labels, rotation=45, ha='right')
ax.legend()
ax.set_ylim(0, 1)
plt.tight_layout()
plt.show()

## 5. Feature Importance

In [None]:
# CatBoost feature importance
catboost_importance = models['CatBoost'].get_feature_importance()
importance_df = pd.DataFrame({
    'feature': FEATURES,
    'importance': catboost_importance
}).sort_values('importance', ascending=False)

print("=== CatBoost Feature Importance (Top 15) ===")
print(importance_df.head(15).to_string(index=False))

In [None]:
# Visualize feature importance
plt.figure(figsize=(12, 10))
top_n = 20
top_features = importance_df.head(top_n)

plt.barh(range(len(top_features)), top_features['importance'].values, color='steelblue')
plt.yticks(range(len(top_features)), top_features['feature'].values)
plt.xlabel('Importance')
plt.title(f'Top {top_n} Most Important Features (CatBoost)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
# Compare feature importance across models
rf_importance = models['RandomForest'].feature_importances_
xgb_importance = models['XGBoost'].feature_importances_

importance_comparison = pd.DataFrame({
    'feature': FEATURES,
    'CatBoost': catboost_importance / catboost_importance.sum(),
    'RandomForest': rf_importance / rf_importance.sum(),
    'XGBoost': xgb_importance / xgb_importance.sum()
})

importance_comparison['avg'] = importance_comparison[['CatBoost', 'RandomForest', 'XGBoost']].mean(axis=1)
importance_comparison = importance_comparison.sort_values('avg', ascending=False)

print("=== Consensus Top Features (All Models) ===")
print(importance_comparison[['feature', 'avg', 'CatBoost', 'RandomForest', 'XGBoost']].head(15).to_string(index=False))

## 6. Error Analysis

In [None]:
# Analyze misclassifications
y_pred = models['CatBoost'].predict(X_test)
misclassified_idx = np.where(y_test != y_pred)[0]

print(f"Total misclassifications: {len(misclassified_idx)} / {len(y_test)} ({len(misclassified_idx)/len(y_test)*100:.1f}%)")

# Most common confusion pairs
confusion_pairs = []
for idx in misclassified_idx:
    true_label = label_mapping[y_test[idx]]
    pred_label = label_mapping[y_pred[idx]]
    confusion_pairs.append((true_label, pred_label))

from collections import Counter
confusion_counts = Counter(confusion_pairs)

print("\n=== Most Common Misclassifications ===")
for (true_l, pred_l), count in confusion_counts.most_common(10):
    print(f"{true_l} -> {pred_l}: {count} times")

In [None]:
# Low confidence predictions
y_proba = models['CatBoost'].predict_proba(X_test)
max_proba = y_proba.max(axis=1)

print("\n=== Prediction Confidence Distribution ===")
print(f"Mean confidence: {max_proba.mean():.4f}")
print(f"Min confidence: {max_proba.min():.4f}")
print(f"Max confidence: {max_proba.max():.4f}")

# Low confidence threshold
low_conf_threshold = 0.5
low_conf_idx = max_proba < low_conf_threshold
print(f"\nPredictions with confidence < {low_conf_threshold}: {low_conf_idx.sum()} ({low_conf_idx.sum()/len(y_test)*100:.1f}%)")

In [None]:
# Confidence distribution plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Overall confidence distribution
axes[0].hist(max_proba, bins=20, color='steelblue', alpha=0.7, edgecolor='black')
axes[0].axvline(x=0.5, color='red', linestyle='--', label='50% threshold')
axes[0].set_xlabel('Confidence')
axes[0].set_ylabel('Count')
axes[0].set_title('Prediction Confidence Distribution')
axes[0].legend()

# Confidence by correct/incorrect
correct_mask = y_test == y_pred
axes[1].hist(max_proba[correct_mask], bins=20, alpha=0.7, label='Correct', color='green')
axes[1].hist(max_proba[~correct_mask], bins=20, alpha=0.7, label='Incorrect', color='red')
axes[1].set_xlabel('Confidence')
axes[1].set_ylabel('Count')
axes[1].set_title('Confidence: Correct vs Incorrect Predictions')
axes[1].legend()

plt.tight_layout()
plt.show()

## 7. Trading Signal Insights

In [None]:
# Key insights for trading signals
print("=" * 60)
print("TRADING SIGNAL INSIGHTS")
print("=" * 60)

# Category-specific feature patterns
df_full = pd.read_csv('/content/whale_features_cleaned.csv')

# Features most relevant for trading signals
signal_features = ['net_flow_eth_signed_log', 'tx_ratio_out_in', 'defi_interactions', 'cex_interactions', 'erc20_ratio']
available_signal_features = [f for f in signal_features if f in df_full.columns]

print("\n=== Category Feature Patterns (for signals) ===")
category_patterns = df_full.groupby('category')[available_signal_features].agg(['mean', 'std']).round(4)
print(category_patterns)

In [None]:
# Signal logic based on classification
print("\n=== Trading Signal Logic ===")
print("""
Based on model's feature importance and category patterns:

1. CEX_Wallet:
   - High incoming_count + low defi_interactions = Exchange deposit hub
   - Signal: Large inflow spike = BEARISH (whales depositing to sell)
   - Signal: Large outflow spike = BULLISH (whales withdrawing to hold)

2. DeFi_Trader:
   - High erc20_ratio + high defi_interactions + high internal_ratio
   - Signal: Sudden protocol exit = VOLATILITY WARNING
   - Signal: Large DEX swap = Follow direction (copy trade)

3. Staker:
   - Low tx_per_day + interactions with Lido/RocketPool
   - Signal: Unstaking = BEARISH (need liquidity, might sell)
   - Signal: Staking = BULLISH (long-term commitment)

4. Miner:
   - High outgoing_count + low erc20_ratio + regular timing
   - Signal: Selling acceleration = BEARISH
   - Signal: Accumulating (not selling) = BULLISH

5. Institutional:
   - High max_tx_value + business hours activity
   - Signal: Follow their direction (they have alpha)
   - Large buy = BULLISH, Large sell = BEARISH

6. Exploiter:
   - Unusual patterns, high gas, rapid movement
   - Signal: AVOID - ignore their movements for trading
""")

## 8. Save Evaluation Results

In [None]:
# Convert confusion_counts to serializable format
confusion_counts_dict = {f"{k[0]}->{k[1]}": v for k, v in confusion_counts.items()}

# Save all evaluation results
evaluation_results = {
    'per_class_metrics': {k: {metric: {str(label): float(val) for label, val in values.items()} 
                              for metric, values in v.items()} 
                          for k, v in per_class_metrics.items()},
    'feature_importance': importance_df.to_dict(orient='records'),
    'confusion_pairs': confusion_counts_dict,
    'best_model': 'CatBoost',
    'timestamp': str(pd.Timestamp.now())
}

with open('/content/evaluation_results.json', 'w') as f:
    json.dump(evaluation_results, f, indent=2, default=str)

!gsutil cp /content/evaluation_results.json gs://{BUCKET}/results/evaluation_results.json
print("Evaluation results saved")

## Summary

Evaluation completed:
1. Confusion matrices for all models
2. Per-class performance analysis
3. Feature importance ranking
4. Error analysis (misclassification patterns)
5. Trading signal insights

**Key Findings:**
- Top features: [See output above]
- Hardest classes: [See output above]
- Most common confusions: [See output above]

**Next:** Run Notebook 4 for hyperparameter tuning with RandomSearch.