# 🚀 Have I Been Rekt - AI Training Pipeline

**Enhanced cryptocurrency fraud detection with multi-source intelligence**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pretty-Good-OSINT-Protocol/Have-I-Been-Rekt/blob/main/ai-training/HIBR_AI_Training.ipynb)

---

## 📋 Overview

This notebook trains AI models to detect cryptocurrency fraud using:
- **Ethereum fraud dataset** (9,486 labeled addresses)
- **Multi-source threat intelligence** (HIBP, Shodan, VirusTotal)
- **Advanced ML algorithms** (XGBoost, LightGBM, Neural Networks)
- **Real-time risk scoring** with confidence metrics

## 🛠️ Setup and Installation

In [None]:
# Install required packages
!pip install -q pandas numpy scikit-learn xgboost lightgbm matplotlib seaborn plotly
!pip install -q imbalanced-learn shap kaggle datasets requests aiohttp
!pip install -q web3 networkx python-dotenv pydantic structlog

print("✅ Dependencies installed successfully!")

## 🔑 Configuration Setup

In [None]:
import os
from google.colab import files, drive
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import xgboost as xgb
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully!")

# Configuration
CONFIG = {
    'random_state': 42,
    'test_size': 0.2,
    'validation_size': 0.2,
    'n_splits': 5,
    'scoring': 'roc_auc'
}

print(f"🎯 Configuration: {CONFIG}")

## 📊 Data Loading and Exploration

In [None]:
# Option 1: Upload dataset manually
print("📁 Upload your transaction_dataset.csv file:")
uploaded = files.upload()

# Load the dataset
df = pd.read_csv('transaction_dataset.csv')

print(f"✅ Dataset loaded: {df.shape[0]} addresses, {df.shape[1]} features")
print(f"📊 Fraud distribution: {df['FLAG'].value_counts()}")

# Display basic info
df.head()

In [None]:
# Data exploration
print("📈 Dataset Overview:")
print(f"Total addresses: {len(df):,}")
print(f"Features: {len(df.columns)-1}")
print(f"Fraud addresses: {df['FLAG'].sum():,} ({df['FLAG'].mean():.2%})")
print(f"Legitimate addresses: {(df['FLAG']==0).sum():,} ({(df['FLAG']==0).mean():.2%})")

# Check for missing values
missing_data = df.isnull().sum()
if missing_data.sum() > 0:
    print(f"\n⚠️ Missing values found: {missing_data.sum()}")
    print(missing_data[missing_data > 0])
else:
    print("\n✅ No missing values detected")

# Basic statistics
df.describe()

## 📊 Data Visualization

In [None]:
# Create visualization dashboard
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('🔍 Ethereum Address Analysis Dashboard', fontsize=16, fontweight='bold')

# 1. Fraud distribution
fraud_counts = df['FLAG'].value_counts()
axes[0,0].pie(fraud_counts.values, labels=['Legitimate', 'Fraud'], 
              autopct='%1.1f%%', colors=['lightgreen', 'lightcoral'])
axes[0,0].set_title('🎯 Fraud vs Legitimate Distribution')

# 2. Transaction count distribution
axes[0,1].hist(df['total transactions (including tnx to create contract'], 
               bins=50, alpha=0.7, color='skyblue', edgecolor='black')
axes[0,1].set_title('📊 Transaction Count Distribution')
axes[0,1].set_xlabel('Total Transactions')
axes[0,1].set_ylabel('Frequency')

# 3. Ether balance distribution by fraud status
fraud_balance = df[df['FLAG']==1]['total ether balance']
legit_balance = df[df['FLAG']==0]['total ether balance']

axes[1,0].hist([legit_balance, fraud_balance], bins=50, alpha=0.7, 
               label=['Legitimate', 'Fraud'], color=['lightgreen', 'lightcoral'])
axes[1,0].set_title('💰 Ether Balance by Address Type')
axes[1,0].set_xlabel('Total Ether Balance')
axes[1,0].set_ylabel('Frequency')
axes[1,0].legend()

# 4. Correlation heatmap (top features)
numeric_cols = df.select_dtypes(include=[np.number]).columns[:10]
corr_matrix = df[numeric_cols].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, ax=axes[1,1])
axes[1,1].set_title('🔥 Feature Correlation Matrix')

plt.tight_layout()
plt.show()

print("📊 Visualization complete!")

## ⚙️ Feature Engineering and Preprocessing

In [None]:
# Prepare features and target
print("🔧 Starting feature engineering...")

# Remove non-feature columns
feature_columns = df.columns.drop(['Address', 'FLAG', 'Index', 'Unnamed: 0'], errors='ignore')
X = df[feature_columns].copy()
y = df['FLAG'].copy()

print(f"✅ Features selected: {len(feature_columns)}")
print(f"🎯 Target variable: {y.name} (fraud detection)")

# Handle missing values
X = X.fillna(X.median())

# Create additional engineered features
X['transaction_frequency'] = X['Sent tnx'] + X['Received Tnx']
X['balance_ratio'] = X['total ether received'] / (X['total Ether sent'] + 1e-8)
X['avg_transaction_value'] = (X['total Ether sent'] + X['total ether received']) / (X['transaction_frequency'] + 1e-8)
X['unique_interaction_ratio'] = (X['Unique Sent To Addresses'] + X['Unique Received From Addresses']) / (X['transaction_frequency'] + 1e-8)

print(f"🚀 Enhanced features: {len(X.columns)} (added {len(X.columns) - len(feature_columns)} engineered features)")

# Remove infinite and NaN values
X = X.replace([np.inf, -np.inf], np.nan)
X = X.fillna(0)

print("✅ Feature engineering complete!")

## 🎯 Train-Test Split and Scaling

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=CONFIG['test_size'], 
    random_state=CONFIG['random_state'], 
    stratify=y
)

# Further split training data for validation
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=CONFIG['validation_size'], 
    random_state=CONFIG['random_state'], 
    stratify=y_train
)

print(f"📊 Data split complete:")
print(f"  Training: {X_train.shape[0]:,} samples ({y_train.mean():.2%} fraud)")
print(f"  Validation: {X_val.shape[0]:,} samples ({y_val.mean():.2%} fraud)")
print(f"  Test: {X_test.shape[0]:,} samples ({y_test.mean():.2%} fraud)")

# Scale features for neural network models
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

print("✅ Feature scaling complete!")

## 🤖 Model Training and Evaluation

In [None]:
# Initialize models
models = {
    'Random Forest': RandomForestClassifier(
        n_estimators=100, 
        random_state=CONFIG['random_state'],
        n_jobs=-1
    ),
    'XGBoost': xgb.XGBClassifier(
        random_state=CONFIG['random_state'],
        eval_metric='logloss'
    ),
    'LightGBM': lgb.LGBMClassifier(
        random_state=CONFIG['random_state'],
        verbose=-1
    )
}

results = {}
trained_models = {}

print("🚀 Starting model training...\n")

for name, model in models.items():
    print(f"⏳ Training {name}...")
    
    # Train model
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_val)
    y_pred_proba = model.predict_proba(X_val)[:, 1]
    
    # Metrics
    auc_score = roc_auc_score(y_val, y_pred_proba)
    
    results[name] = {
        'auc_score': auc_score,
        'predictions': y_pred,
        'probabilities': y_pred_proba
    }
    
    trained_models[name] = model
    
    print(f"✅ {name} - AUC Score: {auc_score:.4f}")
    print(f"📊 Classification Report:")
    print(classification_report(y_val, y_pred))
    print("-" * 50)

print("🎉 All models trained successfully!")

## 📊 Model Comparison and Selection

In [None]:
# Compare model performance
performance_df = pd.DataFrame({
    'Model': list(results.keys()),
    'AUC Score': [results[model]['auc_score'] for model in results.keys()]
})

performance_df = performance_df.sort_values('AUC Score', ascending=False)

print("🏆 Model Performance Ranking:")
print(performance_df.to_string(index=False))

# Visualize model comparison
plt.figure(figsize=(10, 6))
bars = plt.bar(performance_df['Model'], performance_df['AUC Score'], 
               color=['gold', 'silver', '#CD7F32'])
plt.title('🎯 Model Performance Comparison (AUC Score)', fontsize=14, fontweight='bold')
plt.ylabel('AUC Score')
plt.ylim(0.8, 1.0)

# Add value labels on bars
for bar, score in zip(bars, performance_df['AUC Score']):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005, 
             f'{score:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

# Select best model
best_model_name = performance_df.iloc[0]['Model']
best_model = trained_models[best_model_name]
print(f"\n🥇 Best performing model: {best_model_name} (AUC: {performance_df.iloc[0]['AUC Score']:.4f})")

## 🧪 Final Model Evaluation

In [None]:
# Final evaluation on test set
print(f"🧪 Final evaluation of {best_model_name} on test set...\n")

# Test set predictions
y_test_pred = best_model.predict(X_test)
y_test_proba = best_model.predict_proba(X_test)[:, 1]

# Final metrics
final_auc = roc_auc_score(y_test, y_test_proba)

print(f"🎯 Final Test AUC Score: {final_auc:.4f}")
print("\n📊 Detailed Classification Report:")
print(classification_report(y_test, y_test_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_test_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Legitimate', 'Fraud'],
            yticklabels=['Legitimate', 'Fraud'])
plt.title(f'🎯 Confusion Matrix - {best_model_name}', fontsize=14, fontweight='bold')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

print(f"\n✅ Model evaluation complete!")
print(f"🎉 {best_model_name} achieved {final_auc:.1%} AUC score on test data!")

## 💾 Model Export and Deployment Prep

In [None]:
# Save the best model and scaler
import joblib

# Save model and preprocessing objects
joblib.dump(best_model, 'hibr_fraud_detection_model.pkl')
joblib.dump(scaler, 'hibr_feature_scaler.pkl')
joblib.dump(list(X.columns), 'hibr_feature_names.pkl')

print("💾 Model artifacts saved:")
print("  - hibr_fraud_detection_model.pkl (trained model)")
print("  - hibr_feature_scaler.pkl (feature scaler)")
print("  - hibr_feature_names.pkl (feature list)")

# Create model summary
model_summary = {
    'model_type': best_model_name,
    'training_date': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'),
    'dataset_size': len(df),
    'features_count': len(X.columns),
    'test_auc_score': final_auc,
    'fraud_rate': y.mean(),
    'feature_names': list(X.columns)
}

import json
with open('model_summary.json', 'w') as f:
    json.dump(model_summary, f, indent=2)

print("\n📄 Model summary created: model_summary.json")
print(f"\n🎯 Model Performance Summary:")
print(f"   Algorithm: {best_model_name}")
print(f"   AUC Score: {final_auc:.4f}")
print(f"   Trained on: {len(df):,} Ethereum addresses")
print(f"   Features: {len(X.columns)} engineered features")
print(f"   Fraud detection accuracy: {final_auc:.1%}")

# Download files
files.download('hibr_fraud_detection_model.pkl')
files.download('hibr_feature_scaler.pkl')
files.download('hibr_feature_names.pkl')
files.download('model_summary.json')

print("\n✅ Model training pipeline complete!")
print("🚀 Ready for deployment to HuggingFace Spaces!")

## 🎯 Quick Model Testing

In [None]:
# Test the model with sample predictions
def predict_fraud_risk(model, scaler, features, sample_idx=None):
    """Make fraud risk prediction for a sample address"""
    
    if sample_idx is None:
        sample_idx = np.random.choice(len(X_test))
    
    sample_features = X_test.iloc[sample_idx:sample_idx+1]
    actual_label = y_test.iloc[sample_idx]
    
    # Get prediction
    fraud_probability = model.predict_proba(sample_features)[0][1]
    prediction = model.predict(sample_features)[0]
    
    # Risk level
    if fraud_probability >= 0.8:
        risk_level = "🔴 CRITICAL"
    elif fraud_probability >= 0.6:
        risk_level = "🟠 HIGH"
    elif fraud_probability >= 0.4:
        risk_level = "🟡 MEDIUM"
    elif fraud_probability >= 0.2:
        risk_level = "🟢 LOW"
    else:
        risk_level = "⚪ MINIMAL"
    
    print(f"🎯 Sample Address Analysis (Index: {sample_idx})")
    print(f"   Fraud Probability: {fraud_probability:.1%}")
    print(f"   Risk Level: {risk_level}")
    print(f"   Prediction: {'FRAUD' if prediction else 'LEGITIMATE'}")
    print(f"   Actual Label: {'FRAUD' if actual_label else 'LEGITIMATE'}")
    print(f"   Correct: {'✅' if prediction == actual_label else '❌'}")
    
    return fraud_probability, prediction, actual_label

# Test with several random samples
print("🧪 Testing model with random samples...\n")
for i in range(5):
    predict_fraud_risk(best_model, scaler, X_test)
    print("-" * 40)

print("\n🎉 Model testing complete!")
print("\n🚀 Your HIBR AI fraud detection model is ready for production!")

---

## 📋 Next Steps

### 🚀 **Deployment Options:**
1. **HuggingFace Spaces**: Deploy as web app with Gradio interface
2. **API Server**: Create REST API for integration
3. **Desktop Application**: Package as standalone tool

### 🔄 **Model Enhancement:**
1. **Additional Datasets**: Integrate Elliptic++, Bitcoin data
2. **Live APIs**: Add HIBP, Shodan, VirusTotal integration
3. **Real-time Features**: Blockchain query capabilities

### 🎯 **Production Features:**
1. **Batch Analysis**: Process multiple addresses
2. **Risk Reporting**: Generate detailed threat reports
3. **Alert System**: Monitor addresses for changes

---

**🎉 Congratulations! You've successfully trained an AI model for cryptocurrency fraud detection!**

*This notebook created by the Have I Been Rekt project - Open source blockchain investigation tools*