# Sentinel-1: Ethereum On-Chain Fraud Detection Pipeline
This notebook contains the full training pipeline for the Sentinel-1 project, designed to be run on Kaggle. It covers feature selection, data cleaning, and training a Random Forest model as per the project PRD.

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import joblib
import matplotlib.pyplot as plt
import seaborn as sns

# Set design aesthetics
sns.set(style="whitegrid")

## 1. Data Ingestion & Feature Selection
We isolate the four behavioral pillars defined in the PRD:
- **Velocity**: `Avg min between sent tnx`
- **Lifespan**: `Time Diff between first and last (Mins)`
- **Outflow**: `Sent tnx`
- **Inflow**: `Received Tnx`

In [None]:
# Load data
file_path = '/kaggle/input/ethereum-frauddetection-dataset/transaction_dataset.csv' # Adjust path for local run
try:
    df = pd.read_csv(file_path)
except FileNotFoundError:
    # Fallback to current directory for local testing
    df = pd.read_csv('transaction_dataset.csv')

selected_features = [
    'Avg min between sent tnx',
    'Time Diff between first and last (Mins)',
    'Sent tnx',
    'Received Tnx'
]
target = 'FLAG'

print(f"Initial shape: {df.shape}")

## 2. Data Quality & Cleaning
The PRD requires an automated audit to detect and drop NaN values.

In [None]:
# Clean data
df_clean = df[selected_features + [target]].copy()
print(f"Missing values before cleaning:\n{df_clean.isnull().sum()}")

df_clean = df_clean.dropna()
print(f"Final shape after cleaning: {df_clean.shape}")

X = df_clean[selected_features]
y = df_clean[target]

## 3. Model Training (Random Forest)
We use a `RandomForestClassifier` with `max_depth=5` to prevent overfitting.

In [None]:
# Split for evaluation (not strictly in PRD but good practice)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and Train
model = RandomForestClassifier(max_depth=5, random_state=42)
print("Training model...")
model.fit(X_train, y_train)

print("Training Complete.")

## 4. Evaluation
Testing the model performance on the test set.

In [None]:
y_pred = model.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix Visualization
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## 5. Artifact Serialization
Exporting the model for use in the Streamlit application.

In [None]:
artifact_name = 'eth_fraud_model.pkl'
joblib.dump(model, artifact_name)
print(f"Model artifact saved as {artifact_name}")