# 🚀 Multiclass Fraud Detection Training on Kaggle

This notebook is optimized for Kaggle's environment with GPU acceleration and **multiclass classification**.

## 📊 Dataset
- Upload your `final_fraud_detection_dataset.csv`
- **NEW: Supports 10-class classification** (9 scam types + legitimate)
- Classes: `legitimate`, `phishing`, `popup_scam`, `sms_spam`, `reward_scam`, `tech_support_scam`, `refund_scam`, `ssn_scam`, `job_scam`

## 🎯 Models
- Traditional ML: TF-IDF + Logistic Regression/SVM (multiclass)
- Deep Learning: BERT-based classifier (10 classes)

## ⚡ Kaggle Advantages
- Free GPU access (Tesla P100)
- Pre-installed ML libraries
- Easy dataset upload
- Community sharing

## 🎪 Multiclass Benefits
- **Granular fraud detection**: Identify specific scam types
- **Better actionable insights**: Know which type of fraud to defend against
- **Improved model interpretability**: Understand fraud patterns by category

In [19]:
# Install additional packages if needed
!pip install transformers torch --quiet

# Import libraries
import pandas as pd
import numpy as np
import torch
import warnings
warnings.filterwarnings('ignore')

print("✅ Environment ready!")
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

✅ Environment ready!
GPU Available: True
GPU: Tesla T4


In [20]:
# Load your dataset
try:
    df = pd.read_csv('/kaggle/input/fraud-detection-dataset/final_fraud_detection_dataset.csv')
    print(f"✅ Dataset loaded: {len(df)} samples")
    print(f"Columns: {df.columns.tolist()}")
    print(f"Label distribution: {df['binary_label'].value_counts()}")
except FileNotFoundError:
    print("❌ Dataset not found. Please upload your CSV file.")
    # Create sample data for demonstration
    print("📝 Using sample data instead...")
    # [Sample data creation code here]

✅ Dataset loaded: 194913 samples
Columns: ['text', 'binary_label', 'detailed_category', 'data_type']
Label distribution: binary_label
0    101717
1     93196
Name: count, dtype: int64


In [21]:
# Data preprocessing for MULTICLASS classification
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Use detailed_category for multiclass classification (10 classes)
print("📊 Dataset Overview:")
print(f"Total samples: {len(df)}")
print(f"Classes available: {df['detailed_category'].unique()}")
print(f"Class distribution:\n{df['detailed_category'].value_counts()}")

# Split data using detailed_category for multiclass
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['detailed_category'],
    test_size=0.2,
    random_state=42,
    stratify=df['detailed_category']
)

print(f"\n🔄 Data Split:")
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
print(f"Training class distribution:\n{y_train.value_counts()}")

📊 Dataset Overview:
Total samples: 194913
Classes available: ['job_scam' 'legitimate' 'phishing' 'popup_scam' 'refund_scam'
 'reward_scam' 'sms_spam' 'ssn_scam' 'tech_support_scam']
Class distribution:
detailed_category
legitimate           101717
phishing              71857
popup_scam            11333
sms_spam               6988
reward_scam             606
tech_support_scam       605
refund_scam             604
ssn_scam                604
job_scam                599
Name: count, dtype: int64

🔄 Data Split:
Training samples: 155930
Testing samples: 38983
Training class distribution:
detailed_category
legitimate           81374
phishing             57486
popup_scam            9066
sms_spam              5590
reward_scam            485
tech_support_scam      484
refund_scam            483
ssn_scam               483
job_scam               479
Name: count, dtype: int64


In [None]:
# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print(f"TF-IDF features: {X_train_tfidf.shape[1]}")
print("✅ Text vectorization complete!")

In [None]:
# Train traditional ML models for MULTICLASS classification
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

# Encode labels for multiclass (10 classes)
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)

print(f"📋 Label encoding:")
print(f"Number of classes: {len(le.classes_)}")
print(f"Classes: {le.classes_}")

# Logistic Regression for multiclass
lr_model = LogisticRegression(
    random_state=42, 
    max_iter=1000,  # Increased iterations for multiclass
    multi_class='ovr'  # One-vs-Rest for multiclass
)
lr_model.fit(X_train_tfidf, y_train_encoded)

# SVM for multiclass
svm_model = SVC(
    kernel='linear', 
    probability=True, 
    random_state=42,
    decision_function_shape='ovr'  # One-vs-Rest for multiclass
)
svm_model.fit(X_train_tfidf, y_train_encoded)

print("✅ Multiclass models trained!")

In [None]:
# Evaluate MULTICLASS models
models = {'Logistic Regression': lr_model, 'SVM': svm_model}

for name, model in models.items():
    y_pred = model.predict(X_test_tfidf)
    print(f"\n🔍 {name} Results (Multiclass):")
    print(classification_report(y_test_encoded, y_pred, target_names=le.classes_))
    
    # Confusion Matrix
    cm = confusion_matrix(y_test_encoded, y_pred)
    print(f"\nConfusion Matrix shape: {cm.shape}")
    print("Note: Full confusion matrix too large to display completely")
    
    # Show accuracy for each class
    from sklearn.metrics import accuracy_score, f1_score
    accuracy = accuracy_score(y_test_encoded, y_pred)
    f1_macro = f1_score(y_test_encoded, y_pred, average='macro')
    f1_weighted = f1_score(y_test_encoded, y_pred, average='weighted')
    
    print(f"📊 Overall Metrics:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"F1-Score (Macro): {f1_macro:.4f}")
    print(f"F1-Score (Weighted): {f1_weighted:.4f}")

In [22]:
# BERT Training for MULTICLASS classification (GPU accelerated)
import torch
from transformers import (
    BertTokenizer, BertForSequenceClassification,
    get_linear_schedule_with_warmup
)
from torch.optim import AdamW
from torch.utils.data import Dataset, DataLoader

class FraudDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        # Handle both pandas Series and numpy arrays
        if hasattr(self.texts, 'iloc'):
            text = str(self.texts.iloc[idx])
        else:
            text = str(self.texts[idx])
            
        if hasattr(self.labels, 'iloc'):
            label = self.labels.iloc[idx]
        else:
            label = self.labels[idx]
        
        encoding = self.tokenizer(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

print("🚀 Initializing BERT for MULTICLASS classification...")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# IMPORTANT: Change num_labels to 10 for multiclass (9 scam types + 1 legitimate)
num_classes = len(le.classes_)
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', 
    num_labels=num_classes
)

# Move to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(f"Using device: {device}")
print(f"🎯 Multiclass setup: {num_classes} classes")
print(f"Classes: {', '.join(le.classes_)}")

🚀 Initializing BERT for MULTICLASS classification...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Using device: cuda
🎯 Multiclass setup: 9 classes
Classes: job_scam, legitimate, phishing, popup_scam, refund_scam, reward_scam, sms_spam, ssn_scam, tech_support_scam


In [23]:
# Prepare BERT datasets
train_dataset = FraudDataset(X_train, y_train_encoded, tokenizer)
test_dataset = FraudDataset(X_test, y_test_encoded, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

print(f"Training batches: {len(train_loader)}")
print(f"Testing batches: {len(test_loader)}")

Training batches: 9746
Testing batches: 2437


In [24]:
# Training loop for MULTICLASS BERT
optimizer = AdamW(model.parameters(), lr=2e-5)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=len(train_loader) * 3
)

model.train()
for epoch in range(3):
    print(f"\n🚀 Epoch {epoch + 1}/3")
    total_loss = 0
    
    for batch in train_loader:
        optimizer.zero_grad()
        
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        
        # CrossEntropyLoss handles multiclass automatically
        loss = outputs.loss
        total_loss += loss.item()
        
        loss.backward()
        optimizer.step()
        scheduler.step()
    
    avg_loss = total_loss / len(train_loader)
    print(f"Average loss: {avg_loss:.4f}")

print("✅ BERT multiclass training complete!")


🚀 Epoch 1/3
Average loss: 0.1282

🚀 Epoch 2/3
Average loss: 0.0393

🚀 Epoch 3/3
Average loss: 0.0130
✅ BERT multiclass training complete!


In [25]:
# Evaluate BERT MULTICLASS model
model.eval()
predictions = []
true_labels = []

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label']
        
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        preds = torch.argmax(outputs.logits, dim=1)
        
        predictions.extend(preds.cpu().numpy())
        true_labels.extend(labels.numpy())

print("\n🎯 BERT Multiclass Evaluation Results:")
print(classification_report(true_labels, predictions, target_names=le.classes_))

# Confusion Matrix
cm = confusion_matrix(true_labels, predictions)
print(f"\nConfusion Matrix shape: {cm.shape}")

# Overall metrics
from sklearn.metrics import accuracy_score, f1_score
accuracy = accuracy_score(true_labels, predictions)
f1_macro = f1_score(true_labels, predictions, average='macro')
f1_weighted = f1_score(true_labels, predictions, average='weighted')

print(f"\n📊 BERT Overall Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"F1-Score (Macro): {f1_macro:.4f}")
print(f"F1-Score (Weighted): {f1_weighted:.4f}")

# Show per-class performance
print(f"\n🏷️ Per-Class F1 Scores:")
f1_per_class = f1_score(true_labels, predictions, average=None)
for i, class_name in enumerate(le.classes_):
    print(f"{class_name}: {f1_per_class[i]:.4f}")


🎯 BERT Multiclass Evaluation Results:
                   precision    recall  f1-score   support

         job_scam       0.67      0.58      0.62       120
       legitimate       0.98      0.98      0.98     20343
         phishing       0.98      0.97      0.98     14371
       popup_scam       1.00      1.00      1.00      2267
      refund_scam       0.99      1.00      1.00       121
      reward_scam       1.00      1.00      1.00       121
         sms_spam       0.99      0.99      0.99      1398
         ssn_scam       1.00      1.00      1.00       121
tech_support_scam       1.00      0.99      1.00       121

         accuracy                           0.98     38983
        macro avg       0.96      0.95      0.95     38983
     weighted avg       0.98      0.98      0.98     38983


Confusion Matrix shape: (9, 9)

📊 BERT Overall Metrics:
Accuracy: 0.9804
F1-Score (Macro): 0.9514
F1-Score (Weighted): 0.9803

🏷️ Per-Class F1 Scores:
job_scam: 0.6222
legitimate: 0.9817
phi

In [26]:
# Save models for download
import joblib
import os

# Create output directory
os.makedirs('/kaggle/working/models', exist_ok=True)

# Save traditional ML models
# joblib.dump(lr_model, '/kaggle/working/models/logistic_regression.pkl')
# joblib.dump(svm_model, '/kaggle/working/models/svm.pkl')
# joblib.dump(tfidf, '/kaggle/working/models/tfidf_vectorizer.pkl')
# joblib.dump(le, '/kaggle/working/models/label_encoder.pkl')

# Save BERT model
model.save_pretrained('/kaggle/working/models/bert_model')
tokenizer.save_pretrained('/kaggle/working/models/bert_tokenizer')

print("💾 Models saved to /kaggle/working/models/")
print("Download them from the Output tab!")

💾 Models saved to /kaggle/working/models/
Download them from the Output tab!


# 📊 Results Summary

## 🎯 Performance Comparison
- Compare all models' F1-scores, precision, and recall
- BERT typically performs best but requires more resources

## 💡 Next Steps
1. **Download Models**: Get your trained models from the Output tab
2. **Deploy**: Use the saved models in production
3. **Experiment**: Try different hyperparameters
4. **Share**: Publish your notebook to Kaggle community

## ⚡ Kaggle Tips
- Use GPU accelerator for faster training
- Save models regularly to avoid losing progress
- Monitor memory usage with large datasets
- Use the Discussion forum for questions