<a href="https://colab.research.google.com/github/BalaAnbalagan/autogluon-assignment/blob/master/part1-kaggle/ieee-fraud-detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IEEE-CIS Fraud Detection (Binary Classification)

## 🎯 Objective
Build an AutoML binary classifier to detect fraudulent transactions using AutoGluon.

**Task**: Binary Classification  
**Dataset**: IEEE-CIS Fraud Detection (Kaggle)  
**Target**: `isFraud`  
**Metric**: ROC-AUC  

## 📋 What This Notebook Does
1. Install AutoGluon and dependencies
2. Load transaction and identity data from Kaggle
3. Merge datasets and prepare features
4. Train AutoGluon predictor with automatic model selection
5. Show leaderboard and feature importance
6. Generate predictions and save artifacts

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 📦 Install Dependencies

In [None]:
!pip install -q torch torchvision torchaudio
!pip install -q autogluon kaggle

## 📚 Import Libraries

In [None]:
import os
import time
import zipfile
import shutil
import pandas as pd
import numpy as np
from autogluon.tabular import TabularPredictor

# Set random seed for reproducibility
np.random.seed(42)

## 📥 Load Dataset

### Option A: Kaggle API (Recommended)
1. Go to https://www.kaggle.com/settings/account
2. Click "Create New API Token" to download `kaggle.json`
3. Upload it when prompted below

### Option B: Manual Upload
1. Download these 4 CSVs from [Kaggle Competition](https://www.kaggle.com/c/ieee-fraud-detection/data)
   - train_transaction.csv
   - train_identity.csv
   - test_transaction.csv
   - test_identity.csv
2. Upload them when prompted below

In [None]:
# Choose data loading method
USE_KAGGLE_API = False  # Set to True to use Kaggle API, False for manual upload
COMPETITION = "ieee-fraud-detection"

if USE_KAGGLE_API:
    # Upload kaggle.json
    from google.colab import files
    print("📤 Upload your kaggle.json file:")
    uploaded = files.upload()

    # Set up Kaggle credentials
    os.makedirs('/root/.kaggle', exist_ok=True)
    !mv kaggle.json /root/.kaggle/kaggle.json
    !chmod 600 /root/.kaggle/kaggle.json

    # Download competition data
    os.makedirs('data', exist_ok=True)
    print(f"\n📥 Downloading {COMPETITION} dataset...")
    !kaggle competitions download -c $COMPETITION -p data

    # Unzip all archives
    print("\n📂 Extracting files...")
    for filename in os.listdir('data'):
        if filename.endswith('.zip'):
            with zipfile.ZipFile(os.path.join('data', filename), 'r') as zip_ref:
                zip_ref.extractall('data')
    print("✅ Data downloaded and extracted!")

else:
    # Manual upload
    from google.colab import files
    print("📤 Upload these 4 files:")
    print("   1. train_transaction.csv")
    print("   2. train_identity.csv")
    print("   3. test_transaction.csv")
    print("   4. test_identity.csv")
    uploaded = files.upload()

    # Move files to data directory
    os.makedirs('data', exist_ok=True)
    for filename in uploaded.keys():
        shutil.move(filename, os.path.join('data', filename))
    print("✅ Files uploaded successfully!")

## 🔧 Load and Merge Data

The dataset has two parts:
- **Transaction data**: Payment details, amounts, cards
- **Identity data**: Device and network information

We'll merge them on `TransactionID`.

In [None]:
# Load transaction data
print("📖 Loading transaction data...")
train_transaction = pd.read_csv('data/train_transaction.csv')
train_identity = pd.read_csv('data/train_identity.csv')
test_transaction = pd.read_csv('data/test_transaction.csv')
test_identity = pd.read_csv('data/test_identity.csv')

# Merge transaction and identity data
print("🔗 Merging datasets...")
train = train_transaction.merge(train_identity, on='TransactionID', how='left')
test = test_transaction.merge(test_identity, on='TransactionID', how='left')

print(f"\n✅ Data loaded successfully!")
print(f"   Train shape: {train.shape}")
print(f"   Test shape: {test.shape}")
print(f"\n📊 Target distribution:")
print(train['isFraud'].value_counts(normalize=True))

## 🎯 Set Target Label and Problem Type

AutoGluon will automatically detect this is a binary classification problem.

In [None]:
# Define target label
LABEL = "isFraud"

# AutoGluon will auto-detect problem type (binary classification)
# and use ROC-AUC as the metric
print(f"🎯 Target Label: {LABEL}")
print(f"📈 Metric: ROC-AUC (auto-detected for binary classification)")

## 🚀 Train AutoGluon Model

AutoGluon will:
- Automatically handle missing values
- Engineer features
- Train multiple models (LightGBM, CatBoost, Neural Networks, etc.)
- Create an ensemble of the best models

In [None]:
# Create save directory with timestamp
save_dir = f"ag-{int(time.time())}-ieee-fraud"

# Initialize predictor
predictor = TabularPredictor(
    label=LABEL,
    problem_type="binary",  # Explicitly set for clarity
    eval_metric="roc_auc",  # ROC-AUC for binary classification
    path=save_dir
)

# Train the model
print("🏋️ Training AutoGluon models...")
print("This may take 15-20 minutes...\n")

predictor = predictor.fit(
    train,
    presets="medium_quality",  # Balance between speed and accuracy
    time_limit=900,            # 15 minutes (adjust as needed)
    verbosity=2                # Show detailed progress
)

print("\n✅ Training complete!")

## 📊 Model Leaderboard

Shows all models trained and their performance:

In [None]:
# Get leaderboard
leaderboard = predictor.leaderboard(train, silent=True)

print("🏆 Top 10 Models:")
display(leaderboard.head(10))

# Save leaderboard
leaderboard.to_csv('leaderboard.csv', index=False)
print("\n💾 Saved: leaderboard.csv")

## 🔍 Feature Importance

Shows which features are most predictive:

In [None]:
# Get feature importance
feature_importance = predictor.feature_importance(train)

print("🔍 Top 20 Most Important Features:")
display(feature_importance.head(20))

# Save feature importance
feature_importance.to_csv('feature_importance.csv')
print("\n💾 Saved: feature_importance.csv")

## 🔮 Generate Predictions

Create submission file for Kaggle:

In [None]:
# Predict probabilities for the positive class (fraud)
print("🔮 Generating predictions...")
predictions = predictor.predict_proba(test)

# For binary classification, get probability of class 1 (fraud)
if isinstance(predictions, pd.DataFrame):
    fraud_proba = predictions[1]  # Probability of fraud
else:
    fraud_proba = predictions

# Create submission file
submission = pd.DataFrame({
    'TransactionID': test['TransactionID'],
    'isFraud': fraud_proba
})

submission.to_csv('submission.csv', index=False)
print("✅ Predictions generated!")
print("\n📊 Sample predictions:")
display(submission.head(10))
print("\n💾 Saved: submission.csv")

## 💾 Save Model Artifacts

Package everything for download:

In [None]:
# Create model archive
print("📦 Creating model archive...")
shutil.make_archive('autogluon_model', 'zip', save_dir)

print("\n✅ All artifacts saved!")
print("\n📥 Download these files:")
print("   ✓ autogluon_model.zip    - Trained model")
print("   ✓ leaderboard.csv         - Model comparison")
print("   ✓ feature_importance.csv  - Important features")
print("   ✓ submission.csv          - Kaggle submission")
print("\n💡 Use the Files panel (📁) to download")

## 🎓 Summary

This notebook demonstrated:
1. ✅ Loading Kaggle competition data
2. ✅ Merging transaction and identity datasets
3. ✅ Training AutoGluon with automatic model selection
4. ✅ Evaluating model performance via leaderboard
5. ✅ Analyzing feature importance
6. ✅ Generating Kaggle submission file

**Next Steps:**
- Submit `submission.csv` to Kaggle competition
- Try different presets (`best_quality`, `high_quality`)
- Increase `time_limit` for better results
- Experiment with feature engineering