<a href="https://colab.research.google.com/github/BalaAnbalagan/autogluon-assignment/blob/master/part1-kaggle/ieee-fraud-detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IEEE-CIS Fraud Detection (Binary Classification)

## 🎯 Objective
Build an AutoML binary classifier to detect fraudulent transactions using AutoGluon.

**Task**: Binary Classification  
**Dataset**: IEEE-CIS Fraud Detection (Kaggle)  
**Target**: `isFraud`  
**Metric**: ROC-AUC  

## 📋 What This Notebook Does
1. Install AutoGluon and dependencies
2. Load transaction and identity data from Kaggle
3. Merge datasets and prepare features
4. Train AutoGluon predictor with automatic model selection
5. Show leaderboard and feature importance
6. Generate predictions and save artifacts

## 📦 Install Dependencies

In [None]:
!pip install -q torch torchvision torchaudio
!pip install -q autogluon kaggle

## 📚 Import Libraries

In [1]:
import os
import time
import zipfile
import shutil
import pandas as pd
import numpy as np
from autogluon.tabular import TabularPredictor

# Set random seed for reproducibility
np.random.seed(42)

## 📥 Load Dataset

### Option A: Kaggle API (Recommended)
1. Go to https://www.kaggle.com/settings/account
2. Click "Create New API Token" to download `kaggle.json`
3. Upload it when prompted below

### Option B: Manual Upload
1. Download these 4 CSVs from [Kaggle Competition](https://www.kaggle.com/c/ieee-fraud-detection/data)
   - train_transaction.csv
   - train_identity.csv
   - test_transaction.csv
   - test_identity.csv
2. Upload them when prompted below

In [2]:
# Data files are already in the 'data' directory
# No need to download - they've been copied from Downloads folder

import os

# Verify data files exist
data_files = [
    'data/train_transaction.csv',
    'data/train_identity.csv', 
    'data/test_transaction.csv',
    'data/test_identity.csv'
]

print("📂 Checking data files...")
for file in data_files:
    if os.path.exists(file):
        size_mb = os.path.getsize(file) / (1024 * 1024)
        print(f"   ✓ {file} ({size_mb:.1f} MB)")
    else:
        print(f"   ✗ {file} NOT FOUND")
        
print("\n✅ All data files ready!")

📂 Checking data files...
   ✓ data/train_transaction.csv (651.7 MB)
   ✓ data/train_identity.csv (25.3 MB)
   ✓ data/test_transaction.csv (584.8 MB)
   ✓ data/test_identity.csv (24.6 MB)

✅ All data files ready!


## 🎯 Set Target Label and Problem Type

AutoGluon will automatically detect this is a binary classification problem.

## 🔧 Load and Merge Data

The dataset has two parts:
- **Transaction data**: Payment details, amounts, cards
- **Identity data**: Device and network information

We'll merge them on `TransactionID`.

In [3]:
# Load transaction data
print("📖 Loading transaction data...")
train_transaction = pd.read_csv('data/train_transaction.csv')
train_identity = pd.read_csv('data/train_identity.csv')
test_transaction = pd.read_csv('data/test_transaction.csv')
test_identity = pd.read_csv('data/test_identity.csv')

# Merge transaction and identity data
print("🔗 Merging datasets...")
train = train_transaction.merge(train_identity, on='TransactionID', how='left')
test = test_transaction.merge(test_identity, on='TransactionID', how='left')

print(f"\n✅ Data loaded successfully!")
print(f"   Train shape: {train.shape}")
print(f"   Test shape: {test.shape}")
print(f"\n📊 Target distribution:")
print(train['isFraud'].value_counts(normalize=True))

📖 Loading transaction data...
🔗 Merging datasets...

✅ Data loaded successfully!
   Train shape: (590540, 434)
   Test shape: (506691, 433)

📊 Target distribution:
isFraud
0    0.96501
1    0.03499
Name: proportion, dtype: float64


In [4]:
# Define target label
LABEL = "isFraud"

# AutoGluon will auto-detect problem type (binary classification)
# and use ROC-AUC as the metric
print(f"🎯 Target Label: {LABEL}")
print(f"📈 Metric: ROC-AUC (auto-detected for binary classification)")

🎯 Target Label: isFraud
📈 Metric: ROC-AUC (auto-detected for binary classification)


## 🚀 Train AutoGluon Model

AutoGluon will:
- Automatically handle missing values
- Engineer features
- Train multiple models (LightGBM, CatBoost, Neural Networks, etc.)
- Create an ensemble of the best models

In [5]:
# Create save directory with timestamp
save_dir = f"ag-{int(time.time())}-ieee-fraud"

# Initialize predictor
predictor = TabularPredictor(
    label=LABEL,
    problem_type="binary",  # Explicitly set for clarity
    eval_metric="roc_auc",  # ROC-AUC for binary classification
    path=save_dir
)

# Train the model
print("🏋️ Training AutoGluon models...")
print("This may take 15-20 minutes...\n")

predictor = predictor.fit(
    train,
    presets="medium_quality",  # Balance between speed and accuracy
    time_limit=900,            # 15 minutes (adjust as needed)
    verbosity=2                # Show detailed progress
)

print("\n✅ Training complete!")

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.9.6
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 25.0.0: Wed Sep 17 21:42:08 PDT 2025; root:xnu-12377.1.9~141/RELEASE_ARM64_T8132
CPU Count:          10
Memory Avail:       3.64 GB / 16.00 GB (22.8%)
Disk Space Avail:   106.82 GB / 228.27 GB (46.8%)
Presets specified: ['medium_quality']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 900s
AutoGluon will save models to "/Users/banbalagan/Projects/autogluon-assignment/part1-kaggle/ag-1761508585-ieee-fraud"
Train Data Rows:    590540
Train Data Columns: 433
Label Column:       isFraud
Problem Type:       binary
Preprocessing data ...


🏋️ Training AutoGluon models...
This may take 15-20 minutes...



Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    5469.54 MB
	Train Data (Original)  Memory Usage: 2590.15 MB (47.4% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerator...
	Unused Original Features (Count: 4): ['V28', 'V154', 'V155', 'V156']
		These features were not used to generate any of the output features. Add a feature generator compatible with these featu


✅ Training complete!


## 📊 Model Leaderboard

Shows all models trained and their performance:

In [6]:
# Get leaderboard
leaderboard = predictor.leaderboard(train, silent=True)

print("🏆 Top 10 Models:")
display(leaderboard.head(10))

# Save leaderboard
leaderboard.to_csv('leaderboard.csv', index=False)
print("\n💾 Saved: leaderboard.csv")

🏆 Top 10 Models:


Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForestEntr,0.999446,0.93545,roc_auc,6.349913,0.069796,77.495818,6.349913,0.069796,77.495818,1,True,2
1,WeightedEnsemble_L2,0.999406,0.938443,roc_auc,14.134962,0.145917,189.898799,0.008244,0.000569,0.041871,2,True,5
2,RandomForestGini,0.9992,0.936562,roc_auc,7.776805,0.075552,112.36111,7.776805,0.075552,112.36111,1,True,1
3,ExtraTreesEntr,0.989869,0.915163,roc_auc,6.651335,0.086086,63.723692,6.651335,0.086086,63.723692,1,True,4
4,ExtraTreesGini,0.986381,0.898914,roc_auc,6.975684,0.063401,67.072055,6.975684,0.063401,67.072055,1,True,3



💾 Saved: leaderboard.csv


## 🔍 Feature Importance

Shows which features are most predictive:

In [7]:
# Get feature importance
feature_importance = predictor.feature_importance(train)

print("🔍 Top 20 Most Important Features:")
display(feature_importance.head(20))

# Save feature importance
feature_importance.to_csv('feature_importance.csv')
print("\n💾 Saved: feature_importance.csv")

These features in provided data are not utilized by the predictor and will be ignored: ['V28', 'V154', 'V155', 'V156']
Computing feature importance via permutation shuffling for 429 features using 5000 rows with 5 shuffle sets...
	1529.47s	= Expected runtime (305.89s per shuffle set)
	202.08s	= Actual runtime (Completed 5 of 5 shuffle sets)


🔍 Top 20 Most Important Features:


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
TransactionAmt,0.003703,0.001584,0.003196,5,0.006964,0.000442
C1,0.001206,0.000563,0.004359,5,0.002365,4.7e-05
TransactionDT,0.001141,0.000655,0.008794,5,0.002489,-0.000207
TransactionID,0.000987,0.000445,0.003867,5,0.001904,7e-05
card1,0.000971,0.000688,0.017152,5,0.002387,-0.000445
M5,0.000844,0.00047,0.007954,5,0.001812,-0.000123
C13,0.000841,0.000283,0.001338,5,0.001425,0.000258
addr1,0.000801,0.000271,0.001362,5,0.00136,0.000243
C11,0.000771,0.000465,0.010337,5,0.001729,-0.000186
C6,0.000771,0.000505,0.013425,5,0.00181,-0.000268



💾 Saved: feature_importance.csv


## 🔮 Generate Predictions

Create submission file for Kaggle:

In [9]:
# Fix column name mismatch: change hyphens to underscores in test data
test.columns = test.columns.str.replace('-', '_')
print("✅ Fixed test column names to match training data")
# Predict probabilities for the positive class (fraud)
print("🔮 Generating predictions...")
predictions = predictor.predict_proba(test)

# For binary classification, get probability of class 1 (fraud)
if isinstance(predictions, pd.DataFrame):
    fraud_proba = predictions[1]  # Probability of fraud
else:
    fraud_proba = predictions

# Create submission file
submission = pd.DataFrame({
    'TransactionID': test['TransactionID'],
    'isFraud': fraud_proba
})

submission.to_csv('submission.csv', index=False)
print("✅ Predictions generated!")
print("\n📊 Sample predictions:")
display(submission.head(10))
print("\n💾 Saved: submission.csv")

✅ Fixed test column names to match training data
🔮 Generating predictions...
✅ Predictions generated!

📊 Sample predictions:


Unnamed: 0,TransactionID,isFraud
0,3663549,0.015044
1,3663550,0.022846
2,3663551,0.050838
3,3663552,0.007528
4,3663553,0.009528
5,3663554,0.013515
6,3663555,0.082482
7,3663556,0.054694
8,3663557,0.010237
9,3663558,0.038265



💾 Saved: submission.csv


## 💾 Save Model Artifacts

Package everything for download:

In [10]:
# Create model archive
print("📦 Creating model archive...")
shutil.make_archive('autogluon_model', 'zip', save_dir)

print("\n✅ All artifacts saved!")
print("\n📥 Download these files:")
print("   ✓ autogluon_model.zip    - Trained model")
print("   ✓ leaderboard.csv         - Model comparison")
print("   ✓ feature_importance.csv  - Important features")
print("   ✓ submission.csv          - Kaggle submission")
print("\n💡 Use the Files panel (📁) to download")

📦 Creating model archive...

✅ All artifacts saved!

📥 Download these files:
   ✓ autogluon_model.zip    - Trained model
   ✓ leaderboard.csv         - Model comparison
   ✓ feature_importance.csv  - Important features
   ✓ submission.csv          - Kaggle submission

💡 Use the Files panel (📁) to download


## 🎓 Summary

This notebook demonstrated:
1. ✅ Loading Kaggle competition data
2. ✅ Merging transaction and identity datasets
3. ✅ Training AutoGluon with automatic model selection
4. ✅ Evaluating model performance via leaderboard
5. ✅ Analyzing feature importance
6. ✅ Generating Kaggle submission file

**Next Steps:**
- Submit `submission.csv` to Kaggle competition
- Try different presets (`best_quality`, `high_quality`)
- Increase `time_limit` for better results
- Experiment with feature engineering