# 🚀 AutoGluon + Kaggle: IEEE-CIS Fraud Detection (End-to-End)

This Colab walks you through a **no-fuss** pipeline to compete seriously on Kaggle's [IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection/) with **AutoGluon**.

**What you'll do**
1. Install dependencies (`kaggle`, `autogluon`)
2. Configure the Kaggle API (upload `kaggle.json`)
3. Download the competition data
4. Merge CSVs into clean train/test tables
5. Train AutoGluon with a strong baseline
6. Generate predictions & build `submission.csv`
7. Submit to Kaggle from Colab & check leaderboard

> **Note**: This competition is large. If you hit RAM limits, toggle the sampling cell to train on a subset for a quick baseline.


In [None]:
#@title ⏬ Step 1: Install dependencies (AutoGluon + Kaggle)
!pip -q install --upgrade pip
!pip -q install kaggle autogluon.tabular pandas numpy scikit-learn
import sys, os, pandas as pd, numpy as np
print('Python:', sys.version)
import autogluon; print('AutoGluon:', autogluon.__version__)

## 🔐 Step 2: Set up your Kaggle API key
1. Visit https://www.kaggle.com/account
2. Click **Create New API Token** → this downloads `kaggle.json`
3. Upload it in the next cell when prompted


In [None]:
#@title 🔑 Upload kaggle.json and configure permissions
from google.colab import files
from pathlib import Path

uploaded = files.upload()  # select kaggle.json
if 'kaggle.json' not in uploaded:
    raise SystemExit('Please upload kaggle.json to continue.')

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

print('✅ Kaggle API is set up.')

In [None]:
#@title 📦 Step 3: Download the IEEE-CIS Fraud Detection data via Kaggle API
competition = 'ieee-fraud-detection'
!mkdir -p data
!kaggle competitions download -c {competition} -p data
!unzip -oq data/{competition}.zip -d data
print('✅ Data downloaded and extracted to ./data')
!ls -lh data | head -n 20

## 🧹 Step 4: Load & merge CSVs into train/test tables

- Training = `train_transaction.csv` **left-joined** with `train_identity.csv` on `TransactionID`
- Test = `test_transaction.csv` **left-joined** with `test_identity.csv` on `TransactionID`

We'll also **downcast** numeric dtypes to reduce memory usage.

In [None]:
#@title Load CSVs, merge, and reduce memory usage
import pandas as pd

DATA_DIR = 'data'

def reduce_mem_usage(df: pd.DataFrame, verbose=True) -> pd.DataFrame:
    start_mem = df.memory_usage(deep=True).sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtype
        if pd.api.types.is_numeric_dtype(col_type):
            c_min = df[col].min()
            c_max = df[col].max()
            if pd.api.types.is_integer_dtype(col_type):
                if c_min >= 0:
                    if c_max < 255:
                        df[col] = df[col].astype(np.uint8)
                    elif c_max < 65535:
                        df[col] = df[col].astype(np.uint16)
                    elif c_max < 4294967295:
                        df[col] = df[col].astype(np.uint32)
                    else:
                        df[col] = df[col].astype(np.uint64)
                else:
                    if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                        df[col] = df[col].astype(np.int8)
                    elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                        df[col] = df[col].astype(np.int16)
                    elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                        df[col] = df[col].astype(np.int32)
                    else:
                        df[col] = df[col].astype(np.int64)
            else:
                df[col] = pd.to_numeric(df[col], downcast='float')
    end_mem = df.memory_usage(deep=True).sum() / 1024**2
    if verbose:
        print(f'Memory: {start_mem:,.2f} MB -> {end_mem:,.2f} MB (↓{start_mem-end_mem:,.2f} MB)')
    return df

train_tr = pd.read_csv(f"{DATA_DIR}/train_transaction.csv", low_memory=False)
train_id = pd.read_csv(f"{DATA_DIR}/train_identity.csv", low_memory=False)
test_tr  = pd.read_csv(f"{DATA_DIR}/test_transaction.csv", low_memory=False)
test_id  = pd.read_csv(f"{DATA_DIR}/test_identity.csv", low_memory=False)

print('Merging train...')
train = train_tr.merge(train_id, on='TransactionID', how='left')
print('Merging test...')
test  = test_tr.merge(test_id, on='TransactionID', how='left')

del train_tr, train_id, test_tr, test_id
print('Reducing memory (train)...')
train = reduce_mem_usage(train)
print('Reducing memory (test)...')
test = reduce_mem_usage(test)

print('Train shape:', train.shape)
print('Test shape :', test.shape)
train.head(3)

In [None]:
#@title (Optional) Use a sample of the data for quicker training
USE_SAMPLE = False  #@param {type:"boolean"}
SAMPLE_FRACTION = 0.25  #@param {type:"number"}

if USE_SAMPLE:
    train = train.sample(frac=SAMPLE_FRACTION, random_state=42)
    print('Sampled train shape:', train.shape)

## 🤖 Step 5: Train AutoGluon
- Target/label column: **`isFraud`**
- We exclude **`TransactionID`** from features
- Metric: **ROC AUC** (standard for fraud detection)
- Preset: `medium_quality` (good quality vs. training time tradeoff)


In [None]:
#@title Train AutoGluon TabularPredictor
from autogluon.tabular import TabularPredictor

label = 'isFraud'
if label not in train.columns:
    raise SystemExit(f"Label column '{label}' not found in training data.")

# Drop ID-like columns that shouldn't be used as signals
cols_to_drop = ['TransactionID']
train = train.drop(columns=[c for c in cols_to_drop if c in train.columns])
test_features = test.drop(columns=[c for c in cols_to_drop if c in test.columns])

predictor = TabularPredictor(label=label, eval_metric='roc_auc', problem_type='binary', path='ag_models').fit(
    train_data=train,
    presets='medium_quality',
    time_limit=None  # set to an int (seconds) if you want to cap training time
)
leaderboard = predictor.leaderboard(silent=True)
leaderboard.head(10)

## 📈 Step 6: Predict on test & build `submission.csv`
Kaggle expects two columns in `submission.csv`:
- `TransactionID`
- `isFraud` (probability for class `1`)


In [None]:
#@title Generate predictions and create submission file
proba = predictor.predict_proba(test_features)

import pandas as pd
test_ids = test['TransactionID']
if isinstance(proba, pd.DataFrame):
    # Probability for positive class labeled 1
    if 1 in proba.columns:
        preds_pos = proba[1]
    else:
        # fallback if positive class label is 'True' or similar
        pos_col = [c for c in proba.columns if str(c).lower() in ('1', 'true', 'yes')]
        preds_pos = proba[pos_col[0]] if pos_col else proba.iloc[:, -1]
else:
    preds_pos = proba  # series of positive class probabilities

sub = pd.DataFrame({'TransactionID': test_ids, 'isFraud': preds_pos})
sub_path = 'submission.csv'
sub.to_csv(sub_path, index=False)
print('✅ Saved:', sub_path)
!head -n 5 submission.csv

## 🚚 Step 7: Submit to Kaggle (from Colab)
If you haven't accepted the competition rules, the submission will fail—open the [competition page](https://www.kaggle.com/c/ieee-fraud-detection) and click **"I Understand and Accept"** first.

In [None]:
#@title Submit `submission.csv` to Kaggle
message = "AutoGluon baseline"  #@param {type:"string"}
competition = 'ieee-fraud-detection'
try:
    !kaggle competitions submit -c {competition} -f submission.csv -m "$message"
except Exception as e:
    print('\n⚠️ Submission failed. Common causes:')
    print('- You must accept competition rules on the website first')
    print('- Kaggle API rate limits / auth issues')
    print('- Missing `kaggle.json` or wrong permissions (chmod 600)')
    print('\nError:', e)

## 🏁 Step 8: (Optional) Peek at the leaderboard from here

In [None]:
#@title Show competition leaderboard (optional)
competition = 'ieee-fraud-detection'
!kaggle competitions leaderboard {competition} --show | head -n 30

---
### ✅ Tips for better scores
- Increase training time or try `presets='best_quality'` (more accurate, slower)
- Feature engineering: convert dates, count encodings, interactions
- Tune with `predictor.fit(..., hyperparameters=...)`
- Use cross-validation via `fold_strategy` / `num_bag_folds`
- Enrich categorical handling with target encoding (careful with leakage)
- Explore AutoGluon `leaderboard()` to inspect model performance
