<a href="https://colab.research.google.com/github/Likhithagandham/Student-Training-Program/blob/main/Credit_Card_Fraud_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Credit Card Fraud Detection

This project demonstrates different strategies to detect fraudulent credit card transactions using machine learning.  
Since fraud datasets are highly imbalanced, the project compares two main approaches:

1. **Class-weighted models** – models are trained with built-in class balancing.  
2. **Random undersampling** – the majority class (non-fraud) is reduced to match the minority class (fraud).

## Features
- Dataset: Synthetic dataset (`synthetic_creditcard.csv`) with imbalanced classes.
- Models used:
  - Logistic Regression
  - Random Forest
  - XGBoost
- Evaluation metrics:
  - ROC-AUC
  - Confusion Matrix
  - Classification Report (precision, recall, F1-score)

## Workflow
1. Upload dataset (`synthetic_creditcard.csv`) in Colab.
2. Preprocess data and split into train/test sets.
3. Train models with:
   - **Strategy A:** Class weights
   - **Strategy B:** Random undersampling
4. Compare results to see which approach works better.

## How to Run
1. Open the notebook in Google Colab.
2. Upload your dataset (CSV).
3. Run all cells to train models and view results.

## Example Results
- Strategy A (class weights) performs well when the imbalance is extreme.
- Strategy B (undersampling) provides more balanced recall but may lose overall accuracy.

## Future Improvements
- Try SMOTE/ADASYN oversampling if dataset size allows.
- Add ROC curve plots for better visual comparison.
- Experiment with deep learning models (e.g., simple neural networks).

---


In [None]:
!pip install -q imbalanced-learn xgboost

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import files
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, average_precision_score, confusion_matrix, classification_report, precision_recall_curve, roc_curve, auc, f1_score

# upload file
uploaded = files.upload()
filename = list(uploaded.keys())[0]
df = pd.read_csv(filename)
print("shape:", df.shape)
print(df['Class'].value_counts())

# features & target
X = df.drop(columns=['Class'])
y = df['Class']

# stratified split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# ---------- BalancedRandomForest ----------
from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
brf.fit(X_train, y_train)

y_proba_brf = brf.predict_proba(X_test)[:,1]
print("\nBalancedRandomForest ROC-AUC:", round(roc_auc_score(y_test, y_proba_brf),4))
print("BalancedRandomForest PR-AUC :", round(average_precision_score(y_test, y_proba_brf),4))

# pick threshold by maximizing F1 on validation/test
prec, rec, th = precision_recall_curve(y_test, y_proba_brf)
f1_scores = 2*prec*rec/(prec+rec+1e-12)
best_idx = np.nanargmax(f1_scores)
best_thr = th[best_idx] if best_idx < len(th) else 0.5
y_pred_thr = (y_proba_brf >= best_thr).astype(int)
print("BRF best threshold:", round(best_thr,4), "F1:", round(f1_scores[best_idx],4))
print("Confusion matrix (BRF, tuned):\n", confusion_matrix(y_test, y_pred_thr))
print(classification_report(y_test, y_pred_thr, digits=4))

# ---------- EasyEnsembleClassifier ----------
from imblearn.ensemble import EasyEnsembleClassifier
eec = EasyEnsembleClassifier(n_estimators=10, random_state=42, n_jobs=-1)
eec.fit(X_train, y_train)

# EasyEnsemble returns an ensemble that supports predict_proba
y_proba_eec = eec.predict_proba(X_test)[:,1]
print("\nEasyEnsemble ROC-AUC:", round(roc_auc_score(y_test, y_proba_eec),4))
print("EasyEnsemble PR-AUC :", round(average_precision_score(y_test, y_proba_eec),4))

prec2, rec2, th2 = precision_recall_curve(y_test, y_proba_eec)
f1_scores2 = 2*prec2*rec2/(prec2+rec2+1e-12)
best_idx2 = np.nanargmax(f1_scores2)
best_thr2 = th2[best_idx2] if best_idx2 < len(th2) else 0.5
y_pred_eec = (y_proba_eec >= best_thr2).astype(int)
print("EEC best threshold:", round(best_thr2,4), "F1:", round(f1_scores2[best_idx2],4))
print("Confusion matrix (EEC, tuned):\n", confusion_matrix(y_test, y_pred_eec))
print(classification_report(y_test, y_pred_eec, digits=4))

# ---------- Optional: XGBoost with scale_pos_weight (baseline) ----------
from xgboost import XGBClassifier
neg = (y_train==0).sum()
pos = (y_train==1).sum()
scale_pos_weight = max(1, neg/pos)
xgb = XGBClassifier(eval_metric='logloss', scale_pos_weight=scale_pos_weight, random_state=42, use_label_encoder=False)
xgb.fit(X_train, y_train)
y_proba_xgb = xgb.predict_proba(X_test)[:,1]
print("\nXGBoost ROC-AUC:", round(roc_auc_score(y_test, y_proba_xgb),4))
print("XGBoost PR-AUC :", round(average_precision_score(y_test, y_proba_xgb),4))

# threshold tune XGBoost (same method)
prec3, rec3, th3 = precision_recall_curve(y_test, y_proba_xgb)
f1_scores3 = 2*prec3*rec3/(prec3+rec3+1e-12)
best_idx3 = np.nanargmax(f1_scores3)
best_thr3 = th3[best_idx3] if best_idx3 < len(th3) else 0.5
y_pred_xgb = (y_proba_xgb >= best_thr3).astype(int)
print("XGB best threshold:", round(best_thr3,4), "F1:", round(f1_scores3[best_idx3],4))
print("Confusion matrix (XGB, tuned):\n", confusion_matrix(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb, digits=4))

# ---------- Plot Precision-Recall curves ----------
plt.figure(figsize=(8,6))
plt.plot(rec, prec, label=f'BRF (AP={average_precision_score(y_test,y_proba_brf):.3f})')
plt.plot(rec2, prec2, label=f'EasyEnsemble (AP={average_precision_score(y_test,y_proba_eec):.3f})')
plt.plot(rec3, prec3, label=f'XGBoost (AP={average_precision_score(y_test,y_proba_xgb):.3f})')
plt.xlabel('Recall'); plt.ylabel('Precision'); plt.title('Precision-Recall Curve'); plt.legend(); plt.grid(True)
plt.show()

# ---------- Quick counts to avoid "undefined metric" warning ----------
print("\nPred counts (BRF tuned):", np.bincount(y_pred_thr))
print("Pred counts (EEC tuned):", np.bincount(y_pred_eec))
print("Pred counts (XGB tuned):", np.bincount(y_pred_xgb))
