# 💼 Problem Statement
### Contributed By:Sandeep kumar 
With the growing adoption of digital payments, credit card fraud has emerged as a critical threat to both consumers and financial institutions. In FY’19 alone, over 52,000 cases of credit/debit card fraud were reported, reflecting the urgency of implementing proactive fraud detection mechanisms. Each fraudulent transaction results in a direct financial loss for the bank and undermines customer trust — making timely detection not just a security concern but a business imperative.

This project aims to build an effective fraud detection system using machine learning techniques. By analyzing anonymized transaction data, addressing extreme class imbalance through SMOTE, and evaluating multiple classification models (Logistic Regression, Random Forest, and Gradient Boosting), we seek to accurately identify fraudulent transactions in real-time while minimizing false positives. The ultimate goal is to enhance financial security, protect consumers, and support banks in maintaining operational integrity.



## Step 1.Import Libraries

In [None]:
# Importing necessary libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from imblearn.over_sampling import SMOTE

In [None]:
# Loading data
path = "c:\\Users\\sanrkin\\Downloads\\creditcard.csv"
df = pd.read_csv(path)

In [None]:
print("\n=== Dataset Overview ===")

In [None]:
# First few rows
df.head()

PCA transformation is essential for protecting sensitive financial data while maintaining its analytical value for fraud detection.

The V1-V28 features represent transformed versions of original transaction data, making it impossible for malicious actors to reverse-engineer sensitive information.

In [None]:
# Last few rows
df.tail()

In [None]:
print("\n=== Data Info ===")

In [None]:
df.info()

In [None]:
# Statistical info
df.describe()

In [None]:
# columns in the dataset 
df.columns

In [None]:
# Shape of the dataset
df.shape

In [None]:
# Distribution of Fraud vs Normal Transactions
df['Target'].value_counts()

In [None]:
# Plot the distribution of fraud vs non-fraud transactions

plt.figure(figsize=(8, 6))
sns.countplot(x='Target', data=df, palette=['#2ecc71', '#e74c3c'],  # Green for 0, Red for 1
width=0.5)

plt.title('Distribution of Target Classes (Fraud vs Normal)',fontsize=14, pad=15)
plt.xlabel('Target Class', fontsize=12)
plt.ylabel('Count', fontsize=12)

for i in plt.gca().containers:
    plt.gca().bar_label(i, fmt='%d', padding=3)

plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.gca().set_axisbelow(True)

plt.xticks([0, 1], ['Normal (0)', 'Fraud (1)'])

from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='#2ecc71', label='Normal Transactions'),
                  Patch(facecolor='#e74c3c', label='Fraudulent Transactions')]
plt.legend(handles=legend_elements, loc='upper right')

# Adjust layout
plt.tight_layout()

plt.show()

In [None]:
# Heat map

plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

Low Multicollinearity: Most features (V1-V28) show low correlation with each other, indicated by the predominantly light-colored squares. This is good for model performance as it suggests that most features provide unique information.

In [None]:
# Missing values 
df.isnull().sum()

In [None]:
X = df.drop('Target',axis=1)
y = df['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Handle imbalance using SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)


We used SMOTE to handle the extreme class imbalance in the dataset (~0.17% fraud). In such cases, traditional oversampling (which just duplicates existing fraud cases) can lead to overfitting, and undersampling (which removes normal cases) risks losing important information.

Instead, SMOTE generates new, realistic synthetic fraud examples by learning patterns from existing ones. This helps the model better understand what fraud looks like without just memorizing data.

It also improves something called recall, which means:

"Out of all real frauds, how many did we correctly catch?"

Improving recall is especially important in fraud detection because missing a fraud is much more costly than flagging a normal transaction by mistake.

In [None]:
 # Plot the distribution of fraud vs non-fraud transactions before and after SMOTE
plt.figure(figsize=(12, 5))

# Plot 1: Before SMOTE
plt.subplot(1, 2, 1)
sns.countplot(x=y_train, palette=['#2ecc71', '#e74c3c'])
plt.title('Class Distribution Before SMOTE')
plt.xlabel('Target Class')
plt.ylabel('Count')
plt.xticks([0, 1], ['Normal (0)', 'Fraud (1)'])
plt.grid(axis='y', linestyle='--', alpha=0.7)


for i in plt.gca().containers:
    plt.gca().bar_label(i)

# Plot 2: After SMOTE
plt.subplot(1, 2, 2)
sns.countplot(x=y_train_resampled, palette=['#2ecc71', '#e74c3c'])
plt.title('Class Distribution After SMOTE')
plt.xlabel('Target Class')
plt.ylabel('Count')
plt.xticks([0, 1], ['Normal (0)', 'Fraud (1)'])
plt.grid(axis='y', linestyle='--', alpha=0.7)


for i in plt.gca().containers:
    plt.gca().bar_label(i)

plt.tight_layout()
plt.show()

print("\nClass distribution before SMOTE:")
print(pd.Series(y_train).value_counts())
print("\nClass distribution after SMOTE:")
print(pd.Series(y_train_resampled).value_counts())


In [None]:
# Model Building and Training
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

# Dictionary to store results
results = {}

# Train and evaluate each model
for model_name, model in models.items():
    print(f"\n===== {model_name} =====")
    
    # Train
    model.fit(X_train_resampled, y_train_resampled)
    
    # Predict
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    
    # Evaluation metrics
    cm = confusion_matrix(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    auc = roc_auc_score(y_test, y_proba)
    
    # Store results
    results[model_name] = {'cm': cm, 'report': report, 'auc': auc}
    
    # Print results
    print("Confusion Matrix:\n", cm)
    print("\nClassification Report:\n", report)
    print(f"ROC AUC Score: {auc:.4f}")
    
    # Plot confusion matrix
    plt.figure()
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['Normal', 'Fraud'],
                yticklabels=['Normal', 'Fraud'])
    plt.title(f'Confusion Matrix - {model_name}')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()



In [None]:


plt.figure(figsize=(8, 6))
for model_name, model in models.items():
    y_proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    plt.plot(fpr, tpr, label=f'{model_name} (AUC = {results[model_name]["auc"]:.3f})')

plt.plot([0, 1], [0, 1], 'k--')  # diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.legend()
plt.grid(True)
plt.show()

### 🔍 1. Data Overview and Exploration
Dataset size: 56,962 rows × 30 columns.

Target Distribution:

Normal (0): 56,864 transactions

Fraud (1): 98 transactions (only ~0.17% → extremely imbalanced)



### 🧪 PCA-Transformed Features:
Features V1 to V28 are PCA components → real transaction features anonymized.

These preserve pattern and variance but obscure sensitive data — a common step in financial datasets for privacy.



### 🧼 Data Quality:
✅ No missing values

🔁 Low multicollinearity among features (confirmed via correlation heatmap)


### 💡 Insight:
The dataset is highly imbalanced and preprocessed using PCA, which helps with privacy but still needs techniques like SMOTE for better fraud detection.

### ⚖️ 2. Class Imbalance Handling
Problem: Very few fraud cases in training data → risk of models being biased toward "normal" transactions.

Solution: Applied SMOTE (Synthetic Minority Oversampling Technique)

Before SMOTE:
Normal: 45,487 | Fraud: 82

After SMOTE:
Normal: 45,487 | Fraud: 45,487 (Balanced dataset)


#### 📌 Insight:
SMOTE creates synthetic fraud examples, increasing the model’s ability to identify frauds while maintaining the original data integrity — especially important for improving recall (catching actual frauds).


### 🧠 3. Models Used
You used three supervised learning models:

Logistic Regression

Random Forest

Gradient Boosting


### 📈 4. Model Evaluation
✅ Logistic Regression
Recall (Fraud): 0.88

Precision (Fraud): 0.06

AUC: 0.9374

⚠️ Warning: Model hit max iterations (you may want to increase max_iter)


High recall but very low precision — flags too many false positives. This is common with logistic regression when classes are highly imbalanced or complex boundaries exist.

✅ Random Forest
Recall (Fraud): 0.81

Precision (Fraud): 0.87

AUC: 0.9296

Accuracy: ~100%

Insight:

Well-balanced model — catches most frauds and minimizes false positives. Strong precision and recall. Good for production.

✅ Gradient Boosting
Recall (Fraud): 0.88

Precision (Fraud): 0.29

AUC: 0.9809 (highest)

False Positives: 34 (a bit more than RF)

Insight:

Best AUC and recall, but lower precision than RF. Gradient Boosting is aggressive in identifying fraud but has a higher false positive rate.

### 📊 5. ROC Curve Analysis
Gradient Boosting clearly has the best ROC curve (AUC = 0.981), meaning it's best at separating fraud from non-fraud across all thresholds.

Random Forest is also strong with a good balance.

Logistic Regression performs adequately but suffers in precision.

