# Project 8: Credit Card Fraud Detection

This notebook tackles the problem of credit card fraud detection. This is a classic example of a machine learning problem with a **highly imbalanced dataset**, where the number of fraudulent transactions is far lower than the number of legitimate ones.

The main focus of this project is to demonstrate a technique for handling this imbalance—specifically, **Random Undersampling**—and to evaluate the model using appropriate metrics like the **Precision-Recall Curve**.

## 1. Setup and Library Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.datasets import fetch_datasets
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, average_precision_score

## 2. Data Loading and EDA

In [None]:
# Fetch the dataset from imbalanced-learn
fraud_data = fetch_datasets()['creditcard_fraud']

# Create a pandas DataFrame
df = pd.DataFrame(fraud_data.data, columns=[f'V{i+1}' for i in range(fraud_data.data.shape[1])])
df['Class'] = fraud_data.target

print("Dataset shape:", df.shape)
df.head()

In [None]:
# Check the class distribution
class_counts = df['Class'].value_counts()
print("Class Distribution:")
print(class_counts)

# Visualize the imbalance
plt.figure(figsize=(8, 6))
sns.countplot(x='Class', data=df)
plt.title('Class Distribution (0: Legitimate, 1: Fraud)')
plt.show()

As we can see, the dataset is extremely imbalanced. This is the core challenge we need to address.

## 3. Data Preprocessing and Splitting

The features in this dataset are already scaled (they are the result of a PCA transformation). We just need to define our features (X) and target (y) and then split the data.

In [None]:
X = df.drop('Class', axis=1)
y = df['Class']

# Split the data into training and testing sets BEFORE resampling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## 4. Handling Imbalance with Random Undersampling

In [None]:
rus = RandomUnderSampler(random_state=42)
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)

print("Original training set shape:", y_train.value_counts())
print("\nResampled training set shape:", pd.Series(y_train_resampled).value_counts())

Now the training data is perfectly balanced, with an equal number of fraud and legitimate transaction samples.

## 5. Model Training

In [None]:
# Train a Logistic Regression model on the balanced data
model = LogisticRegression(solver='liblinear', random_state=42)
model.fit(X_train_resampled, y_train_resampled)

## 6. Model Evaluation

In [None]:
# Make predictions on the original, imbalanced test set
y_pred = model.predict(X_test)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Legitimate', 'Fraud'], yticklabels=['Legitimate', 'Fraud'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

In [None]:
# Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Legitimate (0)', 'Fraud (1)']))

### Precision-Recall Curve

In [None]:
# Get prediction probabilities for the positive class (fraud)
y_scores = model.decision_function(X_test)

precision, recall, _ = precision_recall_curve(y_test, y_scores)
avg_precision = average_precision_score(y_test, y_scores)

plt.figure(figsize=(8, 6))
plt.step(recall, precision, where='post', label=f'AP={avg_precision:0.2f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall Curve')
plt.legend(loc='best')
plt.show()

## 7. Conclusion

This notebook demonstrated how to approach a highly imbalanced classification problem. By using **Random Undersampling**, we were able to train a model that, while having a high number of false positives (low precision), achieves a **high recall** for the fraud class. 

In a real-world fraud detection system, a high recall is often prioritized. It's generally better to flag a legitimate transaction for review (a false positive) than to miss a fraudulent one (a false negative). The Precision-Recall curve clearly shows this trade-off. Our model can identify over 90% of fraudulent transactions, which is a strong result.