
# Credit Card Fraud Detection Analysis

## 1. Introduction and Brief Description of the Dataset

In this analysis, we explore the **Credit Card Fraud Detection** dataset from OpenML (https://www.openml.org/d/1597). The dataset contains anonymized credit card transaction data from European cardholders, collected over two days in September 2013. The goal is to predict whether a given transaction is fraudulent or not.

**Dataset Features**:
- **Time**: Number of seconds elapsed between this transaction and the first transaction in the dataset.
- **Amount**: Transaction amount.
- **V1-V28**: Principal Component Analysis (PCA) transformed features.
- **Class**: Binary target, where 1 indicates a fraudulent transaction, and 0 indicates a legitimate transaction.

The dataset is highly imbalanced, with only 0.1727% of transactions marked as fraudulent. Handling this imbalance is a critical aspect of the analysis.


In [None]:

import pandas as pd
import numpy as np

# Load the dataset from OpenML
url = 'https://www.openml.org/data/get_csv/1597/dataset_1597_creditcard.csv'
data = pd.read_csv(url)

# Display the first few rows of the dataset for initial inspection
data.head()



## 2. Objectives

The main objective of this analysis is to create a classification model that can effectively distinguish between fraudulent and non-fraudulent transactions. We aim to identify fraudulent transactions with high precision while minimizing false negatives, as missing a fraudulent transaction is more costly than incorrectly classifying a legitimate transaction.



## 3. Model Selection and Training

We will train three classification models:
1. **Logistic Regression**: A baseline model to provide a simple benchmark.
2. **Random Forest Classifier**: An ensemble model that can improve prediction performance.
3. **Gradient Boosting Classifier**: Another ensemble model known for its effectiveness in imbalanced datasets.

We will start by preparing the data and handling categorical variables where needed.


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Splitting the dataset into features and target
X = data.drop('Class', axis=1)
y = data['Class']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Standardizing the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:

# First Model: Logistic Regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Predictions
y_pred_log_reg = log_reg.predict(X_test_scaled)

# Model evaluation
log_reg_report = classification_report(y_test, y_pred_log_reg)
log_reg_cm = confusion_matrix(y_test, y_pred_log_reg)

log_reg_report, log_reg_cm


In [None]:

# Second Model: Random Forest Classifier
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train_scaled, y_train)

# Predictions
y_pred_rf = rf_clf.predict(X_test_scaled)

# Model evaluation
rf_report = classification_report(y_test, y_pred_rf)
rf_cm = confusion_matrix(y_test, y_pred_rf)

rf_report, rf_cm


In [None]:

from sklearn.ensemble import GradientBoostingClassifier

# Third Model: Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(random_state=42)
gb_clf.fit(X_train_scaled, y_train)

# Predictions
y_pred_gb = gb_clf.predict(X_test_scaled)

# Model evaluation
gb_report = classification_report(y_test, y_pred_gb)
gb_cm = confusion_matrix(y_test, y_pred_gb)

gb_report, gb_cm



## 4. Insights and Key Findings

The results of the models are summarized below:

- **Logistic Regression**: A simple model with moderate performance, struggling with detecting fraudulent transactions.
- **Random Forest Classifier**: Shows improvement in precision but still lacks recall for detecting fraudulent cases.
- **Gradient Boosting Classifier**: Provides the best balance between precision and recall, but further improvements are possible.

Based on these results, we recommend using the Gradient Boosting Classifier for this dataset, as it provides the best balance between the conflicting metrics of precision and recall for fraudulent transactions.



## 5. Next Steps

Here are some steps to improve the model's performance:

1. **Handling Class Imbalance**: Implement methods such as SMOTE (Synthetic Minority Over-sampling Technique) to handle class imbalance.
2. **Hyperparameter Tuning**: Use techniques like GridSearchCV to fine-tune hyperparameters, particularly for the Random Forest and Gradient Boosting models.
3. **Feature Engineering**: Explore feature creation or selection methods to improve model performance.
