## Credit Card Fraud Detection Project

This project focuses on building a machine learning model to detect fraudulent transactions. We will explore, preprocess, and model the data to classify transactions as fraud or non-fraud using a dataset of anonymized credit card transactions.


In [10]:
# Import necessary libraries for data manipulation, visualization, and modeling
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from imblearn.over_sampling import SMOTE

# Display settings for better visualization in notebook
%matplotlib inline
sns.set(style='whitegrid')


## Step 1: Load the Dataset and Basic Exploration

Loading the dataset and display basic information to understand the structure and types of columns.


In [11]:
# Load the dataset and examine basic information to understand its structure
dataset = pd.read_csv('creditcard.csv')

dataset.head()  # Display the first few rows of the dataset


In [12]:
dataset.info()  # Overview of dataset structure and types

## Step 2: Data Exploration - Class Distribution

Exploring the distribution of the target variable `Class` to understand class imbalance (fraud vs non-fraud).


In [13]:
# Checking the distribution of the target variable 'Class'
sns.countplot(x='Class', data=dataset)
plt.title('Distribution of Classes (Fraud vs Non-Fraud)')
plt.show()

# Calculate and print the class distribution percentage
class_distribution = dataset['Class'].value_counts(normalize=True) * 100
print("Class Distribution:\n", class_distribution)


In [None]:
# number of transactions
dataset['Class'].value_counts()

In [14]:
#filtering the dataframe for fraudulent transaction 
fraud_transactions = dataset[dataset['Class'] == 1]
# calulating the total amount of transaction 
total_fraudulent_amount = fraud_transactions['Amount'].sum()
# printing the total amount of fraudulent transaction
print("Total amount of fraudulent transaction: ", total_fraudulent_amount)



## Step 3: Data Preprocessing

Handling missing values, normalizing the `Amount` and `Time` features, and preparing the data for modeling.


In [6]:
# Handling missing values (if any) by filling with median values as the dataset is mostly numerical
dataset = dataset.fillna(dataset.median())

# Scale 'Amount' and 'Time' features since they have a different scale compared to other features
scaler = StandardScaler()
dataset[['Amount', 'Time']] = scaler.fit_transform(dataset[['Amount', 'Time']])





## Step 4: Exploratory Data Analysis (EDA)

Let's explore key features such as `Amount` and `Time` to observe their distributions. We will also examine how `Amount` varies between fraud and non-fraud transactions.


In [17]:
#Summary statistics
dataset.describe()

In [19]:
#
dataset['Class'].value_counts()

In [18]:
# Plot distributions for 'Amount' and 'Time' to see their characteristics
plt.figure(figsize=(14,6))
plt.subplot(1, 2, 1)
sns.histplot(dataset['Amount'], bins=100, kde=True)
plt.title('Transaction Amount Distribution')

plt.subplot(1, 2, 2)
sns.histplot(dataset['Time'], bins=50, kde=True)
plt.title('Transaction Time Distribution')
plt.show()

# Compare 'Amount' for Fraud vs Non-Fraud
plt.figure(figsize=(10, 5))
sns.boxplot(x='Class', y='Amount', data=dataset)
plt.title('Amount Distribution by Class')
plt.xlabel('Class (0 = Non-Fraud, 1 = Fraud)')
plt.ylabel('Amount')
plt.show()




### Visualisation

In [None]:
# Checking the distribution of the target variable 'Class'
plt.figure(figsize=(6, 4))
sns.countplot(x='Class', data=dataset)
plt.title('Distribution of Classes (Fraud vs Non-Fraud)')
plt.xlabel('Class (0 = Non-Fraud, 1 = Fraud)')
plt.ylabel('Count')
plt.show()

# Calculate and print the class distribution percentage
class_distribution = dataset['Class'].value_counts(normalize=True) * 100
print("Class Distribution:\n", class_distribution)


In [None]:
corr = dataset.corr()
corr

plt.figure(figsize=(24,18))

sns.heatmap(corr,cmap="coolwarm",annot=True)
plt.show()

# Step 5: Data Preparation for Modeling

Splitting the data into features and target variables, then into training and test sets. Next, we handle class imbalance using SMOTE to oversample the minority class.


In [7]:
# Separate features and target variable
X = dataset.drop('Class', axis=1)
y = dataset
['Class']

# Split into training and testing sets (80-20 split, stratified)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Handle class imbalance in the training data using SMOTE
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)


## Step 6: Model Building and Training

Training a Random Forest classifier on the resampled data to classify transactions as fraudulent or non-fraudulent.


In [None]:
# Train a Random Forest classifier on the resampled data
Model = RandomForestClassifier(random_state=42)
Model.fit(X_train_res, y_train_res)


## Step 7: Model Evaluation

Evaluating the model using a confusion matrix, classification report, and ROC AUC score to assess its performance on the test set.


In [None]:
# Predictions and probability scores on the test set
y_pred = Model.predict(X_test)
y_pred_prob = Model.predict_prob(X_test)[:, 1]

# Confusion Matrix and Classification Report
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

class_report = classification_report(y_test, y_pred)
print("Classification Report:\n", class_report)

# ROC AUC Score
roc_auc = roc_auc_score(y_test, y_pred_prob)
print("ROC AUC Score:", roc_auc)

# Confusion Matrix Visualization
plt.figure(figsize=(6,4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Class')
plt.ylabel('Actual Class')
plt.show()

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()


## Conclusion

In this project, I built a Random Forest model to detect fraudulent transactions. The model achieved an ROC AUC score of X.XX, indicating its effectiveness in distinguishing fraud from non-fraud transactions. Additional improvements could be made by exploring other models and tuning hyperparameters.

Future ideas in mind:
- Try alternative models such as Gradient Boosting or XGBoost for potentially improved results as time goes on.
- Use GridSearchCV for hyperparameter tuning to refine the Random Forest model.
