# Credit Card Fraud Detection
Credit card fraud is one of the most significant challenges facing the financial industry today with losses in the UK amounting to **£551.3 million in 2023 alone**! Fraudulent transactions are rare but highly impactful, making them extremely difficult to detect.

From a machine learning perspective, this presents a **highly imbalanced classification problem**:
- The vast majority of transactions are legitimate.
- Fraudulent transactions make up a very small fraction (**<0.2% in the dataset to be used**).
- A naive model that predicts “not fraud” for everything would achieve 99%+ accuracy, but it would **completely fail at its actual purpose** — detecting fraud.

This project aims to build and evaluate machine learning models that can detect fraudulent transactions with **high recall** (catch as many frauds as possible) while maintaining **precision** (limiting false alarms).

To achieve this, I implemented:
- **Supervised Learning Models** (Logistic Regression, Random Forest, XGBoost) to learn from labelled fraud cases.
- **Anomaly Detection Approaches** (Isolation Forest, Autoencoders) to detect unusual patterns without labels.
- **Cost-Sensitive Learning** to penalise false negatives more heavily, since missing a fraud case is much more costly than flagging a legitimate transaction.

The key business problem:
**How can we detect fraudulent transactions effectively in real-time without overwhelming investigators with too many false positives?**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

***
## Library Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix
#from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
#from imblearn.over_sampling import SMOTE
#from imblearn.under_sampling import RandomUnderSampler
#import xgboost as xgb
from scipy.stats import mannwhitneyu
from scipy.stats import ks_2samp
from scipy.stats import chi2_contingency

***
## Data Loading & Initial Exploration 

In [None]:
df = pd.read_csv("/kaggle/input/creditcardfraud/creditcard.csv")
df.head()

28 of the 31 columns have been anonymised for confidentiality. The remaining 3 columns include:
- `Time`: number of seconds elapsed between each transaction and the first transaction in the dataset.
- `Amount`: quantity of the transaction.
- `Class`: our target variable representing whether the transaction is fraudulent (1) or genuine (0).




In [None]:
df.dtypes.value_counts()

The target variable `Class` can only take the binary values 1 and 0 making it a categorical variable. All 30 features are numerical so categorical encoding won't be required.

In [None]:
df.isnull().sum()

No missing values so imputation won't be required either.

In [None]:
df.describe()

Here are some observations I made from the table above and the conclusions I came to as a result:

- `Class` mean is **0.001727** → dataset is very imbalanced (only ~0.17% of transactions are fraudulent) → must handle with resampling (**SMOTE/undersampling**) or **cost-sensitive learning**.  
- Consequently, confusion matrix accuracy won't be a reliable metric → better to use **precision, recall, F1-score and AUC**.  

- `Amount` median is **22** & 75% of transactions are **<77** → most purchases are small → fraudsters often test cards with small amounts before large transactions, so **distribution of fraud vs. non-fraud by amount** is worth investigating.  

- The PCA-transformed features have a mean ≈ 0 due to PCA standardisation but their ranges vary significantly → some components capture extreme variations → non-linear models like **XGBoost** will be able to capture relationships better.  

- Some extreme values in PCA features (e.g., **V3, V25**) might be **outliers** worth handling.  

***
## Exploratory Data Analysis (continued)

In [None]:
# Class Distribution
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
df['Class'].value_counts().plot(kind='bar')
plt.title('Class Distribution')
plt.xlabel('Class (0=Genuine, 1=Fraud)')


# Transaction Amount Distribution
plt.subplot(1, 2, 2)
plt.hist(df['Amount'], bins=50, alpha=0.7)
plt.title('Transaction Amount Distribution')
plt.xlabel('Amount')
plt.tight_layout()
plt.show()

1. Left plot reinforces our findings from the earlier .describe() output; the bar for fraudulent transactions is barely visible.
2. Right plot shows that the `Amount` feature is right-skewed → may be useful to **log transform** it during feature engineering to reduce skewness & stabilise variance → makes models more robust.
<br>
Next, I should investigate the distribution of `Amount` for fraud vs non-fraud classes in order to reveal further important patterns.


In [None]:
plt.figure(figsize=(12, 5))

# Genuine transactions
plt.subplot(1, 2, 1)  # 1 row, 2 columns, position 1
df[df['Class'] == 0]['Amount'].hist(bins=50, alpha=0.7, color='blue')
plt.title('Genuine Transactions')
plt.xlabel('Amount')
plt.ylabel('Count')

# Fraud transactions
plt.subplot(1, 2, 2)  # 1 row, 2 columns, position 2
df[df['Class'] == 1]['Amount'].hist(bins=50, alpha=0.7, color='red')
plt.title('Fraudulent Transactions')
plt.xlabel('Amount')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

	1. Overlap exists between fraud and genuine amounts at small values, so amount alone is not a perfect fraud predictor.
	2. Fraud transactions show more variability and heavier tails → unusual amounts (too small or too large) could signal risk.
	3. A log transformation of Amount could help models distinguish subtle differences between fraud vs genuine transactions.
	4. This observation motivates combining Amount with other features (e.g., PCA features, Time, etc.) for stronger fraud signals.

In [None]:
fraud_amounts = df[df['Class'] == 1]['Amount']
genuine_amounts = df[df['Class'] == 0]['Amount']

# Tests whether distribution of Amount differs significantly between fraud & genuine transactions
stat, p = mannwhitneyu(fraud_amounts, genuine_amounts, alternative='two-sided')
print(f"Mann–Whitney U Test: statistic={stat}, p-value={p}")

# Compares entire distributions of both samples
stat, p = ks_2samp(fraud_amounts, genuine_amounts)
print(f"K-S Test: statistic={stat}, p-value={p}")

To statistically verify the difference, I conducted the **Mann–Whitney U test** and the **Kolmogorov–Smirnov test**. Both tests yielded **p-values < 0.001**, confirming that transaction amounts for fraud and genuine transactions follow **significantly** different distributions. This supports our earlier visualisation and highlights `Amount` as a valuable predictive feature (particularly after log transformation).

In [None]:
# Create 'Hour' feature from 'Time' (seconds → hours)
df['Hour'] = (df['Time'] // 3600) % 24  # modulo 24 to wrap around daily cycle

# Fraud rate per hour
hourly_fraud_rate = df.groupby('Hour')['Class'].mean()

# Plot
plt.figure(figsize=(12,6))
hourly_fraud_rate.plot(kind='bar', color='red', alpha=0.7)
plt.title("Fraud Rate by Hour of Day")
plt.xlabel("Hour of Day")
plt.ylabel("Fraud Rate")
plt.xticks(rotation=0)
plt.show()

- Fraudulent transactions are **not evenly distributed** over time — they cluster heavily during the early morning hours (**2–4 AM**). This pattern could suggest:
	- Fraudsters may attempt transactions at times when victims are less likely to notice (e.g. asleep).
	- Or it could reflect systematic vulnerabilities in processing/monitoring at those hours.
- Time of day could therefore be a useful feature in my model, either directly (e.g. Hour) or in interaction with other variables like `Amount`.
To determine whether the correlation between `Time` and `Class` is statistically significant and verify my findings, I will carry out a Chi-Square test. 


In [None]:
# Create contingency table: rows=Hour, cols=Class (0=Normal, 1=Fraud)
contingency_table = pd.crosstab(df['Hour'], df['Class'])

# Run Chi-Square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Chi-Square Statistic:", chi2)
print("Degrees of Freedom:", dof)
print("P-value:", p)

# Interpretation
if p < 0.05:
    print("❌ Reject H0: Fraud distribution IS dependent on hour (fraud patterns vary by time).")
else:
    print("✅ Fail to reject H0: No strong evidence that fraud depends on hour.")

In [None]:
# Compute correlations with 'Class'
correlations = df.corr()['Class'].sort_values(ascending=False)

# Select top 5 positive and top 5 negative correlations
top_features = pd.concat([correlations.head(6), correlations.tail(5)])

plt.figure(figsize=(14,6))

# Barplot
plt.subplot(1, 2, 1)
sns.barplot(x=top_features.values, y=top_features.index, palette="coolwarm")
plt.title("Top Features Correlated with Fraud (Class)")
plt.xlabel("Correlation coefficient")
plt.ylabel("Features")

# Heatmap
plt.subplot(1, 2, 2)
sns.heatmap(top_features.to_frame(), annot=True, cmap="coolwarm", center=0)
plt.title("Top Features Correlated with Fraud (Class)")

plt.tight_layout()
plt.show()

The features with the strongest absolute correlation with `Class` are:
- `V17`
- `V14`
- `V12`
- `V11`
- `V10`
- `V4`

In [None]:
top_features = ['V17', 'V14', 'V12','V10','V11','V4'] 

plt.figure(figsize=(14, 10))
for i, feature in enumerate(top_features, 1):
    plt.subplot(2, 3, i)
    sns.kdeplot(data=df[df['Class'] == 0], x=feature, label="Genuine", fill=True, alpha=0.5)
    sns.kdeplot(data=df[df['Class'] == 1], x=feature, label="Fraud", fill=True, alpha=0.5)
    plt.title(f"Distribution of {feature} by Class")
    plt.legend()
plt.tight_layout()
plt.show()

- Fraudulent transactions generally produce **broader, shifted distributions**, while genuine transactions cluster tightly around 0. 
- Features **V17, V14, V12, V10** are particularly powerful fraud indicators → they show clear separation and could be prioritised in feature selection.
- Because the distributions differ so distinctly, **non-linear models** (XGBoost, Random Forest, Neural Nets) will likely exploit these differences more effectively than purely linear models.

In [None]:
features_to_check = ['V3', 'V25'] # features I found to have extreme values in previous section

plt.figure(figsize=(12,6))
for i, feature in enumerate(features_to_check, 1):
    plt.subplot(1, 2, i)
    sns.boxplot(x='Class', y=feature, data=df, palette="Set2")
    plt.title(f"Boxplot of {feature} by Class")
plt.tight_layout()
plt.show()

**V3** (left plot):
- For genuine transactions (Class = 0), the distribution of V3 is **centred around 0** but has many extreme **negative outliers**.
- For fraudulent transactions (Class = 1), the **median is lower** and the **interquartile range is shifted downward** compared to genuine transactions.

⸻

**V25** (right plot):
- For genuine transactions, V25 is tightly **distributed around 0**, with a few outliers in both directions.
- For fraudulent transactions, the **median is slightly higher than 0**, with **more spread** in the distribution.

***
## Class Imbalance Treatment Strategies
As mentioned earlier, one of the key challenges with fraud detection datasets is the sheer imbalance between genuine and fraudulent cases. To address this, I will apply 2 sampling techniques to ensure the model pays proper attention to the minority class:
- Undersampling: Randomly reduce the number of non-fraud cases to balance the dataset → prevents model from being overwhelmed by the majority class, but comes at the cost of losing some information.
- SMOTE: Generate synthetic fraud samples by interpolating between existing fraud cases → helps balance dataset without discarding genuine data.

In [None]:
# Train-Test Split
X = df.drop(columns=['Class'])
y = df['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    stratify=y, random_state=42)

***
## Supervised Learning Models
First, I will test how a baseline logistic regression model performs. I will start by scaling the features first - this is useful for logistic regression models but not necessary for the other models.

In [None]:
# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

It's time to define the model, fit it on the training data, generate predictions and produce a classification report. I will repeat these 4 steps for the other models too.

In [None]:
# Logistic Regression Evaluation 
lr_model = LogisticRegression(max_iter=1000,
                              solver="lbfgs",
                             random_state=42)
lr_model.fit(X_train_scaled, y_train)
preds = lr_model.predict(X_test_scaled)
print("Logistic Regression Classification Report:")
print(classification_report(y_test, preds, target_names=['Not Fraud', 'Fraud']))

The performance of the baseline logistic regression model is quite mediocre as expected with a **high precision**, but **low recall and f1-score**. Now, I will progress to using an ensemble decision tree model (random forest) to see if the performance improves.

In [None]:
# Random Forest Evaluation 
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
preds = rf_model.predict(X_test)
print("Random Forest Classification Report:")
print(classification_report(y_test, preds, target_names=['Not Fraud', 'Fraud']))

The random forest model has a significant boost in performance in comparison to the baseline logistic regression. **Precision, recall and f1-score all exceed 0.8**. Also, there is a smaller difference between recall and precision this time. Next, I will experiment with a gradient-boosting decision tree model: XGBoost. 

In [None]:
# Calculate scale_pos_weight = (number of negative class samples) / (number of positive class samples)
neg, pos = np.bincount(y_train)  # y_train should be binary (0 = normal, 1 = fraud)
scale_pos_weight = neg / pos

# XGBoost evaluation
xgb = XGBClassifier(random_state=42)
xgb.fit(X_train, y_train)
preds = xgb.predict(X_test)
print("XGBoost Classification Report:")
print(classification_report(y_test, preds, target_names=['Not Fraud', 'Fraud']))

As predicted, **XGBoost is the best performing model**. It significantly outperforms the baseline logistic regression and narrowly beats random forest in all aspects.

***
## Feature Engineering
Through research about the domain and some general knowledge, I was able to derive the following features that may be able to improve the model's performance:

In [None]:
df_features = df.copy()

# Custom features derived from Amount
df_features['Log_Amount'] = np.log1p(df_features['Amount']) # less useful for xgboost compared to logistic regression
amount_counts = df['Amount'].value_counts()
df_features['Amount_Frequency'] = df_features['Amount'].map(amount_counts).fillna(1)
df_features['Amount_Rarity'] = 1 / (df_features['Amount_Frequency'] + 1)
df_features['Rolling_Amount_Mean'] = df_features['Amount'].rolling(window=100, min_periods=1).mean() # proxy for sudden anomalies


# Isolation Forest anomaly score
X_scaled = scaler.fit_transform(df_features.drop(columns=['Class']))
iso = IsolationForest(n_estimators=100, contamination=0.001, random_state=42)
df_features['Anomaly_Score'] = -iso.fit_predict(X_scaled)  # -1 = anomaly, 1 = normal


# Time-based features 
df_features['Hour'] = (df_features['Time'] // 3600) % 24
df_features['Day'] = df_features['Time'] // (3600 * 24)
df_features['Is_Night'] = df_features['Hour'].apply(lambda x: 1 if (x < 6 or x > 22) else 0)
hourly_mean = df_features.groupby('Hour')['Amount'].transform('mean')
df_features['Rel_Amount_Hour'] = df_features['Amount'] / (hourly_mean + 1e-8) # felative transaction size within the same hour


# Interaction features 
df_features['V1_V2_Prod'] = df_features['V1'] * df_features['V2']
df_features['V1_V3_Ratio'] = df_features['V1'] / (df_features['V3'] + 1e-8)
df_features['V14_V12_Sum'] = df_features['V14'] + df_features['V12']
df_features['V10_V11_Diff'] = df_features['V10'] - df_features['V11']
df_features['V17_Abs'] = df_features['V17'].abs()
df_features['V4_Sq'] = df_features['V4'] ** 2
df_features['V14_V4_Prod'] = df_features['V14'] * df_features['V4']

In [None]:
# Train-test split including feature engineering
X = df_features.drop(columns=['Class'])
y = df_features['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    stratify=y, random_state=42)

Next, I will conduct a SHAP feature analysis to find out whether my newly engineered features positively impact the model or not. 

In [None]:
# Model trained on enhanced dataset
xgb_enhanced = XGBClassifier(random_state=42)
engineered_cols = [col for col in df_features.columns if col not in ['Class']]
xgb_enhanced.fit(X_train[engineered_cols], y_train)

# SHAP explainer
explainer = shap.TreeExplainer(xgb_enhanced)
shap_values = explainer.shap_values(X_test)

# Global importance plot
shap.summary_plot(shap_values, X_test, plot_type="bar")

# Detailed summary (shows feature impact direction too)
shap.summary_plot(shap_values, X_test)

From the SHAP feature analysis, I noticed that:
- `V4` is the single **most predictive** feature.
- Feature engineering worked well with `V14_V12_Sum`, `V1_V2_Prod` and `V10_V11_Diff` all ranking in the top features, proving that **combining PCA components captures meaningful patterns**.
- `Amount` has low importance, indicating **transaction amounts alone aren't strong fraud predictors** -> validates use of my amount-based engineered features.
- While the model performs well, **explaining fraud decisions to business stakeholders may be difficult** since the most important features are PCA-transformed components rather than intuitive business metrics.
<br>
<br>
Now, I will calculate different metrics like AUC, recall and f1-score to quantify the improvements made through feature engineering. 

In [None]:
# Baseline using only original columns - REMOVE AT END
original_cols = [col for col in df.columns if col not in ['Class']]

xgb_baseline = XGBClassifier(random_state=42)
xgb_baseline.fit(X_train[original_cols], y_train)
baseline_preds = xgb_baseline.predict(X_test[original_cols])
baseline_auc = roc_auc_score(y_test, baseline_preds)

# Enhanced using original + feature engineering columns
enhanced_preds = xgb_enhanced.predict(X_test[engineered_cols])
enhanced_auc = roc_auc_score(y_test, enhanced_preds)

# Results
print(f"Baseline AUC:  {baseline_auc:.4f}")
print("\nClassification Report (Baseline):")
print(classification_report(y_test, baseline_preds, digits=4,target_names=['Not Fraud', 'Fraud']))
print(f"\nEnhanced AUC:  {enhanced_auc:.4f}")
print("\nClassification Report (Enhanced):")
print(classification_report(y_test, enhanced_preds, digits=4,target_names=['Not Fraud', 'Fraud']))

The results show the feature engineering improved the model's performance, with the enhanced model having a:
- **Higher AUC** (0.9234 > 0.9132) → better overall discrimination.
- **Higher Precision** (0.9432 > 0.9101) → less legitimate transactions flagged as fraud.
- **Higher Recall** (0.8469 > 0.8265) → catches more actual fraud cases.
- **Higher f1-score** (0.8925 > 0.8663) → better balance of precision + recall.

***
## Anomaly Detection
- Isolation forest

***
## Hyperparameter Tuning & Cost-Sensitive Learning
In this section, I will optimise the model by choosing the best values for the main XGBoost parameters (up to a certain degree of accuracy). This will also allow me to implement **cost-sensitive learning** via the `scale_pos_weight` parameter → model can **penalise misclassified fraud cases** → **higher AUC, recall and f1** (at the cost of precision). 

Since I will be using a large parameter space, I will use **Randomized Search** instead of Grid Search so the program executes within an appropriate time frame. Based on the results of each iteration, I will change the parameter distribution values until I reach an AUC score I'm satisfied with.

In [None]:
# Calculate baseline scale_pos_weight = (number of negative class samples) / (number of positive class samples) for cost-sensitive learning
neg, pos = np.bincount(y_train)  
base_scale = neg / pos

# Base model
xgb = XGBClassifier(
    objective="binary:logistic",
    eval_metric="auc",
    use_label_encoder=False,
    random_state=42,
    scale_pos_weight=scale_pos_weight  # keep imbalance adjustment
)

# Parameter space
param_dist = {
    "n_estimators": [100, 150, 200, 250],
    "max_depth": [5.5, 6, 6.5],
    "learning_rate": [0.005, 0.01, 0.015],
    "subsample": [0.7, 0.8, 0.9],
    "colsample_bytree": [0.9, 1.0, 1.1],
    "gamma": [0, 0.1, 0.2, 0.3],
    "min_child_weight": [4, 5, 6],
    "scale_pos_weight": [base_scale*0.5, base_scale, base_scale*2, base_scale*5]
}

# Random search (UPGRADE N_ITER=100 & CV=5 IN END IF RESULTS AREN'T GOOD ENOUGH)
random_search = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=param_dist,
    n_iter=50,                
    scoring="roc_auc",        
    n_jobs=-1,                
    cv=3,                     
    verbose=1,
    random_state=42
)

# Fit search
random_search.fit(X_train, y_train)

# Best params and score
print("Best Parameters:", random_search.best_params_)
print("Best AUC Score:", random_search.best_score_)

***
## Model Evaluation
- Confusion matrices: Visualize true/false positives and negatives
- Classification reports: Precision, recall, F1-score for each model
- ROC curves and AUC: Model discrimination ability
- Precision-Recall curves: More appropriate for imbalanced data
- Feature importance analysis: Which features drive fraud detection (may not be possible with hidden column names)

***
## Conclusion & Future Work 
- best model
- business impact quantification: Expected fraud prevention and cost savings
- limitations
- improvements e.g., ensemble of best models, hyperparamter tuning (if not implemented), API deployment
- deployment considerations e.g real-time flagging & monitoring