# Credit Card Fraud Detection
Credit card fraud is one of the most significant challenges facing the financial industry today with losses in the UK amounting to **£551.3 million in 2023 alone**! Fraudulent transactions are rare but highly impactful, making them extremely difficult to detect.

From a machine learning perspective, this presents a **highly imbalanced classification problem**:
- The vast majority of transactions are legitimate.
- Fraudulent transactions make up a very small fraction (**<0.2% in the dataset to be used**).
- A naive model that predicts “not fraud” for everything would achieve 99%+ accuracy, but it would **completely fail at its actual purpose** — detecting fraud.

This project aims to build and evaluate machine learning models that can detect fraudulent transactions with **high recall** (catch as many frauds as possible) while maintaining **precision** (limiting false alarms).

To achieve this, I implemented:
- **Supervised Learning Models** (Logistic Regression, Random Forest, XGBoost) to learn from labelled fraud cases.
- **Anomaly Detection Approaches** (Isolation Forest, Autoencoders) to detect unusual patterns without labels.
- **Cost-Sensitive Learning** to penalise false negatives more heavily, since missing a fraud case is much more costly than flagging a legitimate transaction.

The key business problem:
**How can we detect fraudulent transactions effectively in real-time without overwhelming investigators with too many false positives?**

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

***
## Library Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import xgboost as xgb

***
## Data Loading & Initial Exploration 
- load dataset + print head
- column data types (numerical + categorical)
- summary stats: shape, describe
- missing values 

***
## Exploratory Data Analysis 
- Class distribution: Calculate and visualize fraud rate (expect ~0.17%)
- Class imbalance visualization: Charts showing normal vs fraud distribution
- Transaction amount analysis: Compare amount distributions between classes
- Time pattern analysis: Fraud occurrence by hour/day patterns
- Feature correlation analysis: Heatmaps and correlation with target variable
- Outlier identification: Box plots and statistical outlier detection
- Key insights summary: Document important patterns discovered

***
## Feature Engineering
- Time-based features: Extract hour, day, time-since-last-transaction features
- Amount-based features: Log transforms, scaling, statistical ratio
- ask ai if this is possible as no column names

***
## Train-Test Split & Data Scaling

***
## Class Imbalance Treatment Strategies
- Baseline approach: Train on original imbalanced data
- SMOTE oversampling: Generate synthetic minority class samples
- Class weight adjustment: Use built-in class weighting in algorithms
- Strategy comparison: Compare distributions after each approach
- Strategy selection: Choose best approach based on validation performance

***
## Supervised Learning Models
- Baseline logistic regression (with class weights)
- Random forest
- XGBoost
- Compare performances & find best model

***
## Unsupervised Learning Models (Anomaly Detection)
- Isolation forest
- Autoencoder
- Compare performances & find best model

***
## Model Evaluation
- Confusion matrices: Visualize true/false positives and negatives
- Classification reports: Precision, recall, F1-score for each model
- ROC curves and AUC: Model discrimination ability
- Precision-Recall curves: More appropriate for imbalanced data
- Feature importance analysis: Which features drive fraud detection (may not be possible with hidden column names)

***
## Hyperparamter tuning? (for top 2 models)
- grid search or randomized search
- not essential

***
## Deployment Considerations
- real-time flagging & monitoring
- can merge with next section

***
## Conclusion & Future Work 
- best model
- business impact quantification: Expected fraud prevention and cost savings
- limitations
- improvements e.g ensemble of best models, hyperparamter tuning (if not implemented), API deploym