# Credit Card Fraud Detection

## Main Goals

- Detect fraudulent credit card transactions using transaction data.
- Handle a highly imbalanced dataset where fraud is a rare event.
- Create new features, including error and ratio-based features, to expose fraudulent patterns.
- Apply PCA for dimensionality reduction.
- Evaluate models using metrics appropriate for imbalanced classification (e.g., Precision, Recall).

### Context

In the digital economy, the speed and volume of financial transactions present a significant challenge for banks and payment processors in distinguishing legitimate activity from fraud. The ability to detect fraudulent transactions in real-time is crucial for minimizing financial losses, protecting customers, and maintaining trust in the financial system. In the field of data science, anomaly detection and classification models offer a robust tool for sifting through millions of transactions to identify the subtle patterns that signal fraudulent activity. This project leverages a large-scale synthetic dataset based on real-world financial logs to build a model that can identify fraud, enabling a more data-driven and automated approach to transaction security.

## 1. Loading in the Data

For this project, we will use the [Synthetic Financial Datasets For Fraud Detection](https://www.kaggle.com/datasets/ealaxi/paysim1) from Kaggle. In accordance with Kaggle licenses, please directly visit the Kaggle website and download the `PS_20174392719_1491204439457_log.csv` dataset for this activity, and then upload the file to the same directory as the notebook file.

We can start by loading in the dataset into a pandas dataframe, and then displaying it to ensure it loaded correctly, and so we can see what the features are and how the target is displayed. This means that we have to start by importing pandas as well.

It's worth mentioning that anytime you have a dataset from an external source, such as Kaggle, you can and should refer back to the source of the data to clear up misconceptions and also to get a better understanding of the data.

In [1]:
#Import pandas
import pandas as pd

#Read the CSV file
df = pd.read_csv('PS_20174392719_1491204439457_log.csv')

#Inspect the data
print(df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB
None
               step        amount  oldbalanceOrg  newbalanceOrig  \
count  6.362620e+06  6.362620e+06   6.362620e+06    6.362620e+06   
mean   2.433972e+02  1.798619e+05   8.338831e+05    8.551137e+05   
std    1.423320e+02  6.038582e+05   2.888243e+06    2.924049e+06   
min    1.000000e+00  0.000000e+00   0.000000e+00    0.000000e+00   
25%    1.560000e+02  1.338957e+04   0.000000e+00    0.000000e+00   
50%    2.390000e+02  7.487194e+04

Having taken a look at our data, there is a lot to take note of. Let's clarify the features below using information from Kaggle.

* `step`: An integer representing a unit of time, where 1 step equals 1 hour. The dataset covers a total of 744 steps, equivalent to a 30-day simulation.
* `type`: The type of transaction. This is a categorical feature with five possible values: 'CASH-IN', 'CASH-OUT', 'DEBIT', 'PAYMENT', and 'TRANSFER'.
* `amount`: A numerical value representing the total amount of the transaction in the local currency.
* `nameOrig`: The identifier for the customer who initiated the transaction. This is a categorical feature.
* `oldbalanceOrg`: The initial balance of the originator's account before the transaction occurred.
* `newbalanceOrig`: The balance of the originator's account after the transaction was completed.
* `nameDest`: The identifier for the recipient of the transaction. This is a categorical feature.
* `oldbalanceDest`: The initial balance of the recipient's account before the transaction.
* `newbalanceDest`: The balance of the recipient's account after the transaction.
* `isFraud`: The target variable. A binary flag where `1` indicates the transaction was fraudulent and `0` indicates it was legitimate.
* `isFlaggedFraud`: A supplementary binary flag set by the system's business rules. It flags transactions where more than 200,000 is transferred in a single attempt.

Something else worth noting is that since this is a credit card fraud dataset, it's likely that our target is imbalanced, as it is far more likely for there to be no fraud than a case of fraud. We can double check this by checking the number of values in each class of the target.

In [2]:
#Check target imbalance
print("Target Class Distribution:\n", df['isFraud'].value_counts())

Target Class Distribution:
 isFraud
0    6354407
1       8213
Name: count, dtype: int64


We can see that, as we thought, there is a significant imbalance in the target class. There are only 8213 case of fraud among well over 6 million cases without fraud. This might create issues for our model during training, so we'll keep this in mind. 

Having taken a good look at our data, let's move onto preprocessing the data.

## 2. Initial Preprocessing
Let's now start to clean our data. We want to get the data ready before any forms of model creation or feature engineering.

### Handling Null Entries
A good place to start when handling preprocessing for any dataset is to deal with the missing values. Having missing values in our data can cause major issues with our model and feature engineering later on, so it's important to deal with them quickly. Let's first inspect our dataframe to check for null entries, and then handle them appropriately.

In [3]:
#Check for missing values
print(df.isnull().sum())

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64


Fortunately, there were no null values in our dataset, so we can proceed normally.

### Encoding Categorical Features
Right now, certain features are in categorical format, either as a string or an object, Most models expect the data to be in a numerical format, so let's go ahead and encode our categorical data.

To prepare our data for the machine learning model, we need to handle our categorical features. Our primary focus will be the `type` column, which contains text values like 'CASH_OUT' and 'TRANSFER'. Since machine learning algorithms require numerical inputs to perform mathematical operations, we must convert these categories into a numerical format. We will accomplish this using **one-hot encoding**, which creates new binary columns for each transaction type. This method effectively represents the category of a transaction without creating a false or misleading ordinal relationship between the different types, ensuring our model can interpret the data correctly.

We will drop the `nameOrig` and `nameDest` columns because they contain millions of unique customer IDs, making them high-cardinality features. Encoding these is computationally impractical and would cause the model to simply memorize individual users rather than learn the general patterns of fraudulent transactions. By removing them, we force the model to focus on the transactional data itself, which is a more robust and effective approach. Essentially, we'll be focusing on common patterns of fraud as opposed to what might not be in line for a specific person. This is effective since there are often similar patterns to fraud, and since it is difficult to get precise purchasing information due to privacy, security, and legal reasons.

Let's start by one-hot encoding the `type` feature, and then droping the `nameOrig` and `nameDest` columns.

In [4]:
#one-hot encode the 'type' column
df = pd.get_dummies(df, columns=['type'])

#Drop the 'nameOrig' and 'nameDest' columns
df = df.drop(columns=['nameOrig', 'nameDest'])

#Display the modified DataFrame
display(df)

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,1,9839.64,170136.00,160296.36,0.00,0.00,0,0,False,False,False,True,False
1,1,1864.28,21249.00,19384.72,0.00,0.00,0,0,False,False,False,True,False
2,1,181.00,181.00,0.00,0.00,0.00,1,0,False,False,False,False,True
3,1,181.00,181.00,0.00,21182.00,0.00,1,0,False,True,False,False,False
4,1,11668.14,41554.00,29885.86,0.00,0.00,0,0,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,339682.13,339682.13,0.00,0.00,339682.13,1,0,False,True,False,False,False
6362616,743,6311409.28,6311409.28,0.00,0.00,0.00,1,0,False,False,False,False,True
6362617,743,6311409.28,6311409.28,0.00,68488.84,6379898.11,1,0,False,True,False,False,False
6362618,743,850002.52,850002.52,0.00,0.00,0.00,1,0,False,False,False,False,True


With that, we were able to successfully encode the `type` column and remove the `nameOrig` and `nameDest` columns. With basic preprocessing complete, we can move on to feature engineering.

## 3. Feature Engineering

To improve our model's ability to detect fraud, our next step is feature engineering. The goal is to create new data points that highlight the unusual mathematical signatures of fraudulent transactions, making these rare events stand out more clearly from normal activity. We'll accomplish this by creating error features that calculate the discrepancy between the expected account balance and the actual final balance after a transaction. We will also create ratio features, such as dividing the transaction `amount` by the account's original balance, to contextualize the transaction's size relative to the account's history.

Additionally, we'll extract time-based features from the `step` column. Since each step represents one hour, we can calculate the `hour_of_day` for each transaction to help the model learn if fraudulent activities occur more frequently at specific times, like late at night. These engineered features provide crucial behavioral context that isn't available in the raw data, giving our anomaly detection model a much stronger signal to learn from.

In [5]:
#Import numpy for the .where function
import numpy as np

#Create Error-Based Features
#These features capture discrepancies in account balances that often signal fraud.
#A non-zero error is a potential red flag.
df['errorBalanceOrg'] = df['newbalanceOrig'] + df['amount'] - df['oldbalanceOrg']
df['errorBalanceDest'] = df['oldbalanceDest'] + df['amount'] - df['newbalanceDest']

#Create Ratio-Based Features
#This feature contextualizes the transaction amount relative to the account balance.
#We use np.where to avoid division by zero if the original balance is 0.
df['amount_to_balance_ratio'] = np.where(
    df['oldbalanceOrg'] > 0, 
    df['amount'] / df['oldbalanceOrg'], 
    0
)

#Create another ratio to compare the transaction amount to the destination account's balance.
#This helps identify if the transaction amount is unusually large compared to the destination account's balance.
#Again, we use np.where to handle cases where the destination balance is 0.
df['amount_to_dest_balance_ratio'] = np.where(
    df['oldbalanceDest'] > 0, 
    df['amount'] / df['oldbalanceDest'], 
    0
)

#Create Time-Based Features
#This converts the 'step' (which represents 1 hour) into an 'hour_of_day' feature.
df['hour_of_day'] = df['step'] % 24

#Display the Results
print("--- Feature Engineering Complete ---")
print("Preview of the DataFrame with new features:")

#Displaying a subset of columns for clarity
display(df)

--- Feature Engineering Complete ---
Preview of the DataFrame with new features:


Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,errorBalanceOrg,errorBalanceDest,amount_to_balance_ratio,amount_to_dest_balance_ratio,hour_of_day
0,1,9839.64,170136.00,160296.36,0.00,0.00,0,0,False,False,False,True,False,0.0,9.839640e+03,0.057834,0.000000,1
1,1,1864.28,21249.00,19384.72,0.00,0.00,0,0,False,False,False,True,False,0.0,1.864280e+03,0.087735,0.000000,1
2,1,181.00,181.00,0.00,0.00,0.00,1,0,False,False,False,False,True,0.0,1.810000e+02,1.000000,0.000000,1
3,1,181.00,181.00,0.00,21182.00,0.00,1,0,False,True,False,False,False,0.0,2.136300e+04,1.000000,0.008545,1
4,1,11668.14,41554.00,29885.86,0.00,0.00,0,0,False,False,False,True,False,0.0,1.166814e+04,0.280795,0.000000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,339682.13,339682.13,0.00,0.00,339682.13,1,0,False,True,False,False,False,0.0,0.000000e+00,1.000000,0.000000,23
6362616,743,6311409.28,6311409.28,0.00,0.00,0.00,1,0,False,False,False,False,True,0.0,6.311409e+06,1.000000,0.000000,23
6362617,743,6311409.28,6311409.28,0.00,68488.84,6379898.11,1,0,False,True,False,False,False,0.0,1.000000e-02,1.000000,92.152375,23
6362618,743,850002.52,850002.52,0.00,0.00,0.00,1,0,False,False,False,False,True,0.0,8.500025e+05,1.000000,0.000000,23


With that, our feature engineering is complete. Note that while we encoded the hour of day from the step feature, we'll still be keeping the step feature since it represents time in a different manner than our newly created feature.

## 4. Train-Test Split
With our feature engineering complete, it's time to move on to the train test split. We want to split our data into training and testing so that there is a set of data our model can learn from, and a set of data to practice against. We can do this simply using the train_test_split module from Sklearn.

It's at this point in this project that we should also properly identify our target, and split it from the rest of our data. For this project and data, our target is the isFraud feature. 

As such, we'll separate our target from the remaining features, and then use the train_test_split module to perform the split.

In [6]:
#Import train_test_split from sklearn
from sklearn.model_selection import train_test_split

#Separate the target variable and the features
X = df.drop(columns=['isFraud'])
y = df['isFraud']

#perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=64)

## 5. Scaling
Now that our data has been split into training and testing data, we can perform our next preprocessing task, standardization. To make sure our models don't have any bias in choosing values, we'll be scaling down all the features so that their mean is 0 and standard deviation is 1. We do this specifically since many models tend to give more priority to larger values, so features that are in the thousands by default, like the year, would get more importance than it should. 

Sklearn does this process for us fortunately, all we need to do is import the StandardScaler, fit it on the training data, and transform both training and testing data. We wait until after the train test split specifically because we want the scaler fitted on the training data to avoid data leakage.

In [7]:
#import StandardScaler from sklearn
from sklearn.preprocessing import StandardScaler

#Initialize the StandardScaler
scaler = StandardScaler()

#Fit the scaler on the training data, then transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 6. Principle Component Analysis

With our data now split and scaled, we will perform **Principal Component Analysis (PCA)** as the final step before modeling. PCA is a dimensionality reduction technique used to distill a large set of features into a smaller, more efficient set of new features called principal components. We are doing this to reduce the complexity of our dataset, which can help our model train faster and sometimes perform better by removing redundant information and noise. It is essential that we apply PCA *after* scaling because the algorithm works by identifying the directions of maximum variance, and it would be biased by features with large, arbitrary scales if the data were not first standardized. To implement this, we will use Sklearn's PCA to analyze our scaled training data and transform our extensive feature set into a smaller number of principal components that still capture the vast majority of the original information.

In [8]:
from sklearn.decomposition import PCA

#Initialize and Fit PCA
#We'll choose n_components=0.95, which tells PCA to automatically select the
#number of components needed to retain 95% of the original variance.
pca = PCA(n_components=0.95)

#Fit PCA ONLY on the training data to learn the principal components.
#Same idea as fitting the scaler, we want to avoid data leakage.
pca.fit(X_train_scaled)

#Apply the learned transformation to both the training and testing sets
X_train_pca = pca.transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)


#Check the Results
print(f"Original number of features: {X_train_scaled.shape[1]}")
print(f"Reduced number of features (Principal Components): {pca.n_components_}")
print(f"Variance explained by new components: {np.sum(pca.explained_variance_ratio_):.4f}")

Original number of features: 17
Reduced number of features (Principal Components): 12
Variance explained by new components: 0.9604


It's worth noting that in addition to dimensionality reduction, PCA serves a crucial secondary purpose in industries like banking: data privacy and anonymization. The process transforms concrete, sensitive features like account balances into a new set of abstract, mathematical components that are not directly human-readable. This allows an organization to share a feature-rich dataset for modeling and analysis without exposing the raw, private customer information, making PCA a powerful tool for both simplifying data and securing it. This is actually part of the reason we are using a synthetic dataset for this project. Most of the uploaded banking datasets already have PCA applied for privacy and security purposes, so by grabbing a synthetic dataset like this, we're able to practice applying PCA ourselves.

## 7. Data Balancing

As this project is designed for learning, training a model on the full six million rows can be computationally intensive for a standard laptop. This dataset also presents a severe class imbalance, with fraudulent transactions being extremely rare. Therefore, we'll address both issues at once by creating a smaller, perfectly balanced training subsample.

To increase the representation of our rare fraud cases, we'll use an advanced oversampling technique called **SMOTE (Synthetic Minority Over-sampling Technique)**. Instead of simply duplicating existing fraud data, SMOTE intelligently creates new, artificial fraud samples by generating data points along the line segments that connect real fraud cases. This results in a richer, more diverse set of fraud examples for our model to learn from without creating exact copies.

Conversely, to handle the overwhelmingly large 'non-fraud' class, we'll use **random undersampling**. This technique works by simply removing random samples from the majority class until its size is reduced to a more manageable number. The main benefit is that it significantly speeds up training time and prevents the model from being biased by the sheer volume of non-fraudulent examples.

To combine these two methods in a clean way (and to save us some trouble), we'll use a pipeline object from the imblearn library. A pipeline is an object that chains multiple data processing steps together into a single workflow. This ensures that our oversampling and undersampling steps are applied in the correct sequence automatically, preventing logical errors.

Our specific strategy will be to use the pipeline to first apply SMOTE to increase the number of fraud cases to a more substantial level, and then immediately apply random undersampling to reduce the number of non-fraud cases to match. This hybrid approach gives us the best of both worlds: a perfectly balanced dataset that is large enough to learn from but small enough to train on efficiently.

In [9]:
#Import necessary libraries for resampling
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

#Define the resampling strategy
#Use SMOTE to increase the minority class to a more substantial number
smote = SMOTE(sampling_strategy = {1: 1000000}, random_state = 42) 

#Use RandomUnderSampler to make both classes equal
under = RandomUnderSampler(sampling_strategy = {0: 1000000}, random_state = 42)

#Create the pipeline that chains these steps
pipeline = Pipeline(steps=[('smote', smote), ('under', under)])

#Apply the pipeline to the full training data
X_train_resampled, y_train_resampled = pipeline.fit_resample(X_train_pca, y_train)

#Print out the counts to verify results
print("New balanced training set class distribution:")
print(y_train_resampled.value_counts())

New balanced training set class distribution:
isFraud
0    1000000
1    1000000
Name: count, dtype: int64



## 8. Building and Training the Model
With our data completely prepared, it's time to build our models. For this project, we'll be using a random forest for it's robustness and ability to predict binary targets such as in this dataset. Please note that as of right now, it might still take upwards of 20 minutes to fit the data. In the previous step, we ensured our dataset had 1 million cases of both fraudulent and non-fraudulent transactions. For the sake of time, you're free to set that down to 100,000, or another value that might be more reasonable in terms of run-time. While the model does in fact benefit from having more data to train from, this project is for educational purposes, so it's ok to simply understand the technique and move on.

We'll import the models from Sklearn, initialize them, and then train them on the training data.

In [None]:
#Import random forest  from sklearn
from sklearn.ensemble import RandomForestClassifier

#Initialize the random forest model
model = RandomForestClassifier(random_state=64)

#Fit the random forest on the training data
model.fit(X_train_resampled, y_train_resampled)


Again, if no parameters were changed, this will take on average around 20 minutes to fit. With larger and more complex data, this tends to happen. But besides that, everything went as it should. Let's move on.

## 9. Evaluating the Model

With our model trained on our balanced data, it's time to check how well it performs at identifying credit card fraud. We'll import standard metrics of success from Sklearn, test it against our model, and analyze the results.

In [11]:
#Import metrics for evaluation
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

#Make predictions on the test set
pred = model.predict(X_test_pca)

#Evaluate the Random Forest model
print("Random Forest Accuracy:", accuracy_score(y_test, pred))
print("Random Forest Confusion Matrix:\n", confusion_matrix(y_test, pred))
print("\nRandom Forest Classification Report:\n", classification_report(y_test, pred))

Random Forest Accuracy: 0.99747666841647
Random Forest Confusion Matrix:
 [[1267874    2960]
 [    251    1439]]

Random Forest Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00   1270834
           1       0.33      0.85      0.47      1690

    accuracy                           1.00   1272524
   macro avg       0.66      0.92      0.74   1272524
weighted avg       1.00      1.00      1.00   1272524



### Analysis

#### Confusion Matrix

The confusion matrix reveals how well the Random Forest model differentiates between legitimate transactions (class 0) and fraudulent ones (class 1), which is critical in a real-world fraud detection scenario.

The Random Forest model demonstrates well overall performance. It correctly identifies 1,267,874 legitimate transactions and 1,439 fraudulent ones. However, it does misclassify 2,960 legitimate transactions as fraudulent (false positives) and misses 251 fraud cases (false negatives). Given the sheer volume of class 0 examples, this outcome represents a significant achievement in model sensitivity, especially for the rare class 1 events.

These results highlight a model that is both cautious and effective. While it produces some false alarms, its ability to catch a large proportion of actual fraud cases signals a strong fit for high-risk financial applications. 

#### Classification Report

The model’s performance metrics provide further insight into this tradeoff:

* **Precision**:
  For legitimate transactions (class 0), the model achieves a perfect precision of 1.00, meaning it nearly never falsely labels a fraudulent transaction as legitimate. For fraudulent transactions (class 1), the precision drops to 0.33, indicating that while the model catches many frauds, it also flags a notable number of legitimate transactions as suspicious. This is expected and acceptable in fraud detection, where it is better to be cautious.

* **Recall**:
  Recall for class 0 is again nearly perfect at 1.00, showing that almost every non-fraudulent transaction is correctly identified. More importantly, the recall for class 1 reaches 0.85, meaning the model successfully identifies 85% of fraudulent transactions. This high recall rate is essential in fraud detection, where the cost of missing fraudulent activity can be far greater than dealing with false alarms.

* **F1-score**:
  The F1-score balances the tradeoff between precision and recall. For class 0, the score remains perfect at 1.00. For class 1, the F1-score is 0.47, reflecting moderate effectiveness in balancing precision and recall for fraudulent transactions. This performance is strong given the extreme class imbalance and high-stakes context.

Macro averages show reasonable balance between the classes, with values of 0.66 (precision), 0.92 (recall), and 0.74 (F1). The weighted averages remain high due to the overwhelming presence of class 0, but they confirm that the model is not neglecting the minority class.

#### Overall Analysis

The Random Forest model proves to be a highly effective fraud detection tool in this case. Despite the skewed dataset, it achieves an overall accuracy of 99.75% while maintaining a strong recall of 0.85 on the minority class. This means the model is able to identify most fraudulent transactions while keeping false negatives to a minimum.

In the realm of credit card fraud, this kind of performance is not only acceptable, it is desirable. The cost of a false positive (flagging a legitimate transaction as fraud) typically results in a brief inconvenience for the customer, such as a declined charge or a verification step. On the other hand, the cost of a false negative (failing to catch actual fraud) can result in significant financial loss, reputational damage, and downstream risk.

That is why high recall on the fraud class is more important than high precision. Catching more fraud, even if it means incorrectly flagging some legitimate transactions, can be quickly managed through manual review or automated checks. In contrast, failing to detect fraud has much more severe consequences.

This model delivers on that principle. With a recall of 0.85 on fraudulent transactions and the robustness of ensemble learning behind it, the Random Forest model stands out as a strong candidate for deployment in a real-world fraud detection pipeline. Note however that in a real world scenario, you'll likely have more data to work with, as well as a stronger machine. For an educational project, this teaches the techniques necessary to detect fraud, but there is still room to improve and alter the model specifically to catch more fraud in real world scenarios.

Great job reaching this point in the project. You not only tackled an imbalanced and challenging dataset, but you also applied advanced techniques like SMOTE and performed a thorough evaluation across meaningful metrics. This is a well-executed step toward mastering classification problems in real-world, high-impact domains. Give yourself a pat on the back.