**Fraud Detection in Financial Transactions Using the IEEE-CIS Fraud Detection Dataset**

 Financial fraud continues to cost banks, businesses, and consumers billions of dollars annually. This project aims to develop a data-driven machine learning model to detect fraudulent transactions in real-time to reduce financial losses and improve overall transaction security. 


The objective of this project is to develop a machine learning model that accurately detects fraudulent transactions using the IEEE-CIS Fraud Detection dataset. I will:

1. Preprocess the data by handling missing values, removing duplicates, and transforming categorical variables.
2. Perform exploratory data analysis (EDA) to identify key fraud indicators.
3. Engineer relevant features to enhance model performance.
4. Train and evaluate machine learning models, addressing class imbalance and optimizing for precision and recall.
5. Deploy the best model and visualize insights to aid in real-time fraud detection.


I will use the IEEE-CIS Fraud Detection dataset available on Kaggle, specifically the train_transaction.csv and train_identity.csv files. These files provide comprehensive, anonymized transaction data and supplementary identity information that can be merged on the TransactionID field.

**Data Collection**

In [2]:
import pandas as pd
import numpy as np
import os

In [4]:
transaction_path = 'ieee-fraud-detection_project/data/raw/train_transaction.csv'
identity_path = 'ieee-fraud-detection_project/data/raw/train_identity.csv'

In [6]:
print("train_transaction.csv exists:", os.path.exists(transaction_path))
print("train_identity.csv exists:", os.path.exists(identity_path))

train_transaction.csv exists: True
train_identity.csv exists: True


In [8]:
#loading the raw data files
df_transaction = pd.read_csv(transaction_path)
df_identity = pd.read_csv(identity_path)

print("Transaction dataset shape:",df_transaction.shape)
print("Identity dataset shape:",df_identity.shape)

Transaction dataset shape: (590540, 394)
Identity dataset shape: (144233, 41)


In [10]:
#Merging the 2 datasets on 'TransactionID' using a left join
df_merged = pd.merge(df_transaction, df_identity, on='TransactionID',how='left')
print("Merged dataframe shape:",df_merged.shape)

Merged dataframe shape: (590540, 434)


**Data Definition**

In [22]:
#Displaying the first few rows
print("First 5 rows of merged data:")
display(df_merged.head())

First 5 rows of merged data:


Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M


In [24]:
#An overview of the column names, data types, and non-null counts!
print("\nDataFrame Info:")
df_merged.info()


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 434 entries, TransactionID to DeviceInfo
dtypes: float64(399), int64(4), object(31)
memory usage: 1.9+ GB


In [26]:
#Summary statistics for the numerical columns!
print("\nSummary Statistics (Numerical Columns):")
display(df_merged.describe())


Summary Statistics (Numerical Columns):


Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,card1,card2,card3,card5,addr1,addr2,...,id_17,id_18,id_19,id_20,id_21,id_22,id_24,id_25,id_26,id_32
count,590540.0,590540.0,590540.0,590540.0,590540.0,581607.0,588975.0,586281.0,524834.0,524834.0,...,139369.0,45113.0,139318.0,139261.0,5159.0,5169.0,4747.0,5132.0,5163.0,77586.0
mean,3282270.0,0.03499,7372311.0,135.027176,9898.734658,362.555488,153.194925,199.278897,290.733794,86.80063,...,189.451377,14.237337,353.128174,403.882666,368.26982,16.002708,12.800927,329.608924,149.070308,26.508597
std,170474.4,0.183755,4617224.0,239.162522,4901.170153,157.793246,11.336444,41.244453,101.741072,2.690623,...,30.37536,1.561302,141.095343,152.160327,198.847038,6.897665,2.372447,97.461089,32.101995,3.737502
min,2987000.0,0.0,86400.0,0.251,1000.0,100.0,100.0,100.0,100.0,10.0,...,100.0,10.0,100.0,100.0,100.0,10.0,11.0,100.0,100.0,0.0
25%,3134635.0,0.0,3027058.0,43.321,6019.0,214.0,150.0,166.0,204.0,87.0,...,166.0,13.0,266.0,256.0,252.0,14.0,11.0,321.0,119.0,24.0
50%,3282270.0,0.0,7306528.0,68.769,9678.0,361.0,150.0,226.0,299.0,87.0,...,166.0,15.0,341.0,472.0,252.0,14.0,11.0,321.0,149.0,24.0
75%,3429904.0,0.0,11246620.0,125.0,14184.0,512.0,150.0,226.0,330.0,87.0,...,225.0,15.0,427.0,533.0,486.5,14.0,15.0,371.0,169.0,32.0
max,3577539.0,1.0,15811130.0,31937.391,18396.0,600.0,231.0,237.0,540.0,102.0,...,229.0,29.0,671.0,661.0,854.0,44.0,26.0,548.0,216.0,32.0


In [28]:
#Counting unique values per column
unique_counts = df_merged.nunique()
display(unique_counts)

TransactionID     590540
isFraud                2
TransactionDT     573349
TransactionAmt     20902
ProductCD              5
                   ...  
id_36                  2
id_37                  2
id_38                  2
DeviceType             2
DeviceInfo          1786
Length: 434, dtype: int64

In [32]:
#Next, i checked for missing values in the dataset
print("\nMissing values per column:")
missing_values = df_merged.isnull().sum()
display(missing_values[missing_values > 0])


Missing values per column:


card2           8933
card3           1565
card4           1577
card5           4259
card6           1571
               ...  
id_36         449555
id_37         449555
id_38         449555
DeviceType    449730
DeviceInfo    471874
Length: 414, dtype: int64

In [36]:
#Example: if 'TransactionAmt exists, let's check the min and max!
if 'TransactionAmt' in df_merged.columns:
    min_amt = df_merged['TransactionAmt'].min()
    max_amt = df_merged['TransactionAmt'].max()
    print(f"\nTransactionAmt range: {min_amt} to {max_amt}")


TransactionAmt range: 0.251 to 31937.391


**Data Cleaning**

In [39]:
#Looking for NA or missing values, duplicates
df_clean = df_merged.copy()

In [41]:
#Next step, identify the numeric and categorical columns
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
categorical_cols = df_clean.select_dtypes(include=['object']).columns

In [45]:
#Handling the missing values in the dataset: filling the numeric columns with median, categorical with 'Unknown'
df_clean[numeric_cols] = df_clean[numeric_cols].fillna(df_clean[numeric_cols].median())
df_clean[categorical_cols] = df_clean[categorical_cols].fillna('Unknown')

In [47]:
#Next, removing the duplicates
duplicate_count = df_clean.duplicated().sum()
print("\nNumber of duplicate rows:", duplicate_count)
if duplicate_count > 0:
    df_clean.drop_duplicates(inplace=True)
    print("Duplicates removed. New shape:", df_clean.shape)


Number of duplicate rows: 0


In [49]:
#For memory efficiency and optimization, we can convert certain columns to 'category'
for col in categorical_cols:
    df_clean[col] = df_clean[col].astype('category')

In [51]:
#Final check after cleaning this data
print("\nCleaned Dataframe Info:")
df_clean.info()


Cleaned Dataframe Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 434 entries, TransactionID to DeviceInfo
dtypes: category(31), float64(399), int64(4)
memory usage: 1.8 GB


In [53]:
print("\nPreview of cleaned data:")
display(df_clean.head())


Preview of cleaned data:


Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987000,0,86400,68.5,W,13926,361.0,150.0,discover,142.0,...,Unknown,24.0,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,Unknown,24.0,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,Unknown,24.0,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,Unknown,24.0,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M


In [55]:
#Saving the cleaned data
processed_dir = 'ieee-fraud-detection_project/data/processed'
os.makedirs(processed_dir, exist_ok=True)

cleaned_file_path = os.path.join(processed_dir, 'cleaned_merged_data.csv')
df_clean.to_csv(cleaned_file_path, index=False)

print(f"Cleaned data saved to: {cleaned_file_path}")

Cleaned data saved to: ieee-fraud-detection_project/data/processed/cleaned_merged_data.csv
