<a href="https://colab.research.google.com/github/Hirsi801/fraud_detection/blob/main/fraud_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Fraud Detection in Banking Transactions using Unsupervised Learnin

# Step 1: load the Dataset

In [None]:
import pandas as pd
df = pd.read_csv('fraudTrain.csv')
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Index                  1048575 non-null  int64  
 1   trans_date_trans_time  1048575 non-null  object 
 2   cc_num                 1048575 non-null  float64
 3   merchant               1048575 non-null  object 
 4   category               1048575 non-null  object 
 5   amt                    1048575 non-null  float64
 6   first                  1048575 non-null  object 
 7   last                   1048575 non-null  object 
 8   gender                 1048575 non-null  object 
 9   street                 1048575 non-null  object 
 10  city                   1048575 non-null  object 
 11  state                  1048575 non-null  object 
 12  zip                    1048575 non-null  int64  
 13  lat                    1048575 non-null  float64
 14  long              

# Drop Unnecessary Columns

Step 2.1: Remove Unnecessary Columns

Several columns are unnecessary for fraud detection, such as:


Personal information (first, last, street, city, zip, job, dob) – Doesn't contribute to fraud detection.

Transaction ID (trans_num) – Unique ID, not useful for machine learning.

Timestamps (trans_date_trans_time, unix_time) – We can extract useful features like hour, day, month instead.

Location Data (lat, long, merch_lat, merch_long) – These may be useful, but let's first test without them.

In [None]:
# Drop unnecessary columns
df.drop(columns=['Index', 'first', 'last', 'street', 'city', 'zip', 'job', 'dob', 'trans_num'], inplace=True)


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 14 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   trans_date_trans_time  1048575 non-null  object 
 1   cc_num                 1048575 non-null  float64
 2   merchant               1048575 non-null  object 
 3   category               1048575 non-null  object 
 4   amt                    1048575 non-null  float64
 5   gender                 1048575 non-null  object 
 6   state                  1048575 non-null  object 
 7   lat                    1048575 non-null  float64
 8   long                   1048575 non-null  float64
 9   city_pop               1048575 non-null  int64  
 10  unix_time              1048575 non-null  int64  
 11  merch_lat              1048575 non-null  float64
 12  merch_long             1048575 non-null  float64
 13  is_fraud               1048575 non-null  int64  
dtypes: float64(6), int

# Handle Missing Values

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

Series([], dtype: int64)


# Convert Categorical Features

Encode Categorical Variables (One-Hot Encoding)

Categorical columns: merchant, category, gender, state

Convert them into numerical values using one-hot encoding.


In [None]:
df = pd.get_dummies(df, columns=['merchant', 'category', 'gender', 'state'], drop_first=True)

# Feature Engineering

Extract Date-Time Features

Since trans_date_trans_time is a timestamp, we extract:

Hour – Fraud transactions might be more frequent at odd hours.

Day of Week – Some days may have more fraud cases.

Month – Seasonal fraud patterns.

In [None]:
df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])

# Extract time-based features
df['hour'] = df['trans_date_trans_time'].dt.hour
df['day_of_week'] = df['trans_date_trans_time'].dt.dayofweek
df['month'] = df['trans_date_trans_time'].dt.month

# Drop original timestamp column
df.drop(columns=['trans_date_trans_time'], inplace=True)


# Scale Numerical Data

Scaling is necessary because the amt column has a large range, which may affect the model.

In [None]:
from sklearn.preprocessing import StandardScaler

# Select numerical columns
num_cols = ['amt', 'city_pop', 'unix_time']

# Standardize numerical features
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])


# Define Features and Target

Features (X): All columns except is_fraud

Target (y): is_fraud (1 = Fraud, 0 = Legitimate)

In [None]:
# Separate features and target
X = df.drop(columns=['is_fraud'])
y = df['is_fraud']


Now that our data is preprocessed, we will train an Isolation Forest model to detect fraudulent transactions

# Train the Isolation Forest Model

In [None]:
# from sklearn.ensemble import IsolationForest

# # Train Isolation Forest
# iso_forest = IsolationForest(n_estimators=200, contamination=0.02, random_state=42, verbose=1)
# iso_forest.fit(X)

# # Predict anomaly scores
# df['Anomaly_Score'] = iso_forest.predict(X)

# # Convert -1 (fraud) and 1 (legit) to 1 (fraud) and 0 (legit)
# df['Anomaly_Score'] = df['Anomaly_Score'].map({1: 0, -1: 1})

# # Count detected fraud cases
# print(df['Anomaly_Score'].value_counts())


# Evaluate the Model

Since we have the actual fraud labels (is_fraud), we can evaluate the model.

In [None]:
# from sklearn.metrics import classification_report, confusion_matrix

# # True fraud labels
# y_true = y  # Original fraud labels
# y_pred = df['Anomaly_Score']  # Predictions from Isolation Forest

# # Print classification report
# print(classification_report(y_true, y_pred))

# # Display confusion matrix
# import seaborn as sns
# import matplotlib.pyplot as plt

# plt.figure(figsize=(6,4))
# sns.heatmap(confusion_matrix(y_true, y_pred), annot=True, fmt='d', cmap='Blues', xticklabels=['Legit', 'Fraud'], yticklabels=['Legit', 'Fraud'])
# plt.xlabel("Predicted")
# plt.ylabel("Actual")
# plt.title("Confusion Matrix")
# plt.show()


Confusion Matrix:

True Negatives (TN): 1,032,221 (Correctly classified legitimate transactions)

False Positives (FP): 10,348 (Legitimate transactions incorrectly marked as fraud)

False Negatives (FN): 5,868 (Fraud cases missed)

True Positives (TP): 138 (Correctly identified fraud cases)



# Visualizing Fraud vs. Legit Transactions

In [None]:
# Box Plot of Transaction Amount for Fraud vs. Legit
# plt.figure(figsize=(10,6))
# sns.boxplot(x=df['Anomaly_Score'], y=df['amt'])
# plt.xlabel("Fraud (1) vs. Legit (0)")
# plt.ylabel("Transaction Amount")
# plt.title("Transaction Amounts of Fraudulent and Legitimate Transactions")
# plt.show()


In [None]:
# Scatter Plot of City Population vs. Transaction Amount
# plt.figure(figsize=(10,6))
# sns.scatterplot(x=df['city_pop'], y=df['amt'], hue=df['Anomaly_Score'], palette="viridis", alpha=0.6)
# plt.xlabel("City Population")
# plt.ylabel("Transaction Amount")
# plt.title("Fraud vs. Legit Transactions")
# plt.legend(title="Fraud")
# plt.show()


#  Save the Model for Deployment

In [None]:
# import pickle

# # Save trained model
# pickle.dump(iso_forest, open("fraud_detection_model.pkl", "wb"))


# For Improvement use Other Model

 Step 1: Apply One-Class SVM for Fraud Detection

One-Class SVM is useful for anomaly detection when fraud cases are rare.

 It learns the normal transaction patterns and flags outliers as fraud.

In [None]:
from sklearn.svm import OneClassSVM

# Train One-Class SVM
one_class_svm = OneClassSVM(nu=0.02, kernel="rbf", gamma="auto")  # Adjust `nu` (0.01 - 0.05)
one_class_svm.fit(X)

# Predict anomalies
df["Anomaly_Score"] = one_class_svm.predict(X)

# Convert One-Class SVM output: 1 (Normal) → 0,  -1 (Anomaly/Fraud) → 1
df["Anomaly_Score"] = df["Anomaly_Score"].map({1: 0, -1: 1})

# Count detected fraud cases
print(df["Anomaly_Score"].value_counts())


Model Evaluation

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# True fraud labels
y_true = y  # Original fraud labels
y_pred = df['Anomaly_Score']  # Predictions from Isolation Forest

# Print classification report
print(classification_report(y_true, y_pred))

# Display confusion matrix
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
sns.heatmap(confusion_matrix(y_true, y_pred), annot=True, fmt='d', cmap='Blues', xticklabels=['Legit', 'Fraud'], yticklabels=['Legit', 'Fraud'])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()