### Made By: Stefanie K Saji

# **Dataset Description**
---

Source: Dataset by [@rohan-chandrashekar](https://huggingface.co/datasets/rohan-chandrashekar/) from [Hugging Face](https://huggingface.co/datasets/rohan-chandrashekar/Financial_Fraud_Detection)

Size: 3,713,576 rows x 14 columns

Target Variable: isFraud (binary classification)

Nature: Structured financial transaction data

# **Fraud Detection Using Logistic Regression**
---

## **Description:**
 This project implements a machine learning pipeline to detect fraudulent transactions.

## **Detailed workflow consists of the following steps:**

**Data Preparation:**

1. Loaded the dataset and removed irrelevant columns (nameOrig, nameDest, isFlaggedFraud).

2. Created a balanced dataset by sampling fraud and non-fraud transactions to address class imbalance, ensuring the model can learn patterns from both classes.

**Feature Processing:**

1. Identified numeric features (amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest, newbalanceDest) and categorical features (type).

2. Applied StandardScaler to numeric features to normalize their range and OneHotEncoder(drop='first') to categorical features to handle categorical variables while avoiding multicollinearity.

3. Used a ColumnTransformer to apply these transformations consistently in a single step.

**Model Training:**

1. Built a Pipeline combining the preprocessing step and a LogisticRegression classifier with class_weight='balanced' to further compensate for class imbalance.

2. Trained the pipeline on the training data (X_train, y_train) so the model learns the relationship between features and fraud labels.

**Evaluation:**

1. The model was first validated on a hold-out portion of the training data, achieving 97.74% accuracy, demonstrating strong internal performance.

2. On the completely unseen final test dataset, the model achieved 97.91% accuracy, confirming that it generalizes well and effectively detects fraudulent transactions.

3. Precision, recall, and F1-score were also calculated, ensuring the model maintains good performance on the minority (fraud) class while minimizing false positives and false negatives.

In [None]:
# Using collab library we connect the notebook (runtime) to the drive.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Importing relevant libraries.
import pandas as pd
import sklearn

In [None]:
#Reading the csv file from the drive and storing the data to df variable.
#Change the path to your respective drive path (See README.md of the Github repository)
#Path Example: "/content/drive/MyDrive/ABC/Financial_Fraud_Dataset.csv"
df = pd.read_csv("YOUR_PATH_HERE")

In [None]:
#Removing non useful rows (for training the ML model).
df_model = df.drop(["nameOrig","nameDest","isUnauthorizedOverdraft",	"action__CASH_IN",	"action__CASH_OUT",	"action__DEBIT",	"action__PAYMENT",	"action__TRANSFER"], axis=1)

In [None]:
df_model.head()

Unnamed: 0,row_id,amount,oldBalanceOrig,newBalanceOrig,oldBalanceDest,newBalanceDest,isFraud
0,0,89.24,89.24,0.0,0.0,89.24,1
1,1,89.24,89.24,0.0,0.0,0.0,1
2,2,70.62,70.62,0.0,0.0,70.62,1
3,3,70.62,70.62,0.0,0.0,0.0,1
4,4,22695.14,22695.14,0.0,0.0,22695.14,1


In [None]:
#Assigning different ID to rows for uniqueness.
df_model = df_model.reset_index().rename(columns={'index': 'row_id'})

In [None]:
from sklearn.model_selection import train_test_split
#Spliting the data into unique database for testing and training.
train_df, test_df = train_test_split(
    df_model,
    test_size=0.25, #Creates a split in which 25% data goes into test and 75% goes into train
    stratify=df_model['isFraud'], #Ensure that class preservation of fraud and non fraud is preserved in both
    random_state=42 #Rows are always shuffled
)

In [None]:
# Checking the number of fraud and non fraud in train data.
train_df.value_counts("isFraud")

Unnamed: 0_level_0,count
isFraud,Unnamed: 1_level_1
0,2532591
1,252591


In [None]:
test_df.value_counts("isFraud")

Unnamed: 0_level_0,count
isFraud,Unnamed: 1_level_1
0,844197
1,84197


In [None]:
# Created a balanced training dataset to manage minority data(isFraud=1) to avoid bias.
train_fraud = train_df[train_df['isFraud'] == 1] #Selects all rows where isFraud is 1, dataframe with only fraudulent transactions
train_normal = train_df[train_df['isFraud'] == 0] #Selects all rows where isFraud is 0, dataframe with only non fraudulent transactions

train_fraud_sample = train_fraud.sample(n=8000, random_state=42) #Randomly samples 6,000 fraud rows from train_fraud
train_normal_sample = train_normal.sample(n=10000, random_state=42) #Randomly samples 8,000 normal rows from train_normal

balanced_training_dataset = (
    pd.concat([train_fraud_sample, train_normal_sample]) #Combines the sampled fraud and normal rows into one DataFrame
      .sample(frac=1, random_state=42) #Ensures fraud and normal rows are mixed up, not grouped separately
      .reset_index(drop=True) #Renumbers the row after shuffling to avoid misallignment later
)

In [None]:
# Created a balanced testing dataset to manage minority data(fraud) to avoid bias.
test_fraud = test_df[test_df['isFraud'] == 1]
test_normal = test_df[test_df['isFraud'] == 0]

test_fraud_sample = test_fraud.sample(n=80000, random_state=21)
test_normal_sample = test_normal.sample(n=100000, random_state=21)

balanced_testing_dataset = (
    pd.concat([test_fraud_sample, test_normal_sample])
      .sample(frac=1, random_state=21)
      .reset_index(drop=True)
)


In [None]:
#To check if overlapping exists. 0 as result suggest there are no data intermixed.
overlap = pd.merge( #Combines two training set and testing set by matching rows based on one or more columns
    balanced_training_dataset,
    balanced_testing_dataset,
    how='inner' #Keeps rows that are appearing in both the data frame (duplicate)
)
print(len(overlap)) #Prints how many dupicates exist

0


In [None]:
#Provides information on how many missing values exist in each column.
balanced_training_dataset.isnull().sum()

Unnamed: 0,0
row_id,0
amount,0
oldBalanceOrig,0
newBalanceOrig,0
oldBalanceDest,0
newBalanceDest,0
isFraud,0


In [None]:
#Displays a summary of balanced training dataset, including column names, data types, non-null counts, and memory usage.
balanced_training_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18000 entries, 0 to 17999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   row_id          18000 non-null  int64  
 1   amount          18000 non-null  float64
 2   oldBalanceOrig  18000 non-null  float64
 3   newBalanceOrig  18000 non-null  float64
 4   oldBalanceDest  18000 non-null  float64
 5   newBalanceDest  18000 non-null  float64
 6   isFraud         18000 non-null  int64  
dtypes: float64(5), int64(2)
memory usage: 984.5 KB


In [None]:
# Defines and differenciates categorical and numerical datas.
categorical = []
numeric = ["amount","oldBalanceOrig","newBalanceOrig","oldBalanceDest","newBalanceDest"]

In [None]:
# Separates the target variable isFraud into y and the input features (rest 7 columns) into X for model training.
y = balanced_training_dataset["isFraud"]
X = balanced_training_dataset.drop("isFraud", axis=1)

In [None]:
# Separates the target variable isFraud into y and the input features (rest 7 columns) into X for model testing.
yT = balanced_testing_dataset["isFraud"]
XT = balanced_testing_dataset.drop("isFraud", axis=1)

In [None]:
# Further splits the training data X and y into training and validation sets (model tests itself before touching the unseen test data).
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=42, stratify=y)

In [None]:
#To check if overlapping exists. 0 as result suggest there are no data intermixed.
overlap1 = pd.merge( #Combines two training set and testing set by matching rows based on one or more columns
    X_train,
    X_test,
    how='inner' #Keeps rows that are appearing in both the data frame (duplicate)
)
print(len(overlap1)) #Prints how many dupicates exist

0


In [None]:
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import StandardScaler, OneHotEncoder
# preprocessor = ColumnTransformer(                                 #Allows to apply different preprocessing to different columns in a single step
#     transformers=[                                                #list of transformations
#         ("num", StandardScaler(), numeric),                       #converts categorical columns into binary dummy variables
#         ("cat", OneHotEncoder(drop="first"), categorical)         #drops first column to reduce redundancy and multicollinearity
#     ],
#     remainder="drop"                                              #columns not listed in numeric or categorical are dropped
# )

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
#Storing the preprocessor and logistic regression model into pipeline variable.
pipeline = Pipeline([
    ("prep", preprocessor), #Stores the preprocessor we defined above into prep variable
    ("clf", LogisticRegression(class_weight="balanced", max_iter=1000)) #Stores the logistic regression model with parameters(gives priority to minority data)
])

In [None]:
pipeline.fit(X_train, y_train) #Now the model is trained

In [None]:
#Checking performance during the training using validation prediction.
y_pred = pipeline.predict(X_test)

In [None]:
#Checking performance on final test set.
yT_pred = pipeline.predict(XT)

In [None]:
from sklearn.metrics import classification_report
#Prints a detailed performance report of the model on the validation set.
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       1.00      0.96      0.98      3000
           1       0.95      1.00      0.98      2400

    accuracy                           0.98      5400
   macro avg       0.98      0.98      0.98      5400
weighted avg       0.98      0.98      0.98      5400



In [None]:
#Prints the classification report for your final test set.
print(classification_report(yT,yT_pred))

              precision    recall  f1-score   support

           0       1.00      0.96      0.98    100000
           1       0.96      1.00      0.98     80000

    accuracy                           0.98    180000
   macro avg       0.98      0.98      0.98    180000
weighted avg       0.98      0.98      0.98    180000



In [None]:
#Computes the accuracy of the model on the validation set.
pipeline.score(X_test, y_test) * 100

97.74074074074073

In [None]:
#Computes the accuracy of the model on the final test set.
pipeline.score(XT, yT) * 100

97.91888888888889