<h3 style=>-- Model Building Logistic Regression --</h3>

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Load the Cleaned Data

In [6]:
data = pd.read_csv(r"C:\Users\Gebruiker\Desktop\IronHack\Projects\Online Payments Fraud Detection\cleaned_fraud_dataset.csv")

In [7]:
df = data.copy()

Define Features & Target

In [8]:
X = df.drop("isFraud", axis=1)
y = df["isFraud"]

Train/Split

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Feature Scaling 

In [7]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


Train the model

In [8]:
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


Make Predictions

In [9]:
y_pred = model.predict(X_test_scaled)


Evaluate the model

In [10]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.9992039442871019
Confusion Matrix:
 [[1270852      52]
 [    961     659]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00   1270904
           1       0.93      0.41      0.57      1620

    accuracy                           1.00   1272524
   macro avg       0.96      0.70      0.78   1272524
weighted avg       1.00      1.00      1.00   1272524



- The dataset is imbalanced -  1,270,904 class 0s vs. only 1,620 class 1s.
- Out of 1,620 actual class 1s, your model only identified 659 correctly.
- It missed 961 real positives, which is a big concern in risk/fraud/medical scenarios.

- Only 41% of real positives were detected -> This is dangerous if class 1 = fraud, illness, or anything critical.
- When the model does predict class 1, it's right 93% of the time — good, but you're missing too many positives

***Solutions for Imbalanced Data***

- Class weighting

-> 

In [12]:
model = LogisticRegression(class_weight='balanced')
model.fit(X_train_scaled, y_train)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,
,solver,'lbfgs'
,max_iter,100


Make prediction

In [13]:
y_pred = model.predict(X_test_scaled)

Evaluate 

In [14]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.9518437373283333
Confusion Matrix:
 [[1209662   61242]
 [     38    1582]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.95      0.98   1270904
           1       0.03      0.98      0.05      1620

    accuracy                           0.95   1272524
   macro avg       0.51      0.96      0.51   1272524
weighted avg       1.00      0.95      0.97   1272524



- Recall for Class 1: Jumped from 0.41 -> 0.98 --> the model got almost all actual positives, which is excellent in risk/fraud detection
- Precision for Class 1: Dropped to 0.03 ->  The model is now predicting class 1 too often, and most of those are false alarms.
- Precision = Only 3% of predicted class 1s are correct.

- False Positives (Class 0 predicted as 1):
-> Over 61,000 cases — this is a very high number.

* it flagged everything suspicious, even something that might be harmless to be sure that it didn't miss actual fraud.

***SMOTE (Synthetic Minority Over-sampling Technique) with Logistic Regression***
->  is one of the most effective ways to deal with imbalanced datasets. It generates synthetic examples of the minority class to balance your training set — without simply duplicating rows.

***Import SMOTE and needed libraries***

In [10]:
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


* Rerun just the top cell to import the necessary libraries and load the dataset.

Define Features & Target

In [11]:
X = df.drop("isFraud", axis=1)
y = df["isFraud"]

Split data

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

Apply SMOTE only to training data

In [13]:
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Train model on the resampled data

In [14]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_resampled, y_resampled)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Predict on the test set - evaluate using the original (unseen) test set

In [15]:
y_pred = model.predict(X_test)

Evaluate model performance

In [16]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print("Accuracy:", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9144660532925116
[[1452772  135830]
 [    225    1828]]
              precision    recall  f1-score   support

           0       1.00      0.91      0.96   1588602
           1       0.01      0.89      0.03      2053

    accuracy                           0.91   1590655
   macro avg       0.51      0.90      0.49   1590655
weighted avg       1.00      0.91      0.95   1590655



***Before SMOTE***
Accuracy: 95.18%
* Confusion Matrix:

True Negatives (TN): 1,209,662

False Positives (FP): 61,242

False Negatives (FN): 38

True Positives (TP): 1,582

* Key Metrics:

Class 0 (majority): Precision = 1.00, Recall = 0.95

Class 1 (minority): Precision = 0.03, Recall = 0.98, F1-score = 0.05

* Conclusion:

Despite very high recall (0.98) for the minority class (which is rare), the precision is extremely low (0.03), meaning the model misclassifies many negative samples as positive.

The F1-score is poor for the minority class due to low precision.

High accuracy is misleading, driven by the class imbalance.

***After SMOTE*** Accuracy: 91.45%

* Confusion Matrix:

TN: 1,452,772

FP: 135,830

FN: 225

TP: 1,828

* Key Metrics:

Class 0: Precision = 1.00, Recall = 0.91

Class 1: Precision = 0.01, Recall = 0.89, F1-score = 0.03

* Concusion:

SMOTE helped increase the recall for the minority class slightly.

However, precision remains very low (0.01)—still predicting many false positives.

Accuracy dropped (expected with SMOTE) because the model is now exposed to more minority samples and struggles more with classifying the majority correctly.

F1-score for the minority class is still low.