## E-Commerce Fraud Detection Logistic Regression Model 

## Objectives

* Build a logistic regression model to predict fraudulent transactions.
* Turn the patterns we found in EDA (e.g., time, amount, device × channel, country) into a basic risk score using Logistic Regression, so we can automatically flag higher-risk transactions for extra checks.

## Requirements
* Python 3.x
* Jupyter Notebook
* Required libraries: pandas, numpy and scikit-learn.
* pip install scikit-learn (bash terminal)

## What this notebook does
- Loads the cleaned dataset.
- Prepares features (numeric + one-hot encoded categoricals) in a reproducible pipeline.
- Trains a Logistic Regression model with class imbalance handling.
- Evaluates performance with ROC AUC and PR AUC, plus precision/recall/F1.
- Finds a practical decision threshold (F1-oriented) and shows the confusion matrix.
- Lists top positive/negative coefficients to keep the model **explainable.

## Notes
- AI has been used in this section to help write and format the code and markdown. All code and explanations have been reviewed and edited by me to ensure accuracy, clarity and error free.
- Common issue with AI: it sometimes capitalises column names when it shouldn't. for future use, ensure column names are correct.

## Inputs
- Dataset/Cleaned/cleaned_transactions.csv (must include: amount, hour, day_of_week, country, device_type, channel, coupon_applied, is_fraud)

### Outputs
- Printed metrics (ROC AUC, PR AUC, classification report, confusion matrix)
- Suggested threshold for best F1
- Coefficient table (feature importance)
- (Optional) savable metrics text file for reports.

### Success criteria
- Model trains without errors using the clean dataset.
- Reasonable discrimination on test set (non-trivial ROC AUC and PR AUC).
- Clear, copy-paste-ready metrics and a chosen threshold to add to the README.
- Coefficients that align with EDA insights (e.g., higher amounts → higher fraud risk).

In [1]:
# Imports & display options
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, average_precision_score,
    precision_recall_curve
)

pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)

We will now load the cleaned dataset and prepare features for modeling.

In [4]:
# Load cleaned data (cleaned_transactions.csv)
import os
from pathlib import Path

# Ensure we're in the correct working directory
current_dir = os.getcwd()
expected_path = r"c:\Users\Nine\OneDrive\Documents\VS Code Projects\E-Commerce-Fraud-Detection-Capstone\E-Commerce-Fraud-Detection-Capstone"

# If we're in the jupyter_notebooks subdirectory, move up one level
if current_dir.endswith("jupyter_notebooks"):
    os.chdir(os.path.dirname(current_dir))
    print(f"Changed working directory from: {current_dir}")
    print(f"                           to: {os.getcwd()}")

# Double check we're in the right directory for this project
if os.getcwd() != expected_path:
    os.chdir(expected_path)
    print(f"Corrected working directory to: {expected_path}")

# Load the cleaned dataset
data_path = Path("DataSet/Cleaned/cleaned_transactions.csv")


In [16]:
print(df.shape)
df.head()

(10000, 12)


Unnamed: 0,transaction_id,user_id,timestamp,amount,country,device,channel,hour,dayofweek,coupon_applied,num_items,is_fraud
0,6253,3594,2023-01-28 06:04:00,125.79,US,mobile,ads,6,5,0,5,0
1,4685,2502,2023-04-27 21:32:00,153.4,DE,mobile,web,21,3,0,3,0
2,1732,2287,2023-08-19 19:03:00,7.64,IN,tablet,app,19,5,0,3,0
3,4743,3043,2023-03-14 04:56:00,36.36,US,mobile,web,4,1,1,2,0
4,4522,4629,2023-09-24 21:33:00,55.17,ES,mobile,app,21,6,0,1,0


In [None]:
# Quick sanity checks (target & dtypes)

# Target checks 
assert "is_fraud" in df.columns, "Expected target column 'is_fraud' not found."

# Make y robust to True/False/0/1/"True"/"False"
y_raw = df["is_fraud"]
if y_raw.dtype == bool:
    y = y_raw.astype(int)
else:
    y = y_raw.replace({"True": 1, "False": 0}).astype(int)

# Peek class balance
fraud_rate = y.mean()
print(f"Fraud rate: {fraud_rate:.3%} (imbalance expected)")

Fraud rate: 6.430% (imbalance expected)


This confirms class imbalance and sets up the binary target correctly (0/1)

In [17]:
# Feature selection 
candidate_cols = [
    "amount", "hour", "dayofweek", "num_items",
    "country", "device", "channel", "coupon_applied"
]
available = [c for c in candidate_cols if c in df.columns]
print(f"Available columns for modeling: {available}")

X = df[available].copy()

# Ensure boolean becomes string for clean one-hot encoding
if "coupon_applied" in X.columns and X["coupon_applied"].dtype == bool:
    X["coupon_applied"] = X["coupon_applied"].astype(str)

print(f"Feature matrix shape: {X.shape}")
print(f"Data types:")
print(X.dtypes)

Available columns for modeling: ['amount', 'hour', 'dayofweek', 'num_items', 'country', 'device', 'channel', 'coupon_applied']
Feature matrix shape: (10000, 8)
Data types:
amount            float64
hour                int64
dayofweek           int64
num_items           int64
country            object
device             object
channel            object
coupon_applied      int64
dtype: object


We mix numeric + categorical features. Booleans → strings so OHE treats them as categories

In [20]:
# Train/test split (with stratify)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)
X_train.shape, X_test.shape, y_train.mean(), y_test.mean()


((7500, 8), (2500, 8), 0.06426666666666667, 0.0644)

stratify=y preserves class imbalance proportion.

In [32]:
# Preprocessing: pass numeric, OHE categorical

# Preprocess
numeric_features = [c for c in ["amount", "hour", "day_of_week"] if c in X.columns]
categorical_features = [c for c in ["country", "device_type", "channel", "coupon_applied"] if c in X.columns]

preprocess = ColumnTransformer(
    transformers=[
        ("num", "passthrough", numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
    ]
)


This keeps the pipeline clean and reproducible.
We won’t leak test info; encoding happens inside the pipeline.

In [33]:
# Model: logistic regression (balanced)

# Build pipeline & fit
clf = Pipeline(steps=[
    ("prep", preprocess),
    ("logreg", LogisticRegression(max_iter=1000, class_weight="balanced"))
])

clf.fit(X_train, y_train)

In [34]:
# Evaluate: ROC AUC, PR AUC, report, confusion matrix

proba = clf.predict_proba(X_test)[:, 1]
pred_05 = (proba >= 0.5).astype(int)

print("ROC AUC:               ", round(roc_auc_score(y_test, proba), 3))
print("Average Precision (PR AUC):", round(average_precision_score(y_test, proba), 3))
print("\nClassification Report (threshold=0.5):\n",
      classification_report(y_test, pred_05, digits=3))
print("Confusion Matrix (threshold=0.5):\n", confusion_matrix(y_test, pred_05))


ROC AUC:                0.777
Average Precision (PR AUC): 0.31

Classification Report (threshold=0.5):
               precision    recall  f1-score   support

           0      0.973     0.784     0.868      2339
           1      0.179     0.683     0.283       161

    accuracy                          0.777      2500
   macro avg      0.576     0.733     0.576      2500
weighted avg      0.922     0.777     0.830      2500

Confusion Matrix (threshold=0.5):
 [[1833  506]
 [  51  110]]


How to interpret quickly:

- ROC AUC ~ probability the model ranks a random fraud above a random non-fraud (1.0 = perfect).

- PR AUC (Average Precision) is better for imbalanced data; higher is better.

- Classification report shows precision/recall/F1 for each class at threshold 0.5.

- Confusion matrix shows counts of TP/FP/TN/FN at that threshold.

In [35]:
# Threshold tuning (simple F1 search)
prec, rec, thr = precision_recall_curve(y_test, proba)
f1_scores = 2 * (prec * rec) / (prec + rec + 1e-9)
best_idx = np.nanargmax(f1_scores)
best_thr = thr[max(best_idx-1, 0)]  # align size difference between thr and prec/rec

print(f"Best F1 ≈ {f1_scores[best_idx]:.3f} at threshold ≈ {best_thr:.2f}")

pred_best = (proba >= best_thr).astype(int)
print("\nClassification Report (best F1 threshold):\n",
      classification_report(y_test, pred_best, digits=3))
print("Confusion Matrix (best F1 threshold):\n", confusion_matrix(y_test, pred_best))


Best F1 ≈ 0.384 at threshold ≈ 0.78

Classification Report (best F1 threshold):
               precision    recall  f1-score   support

           0      0.957     0.959     0.958      2339
           1      0.386     0.379     0.382       161

    accuracy                          0.921      2500
   macro avg      0.672     0.669     0.670      2500
weighted avg      0.921     0.921     0.921      2500

Confusion Matrix (best F1 threshold):
 [[2242   97]
 [ 100   61]]


Why: In fraud, you often prefer higher recall (catch more fraud) while keeping precision acceptable.

In [36]:
# “Feature importance”: inspect coefficients
# For logistic regression, coefficients tell you direction/strength (after OHE).

ohe = clf.named_steps["prep"].named_transformers_["cat"]
cat_out = []
if categorical_features:
    cat_out = ohe.get_feature_names_out(categorical_features).tolist()

feature_names = numeric_features + cat_out
coefs = clf.named_steps["logreg"].coef_.ravel()

coef_df = pd.DataFrame({"feature": feature_names, "coef": coefs}).sort_values("coef", ascending=False)
print("Top positive (higher => more likely fraud):")
display(coef_df.head(10))
print("\nTop negative (lower => less likely fraud):")
display(coef_df.tail(10))

Top positive (higher => more likely fraud):


Unnamed: 0,feature,coef
8,country_IN,0.511912
3,country_BR,0.377529
9,country_JP,0.172918
12,channel_ads,0.083488
0,amount,0.015648
5,country_DE,-0.007757
1,hour,-0.028454
2,country_AU,-0.062362
14,channel_email,-0.10353
15,channel_social,-0.112259



Top negative (lower => less likely fraud):


Unnamed: 0,feature,coef
15,channel_social,-0.112259
6,country_ES,-0.146451
10,country_UK,-0.170607
16,channel_web,-0.236889
11,country_US,-0.284573
17,coupon_applied_0,-0.343569
13,channel_app,-0.375212
18,coupon_applied_1,-0.400833
7,country_FR,-0.411233
4,country_CA,-0.723778


How to read:

- Positive coef → pushes probability up (riskier).

- Negative coef → pushes probability down (safer).
- Match these back to your EDA patterns (e.g., certain device×channel combos).

Below i will input the optional choice if i want to save the model for future use.
(Code block will be commented out, please remove "#" if you want to use it)


In [None]:
#metrics_txt = []

#roc = roc_auc_score(y_test, proba)
#apr = average_precision_score(y_test, proba)
#metrics_txt.append(f"ROC AUC: {roc:.3f}")
#metrics_txt.append(f"PR AUC: {apr:.3f}")

#report_05 = classification_report(y_test, pred_05, digits=3)
#report_best = classification_report(y_test, pred_best, digits=3)

#with open("reports/modeling/logreg_metrics.txt", "w") as f:
#    f.write("\n".join(metrics_txt))
#    f.write("\n\n-- Threshold 0.5 --\n")
#    f.write(report_05)
#    f.write("\n\n-- Best F1 --\n")
#    f.write(report_best)

#coef_df.to_csv("reports/modeling/logreg_coefficients.csv", index=False)
#print("Saved to reports/modeling/")