Importing some necessary libraries that will be useful for my data analysis, visualization and machine learning. Other necessary libraries will be installed when I come accross the need for them during the process. 👇

### Setup
I import the libraries I need (pandas/NumPy for data, matplotlib/seaborn for plots, scikit‑learn for ML) and silence non‑critical warnings.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import warnings
warnings.filterwarnings('ignore')

## IMPORTING & INSPECTING DATASET

#### At this stage, I am only performing a light inspection of the dataset to understand its shape, missing values, and distributions. I will postpone deeper analysis (skewness, scaling needs, final feature selection) until after I clean the data and impute missing values.

### Load data
I load the training/test CSVs and preview shapes/heads to confirm they read correctly.

In [None]:
dt = pd.read_csv('/content/drive/MyDrive/fraud_transactions_train_10000_with_missing.csv')
dt.head()

In [None]:
dt.info()

## DROPPING ID COLUMNS

#### ID columns are not part of the features are not useful for the predictive analysis so i'll be dropping them. 👇

In [None]:
dt.drop(['transaction_id', 'customer_id'], axis=1, inplace=True)

#### Defining a function that helps me check for the percentage of missingness across the entire dataset.👇

### Missing values quick check
I compute % missing per column so I can plan imputation (remember: keep missingness as signal).

In [None]:
def perc_missing(df):                                  # defining a function for checking % missing values of any dataset
  missing = round((df.isnull().sum()/len(df))*100,3)   # this code is replicating the formular (sum of null values/total values) * 100, and rounding up to 3 decimal places
  perc_missing = missing[missing>0].sort_values()      # this code is to select from the data only the columns with missing values more than 0

  return perc_missing

In [None]:
perc_missing(dt)

#### From the outcome, it can be observed that three columns have missing values with percentage missingness if 2%, 3% and 5% respectively.👆

#### Inspecting the count of unique values across all columns for deciding the best encoding methods later on. 👇

In [None]:
for col in dt.select_dtypes(include='object').columns:
    print(f"\n{col} value counts:")
    print(dt[col].value_counts().head(10))

## SPLITTING THE DATASET AS EARLY AS POSSIBLE

#### Splitting to X and Y, Train and Test

In [None]:
X = dt.iloc[:,:-1]
y =dt.iloc[:,-1]

In [None]:
from sklearn.model_selection import train_test_split

#### I'll split to 80/20 so that I will have more data to train on since the fraud cases are usually rare. 👇

### Early split to avoid leakage
I split into train/test now so any fitting (imputation/encoding/scaling) only learns from train.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.head(2)

In [None]:
X_test.head(2)

In [None]:
X_train.info()

In [None]:
X_test.info()

## CLEANING DATASET

## Handling Missing Values

#### In this fraud prediction project, I decided not to drop any rows or columns that contain missing values. The reason is that every transaction record is potentially important for identifying fraudulent activity, and removing rows may eliminate rare but critical fraud cases.

#### Similarly, dropping columns is not advisable because even features with missing values can carry useful signals. For example, the fact that a customer did not provide income information, or that device trust data is unavailable, could itself correlate with fraudulent behavior.

#### Instead of dropping, I will handle missing values through imputation strategies (such as median filling for numerical features and special categories/flags for categorical ones). This ensures that:

	•	No valuable transaction records are lost.
	•	Missingness itself can be captured and used by the model as a potential fraud indicator.


#### In this project, I decided not to apply feature selection before training. The dataset contains 27 features, and in fraud detection every feature can potentially hold weak but important signals of fraudulent behavior. Dropping features too early may lead to losing valuable information, especially since fraud cases are rare and subtle.

#### Instead, I will train the models using all 27 features. After training, I will rely on model-based interpretability methods such as feature importance (from tree-based models), coefficients (from logistic regression), and SHAP values to analyze which features contributed most to fraud detection.

#### This approach ensures that I do not prematurely discard useful signals. It also allows me to provide insights later about which features were most influential in predicting fraud, without limiting the learning ability of the model at the start.

In [None]:
# First, I'll group the columns into categorical and numerical columns

# Categorical columns are all object type columns
cat_cols = X_train.select_dtypes(include='object').columns.tolist()

# Numerical columns are all int and float type columns
num_cols = X_train.select_dtypes(include=[np.number, 'int64', 'float64']).columns.tolist()

#### The 3 missing columns in the dataset are Customer Income Monthly, Average Transaction Amount (30 days) and Device Trust Score.

## IN MY OPINION

#### I think filling missing numerical values for a fraud detection dataset with median or mean will disrupt the integrity of the dataset because misingness can also be a factor or a signal for fraudulent activities.

#### I will have to examine the range of values in each columns to know which values i will input to fill the missing rows in order to generate an outlier for the machine to understand during training.

In [None]:
cols_with_missing = ["customer_income_monthly",
                     "avg_transaction_amount_30d",
                     "device_trust_score"]

for col in cols_with_missing:
    print(f"\nColumn: {col}")
    print("Minimum value:", X_train[col].min())
    print("Maximum value:", X_train[col].max())

In [None]:
for col in cols_with_missing:
    print(f"\nColumn: {col}")
    print("Minimum value:", X_test[col].min())
    print("Maximum value:", X_test[col].max())

#### **From the outcome, I can see assume the range for each column to be;**

#### Customer Income Monthly (0 to 20000) - best outlier value (99999)

#### Average Transaction Amount (0 to 5000) - best outlier value (99999)

#### Device Trust Score (0 to 1) - best outlier value (-1)**bold text**

In [None]:
# Importing imputation libary

from sklearn.impute import SimpleImputer

### Impute with sentinel values
For selected numeric columns, I fill missing with out‑of‑range sentinels (e.g., 99999) so the model can learn the pattern of missingness.

In [None]:
# Filling with outliers to represent missing values

# For Customer Income Monthly (99999)

imp_income = SimpleImputer(strategy="constant", fill_value=99999)

X_train[["customer_income_monthly"]] = imp_income.fit_transform(X_train[["customer_income_monthly"]])
X_test[["customer_income_monthly"]] = imp_income.transform(X_test[["customer_income_monthly"]])

### Impute with sentinel values
For selected numeric columns, I fill missing with out‑of‑range sentinels (e.g., 99999) so the model can learn the pattern of missingness.

In [None]:
# For Average Transaction Amount 30 days (99999)

imp_avg = SimpleImputer(strategy="constant", fill_value=99999)

X_train[["avg_transaction_amount_30d"]] = imp_avg.fit_transform(X_train[["avg_transaction_amount_30d"]])
X_test[["avg_transaction_amount_30d"]] = imp_avg.transform(X_test[["avg_transaction_amount_30d"]])

In [None]:
# For Device Trust Score (-1)

imp_trust = SimpleImputer(strategy="constant", fill_value=-1)

X_train[["device_trust_score"]] = imp_trust.fit_transform(X_train[["device_trust_score"]])
X_test[["device_trust_score"]] = imp_trust.transform(X_test[["device_trust_score"]])

In [None]:
# Confirming

X_train.info()

In [None]:
X_test.info()

## ENCODING CATEGORICAL COLUMNS

#### Since the machine only understands numbers, converting categorical columns to number identifiers will be the next step.

### Encoding Choice

#### For all my categorical columns, I will be using OrdinalEncoder. After inspecting the dataset, I observed that none of the categorical features have a natural order or hierarchy (e.g., “first class > business class > economy class”). In such cases, OrdinalEncoder can safely act like label encoding, mapping each category to a unique integer.

#### I chose OrdinalEncoder instead of:
	•	OneHotEncoder → this would increase the dimensionality significantly, since my dataset already has many features. I want to avoid unnecessary feature expansion.
	•	LabelEncoder → mainly designed for target labels and not ideal for multiple feature columns. It also does not handle unseen categories well.
	•	Other encoders (e.g., Target Encoding) → while powerful, they bring higher risk of data leakage if not carefully cross-validated.

#### OrdinalEncoder is simple, compact, and integrates smoothly into a pipeline, which is important since I intend to deploy the final model on Streamlit. This makes it easier to save, reload, and apply the exact same preprocessing during deployment.

In [None]:
# Importing library for encoding

from sklearn.preprocessing import OrdinalEncoder

I have already defined all the 'Object' datatype columns as cat_cols, so I can go ahead to encode.

In [None]:
# Encoding

encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)

X_train[cat_cols] = encoder.fit_transform(X_train[cat_cols])
X_test[cat_cols] = encoder.transform(X_test[cat_cols])

In [None]:
X_train.head()

In [None]:
X_test.head()

## SCALING

#### In this project, I intend to created two versions of the dataset:
#### 1. Unscaled Data (raw values):

	•	Used for tree-based models like Random Forest and XGBoost.
	•	These models do not require scaling because they split features based on thresholds.

#### 2. Scaled Data (standardized features):

	•	Standardized to mean = 0 and standard deviation = 1.
	•	Used for linear models (e.g., Logistic Regression, SVM) and Neural Networks, which are sensitive to feature magnitudes.
	•	Standardization ensures that no single feature dominates the learning process simply due to its scale.

#### I will train models on both datasets:

	•	Tree models on both unscaled and scaled data (to confirm they are robust to scaling).
	•	Linear/NN models on the scaled data (since they require it).

#### This approach allows me to compare performance across algorithm families while ensuring each model receives data in the form that best suits its learning mechanism.

#### Also, to avoid tampering with the colums with missing values outliers, i will excempt them from the columns to be scaled. 👇

#### I will also avoid scaling the encoded columns and scale only the genuine continuous numeric columns.

In [None]:
# I dentifying outlier columns

outlier_cols = ["customer_income_monthly", "avg_transaction_amount_30d", "device_trust_score"]

I have already defined all the 'Float' and 'Int' datatype columns as num_cols, so I can go ahead to encode.

In [None]:
# Identifying the genuine continuous numeric columns of the dataset

scale_cols = [c for c in num_cols if c not in outlier_cols]

In [None]:
# Making a copy of the two paths

X_train_unscaled = X_train.copy()
X_test_unscaled  = X_test.copy()

X_train_scaled = X_train.copy()
X_test_scaled  = X_test.copy()

In [None]:
# Importing library for standard scaling

from sklearn.preprocessing import OrdinalEncoder, StandardScaler

In [None]:
# Scaling data

sc = StandardScaler()

X_train_scaled[scale_cols] = sc.fit_transform(X_train_scaled[scale_cols])
X_test_scaled[scale_cols]  = sc.transform(X_test_scaled[scale_cols])

In [None]:
# Confirming

X_train_scaled.head()

In [None]:
X_test_scaled.head()

In [None]:
pip install lazypredict

### Train the model
I fit the chosen model/pipeline on the training data.

In [None]:
from lazypredict.Supervised import LazyClassifier
from sklearn.metrics import roc_auc_score, average_precision_score

#X_train_unscaled, X_test_unscaled, y_train, y_test
clf_us = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None, random_state=42)
models_us, preds_us = clf_us.fit(X_train_unscaled, X_test_unscaled, y_train, y_test)

print("=== LazyPredict on UN-SCALED data (good for trees) ===")
print(models_us.sort_values(by=["ROC AUC","Accuracy"], ascending=False).head(20))

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Train the model
I fit the chosen model/pipeline on the training data.

In [None]:
#X_train_scaled, X_test_scaled, y_train, y_test
clf_us = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None, random_state=42)
models_us, preds_us = clf_us.fit(X_train_scaled, X_test_scaled, y_train, y_test)

print("=== LazyPredict on UN-SCALED data (good for trees) ===")
print(models_us.sort_values(by=["ROC AUC","Accuracy"], ascending=False).head(20))

### Evaluate
I report Accuracy, Balanced Accuracy, Precision, Recall, F1, ROC AUC, and PR AUC — focusing on recall/PR AUC for fraud.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (roc_auc_score, average_precision_score,
                             accuracy_score, balanced_accuracy_score,
                             precision_score, recall_score, f1_score,
                             classification_report)

### Threshold sweep
I scan several probability cutoffs and pick one that boosts recall at acceptable precision (I later settled around 0.45).

In [None]:
# ===============================
# Metrics nicely
# ===============================

def print_metrics(y_true, proba, preds, header=""):
    print("\n" + "="*len(header))
    print(header)
    print("="*len(header))
    print(f"Accuracy:           {accuracy_score(y_true, preds):.4f}")
    print(f"Balanced Accuracy:  {balanced_accuracy_score(y_true, preds):.4f}")
    print(f"Precision:          {precision_score(y_true, preds, zero_division=0):.4f}")
    print(f"Recall:             {recall_score(y_true, preds, zero_division=0):.4f}")
    print(f"F1:                 {f1_score(y_true, preds, zero_division=0):.4f}")
    print(f"ROC AUC:            {roc_auc_score(y_true, proba):.4f}")
    print(f"PR  AUC:            {average_precision_score(y_true, proba):.4f}")
    print("\nClassification report:\n", classification_report(y_true, preds, digits=4))

### Threshold sweep
I scan several probability cutoffs and pick one that boosts recall at acceptable precision (I later settled around 0.45).

In [None]:
# ===============================
# Threshold sweep (see trade-offs)
# ===============================

def threshold_sweep(y_true, proba, thresholds=(0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5)):
    rows = []
    for t in thresholds:
        preds = (proba >= t).astype(int)
        rows.append({
            "threshold": t,
            "precision": precision_score(y_true, preds, zero_division=0),
            "recall":    recall_score(y_true, preds, zero_division=0),
            "f1":        f1_score(y_true, preds, zero_division=0),
            "bal_acc":   balanced_accuracy_score(y_true, preds)
        })
    return pd.DataFrame(rows).sort_values("threshold")

### Handle class imbalance Random Forest
I set `class_weight='balanced'` so the model pays more attention to rare fraud cases.

In [None]:
# ===============================
# 1) RandomForest (unscaled) + class_weight
# ===============================

rf = RandomForestClassifier(
    n_estimators=400,
    max_depth=None,              # you can tune later (e.g., 8, 12, 16)
    class_weight="balanced",     # <<< imbalance handling
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train_unscaled, y_train)
proba_rf = rf.predict_proba(X_test_unscaled)[:, 1]
preds_rf = (proba_rf >= 0.5).astype(int)
print_metrics(y_test, proba_rf, preds_rf, header="RandomForest (UNSCALED) + class_weight='balanced'")

print("\nThreshold sweep (RF):")
display(threshold_sweep(y_test, proba_rf))

### Handle class imbalance Logistic Regression
I set `class_weight='balanced'` so the model pays more attention to rare fraud cases.

In [None]:
# ===============================
# 2) Logistic Regression (scaled) + class_weight
# ===============================
lr = LogisticRegression(
    C=0.1907,
    solver="lbfgs",
    penalty="l2",
    class_weight="balanced",
    max_iter=900,
    n_jobs=-1
)
lr.fit(X_train_scaled, y_train)
proba_lr = lr.predict_proba(X_test_scaled)[:, 1]
preds_lr = (proba_lr >= 0.5).astype(int)
print_metrics(y_test, proba_lr, preds_lr, header="LogisticRegression (SCALED) + class_weight='balanced'")

print("\nThreshold sweep (LR):")
display(threshold_sweep(y_test, proba_lr))

### Handle class imbalance XGBoost
I set `class_weight='balanced'` so the model pays more attention to rare fraud cases.

In [None]:
# ===============================
# 3) XGBoost (unscaled) + scale_pos_weight  (optional)
# ===============================

# scale_pos_weight ≈ negatives / positives in TRAIN
pos = y_train.sum()
neg = len(y_train) - pos
spw = (neg / pos) if pos > 0 else 1.0

xgb = XGBClassifier(
    n_estimators=600,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.9,
    colsample_bytree=0.9,
    reg_lambda=1.0,
    random_state=42,
    n_jobs=-1,
    scale_pos_weight=spw,     # <<< key imbalance control
    objective="binary:logistic",
    eval_metric="auc"
)
xgb.fit(X_train_unscaled, y_train)
proba_xgb = xgb.predict_proba(X_test_unscaled)[:, 1]
preds_xgb = (proba_xgb >= 0.5).astype(int)
print_metrics(y_test, proba_xgb, preds_xgb, header=f"XGBoost (UNSCALED) + scale_pos_weight={spw:.2f}")

print("\nThreshold sweep (XGB):")
display(threshold_sweep(y_test, proba_xgb))

### Train the model
I fit the chosen model/pipeline on the training data.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from scipy.stats import loguniform

# Cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Search space: just C (regularization strength)
param_dist = {
    "C": loguniform(1e-3, 1e2),           # sample C between 0.001 and 100
    "solver": ["lbfgs", "liblinear"],
}


# Base Logistic Regression
lr = LogisticRegression(
    penalty="l2",
    class_weight="balanced",
    max_iter=2000,
    n_jobs=-1
)

# Randomized search
rs = RandomizedSearchCV(
    lr,
    param_distributions=param_dist,
    n_iter=20,                   # number of random draws
    scoring="average_precision", # PR-AUC scoring
    cv=cv,
    n_jobs=-1,
    verbose=1,
    refit=True,
    random_state=42
)

# Fit
rs.fit(X_train_scaled, y_train)

print("Best params:", rs.best_params_)
print("Best CV PR-AUC:", rs.best_score_)

### Model Selection and Hyperparameter Tuning for Fraud Detection

At the onset of this project, I used **LazyPredict** to run multiple algorithms on the dataset with default hyperparameters. The purpose of this was not to accept those results at face value, but to quickly summarize and compare which models showed initial promise. Interestingly, some models reported very high accuracies (around **0.95**).

However, in fraud detection, a high accuracy does **not** necessarily mean a good model. This is because fraudulent transactions form a very small minority (around 4% of the dataset). A model could achieve >95% accuracy by simply predicting **“non-fraud”** for almost everything. That is dangerous, because it means many fraudulent activities would be missed.

The real goal in fraud detection is not just to predict the majority class correctly, **but to force the model to pay more attention to the minority fraudulent class.** In other words, it is better for the model to sometimes flag a genuine transaction as fraudulent (false positive) than to wrongly classify an actual fraudulent transaction as genuine (false negative). For this reason, I moved to **class_weight=“balanced”** in Logistic Regression, so that the algorithm could give more weight to fraud cases during training.

⸻

### Metrics Focus

Because of the imbalanced nature of the dataset, I evaluated models not just on plain accuracy but on multiple metrics that give a clearer picture:

	•	Accuracy: Overall correct predictions. In fraud analysis, this number can be misleading if used alone. Typically, we expect 0.70–0.85 to be a reasonable range (since forcing the model to detect fraud usually reduces accuracy).
	•	My result: 0.77 (within the expected range).

	•	Balanced Accuracy: Accounts for imbalance by averaging recall across classes. A good fraud model should push this above 0.60.
	•	My result: 0.63 (slightly above baseline, showing the model is learning fraud patterns).

	•	Precision (fraud class): Of all predicted frauds, how many were actually fraud. Precision is usually low in fraud problems, often <0.2, because the model prefers to “over-flag.”
	•	My result: 0.097 (low but acceptable in fraud context, since recall is prioritized).

	•	Recall (fraud class): Of all actual frauds, how many were caught. This is critical in fraud detection — values around 0.40–0.60 are realistic for first models.
	•	My result: 0.48 (good, the model catches nearly half of frauds).

	•	F1 Score: Harmonic mean of precision and recall. Expected to be low when fraud is rare, but still useful as a balance check.
	•	My result: 0.16 (low, but consistent with the recall–precision trade-off).

	•	ROC AUC: Measures the ability to rank frauds above non-frauds. A baseline is 0.50 (random). Values between 0.60–0.70 are acceptable in early fraud work.
	•	My result: 0.64 (model is better than random and shows a signal).

	•	PR AUC: More honest for rare classes because it focuses on precision–recall trade-off. Baseline equals fraud rate (~0.04). Anything above 0.07–0.08 shows the model is learning.
	•	My result: 0.078 (almost double the baseline, good progress).

	•	Classification Report: Gave a detailed breakdown for each class, confirming that the model sacrifices precision to improve recall, which is the safer option in fraud detection.

⸻

### Summary

After comparing multiple models, I found that **Logistic Regression with class_weight=“balanced”** was the best-performing and most interpretable model for this task. Hyperparameter tuning (specifically on the C parameter) further improved performance. The final model reached:

	•	Accuracy = 0.77
	•	Balanced Accuracy = 0.63
	•	Recall (fraud class) = 0.48
	•	ROC AUC = 0.64
	•	PR AUC = 0.078

These results are consistent with what is expected in fraud prediction:

	•	Not extremely high accuracy (because we forced it to detect fraud).
	•	Reasonable recall (almost half of frauds caught).
	•	PR AUC above the baseline fraud rate, showing the model has learned useful patterns.

⸻

This reasoning and explanation justify why Logistic Regression was chosen as the final model, and why the metrics prove it is suitable for fraud detection tasks.

## PIPELINE

In [None]:
from sklearn.pipeline import Pipeline as SkPipe
from sklearn.compose import ColumnTransformer

**Preprocess 👇**

### Impute with sentinel values
For selected numeric columns, I fill missing with out‑of‑range sentinels (e.g., 99999) so the model can learn the pattern of missingness.

In [None]:
preprocess = ColumnTransformer(
    transformers=[
        ("cat", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1), cat_cols),
        ("imp_income", SimpleImputer(strategy="constant", fill_value=99999), ["customer_income_monthly"]),
        ("imp_avg30",  SimpleImputer(strategy="constant", fill_value=99999), ["avg_transaction_amount_30d"]),
        ("imp_trust",  SimpleImputer(strategy="constant", fill_value=-1),    ["device_trust_score"]),
        ("scale_num",  SkPipe([("scaler", StandardScaler())]),               scale_cols),
    ],
    remainder="drop",
    verbose_feature_names_out=False
)

**Classifier 👇**

### Handle class imbalance
I set `class_weight='balanced'` so the model pays more attention to rare fraud cases.

In [None]:
BEST_C = 0.1907
clf = LogisticRegression(
    solver="lbfgs",
    penalty="l2",
    class_weight="balanced",
    C=BEST_C,
    max_iter=2000,
    n_jobs=-1
)

**Preprocess to model; fit & quick evaluation 👇**

### Train the model
I fit the chosen model/pipeline on the training data.

In [None]:
pipe = SkPipe([("prep", preprocess), ("clf", clf)])
pipe.fit(X_train, y_train)

proba = pipe.predict_proba(X_test)[:, 1]
preds = (proba >= 0.50).astype(int)   # default; you can change later

print("\n=== Logistic Regression Pipeline (t=0.50) ===")
print("Accuracy:", round(accuracy_score(y_test, preds), 4))
print("Balanced Acc:", round(balanced_accuracy_score(y_test, preds), 4))
print("Precision:", round(precision_score(y_test, preds, zero_division=0), 4))
print("Recall:", round(recall_score(y_test, preds, zero_division=0), 4))
print("F1:", round(f1_score(y_test, preds, zero_division=0), 4))
print("ROC AUC:", round(roc_auc_score(y_test, proba), 4))
print("PR  AUC:", round(average_precision_score(y_test, proba), 4))
print("\nReport:\n", classification_report(y_test, preds, digits=4))

### Threshold sweep
I scan several probability cutoffs and pick one that boosts recall at acceptable precision (I later settled around 0.45).

In [None]:
for t in [0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50]:
    p = (proba >= t).astype(int)
    print(f"t={t:.2f}  Prec={precision_score(y_test,p,zero_division=0):.3f}  "
          f"Rec={recall_score(y_test,p,zero_division=0):.3f}  "
          f"BalAcc={balanced_accuracy_score(y_test,p):.3f}")

**Saving the model + threshold 👇**

### Setup
I import the libraries I need (pandas/NumPy for data, matplotlib/seaborn for plots, scikit‑learn for ML) and silence non‑critical warnings.

In [None]:
import pickle

### Saving pipeline in pkl for deployment

In [None]:
CHOSEN_THRESHOLD = 0.45   # I selected the best threshold from the threshold sweep result.

# combining pipeline + threshold together

artifacts = {
    "pipeline": pipe,
    "threshold": CHOSEN_THRESHOLD
}

with open("fraud_threshold.pkl", "wb") as f:
    pickle.dump(artifacts, f)

print("Saved fraud_lr_pipeline.pkl (pipeline + threshold together)")

## QUICK DRIFT TEST CHECK ON SOME IMPORTANT COLUMNS

In [None]:
from scipy.stats import ks_2samp

features_to_check = ["transaction_amount", "customer_income_monthly", "device_trust_score"]

for col in features_to_check:
    stat, p = ks_2samp(X_train[col].dropna(), X_test[col].dropna())
    print(f"{col} → KS test p-value: {p:.4f}")
    if p < 0.05:
        print("  ⚠️ Possible drift detected")
    else:
        print("  ✅ No significant drift")

    # plot histogram
    plt.figure(figsize=(6,3))
    plt.hist(X_train[col], bins=30, alpha=0.5, label='Train')
    plt.hist(X_test[col], bins=30, alpha=0.5, label='Test')
    plt.title(f"Distribution comparison for {col}")
    plt.legend()
    plt.show()