# Modeling — Ad Fraud Detection

**Goal:** Train baseline models on the Gold table `fraud_signals`.

**We’ll do:**
1) Load features from PostgreSQL (`fraud_signals`)
2) Build a **label** by joining `ad_performance.fraud` on `(date, ad_id)`
3) Split by time (train vs. recent validation)
4) **Supervised baseline:** LightGBM (class_weight balanced)
5) **Unsupervised baseline:** IsolationForest (rank anomalies)
6) Evaluate with ROC AUC, PR AUC, precision/recall at top-K, and feature importance
7) Save artifacts (CSV + model)

In [22]:
# 📦 Import core libraries
import os
import pandas as pd
from dotenv import load_dotenv
from sqlalchemy import create_engine, text

# 📂 Load environment variables (DB credentials)
load_dotenv()

DB_USER = os.getenv("DB_USER") or os.getenv("PGUSER")
DB_PASSWORD = os.getenv("DB_PASSWORD", "")
DB_HOST = os.getenv("DB_HOST") or os.getenv("PGHOST", "localhost")
DB_PORT = os.getenv("DB_PORT") or os.getenv("PGPORT", "5432")
DB_NAME = os.getenv("DB_NAME") or os.getenv("PGDATABASE")

# Create engine for PostgreSQL connection
if DB_PASSWORD:
    DATABASE_URL = f"postgresql+psycopg2://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
else:
    DATABASE_URL = f"postgresql+psycopg2://{DB_USER}@{DB_HOST}:{DB_PORT}/{DB_NAME}"

engine = create_engine(DATABASE_URL)

# Test connection
with engine.connect() as conn:
    version = conn.execute(text("SELECT version()")).scalar()
print("✅ Connected to:", version)


✅ Connected to: PostgreSQL 17.5 (Postgres.app) on aarch64-apple-darwin23.6.0, compiled by Apple clang version 15.0.0 (clang-1500.3.9.4), 64-bit


In [23]:
# Quick sanity check to avoid empty joins later
print("== clean.clicks ==")
print(pd.read_sql("SELECT COUNT(*) AS n, MIN(click_time) min_t, MAX(click_time) max_t FROM clean.clicks", engine))

print("\n== clean.ad_performance ==")
print(pd.read_sql("SELECT COUNT(*) AS n, MIN(date) min_d, MAX(date) max_d FROM clean.ad_performance", engine))

print("\n== clean.connections ==")
print(pd.read_sql("SELECT COUNT(*) AS n FROM clean.connections", engine))

print("\n== Exemple clicks ==")
print(pd.read_sql("SELECT ad_id, click_time FROM clean.clicks ORDER BY click_time DESC LIMIT 5", engine))

print("\n== Exemple perf ==")
print(pd.read_sql("SELECT ad_id, date, clicks, conversions FROM clean.ad_performance ORDER BY date DESC LIMIT 5", engine))


== clean.clicks ==
        n                      min_t                      max_t
0  520000 2015-08-13 01:48:29.530750 2025-08-17 02:03:05.037790

== clean.ad_performance ==
      n       min_d       max_d
0  1200  2025-07-08  2025-08-10

== clean.connections ==
      n
0  2000

== Exemple clicks ==
   ad_id                 click_time
0  AD005 2025-08-17 02:03:05.037790
1  AD002 2025-08-17 02:03:05.037790
2  AD003 2025-08-17 02:03:05.037790
3  AD003 2025-08-17 02:03:05.037790
4  AD008 2025-08-17 02:03:05.037790

== Exemple perf ==
   ad_id        date  clicks  conversions
0  AD004  2025-08-10     315          211
1  AD002  2025-08-10    2932         1012
2  AD001  2025-08-10     666           94
3  AD003  2025-08-10    1173         1018
4  AD005  2025-08-10     113           10


In [24]:
# Quick sanity check to avoid empty joins later
print("== clean.clicks ==")
print(pd.read_sql("SELECT COUNT(*) AS n, MIN(click_time) min_t, MAX(click_time) max_t FROM clean.clicks", engine))

print("\n== clean.ad_performance ==")
print(pd.read_sql("SELECT COUNT(*) AS n, MIN(date) min_d, MAX(date) max_d FROM clean.ad_performance", engine))

print("\n== clean.connections ==")
print(pd.read_sql("SELECT COUNT(*) AS n FROM clean.connections", engine))

print("\n== Exemple clicks ==")
print(pd.read_sql("SELECT ad_id, click_time FROM clean.clicks ORDER BY click_time DESC LIMIT 5", engine))

print("\n== Exemple perf ==")
print(pd.read_sql("SELECT ad_id, date, clicks, conversions FROM clean.ad_performance ORDER BY date DESC LIMIT 5", engine))


== clean.clicks ==
        n                      min_t                      max_t
0  520000 2015-08-13 01:48:29.530750 2025-08-17 02:03:05.037790

== clean.ad_performance ==
      n       min_d       max_d
0  1200  2025-07-08  2025-08-10

== clean.connections ==
      n
0  2000

== Exemple clicks ==
   ad_id                 click_time
0  AD005 2025-08-17 02:03:05.037790
1  AD002 2025-08-17 02:03:05.037790
2  AD003 2025-08-17 02:03:05.037790
3  AD003 2025-08-17 02:03:05.037790
4  AD008 2025-08-17 02:03:05.037790

== Exemple perf ==
   ad_id        date  clicks  conversions
0  AD004  2025-08-10     315          211
1  AD002  2025-08-10    2932         1012
2  AD001  2025-08-10     666           94
3  AD003  2025-08-10    1173         1018
4  AD005  2025-08-10     113           10


In [25]:
query = """
WITH clicks_day AS (
  SELECT 
    ad_id,
    (click_time::date) AS as_of_date,
    COUNT(*) AS clicks_day
  FROM clean.clicks
  GROUP BY ad_id, as_of_date
)
SELECT 
  cd.ad_id,
  cd.as_of_date,
  cd.clicks_day,
  ap.impressions,
  ap.clicks AS perf_clicks,
  ap.conversions,
  ap.ctr,
  ap.conversion_rate,
  ap.bounce_rate
FROM clicks_day cd
LEFT JOIN clean.ad_performance ap
  ON ap.ad_id = cd.ad_id
 AND ap.date  = cd.as_of_date
LIMIT 5000;
"""
df = pd.read_sql(query, engine)
print("✅ Rows fetched:", df.shape[0])
df.head()


✅ Rows fetched: 60


Unnamed: 0,ad_id,as_of_date,clicks_day,impressions,perf_clicks,conversions,ctr,conversion_rate,bounce_rate
0,AD001,2017-11-07,16682,,,,,,
1,AD002,2025-08-17,320,,,,,,
2,AD001,2017-11-09,14761,,,,,,
3,AD003,2017-11-09,14826,,,,,,
4,AD010,2017-11-06,2607,,,,,,


In [31]:
nearest_sql = """
WITH clicks_day AS (
  SELECT ad_id, (click_time::date) AS as_of_date, COUNT(*) AS clicks_day
  FROM clean.clicks
  GROUP BY ad_id, as_of_date
),
joined AS (
  SELECT
    cd.*,
    ap.impressions, ap.clicks AS perf_clicks, ap.conversions, ap.ctr, ap.conversion_rate, ap.bounce_rate,
    ap.fraud,
    ROW_NUMBER() OVER (
      PARTITION BY cd.ad_id, cd.as_of_date
      ORDER BY ABS(ap.date - cd.as_of_date)
    ) AS rn
  FROM clicks_day cd
  LEFT JOIN clean.ad_performance ap
    ON ap.ad_id = cd.ad_id
)
SELECT *
FROM joined
WHERE rn = 1
LIMIT 50000;  -- garde raisonnable pour ton laptop
"""
df = pd.read_sql(nearest_sql, engine)
print("Rows:", len(df))
df.head()


Rows: 60


Unnamed: 0,ad_id,as_of_date,clicks_day,impressions,perf_clicks,conversions,ctr,conversion_rate,bounce_rate,fraud,rn
0,AD001,2015-08-13,161,2405,1358,243,0.5647,0.1789,0.117,True,1
1,AD001,2017-11-06,2468,2405,1358,243,0.5647,0.1789,0.117,True,1
2,AD001,2017-11-07,16682,2405,1358,243,0.5647,0.1789,0.117,True,1
3,AD001,2017-11-08,17369,2405,1358,243,0.5647,0.1789,0.117,True,1
4,AD001,2017-11-09,14761,2405,1358,243,0.5647,0.1789,0.117,True,1


In [32]:
import numpy as np

# 6.1) Créer / valider le label binaire
# On privilégie la colonne 'fraud' si elle existe dans df (provenant de ad_performance.fraud).
if "label" not in df.columns:
    if "fraud" in df.columns:
        df["label"] = df["fraud"].fillna(0).astype(int)
    else:
        # Fallback simple & explicable si 'fraud' indisponible :
        # "suspect" si beaucoup de clics et 0 conversions
        df["label"] = ((df.get("clicks_day", 0) >= 50) & (df.get("conversions", 0) == 0)).astype(int)

# 6.2) Liste de colonnes numériques candidates (on ne garde que celles présentes)
candidate_num = [
    "clicks_day", "impressions", "perf_clicks", "conversions",
    "ctr", "conversion_rate", "bounce_rate"
]
num_cols = [c for c in candidate_num if c in df.columns]

# 6.3) Categorical — on utilise la catégorie d’annonce normalisée
# (si elle n’existe pas encore, on la récupère depuis clean.ads)
if "category_norm" not in df.columns:
    ads = pd.read_sql("SELECT ad_id, category FROM clean.ads;", engine)
    ads["category_norm"] = (
        ads["category"].fillna("unknown").astype(str).str.strip()
           .str.lower().str.replace(r"\s+", " ", regex=True)
    )
    df = df.merge(ads[["ad_id", "category_norm"]], on="ad_id", how="left")

df["category_norm"] = df["category_norm"].fillna("unknown").astype(str)
cat_cols = ["category_norm"]

# 6.4) Nettoyage NA
for c in num_cols:
    df[c] = df[c].astype(float).fillna(0.0)

df = df.dropna(subset=["label"])
df["label"] = df["label"].astype(int)

# 6.5) Colonnes méta (optionnelles)
meta_cols = [c for c in ["as_of_date","ad_id","ip_address"] if c in df.columns]

print("✅ Label distribution -> positives:", int(df["label"].sum()), "/", len(df))
print("Numeric features:", num_cols)
print("Categorical:", cat_cols)
df[ (meta_cols + ["label"] + num_cols + cat_cols) ].head(3)


✅ Label distribution -> positives: 64 / 240
Numeric features: ['clicks_day', 'impressions', 'perf_clicks', 'conversions', 'ctr', 'conversion_rate', 'bounce_rate']
Categorical: ['category_norm']


Unnamed: 0,as_of_date,ad_id,label,clicks_day,impressions,perf_clicks,conversions,ctr,conversion_rate,bounce_rate,category_norm
0,2015-08-13,AD001,1,161.0,2405.0,1358.0,243.0,0.5647,0.1789,0.117,tech
1,2015-08-13,AD001,1,161.0,2405.0,1358.0,243.0,0.5647,0.1789,0.117,finance
2,2015-08-13,AD001,1,161.0,2405.0,1358.0,243.0,0.5647,0.1789,0.117,finance


In [33]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# 7.1) Jeu de données X / y
feature_cols = num_cols + cat_cols
X = df[feature_cols].copy()
y = df["label"].values

# 7.2) Préprocesseur : OneHot sur la catégorie, passthrough sur numériques
preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False
)

# 7.3) Deux pipelines : LogReg (baseline) et RandomForest (non linéaire)
logreg_clf = Pipeline(steps=[
    ("prep", preprocess),
    ("clf", LogisticRegression(max_iter=200, class_weight="balanced"))
])

rf_clf = Pipeline(steps=[
    ("prep", preprocess),
    ("clf", RandomForestClassifier(
        n_estimators=200,
        max_depth=None,
        n_jobs=-1,
        class_weight="balanced_subsample",
        random_state=42
    ))
])

# 7.4) Split (stratifié)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

print("Train size:", X_train.shape, " Test size:", X_test.shape)


Train size: (180, 8)  Test size: (60, 8)


In [34]:
from sklearn.metrics import roc_auc_score, average_precision_score, classification_report, confusion_matrix

def evaluate_model(name, model, X_tr, y_tr, X_te, y_te):
    model.fit(X_tr, y_tr)
    # Probabilité / score
    if hasattr(model.named_steps["clf"], "predict_proba"):
        y_proba = model.predict_proba(X_te)[:, 1]
    else:
        # fallback pour modèles sans predict_proba
        y_proba = model.decision_function(X_te) if hasattr(model, "decision_function") else model.predict(X_te)

    y_pred = (y_proba >= 0.5).astype(int)

    metrics = {
        "roc_auc": roc_auc_score(y_te, y_proba) if len(np.unique(y_te))>1 else None,
        "pr_auc": average_precision_score(y_te, y_proba) if len(np.unique(y_te))>1 else None,
        "report": classification_report(y_te, y_pred, output_dict=True),
        "confusion_matrix": confusion_matrix(y_te, y_pred).tolist()
    }
    print(f"\n=== {name} ===")
    print("ROC-AUC:", metrics["roc_auc"])
    print("PR-AUC :", metrics["pr_auc"])
    print(classification_report(y_te, y_pred))
    print("Confusion matrix:\n", metrics["confusion_matrix"])
    return metrics, y_proba, y_pred

metrics_logreg, proba_logreg, pred_logreg = evaluate_model("LogisticRegression", logreg_clf, X_train, y_train, X_test, y_test)
metrics_rf,     proba_rf,     pred_rf     = evaluate_model("RandomForest",       rf_clf,     X_train, y_train, X_test, y_test)


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=200).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



=== LogisticRegression ===
ROC-AUC: 0.8082386363636365
PR-AUC : 0.48108477227163293
              precision    recall  f1-score   support

           0       0.94      0.75      0.84        44
           1       0.56      0.88      0.68        16

    accuracy                           0.78        60
   macro avg       0.75      0.81      0.76        60
weighted avg       0.84      0.78      0.79        60

Confusion matrix:
 [[33, 11], [2, 14]]

=== RandomForest ===
ROC-AUC: 1.0
PR-AUC : 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        44
           1       1.00      1.00      1.00        16

    accuracy                           1.00        60
   macro avg       1.00      1.00      1.00        60
weighted avg       1.00      1.00      1.00        60

Confusion matrix:
 [[44, 0], [0, 16]]


In [35]:
import json, joblib
from pathlib import Path

Path("models").mkdir(exist_ok=True)
Path("reports").mkdir(exist_ok=True)

joblib.dump(logreg_clf, "models/logreg_pipeline.joblib")
joblib.dump(rf_clf,     "models/rf_pipeline.joblib")

with open("reports/logreg_metrics.json", "w") as f:
    json.dump(metrics_logreg, f, indent=2)
with open("reports/rf_metrics.json", "w") as f:
    json.dump(metrics_rf, f, indent=2)

print("✅ Saved models to models/*.joblib and metrics to reports/*.json")


✅ Saved models to models/*.joblib and metrics to reports/*.json


In [40]:
# Permutation importance on the SAME input columns used to train the pipeline
from sklearn.inspection import permutation_importance
import numpy as np
import pandas as pd

# use the same features you trained on
feature_cols = num_cols + cat_cols

# sample to keep it light
rng = np.random.RandomState(42)
sample_idx = rng.choice(len(X_test), size=min(2000, len(X_test)), replace=False)
X_te_sample = X_test.iloc[sample_idx]
y_te_sample = y_test[sample_idx]

rf_clf.fit(X_train, y_train)

result = permutation_importance(
    rf_clf, X_te_sample, y_te_sample,
    n_repeats=10, random_state=42, n_jobs=-1
)

imp = pd.DataFrame({
    "feature": feature_cols,
    "importance_mean": result.importances_mean,
    "importance_std": result.importances_std
}).sort_values("importance_mean", ascending=False)

imp.head(20)


Unnamed: 0,feature,importance_mean,importance_std
1,impressions,0.055,0.029861
6,bounce_rate,0.015,0.008975
4,ctr,0.01,0.008165
0,clicks_day,0.0,0.0
3,conversions,0.0,0.0
2,perf_clicks,0.0,0.0
5,conversion_rate,0.0,0.0
7,category_norm,0.0,0.0


In [41]:
from sklearn.ensemble import IsolationForest

# On prend uniquement les features numériques (zéro NA)
X_num = df[num_cols].fillna(0.0).astype(float)

iso = IsolationForest(
    n_estimators=200, contamination=0.02, random_state=42, n_jobs=-1
)
iso.fit(X_num)
anomaly_score = -iso.decision_function(X_num)  # plus grand => plus anormal
is_outlier = iso.predict(X_num) == -1

anoms = df[meta_cols + num_cols].copy() if meta_cols else df[num_cols].copy()
anoms["anomaly_score"] = anomaly_score
anoms["is_outlier"] = is_outlier
anoms_sorted = anoms.sort_values("anomaly_score", ascending=False).head(20)

print("Top anomalies:")
anoms_sorted.head(10)

Top anomalies:


Unnamed: 0,as_of_date,ad_id,clicks_day,impressions,perf_clicks,conversions,ctr,conversion_rate,bounce_rate,anomaly_score,is_outlier
212,2025-08-17,AD009,327.0,4026.0,3080.0,3067.0,0.765,0.9958,0.9172,0.044317,True
213,2025-08-17,AD009,327.0,4026.0,3080.0,3067.0,0.765,0.9958,0.9172,0.044317,True
214,2025-08-17,AD009,327.0,4026.0,3080.0,3067.0,0.765,0.9958,0.9172,0.044317,True
215,2025-08-17,AD009,327.0,4026.0,3080.0,3067.0,0.765,0.9958,0.9172,0.044317,True
20,2025-08-17,AD001,303.0,4621.0,2457.0,1965.0,0.5317,0.7998,0.6118,-0.0,False
21,2025-08-17,AD001,303.0,4621.0,2457.0,1965.0,0.5317,0.7998,0.6118,-0.0,False
22,2025-08-17,AD001,303.0,4621.0,2457.0,1965.0,0.5317,0.7998,0.6118,-0.0,False
23,2025-08-17,AD001,303.0,4621.0,2457.0,1965.0,0.5317,0.7998,0.6118,-0.0,False
188,2025-08-17,AD008,297.0,4092.0,2951.0,2860.0,0.7212,0.9692,0.2581,-0.008061,False
189,2025-08-17,AD008,297.0,4092.0,2951.0,2860.0,0.7212,0.9692,0.2581,-0.008061,False


In [42]:
# On stocke les résultats test (RandomForest) pour Looker/Tableau
pred_frame = df.loc[X_test.index, meta_cols].copy() if meta_cols else pd.DataFrame(index=X_test.index)
pred_frame["label_true"] = y_test
pred_frame["proba_rf"]   = proba_rf
pred_frame["pred_rf"]    = pred_rf

# Sauvegarde en DB (schema bi)
pred_frame.to_sql("model_predictions", engine, schema="bi", if_exists="replace", index=False)
print("✅ Saved predictions to bi.model_predictions")


✅ Saved predictions to bi.model_predictions


In [44]:
from sklearn.metrics import precision_recall_curve
import numpy as np

p, r, t = precision_recall_curve(y_test, proba_rf)

# Align arrays: thresholds correspond to p[:-1], r[:-1]
p_t = p[:-1]
r_t = r[:-1]
t_t = t

# Target recall ≥ 0.80 (change if you want)
TARGET_RECALL = 0.80
mask = r_t >= TARGET_RECALL

if mask.any():
    # Among points meeting recall target, pick the one with best precision
    best_local_idx = np.argmax(p_t[mask])
    # Map back to global index
    global_idx = np.flatnonzero(mask)[best_local_idx]
    best_thr = t_t[global_idx]
    print(f"Chosen threshold: {best_thr:.6f} | Precision: {p_t[global_idx]:.3f} | Recall: {r_t[global_idx]:.3f}")
else:
    # Fallback: maximize F1 across all thresholds
    f1 = 2 * p_t * r_t / (p_t + r_t + 1e-9)
    global_idx = np.nanargmax(f1)
    best_thr = t_t[global_idx]
    print(
        "No point reached the target recall; using F1-optimal threshold instead.\n"
        f"Chosen threshold: {best_thr:.6f} | Precision: {p_t[global_idx]:.3f} | Recall: {r_t[global_idx]:.3f} | F1: {f1[global_idx]:.3f}"
    )

# If you want predictions at that threshold:
y_pred_best = (proba_rf >= best_thr).astype(int)


Chosen threshold: 0.885000 | Precision: 1.000 | Recall: 1.000


In [45]:
from sklearn.calibration import CalibratedClassifierCV
rf_cal = CalibratedClassifierCV(rf_clf, method="isotonic", cv=3)
rf_cal.fit(X_train, y_train)
proba_rf_cal = rf_cal.predict_proba(X_test)[:,1]


In [46]:
from sklearn.model_selection import RandomizedSearchCV
params = {
  "clf__n_estimators": [200, 300, 500],
  "clf__max_depth": [None, 10, 20],
  "clf__min_samples_split": [2, 5, 10],
  "clf__min_samples_leaf": [1, 2, 4]
}
search = RandomizedSearchCV(rf_clf, params, n_iter=12, scoring="average_precision", n_jobs=-1, cv=3, random_state=42)
search.fit(X_train, y_train)
best_model = search.best_estimator_


In [47]:
anoms_sorted.to_sql("anomalies_isoforest", engine, schema="bi", if_exists="replace", index=False)


20

## ✅ Baselines complete

**Supervised (LightGBM):**
- Time-aware split (last 7 days as validation)
- Balanced objective
- Metrics: ROC AUC, PR AUC, threshold selection via PR curve
- Permutation importance for interpretability
- Artifacts saved (model + predictions)

**Unsupervised (IsolationForest):**
- Trained on clean TRAIN features
- Ranked anomalies on validation
- Useful when labels are scarce/noisy

**Next ideas:**
- Calibrate threshold for business target (e.g., 90% precision)
- Add **graph features** (e.g., IP ↔ ad community metrics)
- Add **IP reputation / geo** enrichment
- Try **XGBoost** and compare with LightGBM
- Add **SHAP** for feature attribution (if needed)
