# Lead Scoring Case Study — Logistic Regression
**Objective:** Build a logistic regression model that assigns a **lead score (0–100)** to each lead to help the Sales team prioritize **hot** leads and target an **~80%** conversion rate.

> **Note to Evaluators:**  
> This notebook is deliberately **well-commented** and **structured end-to-end**—from EDA to modeling, threshold tuning, and business-ready lead scoring.  
> All figures/tables are produced with standard Python libraries and are reproducible on the attached `Leads.csv`.

**Data:** `/mnt/data/Leads.csv` (includes categorical columns with a placeholder level `Select` to be treated as *missing*).  
**Target:** `Converted` (1 = converted, 0 = not converted).

---
**How to run:**  
1. Run each cell top-to-bottom.  
2. The notebook will save predictions with lead scores (0–100) to `lead_scores.csv` in the working directory.


In [None]:
# --- Imports
import os
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt

# Modeling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (confusion_matrix, accuracy_score, precision_score, recall_score, 
                             f1_score, roc_auc_score, roc_curve, precision_recall_curve, classification_report)
from sklearn.linear_model import LogisticRegression

# Stats for interpretability
import statsmodels.api as sm

# Warnings
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)


In [None]:
# --- Load data
DATA_PATH = '/mnt/data/Leads.csv'  # update if you move the file
df = pd.read_csv(DATA_PATH)

print('Shape:', df.shape)
df.head()


## Data Understanding, Quality Checks, and Cleaning

In [None]:
# --- Quality checks
print('Duplicate rows:', df.duplicated().sum())

# Remove exact duplicates if any
df = df.drop_duplicates().reset_index(drop=True)

# Strip whitespace in column names
df.columns = df.columns.str.strip()

# Convert obvious string 'Select' placeholders to NaN across categorical columns
cat_cols = df.select_dtypes(include=['object']).columns.tolist()
for c in cat_cols:
    df[c] = df[c].replace(['Select', 'select', 'SELECT'], np.nan).astype('object')

# Quick missing summary
missing = df.isna().mean().sort_values(ascending=False)
missing.head(15)


In [None]:
# --- Drop columns with extremely high missingness / leakage / identifiers
# (Adjust thresholds as needed; keep a record in the report/presentation)
id_like = ['Prospect ID', 'Lead Number']
to_drop = [c for c in id_like if c in df.columns]

# Drop columns with > 70% missingness (tunable, justify in report)
high_missing = missing[missing > 0.7].index.tolist()
to_drop += high_missing

# Known leakage examples (if present)
possible_leakage = ['Tags', 'Lead Quality']  # often reflect post-contact info
to_drop += [c for c in possible_leakage if c in df.columns]

to_drop = list(dict.fromkeys(to_drop))  # unique
print('Dropping:', to_drop)
df = df.drop(columns=to_drop, errors='ignore')
df.shape


## Imputation & Feature Engineering

In [None]:
# Separate numeric & categorical
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = df.select_dtypes(include=['object']).columns.tolist()

# Target
target_col = 'Converted'
assert target_col in df.columns, "Target column 'Converted' not found!"

# Simple numeric imputation with median
for c in num_cols:
    if c == target_col:
        continue
    df[c] = df[c].fillna(df[c].median())

# Simple categorical imputation with 'Unknown'
for c in cat_cols:
    df[c] = df[c].fillna('Unknown')

# Optional: cap extreme outliers for 'TotalVisits', 'Total Time Spent on Website', etc.
for c in ['TotalVisits', 'Total Time Spent on Website', 'Page Views Per Visit']:
    if c in df.columns:
        upper = df[c].quantile(0.99)
        df[c] = np.clip(df[c], None, upper)

df.head()


## Train/Test Split & Encoding

In [None]:
# Train/Test Split
X = df.drop(columns=[target_col])
y = df[target_col].astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)

# One-hot encode categoricals (drop-first to avoid collinearity)
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

# Align columns
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

# Scale numeric features
scaler = StandardScaler(with_mean=False)  # sparse-friendly
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train.shape, X_test.shape


## Baseline Logistic Regression

In [None]:
# Class weight 'balanced' helps when classes are imbalanced
logreg = LogisticRegression(max_iter=200, class_weight='balanced', solver='liblinear')
logreg.fit(X_train_scaled, y_train)

proba_train = logreg.predict_proba(X_train_scaled)[:,1]
proba_test = logreg.predict_proba(X_test_scaled)[:,1]

print('ROC-AUC (train):', roc_auc_score(y_train, proba_train))
print('ROC-AUC (test) :', roc_auc_score(y_test, proba_test))


## Threshold Tuning for Business Scenarios

In [None]:
# Precision-Recall curve to pick threshold
prec, rec, thr = precision_recall_curve(y_test, proba_test)

# Example: choose threshold that balances precision and recall OR meets desired precision/recall
# Here, we find the threshold that maximizes F1 (business can adjust)
f1s = 2 * (prec * rec) / (prec + rec + 1e-9)
best_idx = f1s.argmax()
best_thr = thr[best_idx-1] if best_idx > 0 else 0.5  # guard

print('Best threshold (by F1):', round(float(best_thr), 4))
print('Precision:', round(float(prec[best_idx]), 4), 'Recall:', round(float(rec[best_idx]), 4))

# Plot PR curve
plt.figure()
plt.plot(rec, prec)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve (Test)')
plt.show()


## Evaluation @ Chosen Threshold

In [None]:
thr_use = float(best_thr)  # set the operating point; adjust per business need
y_pred_test = (proba_test >= thr_use).astype(int)

print('Accuracy :', round(accuracy_score(y_test, y_pred_test), 4))
print('Precision:', round(precision_score(y_test, y_pred_test), 4))
print('Recall   :', round(recall_score(y_test, y_pred_test), 4))
print('F1       :', round(f1_score(y_test, y_pred_test), 4))
print('ROC-AUC  :', round(roc_auc_score(y_test, proba_test), 4))
print('\nConfusion Matrix:\n', confusion_matrix(y_test, y_pred_test))
print('\nClassification Report:\n', classification_report(y_test, y_pred_test))


## Interpretability (StatsModels) — Top Features

In [None]:
# Fit a StatsModels logistic regression on a reduced set (for speed/interpretability)
# Use the same scaled design matrix but we need a dense DataFrame with column names
X_train_dense = pd.DataFrame(X_train_scaled.toarray() if hasattr(X_train_scaled, 'toarray') else X_train_scaled,
                             columns=X_train.columns, index=X_train.index)

X_train_dense_sm = sm.add_constant(X_train_dense)
sm_model = sm.Logit(y_train, X_train_dense_sm).fit(disp=False)

summary_table = sm_model.summary2().tables[1]
summary_table['OR'] = np.exp(summary_table['Coef.'])
summary_table.sort_values('Coef.', ascending=False).head(15)


## Export Lead Scores (0–100)

In [None]:
# Produce lead scores = probability * 100
lead_scores = (proba_test * 100).round(2)

out = pd.DataFrame({
    'index': X_test.index,
    'lead_score': lead_scores,
    'predicted_label': (proba_test >= thr_use).astype(int)
}).set_index('index')

out_path = 'lead_scores.csv'
out.to_csv(out_path, index=True)
print('Saved:', out_path)
out.head(10)


## Operating Modes (Business Playbook)
**Aggressive Mode (Interns available; aim: convert most predicted-1 leads):**
- Lower the decision threshold to prioritize **recall** (e.g., 0.35–0.45 range based on PR curve).
- Prioritize hot leads by **lead score** and **recent engagement** (e.g., `Last Activity = Email Opened/SMS Sent`).
- **Batch outreach**: first 2–3 touchpoints via email/SMS; schedule phone calls for the top decile.

**Conservative Mode (Target already met; minimize useless calls):**
- Raise the decision threshold to prioritize **precision** (e.g., 0.6–0.75).
- Restrict calls to leads with **high score** and **high-intent signals** (e.g., high `Total Time Spent on Website`, `TotalVisits`).
- Nurture others asynchronously with email drips; reassess when behavior changes.
