# LightGBM Ads Tutorial

End-to-end tutorial using synthetic keyword-ads performance data.

**Models covered:**
1. CTR prediction (regression)
2. Conversion prediction (binary classification)
3. Keyword ranking function (score-based)
4. Learning-to-Rank with LambdaMART (group-split version)
5. Feature importance

## 0) Install dependencies

In [None]:
# !pip install lightgbm scikit-learn pandas numpy

## 1) Create synthetic ads dataset

Each row represents one `(given_word, keyword)` pair with features:
- `similarity` – cosine-like similarity between the two words
- `competition`, `impressions`, `clicks`, `cpc`, `cost`, `device`, `hour`

Targets:
- `ctr` – click-through rate (regression)
- `has_conversion` – did it convert at least once? (binary classification)

In [None]:
from src.dataset import make_ads_dataset

df = make_ads_dataset(n=100_000)
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
df.describe()

## 2) Prepare features

LightGBM handles categorical features natively when they are `pandas.Categorical` dtype.

In [None]:
from src.models import prepare_features

X_train, X_test, y_ctr_train, y_ctr_test, y_conv_train, y_conv_test = prepare_features(df)

print(f"Train: {X_train.shape}  |  Test: {X_test.shape}")
print(f"Conversion rate (train): {y_conv_train.mean():.3f}")

## 3) Model A — CTR prediction (regression)

CTR is continuous and bounded in (0, 1). We weight each sample by `impressions` so high-volume rows have more influence.

In [None]:
from src.models import train_ctr_model

reg = train_ctr_model(X_train, X_test, y_ctr_train, y_ctr_test)

## 4) Model B — Conversion prediction (binary classification)

In [None]:
from src.models import train_conversion_model

clf = train_conversion_model(X_train, X_test, y_conv_train, y_conv_test)

## 5) Rank keywords for a given word

For a new `given_word`, score a list of candidate keywords using:
- `pred_ctr` from the regression model
- `pred_conv_prob` from the classifier
- `score = pred_ctr × pred_conv_prob` (customize to ROAS, profit, etc.)

In [None]:
from src.ranking import rank_keywords_for_given

candidates = ["white sneakers", "running shoes", "hiking boots", "leather wallet",
              "laptop bag", "wireless earbuds", "gaming mouse", "yoga mat"]

base = {
    "similarity":  0.7,
    "competition": 0.6,
    "impressions": 5000,
    "clicks":      0,
    "cpc":         2.0,
    "cost":        0.0,
    "device":      "mobile",
    "hour":        21,
}

ranked = rank_keywords_for_given("sneakers", candidates, base, reg, clf)
ranked

## 6) Learning-to-Rank with LambdaMART

A proper LambdaMART setup requires:
1. **Group-based train/test split** — keep all rows for a `given_word` in the same split.
2. **Group sizes array** — number of candidate keywords per query, in order.
3. **Relevance labels** — here we use `ctr`; in production use ROAS or conversions.

In [None]:
from src.models import train_ranker

ranker, test_words = train_ranker(df)

In [None]:
# Inspect ranker scores for one test given_word
from src.models import FEATURE_COLS, CAT_COLS

sample_word = list(test_words)[0]
df_sample   = df[df["given_word"] == sample_word].copy()

Xs = df_sample[FEATURE_COLS].copy()
for c in CAT_COLS:
    Xs[c] = Xs[c].astype("category")

df_sample["ranker_score"] = ranker.predict(Xs)
df_sample[["given_word", "keyword", "ctr", "ranker_score"]] \
    .sort_values("ranker_score", ascending=False) \
    .head(10) \
    .reset_index(drop=True)

## 7) Feature importance

Using **gain** (total reduction in loss attributed to each feature).

In [None]:
from src.models import feature_importance

fi_reg    = feature_importance(reg,    "CTR Regression")
fi_clf    = feature_importance(clf,    "Conversion Classifier")
fi_ranker = feature_importance(ranker, "LambdaMART Ranker")

## Quick-reference: choosing the right setup

| Success metric | Target variable | LightGBM objective | Eval metric |
|---|---|---|---|
| CTR | `ctr` (float) | `regression` | RMSE / MAE |
| Conversion | `has_conversion` (0/1) | `binary` | AUC / PR-AUC |
| ROAS / Profit | continuous value | `regression` or `tweedie` | RMSE |
| Click volume | `clicks` (count) | `poisson` | — |
| Keyword ranking | any relevance label | `lambdarank` | NDCG@k |