# LightGBM Ads Tutorial

End-to-end tutorial using synthetic keyword-ads performance data.

**Models covered:**
1. CTR prediction (regression)
2. Conversion prediction (binary classification)
3. Keyword ranking function (score-based)
4. Learning-to-Rank with LambdaMART (group-split version)
5. Feature importance

## 0) Install dependencies

In [1]:
# !pip install lightgbm scikit-learn pandas numpy

## 1) Create synthetic ads dataset

Each row represents one `(given_word, keyword)` pair with features:
- `similarity` â€“ cosine-like similarity between the two words
- `competition`, `impressions`, `clicks`, `cpc`, `cost`, `device`, `hour`

Targets:
- `ctr` â€“ click-through rate (regression)
- `has_conversion` â€“ did it convert at least once? (binary classification)

In [2]:
import json
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

rng = np.random.default_rng(42)

# Load word lists from files
with open("given_words.json") as f:
    given_words = np.array(json.load(f))

with open("keywords.json") as f:
    keywords = np.array(json.load(f))

print(f"given_words : {len(given_words)}")
print(f"keywords    : {len(keywords)}")

# â”€â”€ Precompute real similarities â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
print("Encoding embeddings...")
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

all_words = np.unique(np.concatenate([given_words, keywords]))
vecs      = embed_model.encode(all_words.tolist(), normalize_embeddings=True, show_progress_bar=True)
vec_index = dict(zip(all_words, vecs))

def real_similarity(given: np.ndarray, kw: np.ndarray) -> np.ndarray:
    """Vectorised cosine similarity for arrays of given/keyword strings."""
    g_vecs = np.stack([vec_index[g] for g in given])
    k_vecs = np.stack([vec_index[k] for k in kw])
    # row-wise dot product (vectors are already normalised)
    return np.clip((g_vecs * k_vecs).sum(axis=1), 0.0, 1.0)

print("Done.")

def make_ads_dataset(n=100_000):
    given = rng.choice(given_words, size=n)
    kw    = rng.choice(keywords, size=n)

    similarity  = real_similarity(given, kw)

    impressions = rng.integers(50, 20000, size=n)
    device      = rng.choice(["mobile", "desktop"], size=n, p=[0.7, 0.3])
    hour        = rng.integers(0, 24, size=n)
    competition = rng.uniform(0.1, 1.0, size=n)

    cpc = np.clip(
        0.2 + 2.0 * competition + 0.5 * (1 - similarity) + rng.normal(0, 0.15, size=n),
        0.05, None
    )

    device_boost = np.where(device == "mobile", 0.02, 0.0)
    hour_boost   = np.where((hour >= 19) & (hour <= 23), 0.01, 0.0)
    ctr = np.clip(
        0.01 + 0.10 * similarity + device_boost + hour_boost + rng.normal(0, 0.01, size=n),
        0.0005, 0.30
    )

    clicks = rng.binomial(impressions, p=ctr)
    cost   = clicks * cpc

    conv_p = 1 / (1 + np.exp(-(-2.0 + 4.0 * similarity - 0.4 * cpc)))
    conversions    = rng.binomial(np.maximum(clicks, 1), p=np.clip(conv_p, 0.0001, 0.8))
    has_conversion = (conversions > 0).astype(int)

    return pd.DataFrame({
        "given_word":     given,
        "keyword":        kw,
        "similarity":     similarity,
        "competition":    competition,
        "impressions":    impressions,
        "clicks":         clicks,
        "cpc":            cpc,
        "cost":           cost,
        "device":         device,
        "hour":           hour,
        "ctr":            np.where(impressions > 0, clicks / impressions, 0.0),
        "has_conversion": has_conversion,
        "conversions":    conversions,
    })

df = make_ads_dataset(100_000)
print(f"\nDataset shape: {df.shape}")
df.head()



given_words : 50
keywords    : 1030
Encoding embeddings...


Batches:   0%|          | 0/34 [00:00<?, ?it/s]

Done.

Dataset shape: (100000, 13)


Unnamed: 0,given_word,keyword,similarity,competition,impressions,clicks,cpc,cost,device,hour,ctr,has_conversion,conversions
0,watch,vacuum for pet hair,0.0,0.531397,19901,172,1.804093,310.304063,desktop,13,0.008643,1,9
1,supplement,slide guitar,0.0,0.470275,18780,169,1.616793,273.238024,desktop,1,0.008999,1,14
2,mattress,horseshoe necklace,0.036682,0.78998,19265,734,2.33296,1712.392556,mobile,19,0.0381,1,45
3,sunglasses,monitor arm,0.21458,0.7666,18955,1246,2.328759,2901.633413,mobile,23,0.065735,1,149
4,sunglasses,pixel phone,0.228574,0.773931,17678,584,2.094634,1223.26628,desktop,8,0.033035,1,70


In [3]:
df.describe()

Unnamed: 0,similarity,competition,impressions,clicks,cpc,cost,hour,ctr,has_conversion,conversions
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,0.197328,0.548878,10008.15778,458.05521,1.698776,771.320205,11.47293,0.045756,0.98557,73.91203
std,0.115925,0.259774,5753.572191,338.596249,0.544315,635.426607,6.897224,0.018704,0.119256,102.030089
min,0.0,0.100003,50.0,0.0,0.278226,0.0,0.0,0.0,0.0,0.0
25%,0.12296,0.323847,5015.75,183.0,1.248251,280.721762,6.0,0.033165,1.0,20.0
50%,0.184707,0.548076,9987.5,397.0,1.699685,615.572365,11.0,0.045356,1.0,47.0
75%,0.250362,0.773605,14981.0,670.0,2.14942,1100.535014,17.0,0.057424,1.0,92.0
max,0.903308,0.999987,19999.0,2560.0,3.152396,5077.750616,23.0,0.166667,1.0,1931.0


## 2) Prepare features

LightGBM handles categorical features natively when they are `pandas.Categorical` dtype.

In [4]:
from sklearn.model_selection import train_test_split

FEATURE_COLS = [
    "given_word", "keyword", "similarity", "competition",
    "impressions", "clicks", "cpc", "cost", "device", "hour"
]
CAT_COLS = ["given_word", "keyword", "device"]

X = df[FEATURE_COLS].copy()
for c in CAT_COLS:
    X[c] = X[c].astype("category")

y_ctr  = df["ctr"].values
y_conv = df["has_conversion"].values

X_train, X_test, y_ctr_train, y_ctr_test = train_test_split(
    X, y_ctr, test_size=0.2, random_state=42
)
# reuse the same split indices for the conversion target
y_conv_train = y_conv[X_train.index]
y_conv_test  = y_conv[X_test.index]

print(f"Train: {X_train.shape}  |  Test: {X_test.shape}")
print(f"Conversion rate (train): {y_conv_train.mean():.3f}")

Train: (80000, 10)  |  Test: (20000, 10)
Conversion rate (train): 0.985


## 3) Model A â€” CTR prediction (regression)

CTR is continuous and bounded in (0, 1). We weight each sample by `impressions` so high-volume rows have more influence.

In [5]:
import lightgbm as lgb
from sklearn.metrics import mean_squared_error

reg = lgb.LGBMRegressor(
    n_estimators=2000,
    learning_rate=0.03,
    num_leaves=63,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    verbose=-1,
)

reg.fit(
    X_train, y_ctr_train,
    sample_weight=X_train["impressions"],
    eval_set=[(X_test, y_ctr_test)],
    eval_sample_weight=[X_test["impressions"]],
    eval_metric="l2",
    categorical_feature=CAT_COLS,
    callbacks=[lgb.early_stopping(stopping_rounds=80, verbose=False),
               lgb.log_evaluation(period=200)],
)

pred_ctr = reg.predict(X_test)
rmse = mean_squared_error(y_ctr_test, pred_ctr) ** 0.5
print(f"\nCTR RMSE : {rmse:.6f}")
print(f"Best iter: {reg.best_iteration_}")

[200]	valid_0's l2: 2.67262e-06
[400]	valid_0's l2: 1.22088e-06
[600]	valid_0's l2: 9.41585e-07
[800]	valid_0's l2: 8.53782e-07
[1000]	valid_0's l2: 8.20225e-07
[1200]	valid_0's l2: 8.00581e-07
[1400]	valid_0's l2: 7.89481e-07
[1600]	valid_0's l2: 7.82909e-07
[1800]	valid_0's l2: 7.77716e-07
[2000]	valid_0's l2: 7.739e-07

CTR RMSE : 0.001495
Best iter: 1992


## 4) Model B â€” Conversion prediction (binary classification)

In [6]:
from sklearn.metrics import roc_auc_score, average_precision_score

clf = lgb.LGBMClassifier(
    n_estimators=3000,
    learning_rate=0.03,
    num_leaves=63,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    verbose=-1,
)

clf.fit(
    X_train, y_conv_train,
    eval_set=[(X_test, y_conv_test)],
    eval_metric="auc",
    categorical_feature=CAT_COLS,
    callbacks=[lgb.early_stopping(stopping_rounds=100, verbose=False),
               lgb.log_evaluation(period=200)],
)

proba = clf.predict_proba(X_test)[:, 1]
print(f"\nAUC   : {roc_auc_score(y_conv_test, proba):.4f}")
print(f"PR-AUC: {average_precision_score(y_conv_test, proba):.4f}")


AUC   : 0.9935
PR-AUC: 0.9999


## 5) Rank keywords for a given word

For a new `given_word`, score a list of candidate keywords using:
- `pred_ctr` from the regression model
- `pred_conv_prob` from the classifier
- `score = pred_ctr Ã— pred_conv_prob` (customize to ROAS, profit, etc.)

In [7]:
def rank_keywords_for_given(given_word: str, candidates: list, base_features: dict) -> pd.DataFrame:
    """Score and rank candidate keywords for a given word.

    Parameters
    ----------
    given_word    : The query / seed word.
    candidates    : List of keyword strings to evaluate.
    base_features : Dict of feature values shared across all candidates
                    (all cols except given_word, keyword, similarity).

    Returns
    -------
    DataFrame sorted by score descending.
    """
    # Compute real similarities using sentence-transformers
    texts     = [given_word] + candidates
    vecs      = embed_model.encode(texts, normalize_embeddings=True)
    given_vec = vecs[0:1]
    kw_vecs   = vecs[1:]
    sims      = cosine_similarity(given_vec, kw_vecs)[0]

    rows = [
        {**base_features, "given_word": given_word, "keyword": kw, "similarity": float(sim)}
        for kw, sim in zip(candidates, sims)
    ]
    Xcand = pd.DataFrame(rows)[FEATURE_COLS]
    for c in CAT_COLS:
        Xcand[c] = Xcand[c].astype("category")

    ctr_hat  = reg.predict(Xcand)
    conv_hat = clf.predict_proba(Xcand)[:, 1]

    return pd.DataFrame({
        "given_word":     given_word,
        "keyword":        candidates,
        "similarity":     sims,
        "pred_ctr":       ctr_hat,
        "pred_conv_prob": conv_hat,
        "score":          ctr_hat * conv_hat,
    }).sort_values("score", ascending=False).reset_index(drop=True)


candidates = [
    "white sneakers", "running shoes", "canvas shoes",
    "hiking boots", "yoga mat", "leather wallet",
    "wireless earbuds", "gaming mouse",
]

base = {
    "competition": 0.6,
    "impressions": 5000,
    "clicks":      0,
    "cpc":         2.0,
    "cost":        0.0,
    "device":      "mobile",
    "hour":        21,
}

ranked = rank_keywords_for_given("sneakers", candidates, base)
ranked

Unnamed: 0,given_word,keyword,similarity,pred_ctr,pred_conv_prob,score
0,sneakers,white sneakers,0.85892,0.033682,0.334675,0.011272
1,sneakers,running shoes,0.669356,0.029309,0.334675,0.009809
2,sneakers,canvas shoes,0.649999,0.028859,0.334675,0.009658
3,sneakers,hiking boots,0.582462,0.026393,0.334675,0.008833
4,sneakers,leather wallet,0.331565,0.01557,0.334675,0.005211
5,sneakers,gaming mouse,0.274997,0.011825,0.261867,0.003097
6,sneakers,wireless earbuds,0.180678,0.006974,0.268278,0.001871
7,sneakers,yoga mat,0.166128,0.00635,0.247677,0.001573


## 6) Learning-to-Rank with LambdaMART

A proper LambdaMART setup requires:
1. **Group-based train/test split** â€” keep all rows for a `given_word` in the same split.
2. **Group sizes array** â€” number of candidate keywords per query, in order.
3. **Relevance labels** â€” here we use `ctr`; in production use ROAS or conversions.

In [8]:
from lightgbm import LGBMRanker

# â”€â”€ 6a) Group-based train/test split â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
unique_given = df["given_word"].unique()
rng_split    = np.random.default_rng(0)
rng_split.shuffle(unique_given)

split_idx    = int(len(unique_given) * 0.8)
train_words  = set(unique_given[:split_idx])
test_words   = set(unique_given[split_idx:])

df_rank = df.sort_values("given_word").copy()

mask_train = df_rank["given_word"].isin(train_words)
df_r_train = df_rank[mask_train].copy()
df_r_test  = df_rank[~mask_train].copy()

print(f"Ranker train rows: {len(df_r_train)}  |  test rows: {len(df_r_test)}")
print(f"Train given_words: {sorted(train_words)}")
print(f"Test  given_words: {sorted(test_words)}")

Ranker train rows: 80092  |  test rows: 19908
Train given_words: ['boots', 'camera', 'camping', 'candle', 'coffee', 'desk', 'dress', 'fishing', 'gaming', 'gift', 'guitar', 'handbag', 'headphones', 'jacket', 'jewelry', 'keyboard', 'laptop', 'luggage', 'makeup', 'mattress', 'monitor', 'necklace', 'pants', 'perfume', 'pet', 'phone', 'plant', 'printer', 'protein', 'running', 'shoes', 'skincare', 'sneakers', 'sunglasses', 'supplement', 'toy', 'vitamin', 'wallet', 'watch', 'yoga']
Test  given_words: ['baby', 'backpack', 'bicycle', 'blender', 'book', 'ring', 'shirt', 'sofa', 'tent', 'vacuum']


In [9]:
# â”€â”€ 6b) Build feature matrices and group size arrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
def bin_ctr(ctr_values: np.ndarray, n_bins: int = 5) -> np.ndarray:
    """Convert continuous CTR into integer relevance grades (0 to n_bins-1)."""
    bins = np.quantile(ctr_values, np.linspace(0, 1, n_bins + 1))
    bins = np.unique(bins)  # remove duplicates if any
    return np.digitize(ctr_values, bins[1:-1]).astype(int)

def build_rank_arrays(subset: pd.DataFrame):
    Xr = subset[FEATURE_COLS].copy()
    for c in CAT_COLS:
        Xr[c] = Xr[c].astype("category")
    y      = bin_ctr(subset["ctr"].values)   # integer grades required by LambdaMART
    groups = subset.groupby("given_word", sort=True).size().tolist()
    return Xr, y, groups

Xr_train, yr_train, groups_train = build_rank_arrays(df_r_train)
Xr_test,  yr_test,  groups_test  = build_rank_arrays(df_r_test)

print(f"Label range: {yr_train.min()} â€“ {yr_train.max()}  (grades 0â€“4)")
print(f"Group sizes (train): {groups_train}")
print(f"Group sizes (test) : {groups_test}")

Label range: 0 â€“ 4  (grades 0â€“4)
Group sizes (train): [2077, 1977, 2026, 1925, 2017, 1916, 1993, 2019, 2022, 1946, 1978, 1976, 2089, 1906, 2006, 1985, 2069, 2056, 1932, 2027, 2093, 2046, 1986, 2097, 2021, 1973, 2031, 2016, 2092, 1976, 1963, 2021, 1938, 1998, 1975, 1984, 1962, 1979, 1967, 2032]
Group sizes (test) : [2002, 1956, 2011, 2028, 1987, 1920, 2081, 1967, 1957, 1999]


In [10]:
# â”€â”€ 6c) Train LambdaMART ranker â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
ranker = LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    n_estimators=2000,
    learning_rate=0.03,
    num_leaves=63,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    verbose=-1,
)

ranker.fit(
    Xr_train, yr_train,
    group=groups_train,
    eval_set=[(Xr_test, yr_test)],
    eval_group=[groups_test],
    eval_at=[3, 5, 10],
    categorical_feature=CAT_COLS,
    callbacks=[lgb.early_stopping(stopping_rounds=80, verbose=False),
               lgb.log_evaluation(period=200)],
)

print(f"\nBest iteration: {ranker.best_iteration_}")


Best iteration: 18


In [11]:
# â”€â”€ 6d) Inspect ranker scores for one test given_word â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
sample_word = list(test_words)[0]
df_sample   = df_r_test[df_r_test["given_word"] == sample_word].copy()

Xs = df_sample[FEATURE_COLS].copy()
for c in CAT_COLS:
    Xs[c] = Xs[c].astype("category")

df_sample["ranker_score"] = ranker.predict(Xs)
df_sample[["given_word", "keyword", "ctr", "ranker_score"]] \
    .sort_values("ranker_score", ascending=False) \
    .reset_index(drop=True)

Unnamed: 0,given_word,keyword,ctr,ranker_score
0,blender,ninja blender,0.098982,0.808492
1,blender,travel blender,0.104976,0.808492
2,blender,beauty blender,0.103904,0.808492
3,blender,countertop blender,0.079826,0.808492
4,blender,beauty blender,0.096589,0.808492
...,...,...,...,...
2023,blender,open back headphones,0.038789,-0.804714
2024,blender,leather jacket,0.047831,-0.804868
2025,blender,flannel shirt,0.045409,-0.805315
2026,blender,firm mattress,0.051174,-0.806465


## 7) Feature importance

Using **gain** (total reduction in loss attributed to each feature).

In [12]:
def show_importance(model, title: str):
    fi = pd.DataFrame({
        "feature":    model.feature_name_,
        "importance": model.booster_.feature_importance(importance_type="gain"),
    }).sort_values("importance", ascending=False).reset_index(drop=True)
    print(f"\n=== {title} ===")
    print(fi.to_string(index=False))
    return fi

fi_reg    = show_importance(reg,    "CTR Regression")
fi_clf    = show_importance(clf,    "Conversion Classifier")
fi_ranker = show_importance(ranker, "LambdaMART Ranker")


=== CTR Regression ===
    feature   importance
     clicks 2.376324e+06
impressions 8.974916e+05
 similarity 5.670585e+05
     device 5.235285e+05
       cost 1.272965e+05
        cpc 4.867025e+04
       hour 3.862799e+04
    keyword 1.774280e+04
 given_word 6.949172e+03
competition 2.458634e+03

=== Conversion Classifier ===
    feature    importance
     clicks 114545.429480
       cost  17831.133654
 similarity   7060.908456
        cpc   5954.600694
competition   4271.569825
impressions   3338.817583
       hour   2226.812948
    keyword   1600.451376
 given_word    522.214152
     device    284.730831

=== LambdaMART Ranker ===
    feature  importance
     clicks 7653.807459
       cost 1088.720889
 similarity  460.325612
impressions  195.929314
        cpc  168.420746
     device  159.497969
competition   28.083892
    keyword   24.707920
       hour   23.041285
 given_word    0.000000


## Quick-reference: choosing the right setup

The table below covers all common setups. Rows marked âœ… are implemented in this notebook; rows marked ðŸ“– are for reference only.

| | Success metric | Target variable | LightGBM objective | Eval metric |
|---|---|---|---|---|
| âœ… | CTR | `ctr` (float) | `regression` | RMSE / MAE |
| âœ… | Conversion | `has_conversion` (0/1) | `binary` | AUC / PR-AUC |
| âœ… | Keyword ranking | any relevance label | `lambdarank` | NDCG@k |
| ðŸ“– | ROAS / Profit | continuous value | `regression` or `tweedie` | RMSE |
| ðŸ“– | Click volume | `clicks` (count) | `poisson` | â€” |