# Clustering ad campaigns

Case available here

## 1. Importing the dataset

In [1]:
import pandas as pd

campaigns_df = pd.read_csv(
    "data/_SELECT_c_id_as_campaign_id_c_product_name_as_product_name_c_sho_202411281333.csv"
)
campaigns_df

Unnamed: 0,campaign_id,product_name,short_description,long_description
0,51831d3f-cb5f-4ce1-94df-f56b7b9fea23,Beyond Body - Personalized Wellness Book,Embark on a personalized wellness journey with...,"Beyond Body is more than just a wellness book,..."
1,aff29b17-7d00-43a9-accc-28342c15faa0,Better In Person Dating App,Seeking creators to create UGC-style videos hi...,Better in Person is a dating app for intention...
2,bfe17c54-6a0b-4357-9c42-f2c47d647312,ISM,"Play around with Ism Lens for that cool twist,...",Unlock the potential of your creativity with I...
3,b047aaef-f3ce-4758-84ea-784ee41026cf,Officiel QI Test,"Promouvoir Officiel QI Test, un service de tes...","Dans cette campagne, nous voulons promouvoir O..."
4,7ff00182-260c-44de-ac4a-068ce407e64c,Joko,Dans cette campagne on commence par un hook où...,Dans cette campagne on commence par un hook où...
...,...,...,...,...
701,ab45a400-995d-4a25-a92a-3d36f15e2226,GoTrendier México,Motiva a tus seguidores a vender sus prendas e...,GoTrendier Colombia busca creadores de conteni...
702,0be162b9-a5d0-43aa-a4a6-416d2d96ff6d,Be Fit: Gym & At Home Workouts,"Enhance your fitness journey with Be Fit, an a...","We're launching a new fitness app, Be Fit: Gym..."
703,4ea06c4a-415c-4d91-ba5d-72600668c299,VoiceTasker,Experience task management like never before w...,The ad campaign seeks to promote VoiceTasker –...
704,982e4143-b50e-4d3a-baa0-9fb225a7aaba,Jelly Juice,"Erfreuen Sie sich an Jelly Juice, einem zauber...",Jelly Juice DE ist ein unterhaltsames match-3-...


Clearly, we can see that at least 3 different languages are being used here. Clustering based on techniques like TFIDF will therefore probably highlight first the language. We can start by translating them to english, and then clustering the topic

In [2]:
import langdetect
from langdetect import LangDetectException
import numpy as np


def detect_language(x: str):
    out = np.nan
    try:
        out = langdetect.detect(x)
    except LangDetectException:
        ...
    return out


campaigns_df.loc[:, "language"] = campaigns_df.long_description.apply(detect_language)

In [3]:
campaigns_df.language.value_counts()

language
en    460
fr    166
de     24
es     20
pt     12
ar      5
it      4
pl      4
nl      2
af      1
ro      1
et      1
id      1
so      1
ja      1
vi      1
Name: count, dtype: int64

There seem to be roughly 5 main languages (English, French, German, Spanish, and Portuguese), the rest are either errors or severely underrepresented languages (<1% of the dataset)
Let us remove these articles, and translate everything to english

In [4]:
campaigns_df = campaigns_df[campaigns_df.language.isin(["en", "fr", "de", "es", "pt"])]
campaigns_df.loc[campaigns_df.language == "en", "translated_desc"] = campaigns_df.loc[
    campaigns_df.language == "en", "long_description"
]
campaigns_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  campaigns_df.loc[campaigns_df.language == "en", "translated_desc"] = campaigns_df.loc[campaigns_df.language == "en", "long_description"]


Unnamed: 0,campaign_id,product_name,short_description,long_description,language,translated_desc
0,51831d3f-cb5f-4ce1-94df-f56b7b9fea23,Beyond Body - Personalized Wellness Book,Embark on a personalized wellness journey with...,"Beyond Body is more than just a wellness book,...",en,"Beyond Body is more than just a wellness book,..."
1,aff29b17-7d00-43a9-accc-28342c15faa0,Better In Person Dating App,Seeking creators to create UGC-style videos hi...,Better in Person is a dating app for intention...,en,Better in Person is a dating app for intention...
2,bfe17c54-6a0b-4357-9c42-f2c47d647312,ISM,"Play around with Ism Lens for that cool twist,...",Unlock the potential of your creativity with I...,en,Unlock the potential of your creativity with I...
3,b047aaef-f3ce-4758-84ea-784ee41026cf,Officiel QI Test,"Promouvoir Officiel QI Test, un service de tes...","Dans cette campagne, nous voulons promouvoir O...",fr,
4,7ff00182-260c-44de-ac4a-068ce407e64c,Joko,Dans cette campagne on commence par un hook où...,Dans cette campagne on commence par un hook où...,fr,
...,...,...,...,...,...,...
701,ab45a400-995d-4a25-a92a-3d36f15e2226,GoTrendier México,Motiva a tus seguidores a vender sus prendas e...,GoTrendier Colombia busca creadores de conteni...,es,
702,0be162b9-a5d0-43aa-a4a6-416d2d96ff6d,Be Fit: Gym & At Home Workouts,"Enhance your fitness journey with Be Fit, an a...","We're launching a new fitness app, Be Fit: Gym...",en,"We're launching a new fitness app, Be Fit: Gym..."
703,4ea06c4a-415c-4d91-ba5d-72600668c299,VoiceTasker,Experience task management like never before w...,The ad campaign seeks to promote VoiceTasker –...,en,The ad campaign seeks to promote VoiceTasker –...
704,982e4143-b50e-4d3a-baa0-9fb225a7aaba,Jelly Juice,"Erfreuen Sie sich an Jelly Juice, einem zauber...",Jelly Juice DE ist ein unterhaltsames match-3-...,de,


## 2. Translation

In [59]:
from deep_translator import GoogleTranslator
from tqdm.notebook import tqdm

# total number of charsr to translate
print(float(campaigns_df.long_description.apply(len).sum()))

translator = GoogleTranslator(source="auto", target="en")


out = []
for sentence in tqdm(
    zip(
        campaigns_df.loc[campaigns_df.language != "en", "short_description"].values,
        campaigns_df.loc[campaigns_df.language != "en", "long_description"].values,
        strict=False,
    ),
    total=campaigns_df.loc[campaigns_df.language != "en"].shape[0],
):
    out.append(" ".join(translator.translate_batch(list(sentence), dest="en")))
campaigns_df.loc[campaigns_df.language != "en", "translated_desc"] = out
# campaigns_df.loc[campaigns_df.language != "en", "translated_desc"] = campaigns_df[campaigns_df.language != "en"].apply(translate_row, axis=1)

342748.0


  0%|          | 0/222 [00:00<?, ?it/s]

In [61]:
campaigns_df.to_parquet("data/campaigns_df.parquet")

## 3. TFIDF-based clustering

Considering features for a ML models words that appear a lot in some documents, but not that much in the overall corpus.

In [67]:
from sklearn.cluster import HDBSCAN
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from src.model.preprocessing import extract_nouns_batch, remove_words_from_col


campaign_df_train, campaign_df_test = train_test_split(
    campaigns_df, train_size=0.75, test_size=0.15, random_state=42
)

vectorizer = TfidfVectorizer(
    stop_words="english", max_df=0.7, min_df=0.05, ngram_range=(1, 1)
)


preprocessed_col = pd.Series(
    extract_nouns_batch(campaign_df_train.translated_desc), dtype="str"
)

# handpicked keywords that are too generic to be in a cluster
preprocessed_col = remove_words_from_col(
    preprocessed_col,
    [
        "campaign",
        "users",
        "video",
        "app",
        "wi",
        "ll",
        "th",
        "new",
        "eir",
        "brief",
        "foow",
        "product",
        "lication",
        "product",
        # "money",
        "user",
        "experience",
        # "community",
        "platform",
        "world",
        "fun",
        "life",
        "day",
        "people",
        "photos",
        "goal",
    ],
)
X = vectorizer.fit_transform(preprocessed_col)

dbscan = HDBSCAN(
    metric="cosine", min_cluster_size=20, cluster_selection_epsilon=0.25, alpha=1.0
)  # Use cosine distance for text
campaign_df_train.loc[:, "cluster"] = dbscan.fit_predict(X)

print(campaign_df_train.cluster.value_counts())

cluster
-1    209
 0     92
 6     65
 1     26
 7     25
 3     24
 5     24
 2     23
 4     23
Name: count, dtype: int64


In [68]:
terms = vectorizer.get_feature_names_out()
top_n = 10
for cluster_id in sorted(campaign_df_train.cluster.unique()):
    cluster_X = X[campaign_df_train.cluster == cluster_id]
    cluster_tfidf = cluster_X.mean(axis=0).A1
    top_terms = sorted(zip(cluster_tfidf, terms), reverse=True)[:top_n]
    print(cluster_id, cluster_X.shape[0], [e[1] for e in top_terms])

-1 209 ['way', 'features', 'creators', 'space', 'fitness', 'journey', 'feature', 'community', 'style', 'media']
0 92 ['way', 'time', 'style', 'space', 'money', 'media', 'journey', 'game', 'friends', 'fitness']
1 26 ['way', 'time', 'style', 'space', 'money', 'media', 'journey', 'game', 'friends', 'fitness']
2 23 ['ai', 'style', 'features', 'media', 'feature', 'friends', 'way', 'time', 'community', 'content']
3 24 ['money', 'way', 'ai', 'time', 'creators', 'friends', 'fitness', 'style', 'space', 'media']
4 23 ['time', 'features', 'money', 'game', 'way', 'style', 'space', 'media', 'journey', 'friends']
5 24 ['content', 'time', 'creators', 'media', 'game', 'space', 'ai', 'way', 'style', 'money']
6 65 ['game', 'friends', 'time', 'community', 'way', 'journey', 'style', 'feature', 'content', 'space']
7 25 ['community', 'style', 'journey', 'ai', 'time', 'way', 'space', 'money', 'media', 'game']


In [40]:
terms

array(['access', 'ai', 'aims', 'aows', 'best', 'better', 'choose',
       'community', 'content', 'create', 'creators', 'daily', 'day',
       'designed', 'different', 'discover', 'download', 'easily', 'easy',
       'em', 'engaging', 'enjoy', 'experience', 'ey', 'feature',
       'features', 'feel', 'fitness', 'focus', 'free', 'friends', 'fun',
       'game', 'goal', 'help', 'helps', 'highlight', 'join', 'journey',
       'just', 'let', 'lication', 'life', 'like', 'live', 'looking',
       'make', 'making', 'money', 'need', 'oer', 'offers', 'online',
       'people', 'perfect', 'personal', 'personalized', 'photos',
       'platform', 'product', 'promote', 'real', 'rough', 'safe', 'share',
       'sharing', 'showcase', 'simple', 'social', 'space', 'start',
       'tiktok', 'time', 'today', 'ultimate', 'unique', 'use', 'user',
       'using', 'want', 'way', 'wheer', 'wi', 'world'], dtype=object)

The actual output is not very convincing: 
- first there is a huge number of items that could not be clustered
- second, clusters found seem to greatly overlap in terms of key words and are therefore hard to interpret

## 4. Using sentence-embeddings
Another approach using embeddings this time would look like the following

In [39]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(
    campaign_df_train.translated_desc.values, show_progress_bar=True
)

dbscan_e = HDBSCAN(
    metric="cosine", min_cluster_size=10, cluster_selection_epsilon=0.0, alpha=1.0
)  # Use cosine distance for text
campaign_df_train.loc[:, "cluster"] = dbscan_e.fit_predict(embeddings).astype(int)
print(campaign_df_train.cluster.value_counts())

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

cluster
-1    406
 5     22
 2     18
 4     18
 1     17
 3     17
 0     13
Name: count, dtype: int64


In [42]:
import pickle

with open("data/embeddings.pkl", "wb") as file:
    pickle.dump(embeddings, file)

# with open("data/embeddings.pkl", "rb") as file:
#     embeddings = pickle.load(file)


In [70]:
# Data inputs
from src.model.train import generate_cluster_names

sentences = campaign_df_train["translated_desc"].to_numpy()
clusters = campaign_df_train["cluster"].to_numpy()


# Generate cluster names
cluster_names = generate_cluster_names(sentences, embeddings, clusters)

# Display the names for each cluster
for cluster_id, cluster_name in cluster_names.items():
    print(f"Cluster {cluster_id}: {cluster_name}")

Cluster -1: ['creators', 'fitness', 'just']
Cluster 0: ['application', 'brief', 'create']
Cluster 1: ['allianz', 'free', 'help']
Cluster 2: ['ai', 'ava', 'create', 'dating']
Cluster 3: ['allianz', 'auto', 'best', 'earn']
Cluster 4: ['beauty', 'box', 'choose']
Cluster 5: ['children', 'content', 'create']
Cluster 6: ['experience', 'friends', 'fun', 'game']
Cluster 7: ['ability', 'community', 'connect', 'join']


Some topics emerge better: "ai", "fitness", "auto", "beauty", "children", "game", "networking". But it can only be used as a base as it does not look usable per se

## As a conclusion clustering based on long description is not very convincing here.

I would suggest actually clustering in practice by manually defining a set of clusters based on domain knowledge. Could be "dating", "fitness", "auto", "beauty", "children", "game", "networking", etc. Each with 10 examples (or ideally more).

Then in order to classify any sentence, I would compute each man-defined cluster centroid, and measure the distance of the embedding of the sentence to the center of each cluster, and pick the shortest distance as the suggested cluster. If there is no clear winner, I would predict "-1" 