### Challenge

Your goal is to create an **accurate representation of a user** based on their Google search history.

The data is in `./search_history.json`. This contains a list of searches made by a single person over time.

### What does "accurate" mean?

**Accurate** means understanding which searches are **signal** and which are **noise**. Not every search reflects who someone is. Your job is to separate the meaningful from the incidental and build a coherent picture of this person.

A strong solution might surface insights like:
- **Fashion preferences**: What styles, brands, or aesthetics do they gravitate toward?
- **Travel**: Where have they been? Where are they planning to go?
- **Daily life**: What occupies their time—at work and for leisure?
- **Life transitions**: Are they moving? Starting a new job? Planning a wedding?
- **Location**: Where do they live?

This is not an exhaustive list. The point is to go beyond surface-level keyword extraction and demonstrate that you *actually understand* this person.

### What could a "representation" look like?

There are many ways to represent a user. A few examples:
- A **personal knowledge graph** capturing entities, relationships, and context
- A **single embedding** that encodes the user's preferences in a vector space
- An **LLM fine-tuned** on the user's data
- An **agent** that uses RAG to answer questions about the user

These are just starting points—come up with your own if you have a better idea. The specific representation you choose matters less than **why** you chose it and how well it captures what's meaningful about this person.

### Dummy approach

The following is what we consider a **dummy** approach:
1. Embed all searches
2. Cluster them by topic
3. Label each cluster and call it a "user interest"

This is mechanical. It doesn't distinguish signal from noise, doesn't capture nuance, and doesn't produce insights that feel *true* about a real person.

### What makes an interesting approach?

We're not looking for a "correct" answer, there probably isn't one. We're looking for **evidence of thinking**:
- Why did you choose this method over alternatives?
- What assumptions are you making, and why are they reasonable?
- How do you handle ambiguity in the data?
- What did you try that didn't work?

**The reasoning behind your approach is as important as the solution itself.** Show your work. Explain your decisions. If you explored dead ends, include them.

Make sure to include the cell output in the final commit. We will **not** execute the notebook ourselves.

# My Solution : User Intent modelling

### Overview

The main goal here is to have a "method" to encode the crucial information about the crucial things related to the user, from a source of data. Here we have google search data. 

I intend to think this in a way that is scalable, robust to noise and the way it is actually used in production. Particularly for the use case of Fabric, I would think this as context engineering problem. When a user is chatting to their favourite LLM, depending on the user query/input the LLM decides to fetch context from Fabric's MCP (or similar) server, in this case we must correctly provide it in the most relevant and updated way, specific to the person. We need to get a system that does this for each user.

The important thing is that this context can change with time. Things like user travel plans, favourite restaurants, fashion choices, the job they work, health information etc keep changing at various frequencies. So this makes approaches like finetuning an LLM on the user data not very efficient. Also finetuning can hide important info as it encodes the knowledge into LLM's parameters and the rest depends on the blackbox.

Using the searches as documents for RAG helps to keep the knowledge intact, but the data is filled with a lot of noise so we cannot rely just on the semantic similarity of Vector search as it pollutes context. We need to filter and process aggressively even before the RAG step.

Also the method should be scalable across users and should work with continuous updates as per the data from the user, in production. So any method that relies too much on thresholds, hardcoded cluster labels etc should be avoided (or at the best can be a baseline). 

##### **Considering all these, I would like to formalize the general solution framework as below:**

Each user has a set of themes that they engage with in life. We can model these themes as latent variables as these are difficult to directly observe. At any given moment, a small number of themes are active, and these generate the observed user data at that instant. Some of these themes correspond to noise like random bursts or frequent but not useful events like constant gmail, google, weather visits etc. But the other relevant topics like Fashion/health/career etc form their own themes. We can find these themes from a static dataset collected with initial burn-in period of data (may be historical), and then use it to filter the upcoming searches, so each theme acts like a latent cluster (but learned from the data). All the original searches/data is stored with the label of the theme.  We infer the theme from the summary of the searches assigned to it. In inference, depending on the user query we do RAG to first get the themes (which should better align with the context) and get the most relevant and recent (we have the timestamp of the data) data, and then add it (or summarize it and then add it) to the context to get accurate information about the user. So the relevant deeper info about the person is "mined" as needed from a properly grouped data. This is the key idea.

Now to fill specific details in each step above, I tried different methods in each stage, which I will explain in the sections below. Given the very tight compute and other resource constraints I have, I want to focus on the methodology, reasoning and trade-offs more than the performance of solutions here. I believe the approaches show promise and can surely be improved with adequate resources, I will aim to do this in future.

Due to the limited resources like incapacitated machine, I tried on Google Colab GPU free tier from multiple accounts. I merged the code here.




# Stage -1 : Data Preprocessing



## Purpose

This module is designed to **normalize raw user search events** (from Google or other sources) into a **lightweight, text-based representation** that can be fed into downstream models like embeddings or latent-variable models. Each raw event is a heterogeneous JSON containing fields like `title`, `time`, and `header` (common to all) and some other keys that are different for different dicts. Instead of using expensive methods like an LLM to summarize each search result, we rely on **simple, deterministic text processing** to produce a semantic description.

---

## Key steps

1. **Timestamp normalization**

   * Converts raw ISO-8601 strings to UTC timestamps using `parse_timestamp` or `datetime.fromisoformat`.
   * Ensures all events can be sorted or split temporally.

2. **Text normalization**

   * Removes URLs from titles (`remove_urls`), lowercases, and tokenizes text (`tokenize`).
   * Numeric-only tokens are discarded.
   * Produces a clean, consistent sequence of words describing the search.

3. **URL semantic extraction**

   * Resolves Google redirect URLs (`resolve_google_redirect`) to their target destination.
   * Extracts meaningful tokens from domain and path (`normalize_url`).
   * Captures semantic hints from URLs without fetching content.

4. **Event cleaning**

   * Combines title tokens + URL tokens, removes duplicates, and produces a **single `description` string** per event.
   * Preserves timestamp and event type for downstream use.

5. **Document-frequency filtering**

   * Computes token frequency across the corpus (`compute_document_frequency`).
   * Removes overly common tokens (`df_filter_sentence`) that are uninformative, keeping descriptions more discriminative.

---

## Assumptions and rationale

* **Heterogeneous raw events** can be mapped to a uniform textual representation using only metadata (title + URL).
* **Full semantic summarization is expensive**, so we rely on tokenization and URL parsing instead of LLMs.
* **Frequent/common tokens are less informative**, so DF filtering helps models focus on distinguishing words.
* Output descriptions are compact, deterministic, and suitable for **embedding models, clustering, or latent-variable models**.



In [None]:
from preprocess.prep import (
    clean_events,
    compute_document_frequency,
    df_filter_sentence,
)
import json

# Create artifacts directory    
ARTIFACTS_DIR = Path("artifacts")
ARTIFACTS_DIR.mkdir(exist_ok=True)

# Read the search history as a list of dicts
with open("search_history.json", "r") as f:
    search_history = json.load(f)

# Clean raw events 
## Each event is a dict with mandatory keys: "timestamp" (datetime), "header" (str) and "title" (str). But they can have other keys as well heterogeneously.
## We need to clean them into a uniform format with keys: "timestamp" (datetime) and "description" (str). 
## Description is formed by basic cleaning from the event dict. It could be a list of keywords, removing the most used domain names and stopwords. and URLs

events = clean_events(search_history)
events = sorted(events, key=lambda x: x["timestamp"])

# Build corpus
docs = [e["description"] for e in events]
df = compute_document_frequency(docs)

# DF filtering
cleaned_events = [
    {
        "timestamp": e["timestamp"].timestamp(),
        "description": df_filter_sentence(e["description"], df, len(docs)),
    }
    for e in events
]

# Save artifacts from preprocessing step
with open("artifacts/events.json", "w") as f:
    json.dump(events, f, default=str)

with open("artifacts/cleaned_events.json", "w") as f:
    json.dump(cleaned_events, f)


# Baseline Model

This is similar to the "dummy" model mentioned. Specifically, we do the following:

1) Embed the cleaned description of each event using a Sentence Transformer (all Mini L6)
2) Divide into train/val/test in the order of time, in the ratio 70, 15, 15.
3) On the train set, start from the first event, and loop over all events. Find the dot product similarity of each event with the previous ones, and if it is similar to any one of the previous events, then they form a cluster. Else the event forms its own cluster.
4) Keep accumulating the events into clusters and find the running stats.
5) At the end, each cluster is considered as a "theme"

This is done mainly for comparision with other approaches. Obviously this is very superficial, relies mostly on the LLM embeddings quality and doesn't have the notion of any user profiling task objective. Also the number of clusters grows with time, almost linearly, which is not what happens usually with humans. 

We compare other approaches with this to see the improvement.

In [None]:
from baseline.model import BaselineUserModel
from baseline.vis_helpers import visualize_theme_timeline
import pickle
from pathlib import Path

# Create and load events
model = BaselineUserModel()
model.load_events(cleaned_events)

# Train, val, test split respecting the time
train, val, test = model.time_split()

# Train the model
model.train()


In [None]:
# Visualization of the baseline themes
visualize_theme_timeline(model)

In [None]:
# Save the trained baseline model
with open(ARTIFACTS_DIR / "baseline_model.pkl", "wb") as f:
    pickle.dump(model, f)

# Transformer model



Now, we move closer to our initial idea of modelling latent intents. This approach didn't work out, as I explained later. But all the code related to this can be found in transformers folder in this repo. 

## High-level goal

This model was designed to infer **latent “threads” of user intent** from a sequence of search embeddings, while explicitly modeling **time dynamics** and **thread persistence/decay**. The intended use case is a *user intent regression* problem where intent is:

* **latent** (not directly supervised),
* **temporally structured** (intents evolve over time), but slowly
* **multi-threaded** (multiple intents can coexist), but only very close times.
* and **softly assigned** (probabilistic, not hard clustering). I thought this would be realistic and easy training due to continuous nature of assignments.

The model tries to reconstruct the original search embeddings using a low-dimensional, time-aware latent representation, while also producing a continuous **signal score** indicating the strength or relevance of intent at each timestep.

---

## Core assumptions (explicit)

This approach rests on the following **strong assumptions**:

1. **User intent can be represented as a fixed number `K` of global latent threads**
   Each thread corresponds to a reusable intent prototype shared across users and time.

2. **At each timestep, the user state is a convex mixture over threads**
   → enforced via a softmax over `K` latent variables.

3. **Thread assignment evolves smoothly over time**
   → captured via transformer self-attention + time embeddings.

4. **Time effects are multiplicative and stationary**
   → encoded via learnable decay/boost parameters per thread.

5. **Reconstruction of embeddings is sufficient supervision**
   → no explicit intent labels, only embedding reconstruction.

6. **Search embeddings are linear in intent space**
   → decoder is a single linear projection from thread probabilities.

These assumptions are *very strong* and, as explained later, several are violated in realistic user behavior.

---

## Workflow explanation (step-by-step)

### 1. Inputs

The model consumes:

* `embeds ∈ ℝ[B, T, D]`
  Pre-computed search/query embeddings.

* `timestamps, deltas`
  Absolute and relative time information per event.

* `padding_mask`
  For variable-length sequences.

No explicit intent labels are used.

---

### 2. Time encoding (`MultiScaleTimeEncoder`)

Time is embedded into the same `d_model` space as the transformer via multiple temporal features (e.g. absolute time, relative deltas, possibly log-scaled or periodic).

**Assumption**:
Temporal effects are *additive* and can be injected as positional bias:

```python
x = input_proj(embeds) + time_embeds
```

This assumes time affects *how* intent is expressed, not *which* intent exists — a subtle but important constraint.

---

### 3. Transformer encoder (contextualization)

The transformer processes the sequence to produce contextualized representations:

* Captures:

  * co-occurrence patterns,
  * temporal continuity,
  * local bursts of activity.

However, the transformer is **not autoregressive** and **not causal**, meaning:

* Future queries influence earlier representations unless masked.

**Implicit assumption**:
Intent is symmetric in time within the window — which is rarely true for real user behavior.

---

### 4. Latent thread inference (variational)

The `thread_head` outputs:

* `mean, logvar ∈ ℝ[B, T, K]`

A Gaussian latent variable is sampled during training:

```python
z = mean + ε · exp(0.5 · logvar)
```

Then:

```python
probs = softmax(z)
```

This is a **VAE-style relaxation of discrete thread assignment**.

**Assumptions here**:

* Gaussian noise is an appropriate uncertainty model.
* Threads lie on a simplex.
* KL regularization (presumably elsewhere) meaningfully shapes the latent space.

In practice, this often leads to:

* posterior collapse,
* over-smooth thread assignments,
* entropy domination (uniform softmax).

---

### 5. Thread dynamics (decay & boost)

Each thread has learnable global parameters:

* `decay_k = exp(log_decay_k)`
* `boost_k = exp(log_boost_k)`

These are intended to model:

* how quickly a thread fades over time,
* how strongly it re-activates when matched.

**Critical assumption**:

> Thread dynamics are *stationary*, *global*, and *independent of user context*.

This is almost certainly false:

* user intent decay is user-specific,
* depends on topic,
* and depends on external events.

As a result, these parameters often either:

* collapse to trivial values, or
* get ignored by the rest of the network.

---

### 6. Reconstruction objective

Reconstruction is performed as:

```python
recon = Linear(K → D)(probs)
```

This implies:

> Search embeddings lie in the **linear span of K intent prototypes**.

This is the **most brittle assumption** in the entire model.

Modern sentence embeddings:

* are highly nonlinear,
* encode syntax, semantics, and discourse,
* are not additive mixtures of “intent vectors”.

Thus reconstruction loss encourages:

* blurry averages,
* loss of discriminative information,
* or trivial collapse to mean embeddings.

---

### 7. Signal head

A scalar `signal ∈ (0, 1)` is predicted per timestep.

Intended meaning:

* confidence,
* intent salience,
* or downstream regression signal.

But:

* it is weakly supervised (or unsupervised),
* competes with reconstruction gradients,
* and has no explicit semantic grounding.

---

## Why this approach failed (root causes)

### 1. **Intent is not a linear mixture problem**

User intent is:

* hierarchical,
* compositional,
* often discontinuous.

Forcing it into a softmax over `K` static threads destroys structure.

---

### 2. **Reconstruction is the wrong objective**

Reconstructing embeddings:

* rewards lexical similarity,
* not behavioral intent,
* encourages shortcut solutions.

The model can minimize loss without learning meaningful threads.

---

### 3. **Time modeling is too weak and too global**

* Additive time embeddings are insufficient.
* Global decay parameters ignore context.
* No explicit state transition model exists.

---

### 4. **Variational noise hurts discrete structure**

Gaussian VAEs are poorly suited for:

* categorical latent structure,
* competition between threads,
* sparse activation.

This leads to:

* entropy domination,
* posterior collapse,
* unused threads.

---

### 5. **No grounding signal**

Without:

* click data,
* task completion,
* session boundaries,
* or downstream labels,

the latent space is **unidentifiable**.

Multiple radically different thread decompositions yield the same loss.

---

## Overall 

This model is **architecturally elegant but statistically under-constrained**.

It assumes:

* linearity where none exists,
* stationarity where behavior is contextual,
* smoothness where intent is bursty,
* and reconstructability where abstraction is required.

As a result, it tends to learn soft, meaningless mixtures, collapse threads, and overfit temporal artifacts.


# Mixture of Experts + Variational AutoEncoder Model

Absolutely — let’s go through your **Latent Variable AE → MoE-VAE pipeline** and explain **what it does, the assumptions it makes, why it worked “okay-ish,” and where the limitations come from**, all in detail and without criticizing the data.

---

## High-level goal

This workflow was designed to discover **latent structure in event embeddings**, compressing high-dimensional sentence embeddings into a **smaller latent space** where **discrete or soft clusters (“themes”)** could emerge.

The intended purpose is **concept separation / theme discovery**, similar to identifying latent topics or user intent clusters, **without explicit labels**. It combines:

1. **Sentence embedding pre-processing**
   Represent each event as a dense embedding.
2. **Stage-0 Deep AutoEncoder (AE)**
   Compress embeddings while preserving most information.
3. **Stage-1 MoE-VAE**
   Learn a mixture-of-experts latent representation where each “expert” can represent a soft concept, encouraging sparsity and separation.

---

## Step-by-step workflow

### 1. Temporal splitting

* Events are **chronologically sorted and split** into train, validation, and test.
* Assumption: Temporal structure is meaningful; e.g., the model is trained only on past data relative to validation/test events.
* Effect: Avoids data leakage and ensures that latent themes reflect sequential progression.

---

### 2. Embedding

* Uses **sentence-transformers** to encode the `description` of each event.
* Assumption: High-dimensional embeddings capture semantic similarity, and linear combinations of embeddings can approximate “concept mixtures.”
* Output: `(N_events, D)` embeddings ready for compression.

---

### 3. Stage-0 Deep AutoEncoder (AE)

* Compresses embeddings to a latent vector of size `LATENT_DIM = D / COMPRESSION_FACTOR`.
* Uses **deep, symmetric encoder/decoder with GELU + LayerNorm + dropout**.
* Purpose:

  * Reduce dimensionality to a manageable size for MoE-VAE.
  * Remove noise and redundancy while keeping essential semantic information.
* Assumptions:

  * Event embeddings lie on a lower-dimensional manifold.
  * Nonlinear transformations can reconstruct the original embeddings.
* Observed effect:

  * AE often learns a **smoothed, “denoised” representation** that removes minor differences.
  * Compressing embeddings makes later latent mixture modeling more stable.

---

### 4. MoE-VAE

* Core idea: Each latent vector is **modeled as a mixture of K “experts”**, where each expert is a Gaussian latent distribution.
* Steps:

  1. **Encoder MLP**: maps AE-compressed embeddings → hidden space.
  2. **Router**: computes **soft selection probabilities** over `num_experts`.
  3. **Expert parameters**: each expert has `μ` and `logvar` for its Gaussian latent.
  4. **Top-k sampling**: selects k most likely experts per embedding.
  5. **Decoder MLP**: reconstructs the AE latent embeddings from the weighted expert latent vectors.
* Loss components:

  * Reconstruction (MSE)
  * KL divergence (latent Gaussian regularization)
  * Router entropy (sparsity encouragement)
  * Load balancing (ensure all experts are used)
* Assumptions:

  * The latent structure is **compositional**, and events can belong partially to multiple latent themes.
  * Sparse selection of experts encourages **interpretable separation**.
  * Gaussian latents are sufficient to model latent variability.

---

### 5. Inference & cluster assignment

* `infer_theme_probs` produces **soft and hard assignments** to latent experts for each event.
* Observed effect:

  * **Some concept separation emerges**: clusters correspond to different semantic themes in embeddings.
  * Hard assignments allow downstream analysis (e.g., plotting, event grouping).
* Limitation:

  * Experts may still overlap; sparsity constraints cannot fully enforce semantic orthogonality.

---

## Why it worked “okay-ish”

1. **AE compression helped stabilize latent modeling**

   * Reduced dimensionality from ~768–1024 → 96, giving MoE-VAE a tractable input space.
   * Removed minor embedding noise that could confuse mixture assignments.

2. **MoE-VAE captures multi-modal latent structure**

   * Top-k routing encourages sparsity.
   * Load balancing prevents expert collapse (though only partially).
   * Soft assignments enable partial membership of events to multiple latent themes, which is realistic for semantic data.

3. **End-to-end latent space is semantically meaningful**

   * Events with similar descriptions often cluster together.
   * Themes are interpretable in aggregate (e.g., “finance events,” “technical updates”), even without supervision.

---

## Limitations

1. **Reconstruction-based supervision**

   * The model is trained only to reconstruct embeddings, not explicitly to maximize separability. (explicit guidance needs labels. May be a set of initial noisy labels generated by a good LLM can help here.)
   * May produce **blurred / overlapping latent clusters**, limiting downstream utility.

2. **Gaussian experts + top-k routing are approximate**

   * Softmax weighting of Gaussian samples can mix distinct concepts.
   * Variance in latent space can reduce clarity of cluster boundaries.

3. **Hyperparameters matter a lot**

   * Number of experts, top-k, latent size, hidden dims, AE compression all control trade-offs between separation vs. reconstruction.
   * Small dataset + limited batch size can make sparse expert activation noisy.

4. **No temporal dynamics modeled**

   * Unlike the Transformer model you tried later, this pipeline ignores event timestamps beyond splitting.
   * Could miss sequential patterns or evolving themes over time.

5. **Interpretability depends on scale**

   * Larger K may overfit to small differences.
   * Smaller K may merge unrelated themes.

---

## Summary

* The pipeline works “okayish” because:

  1. AE reduces embedding noise.
  2. MoE-VAE captures multi-modal latent structure.
  3. Sparse routing encourages partial concept separation.

* It is limited because:

  * Only reconstructs embeddings (weak supervision).
  * Gaussian latent assumption + soft top-k may blur clusters.
  * Temporal/sequential dependencies are ignored.
  * Hyperparameter sensitivity and small dataset limit expert specialization.

Overall, it’s a **reasonably strong unsupervised latent concept discovery method**, producing interpretable clusters better than baseline, but the performance is not still perfect. Performance could improve with better latent supervision, stronger regularization, or temporal modeling, but even as-is it does reveal some meaningful structure in your embeddings.





In [None]:
from latent_variables import config as cfg
from latent_variables.data_split import split_by_time
from latent_variables.embedding import embed_events
from latent_variables.autoencoder import DeepAutoEncoder
from latent_variables.training import train_autoencoder, encode_all
from latent_variables.datasets import EmbeddingDataset
from latent_variables.moe_vae import MoEVAE
from latent_variables.losses import moe_vae_loss
from latent_variables.inference import infer_theme_probs

from sentence_transformers import SentenceTransformer
from torch.utils.data import DataLoader
import torch

# ---- Split
train_e, val_e, test_e = split_by_time(
    cleaned_events, cfg.TRAIN_RATIO, cfg.VAL_RATIO
)

# ---- Embeddings
embedder = SentenceTransformer(cfg.EMBEDDING_MODEL, device=cfg.DEVICE)
train_emb = embed_events(train_e, embedder, cfg.EMBED_BATCH_SIZE)
val_emb   = embed_events(val_e, embedder, cfg.EMBED_BATCH_SIZE)
test_emb  = embed_events(test_e, embedder, cfg.EMBED_BATCH_SIZE)

# ---- Stage-0 : AE
latent_dim = train_emb.shape[1] // cfg.AE_COMPRESSION_FACTOR
ae = DeepAutoEncoder(
    train_emb.shape[1],
    latent_dim,
    cfg.AE_HIDDEN_MULTIPLIERS,
    cfg.AE_DROPOUT
).to(cfg.DEVICE)

opt = torch.optim.AdamW(ae.parameters(), lr=cfg.AE_LR)

train_loader = DataLoader(
    EmbeddingDataset(train_emb),
    batch_size=cfg.AE_BATCH_SIZE,
    shuffle=True
)
val_loader = DataLoader(
    EmbeddingDataset(val_emb),
    batch_size=cfg.AE_BATCH_SIZE
)

train_autoencoder(ae, opt, train_loader, val_loader, cfg.DEVICE, cfg.AE_EPOCHS)

compressed_train = encode_all(ae, train_loader, cfg.DEVICE)

# ---- Stage 1 : MoE-VAE
model = MoEVAE(cfg).to(cfg.DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=cfg.MOE_LR)

loader = DataLoader(
    EmbeddingDataset(compressed_train),
    batch_size=cfg.MOE_BATCH_SIZE,
    shuffle=True
)

for epoch in range(cfg.MOE_EPOCHS):
    model.train()
    for x in loader:
        x = x.to(cfg.DEVICE)
        recon, mu, logvar, probs = model(x)
        loss, *_ = moe_vae_loss(x, recon, mu, logvar, probs, cfg)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# ---- Inference
labels = infer_theme_probs(model, compressed_train, cfg.DEVICE, cfg.EPS)


In [None]:
# Save the trained MoEVAE model
with open(ARTIFACTS_DIR / "moe_vae_model.pkl", "wb") as f:
    pickle.dump(model, f)

# Metrics

# Further Directions

* **Resource-dependent improvements**

  * Clear gains over baseline are expected with **more compute, larger datasets, and additional development time**. Due to my own constraints reg computational resources and personal reasons, I plan to dedicate more time going further on this.

* **Evaluation enhancements**

  * Explore **better metrics** for automated evaluation:

    * LLM-based scoring/judging.
    * Metrics derived from **behavioral studies**.

* **Inference pipeline**

  * Use the trained model to **group events into latent themes**.
  * Integrate a **RAG layer** for context-aware retrieval:

    * Options include **vector databases** or **graph-based RAG**.
  * Supports **user-facing LLMs** to retrieve relevant context efficiently.

* **Periodic retraining & scalability**

  * Retrain the model periodically (every few months to a year) on **new user events**.
  * Merge **old and new themes** to capture evolving patterns.
  * Enables a **scalable, incremental update strategy** while maintaining efficiency.
