# Enhancing Query Recommendations Through User Behavior Analysis

This notebook re-implements the main ideas from the paper *Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion (K-LaMP)*.  
The goal is to build a lightweight entity-centric knowledge store from user search histories and browsing activities, and use it to enhance Large Language Models with personalized context. 

## Table of Contents

[1. Import libraries & Loading of the Datasets](#1-import-libraries--loading-of-the-datasets)  
[2. Setup & small text utils](#2-setup--small-text-utils)  
[3. Memory Stream Construction](#3-memory-stream-construction)  
[4. Entity Store Construction](#4-entity-store-construction)  
[5. User & Session Modeling](#5-user--session-modeling)  
[6. Pick K_entities from session history and context](#6-pick-k_entities-from-session-history-and-context)  
[7. Prompt Builder for Gemini K LaMP style](#7-prompt-builder-for-gemini-k-lamp-style)    
[8. Gemini API Setup](#8-gemini-api-setup)  
[9. User Initialization and Session Logging](#9-user-initialization-and-session-logging)  
[10. Next Query Generation](#10-next-query-generation)  
[11. Conclusions & Next Steps](#11-conclusions--next-steps)


## 1. Import libraries & Loading of the Datasets

This section imports libraries and loads the datasets used for our model.

**Libraries**

In [19]:
import os
import json, re, time
import random
from pathlib import Path
from collections import Counter, defaultdict
from datetime import datetime, timedelta, timezone
from typing import List

import pandas as pd
import numpy as np
import spacy
import google.generativeai as genai


**Data Loading**

This section loads the three core datasets required for the project:
- POI information (`poi_info_updated.csv`) containing points of interest (POI) and related metadata.
- Descriptions (`data_descr_en_updated.csv`) with English textual descriptions of the POIs. 
- User profiles (`User Profiles.csv`) with ORCID information, past queries, and personal details. 

JSON‑like columns are parsed into Python lists to preserve multi‑valued fields (e.g., ORCID keywords, previous queries, POI answers).

In [20]:
# Load CSV files
poi_df = pd.read_csv("Datasets/poi_info_updated.csv")
descr_df = pd.read_csv("Datasets/data_descr_en_updated.csv")
users_df = pd.read_csv("Datasets/User Profiles_updated.csv")

# Parse JSON array cells into Python lists
users_df["orcid__keywords"] = users_df["orcid__keywords"].apply(lambda x: json.loads(x) if isinstance(x, str) else [])
users_df["previous_queries"] = users_df["previous_queries"].apply(lambda x: json.loads(x) if isinstance(x, str) else [])
users_df["Personal Interest"] = users_df["Personal Interest"].apply(lambda x: json.loads(x) if isinstance(x, str) else [])
users_df["POI_answers"] = users_df["POI_answers"].apply(lambda x: json.loads(x) if isinstance(x, str) else [])

# Display dataset shapes and first few rows
print("POI dataset shape:", poi_df.shape)
print("Descriptions dataset shape:", descr_df.shape)
display(poi_df.head())
display(users_df.head())



POI dataset shape: (78, 6)
Descriptions dataset shape: (77, 4)


Unnamed: 0,poi_id,poi_name,category_id,category_name,longitude,latitude
0,54,Basilica di Santa Anastasia,1,Chiese,16673.45.00,45.445.176.000.000.000
1,52,complesso del Duomo,1,Chiese,166142.21.00,4.544.707.660.000.000
2,70,Chiesa di San Bernardino,1,Chiese,163530.46.00,73195.54.00
3,74,Chiesa di Santa Maria in Organo,1,Chiese,729.57.00,74111.56.00
4,51,Chiesa di San Lorenzo,1,Chiese,165261.42.00,73567.17.00


Unnamed: 0,user_id,nationality,previous_queries,orcid__id,orcid__keywords,Personal Interest,Profession,POI_answers
0,u1,England,"[Best restaurants in Verona, Top-rated museums...",0000-0002-1825-0097,"[machine learning, natural language processing...","[Artificial Intelligence, technology, travel, ...",mathematics professor,"[Trattoria al Pompiere, Museo di Castelvecchio..."
1,u2,Italy,"[Best ice cream in Verona, Is the Castel San P...",0000-0001-6092-6831,"[database, data science, ethics in data manage...","[photograph, museum, history, books, reading, ...",database management professor,"[Gelateria La Romana, Funicolare di Castel San..."
2,u3,Italy,[Is there a tourist information office near th...,0000-0002-9809-1005,"[diabetes, metabolism, pancreatic beta cell fu...","[nature, hiking, pets, religion, dogs, walking]",Physician,"[IAT Verona Centro, Museo Lapidario Maffeiano,..."
3,u4,USA,[24/7 parking near the Arena for an evening sh...,,[],"[sports, nba, Movies, Computer, beer]",Student,"[Parcheggio Arena (SABA), Stadio Marcantonio B..."
4,u5,India,"[Best ice cream in Verona, Is the Castel San P...",,"[Researcher, Sport Teacher, handball and Bodyb...",[],Sport science researcher,"[Gelateria La Romana, Funicolare di Castel San..."


Merge the two datasets on common IDs

In [21]:
# Find common IDs between POI and descriptions
common_ids = set(poi_df["poi_id"]).intersection(set(descr_df["classref"]))

# Merge datasets on matching IDs
merged_df = pd.merge(
    poi_df[poi_df["poi_id"].isin(common_ids)],
    descr_df[descr_df["classref"].isin(common_ids)],
    left_on="poi_id",
    right_on="classref",
    how="inner"
)

# Keep only relevant columns
merged_df = merged_df[[
    "poi_id", "poi_name", "category_name", "descr_trad_value"
]]

# Show result
print(f"POIs with description available: {merged_df.shape[0]}")


POIs with description available: 78


In [22]:
display(merged_df)

Unnamed: 0,poi_id,poi_name,category_name,descr_trad_value
0,54,Basilica di Santa Anastasia,Chiese,The church of St. Anastasia is a fine example ...
1,52,complesso del Duomo,Chiese,"The Cathedral, which is dedicated to Santa Mar..."
2,70,Chiesa di San Bernardino,Chiese,The Church of San Bernardino is a Catholic pla...
3,74,Chiesa di Santa Maria in Organo,Chiese,"The church, near the Organo gate, already exis..."
4,51,Chiesa di San Lorenzo,Chiese,San Lorenzo is a Romanesque Roman Catholic chu...
...,...,...,...,...
73,33,Multisala Rivoli,Cinema,"A multi-screen cinema just off Piazza Bra, Riv..."
74,34,Cinema Fiume,Cinema,Part of a local network of art-house and first...
75,35,A.M.E.N,Discoteca,Set on the Torricelle hillside above the histo...
76,36,Berfi’s Club,Discoteca,"A staple of Verona’s club scene for decades, B..."


# 2. Setup & small text utils 

This section sets up the storage paths for the personal knowledge base (memory stream and entity store, both saved as parquet files), loads the spaCy English model for entity recognition, and defines utility functions for:  

- extracting named entities from text,  
- loading and saving parquet files,  
- resetting the memory and entity stores (full or partial reset).  

In [23]:
# Where to store the personal knowledge base (parquet files)
DATA_DIR = Path("data_store")
DATA_DIR.mkdir(exist_ok=True)
MEM_PATH = DATA_DIR / "memory_stream.parquet"
ENT_PATH = DATA_DIR / "entity_store.parquet"

# Load English model for spaCy NER
nlp = spacy.load("en_core_web_sm")

def extract_entities(text: str) -> list[str]:
    """
    Extract named entities using spaCy NER.
    Filters out less useful types (dates, numbers, ordinals).
    Returns a list of unique entity strings in lowercase.
    """
    if not text:
        return []

    doc = nlp(text)
    entities = []

    for ent in doc.ents:
        # ent.text = the actual entity string (e.g. "Verona")
        # ent.label_ = the entity type (e.g. GPE, ORG, DATE)
        if ent.label_ not in {"DATE", "TIME", "CARDINAL", "ORDINAL"}:
            entities.append(ent.text.lower())

    # Deduplicate by converting to set, then back to list
    return list(set(entities))

def _load_parquet(path: Path) -> pd.DataFrame:
    """Load a parquet file if it exists; otherwise return an empty DataFrame."""
    if path.exists():
        return pd.read_parquet(path)
    return pd.DataFrame()

def _save_parquet(df: pd.DataFrame, path: Path):
    """Save DataFrame to parquet (creates/overwrites)."""
    df.to_parquet(path, index=False)

def reset_entity_store(full: bool = True):
    """
    Reset both entity store and memory stream.
    - full=True  -> delete ENT_PATH and MEM_PATH
    - full=False -> delete only ENT_PATH
    """
    ent_path = Path(ENT_PATH)
    if ent_path.exists():
        ent_path.unlink()
        print(f"[OK] Entity store {ent_path} deleted.")
    else:
        print(f"[INFO] Entity store {ent_path} already empty.")

    if full:
        mem_path = Path(MEM_PATH)
        if mem_path.exists():
            mem_path.unlink()
            print(f"[OK] Memory stream {mem_path} deleted.")
        else:
            print(f"[INFO] Memory stream {mem_path} already empty.")

    # recreate empty DataFrames and save them
    empty_ent = pd.DataFrame(columns=["user_id","entity","count","first_seen","last_seen"])
    _save_parquet(empty_ent, ENT_PATH)

    #  keep only user_id, timestamp, text, meta (JSON)
    empty_mem = pd.DataFrame(columns=["user_id","timestamp","text","meta"])
    _save_parquet(empty_mem, MEM_PATH)

    print("[DONE] Store succesfully resetted.")

# 3. Memory Stream Construction

This section defines the functions for constructing and populating the memory stream. 
It appends user interactions (queries, POI views, ORCID keywords) as timestamped events, storing both raw text and metadata.

These records form the basis for building the entity-centric knowledge store.

In [24]:
# === Memory stream appenders ===

def append_memory(user_id: str,
                  text: str,
                  meta: dict | None = None,
                  ts: datetime | None = None):
    """Append a single event to the memory stream."""
    
    mem = _load_parquet(MEM_PATH)
    row = {
        "user_id": user_id,
        "timestamp": pd.to_datetime(ts or datetime.now(timezone.utc)),
        "text": text or "",
        "meta": json.dumps(meta or {}, ensure_ascii=False),
    }
    mem = pd.concat([mem, pd.DataFrame([row])], ignore_index=True)
    _save_parquet(mem, MEM_PATH)

# === ORCID keyword extraction and insertion ===

def extract_keywords_from_orcid(user, merged_df) -> list[str]:
    """Return keywords user profile + previous queries + POI answers."""
    
    kws = []
    orcid_keywords = user["orcid__keywords"] if "orcid__keywords" in user else []
    kws.extend([kw for kw in orcid_keywords if kw])

    # Extract keywords from previous queries
    previous_queries = user["previous_queries"] if "previous_queries" in user else []
    for query in previous_queries:
        if query:
            kws.extend(extract_entities(query))
    
    # Extract keywords from POI answers
    poi_answers = user["POI_answers"] if "POI_answers" in user else []
    for answer in poi_answers:
        if answer:
            text = merged_df.loc[merged_df["poi_name"] == answer, "descr_trad_value"]
            kws.extend(extract_entities(str(text.item())))

    return kws
    
def insert_orcid_keywords_to_memory(user_id, keywords) -> list[str]:
    """
    Persist ORCID keywords into the memory stream without entity extraction.

    Each keyword is saved as-is (after trimming), tagged with meta.src="orcid".
    Downstream, `rebuild_entity_store()` treats these entries as already-clean
    entities and bypasses the linker.
    """
    for term in keywords:
        if isinstance(term, str) and term.strip():
            append_memory(user_id=user_id, text=term.strip(), meta={"src": "orcid"})
    return keywords


def insert_previous_queries_and_POI_to_memory(user, merged_df) -> list[str]:
    """
    Persist full-text previous queries and viewed-POI descriptions into memory.

    For each user, this function:
    Stores every previous query as full text with meta.src="query".
    Looks up each answered/viewed POI by name in `merged_df` and stores
    its `descr_trad_value` (full text) with meta.src="poi".
    
    """
    stored = []

    # Previous queries (full text)
    for q in user.get("previous_queries", []):
        if isinstance(q, str) and q.strip():
            append_memory(user_id=user["user_id"], text=q.strip(), meta={"src": "query"})
            stored.append(q.strip())

    # POI answers -> use the translated/clean description field
    for ans in user.get("POI_answers", []):
        if ans:
            ser = merged_df.loc[merged_df["poi_name"] == ans, "descr_trad_value"]
            if ser.empty:
                continue
            txt = str(ser.iloc[0]).strip()  # robust to multiple matches
            if txt:
                append_memory(user_id=user["user_id"], text=txt, meta={"src": "poi"})
                stored.append(txt)

    return stored


## 4. Entity Store Construction

This section rebuilds the entity store from the memory stream. 

Entities are extracted from user interactions, normalized (lowercased, spaces collapsed), and aggregated per user.  
The resulting store tracks counts and first/last occurrence timestamps, enabling long-term personalization and keeping entities up to date over time.

In [25]:
# Rebuild entity store from memory stream
def rebuild_entity_store():
    """
    Build the per-user entity store from the raw memory stream.

    This function reads the event memory (MEM_PATH), where each row contains
    a free-form text snippet and optional metadata, extracts entities per row,
    and then prepares them for aggregation into a compact entity store with
    counts and first/last seen timestamps per (user_id, entity).
    """
    mem = _load_parquet(MEM_PATH)
    if mem.empty:
        ent = pd.DataFrame(columns=["user_id", "entity", "count", "first_seen", "last_seen"])
        _save_parquet(ent, ENT_PATH)
        return

    # Normalize timestamp and text columns
    mem["timestamp"] = pd.to_datetime(mem["timestamp"], utc=True, errors="coerce")
    mem["text"] = mem["text"].astype(str)

    # Extract a lightweight source tag from the JSON `meta` (if present)
    def _get_src(m):
        try:
            d = json.loads(m) if isinstance(m, str) else (m or {})
            return d.get("src", None)
        except Exception:
            return None
    mem["src"] = mem["meta"].apply(_get_src)

    # Normalize text: lowercase, collapse spaces
    def _norm(s: str) -> str:
        return " ".join(s.lower().split())

    # # Row-wise entity extraction with an ORCID-specific fast path
    def _extract_row(r) -> list[str]:
        t = r["text"].strip()
        if not t:
            return []
        if r["src"] == "orcid":
            # Trust ORCID keywords as already curated; store as a single normalized token
            return [_norm(t)]
        ents = extract_entities(t)
        return [_norm(e) for e in ents if isinstance(e, str) and e.strip()]

    mem["ents"] = mem.apply(_extract_row, axis=1)

    # flatten
    df = mem[["user_id", "timestamp", "ents"]].explode("ents", ignore_index=True)
    df = df.dropna(subset=["ents"]).rename(columns={"ents": "entity"})

    # Aggregate counts and first/last seen
    ent = (df.groupby(["user_id", "entity"], dropna=False)
             .agg(count=("entity", "size"),
                  first_seen=("timestamp", "min"),
                  last_seen=("timestamp", "max"))
             .reset_index())

    ent = (ent.sort_values(["user_id", "entity", "last_seen"])
             .drop_duplicates(["user_id", "entity"], keep="last"))

    _save_parquet(ent, ENT_PATH)


ent = pd.read_parquet(ENT_PATH)
display(ent.head()) 

Unnamed: 0,user_id,entity,count,first_seen,last_seen


## 5. User & Session Modeling


This section models user behavior through a memory stream of logged events, including queries and POI page views.

The utilities defined here enable capturing recent interactions to build a session context that complements long-term profiles for personalization.

In [26]:
# === Logging helpers for queries and pages ===

def log_query_event(current_query: str, user_id: str):
    """Append a user query into the memory stream."""
    append_memory(user_id=user_id, text=current_query.strip(), meta={"src": "query"})

def log_page_viewed_event(poi_row: pd.Series, user_id: str):
    """
    Append a 'page view' using your merged_df row.
    We concatenate name, category, and description into the 'text' field.
    """
    text = f"{poi_row['descr_trad_value']}"
    append_memory(user_id=user_id, text=text, meta={"src": "poi"})

## 6. Pick K_entities from session history and context

This section selects up to *k* personal entities from the current context, where context entities are extracted from the current query and page text.
  
It supports three K-LaMP strategies—**familiar** (probability ∝ past counts), **unfamiliar** (probability ∝ 1/(count+1), includes unseen), and **lapsed** (previously seen but not recently)—using weighted sampling without replacement to balance relevance and novelty.

In [27]:
# === Retrieve personal entities for the current context ===

def pick_personal_entities_k_lamp(user_id: str,
                         query: str,
                         page_text: str,
                         strategy: str = "familiar",   # "familiar" | "unfamiliar" | "lapsed"
                         k: int = 5,
                         lapsed_days: int = 14,
                         seed: int | None = None) -> List[str]:
    """
    K-LaMP selection:
      - Context entities = entities(query) ∪ entities(page)
      - Look up (count, last_seen) in user's entity store
      - Sample k entities according to strategy:
          familiar:     sample ∝ count (exclude count==0)
          unfamiliar:   sample ∝ 1/(count+1)  (include unseen with high prob)
          lapsed:       keep last_seen < now-14d, sample ∝ count
    Sampling is WITHOUT replacement. Use `seed` for reproducibility.
    """
    # context entities (order-preserving unique, canonicalized)
    ctx_raw = (extract_entities(query) or []) + (extract_entities(page_text) or [])
    ctx = []
    for e in ctx_raw:
        ce = " ".join((e or "").lower().split())
        if ce and ce not in ctx:
            ctx.append(ce)
    if not ctx:
        return []

    # user store lookup
    ent = _load_parquet(ENT_PATH)
    if ent.empty:
        ent_user = pd.DataFrame(columns=["entity","count","last_seen"])
    else:
        ent_user = ent[ent["user_id"] == user_id].copy()
        # canonicalize entity and ensure datetime
        ent_user["entity"] = ent_user["entity"].astype(str).str.lower().str.replace(r"\s+", " ", regex=True).str.strip()
        ent_user["last_seen"] = pd.to_datetime(ent_user["last_seen"], utc=True, errors="coerce")
        # keep latest row per entity if duplicates exist
        ent_user = (ent_user.sort_values(["entity","last_seen"])
                             .drop_duplicates(subset=["entity"], keep="last"))

    ent_user.set_index("entity", inplace=True, drop=False)

    now = datetime.now(timezone.utc)
    cutoff = now - timedelta(days=lapsed_days)

    items = []  # (entity, count, last_seen)
    for e in ctx:
        if e in ent_user.index:   # check if entity exists in user's store
            row = ent_user.loc[e]
            # if multiple rows (edge case), take the last one
            if isinstance(row, pd.DataFrame):
                row = row.sort_values("last_seen").iloc[-1]
            cnt = int(pd.to_numeric(row.get("count", 0), errors="coerce") or 0)
            last_seen = pd.to_datetime(row.get("last_seen"), utc=True, errors="coerce")
        else:
            cnt = 0
            last_seen = None
        items.append((e, cnt, last_seen))

    # candidates + weights
    if strategy == "familiar":
        cand = [(e, c, ls) for (e, c, ls) in items if c > 0]
        weights = [float(c) for (_, c, _) in cand]  # ∝ count

    elif strategy == "unfamiliar":
        cand = items[:]  # include unseen
        weights = [1.0 / (c + 1.0) for (_, c, _) in cand]  # ∝ 1/(count+1)

    elif strategy == "lapsed":
        cand = [(e, c, ls) for (e, c, ls) in items if (ls is not None and pd.notna(ls) and ls < cutoff)]
        weights = [float(c) for (_, c, _) in cand]  # ∝ count

    else:
        raise ValueError("strategy must be 'familiar', 'unfamiliar', or 'lapsed'")

    if not cand:  # if no candidates after filtering
        return []

    if sum(weights) <= 0:
        weights = [1.0] * len(cand)  # fallback uniform

    # weighted sampling without replacement
    rng = random.Random(seed)
    chosen: List[str] = []
    cand_e = [e for (e, _, _) in cand]  # candidate entities
    cand_w = [float(w) for w in weights]  # candidate weights

    for _ in range(min(k, len(cand_e))):
        total = sum(cand_w)
        if total <= 0:
            idx = rng.randrange(len(cand_e))
        else:
            r = rng.random() * total
            acc = 0.0
            idx = 0
            for i, w in enumerate(cand_w):
                acc += w
                if r <= acc:
                    idx = i
                    break
        chosen.append(cand_e.pop(idx))
        cand_w.pop(idx)

    return chosen

## 7. Prompt Builder for Gemini K LaMP style

This section implements utilities to build prompts for Gemini in line with the K-LaMP framework.  

It reconstructs the user’s session (queries, viewed pages) and combines it with personal entities to form structured **system** and **user** messages, ensuring that query suggestions are contextual, personalized, and aligned with long-term interests.

In [28]:
# === Prompt builder for Gemini ===

def _load_memory() -> pd.DataFrame:
    """Load memory stream with a stable schema (no 'source' column)."""
    expected = ["user_id", "timestamp", "text", "meta"]
    mem = _load_parquet(MEM_PATH)
    for c in expected:
        if c not in mem.columns:
            mem[c] = pd.Series(dtype="object")
    mem["timestamp"] = pd.to_datetime(mem["timestamp"], utc=True, errors="coerce")
    return mem[expected]

def _get_src(meta_val):
    """Extract 'src' from JSON-encoded meta field."""
    try:
        d = json.loads(meta_val) if isinstance(meta_val, str) else (meta_val or {})
        return d.get("src", None)
    except Exception:
        return None

def get_session_queries(user_id: str,
                        n: int | None = None,
                        hours: int | None = None,
                        order: str = "desc") -> list[str]:
    """Return the user's session queries (meta['src'] == 'query')."""
    mem = _load_memory()
    mem["src"] = mem["meta"].apply(_get_src)

    q = mem[(mem["user_id"] == user_id) & (mem["src"] == "query")].copy()

    if hours is not None:
        cutoff = datetime.now(timezone.utc) - timedelta(hours=hours)
        q = q[q["timestamp"] >= cutoff]

    q = q.sort_values("timestamp", ascending=(order == "asc"))
    queries = [str(t) for t in q["text"].tolist()]

    if n is not None:
        queries = queries[:n]
    return queries

def get_latest_article(user_id: str) -> tuple[str, str]:
    """Return (title, text) of the most recent POI/page event (meta['src'] == 'poi')."""
    mem = _load_memory()
    mem["src"] = mem["meta"].apply(_get_src)

    pages = mem[(mem["user_id"] == user_id) & (mem["src"] == "poi")].copy()
    if pages.empty:
        return "", ""

    r = pages.sort_values("timestamp", ascending=False).iloc[0]
    try:
        meta = json.loads(r["meta"] or "{}")
    except Exception:
        meta = {}
    title = meta.get("poi_name") or (str(r["text"]).split(":")[0][:120] if isinstance(r["text"], str) else "")
    text = str(r["text"] or "")
    return title, text


**Prompt for K-LaMP (paper-style) + ORCID**

This section follows the K-LaMP design while also including long-term ORCID keywords. 
It composes structured **system** and **user** messages that combine:  
- the current query,  
- the recent session history,  
- the current article (title + text),  
- sampled personal entities from the knowledge store,  
- long-term ORCID keywords.  

The output is a prompt dictionary `{system, user}` to be used with the Gemini model for next-query generation.


In [29]:
def build_k_lamp_prompt_paper(user_row: dict | pd.Series,
                                     current_query: str,
                                     page_title: str,
                                     page_text: str,
                                     strategy: str = "familiar",
                                     k_entities: int = 5,
                                     personal_keywords: list[str] | None = None,
                                     n_session: int | None = None,     # None -> all queries
                                     max_article_chars: int = 1200) -> dict:
    """
    Build {system,user} messages as in K-LaMP, with 'Personal Entities' = entities
    sampled from the current context [query · page] according to `strategy`.
    """

    if isinstance(user_row, pd.Series):
        user_row = user_row.to_dict()
    user_id = user_row["user_id"]

    # System message (rules)
    system_msg = (
        "You are an AI assistant whose primary goal is to suggest a next search query, in order to help a user search and find information better on the search engine."
        " Two different queries and entities are separated by the token '|'. For example,'Microsoft' and 'Google' would appear as 'Microsoft' | 'Google'.\n"
    )

    # Session (last N or all)
    session_list = get_session_queries(user_id=user_id, n=n_session, order="desc")
    session_str = " | ".join(session_list or [])

    # Article
    art_title = page_title or ""
    art_text = (page_text or "")[:max_article_chars]

    # Personal Entities via K-LaMP sampler (context-dependent)
    personal_ents = pick_personal_entities_k_lamp(
        user_id=user_id,
        query=current_query,
        page_text=page_text,
        strategy=strategy,
        lapsed_days=14,
        k=k_entities,
        seed=42,  # optional reproducibility
    )
    personal_str = " | ".join(personal_ents)

    # Personal Keywords (static, e.g., ORCID)
    personal_keywords_str = " | ".join(personal_keywords or [])

    # User message (payload)
    user_msg = (
        "You are going to suggest a search query that the user would search next based on the current query, the current session, the current article, and the personal entities."
        "The explanations of the query, session, article, and personal entities are as follows:\n"
        "- The query is a specific set of phrases that the user enters into the search engine to find the information or resources related to a particular topic, question, or interest.\n"
        "- The session refers to a sequence of queries requested by the user on the search engine, within a certain period of time or with regard to the completion of a task.\n"
        "- The article refers to a specific webpage that the user clicks and reads from several search results displayed by the search engine in response to the requested query.\n"
        "- The personal entity refers to a topic, keyword, person, event, or any subject that is specifically relevant or appealing to the individual user based on their personal interests.\n"
        "- The ORCID keywords refer to self-declared research topics and academic interests from the user's ORCID profile. They provide a stable signal of long-term expertise or focus areas, complementing the personal entities extracted from recent interactions.\n\n"

        "Read the following query, session, article, and personal entities of the user as the context information, which might be helpful and relevant to suggest the next query.\n\n"
        f"Query: {current_query}\n"
        f"Session: {session_str}\n"
        f"Article Title: {art_title}\n"
        f"Article Text: {art_text}\n\n"
        f"Personal Entities: {personal_str}\n\n"
        f"ORCID Keywords: {personal_keywords_str}\n\n"
        "Based on the above query, session, article, personal entities, and ORCID keywords, please generate one next query suggestion with the "
        "rationale, in the format of\n"
        "Query Suggestion:\n"
        "Rationale:"
    )
    return {"system": system_msg, "user": user_msg}

**Prompt for Enhanced K-LaMP with Profile Integration**

Extends the original K-LaMP prompt builder by integrating **user profile attributes** (profession, nationality, personal interests) together with **ORCID keywords** and context-dependent personal entities.  

The system and user messages are designed to:  
- prioritize long-term signals from ORCID and professional persona,  
- maintain session continuity,  
- use the current article as supporting context,  
- and generate personalized next-query suggestions that balance both short-term and long-term relevance.  


This enhanced formulation aims to stress-test whether integrating explicit profile metadata improves personalization beyond entity-based signals alone.

In [30]:
def build_k_lamp_prompt_enhanced(user_row: dict | pd.Series,
                                     current_query: str,
                                     page_title: str,
                                     page_text: str,
                                     strategy: str = "familiar",
                                     k_entities: int = 5,
                                     personal_keywords: list[str] | None = None,
                                     n_session: int | None = None,     # None -> all queries
                                     max_article_chars: int = 1200) -> dict:
    """
    Build {system,user} messages as in K-LaMP, with 'Personal Entities' = entities
    sampled from the current context [query · page] according to `strategy`.
    """

    if isinstance(user_row, pd.Series):
        user_row = user_row.to_dict()
    user_id = user_row["user_id"]

    # System message (rules)
    system_msg = (
        "You are an AI assistant whose primary goal is to suggest a next search query, "
        "to help the user search and find information better on the search engine. "
        "Different queries and entities are separated by '|'."
    )

    # Session (last N or all)
    session_list = get_session_queries(user_id=user_id, n=n_session, order="desc")
    session_str = " | ".join(session_list or [])

    # Article
    art_title = page_title or ""
    art_text = (page_text or "")[:max_article_chars]

    # Personal Entities via K-LaMP sampler (context-dependent)
    personal_ents = pick_personal_entities_k_lamp(
        user_id=user_id,
        query=current_query,
        page_text=page_text,
        strategy=strategy,
        lapsed_days=14,
        k=k_entities,
        seed=42,  # optional reproducibility
    )
    personal_str = " | ".join(personal_ents)

    # Personal Keywords (static, e.g., ORCID)
    personal_keywords_str = " | ".join(personal_keywords or [])

    # Personal interests
    personal_interests_str = " | ".join(user_row.get("Personal Interest", []))

    # User message (payload)
    user_msg = (
        "You are going to suggest ONE next search query based on the current query, the current session, "
        "the current article, the user's personal entities, and the user's ORCID keywords.\n\n"

        "Guidance and priorities:\n"
        "- Prioritize long-term user relevance from ORCID keywords and profession/persona (≈50%).\n"
        "- Maintain session intent continuity without lexical repetition (≈25%).\n"
        "- Use the article only as supporting context (cap ≈25%).\n"
        "- Favor novelty and depth over logistics unless the session explicitly shows planning intent.\n"
        "- ORCID Keywords have higher priority than Personal Entities; use entities only as complementary, short-term signals.\n\n"
        #"- Always ensure the next query is a natural continuation of the clicked article. If the article is a venue (e.g., restaurant, gelateria, shop), the next query must logically involve details such as reviews, opening hours, similar venues, or cultural context. "
        #"- Weigh the importance of article, session, and ORCID profile dynamically.""
        #"- If the article is strongly aligned with the ORCID profile, integrate both.""
        #"- If the article is unrelated, prefer session continuity over ORCID.""
        #"- Always avoid suggestions that feel unnatural compared to the clicked article."

        "Explanations:\n"
        "- Query: the phrase the user types next.\n"
        "- Session: the recent sequence of queries tied to the same task.\n"
        "- Article: the page the user just read/clicked.\n"
        "- Personal Entities: short-term topics/entities extracted from recent user context.\n"
        "- ORCID Keywords: self-declared, long-term academic/professional interests from the user's ORCID profile; "
        "they should steer personalization beyond the current article.\n\n"

        "CONTEXT:\n"
        f"Query: {current_query}\n"
        f"Session:{session_str}\n"
        f"Article Title: {art_title}\n"
        f"Article Text: {art_text}\n\n"
        f"Personal Entities: {personal_str}\n\n"
        f"User personal interests: {personal_interests_str}\n"
        f"User profession: {user_row.get('Profession')}\n"
        f"User nationality: {user_row.get('nationality')}\n"
        f"ORCID Keywords: {personal_keywords_str}\n\n"


        "Based on the above query, session, article, user context entities, user profile keywords, and the user characteristics please generate one next query "
        "suggestion with the rationale, in the format of\n"
        "Query Suggestion:\n"
        "Rationale:"
    )
    return {"system": system_msg, "user": user_msg}

## 8. Gemini API Setup

This section sets up access to the Gemini API.

The API key is loaded securely from environment variables, and the latest chat-style Gemini model is initialized for use in query generation.

In [31]:
# === Gemini model setup ===

# Simple sanity check for the API key
if not os.getenv("GEMINI_API_KEY"):
    print("[WARN] GEMINI_API_KEY is not set; Gemini calls will fail.")

# Use the API key from environment variable
genai.configure(api_key=os.getenv("GEMINI_API_KEY"))

# Load Gemini model (chat-style)
model = genai.GenerativeModel(model_name="models/gemini-1.5-pro-latest")

## 9. User Initialization and Session Logging

**9.1 Populate memory with profiles**

This subsection initializes the user profiles (u1–u4) from the dataset and logs their ORCID keywords, previous queries, and POI descriptions into the memory stream.  

After populating the stream, the entity store is rebuilt to incorporate these signals.


In [32]:
#reset_entity_store(full=True)  # Reset both entity store and memory stream

In [33]:
# Set user_id as index for easy access
users_df = users_df.set_index("user_id", drop=False)

u1 = users_df.loc["u1"]
u2 = users_df.loc["u2"]
u3 = users_df.loc["u3"]
u4 = users_df.loc["u4"]
u5 = users_df.loc["u5"]

# Save for each user the ORCID keywords, previous queries + POI descriptions in the memory stream
for user in [u1, u2, u3, u4, u5]:
    insert_orcid_keywords_to_memory(user.user_id, user.orcid__keywords)
    insert_previous_queries_and_POI_to_memory(user, merged_df)

rebuild_entity_store()
ent = pd.read_parquet(ENT_PATH)
display(ent)

  mem = pd.concat([mem, pd.DataFrame([row])], ignore_index=True)


Unnamed: 0,user_id,entity,count,first_seen,last_seen
0,u1,adige,2,2025-09-23 09:10:27.057797+00:00,2025-09-23 09:10:27.066284+00:00
1,u1,allied,1,2025-09-23 09:10:27.057797+00:00,2025-09-23 09:10:27.057797+00:00
2,u1,amedeo mantellato,1,2025-09-23 09:10:27.057797+00:00,2025-09-23 09:10:27.057797+00:00
3,u1,arena,2,2025-09-23 09:10:26.985288+00:00,2025-09-23 09:10:27.074610+00:00
4,u1,austrian,1,2025-09-23 09:10:27.039743+00:00,2025-09-23 09:10:27.039743+00:00
...,...,...,...,...,...
381,u5,verona villafranca airport,1,2025-09-23 09:10:28.323525+00:00,2025-09-23 09:10:28.323525+00:00
382,u5,verona villafranca s.p.a.,1,2025-09-23 09:10:28.323525+00:00,2025-09-23 09:10:28.323525+00:00
383,u5,veronese,1,2025-09-23 09:10:28.352803+00:00,2025-09-23 09:10:28.352803+00:00
384,u5,vicenza,1,2025-09-23 09:10:28.323525+00:00,2025-09-23 09:10:28.323525+00:00


**9.2 Simulate session logging**


This subsection simulates a user session by logging a new query and a POI page view for each user.

The entity store is then rebuilt, and the article context (title and description) is prepared for prompt construction.


In [34]:
# Choose a current query and a visited page
current_query = "What to do in Verona?"
poi_row = merged_df.iloc[60] #gelateria savoia

# Log to memory
for user in [u1, u2, u3, u4, u5]:
    log_query_event(current_query, user.user_id)
    log_page_viewed_event(poi_row, user.user_id)

#  Rebuild entity store
rebuild_entity_store()

# Build page title/text
page_title = poi_row["poi_name"]
page_text  = f"{poi_row['poi_name']} ({poi_row['category_name']}): {poi_row['descr_trad_value']}"
display(page_title, page_text[:300])
entities_extracted = extract_entities(page_text)
display(entities_extracted)

'Gelateria Savoia'

'Gelateria Savoia (Ristorazione): An artisanal gelateria founded in 1939 in the historic center, Gelateria Savoia has been a local reference point for classic gelato ever since. Its story begins under the clock at Piazza Bra with founders Luigia Savoia and Vittorio Bonvicini, and today the shop conti'

['luigia savoia', 'gelateria savoia', 'vittorio bonvicini', 'piazza bra']

## 10. Next Query Generation

**10.1 Paper Scenario (Original K-LaMP)**

This subsection reproduces the original K-LaMP setup.  
For each sample user, it builds the **paper-style prompt**, displays the full *system* and *user* messages, invokes Gemini to generate **one next search query**, and outputs the suggested query for inspection.


In [35]:
# === Model 1: K-LaMP prompt from paper plus ORCID Keywords ===

print("*****************************************************************************")
print("=== PAPER SCENARIO: NEXT QUERY GENERATION WITH K-LAMP PROMPT ===\n")
for user in [u1, u2, u3, u4, u5]:
    print(f"\033[36m=== USER {user['user_id']} - {user['Profession']} ===\033[0m")
    # Build messages with strategy = familiar | unfamiliar | lapsed
    msgs = build_k_lamp_prompt_paper(
    user_row=user,
    current_query=current_query,
    page_title=page_title,
    page_text=page_text,
    strategy="familiar",     # "familiar" | "unfamiliar" | "lapsed"
    k_entities=5,        
    personal_keywords=user["orcid__keywords"],  # from ORCID    
    n_session=None          
    )

    # Print FULL messages
    print("\n=== SYSTEM ===\n")
    print(msgs["system"])

    print("=== USER ===\n")
    print(msgs["user"])

    # Call Gemini
    model = genai.GenerativeModel(
        model_name="models/gemini-1.5-pro-latest",
        system_instruction=msgs["system"]
    )
    resp = model.generate_content(msgs["user"])

    # Safe print even if resp.text is missing
    out_text = getattr(resp, "text", "") or ""
    print(f"\n\033[33mGemini next\033[0m {out_text.strip()}\n")

*****************************************************************************
=== PAPER SCENARIO: NEXT QUERY GENERATION WITH K-LAMP PROMPT ===

[36m=== USER u1 - mathematics professor ===[0m

=== SYSTEM ===

You are an AI assistant whose primary goal is to suggest a next search query, in order to help a user search and find information better on the search engine. Two different queries and entities are separated by the token '|'. For example,'Microsoft' and 'Google' would appear as 'Microsoft' | 'Google'.

=== USER ===

You are going to suggest a search query that the user would search next based on the current query, the current session, the current article, and the personal entities.The explanations of the query, session, article, and personal entities are as follows:
- The query is a specific set of phrases that the user enters into the search engine to find the information or resources related to a particular topic, question, or interest.
- The session refers to a sequence of que

**10.2 Enhanced Scenario (Profile-Aware K-LaMP)**

This subsection demonstrates the **enhanced K-LaMP prompt**, which integrates ORCID keywords and additional user profile attributes (profession, nationality, interests). 

For each sample user, it builds the enriched prompt, displays the full *system* and *user* messages, and invokes Gemini to generate a personalized next search query.  

Compared to the original paper scenario, this setup emphasizes **long-term profile signals** while still maintaining session continuity and contextual grounding in the current article.

In [36]:
# === Model 3: K-LaMP with prompt-weighted ORCID + personal interests ===

print("*****************************************************************************")
print("=== ORCID PROMPT ENHANCED CASE: NEXT QUERY GENERATION WITH K-LAMP PROMPT ===\n")
for user in [u1, u2, u3, u4]:
    print(f"\033[36m=== USER {user['user_id']} - {user['Profession']} ===\033[0m")
    # Build messages with strategy = familiar | unfamiliar | lapsed
    msgs = build_k_lamp_prompt_enhanced(
    user_row=user,
    current_query=current_query,
    page_title=page_title,
    page_text=page_text,
    strategy="familiar",     # "familiar" | "unfamiliar" | "lapsed"
    k_entities=5,        
    personal_keywords=user["orcid__keywords"],  # from ORCID    
    n_session=None          
    )

    # Print FULL messages
    print("\n=== SYSTEM ===\n")
    print(msgs["system"])

    print("\n=== USER ===\n")
    print(msgs["user"])

    # Call Gemini
    model = genai.GenerativeModel(
        model_name="models/gemini-1.5-pro-latest",
        system_instruction=msgs["system"]
    )
    resp = model.generate_content(msgs["user"])

    # Safe print even if resp.text is missing
    out_text = getattr(resp, "text", "") or ""
    print(f"\n\033[33mGemini next\033[0m {out_text.strip()}\n")

*****************************************************************************
=== ORCID PROMPT ENHANCED CASE: NEXT QUERY GENERATION WITH K-LAMP PROMPT ===

[36m=== USER u1 - mathematics professor ===[0m

=== SYSTEM ===

You are an AI assistant whose primary goal is to suggest a next search query, to help the user search and find information better on the search engine. Different queries and entities are separated by '|'.

=== USER ===

You are going to suggest ONE next search query based on the current query, the current session, the current article, the user's personal entities, and the user's ORCID keywords.

Guidance and priorities:
- Prioritize long-term user relevance from ORCID keywords and profession/persona (≈50%).
- Maintain session intent continuity without lexical repetition (≈25%).
- Use the article only as supporting context (cap ≈25%).
- Favor novelty and depth over logistics unless the session explicitly shows planning intent.
- ORCID Keywords have higher priority than

## 11. Conclusions & Next Steps

This notebook re-implemented the K-LaMP framework for **personalized contextual query suggestion**.  
The workflow demonstrated:  
1. Loading and merging datasets (POIs, descriptions, user profiles).  
2. Building a **memory stream** of user interactions (queries, POI views, ORCID keywords).  
3. Constructing an **entity store** to aggregate entities with counts and timestamps.  
4. Designing prompt builders (original K-LaMP vs. enhanced version with profile integration).  
5. Using the **Gemini API** to generate next-query suggestions under different scenarios.  

**Key takeaway:**  

Entity-centric personalization, especially when combined with **ORCID profiles** and user attributes, yields richer and more relevant query recommendations compared to a baseline prompt.  

**Future directions:**  
- Develop automatic evaluation metrics to quantify personalization quality.   
- Investigate alternative retrieval strategies (familiar, unfamiliar, lapsed) and dynamic weighting schemes.  
- Assess scalability on larger and more diverse user datasets, including longer interaction histories.  


## 12. Sandbox for Next-Query Experiments

# Sandbox Set up

In [37]:
# === 12.1 Helper function  ===

from typing import Literal, Tuple

def _resolve_poi(poi_selector) -> Tuple[str, str]:
    """
    Resolve a POI from `merged_df` given either:
    - an integer row index
    - a string 'poi_name'
    Returns (page_title, page_text).
    """
    if isinstance(poi_selector, int):
        row = merged_df.iloc[poi_selector]
    else:
        # fallback: first match by name (case-insensitive)
        m = merged_df[merged_df["poi_name"].str.lower() == str(poi_selector).lower()]
        if m.empty:
            # try substring search
            m = merged_df[merged_df["poi_name"].str.lower().str.contains(str(poi_selector).lower(), na=False)]
        if m.empty:
            raise ValueError(f"POI '{poi_selector}' not found. Use an index (int) or exact/partial name (str).")
        row = m.iloc[0]
    title = row["poi_name"]
    text  = f"{row['poi_name']} ({row['category_name']}): {row['descr_trad_value']}"
    return title, text


def run_next_query_experiment(
    user_id: str,
    current_query: str,
    poi_selector,                             # int index or poi_name string
    strategy: Literal["familiar","unfamiliar","lapsed"] = "familiar",
    k_entities: int = 5,
    n_session: int | None = None,
    mode: Literal["enhanced","paper"] = "enhanced",
    persist: bool = False,                    # if True, log query/page into memory and rebuild ENT store
    seed: int = 42
) -> dict:
    """
    Run a single next-query experiment with minimal side effects.
    - If persist=False (default), it will NOT log anything to memory.
    - If persist=True, it logs query+page and rebuilds the entity store.
    Returns a dict with inputs, prompt, and Gemini output.
    """
    # Resolve article/page
    page_title, page_text = _resolve_poi(poi_selector)

    # If requested, persist this interaction to the memory stream
    if persist:
        log_query_event(current_query, user_id)
        # Build a minimal fake row-like object for page logging
        class _Row: pass
        r = _Row()
        setattr(r, "__getitem__", lambda _, k: {"descr_trad_value": page_text}[k])
        log_page_viewed_event({"descr_trad_value": page_text}, user_id)  # uses only descr_trad_value
        rebuild_entity_store()

    # Pick the user row from users_df (already loaded earlier)
    try:
        user_row = users_df.loc[user_id]
    except Exception as e:
        raise ValueError(f"Unknown user_id '{user_id}'. Available: {list(users_df['user_id'])}") from e

    # Prepare ORCID keywords if present
    personal_keywords = user_row.get("orcid__keywords", [])

    # Build prompt
    if mode == "enhanced":
        msgs = build_k_lamp_prompt_enhanced(
            user_row=user_row,
            current_query=current_query,
            page_title=page_title,
            page_text=page_text,
            strategy=strategy,
            k_entities=k_entities,
            personal_keywords=personal_keywords,
            n_session=n_session,
        )
    else:
        msgs = build_k_lamp_prompt_paper(
            user_row=user_row,
            current_query=current_query,
            page_title=page_title,
            page_text=page_text,
            strategy=strategy,
            k_entities=k_entities,
            personal_keywords=personal_keywords,
            n_session=n_session,
        )

    # Call Gemini (reuse `model` if available; otherwise create a local one)
    _model = genai.GenerativeModel(
    model_name="models/gemini-1.5-pro-latest",
    system_instruction=msgs["system"]
    )   

    # Now generate using only the user message
    resp = _model.generate_content(msgs["user"])
    out_text = getattr(resp, "text", "") or ""

    if mode == "enhanced":
        print(f"\033[36m=== USER {user_id} - {user_row.get('Profession')} ===\033[0m")
        print("\n=== CONTEXT ===\n")
        print(f"\033[31mCurrent query:\033[0m {current_query}")
        print(f"\033[31mPage title:\033[0m {page_title}") 
        print(f"\033[31mPage text:\033[0m {page_text[:300]}")
        print(f"\033[31mStrategy:\033[0m {strategy}")
        print(f"\033[31mPersonal keywords (ORCID):\033[0m {personal_keywords}")
        print(f"\033[31mSession queries (last {n_session or 'all'}):\033[0m {get_session_queries(user_id, n=n_session, order='desc')}")
        print(f"\033[31mPersonal entities (K-LaMP):\033[0m {pick_personal_entities_k_lamp(user_id, current_query, page_text, strategy=strategy, k=k_entities, seed=seed)}")
        print(f"\033[31mPersona interests:\033[0m {user_row.get('Personal Interest')}")
          
        print(f"\n\033[33mGemini next\033[0m {out_text.strip()}\n")
    
    elif mode == "paper":
        print(f"\033[36m=== USER {user_id}\033[0m")
        print("\n=== CONTEXT ===\n")
        print(f"\033[31mCurrent query:\033[0m {current_query}")
        print(f"\033[31mPage title:\033[0m {page_title}") 
        print(f"\033[31mPage text:\033[0m {page_text[:300]}")
        print(f"\033[31mStrategy:\033[0m {strategy}")
        print(f"\033[31mPersonal keywords (ORCID):\033[0m {personal_keywords}")
        print(f"\033[31mSession queries (last {n_session or 'all'}):\033[0m {get_session_queries(user_id, n=n_session, order='desc')}")
        print(f"\033[31mPersonal entities (K-LaMP):\033[0m {pick_personal_entities_k_lamp(user_id, current_query, page_text, strategy=strategy, k=k_entities, seed=seed)}")


          
        print(f"\n\033[33mGemini next\033[0m {out_text.strip()}\n")


    return 


# Sandbox

Enhanced: Prompt and Personal Interests (Model 3)

In [38]:
# Enhanced, by POI index
run_next_query_experiment(
    user_id="u4",
    current_query="What to do tonight",
    poi_selector=10,                   # row index
    strategy="familiar",
    k_entities=5,
    mode="enhanced",
    persist=False
)

[36m=== USER u4 - Student ===[0m

=== CONTEXT ===

[31mCurrent query:[0m What to do tonight
[31mPage title:[0m Piazza Bra
[31mPage text:[0m Piazza Bra (Monumenti): This square gives a magnificent glance on the history of the city: it is the foyer of the Arena on occasion of the opera, which draws thousand of spectators. It is the place where the people from Verona like strolling for more than two centuries – along the Liston: between 17
[31mStrategy:[0m familiar
[31mPersonal keywords (ORCID):[0m []
[31mSession queries (last all):[0m ['What to do in Verona?', 'Where to drink something in the evening?', 'Which exhibition center hosts Vinitaly in Verona?', 'Best cinema in Verona', 'Where does the Castel San Pietro funicular depart from?', 'Pastry shop in Verona known for the traditional Nadalin', 'Historic wine bar in the center', 'Artisanal gelato a minute from Piazza Bra', 'Nightclub in Verona with two levels', 'Nightclub with a panoramic terrace overlooking the city', 'Wh

Paper (Model 1)

In [39]:
# Paper-style prompt, by POI name
run_next_query_experiment(
    user_id="u4",
    current_query="What to do tonight",
    poi_selector=10,    
    strategy="familiar",
    k_entities=5,
    mode="paper",
    persist=False
)

[36m=== USER u4[0m

=== CONTEXT ===

[31mCurrent query:[0m What to do tonight
[31mPage title:[0m Piazza Bra
[31mPage text:[0m Piazza Bra (Monumenti): This square gives a magnificent glance on the history of the city: it is the foyer of the Arena on occasion of the opera, which draws thousand of spectators. It is the place where the people from Verona like strolling for more than two centuries – along the Liston: between 17
[31mStrategy:[0m familiar
[31mPersonal keywords (ORCID):[0m []
[31mSession queries (last all):[0m ['What to do in Verona?', 'Where to drink something in the evening?', 'Which exhibition center hosts Vinitaly in Verona?', 'Best cinema in Verona', 'Where does the Castel San Pietro funicular depart from?', 'Pastry shop in Verona known for the traditional Nadalin', 'Historic wine bar in the center', 'Artisanal gelato a minute from Piazza Bra', 'Nightclub in Verona with two levels', 'Nightclub with a panoramic terrace overlooking the city', 'Where to watch a