LLM rerank + explanations

Imports & paths

In [19]:
import os
import json
from pathlib import Path

import numpy as np
import pandas as pd
from openai import OpenAI

PROJECT_ROOT = Path(".").resolve()
DATA_DIR = PROJECT_ROOT / "data"

CANDIDATES_EMB_PKL_PATH = DATA_DIR / "candidates_with_embeddings.pkl"
MY_BOOKS_ENRICHED_PATH = DATA_DIR / "my_rated_books_enriched.csv"

CANDIDATES_EMB_PKL_PATH, MY_BOOKS_ENRICHED_PATH


(WindowsPath('C:/Users/brethm01/book-nlp/data/candidates_with_embeddings.pkl'),
 WindowsPath('C:/Users/brethm01/book-nlp/data/my_rated_books_enriched.csv'))

In [20]:
df_cand.head()

Unnamed: 0,title_llm,author_llm,isbn13_llm,why_match_llm,ol_work_key,ol_title,ol_author_name,ol_isbn_any,ol_first_publish_year,ol_language,ol_subjects,ol_description,text_for_embedding,embedding,sim_to_taste
0,The Elegance of the Hedgehog,Muriel Barbery,9781933372005,A philosophical novel that delves into the liv...,/works/OL13351631W,The Elegance of the Hedgehog,Muriel Barbery,0753186128,2008,eng,Apartment dwellers; class in fiction; Apartmen...,EA novel by the French professor of philosophy...,EA novel by the French professor of philosophy...,"[-0.0017430434, 0.04466726, -0.023971178, -0.0...",0.513164
1,The Brief Wondrous Life of Oscar Wao,Junot Díaz,9781594489587,A multi-layered narrative that combines histor...,/works/OL7990014W,The Brief Wondrous Life of Oscar Wao,Junot Díaz,1594483590,2007,heb,National Book Critics Circle Award Winner; awa...,Things have never been easy for Oscar. A ghett...,Things have never been easy for Oscar. A ghett...,"[0.02750994, 0.013716656, -0.068149045, 0.0361...",0.48431
2,The Unbearable Lightness of Being,Milan Kundera,9780061148520,Explores philosophical themes of love and exis...,/works/OL28766571W,Nesnesitelná lehkost bytí = The Unbearable Lig...,Milan Kundera,4087603512,1998,,,,'Nesnesitelná lehkost bytí = The Unbearable Li...,"[-0.011193863, 0.05371249, -0.048612062, -0.03...",0.531284
3,The Book Thief,Markus Zusak,9780375842207,A unique narrative perspective on life during ...,/works/OL5819456W,The Book Thief,Markus Zusak,9780399556524,1998,ger,nyt:young-adult-paperback-monthly=2022-09-04; ...,"The extraordinary, beloved novel about the abi...","The extraordinary, beloved novel about the abi...","[-0.0086158775, 0.04252238, -0.008823275, 0.02...",0.571621
4,The Night Circus,Erin Morgenstern,9780385534635,"A fantastical tale of a magical competition, r...",/works/OL16086747W,The Night Circus,Erin Morgenstern,354828549X,2011,chi,New York Times bestseller; Fiction; Magicians;...,The circus arrives without warning. No announc...,The circus arrives without warning. No announc...,"[-0.022252439, 0.032399315, -0.06338484, -0.03...",0.424309


Load candidates + your books (for profile text)

In [21]:
df_cand = pd.read_pickle(CANDIDATES_EMB_PKL_PATH)
df_my = pd.read_csv(MY_BOOKS_ENRICHED_PATH)

print("Candidates:", df_cand.shape)
print("My books:", df_my.shape)

df_cand[["ol_title", "ol_author_name", "sim_to_taste"]].head()


Candidates: (8, 15)
My books: (88, 17)


Unnamed: 0,ol_title,ol_author_name,sim_to_taste
0,The Book Thief,Markus Zusak,0.584993
1,The Kite Runner,Khaled Hosseini,0.438882
2,The Nightingale,Kristin Hannah,0.328327
3,The Shadow of the Wind,Carlos Ruiz Zafón,0.494834
4,Life of Pi,Yann Martel,0.361335


Sort by similarity and pick a top-N pool
We’ll let the LLM rerank only the most promising candidates, say top 40.

In [22]:
df_ranked = df_cand.sort_values("sim_to_taste", ascending=False).reset_index(drop=True)

TOP_N_POOL = min(40, len(df_ranked))  # can tweak
df_pool = df_ranked.head(TOP_N_POOL).copy()

df_pool[["ol_title", "ol_author_name", "sim_to_taste"]].head(10)


Unnamed: 0,ol_title,ol_author_name,sim_to_taste
0,The Book Thief,Markus Zusak,0.584993
1,The Goldfinch,Donna Tartt,0.553718
2,The Shadow of the Wind,Carlos Ruiz Zafón,0.494834
3,The Help,Kathryn Stockett,0.444609
4,The Kite Runner,Khaled Hosseini,0.438882
5,Life of Pi,Yann Martel,0.361335
6,The Immortal Life of Henrietta Lacks,Rebecca Skloot,0.337923
7,The Nightingale,Kristin Hannah,0.328327


Build / reuse a textual “taste profile”

In [23]:
df_top = df_my.sort_values("my_rating", ascending=False).head(40)

lines = []
for _, r in df_top.iterrows():
    title = r.get("title", "")
    author = r.get("author", "")
    rating = r.get("my_rating", "")
    desc = r.get("ol_description") or r.get("my_review") or ""
    desc_short = (desc[:200] + "...") if isinstance(desc, str) and len(desc) > 200 else desc
    lines.append(f"- '{title}' by {author} (I rated it {rating}/5). {desc_short}")

taste_profile_text = (
    "Here are some books I have read and how I felt about them:\n\n" +
    "\n".join(lines)
)

print(taste_profile_text[:800])


Here are some books I have read and how I felt about them:

- 'Man's Search for Meaning' by Viktor E. Frankl (I rated it 5/5). Psychiatrist Viktor Frankl's memoir has riveted generations of readers with its descriptions of life in Nazi death camps and its lessons for spiritual survival. Based on his own experience and the sto...
- 'His Dark Materials (His Dark Materials #1-3)' by Philip Pullman (I rated it 5/5). nan
- 'Lord of the Flies' by William Golding (I rated it 5/5). Lord of the Flies is a 1954 novel by Nobel Prize–winning British author William Golding. The book focuses on a group of British boys stranded on an uninhabited island and their disastrous attempt to g...
- 'The Catcher in the Rye' by J.D. Salinger (I rated it 5/5). nan
- 'A Little History of the World' by E.H. Gombrich 


OpenAI client

In [24]:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("Please set OPENAI_API_KEY in your environment.")

client = OpenAI(api_key=api_key)
RERANK_MODEL = "gpt-4o-mini"


Prepare candidate list for the LLM (with IDs)
We’ll give the LLM a stable numeric ID for each candidate, so it can refer to them cleanly in JSON.

In [25]:
# create a stable ID column for the pool
df_pool = df_pool.reset_index(drop=True).copy()
df_pool["cand_id"] = df_pool.index.astype(int)

cols_for_llm = [
    "cand_id",
    "ol_title",
    "ol_author_name",
    "ol_first_publish_year",
    "ol_language",
    "text_for_embedding",
    "sim_to_taste",
    "why_match_llm"
]

df_pool[cols_for_llm].head()


Unnamed: 0,cand_id,ol_title,ol_author_name,ol_first_publish_year,ol_language,text_for_embedding,sim_to_taste,why_match_llm
0,0,The Book Thief,Markus Zusak,1998,ger,"The extraordinary, beloved novel about the abi...",0.584993,This novel set in Nazi Germany explores themes...
1,1,The Goldfinch,Donna Tartt,2013,eng,"""The Goldfinch is a rarity that comes along pe...",0.553718,This Pulitzer Prize-winning novel explores the...
2,2,The Shadow of the Wind,Carlos Ruiz Zafón,2009,,'The Shadow of the Wind' by Carlos Ruiz Zafón....,0.494834,A literary mystery set in post-war Barcelona t...
3,3,The Help,Kathryn Stockett,2009,eng,Three ordinary women are about to take one ext...,0.444609,A compelling story set in the 1960s American S...
4,4,The Kite Runner,Khaled Hosseini,2003,kor,"The unforgettable, heartbreaking story of the ...",0.438882,A powerful story of friendship and redemption ...


Define LLM rerank function
We’ll:

send cand_id, title, author, short summary & similarity score

ask for JSON like {"recommendations": [{"cand_id": 3, "rank": 1, "reason": "..."}]}

In [26]:
def llm_rerank_candidates(df_pool: pd.DataFrame,
                          taste_profile_text: str,
                          top_k: int = 10) -> pd.DataFrame:
    """
    Reranks a pool of candidates using an LLM.
    Returns a DataFrame with columns: cand_id, rank, reason.
    """

    # build a compact list of dicts for the prompt
    candidate_items = []
    for _, r in df_pool.iterrows():
        candidate_items.append({
            "cand_id": int(r["cand_id"]),
            "title": r.get("ol_title") or r.get("title_llm"),
            "author": r.get("ol_author_name") or r.get("author_llm"),
            "year": int(r["ol_first_publish_year"]) if not pd.isna(r.get("ol_first_publish_year")) else None,
            "language": r.get("ol_language"),
            "similarity_score": float(r.get("sim_to_taste", 0.0)),
            "summary": (r.get("text_for_embedding") or "")[:500],  # truncate for context
            "why_match_initial": r.get("why_match_llm"),
        })

    system_msg = """
You are a careful book recommendation assistant.
You receive a list of real candidate books and a description of the user's reading taste.
Your job is to pick and rank the best recommendations for this user, and explain why.
You MUST respond ONLY with valid JSON with this structure:

{
  "recommendations": [
    {
      "cand_id": <integer>,
      "rank": <1-based integer, lower is better>,
      "reason": "short explanation in 1-3 sentences"
    }
  ]
}

Do not output anything outside the JSON.
"""

    user_msg = f"""
User reading taste:
{taste_profile_text}

You are given the following candidate books (as JSON):

{json.dumps(candidate_items, indent=2)}

Instructions:
- Consider both similarity_score and the textual summaries.
- Prefer books that match the user's themes, tone, and depth.
- Try to provide some diversity (not all the same author or exact vibe).
- Avoid recommending more than one book from the same series if possible.
- Return up to {top_k} candidates.
"""

    resp = client.chat.completions.create(
        model=RERANK_MODEL,
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": user_msg}
        ],
        temperature=0.4,
    )

    raw = resp.choices[0].message.content.strip()

    # handle ```json fences just in case
    if raw.startswith("```"):
        raw = raw.strip("`").strip()
        if raw.lower().startswith("json"):
            raw = raw[4:].strip()

    # print(raw)  # uncomment if debugging
    data = json.loads(raw)

    recs = data.get("recommendations", [])
    df_recs = pd.DataFrame(recs)

    # basic sanity guards
    if "cand_id" not in df_recs.columns:
        raise ValueError("LLM output missing 'cand_id' field")
    if "rank" not in df_recs.columns:
        # if rank missing, create based on order
        df_recs["rank"] = range(1, len(df_recs) + 1)

    df_recs = df_recs.sort_values("rank", ascending=True).reset_index(drop=True)
    return df_recs


In [27]:
df_recs = llm_rerank_candidates(
    df_pool=df_pool,
    taste_profile_text=taste_profile_text,
    top_k=10
)

df_recs


Unnamed: 0,cand_id,rank,reason
0,0,1,This novel set in Nazi Germany explores themes...
1,4,2,A powerful story of friendship and redemption ...
2,7,3,This historical novel about two sisters in Naz...
3,1,4,This Pulitzer Prize-winning novel explores the...
4,3,5,A compelling story set in the 1960s American S...
5,5,6,A philosophical adventure about survival and f...
6,2,7,A literary mystery set in post-war Barcelona t...
7,6,8,This non-fiction book blends science and perso...


Join LLM ranking back with full metadata

In [28]:
df_final = df_recs.merge(df_pool, on="cand_id", how="left")

df_final[
    ["rank", "ol_title", "ol_author_name", "ol_first_publish_year", "sim_to_taste", "reason"]
].sort_values("rank").reset_index(drop=True)


Unnamed: 0,rank,ol_title,ol_author_name,ol_first_publish_year,sim_to_taste,reason
0,1,The Book Thief,Markus Zusak,1998,0.584993,This novel set in Nazi Germany explores themes...
1,2,The Kite Runner,Khaled Hosseini,2003,0.438882,A powerful story of friendship and redemption ...
2,3,The Nightingale,Kristin Hannah,2000,0.328327,This historical novel about two sisters in Naz...
3,4,The Goldfinch,Donna Tartt,2013,0.553718,This Pulitzer Prize-winning novel explores the...
4,5,The Help,Kathryn Stockett,2009,0.444609,A compelling story set in the 1960s American S...
5,6,Life of Pi,Yann Martel,2000,0.361335,A philosophical adventure about survival and f...
6,7,The Shadow of the Wind,Carlos Ruiz Zafón,2009,0.494834,A literary mystery set in post-war Barcelona t...
7,8,The Immortal Life of Henrietta Lacks,Rebecca Skloot,2009,0.337923,This non-fiction book blends science and perso...


In [29]:
FINAL_RECS_PATH = DATA_DIR / "final_recommendations_llm.csv"
df_final.to_csv(FINAL_RECS_PATH, index=False)
FINAL_RECS_PATH


WindowsPath('C:/Users/brethm01/book-nlp/data/final_recommendations_llm.csv')