Use your taste vector + profile to

Ask the LLM to propose candidate books (with filters),

Validate + enrich them via Open Library,

Save them for later similarity + LLM-rerank.

Imports & paths

In [73]:
import os
import json
import time
from pathlib import Path

import numpy as np
import pandas as pd
import requests

from openai import OpenAI

PROJECT_ROOT = Path(".").resolve()
DATA_DIR = PROJECT_ROOT / "data"

ENRICHED_MY_BOOKS_PATH = DATA_DIR / "my_rated_books_enriched.csv"
TASTE_VECTOR_NPY_PATH = DATA_DIR / "taste_vector.npy"
CANDIDATES_RAW_PATH = DATA_DIR / "candidates_raw_llm.csv"
CANDIDATES_ENRICHED_PATH = DATA_DIR / "candidates_enriched.csv"

ENRICHED_MY_BOOKS_PATH, TASTE_VECTOR_NPY_PATH


(WindowsPath('C:/Users/brethm01/book-nlp/data/my_rated_books_enriched.csv'),
 WindowsPath('C:/Users/brethm01/book-nlp/data/taste_vector.npy'))

In [74]:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable.")

client = OpenAI(api_key=api_key)

Load your enriched books + taste vector
We’ll use your enriched file to build a taste description for the prompt.

In [75]:
df_my = pd.read_csv(ENRICHED_MY_BOOKS_PATH)
taste_vector = np.load(TASTE_VECTOR_NPY_PATH)

print("My books:", df_my.shape)
df_my[["title", "author", "my_rating"]].head()

My books: (88, 17)


Unnamed: 0,title,author,my_rating
0,The Alchemist,Paulo Coelho,2
1,Of Mice and Men,John Steinbeck,4
2,To Kill a Mockingbird,Harper Lee,4
3,A Brief History of Time,Stephen Hawking,4
4,Man's Search for Meaning,Viktor E. Frankl,5


I dont want the LLM to give me books I have already read 
create a normalized key

In [76]:
import re
import unicodedata

def norm_text(s: str) -> str:
    if not isinstance(s, str):
        return ""
    # remove accents
    s = ''.join(c for c in unicodedata.normalize("NFD", s)
                if unicodedata.category(c) != "Mn")
    # lowercase, remove punctuation, collapse spaces
    s = re.sub(r"[^a-zA-Z0-9 ]+", " ", s.lower())
    s = re.sub(r"\s+", " ", s).strip()
    return s

# key = normalized "title || author"
df_my["read_key"] = df_my.apply(
    lambda r: f"{norm_text(r.get('title', ''))}||{norm_text(r.get('author', ''))}",
    axis=1
)
READ_KEYS = set(df_my["read_key"])
len(READ_KEYS)


88

Build a textual “taste profile” for the LLM
We’ll describe your taste using your top-rated books.

In [77]:
# Sort by rating (and maybe by recency later if you want)
df_top = df_my.sort_values("my_rating", ascending=False).head(20)

lines = []
for _, r in df_top.iterrows():
    title = r.get("title", "")
    author = r.get("author", "")
    rating = r.get("my_rating", "")
    desc = r.get("ol_description") or r.get("my_review") or ""
    desc_short = (desc[:200] + "...") if isinstance(desc, str) and len(desc) > 200 else desc
    lines.append(
        f"- '{title}' by {author} (I rated it {rating}/5). {desc_short}"
    )

taste_profile_text = (
    "Here are some books I enjoyed and how I felt about them:\n\n" +
    "\n".join(lines)
)

print(taste_profile_text)


Here are some books I enjoyed and how I felt about them:

- 'Man's Search for Meaning' by Viktor E. Frankl (I rated it 5/5). Psychiatrist Viktor Frankl's memoir has riveted generations of readers with its descriptions of life in Nazi death camps and its lessons for spiritual survival. Based on his own experience and the sto...
- 'His Dark Materials (His Dark Materials #1-3)' by Philip Pullman (I rated it 5/5). nan
- 'Lord of the Flies' by William Golding (I rated it 5/5). Lord of the Flies is a 1954 novel by Nobel Prize–winning British author William Golding. The book focuses on a group of British boys stranded on an uninhabited island and their disastrous attempt to g...
- 'The Catcher in the Rye' by J.D. Salinger (I rated it 5/5). nan
- 'A Little History of the World' by E.H. Gombrich (I rated it 5/5). nan
- 'Outliers: The Story of Success' by Malcolm Gladwell (I rated it 5/5). In this stunning new book, Malcolm Gladwell takes us on an intellectual journey through the world of "outli

OpenAI client setup

In [78]:
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable.")

client = OpenAI(api_key=api_key)

LLM candidate generator 
We’ll ask the model for a list of real books matching your taste + filters, in strict JSON.

In [79]:
import json
import pandas as pd

def generate_candidates_llm(
    user_profile_text: str,
    hard_filters: dict,
    k: int = 40
) -> pd.DataFrame:
    """
    Uses the new OpenAI SDK (client.chat.completions.create)
    and prompts the model to return JSON.
    Less strict about filters to avoid empty results.
    Also tries to avoid books the user has already read.
    """
    filters_txt = (
        "None (no hard filters)" if not hard_filters
        else ", ".join([f"{key}={val}" for key, val in hard_filters.items()])
    )

    # assumes df_my exists in the notebook
    already_read_titles = df_my["title"].dropna().unique().tolist()
    already_read_blob = ", ".join(already_read_titles[:50])  # limit to first 50

    system_msg = f"""
You are a book recommendation engine.
Recommend ONLY real, published books (no invented titles or authors).
Always return at least 5 candidates if possible.

You MUST NOT recommend any of the following books, because the user has already read them:
{already_read_blob}

If a book is very similar but clearly a different work, it's allowed.

You MUST respond ONLY with valid JSON.
The JSON must have a top-level key "candidates",
which is a list of objects with keys: "title", "author", "isbn13", "why_match".
Do not output anything outside the JSON.
"""

    user_msg = f"""
User reading taste:
{user_profile_text}

Hard filters (try to respect them, but do not return an empty list if uncertain):
{filters_txt}

Return up to {k} books in this exact JSON structure:
{{
  "candidates": [
    {{
      "title": "Book title",
      "author": "Author name",
      "isbn13": "optional ISBN-13 as string or null",
      "why_match": "Short explanation why this fits the user's taste and filters, including any uncertainty."
    }}
  ]
}}

Do not include any extra keys or any text outside this JSON.
If you are unsure about isbn13, set it to null.
"""

    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": user_msg}
        ],
        temperature=0.5,
    )

    raw_text = resp.choices[0].message.content.strip()

    # In case the model wraps output in ```json ... ``` fences
    if raw_text.startswith("```"):
        raw_text = raw_text.strip("`").strip()
        if raw_text.lower().startswith("json"):
            raw_text = raw_text[4:].strip()

    # print(raw_text)  # uncomment for debugging if JSON parsing fails

    data = json.loads(raw_text)
    candidates = data.get("candidates", [])
    return pd.DataFrame(candidates)


In [80]:
#hard_filters = {"author_nationality": "India"}
hard_filters = {}

df_test = generate_candidates_llm(
    user_profile_text="I like literary fiction, magical realism and social satire.",
    hard_filters=hard_filters,
    k=10
)

df_test, df_test.shape

(                                       title                 author  \
 0                 The Wind-Up Bird Chronicle        Haruki Murakami   
 1       The Brief Wondrous Life of Oscar Wao             Junot Díaz   
 2                   The Master and Margarita       Mikhail Bulgakov   
 3  The Amazing Adventures of Kavalier & Clay         Michael Chabon   
 4                           The Night Circus       Erin Morgenstern   
 5                     The Book of Chameleons  José Eduardo Agualusa   
 6                A Visit from the Goon Squad          Jennifer Egan   
 7                   The House of the Spirits         Isabel Allende   
 8              The Brief History of the Dead       Kevin Brockmeier   
 
           isbn13                                          why_match  
 0  9780099448822  This novel blends magical realism with deep li...  
 1  9781594483295  This book combines elements of magical realism...  
 2  9780143108271  A classic of magical realism, this novel criti

In [81]:
df_test.shape

(9, 4)

Test LLM candidate generation on a small batch
Let’s, for example, ask for Indian authors 

In [82]:
#user_profile_text="I like literary fiction, magical realism and social satire."

#hard_filters = {
#    "author_nationality": "India"
#}

hard_filters = {}

df_cand_raw = generate_candidates_llm(
    user_profile_text=taste_profile_text,
    hard_filters=hard_filters,
    k=40
)

print("Raw LLM candidates:", df_cand_raw.shape)
df_cand_raw.head(10)


Raw LLM candidates: (8, 4)


Unnamed: 0,title,author,isbn13,why_match
0,The Book Thief,Markus Zusak,9780375842207,This novel set in Nazi Germany explores themes...
1,The Kite Runner,Khaled Hosseini,9781594631931,A powerful story of friendship and redemption ...
2,The Nightingale,Kristin Hannah,9780399170943,This historical novel about two sisters in Naz...
3,The Shadow of the Wind,Carlos Ruiz Zafón,9780143034902,A literary mystery set in post-war Barcelona t...
4,Life of Pi,Yann Martel,9780156027328,A philosophical adventure about survival and f...
5,The Goldfinch,Donna Tartt,9780316055444,This Pulitzer Prize-winning novel explores the...
6,The Immortal Life of Henrietta Lacks,Rebecca Skloot,9781400052189,This non-fiction book blends science and perso...
7,The Help,Kathryn Stockett,9780425232200,A compelling story set in the 1960s American S...


In [83]:
# Remove books I have already read 

def add_key(df):
    df = df.copy()
    df["key"] = df.apply(
        lambda r: f"{norm_text(r.get('title', ''))}||{norm_text(r.get('author', ''))}",
        axis=1
    )
    return df

df_cand_raw = add_key(df_cand_raw)

# keep only books that are NOT in your read list
df_cand_raw = df_cand_raw[~df_cand_raw["key"].isin(READ_KEYS)].reset_index(drop=True)

# drop helper column if you like
df_cand_raw = df_cand_raw.drop(columns=["key"])

df_cand_raw.head(), df_cand_raw.shape


(                    title             author         isbn13  \
 0          The Book Thief       Markus Zusak  9780375842207   
 1         The Kite Runner    Khaled Hosseini  9781594631931   
 2         The Nightingale     Kristin Hannah  9780399170943   
 3  The Shadow of the Wind  Carlos Ruiz Zafón  9780143034902   
 4              Life of Pi        Yann Martel  9780156027328   
 
                                            why_match  
 0  This novel set in Nazi Germany explores themes...  
 1  A powerful story of friendship and redemption ...  
 2  This historical novel about two sisters in Naz...  
 3  A literary mystery set in post-war Barcelona t...  
 4  A philosophical adventure about survival and f...  ,
 (8, 4))

In [84]:
df_cand_raw.to_csv(CANDIDATES_RAW_PATH, index=False)
CANDIDATES_RAW_PATH


WindowsPath('C:/Users/brethm01/book-nlp/data/candidates_raw_llm.csv')

Open Library helpers (reuse from Step 2)

In [85]:
BASE_SEARCH_URL = "https://openlibrary.org/search.json"
BASE_WORK_URL = "https://openlibrary.org"

import unicodedata
import re
from difflib import SequenceMatcher

def normalize(text):
    if not isinstance(text, str):
        return ""
    text = ''.join(c for c in unicodedata.normalize('NFD', text)
                   if unicodedata.category(c) != 'Mn')  # remove accents
    text = re.sub(r"[^a-zA-Z0-9 ]+", "", text.lower())
    return text.strip()

def similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()

def search_open_library(title: str, author: str, max_retries: int = 3):
    """
    Improved 'exact-ish' search using title= and author=,
    then ranking candidates by similarity.
    """
    if not isinstance(title, str) or not isinstance(author, str):
        return None
    
    q_title = normalize(title)
    q_author = normalize(author)

    params = {
        "title": title,
        "author": author,
        "fields": "title,author_name,key,isbn,first_publish_year,language,subject",
        "limit": 10
    }

    for attempt in range(max_retries):
        try:
            resp = requests.get(BASE_SEARCH_URL, params=params, timeout=15)
            if resp.status_code != 200:
                time.sleep(1)
                continue
            
            data = resp.json()
            docs = data.get("docs", [])
            if not docs:
                return None
            
            best_doc = None
            best_score = -1

            for d in docs:
                cand_title = normalize(d.get("title", ""))
                cand_authors = d.get("author_name", [])
                cand_author = normalize(cand_authors[0]) if cand_authors else ""

                title_score = similarity(q_title, cand_title)
                author_score = similarity(q_author, cand_author)
                score = 0.7 * title_score + 0.3 * author_score

                if score > best_score:
                    best_score = score
                    best_doc = d
            
            return best_doc
        
        except Exception:
            time.sleep(1)
            continue
    
    return None


def get_work_details(work_key: str, max_retries: int = 3):
    if not isinstance(work_key, str):
        return None
    
    url = f"{BASE_WORK_URL}{work_key}.json"
    for attempt in range(max_retries):
        try:
            resp = requests.get(url, timeout=15)
            if resp.status_code != 200:
                time.sleep(1)
                continue
            return resp.json()
        except Exception:
            time.sleep(1)
            continue
    return None


def extract_description(work_json):
    if not work_json:
        return None
    desc = work_json.get("description")
    if isinstance(desc, str):
        return desc.strip()
    if isinstance(desc, dict):
        return str(desc.get("value", "")).strip() or None
    return None


Validate + enrich LLM candidates
Now we cross-check the LLM’s titles against Open Library and pull metadata + descriptions.

In [86]:
def validate_and_enrich_candidates(df_cand: pd.DataFrame, polite_delay: float = 0.3) -> pd.DataFrame:
    rows = []
    for idx, row in df_cand.iterrows():
        title = row.get("title", "")
        author = row.get("author", "")
        why_match = row.get("why_match", "")
        isbn13_llm = row.get("isbn13", None)

        print(f"[{idx+1}/{len(df_cand)}] Validating: {title} — {author}")

        meta = search_open_library(title, author)
        if meta is None:
            print("   -> No Open Library match, skipping.")
            continue  # drop hallucinations / no matches

        work_key = meta.get("key")
        ol_title = meta.get("title")
        author_names = meta.get("author_name") or []
        ol_author_name = author_names[0] if author_names else None
        isbn_list = meta.get("isbn") or []
        ol_isbn_any = isbn_list[0] if isbn_list else None
        ol_year = meta.get("first_publish_year")
        languages = meta.get("language") or []
        ol_language = languages[0] if languages else None
        subjects = meta.get("subject") or []
        subjects_str = "; ".join(subjects) if subjects else None

        work_json = get_work_details(work_key) if work_key else None
        description = extract_description(work_json)

        rows.append({
            "title_llm": title,
            "author_llm": author,
            "isbn13_llm": isbn13_llm,
            "why_match_llm": why_match,
            "ol_work_key": work_key,
            "ol_title": ol_title,
            "ol_author_name": ol_author_name,
            "ol_isbn_any": ol_isbn_any,
            "ol_first_publish_year": ol_year,
            "ol_language": ol_language,
            "ol_subjects": subjects_str,
            "ol_description": description
        })

        time.sleep(polite_delay)

    df_enriched = pd.DataFrame(rows)
    return df_enriched


Run validation on your small candidate set

In [87]:
df_cand_enriched = validate_and_enrich_candidates(df_cand_raw, polite_delay=0.3)

print("Enriched candidates:", df_cand_enriched.shape)
df_cand_enriched.head()


[1/8] Validating: The Book Thief — Markus Zusak
[2/8] Validating: The Kite Runner — Khaled Hosseini
[3/8] Validating: The Nightingale — Kristin Hannah
[4/8] Validating: The Shadow of the Wind — Carlos Ruiz Zafón
[5/8] Validating: Life of Pi — Yann Martel
[6/8] Validating: The Goldfinch — Donna Tartt
[7/8] Validating: The Immortal Life of Henrietta Lacks — Rebecca Skloot
[8/8] Validating: The Help — Kathryn Stockett
Enriched candidates: (8, 12)


Unnamed: 0,title_llm,author_llm,isbn13_llm,why_match_llm,ol_work_key,ol_title,ol_author_name,ol_isbn_any,ol_first_publish_year,ol_language,ol_subjects,ol_description
0,The Book Thief,Markus Zusak,9780375842207,This novel set in Nazi Germany explores themes...,/works/OL5819456W,The Book Thief,Markus Zusak,9780399556524,1998,ger,nyt:young-adult-paperback-monthly=2022-09-04; ...,"The extraordinary, beloved novel about the abi..."
1,The Kite Runner,Khaled Hosseini,9781594631931,A powerful story of friendship and redemption ...,/works/OL5781992W,The Kite Runner,Khaled Hosseini,9787542036346,2003,kor,New York Times bestseller; nyt:trade_fiction_p...,"The unforgettable, heartbreaking story of the ..."
2,The Nightingale,Kristin Hannah,9780399170943,This historical novel about two sisters in Naz...,/works/OL17116910W,The Nightingale,Kristin Hannah,9786555650853,2000,spa,Civilians in war; Fiction; FICTION / Contempor...,"Despite their differences, sisters Vianne and ..."
3,The Shadow of the Wind,Carlos Ruiz Zafón,9780143034902,A literary mystery set in post-war Barcelona t...,/works/OL36433603W,The Shadow of the Wind,Carlos Ruiz Zafón,1439569746,2009,,,
4,Life of Pi,Yann Martel,9780156027328,A philosophical adventure about survival and f...,/works/OL2827199W,Life of Pi,Yann Martel,606269312,2000,heb,Teenage boys; Zoo animals; Fiction; Literature...,"After the tragic sinking of a cargo ship, one ..."


In [88]:
df_cand_enriched.to_csv(CANDIDATES_ENRICHED_PATH, index=False)
CANDIDATES_ENRICHED_PATH


WindowsPath('C:/Users/brethm01/book-nlp/data/candidates_enriched.csv')

What we have : candidates_raw_llm.csv
→ books proposed by the LLM (possibly noisy).

candidates_enriched.csv
→ only validated, real books, with:

Open Library title, author, ISBN,

publication year,

language, subjects,

ol_description,

and the LLM’s original why_match_llm.