# How to ask a book a question?

## Who is this for?

Students who know basic Python and want AI to answer questions about long documents (books, PDFs, notes).

You’ll learn the core RAG pipeline and build intuition for each step.

## Learning outcomes

Explain what RAG is and why we use it

Load + chunk a PDF/book, embed the chunks, store & search them, then ask targeted questions

Avoid common pitfalls (bad chunking, poor retrieval, prompt mistakes)

## What is RAG?

RAG = Retrieval + Generation

Retrieve: pull the most relevant snippets ("chunks") of your book from a vector database using semantic similarity.

Augment: give those snippets to an LLM as context.

Generate: the LLM writes an answer grounded in the retrieved text.

### Why not just ask the LLM?

LLMs may hallucinate and can’t “remember” entire books.

RAG anchors answers in your exact sources (cite or include snippets).

## What will we learn here?

In this code demonstration, we're going to guide you through the following targets to understand how Retrieval Augmented Generation(RAG) works:


1.   Load the document (PDF/HTML/Markdown)

2. Split into overlapping chunks

3. Create embeddings (vector representation) of chunks

4. Store vectors in ChromaDB (local) or Pinecone (cloud)

5. Query: embed your question → find top‑k similar chunks

6. Prompt the LLM with those chunks + your question

7. Answer with citations/snippets

![Fig.1 How does RAG works.](https://drive.google.com/uc?export=view&id=11xecWxSs5qiUQQHeY3lpkST0YLaOrjSw)

Fig.1 How does RAG works.


## 1. Install/Upgrade Dependencies (Modern Stack)

- Installs/updates packages
- Removes RAPIDS (optional) in Google Colab to avoid pyarrow pinning
- Installs the core stack (OpenAI/LangChain/Chroma/Pinecone etc.)

Refs:
- pip user guide: https://pip.pypa.io/
- LangChain split pkgs: https://python.langchain.com/

In [None]:
# Update manual construction tool (latest-oriented setup)
# 1) Optional: remove RAPIDS if present to avoid pyarrow constraints
# !pip uninstall -y cudf-cu12 pylibcudf-cu12 || true

# 2) Upgrade build tools
!pip install -qU pip setuptools wheel

# 3) Base pins for this environment (latest compatible)
#    - requests: you asked for 2.32.5 (note: Google Colab sometimes pins 2.32.4)
!pip install -qU "requests==2.32.5" "jedi>=0.16"

# 4) Install core stack (no old pinecone-client)
#    - Use pyarrow>=21 to satisfy modern datasets
!pip install -qU \
  "pyarrow>=21" \
  "transformers>=4.45,<5" \
  "tokenizers>=0.20.1,<0.21" \
  "sentence-transformers>=3,<4" \
  "langchain>=0.3.27,<0.4" \
  "langchain-community>=0.3.7" \
  "langchain-openai>=0.2.12" \
  "datasets>=4.1.1" \
  "accelerate>=0.33" \
  einops \
  pypdf \
  pdfminer.six \
  pinecone \
  chromadb

# (Optional GPU extras)
# pip install -qU xformers bitsandbytes

## 2. PDF loader options (unstructured/PyPDF) and text splitting
Shows two strategies to load PDFs (Unstructured vs PyPDF) and a text splitter configuration.


*   **Unstructured loader** can parse complex PDFs but sometimes needs optional system deps.

*   **PyPDFLoader** is simpler/reliable for most text PDFs.

*   A splitter (e.g., **CharacterTextSplitter**) controls chunk size/overlap for embeddings.

Please check Langchain Document for more informations: https://python.langchain.com/docs/integrations/document_loaders/



In [None]:
# PDF Loaders. If unstructured gives you a hard time, try PyPDFLoader
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

In [None]:
loader = PyPDFLoader("Flink in Action.pdf")
# loader = PyPDFLoader("field-guide-to-data-science.pdf")
# loader = OnlinePDFLoader("https://wolfpaulus.com/wp-content/uploads/2017/05/field-guide-to-data-science.pdf")
## Other options for loaders
# loader = UnstructuredPDFLoader("../data/field-guide-to-data-science.pdf")
# loader = OnlinePDFLoader("https://wolfpaulus.com/wp-content/uploads/2017/05/field-guide-to-data-science.pdf")

In [None]:
data = loader.load()

In [None]:
# Note: If you're using PyPDFLoader then it will split by page for you already
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[30].page_content)} characters in your document')

You have 35 document(s) in your data
There are 2388 characters in your document


(Optional) How to load everything into data variable?

In [None]:
# Use new document loader
from langchain_community.document_loaders import (
    PyPDFLoader, OnlinePDFLoader, UnstructuredPDFLoader
)

loaders = [
    PyPDFLoader("Flink in Action.pdf"),
    PyPDFLoader("field-guide-to-data-science.pdf"),
    OnlinePDFLoader("https://wolfpaulus.com/wp-content/uploads/2017/05/field-guide-to-data-science.pdf"),
    UnstructuredPDFLoader("Learning_Apache_Flink.pdf"),
    # add more files/URL...
]

data = []
for ld in loaders:
    try:
        docs = ld.load()          # every loader returns List[Document]
        data.extend(docs)         # add to data
    except Exception as e:
        print(f"[warn] skip {ld.__class__.__name__}: {e}")

print(f"Loaded {len(data)} Document chunks from {len(loaders)} loaders.")
# Optional: see the source
sources = {d.metadata.get('source') for d in data}
print("Unique sources:", len(sources))

## 3. Chunk your data up into smaller documents
* **How it works:** Builds/uses a TextSplitter to convert long page texts into smaller chunks for better embedding/search.

* **Why:** Applies split_documents(docs) to get many shorter Document chunks optimized for semantic search.

In [None]:
# Note: If you're using PyPDFLoader then we'll be splitting for the 2nd time.
# This is optional, test out on your own data.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

In [None]:
# Check an example
print(texts[1])

page_content='welcome 
Thank you for purchasing the MEAP for Flink in Action . We are excited to deliver this book as 
large-scale stream processing using Flink and Google Data Flow is fast gaining in popularity. 
Stream processing is much more than just processing records one at a time as they arrive. 
True stream processing needs support for concepts such as event time processing to ensure 
stream processing systems are just as accurate as the batch processing system. There is a 
need for one system the performs both stream a nd batch processing. Apache Flink is that 
system.  
As we started exploring Apache Flink, we discovered the subtle challenges that are 
inherent in stream processing. These challenges are intrinsic to how stream processing is 
performed. Unlike batch processing, where all data is available when processing begins, 
stream processing must be able to handle incomplete data, late arrivals, and out -of-order 
arrivals—without compromising performance or accuracy —an

In [None]:
# Check the contents
print (f'Now you have {len(texts)} documents')

Now you have 51 documents


### 4. Create embeddings (OpenAI): Create embeddings of your documents to get ready for semantic search
*   **Purpose:** Initializes OpenAIEmbeddings for vectorization

*   **How does it works:** Uses your OPENAI_API_KEY and an embedding model (often text-embedding-3-small, 1536-d) to support similarity.

*   **Notice:** Please make sure you have a .env file at the workspace containing OPENAI_API_KEY.

Import Environment Vars & Embeddings

In [None]:
# Cell 10
import os

# load OPENAI_API_KEY
try:
    from dotenv import load_dotenv
    load_dotenv()
except Exception:
    pass

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
assert OPENAI_API_KEY, "Missing OPENAI_API_KEY. Put it in environment or a .env file."

# Use OpenAI Embedding
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

print("Chroma-only setup: embeddings ready.")

Chroma-only setup: embeddings ready.


## 5. Init ChromaDB
What we did is the following steps. Check Document to see how ChromaDB works: https://docs.trychroma.com/docs/overview/introduction


1.   Establish local connections for ChromaDB
2.   Convert text to vectors and insert them into local ChromaDB
3.   Set up retriever.



In [None]:
CHROMA_DIR = "./chroma_store"     # Local Dir
collection_name = "llama-2-rag"   # Name for the instance

# load database
chroma = Chroma(
    collection_name=collection_name,
    embedding_function=embeddings,
    persist_directory=CHROMA_DIR,
)

print(f"Chroma collection ready: {collection_name} (persist at {CHROMA_DIR})")

  chroma = Chroma(


Chroma collection ready: llama-2-rag (persist at ./chroma_store)


(Optional) How to create a ChromaDB and interact with it.

In [None]:
# Upsert datas into Chroma using Embedding
from typing import Iterable

def _to_strings(items: Iterable):
    out = []
    for t in items:
        if hasattr(t, "page_content"):
            out.append(t.page_content)
        else:
            out.append(str(t))
    return out

# Suppose we have `texts`, which we have defined above
assert "texts" in globals(), "Variable `texts` not found. Please define it before running this cell."

text_list = _to_strings(texts)
# If you need to manually define ids / metadatas，you can pass them in this API
chroma.add_texts(texts=text_list)      # metadatas=..., ids=...
chroma.persist()

print(f"Upserted {len(text_list)} texts into Chroma and persisted.")

Upserted 51 texts into Chroma and persisted.


In [None]:
retriever = chroma.as_retriever(search_kwargs={"k": 5})
print("Retriever ready (Chroma).")

Retriever ready (Chroma).


## 6. (Danger/Optional) Clear Data Helper

Clears all vectors in the given namespace (default: `__default__`).

In [None]:
# Cell 15
def chroma_delete_by_ids(id_list):
    """Delete using ID"""
    chroma.delete(ids=id_list)
    chroma.persist()
    print(f"Deleted {len(id_list)} ids from Chroma.")

def chroma_delete_where(where: dict):
    """Delete Using Metadata"""
    _ = chroma._collection.delete(where=where)
    chroma.persist()
    print(f"Deleted by condition: {where}")

def chroma_clear_all():
    """Delete all"""
    _ = chroma._collection.delete(where={})
    chroma.persist()
    print(f"Cleared all data in Chroma collection '{collection_name}'.")


# chroma_delete_by_ids(["doc-1", "doc-2"])
# chroma_delete_where({"source": {"$eq": "note"}})
# chroma_clear_all()


## 7. Interact with ChromaDB

In [None]:
from langchain_community.vectorstores import Chroma

# Use ChromaDB to construct Vector Database（Locally，Can change collection name and directory）
docsearch = Chroma.from_texts(
    texts=[t.page_content for t in texts],
    embedding=embeddings,
    collection_name=collection_name,
    persist_directory="./chroma_store",  # 本地存储路径
)

In [None]:
query = "What is flink?"
docs = docsearch.similarity_search(query, k=5)
# Search similarity based on cosine similarity

In [None]:
# Here's an example of the first document that was returned
print(docs[3].page_content[:1000])

MEAP Edition 
Manning Early Access Program 
Flink in Action 
Version 2 
 
 
 
 
 
Copyright 2016 Manning Publications 
 
 
For more information on this and other Manning titles go to  
www.manning.com 
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and 
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. 
https://forums.manning.com/forums/flink-in-action


In [None]:
print(docs)

[Document(metadata={}, page_content='https://forums.manning.com/forums/flink-in-action\n4'), Document(metadata={}, page_content='brief contents \nPART 1:  STREAM PROCESSING USING FLINK \n  1  Introducing Apache Flink \n  2  Getting started with Flink  \n  3  Batch processing using the DataSet API \n  4  Stream processing using the DataStream API \n  5  Basics of event time processing  \nPART 2:  ADVANCED STREAM PROCESSING USING FLINK \n  6  Session windows and custom windows \n  7  Using the Flink API in practice  \n  8  Using Kafka with Flink \n  9  Fault tolerance in Flink   \nPART 3:  OUT IN THE WILD  \n10  Domain-specific libraries in Flink – CEP and Streaming SQL \n11  Apache Beam and Flink \nAPPENDIXES: \nA   Setting up your local Flink environment \nB   Installing Apache Kafka \n \n©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and \nother simple mistakes. These will be cleaned up during production of the book by copyedit

## 8. Query those docs to get your answer back

In [None]:
# Import
from langchain_openai import OpenAI
from langchain.chains.question_answering import load_qa_chain

In [None]:
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

stuff: https://python.langchain.com/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/docs/how_to/#qa-with-rag
  chain = load_qa_chain(llm, chain_type="stuff")


In [None]:
query = "What is Flink?"
docs = docsearch.similarity_search(query)

In [None]:
chain.run(input_documents=docs, question=query)

  chain.run(input_documents=docs, question=query)


' Flink is a stream processing system used for handling streaming data in real-time. It is commonly used in businesses to analyze and make decisions based on constantly generated events.'

## 9.(Optional) Using Pinecone as remote Vector Database
Install minimal deps (Pinecone v2 client + LangChain split pkgs)

In [None]:
!pip3 install -qU "pinecone-client==3" langchain-community langchain-openai tiktoken python-dotenv

Init Pinecone (v3) — connect & (optionally) create serverless index

In [None]:
print("### Init Pinecone (v3) — connect & ensure index")

import os
from typing import List
from pinecone import Pinecone, ServerlessSpec

# --- Config from your metadata ---
pc_api_key = os.getenv("PINECONE_API_KEY")
assert pc_api_key, "Missing PINECONE_API_KEY in environment."

index_name_v3 = "llama-2-rag"    # from your metadata
metric_v3 = "cosine"             # from your metadata
dimension_v3 = 1536              # from your metadata
cloud_v3 = "aws"                 # from your metadata
region_v3 = "us-east-1"          # from your metadata

# Init client
pc = Pinecone(api_key=pc_api_key)

# Create if not exists (serverless)
existing_names = [it["name"] for it in pc.list_indexes()]
if index_name_v3 not in existing_names:
    print(f"Creating serverless index '{index_name_v3}' (dim={dimension_v3}, metric={metric_v3}, {cloud_v3}/{region_v3})")
    pc.create_index(
        name=index_name_v3,
        dimension=dimension_v3,
        metric=metric_v3,
        spec=ServerlessSpec(cloud=cloud_v3, region=region_v3),
    )
else:
    print(f"Index '{index_name_v3}' already exists.")

# Get an index handle (using name is fine; host can be used too)
index_v3 = pc.Index(index_name_v3)

# Show resolved host/summary
info = pc.describe_index(index_name_v3)
print("Connected to:", info.get("name"), "| host:", info.get("host"), "| metric:", info.get("metric"), "| dim:", info.get("dimension"))


### Init Pinecone (v3) — connect & ensure index
Index 'llama-2-rag' already exists.
Connected to: llama-2-rag | host: llama-2-rag-ac71174.svc.aped-4627-b74a.pinecone.io | metric: cosine | dim: 1536


Upsert texts into Pinecone (v3) — using your corpus

In [None]:
print("### Upsert — embed & write your existing corpus into Pinecone (v3)")

from typing import Iterable

# Reuse your existing embeddings if present; otherwise create one that matches dim=1536
if "embeddings" not in globals():
    from langchain_openai import OpenAIEmbeddings
    openai_key = os.getenv("OPENAI_API_KEY")
    assert openai_key, "Missing OPENAI_API_KEY; set it in environment or .env."
    # text-embedding-3-small → 1536 dims (matches your index)
    embeddings = OpenAIEmbeddings(openai_api_key=openai_key, model="text-embedding-3-small")
    print("Created OpenAIEmbeddings(model='text-embedding-3-small').")

# Ensure your corpus exists
assert "texts" in globals(), "Variable `texts` not found. Please define your corpus upstream."

def to_str_list(items: Iterable) -> list:
    out = []
    for x in items:
        out.append(x.page_content if hasattr(x, "page_content") else str(x))
    return out

payloads: list = to_str_list(texts)

# Embed and upsert
vectors_upsert = []
# Use embed_documents for batch embedding
doc_vectors: List[List[float]] = embeddings.embed_documents(payloads)
for i, (txt, vec) in enumerate(zip(payloads, doc_vectors)):
    vectors_upsert.append({"id": f"pc-doc-{i}", "values": vec, "metadata": {"text": txt}})

resp = index_v3.upsert(vectors=vectors_upsert)
print("Upserted:", len(vectors_upsert), "vectors.", "Result:", getattr(resp, "upserted_count", "ok"))

### Upsert — embed & write your existing corpus into Pinecone (v3)
Upserted: 51 vectors. Result: 51


Similarity search (v3 query) — same “What is Flink?” case

In [None]:
print("### Similarity Search — top-k for: 'What is Flink?'")

query_text_v3 = "What is Flink?"
# Single-vector query
qvec = embeddings.embed_query(query_text_v3)

res = index_v3.query(
    vector=qvec,
    top_k=4,
    include_metadata=True,
)

matches = res.get("matches", []) if isinstance(res, dict) else getattr(res, "matches", [])
if not matches:
    print("No matches found.")
else:
    for rank, m in enumerate(matches, 1):
        md = m.get("metadata", {}) if isinstance(m, dict) else getattr(m, "metadata", {})
        snippet = (md.get("text", "") or "")[:200].replace("\n", " ")
        score = m.get("score", None) if isinstance(m, dict) else getattr(m, "score", None)
        print(f"[{rank}/4] id={m.get('id') if isinstance(m, dict) else getattr(m, 'id', None)} score={score} :: {snippet}{'...' if len(snippet)>=200 else ''}")


### Similarity Search — top-k for: 'What is Flink?'
[1/4] id=pc-doc-4 score=0.839233398 :: 1   Introducing Apache Flink  This chapter covers   • Why stream processing is important  • What is Apache Flink  • Apache Flink in the context of a real world example  This book is about handling str...
[2/4] id=pc-doc-9 score=0.835510254 :: https://forums.manning.com/forums/flink-in-action 4
[3/4] id=pc-doc-50 score=0.834777832 :: streaming provides Flink with a more fine -grained ability to process data.  In the next chapter we will show you how to install Flink and write simple programs in Flink  using the DataSet, DataStream...
[4/4] id=pc-doc-3 score=0.831787109 :: brief contents  PART 1:  STREAM PROCESSING USING FLINK    1  Introducing Apache Flink    2  Getting started with Flink     3  Batch processing using the DataSet API    4  Stream processing using the D...


Retriever-style helper (v3) — returns LangChain Documents for QA

In [None]:
print("### Retriever Helper — convert v3 results to LangChain Documents")

from langchain.schema import Document

def pinecone_v3_retrieve(question: str, k: int = 4) -> list[Document]:
    qv = embeddings.embed_query(question)
    out = index_v3.query(vector=qv, top_k=k, include_metadata=True)
    m = out.get("matches", []) if isinstance(out, dict) else getattr(out, "matches", [])
    docs = []
    for mm in m:
        meta = mm.get("metadata", {}) if isinstance(mm, dict) else getattr(mm, "metadata", {})
        txt = meta.get("text", "")
        docs.append(Document(page_content=txt, metadata={k: v for k, v in meta.items() if k != "text"}))
    return docs

# Preview
_preview = pinecone_v3_retrieve("What is Flink?", k=3)
for i, d in enumerate(_preview, 1):
    print(f"[doc {i}] {d.page_content[:120].replace('\\n',' ')}{'...' if len(d.page_content)>120 else ''}")


### Retriever Helper — convert v3 results to LangChain Documents
[doc 1] 1  
Introducing Apache Flink 
This chapter covers  
• Why stream processing is important 
• What is Apache Flink 
• Apac...
[doc 2] https://forums.manning.com/forums/flink-in-action
4
[doc 3] streaming provides Flink with a more fine -grained ability to process data. 
In the next chapter we will show you how to...


(Optional) QA chain — same “What is Flink?” final answer

In [None]:
print("### QA — run the same final example with LangChain QA chain")

from langchain_openai import OpenAI as OpenAICompletion
from langchain.chains.question_answering import load_qa_chain

qa_llm = OpenAICompletion(temperature=0)  # uses OPENAI_API_KEY from env
qa_chain_v3 = load_qa_chain(qa_llm, chain_type="stuff")

question_v3 = "What is Flink?"
docs_v3 = pinecone_v3_retrieve(question_v3, k=4)
if not docs_v3:
    print("No documents retrieved for QA.")
else:
    answer_v3 = qa_chain_v3.run(input_documents=docs_v3, question=question_v3)
    print("\n### Answer")
    print(answer_v3)

### QA — run the same final example with LangChain QA chain

### Answer
 Flink is a stream processing system used for handling streaming data in real-time. It is commonly used in businesses to analyze and make decisions based on constantly generated events.


(Danger/Optional) Clear namespace or selective delete (v3)

In [None]:
print("### Clear — delete all vectors in the default namespace (no explicit namespace)")

def pinecone_v3_clear_default():
    # On this API version, do NOT pass namespace="__default__".
    index_v3.delete(delete_all=True)   # ← no namespace arg
    print(f"Cleared all vectors in the default namespace on index '{index_name_v3}'.")

# Uncomment to run:
pinecone_v3_clear_default()

### Clear — delete all vectors in the default namespace (no explicit namespace)
Cleared all vectors in the default namespace on index 'llama-2-rag'.


(Optional) Relink an existing index handle (v3) — using host or name

In [None]:
print("### Relink — show how to get a fresh handle to the same serverless index")

# Option 1: by name (simple for serverless)
index_again = pc.Index(index_name_v3)

# Option 2: by host (useful if you prefer explicit host)
desc = pc.describe_index(index_name_v3)
host_url = desc.get("host")
index_by_host = pc.Index(host=host_url)

print("Relinked: name-handle ok, host-handle ok →", host_url)

### Relink — show how to get a fresh handle to the same serverless index
Relinked: name-handle ok, host-handle ok → llama-2-rag-ac71174.svc.aped-4627-b74a.pinecone.io
