# Top-K Similarity Search - Ask A Book A Question

In this tutorial we will see a simple example of basic retrieval via Top-K Similarity search

In [2]:
# pip install langchain --upgrade
# Version: 0.0.164

# !pip install pypdf

In [3]:
# PDF Loaders. If unstructured gives you a hard time, try PyPDFLoader
from langchain_community.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader, TextLoader
from langchain_community.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
import os

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Load your data

Next let's load up some data. I've put a few 'loaders' on there which will load data from different locations. Feel free to use the one that suits you. The default one queries one of Paul Graham's essays for a simple example. This process will only stage the loader, not actually load it.

Then let's go ahead and actually load the data.

In [4]:
from langchain.document_loaders import JSONLoader

loader = JSONLoader(
    file_path="computer_graphics_notes.json",
    jq_schema='.',  # or specify the structure like '.content' or '.pages[]'
    text_content=False
)

data = loader.load()
data

[Document(metadata={'source': 'C:\\Users\\anand\\Desktop\\ananda github\\Langchain-tutorial\\projects\\chatwithbook\\computer_graphics_notes.json', 'seq_num': 1}, page_content='[{"title": "Introduction to Computer Graphics", "content": "https://collegenote.pythonanywhere.com                                     Prepared By: Jayanta Poudel \\n \\n1 Computer Graphics (Reference Note)                                                                      BSc.CSIT                                                                                                   \\nUnit 1 \\nIntroduction of Computer Graphics \\nComputer graphics is a field related to the generation of graphics using computer. It includes \\nthe creation, storage and manipulation of images of object. These objects come from diverse \\nfield such as medicine, physical, mathematical, engineering, architecture, entertainment, \\nadvertisement. \\n- It is related to the generation and the representation of graphics by a computer usi

In [5]:
data = loader.load()

Then let's actually check out what's been loaded

In [6]:
# Note: If you're using PyPDFLoader then it will split by page for you already
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your sample document')
print (f'Here is a sample: {data[0].page_content[:200]}')

You have 1 document(s) in your data
There are 37960 characters in your sample document
Here is a sample: [{"title": "Introduction to Computer Graphics", "content": "https://collegenote.pythonanywhere.com                                     Prepared By: Jayanta Poudel \n \n1 Computer Graphics (Reference N


### Chunk your data up into smaller documents

While we could pass the entire essay to a model w/ long context, we want to be picky about which information we share with our model. The better signal to noise ratio we have the more likely we are to get the right answer.

The first thing we'll do is chunk up our document into smaller pieces. The goal will be to take only a few of those smaller pieces and pass them to the LLM.

In [7]:
# We'll split our data into chunks around 500 characters each with a 50 character overlap. These are relatively small.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(data)
texts

[Document(metadata={'source': 'C:\\Users\\anand\\Desktop\\ananda github\\Langchain-tutorial\\projects\\chatwithbook\\computer_graphics_notes.json', 'seq_num': 1}, page_content='[{"title": "Introduction to Computer Graphics", "content": "https://collegenote.pythonanywhere.com                                     Prepared By: Jayanta Poudel \\n \\n1 Computer Graphics (Reference Note)                                                                      BSc.CSIT                                                                                                   \\nUnit 1 \\nIntroduction of Computer Graphics \\nComputer graphics is a field related to the generation of graphics using'),
 Document(metadata={'source': 'C:\\Users\\anand\\Desktop\\ananda github\\Langchain-tutorial\\projects\\chatwithbook\\computer_graphics_notes.json', 'seq_num': 1}, page_content='field related to the generation of graphics using computer. It includes \\nthe creation, storage and manipulation of images of object. Th

In [8]:
# Let's see how many small chunks we have
print (f'Now you have {len(texts)} documents')

Now you have 85 documents


### Create embeddings of your documents to get ready for semantic search

Next up we need to prepare for similarity searches. The way we do this is through embedding our documents (getting a vector per document).

This will help us compare documents later on.

In [9]:
from langchain.embeddings import HuggingFaceEmbeddings

def download_embeddings():
    """
    Download and return the HuggingFace embeddings model.
    """
    model_name = "sentence-transformers/all-MiniLM-L6-v2"
    embeddings = HuggingFaceEmbeddings(
        model_name=model_name
    )
    return embeddings

embedding = download_embeddings()


  embeddings = HuggingFaceEmbeddings(


Check to see if there is an environment variable with you API keys, if not, use what you put below

### Option #1: Chroma (for local)

I like Chroma becauase it's local and easy to set up without an account.

First we'll pass our texts to Chroma via `.from_documents`, this will 1) embed the documents and get a vector, then 2) add them to the vectorstore for retrieval later.

In [10]:
# load it into Chroma
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(texts, embedding)

Let's test it out. I want to see which documents are most closely related to a query.



In [11]:
query = "What peter thiel thinks about startups?"
docs = vectorstore.similarity_search(query)

Then we can check them out. In theory, the texts which are deemed most similar should hold the answer to our question.
But keep in mind that our query just happens to be a question, it could be a random statement or sentence and it would still work.

In [12]:
# Here's an example of the first document that was returned
for doc in docs:
    print (f"{doc.page_content}\n")

and M.Pauline Baker, \u201cComputer Graphics, C Versions.\u201d Prentice \nHall", "keywords": ["References", "Computer Graphics", "C Versions"], "sentiment": "Neutral", "topic_group": "Computer Graphics", "chunk_id": 18}]

own terms. \n\uf0a7 Example: artist's painting programs and various business, medical, and CAD \nsystems. \n \nSoftware standards \n \nPrimary goal of standardized graphics software is portability. When packages are designed \nwith standard graphics functions, software can he moved easily from one hardware system to \nanother and used in different implementations and applications. International and national \nstandards planning organizations in many countries have cooperated in an effort to

in many countries have cooperated in an effort to develop a \ngenerally accepted standard for computer graphics. After considerable effort, this work led to \nfollowing standards: \n \n\uf0b7 GKS (Graphical Kernel System): This system was adopted as the first graphics software \n

### Option #2: Pinecone (for cloud)
If you want to use pinecone, run the code below, if not then skip over to Chroma below it. You must go to [Pinecone.io](https://www.pinecone.io/) and set up an account

In [13]:
# PINECONE_API_KEY = os.getenv('PINECONE_API_KEY', 'YourAPIKey')
# PINECONE_API_ENV = os.getenv('PINECONE_API_ENV', 'us-east1-gcp') # You may need to switch with your env

# # initialize pinecone
# pinecone.init(
#     api_key=PINECONE_API_KEY,  # find at app.pinecone.io
#     environment=PINECONE_API_ENV  # next to api key in console
# )
# index_name = "langchaintest" # put in the name of your pinecone index here

# docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

### Query those docs to get your answer back

Great, those are just the docs which should hold our answer. Now we can pass those to a LangChain chain to query the LLM.

We could do this manually, but a chain is a convenient helper for us.

In [14]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains.question_answering import load_qa_chain
llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash",
    temperature=0.7)

In [19]:
from langchain.chains.question_answering import load_qa_chain


In [16]:
query = "software standars ?"
docs = vectorstore.similarity_search(query,k=10)
docs

[Document(metadata={'source': 'C:\\Users\\anand\\Desktop\\ananda github\\Langchain-tutorial\\projects\\chatwithbook\\computer_graphics_notes.json', 'seq_num': 1}, page_content="own terms. \\n\\uf0a7 Example: artist's painting programs and various business, medical, and CAD \\nsystems. \\n \\nSoftware standards \\n \\nPrimary goal of standardized graphics software is portability. When packages are designed \\nwith standard graphics functions, software can he moved easily from one hardware system to \\nanother and used in different implementations and applications. International and national \\nstandards planning organizations in many countries have cooperated in an effort to"),
 Document(metadata={'seq_num': 1, 'source': 'C:\\Users\\anand\\Desktop\\ananda github\\Langchain-tutorial\\projects\\chatwithbook\\computer_graphics_notes.json'}, page_content='\\nLibrary \\nGraphics \\nsoftware \\nGraphics \\nmonitor \\nI/O device", "keywords": ["Applications", "Computer Graphics", "GUI", "Enter

Awesome! We just went and queried an external data source!

In [20]:
chain.run(input_documents=docs, question=query)

'According to the book, the primary goal of standardized graphics software is **portability**.\n\nWhen graphics packages are designed with standard graphics functions, the software can be easily moved from one hardware system to another and used in different implementations and applications. International and national standards planning organizations in many countries cooperate in an effort to achieve this.'