# Top-K Similarity Search - Ask A Book A Question

In this tutorial we will see a simple example of basic retrieval via Top-K Similarity search

In [None]:
!pip install langchain
!pip install pypdf
!pip install openai
!pip install chromadb
!pip install tiktoken

In [12]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

### Load your data

Next let's load up some data. This process will only stage the loader, not actually load it.

In [14]:
loader = PyPDFLoader("./naval.pdf")

Then let's go ahead and actually load the data.

In [15]:
data = loader.load()

Then let's actually check out what's been loaded

In [16]:
# Note: If you're using PyPDFLoader then it will split by page for you already
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your sample document')
print (f'Here is a sample: {data[0].page_content[:200]}')

You have 242 document(s) in your data
There are 30 characters in your sample document
Here is a sample: THE ALMANACK OF NAVAL RAVIKANT


### Chunk your data up into smaller documents

While we could pass the entire book to a model w/ long context, we want to be picky about which information we share with our model.

The first thing we'll do is chunk up our book into smaller pieces. The goal will be to take only a few of those smaller pieces and pass them to the LLM.

In [17]:
# We'll split our data into chunks around 500 characters each with a 50 character overlap. These are relatively small.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(data)

In [18]:
# Let's see how many small chunks we have
print (f'Now you have {len(texts)} documents')

Now you have 685 documents


### Create embeddings of your documents to get ready for semantic search

Next up we need to prepare for similarity searches. The way we do this is through embedding our documents (getting a vector per document).

This will help us compare documents later on.

In [19]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

Lets define OPNE API KEY

In [20]:
OPENAI_API_KEY = ''

Then we'll get our embeddings using OpenAI's.

In [None]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

Chroma


First we'll pass our texts to Chroma via `.from_documents`, this will 1) embed the documents and get a vector, then 2) add them to the vectorstore for retrieval later.

In [27]:
# load it into Chroma
vectorstore = Chroma.from_documents(texts, embeddings)

Let's test it out. Let's see which documents are most closely related to a query.



In [28]:
query = "How to earn money"
docs = vectorstore.similarity_search(query)

Then we can check them out. In theory, the texts which are deemed most similar should hold the answer to our question.
But keep in mind that our query just happens to be a question, it could be a random statement or sentence and it would still work.

In [None]:
# Here's an example of the first document that was returned
for doc in docs:
    print (f"{doc.page_content}\n")

### Query those docs to get your answer back

Those are just the docs which should hold our answer. Now we can pass those to a LangChain chain to query the LLM.

We could do this manually, but a chain is a convenient helper for us.

In [30]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

In [None]:
llm = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_with_sources_chain(llm, chain_type="stuff")

In [32]:
query = "How to earn money?"
docs = vectorstore.similarity_search(query)

In [33]:
chain.run(input_documents=docs, question=query)

  warn_deprecated(


"To earn money, you need to provide society with something it wants but doesn't know how to get elsewhere. This can be done by delivering a product or service at scale. Another way is to have passive income that covers your expenses. Additionally, doing something you love can also lead to financial success.\nSOURCES: ./naval.pdf"

In [None]:
chain.run(input_documents=docs, question="Explain in 1 line")

"To get rich without relying on luck, focus on delivering a product or service that society wants but doesn't know how to get elsewhere."

In [None]:
chain.run(input_documents=docs, question="What have i asked?")

'You have asked about how to get rich without getting lucky.'