# Project (Question-Answer about content of private document)

## Splitting and Embedding Text Using LangChain

When dealing with very long text, splitting it into chunks is necessary. You also want to keep semantically related pieces of text together. 

We can load the data to be split from many document types, like, text files, PDFs, CSV files, spreadsheets, SQL databases, EverNote, Facebook Chats, JSON files, PowerPoint, and so on. There are **LangChain loaders** that can be used to load data from almost any type of document. There are also **public service loaders** from different services such as Project Gutenberg, Hacker News, Wikipedia, YouTube, etc. Additionally, There are **proprietary service loaders** from Amazon, Azure, Google, and others.

 The recommended text splitter is called RecursiveCharacterTextSplitter. By default, the characters it tries to split on are double backslash n(\\n), backslash(\n), and white space

In [34]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

In [35]:
pip install -r ./requirements.txt -q

Note: you may need to restart the kernel to use updated packages.


In [36]:
pip install -q pinecone-client

Note: you may need to restart the kernel to use updated packages.


In [37]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Open and read the contents of the file as a string
with open('APJ_Kalam.txt', encoding="utf-8") as f:
    APJ_Kalam = f.read()

# with open('APJ_Kalam.txt', 'r') as file: 
        # Read the lines of the file 
       # APJ_Kalam = file.readlines()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,             #The maximum size of your chunks(you should experiment with different values according to chunking strategies to see which one works best)
    chunk_overlap = 20,           #This is the maximum overlap between chunks needed to maintain some continuity between them
    length_function = len         #It indicates how the length of chunks is calculated. The default is to count the number of characters, But because we work with LLMs and LLMs use tokens(not characters), it's pretty common to pass a token counter here.
)

In [38]:
chunks = text_splitter.create_documents([APJ_Kalam])    #Creating the chunks
print(chunks[155])                                                

page_content="July.[49] Kalam was the third President of India to have been honoured with a Bharat Ratna, India's"


In [39]:
#if you want to see only the text then do as following
print(chunks[155].page_content)

July.[49] Kalam was the third President of India to have been honoured with a Bharat Ratna, India's


In [40]:
# lets see the number of chunks I have
print(f'Now I have {len(chunks)} chunks')

Now I have 496 chunks


### Embedding Cost

We'll be using OpenAI's text-embedding-ada-002, which has a cost. It's a good idea, to calculate the embedding costs in advance, to avoid any surprises. we'll use the tiktoken library for this.

In [41]:
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-ada-002')  # Newer versions are text-embedding-3-small and text-embedding-3-large
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD:{total_tokens / 1000 * 0.0004:.6f}')

print_embedding_cost(chunks)

Total Tokens: 10087
Embedding Cost in USD:0.004035


**The next step is to import and instantiate OpenAIEmbeddings to embed text to vectors**

In [42]:
from langchain_openai import OpenAIEmbeddings

#The OpenAIEmbeddings class can be used to embed text to vectors(floating point numbers). 
#embedding = OpenAIEmbeddings(model='text-embedding-3-small')  # If the API key is loaded as an environment variable, it is not necessary to pass it as an argument to this constructor. 

embedding = OpenAIEmbeddings()

In [43]:
#vector = embedding.embed_query('abc')  #the argument can be any text
vector = embedding.embed_query(chunks[155].page_content)
print(vector)

[-0.019802802293095452, -0.00404324030066155, -0.002888749031187924, -0.007615945679647185, -0.003001341279429844, 0.018781068734910672, -0.047403043624212216, -0.030867097567462993, -0.01817609646159929, -0.007595779750938949, 0.007978929602427586, 0.009632524580631557, -0.006691680156003802, -0.015500767925210539, -0.007595779750938949, 0.018458417600869366, -0.004117181263156235, -0.017799667034142364, 0.013396803954038888, 0.017369464901205542, -0.0025039186080300486, 0.008368801740593176, 0.0004327241836977872, 0.011138236702523537, -0.005794302847745608, -0.005942185238396288, -0.003824777625193353, -0.050575795297662075, -0.00013464864545505826, 0.004083571692416714, 0.006348861696270328, -0.002021620150330775, -0.000946951727778839, -0.007488228752043432, -0.02250501811354677, 0.005155719606710776, -0.01240868089791625, 0.0035054860046759367, 0.02352675167173155, 0.016925816332269573, 0.03312566668162359, -0.007696608463157511, 0.009034272731351886, -0.0073269027193614686, -0.0

## Inserting the Embeddings into a Pinecone Index 

In [44]:
import pinecone 
from langchain_community.vectorstores import Pinecone

pc = pinecone.Pinecone()

Now we have to create a pinecone index. We can also use an existing index if we want. Currently, we are using the free plan, we are limited to one index and one project and if we already have one, we will get an error if we try to create a second one. So it's a good idea to delete all existing indexes before creating a new one.

In [45]:
for i in pc.list_indexes().names():
    print('Deleting all indexes....', end ='')
    pc.delete_index(i)
    print('Done')

Deleting all indexes....Done


Now I am creating a new index for these embeddings called apj-kalam

In [46]:

index_name = 'apj-kalam'

if index_name not in pc.list_indexes().names():
    print(f'Creating index {index_name}....')
    pc.create_index(
        name=index_name,
        dimension=1536, 
        metric='cosine', 
        spec=pinecone.ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        ) 
    )
    print('Index created!')

Creating index apj-kalam....
Index created!


Next we will upload the vectors to pinecone using LangChain. We will call **Pinecone.from_documents(chunks, embedding, index_name)**. This method takes three arguments-->**1)chunks** It is a list of text documents that have been obtained by calling the RecursiveCharacterTextSplitter() method. These smaller chunks will be indexed in pineconeto make it easier to search and retrieve relevant information later on. **2)embedding** It is an instance of the OpenAIEmbeddings class, which is responsible for converting text data into embeddings using OpenAI's embedding model. These embeddings will be stored in the pinecone index and used for similarity search, and the **3)index_name** is a string representing the name of the pinecone index. This name is used to identify the index in pinecone database and must already exist. we have already defined all these objects earlier.Pinecone.from_documents() method returns a vector_store object initialized from documents and embeddings. In a nutshell, this method processes the input documents, generates embeddings using the provided OpenAIEmbeddings instance, and returns a new pinecone vector store.

The resulting vector store object can perform similarity searches and retrieve relevant documents based on user queries.

 We run Pinecone.from_documents(chunks, embedding, index_name=index_name)  only once when we split the document into chunks and embed those chunks into pinecone. Once you populate your index with your embeddings, you will just query the index. If you call **Pinecone.from_documents(chunks, embedding, index_name=index_name)** again , it will insert the vectors again and creating double entries.

In [47]:
vector_store = Pinecone.from_documents(chunks, embedding, index_name=index_name) 

In [48]:
# Loading the vector_store  from an existing index
vector_store = Pinecone.from_existing_index(index_name='apj-kalam', embedding=embedding)

## Asking Questions (Similarity search)

So far, we've splitted the text into chunks and embedded them onto vectors, which are then inserted into a pinecone index. Now let's see **how to ask questions and do similarity searches.-->** 

**The user defines a query. The query is embedded  into a vector. A similarity search is performed in the vector database, and the text behined the most similar vectors is the answer to the user's question.**

In [49]:
# Defining a query.
query = 'When apj abdul kalam became president of india'

# Now I am extracting all the relevant chunks
result = vector_store.similarity_search(query)
print(result)

[Document(page_content='This article is part of\na series about\nA. P. J. Abdul Kalam\nPresident of India\n(2002-2007)'), Document(page_content='A. P. J. Abdul Kalam'), Document(page_content='President Dr. APJ Abdul Kalam", and highlighted his achievements as a scientist and as a statesman,'), Document(page_content='A P J Abdul Kalam[25]')]


In [50]:
# Here I am iterating over the chunks and print only the chunk text.
for r in result:
    print(r.page_content)
    print('-' *50)   # here I am adding 50 dashes after each chunk for readability.

This article is part of
a series about
A. P. J. Abdul Kalam
President of India
(2002-2007)
--------------------------------------------------
A. P. J. Abdul Kalam
--------------------------------------------------
President Dr. APJ Abdul Kalam", and highlighted his achievements as a scientist and as a statesman,
--------------------------------------------------
A P J Abdul Kalam[25]
--------------------------------------------------


The above result chunks represent the answer, but you can't give them to users like this. we need the answers in natural language. That's where the LLM comes in. We'll retrieve the most relevant chunks of text and feed them to the language model for the final answer. Check following code:

In [51]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=1)

# here I am exposing the index in a retriever interface. The retriever interface is a generic interface that makes it easy to combine documents with language models.
#search_kwargs takes the value as a dictionary. The key k has a value as an integer
# {'k': 3} means that it will return the three most similar chunks to the user's query 
retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k': 3})

# Finally, creating a chain to answer the questions. The default chain_type="stuff" uses all of the text from the documents in the prompt.
chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)



Now let's ask some questions

In [53]:
#query = 'When apj abdul kalam became president of india'
#query = 'When apj abdul kalam was born?'
query = 'What apj abdul kalam used to do as a kid to support his family?'
answer = chain.invoke(query)
print(answer)

{'query': 'What apj abdul kalam used to do as a kid to support his family?', 'result': 'As a child, A. P. J. Abdul Kalam used to sell newspapers to support his family.'}


In [54]:
print(answer['result'])

As a child, A. P. J. Abdul Kalam used to sell newspapers to support his family.
