# Splitting and Embedding Text Using LangChain (Similarity Search)

In [1]:
%%capture
!pip install -r requirements.txt -q

In [2]:
import os
import openai
import getpass
from pinecone import Pinecone, ServerlessSpec

In [3]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OPENAI_API_KEY: ")
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter your LANGCHAIN_API_KEY: ")
os.environ["PINECONE_API_KEY"] = getpass.getpass("Enter your PINECONE_API_KEY: ")

Enter your OPENAI_API_KEY:  ········
Enter your LANGCHAIN_API_KEY:  ········
Enter your PINECONE_API_KEY:  ········


## Splitting text
- [Langchain Loaders](https://python.langchain.com/docs/how_to/#document-loaders)

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


In [7]:
with open("files/churchill_speech.txt") as f:
    churchill_speech = f.read()

In [10]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=20,
    length_function=len
)

In [11]:
chunks = text_splitter.create_documents([churchill_speech])
print(chunks[0])

page_content='Winston Churchill Speech - We Shall Fight on the Beaches
We Shall Fight on the Beaches
June 4, 1940
House of Commons
From the moment that the French defenses at Sedan and on the Meuse were broken at the end of the
second week of May, only a rapid retreat to Amiens and the south could have saved the British and
French Armies who had entered Belgium at the appeal of the Belgian King; but this strategic fact was'


In [12]:
print(chunks[-1])

page_content='streets, we shall fight in the hills; we shall never surrender, and even if, which I do not for a moment
believe, this Island or a large part of it were subjugated and starving, then our Empire beyond the
seas, armed and guarded by the British Fleet, would carry on the struggle, until, in God's good time,
the New World, with all its power and might, steps forth to the rescue and the liberation of the old.'


In [14]:
print(chunks[-1].page_content)

streets, we shall fight in the hills; we shall never surrender, and even if, which I do not for a moment
believe, this Island or a large part of it were subjugated and starving, then our Empire beyond the
seas, armed and guarded by the British Fleet, would carry on the struggle, until, in God's good time,
the New World, with all its power and might, steps forth to the rescue and the liberation of the old.


In [13]:
len(chunks)

47

## Embedding cost
- [How to count tokens with Tiktoken](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken)
- [tiktoken](https://github.com/openai/tiktoken)

In [15]:
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-3-small')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    # check prices here: https://openai.com/pricing
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD: {total_tokens / 1000 * 0.00002:.6f}')
    
print_embedding_cost(chunks)

Total Tokens: 4646
Embedding Cost in USD: 0.000093


## Create Embeddings

In [16]:
from langchain_openai import OpenAIEmbeddings

In [17]:
embeddings = OpenAIEmbeddings(model='text-embedding-3-small', dimensions=1536)  # 512 works as well

In [19]:
vector = embeddings.embed_query(chunks[0].page_content)
len(vector), vector[:2]

(1536, [0.046349696815013885, 0.052983976900577545])

## Inserting the Embeddings into a Pinecone Index
- [Vector stores](https://python.langchain.com/docs/concepts/vectorstores/)
- [Pinecone](https://python.langchain.com/docs/integrations/vectorstores/pinecone/)

In [20]:
import pinecone
from langchain_community.vectorstores import Pinecone

In [21]:
pc = pinecone.Pinecone()

In [22]:
# deleting all indexes
indexes = pc.list_indexes().names()
for i in indexes:
    print('Deleting all indexes ... ', end='')
    pc.delete_index(i)
    print('Done')

Deleting all indexes ... Done


In [23]:
# creating an index
from pinecone import ServerlessSpec

index_name = 'churchill-speech'
if index_name not in pc.list_indexes().names():
    print(f'Creating index {index_name}')
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric='cosine',
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        ) 
    )
    print('Index created! 😊')
else:
    print(f'Index {index_name} already exists!')

Creating index churchill-speech
Index created! 😊


In [24]:
# processing the input documents, generating embeddings using the provided `OpenAIEmbeddings` instance,
# inserting the embeddings into the index and returning a new Pinecone vector store object. 
vectore_store = Pinecone.from_documents(chunks, embedding=embeddings, index_name=index_name)

In [25]:
index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 47}},
 'total_vector_count': 47}

## Asking Questions (Similarity Search)

In [26]:
query = "where should we fight?"
result = vectore_store.similarity_search(query)
result

[Document(metadata={}, page_content='the Gestapo and all the odious apparatus of Nazi rule, we shall not flag or fail. We shall go on to the\nend, we shall fight in France, we shall fight on the seas and oceans, we shall fight with growing\nconfidence and growing strength in the air, we shall defend our Island, whatever the cost may be, we\nshall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the'),
 Document(metadata={}, page_content="streets, we shall fight in the hills; we shall never surrender, and even if, which I do not for a moment\nbelieve, this Island or a large part of it were subjugated and starving, then our Empire beyond the\nseas, armed and guarded by the British Fleet, would carry on the struggle, until, in God's good time,\nthe New World, with all its power and might, steps forth to the rescue and the liberation of the old."),
 Document(metadata={}, page_content='Winston Churchill Speech - We Shall Fight on the Beaches\n

In [28]:
for r in result:
    print(r.page_content)
    print("-" * 100)

the Gestapo and all the odious apparatus of Nazi rule, we shall not flag or fail. We shall go on to the
end, we shall fight in France, we shall fight on the seas and oceans, we shall fight with growing
confidence and growing strength in the air, we shall defend our Island, whatever the cost may be, we
shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the
----------------------------------------------------------------------------------------------------
streets, we shall fight in the hills; we shall never surrender, and even if, which I do not for a moment
believe, this Island or a large part of it were subjugated and starving, then our Empire beyond the
seas, armed and guarded by the British Fleet, would carry on the struggle, until, in God's good time,
the New World, with all its power and might, steps forth to the rescue and the liberation of the old.
--------------------------------------------------------------------------------

## Answering questions in Natural Language using an LLM

In [49]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

# Initialize the LLM with the specified model and temperature
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0.2)

# Use the provided vector store with similarity search and retrieve top 3 results
retriever = vectore_store.as_retriever(search_type='similarity', search_kwargs={'k': 15})

# Create a RetrievalQA chain using the defined LLM, chain type 'stuff', and retriever
chain = RetrievalQA.from_chain_type(llm=llm, chain_type='stuff', retriever=retriever)

In [50]:
query = 'Answer only from the provided input. Where should we fight?'
answer = chain.invoke(query)
print(answer.get("result"))

We shall fight in France, we shall fight on the seas and oceans, we shall fight with growing confidence and growing strength in the air, we shall defend our Island, whatever the cost may be, we shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the streets, we shall fight in the hills.


In [51]:
query = 'Who was the king of Belgium at that time?'
answer = chain.invoke(query)
# answer
print(answer.get("result"))

The king of Belgium at that time was King Leopold III.


In [52]:
query = 'What about the French Armies??'
answer = chain.invoke(query)
print(answer.get("result"))

The text indicates that the French Army was significantly weakened during the military operations in France and Belgium. The German forces executed a successful penetration that cut off communications between the British and French Armies, leading to a colossal military disaster for the French. The French High Command hoped to close the gap created by the German advance, but the situation deteriorated, and the French Army faced severe challenges, including being outnumbered and surrounded. Ultimately, the French Army was largely cast back and disturbed by the German onslaught.
