In [None]:
# create virtual env: python -m venv llmenv
# activate venv llmenv\Scripts\activate
# install libraries:
# !pip install langchain
# !pip install openai
# !pip install unstructured
# !pip install tiktoken
# !pip install chromadb
# !pip install flask

In [38]:

from langchain.document_loaders import DirectoryLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.chains.question_answering import load_qa_chain
from langchain.chat_models import ChatOpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

In [25]:
from configparser import ConfigParser
config_object = ConfigParser()
config_object.read("config.ini")

#Get the password
openai_config = config_object["openai"]
open_ai_key = openai_config['key']


In [39]:
loader = DirectoryLoader(r"/Users/a200031115/Documents/Ashutosh/Git/custom_QnA/data2")
documents = loader.load()

In [40]:
documents

[Document(page_content='Current Affairs | May 2023\n\n1\n\nwww.bankersadda.com\n\n|\n\nwww.sscadda.com\n\n|\n\nwww.careerpower.in\n\n|\n\nAdda247 App\n\nCurrent Affairs | May 2023\n\n2\n\nwww.bankersadda.com\n\n|\n\nwww.sscadda.com\n\n|\n\nwww.careerpower.in\n\n|\n\nAdda247 App\n\nCurrent Affairs | May 2023\n\nIPL 2023 final Chennai Super Kings (CSK) clinched their fifth Indian Premier League (IPL) title, equaling a record with Mumbai Indians. They secured a five-wicket victory over the Gujarat Titans (GT) amidst a backdrop of fireworks and jubilant celebrations. CSK’s captain, Dhoni, received the IPL trophy and subsequently handed it over to Rayudu and Jadeja. Opting to bat first, the Gujarat Titans managed to score 214 for four, with B Sai Sudharsan playing an outstanding innings of 96 runs off 47 balls. However, due to the rain interruption, CSK’s target was adjusted to 171 runs to be chased in 15 overs. IPL 2023-Full list of Award Winners • Eden Gardens and Wankhede Stadium share t

In [41]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
texts = text_splitter.split_documents(documents)

Created a chunk of size 1261, which is longer than the specified 1000
Created a chunk of size 1005, which is longer than the specified 1000
Created a chunk of size 1461, which is longer than the specified 1000
Created a chunk of size 1417, which is longer than the specified 1000
Created a chunk of size 1032, which is longer than the specified 1000
Created a chunk of size 1219, which is longer than the specified 1000
Created a chunk of size 1327, which is longer than the specified 1000
Created a chunk of size 1023, which is longer than the specified 1000
Created a chunk of size 1412, which is longer than the specified 1000
Created a chunk of size 1407, which is longer than the specified 1000
Created a chunk of size 1481, which is longer than the specified 1000
Created a chunk of size 1195, which is longer than the specified 1000
Created a chunk of size 1464, which is longer than the specified 1000
Created a chunk of size 1170, which is longer than the specified 1000
Created a chunk of s

In [42]:
len(texts)

227

In [28]:
open_ai_key

'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'

In [43]:

embeddings = OpenAIEmbeddings(openai_api_key=open_ai_key)
docsearch = Chroma.from_documents(texts, embeddings)

In [44]:
docsearch

<langchain_community.vectorstores.chroma.Chroma at 0x7f8d4dd034f0>

In [45]:
model_name= "gpt-3.5-turbo"
llm = ChatOpenAI(openai_api_key=open_ai_key,
                    model_name=model_name, temperature=0)
retriever=docsearch.as_retriever()
qa_chain = RetrievalQA.from_chain_type(llm, chain_type="stuff", retriever=retriever, verbose=True)

The default chain_type="stuff" uses ALL of the text from the documents in the prompt. many time it exceeds the token limit and causes rate-limiting errors. 
then, we had to use other chain types for example "map_reduce". 

    map_reduce: It separates texts into batches (as an example, you can define batch size in llm=OpenAI(batch_size=5)), feeds each batch with the question to LLM separately, and comes up with the final answer based on the answers from each batch.

    refine : It separates texts into batches, feeds the first batch to LLM, and feeds the answer and the second batch to LLM. It refines the answer by going through all the batches.

    map-rerank: It separates texts into batches, feeds each batch to LLM, returns a score of how fully it answers the question, and comes up with the final answer based on the high-scored answers from each batch.

    Reference: https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a


In [46]:
query = "who won the ipl 2023"
response = qa_chain.run(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [48]:
response

'Chennai Super Kings (CSK) won the IPL 2023.'

In [34]:
qa_chain2 = load_qa_chain(llm, chain_type="stuff",verbose=True)

In [35]:
matching_documents = docsearch.similarity_search(query)
matching_documents

[Document(page_content='Yashasvi Jaiswal, a player for the Rajasthan Royals, has been performing exceptionally well in the IPL 2023. During a match against the Kolkata Knight Riders at Eden Gardens, he set a new record by scoring the fastest 50 in IPL history, achieving this feat in just 13 balls. (Click here to read the article) Telangana’s Vuppala Prraneeth became India’s 82nd Grandmaster V. Prraneeth, a 15-year-old chess player from Telangana, achieved the title of Grandmaster, becoming the sixth from the state and the 82nd in India. (Click here to read the article)\n\nMax Verstappen wins the Miami Grand Prix 2023 World champion Max Verstappen powered from ninth on the grid to beat Red Bull team-mate Sergio Perez and win the Miami Grand Prix 2023. The victory extends Verstappen’s lead at the top of the standings and follows his triumph in the inaugural Miami race last year. (Click here to read the article)', metadata={'source': '/Users/a200031115/Documents/Ashutosh/Git/custom_QnA/da

In [36]:
response = qa_chain2.run(input_documents = matching_documents,question=query)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
Yashasvi Jaiswal, a player for the Rajasthan Royals, has been performing exceptionally well in the IPL 2023. During a match against the Kolkata Knight Riders at Eden Gardens, he set a new record by scoring the fastest 50 in IPL history, achieving this feat in just 13 balls. (Click here to read the article) Telangana’s Vuppala Prraneeth became India’s 82nd Grandmaster V. Prraneeth, a 15-year-old chess player from Telangana, achieved the title of Grandmaster, becoming the sixth from the state and the 82nd in India. (Click here to read the article)

Max Verstappen wins the Miami Grand Prix 2023 World champion Max Verstappen powered from ninth on the grid to beat Red Bull tea

In [37]:
response

'Chennai Super Kings (CSK) won the IPL 2023.'