## LangChain Demo
Author: Payam Mousavi
Last updated: April 17, 2023
Ideas were borrowed from https://github.com/gkamradt/langchain-tutorials/

The tutorial focuses on loading a relatively large pdf file from the web, chunking it, creating embeddings, loading them into a vector database (i.e., pinecone) and using the OpenAI API to query the document. A simple tkinter GUI is created to interact with the document.

In [1]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
import pinecone
PINECONE_API_ENV = "northamerica-northeast1-gcp"
from dotenv import load_dotenv

config = load_dotenv()

  from tqdm.autonotebook import tqdm


## Loading the data and chunking it:

In [2]:
loader = UnstructuredPDFLoader("./data/Emergent_Abilities.pdf")
data = loader.load()
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

# Chunking:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(data)
print (f'Now you have {len(texts)} documents')

detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy.


You have 1 document(s) in your data
There are 97646 characters in your document
Now you have 103 documents


## Creating embeddings and storing in pinecone:

In [3]:
# Creating embeddings:
embeddings = OpenAIEmbeddings()

# initialize pinecone
pinecone.init(environment=PINECONE_API_ENV)
index_name = "langchan-demo"

# Search in Pincecone using cosine similarity:
docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

## Running queries and gettings answers from OpenAI:

In [4]:
# query = "What are examples of some emergent abilities of large languge models?"
# docs = docsearch.similarity_search(query, include_metadata=True)

query = "What are some limitations of large language models?"
docs = docsearch.similarity_search(query, include_metadata=True)

llm = OpenAI(temperature=0, model_name="text-davinci-003")
chain = load_qa_chain(llm, chain_type="stuff")

output = chain.run(input_documents=docs, question=query)
print(output)


 Large language models cannot perform many tasks with above-random accuracy, such as abstract reasoning tasks like playing Chess and challenging math. They also have difficulty with multilingual emergence tasks, requiring both model scale and training data to perform well.
