<a href="https://colab.research.google.com/github/Rukaya-lab/OpenAI_ex/blob/main/Quering_Multiple_PDF_from_Vector_Store_using_Langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

!pip install langchain
!pip install unstructured
!pip install openai
!pip install chromadb
!pip install Cython
!pip install tiktoken

In [None]:
!apt-get install poppler-utils

!pip install pdf2image
!pip install pytesseract

In [None]:

!pip install glove

### Load Helper Library


In [None]:
import os


In [9]:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator

In [21]:

# Get your API keys from openai, you will need to create an account. 
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
import os
os.environ["OPENAI_API_KEY"] = "Use Your Open AI API Key"

In [11]:
#get from drive

root_dir = "/content/drive/MyDrive/Original"

In [12]:
pdf_folder_path = f'{root_dir}/'
os.listdir(pdf_folder_path)

['VMD0036.pdf', 'VMD0037.pdf', 'VT0004.pdf']

In [13]:
# location of the pdf file/files. 
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)]

In [14]:
loaders

[<langchain.document_loaders.pdf.UnstructuredPDFLoader at 0x7fd471943700>,
 <langchain.document_loaders.pdf.UnstructuredPDFLoader at 0x7fd471943790>,
 <langchain.document_loaders.pdf.UnstructuredPDFLoader at 0x7fd470c2fc40>]

### Vector Store as a Wrapper

  Chroma as vectorstore to index and search embeddings

  There are three main steps going on after the documents are loaded:

  Splitting documents into chunks

  Creating embeddings for each document

  Storing documents and embeddings in a vectorstore


### VectorstoreIndexCreator is just a wrapper around all this logic. 
 It is simply this

  

```
index_creator = VectorstoreIndexCreator(
      vectorstore_cls=Chroma, 
      embedding=OpenAIEmbeddings(),
      text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
  )
```



In [22]:
index = VectorstoreIndexCreator().from_loaders(loaders)
index

VectorStoreIndexWrapper(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7fd42895f910>)

In [32]:
query = "what are the applicability conditions"
index.query(query)

' This module is applicable to jurisdictional programs seeking to estimate a global commodity leakage value as referenced by the VCS JNR Leakage Tool. The module is applicable under the following conditions: jurisdictional proponent may follow the method to determine the area of production subject to leakage in accordance VCS module VMD0036 Global Commodity Leakage Module: Effective Area Approach or follow the method to determine the amount of production subject to leakage in accordance with VCS module VMD0037 Global Commodity Leakage Module: Production Approach, or justify an alternative method to demonstrate the jurisdictional program is substantially maintaining production. To qualify for mitigation criterion (b), provide evidence that the production of relevant domestic commodities is substantially maintained or that the jurisdictional program does not impact the production of relevant domestic commodities within the jurisdiction. To qualify for mitigation criterion (c), provide ev

### Long Route

Instead of using the vector store index creator wrapper

In [33]:
documents = loader.load()

In [35]:
#Next, we will split the documents into chunks.

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

In [36]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [37]:
#We now create the vectorstore to use as the index.

from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings)

In [39]:
#So that’s creating the index. Then, we expose this index in a retriever interface.

retriever = db.as_retriever()

#### Create the chain to answer questions

In [44]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)


In [45]:
query = "What are the analysis of leakage categories?"
qa.run(query)

' The analysis of leakage categories includes observed production for commodity j within the jurisdiction, proportion of leakage resulting in increased supply outside the jurisdiction, proportion of increased supply coming from new land brought into production, share of unaccounted leakage within the country, share of recent global deforestation (and degradation) emissions occurring within the country, share of global forest carbon stocks at-risk of deforestation (and degradation) existing within the country, amount of deforestation (and degradation) already accounted for by other jurisdictional REDD+ programs within the country, and average carbon stocks of forests within the country.'

### Conclusion

Now we have been able to load multiple files from a folder, create a vector storage from chunks of the text in the files and the embeddings and then queried the vectors to retrieve answers.
