<a href="https://colab.research.google.com/github/Khalidsid/chatgpt-retrieval/blob/main/Multiple_PDF_files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installation

In [None]:
!pip install langchain
!pip install unstructured
!pip install openai
!pip install chromadb
!pip install Cython
!pip install tiktoken

Collecting langchain
  Downloading langchain-0.0.223-py3-none-any.whl (1.2 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.5.9-py3-none-any.whl (26 kB)
Collecting langchainplus-sdk<0.0.21,>=0.0.20 (from langchain)
  Downloading langchainplus_sdk-0.0.20-py3-none-any.whl (25 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain)
  Downloading openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow<4.0.0,>=3.3.0 (from dataclasses-json<0.6.0,>=0.5.7->langchain)
  Downloading marshmallow-3.19.0-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Load Required Packages

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator

### OpenAI API Key

In [None]:
# Get your API keys from openai, you will need to create an account.
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
import os
os.environ["OPENAI_API_KEY"] = "YOUR KEY"

### Connect Google Drive

In [None]:
# connect your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"

Mounted at /content/gdrive


In [None]:
pdf_folder_path = f'{root_dir}/data/'
os.listdir(pdf_folder_path)

['1-s2.0-S0378775305016502-main.pdf',
 'NIPS-2017-attention-is-all-you-need-Paper.pdf']

### Load Multiple PDF files

In [None]:
# location of the pdf file/files.
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)]

In [None]:
loaders #depends on number of files

[<langchain.document_loaders.pdf.UnstructuredPDFLoader at 0x7fefbda6bb20>,
 <langchain.document_loaders.pdf.UnstructuredPDFLoader at 0x7fefbcbb5f60>]

### Vector Store
Chroma as vectorstore to index and search embeddings


There are three main steps going on after the documents are loaded:

- Splitting documents into chunks

- Creating embeddings for each document

- Storing documents and embeddings in a vectorstore


In [None]:
index = VectorstoreIndexCreator().from_loaders(loaders)
index

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


VectorStoreIndexWrapper(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7fef8d6934f0>)

In [None]:
index.query('What is the summary of the paper?')

' This paper evaluates economic and environmental indicators for vehicle production and utilization stages and compares four kinds of vehicles: conventional, hybrid, electric and hydrogen fuel cell. The purpose of the paper is to provide information to assist in the design and development of a contemporary light-duty car with superior economic and environmental attributes.'

In [None]:
pdf_folder_path = '/content/gdrive/My Drive/data/'
os.listdir(pdf_folder_path)

['1-s2.0-S0378775305016502-main.pdf',
 'NIPS-2017-attention-is-all-you-need-Paper.pdf']

In [None]:
# location of the pdf file/files.
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)]
index = VectorstoreIndexCreator().from_loaders(loaders)
index

VectorStoreIndexWrapper(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7fef789fffa0>)

In [None]:
index.query('What are the titles of both the papers?')

' The titles of the papers are Economic and Environmental Comparison of Conventional, Hybrid, Electric and Hydrogen Fuel Cell Vehicles and Application of Oxygen Ion-Conductive Membranes for Simultaneous Electricity and Hydrogen Generation.'

In [None]:
index.query_with_sources('How was the transformer model trained?')

{'question': 'How was the transformer model trained?',
 'answer': ' The Transformer model was trained using dropout rate Pdrop = 0.1, beam search with a beam size of 4 and length penalty α = 0.6, and checkpoint averaging.\n',
 'sources': '/content/gdrive/My Drive/data/NIPS-2017-attention-is-all-you-need-Paper.pdf'}

In [None]:
index.query_with_sources('Who wrote the battery paper? ')

{'question': 'Who wrote the battery paper? ',
 'answer': ' Mikhail Granovskii, Ibrahim Dincer, and Marc A. Rosen wrote the battery paper.\n',
 'sources': '/content/gdrive/My Drive/data/1-s2.0-S0378775305016502-main.pdf'}

In [None]:
index.query_with_sources('How was the GPT4all model trained?')

{'question': 'How was the GPT4all model trained?',
 'answer': ' The GPT4all model was trained with LoRA on 437,605 post-processed examples for four epochs. Detailed model hyper-parameters and training code can be found in the associated repository and model training log.\n',
 'sources': '/content/gdrive/My Drive/data_2/2023_GPT4All_Technical_Report.pdf'}

In [None]:
index.query_with_sources('Who wrote the lip sync paper? ')

{'question': 'Who wrote the lip sync paper? ',
 'answer': ' The lip sync paper was written by K R Prajwal, Vinay P. Namboodiri, Rudrabha Mukhopadhyay, and C V Jawahar.\n',
 'sources': '/content/gdrive/My Drive/data_2/2008.10010.pdf'}

## Disclaimer:
Note: OpenAI provides a free API key for initial testing. Once you move to a paid subscription, calling the API in the way demonstrated in this example will incur monetary charges. Refer to OpenAI's pricing information for details.

Be aware that information, such as files to train OpenAI's LLM can become public if applied in the way this demo demonstrates. Refer to OpenAI's usage policy for details.

Do not use for actual tax filing purposes. This demo is for educational purposes only and for demonstrating machine learning methods. The author makes no claims that the outcomes shown here or any outcomes that could be produced by this method are accurate or reliable.

In [None]:
index.query_with_sources('Summarize the battery paper with problem statement and results. ')

{'question': 'Summarize the battery paper with problem statement and results. ',
 'answer': ' This paper compares the economic and environmental performance of conventional, hybrid, electric and hydrogen fuel cell vehicles. It concludes that an electric car with capability for on-board electricity generation is a beneficial option worthy of further investigation. The paper also presents an optimization to obtain the optimal relationship between capacities of batteries and a gas turbine engine, and suggests that if electricity is generated with an efficiency of 50-60%, the electric car becomes superior.\n',
 'sources': '/content/gdrive/My Drive/data/1-s2.0-S0378775305016502-main.pdf'}

In [None]:
index.query_with_sources('Summarize the Attention is all you need paper with problem and result summary. ')

{'question': 'Summarize the Attention is all you need paper with problem and result summary. ',
 'answer': ' The Attention is All You Need paper proposes the Transformer, a model architecture that eschews recurrence and instead relies entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.\n',
 'sources': '/content/gdrive/My Drive/data/NIPS-2017-attention-is-all-you-need-Paper.pdf'}