<a href="https://colab.research.google.com/github/Anand-G-Murugan/LLM-PDF-QA/blob/main/Pdf_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PDF QA Bot using OpenAI, FAISS, Langchain

* The program uses Langchain's text splitter to split the pdf into chunks of data.
* These Chunks are embedded using an embedding model from Huggingface.
* The vectors are then stored using FAISS.
* We then take an input question from the user.
* The program uses vector similarity search to find the most relevant chunk of the pdf to the user's question.
* This chunk is sent to the LLM (OpenAI's GPT-3) along with the user's question.
* The LLM then generates an appropriate answer!

In [1]:
!pip install -q streamlit PyPDF2 python-dotenv faiss-cpu langchain altair openai tiktoken sentence_transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m88.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m62.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m84.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m164.8/164.8 kB[0m [31m18.0 MB/s[0m et

In [2]:
!npm install localtunnel

[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35msaveError[0m ENOENT: no such file or directory, open '/content/package.json'
[K[?25h[37;40mnpm[0m [0m[34;40mnotice[0m[35m[0m created a lockfile as package-lock.json. You should commit this file.
[0m[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35menoent[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No description
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No repository field.
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No README data
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No license field.
[0m
+ localtunnel@2.0.2
added 22 packages from 22 contributors and audited 22 packages in 2.668s

3 packages are looking for funding
  run `npm fund` for details

found [92m0[0m vulnerabilities

[K[?25h

In [3]:
# Huggingface Embeddings
# OpenAI LLM
# FAISS Vectorstore

%%writefile app.py

import os
from dotenv import load_dotenv
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback


def main():
    load_dotenv()
    st.set_page_config(page_title="PDF QA")
    st.header("PDF QA")

    # get OpenAI API key
    os.environ["OPENAI_API_KEY"] = st.text_input("Enter your OpenAI sk", type="password")
    name = os.environ["OPENAI_API_KEY"]
    if(name):
      st.write("OpenAI key has been entered!")

    # upload file
    pdf = st.file_uploader("Upload your PDF", type="pdf")

    # extract the text from the pdf
    if pdf is not None:
      pdf_reader = PdfReader(pdf)
      text = ""
      for page in pdf_reader.pages:
        text += page.extract_text()

      # split into chunks
      text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
      )
      chunks = text_splitter.split_text(text)

      # define embedding function
      embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2") # Embeddings model importedfrom Huggingface
      knowledge_base = FAISS.from_texts(chunks, embedding_function)

      # get user input
      user_question = st.text_input("Ask a question about your PDF:")
      if user_question:
        docs = knowledge_base.similarity_search(user_question)

        # selecting LLM
        llm = OpenAI() # by default -> GPT-3 davinci

        chain = load_qa_chain(llm, chain_type="stuff")
        with get_openai_callback() as cb:
          response = chain.run(input_documents=docs, question=user_question)
          print(cb)

        st.write(response)
        st.write(cb)


if __name__ == '__main__':
    main()

Writing app.py


In [4]:
!streamlit run app.py &>/content/logs.txt & npx localtunnel --port 8501 & curl ipv4.icanhazip.com

34.145.200.10
[K[?25hnpx: installed 22 in 4.277s
your url is: https://brave-bees-tell.loca.lt


* Copy the endpoint ip. (line 1)
* Go to the link. (line 3)
* Enter the endpoint ip.
* You're at the streamlit application!