# PDF Query QA System using Langchain

- Developing a QA system to query against the study and lecture material from the CMPE258 Deep Learning Class at SJSU.
- Using Astra DB Vector Database which uses Apache Cassandra to store the embeddings.
- Using OpenAI LLM and embeddings to quer against the pdf

## Prerequisites

- Astra DB API key
- OpenAI API key

In [1]:
!pip install -q cassio datasets langchain openai tiktoken pyPDF2

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/41.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m807.5/807.5 kB[0m [31m42.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.8/18.8 MB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [2]:
from langchain.vectorstores.cassandra import Cassandra # AstraDB uses the open-source Apache Cassandra Vector DB
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

from datasets import load_dataset
from PyPDF2 import PdfReader
import cassio

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
ASTRA_DB_APPLICATION_TOKEN = '' ## Use your own generated Astra DB application token key and paste as a string
ASTRA_DB_ID = '' ## Use your own generated Astra DB ID key and paste as a string

OPENAI_API_KEY = '' ## Use your own generated OpenAI API token and paste as a string

In [6]:
pdfreader = PdfReader('/content/drive/MyDrive/CMPE258 Slides/CMPE258 1-13 merged.pdf')

In [7]:
from typing_extensions import Concatenate

raw_text = ''
for i, page in enumerate(pdfreader.pages):
  content = page.extract_text()
  if content:
    raw_text += content

In [8]:
raw_text

'CMPE 258 -01 \nDeep Learning\nDr. Kaikai Liu, Ph.D. Associate Professor\nDepartment of Computer Engineering\nSan Jose State University \nEmail: kaikai.liu@sjsu.edu\nWebsite: https://www.sjsu.edu/cmpe/faculty/tenure -\nline/kaikai -liu.phpSpring 2024What is Deep Learning\n•Deep Learning is a part of machine learning that deals with algorithms \ninspired by the structure and function of the human brain. It uses \nartificial neural networks to build intelligent models and solve complex \nproblems. \nWhat is Artificial Intelligence –old answer\n•AI textbooks list\n•http://aima.cs.berkeley.edu/2nd -ed/books.html\n•Artificial Intelligence: A Modern Approach\n•http://aima.cs.berkeley.edu/2nd -ed/books.html\n•http://aima.cs.berkeley.edu/newchap00.pdf\nWhat is Artificial Intelligence –old answer\nhttps://aima.cs.berkeley.edu•Peter Norvig is a Director of Research at Google Inc\n•http://www.norvig.com\n•SJSU EIAC Membership\nWhat is Artificial Intelligence –new answer\n•The Present advancements

In [9]:
cassio.init(token = ASTRA_DB_APPLICATION_TOKEN, database_id = ASTRA_DB_ID)

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(136747931850672) 467263e6-e41c-46a9-b7fc-fde12fc3c6aa-us-east1.db.astra.datastax.com:29042:9a6df4e1-13db-40ef-a4fa-23ead13c781f> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


In [10]:
llm = OpenAI(openai_api_key = OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key = OPENAI_API_KEY)

  warn_deprecated(
  warn_deprecated(


In [11]:
astra_vector_store = Cassandra(
    embedding = embedding,
    table_name = 'qa_mini_demo',
    session = None,
    keyspace = None
)

In [12]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator = '\n',
    chunk_size = 900,
    chunk_overlap = 200,
    length_function = len
)

texts = text_splitter.split_text(raw_text)

In [13]:
texts[:10]

['CMPE 258 -01 \nDeep Learning\nDr. Kaikai Liu, Ph.D. Associate Professor\nDepartment of Computer Engineering\nSan Jose State University \nEmail: kaikai.liu@sjsu.edu\nWebsite: https://www.sjsu.edu/cmpe/faculty/tenure -\nline/kaikai -liu.phpSpring 2024What is Deep Learning\n•Deep Learning is a part of machine learning that deals with algorithms \ninspired by the structure and function of the human brain. It uses \nartificial neural networks to build intelligent models and solve complex \nproblems. \nWhat is Artificial Intelligence –old answer\n•AI textbooks list\n•http://aima.cs.berkeley.edu/2nd -ed/books.html\n•Artificial Intelligence: A Modern Approach\n•http://aima.cs.berkeley.edu/2nd -ed/books.html\n•http://aima.cs.berkeley.edu/newchap00.pdf\nWhat is Artificial Intelligence –old answer\nhttps://aima.cs.berkeley.edu•Peter Norvig is a Director of Research at Google Inc\n•http://www.norvig.com',
 '•http://aima.cs.berkeley.edu/newchap00.pdf\nWhat is Artificial Intelligence –old answer\n

In [15]:
astra_vector_store.add_texts(texts)
print('Inserted %i headlines.' % len(texts))

astra_vector_index = VectorStoreIndexWrapper(vectorstore = astra_vector_store)

Inserted 241 headlines.


In [16]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


Enter your question (or type 'quit' to exit): what is resnet

QUESTION: "what is resnet"




ANSWER: "ResNet stands for Residual Network, which is a type of deep neural network architecture that uses residual connections to facilitate training of very deep networks. It was first introduced in 2015 and has since become a popular choice for image recognition and classification tasks, often outperforming other architectures. ResNet is known for its ability to train very deep networks (up to 152 layers) without suffering from the vanishing gradient problem."

FIRST DOCUMENTS BY RELEVANCE:
    [0.8874] "top 5 error)
•Swept all classification and  detection 
competitions in  ILSVRC’15 an ..."
    [0.8837] "•ResNet  V2 (bottom) avoids all 
nonlinearities in the residual 
pathway 
•Up to 152 ..."
    [0.8723] "sequential convolutions  respectively. This improves computational speed. This is 
t ..."
    [0.8718] "def resnet50(*, weights: Optional[ResNet50_Weights] = None, progress: bool = True, * ..."

What's your next question (or type 'quit' to exit): what is the core idea behind den



ANSWER: "The core idea behind DenseNet is feature reuse, where the input of each layer consists of the feature maps of all earlier layers, and its output is passed to each subsequent layer. This encourages feature reuse and makes the network highly parameter-efficient."

FIRST DOCUMENTS BY RELEVANCE:
    [0.9067] "CNN
•Paper: https://arxiv.org/abs/1608.06993
•The input of each layer consists of th ..."
    [0.8963] "whilst keeping in mind VGG’s philosophy of 
repeating blocks and branches with ident ..."
    [0.8817] "a result it requires fewer parameters than other CNNs, as there are no repeated 
fea ..."
    [0.8808] "ImageNet classification with deep convolutional 
neural networks , Neural Informatio ..."

What's your next question (or type 'quit' to exit): what are auxiliary classifiers in googlenet





QUESTION: "what are auxiliary classifiers in googlenet"




ANSWER: "Auxiliary classifiers in GoogLeNet are additional classifiers added to the intermediate layers of the architecture, specifically the third (Inception 4[a]) and sixth (Inception4[d]) layers. They are only used during training and removed during inference. Their purpose is to perform classification based on the inputs within the network's midsection and add the calculated loss back to the total loss of the network. This helps prevent vanishing gradient descent and overfitting in extensive networks. In GoogLeNet, there are a total of 9 inception modules, with 22 layers in total, and the input layer takes in an image of dimension 224x224. The "Inception" module design involves using kernels of various sizes in parallel to extract both bigger and smaller features simultaneously, increasing the width of the model instead of depth."

FIRST DOCUMENTS BY RELEVANCE:
    [0.8918] "architecture, namely the third(Inception 4[a]) and sixth (Inception4[d])
•Auxilary   ..."
    [0.8915] "•Goo




QUESTION: "quuit"




ANSWER: "I don't see any mention of quitting or dropping the class in the given context. It seems like the policies and protocols mentioned are regarding assignments, grades, and conduct in the classroom. If you are considering quitting the class, I suggest speaking with the instructor directly to discuss your options."

FIRST DOCUMENTS BY RELEVANCE:
    [0.8685] "acceptable to turnitin (such as WORD and PDF); otherwise the reports will be 
consid ..."
    [0.8685] "acceptable to turnitin (such as WORD and PDF); otherwise the reports will be 
consid ..."
    [0.8637] "Website: https://www.sjsu.edu/cmpe/faculty/tenure -
line/kaikai -liu.phpSpring 2024D ..."
    [0.8632] "required policies, which means what you perceive as a low score by summing up 
all y ..."

What's your next question (or type 'quit' to exit): quit
