# Querying PDF using Astra DB and Langchain

Let's create Db in Astra DB and we can start

### Importing Necessary Libraries

In [1]:
from langchain_openai import OpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from datasets import load_dataset #Importing ds from HuggingFace
import cassio #Helps to integrate AstraDB with Langchain and it also helps to init the connection of DB

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from PyPDF2 import PdfReader # To read texts from PDF

### Setup

In [3]:
import os
astra_api= os.getenv("ASTRA_DB_APPLICATION_TOKEN")
ASTRA_DB_ID = "b0533966-3a51-434b-8371-e28d8f5c401d"


In [4]:
# Let's read our document
pdfreader = PdfReader('Document/USMLERxStep1.pdf')

In [5]:
# Let's divide all our docs into chunks
from typing_extensions import Concatenate

# Read all text from PDF
raw_text = ""
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text+=content

In [6]:
print(raw_text)

FOR  
THE® 
New York / Chicago / San Francisco / Athens / London / Madrid / Mexico City  
Milan / New Delhi / Singapore / Sydney / TorontoUSMLE  
STEP 1  
2023FIRST  AID
TAO LE, MD, MHS
Founder, ScholarRx
Associate Clinical Professor, Department of MedicineUniversity of Louisville School of MedicineVIKAS BHUSHAN, MD
Founder, First Aid for the USMLE Step 1Boracay, Philippines
CONNIE QIU, MD, P hD
Resident, Department of Dermatology
Johns Hopkins Hospital
PANAGIOTIS KAPARALIOTIS, MD
University of Athens Medical School, Greece 
KIMBERLY KALLIANOS, MD
Assistant Professor, Department of Radiology and Biomedical ImagingUniversity of California, San Francisco School of MedicineANUP CHALISE, MBBS, MS, MRCSE d
Kathmandu, Nepal 
CAROLINE COLEMAN, MD
Resident, Department of MedicineEmory University School of Medicine
FAS1_2023_00_Frontmatter.indd   1FAS1_2023_00_Frontmatter.indd   1 11/18/22   5:38 PM11/18/22   5:38 PMFirst Aid for the® USMLE Step 1 2023: A Student-to-Student Guide 
Copyright © 2

### Connecting to Astra DB

In [7]:
cassio.init(token=astra_api, database_id=ASTRA_DB_ID)

DriverException: Unable to connect to the metadata service at https://b0533966-3a51-434b-8371-e28d8f5c401d-us-east1.db.astra.datastax.com:29080/metadata. Check the cluster status in the cloud console. 

In [None]:
# Creating the Embeddings

llm = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
embeddings = OpenAIEmbeddings(api_key=os.getenv("OPENAI_API_KEY"))

In [None]:
# Creating LangChain vectorstore using Astra DB

astra_vector_db = Cassandra(
    embedding=embeddings,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None
)

In [None]:
# We need to split the text using CharacterTextSplitter so that it can increase the token size
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=2325,
    chunk_overlap=200,
    length_function=len
)
texts = text_splitter.split_text(raw_text)

In [None]:
texts[1]

'First Aid for the® is a registered trademark of McGraw Hill. 1 2 3 4 5 6 7 8 9   LMN   27 26 25 24 23 22ISBN 978-1-264-94662-4\nMHID 1-264-94662-7\nNotice \nMedicine is an ever-changing science. As new research and clinical experience broaden our knowledge, changes in treatment and drug therapy are required. The authors and the publisher of this work have checked with sources believed to be reliable in their efforts to provide information that is complete and generally in accord with the standards accepted at the time of publication. However, in view of the possibility of human error or changes in medical sciences, neither the authors nor the publisher nor any other party who has been involved in the preparation or publication of this work warrants that the information contained herein is in every respect accurate or complete, and they disclaim all responsibility for any errors or omissions or for the results obtained from use of the information contained in this work. Readers are enc

### Loading the above embiddings into VectorDB

In [None]:
astra_vector_db.add_texts(texts)
print("Inserted %i headlines" %len(texts))
astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_db)

Inserted 1053 headlines


### Testing

Simply run the cells and ask a question -- or quit to stop. (you can also stop execution with the "▪" button on the top toolbar)

In [None]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_db.similarity_search_with_score(query_text, k=2):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content))


QUESTION: "What is darret's esophagus?"
ANSWER: "Barrett's esophagus is a condition in which the cells of the esophagus change and become similar to the cells of the intestines. It is often caused by chronic gastroesophageal reflux disease (GERD) and is associated with an increased risk of esophageal adenocarcinoma."

FIRST DOCUMENTS BY RELEVANCE:
    [0.8988] "Other esophageal pathologies
Gastroesophageal 
reflux diseaseTransient decreases in LES tone. Commonly presents as heartburn , regurgitation , dysphagia . May 
also present as chronic cough , hoarseness  (laryngopharyngeal  reflux). Associated with asthma. 
Complications include erosive esophagitis, strictures, and Barrett esophagus.
Esophagitis Inflammation of esophageal mucosa. Presents with odynophagia and/or dysphagia. Ty pes:
 Reflu
x (erosive) esophagitis —most common type. 2° to GERD.
 Me
dication-induced esophagitis —2° to bisphosphonates, tetracyclines, NSAIDs, ferrous 
sulfate, potassium chloride.
 Eo
sinophilic es

: 