# Quickstart: Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

Import the packages you'll need:

In [1]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms.openai import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from PyPDF2 import PdfReader

### Setup

#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [None]:
ASTRA_DB_APPLICATION_TOKEN = "" # enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = "" # enter your Database ID

OPENAI_API_KEY = "" # enter your OpenAI key

In [4]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('metamorphosis.pdf')

In [5]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [6]:
raw_text



Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [7]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

Create the LangChain embedding and LLM objects for later usage:

In [8]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

  llm = OpenAI(openai_api_key=OPENAI_API_KEY)
  embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)


Create your LangChain vector store ... backed by Astra DB!

In [9]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [10]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [11]:
texts[:50]

['METAMORPHOSIS  \nBY \nFRANZ KAFKA  \n \n \n \n \n \n \n \n \n \n \n \n \n \n I \nOne morning, when Gregor Samsa woke from troubled dreams, he \nfound himself transformed in his bed into a horrible vermin. He lay on \nhis armour -like back, and if he lifted his head a little he could see his \nbrown belly, slightly domed and divided by arches into stiff sections. \nThe bedding was hardly able to cover it and seemed ready to slide off \nany moment. His many legs, piti fully thin compared with the size of the \nrest of him, waved about helplessly as he looked.  \n"What\'s happened to me?" he thought. It wasn\'t a dream. His room, a \nproper human room although a little too small, lay peacefully between \nits four familiar walls. A collection of textile samples lay spread out on',
 'proper human room although a little too small, lay peacefully between \nits four familiar walls. A collection of textile samples lay spread out on \nthe table - Samsa was a travelling salesman - and above it 

### Load the dataset into the vector store



In [12]:

astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 50 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- Who was gregor samsa not coming out of his room?


In [None]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:200]))


QUESTION: "Who was gregor samsa not coming out of his room?"
ANSWER: "Gregor Samsa was not coming out of his room because he was sick and barricading himself in."

FIRST DOCUMENTS BY RELEVANCE:
    [0.9403] "clerk called "Good morning, Mr. Samsa". "He isn't well", said his mother 
to the chi ..."
    [0.9373] "proper human room although a little too small, lay peacefully between 
its four fami ..."
    [0.9373] "it seemed to Gregor much more sensible to le ave him now in peace 
instead of distur ..."
    [0.9372] "could now be heard in the adjoining room. From the room on his right, 
Gregor's sist ..."

QUESTION: "Explain all about gregor samsa and what happends to him in the end?"
ANSWER: "Gregor Samsa is the main character in the story "Metamorphosis" by Franz Kafka. One morning, he wakes up and finds himself transformed into a horrible vermin. He is unable to move and his family is shocked and confused by his transformation. Despite his new form, Gregor still retains his human thou