#### Pre-requisites:

###### You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.


#### What you will do:

###### - Setup: import dependencies, provide secrets, create the LangChain vector store;
###### - Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

###### Vector Search enhances machine learning models by allowing similarity comparisons of embeddings, which are mathematical representations of high dimensional data.

###### As a capability of Astra DB, Vector Search supports various Large Language Models (LLM). Since these LLMs are stateless, they rely on a vector database like Astra DB to store their embeddings. You can expedite your vector-based similarity searches by using Serverless Cassandra with Vector Search, making it easier to develop your LLM-powered applications.

In [50]:
# Install the required dependencies :
!pip install -q astrapy langchain langchain-community langchain-core ollama


In [29]:
!pip install PyPDF2



In [30]:
# Import the packages you will need :

# Langchain components to use
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import AstraDB

from langchain_community.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser



# Support for dataset retrieval with Hugging Face
from datasets import load_dataset


In [31]:
from PyPDF2 import PdfReader

In [32]:
from dotenv import load_dotenv
load_dotenv()

True

In [33]:
# Setup
import os
ASTRA_DB_APPLICATION_TOKEN = os.environ["ASTRA_DB_APPLICATION_TOKEN"]
ASTRA_DB_ID = "20fa6bf3-e94f-4efa-a363-4af5b05f9341" # enter your Database ID


In [34]:
pdfreader = PdfReader("apjspeech.pdf")

In [35]:
from typing_extensions import Concatenate
#read text from PDF
raw_text = ''
for i,page in enumerate(pdfreader.pages) :
    content = page.extract_text()
    if content :
        raw_text += content

In [36]:
raw_text

'A P J Abdul Kalam Departing speech \n \n \nFriends, I am delighted to address you all, in the country and those livi ng abroad, after \nworking with you and completing five beautiful and eventful years in Rashtrapati \nBhavan. Today, it is indeed a thanks giving occasion. I would like to narr ate, how I \nenjoyed every minute of my tenure enriched by the wonderful assoc iation from each one \nof you, hailing from different walks of life, be it politics, sci ence and technology, \nacademics, arts, literature, business, judiciary, administration, local bodies, farming, \nhome makers, special children, media and above all from the youth and st udent \ncommunity who are the future wealth of our country. During my intera ction at \nRashtrapati Bhavan in Delhi and at every state and union territor y as well as through my \nonline interactions, I have many unique experiences to share with you, which signify the \nfollowing important messages: \n \n1. Accelerate development : Aspiration of th

###### Initialize the connection to your database :

In [37]:
!pip install astrapy



In [38]:
from astrapy import DataAPIClient
# Initialize the client
client = DataAPIClient(ASTRA_DB_APPLICATION_TOKEN)
db = client.get_database_by_api_endpoint(
  "https://20fa6bf3-e94f-4efa-a363-4af5b05f9341-us-east1.apps.astra.datastax.com"
)

print(f"Connected to Astra DB: {db.list_collection_names()}")

Connected to Astra DB: ['pdf_rag_collection']


In [39]:
llm = Ollama(model = "gemma:2b")

embedding = OllamaEmbeddings(
    model="nomic-embed-text"
)

###### Create Langchain vector store ... backed by Astra DB

In [40]:
ASTRA_DB_API_ENDPOINT = "https://20fa6bf3-e94f-4efa-a363-4af5b05f9341-us-east1.apps.astra.datastax.com"

In [41]:
!pip uninstall astrapy -y
!pip install astrapy==1.4.1

Found existing installation: astrapy 1.4.1
Uninstalling astrapy-1.4.1:
  Successfully uninstalled astrapy-1.4.1
Collecting astrapy==1.4.1
  Using cached astrapy-1.4.1-py3-none-any.whl.metadata (17 kB)
Using cached astrapy-1.4.1-py3-none-any.whl (156 kB)
Installing collected packages: astrapy
Successfully installed astrapy-1.4.1


In [42]:
astra_vector_store = AstraDB(
    embedding=embedding,
    collection_name="pdf_rag_collection",
    api_endpoint = ASTRA_DB_API_ENDPOINT,
    token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
    namespace="default_keyspace"

)

print("Vector store ready")

Vector store ready


In [43]:
from langchain_text_splitters import CharacterTextSplitter
# We need to split the text using Character Text Split such that it should not increase token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap = 200,
    length_function = len,
)

texts = text_splitter.split_text(raw_text)

In [44]:
texts[:50]

['A P J Abdul Kalam Departing speech \n \n \nFriends, I am delighted to address you all, in the country and those livi ng abroad, after \nworking with you and completing five beautiful and eventful years in Rashtrapati \nBhavan. Today, it is indeed a thanks giving occasion. I would like to narr ate, how I \nenjoyed every minute of my tenure enriched by the wonderful assoc iation from each one \nof you, hailing from different walks of life, be it politics, sci ence and technology, \nacademics, arts, literature, business, judiciary, administration, local bodies, farming, \nhome makers, special children, media and above all from the youth and st udent \ncommunity who are the future wealth of our country. During my intera ction at',
 'home makers, special children, media and above all from the youth and st udent \ncommunity who are the future wealth of our country. During my intera ction at \nRashtrapati Bhavan in Delhi and at every state and union territor y as well as through my \nonline

Load the dataset into the vector store

In [45]:
astra_vector_store.add_texts(texts[:50])
print(f"Inserted {len(texts[:50])} chunks.")

Inserted 31 chunks.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _What is the current GDP?_
- _How much the agriculture target will be increased to and what the focus will be_


In [49]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == 'quit':
        break
    if query_text == "":
        continue

    first_question = False
    print("\nQUESTION : \"%s\"" % query_text)

    # -------------------------------
    # Retrieve top 3 similar chunks
    # -------------------------------
    results = astra_vector_store.similarity_search(query_text, k=3)

    # Combine chunks as context for LLM
    context = " ".join([doc.page_content for doc in results])

    # -------------------------------
    # Generate answer using LLM
    # -------------------------------
    prompt = f"Answer the question based on the following context:\n{context}\n\nQuestion: {query_text}"
    
    # Generate answer using Ollama
    output = llm.generate([prompt])
    
    answer = output.generations[0][0].text

    print("ANSWER : \"%s\"\n" % answer)

    # -------------------------------
    # Show retrieved chunks
    # -------------------------------
    print("FIRST DOCUMENTS BY RELEVANCE : ")
    for i, doc in enumerate(results, 1):
        print("\t- %s" % doc.page_content[:84].replace("\n", " "))



QUESTION : "How much the agriculture target will be increased to and what the focus will be"
ANSWER : "The context does not provide information about how much the agriculture target will be increased or what the focus will be, so I cannot answer this question from the context."

FIRST DOCUMENTS BY RELEVANCE : 
	- manpower and improve the economic conditions of the nation through the principle  of
	- manpower and improve the economic conditions of the nation through the principle  of
	- 6000 farmers from different States and Union Territories visitin g Rashtrapati Bhava

QUESTION : "What is the current GDP?"
ANSWER : "The context does not provide information about the current GDP, so I cannot answer this question from the provided context."

FIRST DOCUMENTS BY RELEVANCE : 
	- 2. A Nation where there is an equitable distribution and adequate acce ss to energy 
	- 2. A Nation where there is an equitable distribution and adequate acce ss to energy 
	- give a glimpse of the richness of our