<a href="https://colab.research.google.com/github/RohiniShankari/GenAIcohort_May2025/blob/main/GenAIcohort_May2025_RAG_Rohini_Sankari.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
# Install required packages (if not already installed)
!pip install --quiet langchain langchain-community faiss-cpu sentence-transformers pymupdf gradio langchain-groq


In [8]:
#  Imports
import os
import fitz  # PyMuPDF
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain_groq import ChatGroq
import gradio as gr
from google.colab import userdata


In [9]:
#  Set Groq API key (from Colab secrets)
os.environ["GROQ_API_KEY"] = userdata.get("groq_key")


In [10]:
#  Upload and load PDF file
pdf_path = "/content/The National Health Mission (NHM),.txt"  # Ensure this is uploaded

def extract_text_from_pdf(pdf_path):
    text = ""
    with fitz.open(pdf_path) as doc:
        for page in doc:
            text += page.get_text()
    return text

raw_text = extract_text_from_pdf(pdf_path)
print("PDF loaded and text extracted.")


✅ PDF loaded and text extracted.


In [11]:
#  Split text into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.create_documents([raw_text])


In [12]:
#  Create FAISS vector store with sentence-transformer embeddings
embedding = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
vectordb = FAISS.from_documents(docs, embedding)
retriever = vectordb.as_retriever()


  embedding = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [13]:
# Initialize Groq LLM (LLaMA3)
llm = ChatGroq(
    groq_api_key=os.environ["GROQ_API_KEY"],
    model_name="llama3-8b-8192"
)


In [14]:
# Set up conversational retrieval chain with memory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
qa_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory
)


  memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)


In [18]:
#  Gradio chatbot interface
def chat_interface(message, history):
    result = qa_chain({"question": message})
    return result["answer"]

gr.ChatInterface(
    fn=chat_interface,
    title=" Government AI Assistant",
    description="Ask me anything about Government healthcare policies. Based on Operational Guidelines for CPHC.",
    theme="soft"
).launch()


  self.chatbot = Chatbot(


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://2f4bbd158b53da483c.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [16]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

def sentence_similarity(sentence1, sentence2):
    # Load the sentence transformer model
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Compute the embeddings for the two sentences
    sentence1_embedding = model.encode([sentence1]) #reference
    sentence2_embedding = model.encode([sentence2]) #prediction

    # Calculate the cosine similarity between the two sentences
    similarity = cosine_similarity(sentence1_embedding, sentence2_embedding)[0]

    return similarity

In [17]:
my_ans="""non-communicable diseases such as
cardiovascular diseases, diabetes, cancer, respiratory, and other chronic diseases, account for over 60% of
total mortality"""
question="""what are the non-communicable diseases which account for 60% of total mortality?"""
bot_ans="""According to the provided context, the non-communicable diseases that account for over 60% of total mortality in India include:
1.Cardiovascular diseases
2.Diabetes
3.Cancer
4.Respiratory diseases"""

similarity_score = sentence_similarity(my_ans, bot_ans)

print(f"Cosine similarity between the two sentences: {similarity_score}")

Cosine similarity between the two sentences: [0.8342185]
