Retrieval-Augmented Generation (RAG) Model for QA Bot

In [None]:
!pip install PyPDF2 pinecone-client sentence-transformers cohere

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-3.1.1-py3-none-any.whl.metadata (10 kB)
Collecting cohere
  Downloading cohere-5.9.4-py3-none-any.whl.metadata (3.4 kB)
Collecting pinecone-plugin-inference<2.0.0,>=1.0.3 (from pinecone-client)
  Downloading pinecone_plugin_inference-1.1.0-py3-none-any.whl.metadata (2.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone-client)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Collecting boto3<2.0.0,>=1.34.0 (from cohere)
  Downloading boto3-1.35.24-py3-none-any.whl.metadata (6.6 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.9.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting httpx>=0.21.2 (from cohere)
  Downloading httpx-0.

In [None]:
from PyPDF2 import PdfReader
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer
import cohere
import numpy as np
from google.colab import userdata

pinecone = Pinecone(api_key=userdata.get('Pinecone'), environment='us-west1-gcp')  # update the api key with your own key
index_name = "qa-bot"

if not pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=768,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )
index = pinecone.Index(index_name)

cohere_client = cohere.Client(api_key=userdata.get('Cohere'))# update the api key with your own key

embedding_model = SentenceTransformer('all-mpnet-base-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
def normalize_embedding(embedding):
    return embedding / np.linalg.norm(embedding)

def create_embeddings(text):
    embeddings = embedding_model.encode(text)
    normalized_embeddings = [normalize_embedding(emb) for emb in embeddings]
    return [embedding.tolist() for embedding in normalized_embeddings]

def store_embeddings_in_pinecone(text_chunks):
    embeddings = create_embeddings(text_chunks)
    vectors = [{"id": str(i), "values": embedding, "metadata": {"text": chunk}} for i, (embedding, chunk) in enumerate(zip(embeddings, text_chunks))]
    index.upsert(vectors=vectors, namespace="document-namespace")

def chunk_text(text, max_chunk_size=512):
    sentences = text.split('. ')
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= max_chunk_size:
            current_chunk += sentence + ". "
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + ". "

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

def query_pinecone(query_text):
    query_embedding = normalize_embedding(embedding_model.encode([query_text])[0])
    query_results = index.query(vector=query_embedding.tolist(), top_k=5, include_metadata=True, namespace="document-namespace")
    return [match['metadata']['text'] for match in query_results['matches']]

def generate_answer_from_chunks(query, retrieved_chunks):
    combined_text = " ".join(retrieved_chunks)
    prompt = f"{combined_text}\n\nAnswer the following question: {query}"
    response = cohere_client.generate(
        model='c4ai-aya-23-35b',
        prompt=prompt,
        max_tokens=300
    )
    return response.generations[0].text

In [None]:
# i have uploaded a pdf to the runnign section, please upload your pdf and change the pdf_path
pdf_path = "/content/VITA_Research_Paper.pdf"

reader = PdfReader(pdf_path)
document_text = ''
for page in reader.pages:
    document_text += page.extract_text()

text_chunks = chunk_text(document_text)
store_embeddings_in_pinecone(text_chunks)
print("Embeddings stored successfully!")

query = "what is VITA and how it is useful to students"
retrieved_chunks = query_pinecone(query)
generated_answer = generate_answer_from_chunks(query, retrieved_chunks)

print("\nRetrieved Document Sections:")
for chunk in retrieved_chunks:
    print(chunk)
    print("---")

print("\nGenerated Answer:")
print(generated_answer)

Embeddings stored successfully!

Retrieved Document Sections:
This shows that the platform is
particularly helpful in helping students grasp hard ideas
through targeted material delivery.
C.User Satisfaction
User satisfaction was measured by surveys that gathered
qualitative comments on the user experience, perceived effec-
tiveness, and overall happiness with the VITA platform.•Positive User Feedback: The majority of students (85%)
indicated a high level of satisfaction with the platform,
citing the simplicity of use, relevancy of individualized
content, and the participatory nature of the learning
experience as crucial factors.
•Perceived Effectiveness: 78% of students believed that
VITA made learning more interesting and effective com-
pared to traditional approaches.
---
This suggests that the individualized
learning strategy offered by VITA considerably boosted
student understanding and retention of material.
Fig. 5. Improvement in Test Scores (Pre-test vs. Post-test)
•Enhanced Le

APPROACH:

1. document processing: first we process the pdf file and extract the text using PyPDF2
2. text chunking: next we convert the extracted text into smaller chunks
3. text embeddigns: now we convert the text into vector embeddings using 'all-mpnet-base-v2' model
4. storing in pinecone: now we store the converted embeddings to pinecone database so that we access the relavent chunks based on the query.
5. querying: when someone asks a query or a question we convert the query into embeddings and ask the pinecone database for similar embeddings.
6. answer generation: now we take the relevant embeddings from the database and give it to 'c4ai-aya-23-35b' model which is available in cohere api then ask it to generate the answer for the question.