Part 1: Retrieval-Augmented Generation (RAG) Model for QA Bot
 Problem Statement:
 Develop a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA)
 bot for a business. Use a vector database like Pinecone DB and a generative model like
 Cohere API (or any other available alternative). The QA bot should be able to retrieve
 relevant information from a dataset and generate coherent answers.
 Task Requirements:
 1. Implement a RAG-based model that can handle questions related to a provided
 document or dataset.
 2. Use a vector database (such as Pinecone) to store and retrieve document
 embeddings efficiently.
 3. Test the model with several queries and show how well it retrieves and generates
 accurate answers from the document.
 Deliverables:
 ● A Colab notebook demonstrating the entire pipeline, from data loading to question
 answering.
 ● Documentation explaining the model architecture, approach to retrieval, and how
 generative responses are created.
 ● Provide several example queries and the corresponding outputs.


In [1]:
pip install pinecone

Collecting pinecone
  Downloading pinecone-5.3.1-py3-none-any.whl.metadata (19 kB)
Collecting pinecone-plugin-inference<2.0.0,>=1.1.0 (from pinecone)
  Downloading pinecone_plugin_inference-1.1.0-py3-none-any.whl.metadata (2.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone-5.3.1-py3-none-any.whl (419 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m419.8/419.8 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_inference-1.1.0-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.4/85.4 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone-plugin-inference, pinecone
Successfully installed pinecone-5.3.1 pinecone-plugin-inference-1.1.0 pinecone-plugin

In [2]:
!pip install PyMuPDF
!pip install sentence-transformers
!pip install pinecone-client
!pip install cohere

Collecting PyMuPDF
  Downloading PyMuPDF-1.24.10-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.24.10 (from PyMuPDF)
  Downloading PyMuPDFb-1.24.10-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.4 kB)
Downloading PyMuPDF-1.24.10-cp310-none-manylinux2014_x86_64.whl (3.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading PyMuPDFb-1.24.10-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, PyMuPDF
Successfully installed PyMuPDF-1.24.10 PyMuPDFb-1.24.10
Collecting sentence-transformers
  Downloading sentence_transformers-3.1.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.1.1-py3-none-any.whl (245 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
import fitz  # PyMuPDF for PDF text extraction

# Function to extract text from a PDF file
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page_num in range(doc.page_count):
        page = doc.load_page(page_num)
        text += page.get_text("text")
    return text

# Load the PDF and extract text
pdf_file = "/content/Black hole article.pdf"
document_text = extract_text_from_pdf(pdf_file)

# Optional: Preview the first 1000 characters of the extracted text
print(document_text[:1000])

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/1818209
Black Holes : A General Introduction
Article  in  Lecture Notes in Physics · February 1998
DOI: 10.1007/978-3-540-49535-2_1 · Source: arXiv
CITATIONS
24
READS
44,724
1 author:
Jean-Pierre Luminet
Laboratoire d'Astrophysique de Marseille
139 PUBLICATIONS   3,119 CITATIONS   
SEE PROFILE
All content following this page was uploaded by Jean-Pierre Luminet on 18 February 2013.
The user has requested enhancement of the downloaded file.
arXiv:astro-ph/9801252v1  26 Jan 1998
Black Holes: A General Introduction
Jean-Pierre Luminet
Observatoire de Paris-Meudon, D´epartement d’Astrophysique Relativiste et de
Cosmologie, CNRS UPR-176, F-92195 Meudon Cedex, France
Abstract. Our understanding of space and time is probed to its depths by black holes.
These objects, which appear as a natural consequence of general relativity, provide a
powerful analytical tool able to examine macrosco

In [5]:
import nltk
nltk.download('punkt')

# Split the text into sentences for embedding generation
sentences = nltk.sent_tokenize(document_text)
print(f"Number of sentences: {len(sentences)}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Number of sentences: 441


In [6]:
from sentence_transformers import SentenceTransformer

# Initialize the model for embedding generation
model = SentenceTransformer('all-mpnet-base-v2')

# Generate embeddings for the sentences
sentence_embeddings = model.encode(sentences)

# Example: Check the embedding for the first sentence
print(f"First sentence: {sentences[0]}")
print(f"First sentence embedding shape: {sentence_embeddings[0].shape}")

  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

First sentence: See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/1818209
Black Holes : A General Introduction
Article  in  Lecture Notes in Physics · February 1998
DOI: 10.1007/978-3-540-49535-2_1 · Source: arXiv
CITATIONS
24
READS
44,724
1 author:
Jean-Pierre Luminet
Laboratoire d'Astrophysique de Marseille
139 PUBLICATIONS   3,119 CITATIONS   
SEE PROFILE
All content following this page was uploaded by Jean-Pierre Luminet on 18 February 2013.
First sentence embedding shape: (768,)


In [7]:
import os
from pinecone import Pinecone, ServerlessSpec

# Initialize Pinecone
pc = Pinecone(
    api_key='ce4bd8f2-6ed6-4147-8cf5-84a8f1a0f10a'  #actual API key
)

# Check if the index exists, if not create one
index_name = 'blackhole'
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  # Replace with the actual dimensionality of your embeddings
        metric='euclidean',  # Choose the appropriate metric (euclidean, cosine, dotproduct)
        spec=ServerlessSpec(
            cloud='aws',  # Cloud provider
            region='us-east-1'  # Change the region to 'us-east-1' for the free plan
        )
    )

# Connect to the Pinecone index
index = pc.Index(index_name)

In [8]:
import cohere
import numpy as np

# Initialize Cohere API for generating answers
cohere_client = cohere.Client('WqKIAPOUFp82pRRbxmJKTDuiDpkladUtMrcS3lSE')

# Example query
query = "what is black hole theory"

# Convert the query to an embedding
# Ensure `model.encode` is a valid method; replace with actual model if necessary
query_embedding = model.encode([query])[0].tolist()  # Convert numpy array to list

# Retrieve relevant document sections from Pinecone
result = index.query(vector=query_embedding, top_k=5)

# Extract the document IDs and their corresponding sentences
retrieved_sentences = [sentences[int(match['id'])] for match in result['matches']]

# Use Cohere to generate an answer based on the retrieved text
response = cohere_client.generate(
    prompt=f"Question: {query}\n\nContext: {' '.join(retrieved_sentences)}\n\nAnswer:",
    max_tokens=100
)

# Display the generated answer
print(f"Generated Answer: {response.generations[0].text}")

Generated Answer:  Black hole theory is a scientific theory that explains the phenomena and mechanics of black holes. 

Black holes are regions of space where an enormous amount of mass is packed into a tiny volume. This creates a gravitational pull so strong that not even electromagnetic waves (i.e. light) can escape. 

The theory explains that black holes form when very massive stars collapse at the end of their life cycle. This process can also happen when galaxies merge. What is left behind is an extremely dense object
