Part 1: Retrieval-Augmented Generation (RAG) Model for QA Bot
 Problem Statement:
 Develop a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA)
 bot for a business. Use a vector database like Pinecone DB and a generative model like
 Cohere API (or any other available alternative). The QA bot should be able to retrieve
 relevant information from a dataset and generate coherent answers.
 Task Requirements:
 1. Implement a RAG-based model that can handle questions related to a provided
 document or dataset.
 2. Use a vector database (such as Pinecone) to store and retrieve document
 embeddings efficiently.
 3. Test the model with several queries and show how well it retrieves and generates
 accurate answers from the document.
 Deliverables:
 ● A Colab notebook demonstrating the entire pipeline, from data loading to question
 answering.
 ● Documentation explaining the model architecture, approach to retrieval, and how
 generative responses are created.
 ● Provide several example queries and the corresponding outputs.


In [1]:
!pip install chromadb sentence-transformers cohere

Collecting chromadb
  Downloading chromadb-0.5.5-py3-none-any.whl.metadata (6.8 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-3.1.0-py3-none-any.whl.metadata (23 kB)
Collecting cohere
  Downloading cohere-5.9.2-py3-none-any.whl.metadata (3.4 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.114.2-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.30.6-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.6.5-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.19.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  D

In [2]:
pip install pdfplumber

Collecting pdfplumber
  Downloading pdfplumber-0.11.4-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m785.2 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m703.0 kB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.4-py3-none-any.whl (59 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdfium2-4.30

In [6]:
import pdfplumber

# Open and extract text from the PDF
with pdfplumber.open('/content/Black hole article.pdf') as pdf:
    text = ''
    for page in pdf.pages:
        text += page.extract_text()

# Now you have the full text in the variable `text`

In [7]:
len(text)

63842

In [8]:
pip install langchain_text_splitters

Collecting langchain_text_splitters
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langchain-core<0.4.0,>=0.3.0 (from langchain_text_splitters)
  Downloading langchain_core-0.3.0-py3-none-any.whl.metadata (6.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.4.0,>=0.3.0->langchain_text_splitters)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langsmith<0.2.0,>=0.1.117 (from langchain-core<0.4.0,>=0.3.0->langchain_text_splitters)
  Downloading langsmith-0.1.120-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain-core<0.4.0,>=0.3.0->langchain_text_splitters)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpointer>=1.9 (from jsonpatch<2.0,>=1.33->langchain-core<0.4.0,>=0.3.0->langchain_text_splitters)
  Downloading jsonpointer-3.0.0-py2.py3-none-any.whl.metadata (2.3 kB)
Downloading langchain_text_splitters-0.3.0-py3-none-any.whl (25 

In [9]:
# Split the document text into paragraphs (or sentences)
document_chunks = text.split('\n\n')  # Splitting by paragraphs

In [10]:
from sentence_transformers import SentenceTransformer

# Load pre-trained SBERT model (From HuggingFace sentence-similarity models)
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Generate embeddings for each chunk
embeddings = model.encode(document_chunks, convert_to_tensor=True)

  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [11]:
import chromadb

# Initialize Chroma DB
client = chromadb.Client()

# Create a collection (like a table) and assign it to the 'collection' variable
collection = client.create_collection("My_collection")

In [12]:
# Create unique IDs for each document chunk
ids = [f'doc_{i}' for i in range(len(document_chunks))]

# Prepare metadata (optional)
metadatas = [{"text": chunk} for chunk in document_chunks]

# Insert embeddings and their corresponding document chunks into the Chroma DB collection
collection.add(
    ids=ids,               # Unique IDs
    embeddings=embeddings.tolist(),  # Convert embeddings tensor to list
    metadatas=metadatas   # Store original text with the embedding
)

In [15]:
# Sample query (user's question)
query = "what is black hole theory"

# Embed the query using the same model
query_embedding = model.encode(query)

# Retrieve the most similar document chunks
results = collection.query(
    query_embeddings=[query_embedding.tolist()],  # Pass the query embedding
    n_results=3  # Number of relevant chunks to return
)

'''# Display the results
for result in results["metadatas"]:
    print(result[0]["text"])  # Print the most relevant document chunks'''



'# Display the results\nfor result in results["metadatas"]:\n    print(result[0]["text"])  # Print the most relevant document chunks'

In [17]:
import cohere

# Initialize the Cohere API
co = cohere.Client('WqKIAPOUFp82pRRbxmJKTDuiDpkladUtMrcS3lSE')

# Combine the retrieved document chunks for generating a response
context = " ".join([result["text"] for result in results["metadatas"]])


# Generate a response based on the context
response = co.generate(
    model='xlarge',  # You can choose other models like 'medium' or 'small'
    prompt=f"Based on the following information: {context}, {query}",
    max_tokens=200
)

# Output the generated answer
print(response.generations[0].text.strip())


NotFoundError: status_code: 404, body: {'message': "model 'xlarge' not found, make sure the correct model ID was used and that you have access to the model."}

In [None]:
embeddings = model.encode(document_chunks, convert_to_tensor=True)


In [18]:
pip install pinecone

Collecting pinecone
  Downloading pinecone-5.1.0-py3-none-any.whl.metadata (19 kB)
Collecting pinecone-plugin-inference<2.0.0,>=1.0.3 (from pinecone)
  Downloading pinecone_plugin_inference-1.1.0-py3-none-any.whl.metadata (2.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone-5.1.0-py3-none-any.whl (245 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.5/245.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_inference-1.1.0-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.4/85.4 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone-plugin-inference, pinecone
Successfully installed pinecone-5.1.0 pinecone-plugin-inference-1.1.0 pinecone-plugin

In [None]:
# Import necessary libraries
import pinecone
from sentence_transformers import SentenceTransformer
import cohere

# Initialize Pinecone and Cohere
pinecone.init(api_key='ce4bd8f2-6ed6-4147-8cf5-84a8f1a0f10a', environment='My_collection')
cohere_client = cohere.Client('WqKIAPOUFp82pRRbxmJKTDuiDpkladUtMrcS3lSE')

# Load and preprocess your dataset/document
document = "Your document text goes here"
sentences = document.split(".")  # Example split into sentences

# Generate embeddings using a Sentence Transformer or Cohere API
model = SentenceTransformer('all-mpnet-base-v2')
sentence_embeddings = model.encode(sentences)

# Store embeddings in Pinecone
index = pinecone.Index('your_index_name')
for i, embedding in enumerate(sentence_embeddings):
    index.upsert(vectors=[(str(i), embedding)])

# Example Query Processing
query = "Your user question here"
query_embedding = model.encode([query])[0]
result = index.query(queries=[query_embedding], top_k=5)

# Retrieve the most relevant document sections and pass them to Cohere for answer generation
retrieved_sentences = [sentences[int(match['id'])] for match in result['matches']]

# Use Cohere to generate an answer
response = cohere_client.generate(
    prompt=f"Question: {query}\n\nAnswer based on context: {retrieved_sentences}",
    max_tokens=100
)
print(response.generations[0].text)


In [19]:
!pip install PyMuPDF
!pip install sentence-transformers
!pip install pinecone-client
!pip install cohere

Collecting PyMuPDF
  Downloading PyMuPDF-1.24.10-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.24.10 (from PyMuPDF)
  Downloading PyMuPDFb-1.24.10-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.4 kB)
Downloading PyMuPDF-1.24.10-cp310-none-manylinux2014_x86_64.whl (3.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading PyMuPDFb-1.24.10-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m50.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, PyMuPDF
Successfully installed PyMuPDF-1.24.10 PyMuPDFb-1.24.10
Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Downloading pinecone_client-5.0.1-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/24

In [20]:
import fitz  # PyMuPDF for PDF text extraction

# Function to extract text from a PDF file
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page_num in range(doc.page_count):
        page = doc.load_page(page_num)
        text += page.get_text("text")
    return text

# Load the PDF and extract text
pdf_file = "/content/Black hole article.pdf"
document_text = extract_text_from_pdf(pdf_file)

# Optional: Preview the first 1000 characters of the extracted text
print(document_text[:1000])

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/1818209
Black Holes : A General Introduction
Article  in  Lecture Notes in Physics · February 1998
DOI: 10.1007/978-3-540-49535-2_1 · Source: arXiv
CITATIONS
24
READS
44,724
1 author:
Jean-Pierre Luminet
Laboratoire d'Astrophysique de Marseille
139 PUBLICATIONS   3,119 CITATIONS   
SEE PROFILE
All content following this page was uploaded by Jean-Pierre Luminet on 18 February 2013.
The user has requested enhancement of the downloaded file.
arXiv:astro-ph/9801252v1  26 Jan 1998
Black Holes: A General Introduction
Jean-Pierre Luminet
Observatoire de Paris-Meudon, D´epartement d’Astrophysique Relativiste et de
Cosmologie, CNRS UPR-176, F-92195 Meudon Cedex, France
Abstract. Our understanding of space and time is probed to its depths by black holes.
These objects, which appear as a natural consequence of general relativity, provide a
powerful analytical tool able to examine macrosco

In [21]:
import nltk
nltk.download('punkt')

# Split the text into sentences for embedding generation
sentences = nltk.sent_tokenize(document_text)
print(f"Number of sentences: {len(sentences)}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Number of sentences: 441


In [22]:
from sentence_transformers import SentenceTransformer

# Initialize the model for embedding generation
model = SentenceTransformer('all-mpnet-base-v2')

# Generate embeddings for the sentences
sentence_embeddings = model.encode(sentences)

# Example: Check the embedding for the first sentence
print(f"First sentence: {sentences[0]}")
print(f"First sentence embedding shape: {sentence_embeddings[0].shape}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

First sentence: See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/1818209
Black Holes : A General Introduction
Article  in  Lecture Notes in Physics · February 1998
DOI: 10.1007/978-3-540-49535-2_1 · Source: arXiv
CITATIONS
24
READS
44,724
1 author:
Jean-Pierre Luminet
Laboratoire d'Astrophysique de Marseille
139 PUBLICATIONS   3,119 CITATIONS   
SEE PROFILE
All content following this page was uploaded by Jean-Pierre Luminet on 18 February 2013.
First sentence embedding shape: (768,)


In [25]:
import os
from pinecone import Pinecone, ServerlessSpec

# Initialize Pinecone
pc = Pinecone(
    api_key='ce4bd8f2-6ed6-4147-8cf5-84a8f1a0f10a'  # Replace with your actual API key
)

# Check if the index exists, if not create one
index_name = 'blackhole'
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  # Replace with the actual dimensionality of your embeddings
        metric='euclidean',  # Choose the appropriate metric (euclidean, cosine, dotproduct)
        spec=ServerlessSpec(
            cloud='aws',  # Cloud provider
            region='us-east-1'  # Change the region to 'us-east-1' for the free plan
        )
    )

# Connect to the Pinecone index
index = pc.Index(index_name)

In [27]:
import cohere
import numpy as np

# Initialize Cohere API for generating answers
cohere_client = cohere.Client('WqKIAPOUFp82pRRbxmJKTDuiDpkladUtMrcS3lSE')

# Example query
query = "what is black hole theory"

# Convert the query to an embedding
# Ensure `model.encode` is a valid method; replace with actual model if necessary
query_embedding = model.encode([query])[0].tolist()  # Convert numpy array to list

# Retrieve relevant document sections from Pinecone
result = index.query(vector=query_embedding, top_k=5)

# Extract the document IDs and their corresponding sentences
retrieved_sentences = [sentences[int(match['id'])] for match in result['matches']]

# Use Cohere to generate an answer based on the retrieved text
response = cohere_client.generate(
    prompt=f"Question: {query}\n\nContext: {' '.join(retrieved_sentences)}\n\nAnswer:",
    max_tokens=100
)

# Display the generated answer
print(f"Generated Answer: {response.generations[0].text}")

Generated Answer:  Black hole theory is a scientific theory that explains the formation, behavior, and characteristics of black holes, which are extremely dense regions in space from which no light can escape. The theory is based on general relativity and proposes that black holes are formed when massive stars collapse at the end of their life cycle or through other gravitational phenomena. 

The key components of black hole theory include: 

1. Event Horizon: This is the boundary around a black hole where the gravitational pull becomes so strong that nothing
