# **Retrieval-Augmented Generation (RAG) model for a Question Answering (QA) bot using a vector database (like Pinecone) and a generative model (ie., Cohere model).**

# **Setting Up the Environment**

#### Install Dependencies

In [None]:
!pip install pinecone-client cohere openai transformers sentence-transformers

Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting cohere
  Downloading cohere-5.11.1-py3-none-any.whl.metadata (3.5 kB)
Collecting openai
  Downloading openai-1.52.0-py3-none-any.whl.metadata (24 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-3.2.0-py3-none-any.whl.metadata (10 kB)
Collecting pinecone-plugin-inference<2.0.0,>=1.0.3 (from pinecone-client)
  Downloading pinecone_plugin_inference-1.1.0-py3-none-any.whl.metadata (2.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone-client)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.9.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting httpx>=0.21.2 (from cohere)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx-sse==0.4.0 (from cohere)
  Downloading httpx_sse-0.

In [None]:
!pip install PyMuPDF pandas

Collecting PyMuPDF
  Downloading PyMuPDF-1.24.11-cp38-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading PyMuPDF-1.24.11-cp38-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (19.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.6/19.6 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.24.11


#### Import Libraries

In [None]:
# Document processing libraries
import fitz  # For PDF extraction
import pandas as pd

from sentence_transformers import SentenceTransformer

import pinecone
import cohere
import openai

# Upload files from Colab
from google.colab import files

#### Set Up Pinecone

In [None]:
import os
from pinecone import Pinecone, ServerlessSpec

# Set your Pinecone API key and environment
PINECONE_API_KEY = 'b14daa31-6e34-483d-a6db-bbe89dcc2aff'
PINECONE_ENV = 'us-east-1'

pc = Pinecone(api_key=PINECONE_API_KEY)

index = pc.Index('ragbot')

# verify the connection
index_stats = index.describe_index_stats()
print(index_stats)

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}


#### Set Up Cohere


In [None]:
co = cohere.Client('jpOL27VdFb684lyOdK5ohvug3YEmDSaTNSGcpSoz')

# Verify if the connection works
try:
    response = co.generate(prompt="Hello")
    print("Cohere API connected successfully!")
except Exception as e:
    print(f"Error connecting to Cohere: {e}")

Cohere API connected successfully!


# **Upload Data Files in Colab**

*After executing this code, you’ll see an upload button in the output area where you can select your PDFs, TXT, or CSV files.*

In [None]:
from google.colab import files

# Upload files using the Colab upload widget
uploaded_files = files.upload()

print(f"Uploaded Files: {uploaded_files.keys()}")


Saving Cover Letter(Pijush Pathak).pdf to Cover Letter(Pijush Pathak).pdf
Uploaded Files: dict_keys(['Cover Letter(Pijush Pathak).pdf'])


####Extract Text from Uploaded Files
*Once files are uploaded, the following code will extract text from PDFs, TXT, and CSV files.*

In [None]:
import fitz  # PyMuPDF for reading PDFs
import pandas as pd

def extract_text_from_pdf(file_path):
    """Extract text from a PDF file."""
    text = ""
    with fitz.open(file_path) as doc:
        for page in doc:
            text += page.get_text()
    return text

def extract_text_from_txt(file_path):
    """Extract text from a TXT file."""
    with open(file_path, 'r') as f:
        return f.read()

def extract_text_from_csv(file_path):
    """Extract text from a CSV file."""
    df = pd.read_csv(file_path)
    return df.to_string()

# Extract text from all uploaded files
document_texts = []
for filename, file_content in uploaded_files.items():
    with open(filename, 'wb') as f:
        f.write(file_content)

    if filename.endswith('.pdf'):
        document_texts.append(extract_text_from_pdf(filename))
    elif filename.endswith('.txt'):
        document_texts.append(extract_text_from_txt(filename))
    elif filename.endswith('.csv'):
        document_texts.append(extract_text_from_csv(filename))

# Combine all extracted texts
full_text = " ".join(document_texts)
print(f"Extracted Text (First 500 chars):\n{full_text[:500]}...")


Extracted Text (First 500 chars):
Pijush Pathak 
Chennai, Tamil Nadu, India - 603203 
Ph: +91-6000839087 
Email: pijushpathak94@gmail.com 
Date:02/10/2024 
Hiring Manager 
Microsoft 
Dear Hiring Manager, 
I am excited to apply for the Research Fellowship opportunity at Microsoft. With a background in AI, 
machine learning, and data analytics, coupled with a passion for contributing to cutting-edge 
research, I am eager to be part of a team that is solving global problems and advancing AI 
technologies. 
I hold a B.Tech. in Compu...


# **Chunking and Embedding the Text**

#### Split Text into Chunks
*We’ll split the extracted text into chunks of 500 words to keep embeddings concise and manageable.*

In [None]:
def split_into_chunks(text, max_length=500):
    """Splits text into smaller chunks, each with a maximum of `max_length` words."""
    words = text.split()
    for i in range(0, len(words), max_length):
        yield ' '.join(words[i:i + max_length])

chunks = list(split_into_chunks(full_text))
print(f"Total Chunks: {len(chunks)}")

Total Chunks: 1


#### Generate Embeddings for Each Chunk
- Now, we’ll use Sentence Transformers to generate embeddings for each chunk.

#### Load the Sentence Transformer model:
- The all-MiniLM-L6-v2 model is fast and optimized for short text embeddings.

#### Generate embeddings for all chunks

In [None]:
from sentence_transformers import SentenceTransformer

# Load the embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for each chunk
chunk_embeddings = [embedder.encode(chunk) for chunk in chunks]

print(f"Generated {len(chunk_embeddings)} embeddings.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generated 1 embeddings.


#### Verify the Embeddings
*Check the shape and content of one of the generated embeddings.*

In [None]:
# Display the shape of the first embedding vector
print(f"Shape of one embedding: {len(chunk_embeddings[0])} dimensions")

# Display the first chunk and its embedding (optional)
print(f"First Chunk: {chunks[0][:100]}...")
print(f"First Embedding: {chunk_embeddings[0][:10]}...")

Shape of one embedding: 384 dimensions
First Chunk: Pijush Pathak Chennai, Tamil Nadu, India - 603203 Ph: +91-6000839087 Email: pijushpathak94@gmail.com...
First Embedding: [-0.11093947  0.00760809 -0.01098286  0.01738251  0.01476165 -0.06121903
  0.03457382 -0.02001056  0.00828611 -0.03767524]...


# **Initialize Pinecone and Upload Embeddings**

#### Prepare and Upload Embeddings

In [None]:
# Prepare data for uploading: List of (id, embedding, metadata) tuples
embedding_data = [
    (f"chunk-{i}", chunk_embeddings[i], {"text": chunks[i]})
    for i in range(len(chunks))
]

# Upload embeddings to the Pinecone index in batches for efficiency
BATCH_SIZE = 100

for i in range(0, len(embedding_data), BATCH_SIZE):
    batch = embedding_data[i:i + BATCH_SIZE]
    index.upsert(vectors=batch)

print("All embeddings uploaded to Pinecone.")

All embeddings uploaded to Pinecone.


#### Verify the Upload
*You can query Pinecone to ensure that the embeddings were uploaded correctly.*

In [None]:
# Check the number of vectors in the index
index_stats = index.describe_index_stats()
print(f"Total vectors in index: {index_stats['total_vector_count']}")

Total vectors in index: 1


# **Retrieve Relevant Chunks Based on User Query**

#### Define a Function to Handle Queries
- *Get the embedding for the user query.*

- *Retrieve similar chunks from Pinecone using the query embedding.*

In [None]:
def retrieve_relevant_chunks(query, top_k=3):
        query_embedding = embedder.encode(query).tolist()  # Convert to list
        query_response = index.query(
            vector=query_embedding,
            top_k=top_k,
            include_metadata=True
        )
        relevant_chunks = [result['metadata']['text'] for result in query_response['matches']]
        return relevant_chunks

#### Verify using example usage

In [None]:
# Example usage
user_query = "What are the main points of the document?"
retrieved_chunks = retrieve_relevant_chunks(user_query)

print("Retrieved Chunks:")
for chunk in retrieved_chunks:
    print(f"- {chunk}")

Retrieved Chunks:
- Pijush Pathak Chennai, Tamil Nadu, India - 603203 Ph: +91-6000839087 Email: pijushpathak94@gmail.com Date:02/10/2024 Hiring Manager Microsoft Dear Hiring Manager, I am excited to apply for the Research Fellowship opportunity at Microsoft. With a background in AI, machine learning, and data analytics, coupled with a passion for contributing to cutting-edge research, I am eager to be part of a team that is solving global problems and advancing AI technologies. I hold a B.Tech. in Computer Science and Engineering with a specialization in Big Data Analytics from SRM Institute of Science & Technology. During my internship at Blackcoffer as a Data Scientist Associate, I had the opportunity to work on AI-driven projects that involved developing and fine- tuning machine learning models for various applications. This experience allowed me to collaborate closely with cross-functional teams and reinforced my commitment to solving real-world problems with innovative technologic

# **Use Cohere to Generate Answers**

In [None]:
# prompt:  Use Cohere to Generate Answers

def generate_answer_with_cohere(query, retrieved_chunks):
    """Generates an answer to a query using Cohere, based on retrieved chunks."""

    prompt = f"""
    Given the following context, answer the question: {query}

    Context:
    {''.join(retrieved_chunks)}

    Answer:
    """

    try:
        response = co.generate(
            model='command-xlarge-nightly',
            prompt=prompt,
            max_tokens=200,
            temperature=0.7,
            k=0,
            p=0.75,
            frequency_penalty=0,
            presence_penalty=0,
            stop_sequences=[],
            return_likelihoods='NONE'
        )
        return response.generations[0].text
    except Exception as e:
        print(f"Error generating answer with Cohere: {e}")
        return "I'm sorry, I couldn't generate an answer at the moment."


user_query = "What are the main points of the document?"
retrieved_chunks = retrieve_relevant_chunks(user_query)
answer = generate_answer_with_cohere(user_query, retrieved_chunks)
print(f"Answer: {answer}")

Answer: The main points of this cover letter are:

- The applicant, Pijush Pathak, is applying for a Research Fellowship at Microsoft, highlighting their background in AI, machine learning, and data analytics.

- They hold a B.Tech. in Computer Science and Engineering with a specialization in Big Data Analytics and have experience as a Data Scientist Associate at Blackcoffer, where they worked on AI-driven projects.

- Pijush has published research on Sentiment Analysis and Text Extraction, demonstrating their ability to contribute to AI research.

- The applicant is attracted to the fellowship because it offers a balance between research and real-world application, collaboration with experts, and the opportunity to contribute to program synthesis research.

- Key qualifications include proficiency in Python and machine learning/deep learning frameworks, experience in model development and optimization, strong communication skills, and a passion for advancing AI research.

- Pijush exp

# **Build the Complete QA Pipeline**

In [None]:
# This function will integrate retrieval and generation to provide the final answer
def qa_pipeline(query):
    """
    Complete QA pipeline integrating retrieval and generation.
    """
    retrieved_chunks = retrieve_relevant_chunks(query)
    answer = generate_answer_with_cohere(query, retrieved_chunks)
    return answer

# Example usage
user_query = "What is the main topic of the document?"
answer = qa_pipeline(user_query)
print(f"Answer: {answer}")

Answer: The main topic of the document is a cover letter for a research fellowship application.


#### Test with Multiple Queries
You can now test the bot with several questions to ensure it performs accurately.

In [None]:
def qa_pipeline_multiple_queries(queries):
    """
    Complete QA pipeline for handling multiple queries.
    """
    answers = []
    for query in queries:
        retrieved_chunks = retrieve_relevant_chunks(query)
        answer = generate_answer_with_cohere(query, retrieved_chunks)
        answers.append((query, answer))
    return answers

multiple_queries = [
    "What is the main topic of the document?",
    "What are some key takeaways from the document?",
    "Can you summarize the document briefly?",
    # Add more queries here...
]

results = qa_pipeline_multiple_queries(multiple_queries)

for query, answer in results:
    print(f"Query: {query}")
    print(f"Answer: {answer}")
    print("-" * 20)

Query: What is the main topic of the document?
Answer: The main topic of the document is a job application for a Research Fellowship at Microsoft.
--------------------
Query: What are some key takeaways from the document?
Answer: Here are some key takeaways from the document:

- Pijush Pathak is a recent graduate with a B.Tech. in Computer Science and Engineering with a specialization in Big Data Analytics from SRM Institute of Science & Technology.
- They have a strong background in AI, machine learning, and data analytics, with practical experience in developing and fine-tuning machine learning models during an internship at Blackcoffer as a Data Scientist Associate.
- Pathak has published research on Sentiment Analysis and Text Extraction from Tweets using SpaCy NER, demonstrating their ability to contribute to AI research.
- Their key skills include proficiency in Python, machine learning, and deep learning frameworks (Keras, TensorFlow), experience in model development and optimiz