Data Scientist Assignment
****

As a Data Scientist, you are at the core of the RAG bot's "intelligence." You will be responsible for defining how the bot understands questions, retrieves information, and how the LLM ultimately uses that information to generate answers. Your focus will be on the agent's logic and the data pipeline.
Phase 1: Preparation
Your primary goal is to research, understand, and propose an architecture for the RAG agent, and a high-level workflow for its data.
Actions:
Deep Dive into RAG:
Review the the provided MongoDB workshop notebook: https://github.com/mongodb-developer/genai-devday-notebooks/blob/main/notebooks/ai-rag-lab.ipynb
Understand each step: chunking, embedding, ingestion, vector search, RAG, and memory. Pay close attention to the "from scratch" approach.
Research different embedding models (e.g., Sentence Transformers, OpenAI Embeddings, Cohere Embeddings, models from Hugging Face). Think through their strengths, weaknesses, and how they are typically used.
Research vector databases/stores (e.g., Faiss for local prototyping, Pinecone, Weaviate, Milvus, and specifically MongoDB Atlas Vector Search as it aligns with the workshop's context).
Familiarize yourself with basic concepts of LLMs and how they consume context.

Architecture & Workflow Diagram:
Based on your research and the workshop, design a conceptual architecture diagram for the RAG agent. This should illustrate the major components and how they interact.
Create a workflow diagram showing the flow of data when a user asks a question to the Discord bot, through your RAG system, and back to the user.
Consider:
Where will the knowledge base reside?
How will documents be chunked and embedded?
Where will embeddings be stored?
How will a user query be embedded and used for search?
How will retrieved context be passed to an LLM?
Which LLM will you propose (e.g., a free/open-source option for a PoC like ollama with a local model, or a cloud API like Gemini, OpenAI if you get access)?

Office Hours (optional):
Use Office Hours to help with roadblocks rather than as a gatekeeper holding you back. This way you can start developing your code to make progress and no longer wait for Office hours.
Some things to think about:
Your proposed RAG agent architecture diagram.
Your RAG workflow diagram.
Your rationale for choosing specific embedding models, vector store approaches, and LLM providers.
Any challenges or open questions you've identified (e.g., how to handle large documents, real-time updates to the knowledge base, cost implications of LLMs).
How you plan to collaborate with the Backend Engineers on API design for your RAG components.
üèÜPhase 2: Development
Can start development even if not attend Office hours. Start focusing on implementing the core RAG logic.
Actions:
Data Ingestion Pipeline:
Implement the process of chunking your knowledge base documents.
Generate embeddings for each chunk.
Ingest these embeddings into your chosen vector store (e.g., a simple in-memory list for initial prototyping, a local vector store like FAISS, or integrate with MongoDB Atlas Vector Search)
.
Retrieval Logic:
Implement the logic to convert an incoming user query into an embedding.
Perform a vector search against your stored embeddings to retrieve the most relevant chunks.


Augmentation & Generation (RAG Chain):
Construct the prompt for the LLM by combining the user's original query and the retrieved context.
Make an API call to the chosen LLM to generate the final answer.


Collaborate with Backend Engineers (Time Permitting):
If time allows: Work closely with the Backend Engineers to define the API endpoints your RAG components will expose (e.g., an endpoint to receive a query and return an answer, or an endpoint for document ingestion).


Agent Evaluation (Time Permitting):
If time allows, research and implement basic metrics for evaluating your RAG system's performance. Consider:
Relevance/Precision: How often are the retrieved chunks actually relevant to the query?
Faithfulness: Does the generated answer only use information from the retrieved context, or does it hallucinate?
Answer Correctness: Is the final answer factually accurate based on the knowledge base?
You might collect a small set of sample questions and their expected answers from your knowledge base to test against.
Remember to start simple and iterate. Focus on getting a basic end-to-end flow working before optimizing or adding complex features.


In [None]:
!pip install datasets pandas pymongo sentence_transformers
!pip install -U sentence_transformers
!pip install accelerate

Collecting pymongo
  Downloading pymongo-4.16.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (10.0 kB)
Collecting dnspython<3.0.0,>=2.6.1 (from pymongo)
  Downloading dnspython-2.8.0-py3-none-any.whl.metadata (5.7 kB)
Downloading pymongo-4.16.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (1.7 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.7/1.7 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.8.0-py3-none-any.whl (331 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m331.1/331.1 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.8.0 pymongo-4.16.0
Collecting sentence_transformers
  Downloading sente

In [None]:
from datasets import load_dataset# Load the dataset
import pandas as pd
dataset = load_dataset("breadlicker45/discord-chat")#Dataset from hugging face since its an open source
dataset_df = pd.DataFrame(dataset['train'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


output_file.csv:   0%|          | 0.00/18.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11136 [00:00<?, ? examples/s]

In [None]:
# Define chunking function
def chunk_text(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks
# Apply chunking
dataset_df['chunks'] = dataset_df['data'].apply(lambda x: chunk_text(x, chunk_size=500, overlap=50))

In [None]:
print(dataset_df.isnull().sum())


data      0
chunks    0
dtype: int64


In [None]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

def query_to_embedding(query):
    # Optionally preprocess the query here
    embedding = model.encode(query)
    return embedding

# Example usage
user_query = "How do I implement chunking for knowledge base documents?"
embedding = query_to_embedding(user_query)
print(embedding)  # This is a numpy array (vector)

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[-2.65025925e-02  9.39666703e-02 -4.64301882e-03 -4.13934421e-03
  3.79471312e-04 -4.95037362e-02 -6.39533019e-03  4.03315946e-02
 -5.31265773e-02  5.61714359e-02 -1.84585117e-02 -7.10261846e-03
  5.19485809e-02  4.35996940e-03  3.01876497e-02  5.00825606e-02
 -5.73472902e-02  4.88801003e-02 -2.27210596e-02 -3.89257520e-02
  5.87890223e-02  2.41284724e-02 -3.87907624e-02  5.49254306e-02
  1.43315876e-02  5.29158041e-02 -5.79639487e-02 -2.95056235e-02
  6.17661588e-02 -2.16665808e-02  7.69351721e-02 -2.21673083e-02
  7.37912655e-02  7.03761727e-02  2.16310192e-02  4.70994674e-02
  1.93249085e-03  4.83957119e-02 -4.21897024e-02  1.65772978e-02
  1.33455861e-02 -2.66822334e-03 -2.95120962e-02  3.57294306e-02
  8.87363851e-02  5.78362420e-02 -5.13134077e-02  6.45370781e-02
  4.08332199e-02 -4.55107773e-03 -1.42568067e-01  8.13647732e-03
 -2.22886764e-02  9.07299668e-02  5.81126735e-02  2.55906172e-02
 -7.38101825e-02 -1.56094832e-02 -1.39959559e-01 -3.00114993e-02
 -3.85767221e-02 -2.22385

In [None]:
!pip install pymongo



In [None]:
from requests import get
ip = get('https://api.ipify.org').text
print(f"Your Colab IP: {ip}")

Your Colab IP: 35.221.152.39


In [None]:
import pymongo
# from google.colab import userdata # No longer needed if using hardcoded URI
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

def get_mongo_client(input_mongo_uri):
    """Establish the connection to MongoDB"""
    try:
        client = MongoClient(input_mongo_uri, server_api=ServerApi('1'))
        client.admin.command('ping')
        print("Connection to MongoDB successful")
        return client
    except Exception as e:
        print(f"Connection failed: {e}")
        return None

# Use the hardcoded URI that was proven to work in cell LHMbkhR0Vsly
working_mongo_uri = "mongodb+srv://natashaaaa21_db_user:Canada9347@cluster0.zirsn0c.mongodb.net/?appName=Cluster0"

mongo_client = get_mongo_client(working_mongo_uri)

#Ingest these embeddings into your chosen vector store
db = mongo_client["chatbot"]
collection = db["embeddings"]

Connection to MongoDB successful


In [None]:
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
uri = "mongodb+srv://natashaaaa21_db_user:Canada9347@cluster0.zirsn0c.mongodb.net/?appName=Cluster0"
# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))
# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


In [None]:
collection.delete_many({})

DeleteResult({'n': 11136, 'electionId': ObjectId('7fffffff0000000000000334'), 'opTime': {'ts': Timestamp(1769611633, 244), 't': 820}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1769611634, 60), 'signature': {'hash': b'=\x9a\xa6m\xee\xc3\xca\xa9\x15\x0fIt\x1eD\\\x1e\xb8\x03\x96\x8c', 'keyId': 7539235469406502913}}, 'operationTime': Timestamp(1769611633, 244)}, acknowledged=True)

How to Use MongoDB as Vector Store for RAG - Atlas Vector Search Index


In [None]:
documents = dataset_df.to_dict('records')
collection.insert_many(documents)
print("Data Ingestion to MongoDB inserted successfully")

Data Ingestion to MongoDB inserted successfully


In [None]:
#CREATE a vector search
{
  "fields": [
    {
      "type": "vector",
      "path": "your_embedding_field",
      "numDimensions": 768,
      "similarity": "cosine"
    }
  ]
}



{'fields': [{'type': 'vector',
   'path': 'your_embedding_field',
   'numDimensions': 768,
   'similarity': 'cosine'}]}

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Flatten the list of lists of chunks into a single list of all chunks
chunks = [chunk for sublist in dataset_df['chunks'] for chunk in sublist]

# Load model and encode query (model is already defined in a previous cell, but re-defining for clarity and self-containment if this cell is run independently)
model = SentenceTransformer('all-MiniLM-L6-v2')
user_query = "How do I implement chunking for knowledge base documents?"
query_embedding = model.encode(user_query).reshape(1, -1)

# Generate embeddings for all chunks
chunk_embeddings = model.encode(chunks)

# Compute similarities
similarities = cosine_similarity(query_embedding, chunk_embeddings)[0]

# Get top N relevant chunks
top_n = 5
top_indices = np.argsort(similarities)[::-1][:top_n]
relevant_chunks = [chunks[i] for i in top_indices]

# Print results
for i, chunk in enumerate(relevant_chunks, 1):
    print(f"Result {i}:\n{chunk}\n")

Result 1:
ginning of the document, and whether it's therefore OK to apply the model to such chunks)
triggerhappygandi#0001: cc @kindiana
kindiana#1016: All documents are concatenated after being shuffled and the whole thing is split into 2049 chunks with the final one dropped

Result 2:
ìÖ¨ gabriel_syme ìÖ¨#3220: as in slicing a small document in content_length bytes and then...averaging smh?
ìÖ¨ gabriel_syme ìÖ¨#3220: I guess one can use a 2048 context, should be enough for small documents
rom1504#5008: You can also index all the sentences independently
ìÖ¨ gabriel_syme ìÖ¨#3220: yeah that makes more sense probably, increase search refinement
etk934#4704: mean pooling document chunks works surprisingly well
Brady#0053: I haven't found Ghostwriter very helpful yet, personally
ilovescience#

Result 3:
m trying to understand if I should propagate the metadata for each doc to the paragraphs I (necessarily have to) split it into.

So for example, should each section/paragraph/chunk h

LLM -OPENAI-I have purchased an Open AI secret key to  get output from gpt-3.5-turbo

In [None]:
import os
from openai import OpenAI

# 1. Set up the OpenAI client with your API key
# It's best practice to set your API key as an environment variable (OPENAI_API_KEY)
# so you don't hardcode it in your script.
# client = OpenAI()
# If you need to hardcode for testing (not recommended for production):
client = OpenAI(api_key="secretkey-hidden")

# 2. Define the system prompt to give the chatbot its persona
# This is where you make it a Discord Q&A chatbot assistant.
system_prompt = """
You are a helpful and expert Discord Q&A chatbot.
Your role is to assist users with questions about data science, including:
- Explaining concepts clearly and concisely
- Debugging code and identifying issues
- Explaining algorithms step-by-step
- Providing project advice and best practices

Always respond with:
- Clear, easy-to-understand explanations
- Code snippets when relevant, formatted in Markdown code blocks
- Step-by-step reasoning or examples if applicable

Be friendly, professional, and supportive in your responses.
"""

# A list to store the conversation history, which helps the bot remember context (memory)
messages = [{"role": "system", "content": system_prompt}]

# 3. Function to get a response from the API
def get_chatbot_response(user_input):
    messages.append({"role": "user", "content": user_input})

    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",  # Or a more advanced model like "gpt-4" or "gpt-4o"
            messages=messages,
            temperature=0.7,  # Controls randomness (0.0 is deterministic, 1.0 is creative)
            max_tokens=500
        )
        bot_response = response.choices[0].message.content
        messages.append({"role": "assistant", "content": bot_response})
        return bot_response
    except Exception as e:
        return f"An error occurred: {e}"

# 4. Main chat loop to interact with the chatbot
print("Discord Q&A Chatbot: Hello! How can I assist you with your Discord Q&A today? Type 'exit' to end the chat.")
while True:
    user_input = input("You: ")
    if user_input.lower() == 'exit':
        print("Discord Q&A Chatbot: Goodbye!")
        break

    response = get_chatbot_response(user_input)
    print(f"Discord Q&A Chatbot: {response}")



Discord Q&A Chatbot: Hello! How can I assist you with your Discord Q&A today? Type 'exit' to end the chat.
You: what is data science ?
Discord Q&A Chatbot: Data science is a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It combines expertise from various domains such as statistics, computer science, machine learning, and domain-specific knowledge to analyze and interpret complex data sets.

In simpler terms, data science involves collecting, cleaning, analyzing, and visualizing data to uncover patterns, trends, and insights that can be used to make informed decisions and predictions. Data scientists use tools like programming languages (e.g., Python, R), statistical techniques, machine learning algorithms, and data visualization to derive valuable information from data.

Overall, data science helps businesses and organizations make data-driven decisions, improve processes

KeyboardInterrupt: Interrupted by user