# Summary of the Jupyter Notebook

This notebook demonstrates how to build a retrieval-augmented generation (RAG) pipeline using Google Gemini and ChromaDB.

## Workflow Overview

1. **Environment Setup**
    - Loads environment variables and API keys.

2. **Database Initialization**
    - Initializes a persistent ChromaDB client.
    - Creates a collection for storing documents.

3. **Document Ingestion**
    - Adds a set of technology-related articles to the ChromaDB collection.

4. **Semantic Search**
    - Performs semantic search on the collection using both text and embedding-based queries.

5. **Embedding Generation**
    - Generates document embeddings with Gemini's embedding model.
    - Stores embeddings in a new ChromaDB collection.

6. **Knowledge Base Querying**
    - Queries the knowledge base using embedded queries to retrieve relevant documents.

7. **Prompt Construction & LLM Inference**
    - Constructs prompts from retrieved documents.
    - Uses Gemini's LLM to generate answers based on the provided context.

---

This notebook showcases the integration of document storage, semantic search, and large language model inference for question answering.

In [None]:
import os
from dotenv import load_dotenv
from google import genai
import chromadb

In [6]:
load_dotenv(override=True)
GEMINI_API_KEY = os.getenv("GOOGLE_API_KEY")
os.environ["GOOGLE_API_KEY"] = GEMINI_API_KEY

In [7]:
chroma_client = chromadb.PersistentClient("./chroma_db")
collection = chroma_client.get_or_create_collection(name="tech_articles")

In [8]:
documents = [
    """
    Artificial Intelligence and Machine Learning have revolutionized how we approach data analysis 
    and automation. Machine learning algorithms can identify patterns in large datasets that would 
    be impossible for humans to detect manually. Deep learning, a subset of machine learning, uses 
    neural networks with multiple layers to process complex data like images, text, and audio. 
    Popular frameworks like TensorFlow, PyTorch, and scikit-learn have made it easier for developers 
    to implement AI solutions. The applications range from recommendation systems and fraud detection 
    to autonomous vehicles and medical diagnosis.
    """,
    
    """
    Cloud computing has transformed the technology landscape by providing scalable, on-demand access 
    to computing resources. Major cloud providers like Amazon Web Services (AWS), Microsoft Azure, 
    and Google Cloud Platform offer services ranging from simple storage to complex machine learning 
    platforms. The three main service models are Infrastructure as a Service (IaaS), Platform as a 
    Service (PaaS), and Software as a Service (SaaS). Benefits include cost efficiency, scalability, 
    reliability, and global accessibility. However, organizations must also consider security, 
    compliance, and vendor lock-in when adopting cloud solutions.
    """,
    
    """
    Cybersecurity is becoming increasingly critical as our digital footprint expands. Common threats 
    include malware, phishing attacks, ransomware, and social engineering. Organizations implement 
    multiple layers of security including firewalls, intrusion detection systems, encryption, and 
    multi-factor authentication. The CIA triad - Confidentiality, Integrity, and Availability - 
    forms the foundation of information security principles. Regular security audits, employee 
    training, and incident response plans are essential components of a comprehensive cybersecurity 
    strategy. Zero-trust architecture is gaining popularity as a security model that assumes no 
    implicit trust within the network.
    """,
    
    """
    Blockchain technology, originally developed for Bitcoin, has found applications beyond 
    cryptocurrency. A blockchain is a distributed ledger that maintains a continuously growing 
    list of records, called blocks, which are linked using cryptographic hashes. Key features 
    include decentralization, transparency, immutability, and consensus mechanisms. Applications 
    include supply chain tracking, digital identity verification, smart contracts, and 
    decentralized finance (DeFi). Popular blockchain platforms include Ethereum, Hyperledger 
    Fabric, and Binance Smart Chain. Challenges include scalability, energy consumption, and 
    regulatory uncertainty.
    """,
    
    """
    The Internet of Things (IoT) refers to the network of physical devices embedded with sensors, 
    software, and connectivity that enables them to collect and exchange data. IoT devices range 
    from simple sensors to complex industrial machines. Applications include smart homes, wearable 
    devices, connected cars, industrial automation, and smart cities. Key technologies include 
    RFID, Wi-Fi, Bluetooth, cellular networks, and edge computing. Challenges include device 
    security, data privacy, interoperability, and managing the massive scale of connected devices. 
    Edge computing is becoming crucial for processing IoT data closer to its source.
    """,
    
    """
    DevOps is a cultural and technical movement that emphasizes collaboration between development 
    and operations teams. It aims to shorten the development lifecycle while delivering features, 
    fixes, and updates frequently and reliably. Key practices include continuous integration (CI), 
    continuous deployment (CD), infrastructure as code, monitoring, and automated testing. Popular 
    tools include Docker for containerization, Kubernetes for orchestration, Jenkins for CI/CD, 
    and Terraform for infrastructure management. Benefits include faster time to market, improved 
    collaboration, higher quality software, and better system reliability.
    """,
    
    """
    Quantum computing represents a fundamental shift in computational paradigms, leveraging quantum 
    mechanical phenomena like superposition and entanglement. Unlike classical bits that exist in 
    states of 0 or 1, quantum bits (qubits) can exist in multiple states simultaneously. This 
    enables quantum computers to perform certain calculations exponentially faster than classical 
    computers. Applications include cryptography, drug discovery, financial modeling, and 
    optimization problems. Companies like IBM, Google, and Microsoft are developing quantum 
    systems. Challenges include quantum error correction, maintaining quantum coherence, and 
    developing quantum algorithms.
    """,
    
    """
    Data Science combines statistics, mathematics, programming, and domain expertise to extract 
    insights from data. The data science process typically involves data collection, cleaning, 
    exploration, modeling, and visualization. Python and R are popular programming languages, 
    with libraries like pandas, NumPy, matplotlib, and scikit-learn. Big data technologies like 
    Hadoop, Spark, and NoSQL databases handle large-scale data processing. Data scientists work 
    on problems like predictive analytics, customer segmentation, recommendation systems, and 
    business intelligence. The field requires strong analytical skills and the ability to 
    communicate findings to stakeholders.
    """
]
len(documents)

8

In [9]:
collection.add(ids=["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"],
               documents=documents
               )

In [10]:
example_queries = [
    "What are the benefits of cloud computing?",
    "How does machine learning work?",
    "What are common cybersecurity threats?",
    "Explain blockchain technology applications",
    "What is DevOps and its key practices?",
    "How do IoT devices communicate?",
    "What makes quantum computing different?",
    "What tools do data scientists use?"
]


In [11]:
query = "What are machine learning frameworks?"
results = collection.query(
    query_texts=query,
    n_results=2
)

In [12]:
print(results['documents'][0])

['\n    Artificial Intelligence and Machine Learning have revolutionized how we approach data analysis \n    and automation. Machine learning algorithms can identify patterns in large datasets that would \n    be impossible for humans to detect manually. Deep learning, a subset of machine learning, uses \n    neural networks with multiple layers to process complex data like images, text, and audio. \n    Popular frameworks like TensorFlow, PyTorch, and scikit-learn have made it easier for developers \n    to implement AI solutions. The applications range from recommendation systems and fraud detection \n    to autonomous vehicles and medical diagnosis.\n    ', '\n    Data Science combines statistics, mathematics, programming, and domain expertise to extract \n    insights from data. The data science process typically involves data collection, cleaning, \n    exploration, modeling, and visualization. Python and R are popular programming languages, \n    with libraries like pandas, NumPy

In [13]:
retrieved_docs = results['documents'][0]

context = "\n\n".join(retrieved_docs)

In [14]:
print(context)


    Artificial Intelligence and Machine Learning have revolutionized how we approach data analysis 
    and automation. Machine learning algorithms can identify patterns in large datasets that would 
    be impossible for humans to detect manually. Deep learning, a subset of machine learning, uses 
    neural networks with multiple layers to process complex data like images, text, and audio. 
    Popular frameworks like TensorFlow, PyTorch, and scikit-learn have made it easier for developers 
    to implement AI solutions. The applications range from recommendation systems and fraud detection 
    to autonomous vehicles and medical diagnosis.
    


    Data Science combines statistics, mathematics, programming, and domain expertise to extract 
    insights from data. The data science process typically involves data collection, cleaning, 
    exploration, modeling, and visualization. Python and R are popular programming languages, 
    with libraries like pandas, NumPy, matplotlib, an

In [15]:
# Formulate the prompt for the LLM
prompt = f"""Use the following context to answer the question.

Context:
{context}

Question:
{query}

Answer the question using only the information from the context above."""
print(prompt)

Use the following context to answer the question.

Context:

    Artificial Intelligence and Machine Learning have revolutionized how we approach data analysis 
    and automation. Machine learning algorithms can identify patterns in large datasets that would 
    be impossible for humans to detect manually. Deep learning, a subset of machine learning, uses 
    neural networks with multiple layers to process complex data like images, text, and audio. 
    Popular frameworks like TensorFlow, PyTorch, and scikit-learn have made it easier for developers 
    to implement AI solutions. The applications range from recommendation systems and fraud detection 
    to autonomous vehicles and medical diagnosis.
    


    Data Science combines statistics, mathematics, programming, and domain expertise to extract 
    insights from data. The data science process typically involves data collection, cleaning, 
    exploration, modeling, and visualization. Python and R are popular programming langu

In [16]:
client = genai.Client(api_key=GEMINI_API_KEY)

response = client.models.generate_content(
    model="gemini-2.5-flash", contents=prompt
)
print(response.text)

Machine learning frameworks are tools like TensorFlow, PyTorch, and scikit-learn that have made it easier for developers to implement AI solutions.


In [17]:
from chromadb.utils import embedding_functions
from google.genai.types import EmbedContentConfig

In [18]:
embed_model = "gemini-embedding-001"
response = client.models.embed_content(
    model=embed_model,
    contents=documents,
    config=EmbedContentConfig(taskType="RETRIEVAL_DOCUMENT")
)

doc_embeddings = [emb.values for emb in response.embeddings]

print(f"Generated Length {len(doc_embeddings)} embeddings")
print(f"Dimensions of first embedding {len(doc_embeddings[0])}")

Generated Length 8 embeddings
Dimensions of first embedding 3072


In [None]:
collection = chroma_client.create_collection(name="knowledge_base")

collection.add(documents=documents,
               embeddings=doc_embeddings,
               ids=["doc1", "doc2", "doc3", "doc4", "doc5", "doc6", "doc7", "doc8"]
               )

print("Document in Collection", collection.count())

In [None]:
query1 = "What are the benefits of cloud computing?"

query_response = client.models.embed_content(
    model=embed_model,
    contents=[query],
    config=EmbedContentConfig(taskType="RETRIEVAL_QUERY")
)

query_vector = query_response.embeddings[0].values

result = collection.query(query_embeddings=[query_vector],
                          n_results=2,
                          include=["documents", "distances"])

result

{'ids': [['doc2', 'doc6']],
 'embeddings': None,
 'documents': [['\n    Cloud computing has transformed the technology landscape by providing scalable, on-demand access \n    to computing resources. Major cloud providers like Amazon Web Services (AWS), Microsoft Azure, \n    and Google Cloud Platform offer services ranging from simple storage to complex machine learning \n    platforms. The three main service models are Infrastructure as a Service (IaaS), Platform as a \n    Service (PaaS), and Software as a Service (SaaS). Benefits include cost efficiency, scalability, \n    reliability, and global accessibility. However, organizations must also consider security, \n    compliance, and vendor lock-in when adopting cloud solutions.\n    ',
   '\n    DevOps is a cultural and technical movement that emphasizes collaboration between development \n    and operations teams. It aims to shorten the development lifecycle while delivering features, \n    fixes, and updates frequently and reliab

In [None]:
retreived_docs = result['documents'][0]
print(retreived_docs)

['\n    Cloud computing has transformed the technology landscape by providing scalable, on-demand access \n    to computing resources. Major cloud providers like Amazon Web Services (AWS), Microsoft Azure, \n    and Google Cloud Platform offer services ranging from simple storage to complex machine learning \n    platforms. The three main service models are Infrastructure as a Service (IaaS), Platform as a \n    Service (PaaS), and Software as a Service (SaaS). Benefits include cost efficiency, scalability, \n    reliability, and global accessibility. However, organizations must also consider security, \n    compliance, and vendor lock-in when adopting cloud solutions.\n    ', '\n    DevOps is a cultural and technical movement that emphasizes collaboration between development \n    and operations teams. It aims to shorten the development lifecycle while delivering features, \n    fixes, and updates frequently and reliably. Key practices include continuous integration (CI), \n    contin

In [None]:
context = "\n\n".join(retreived_docs)

In [None]:
print(context)


    Cloud computing has transformed the technology landscape by providing scalable, on-demand access 
    to computing resources. Major cloud providers like Amazon Web Services (AWS), Microsoft Azure, 
    and Google Cloud Platform offer services ranging from simple storage to complex machine learning 
    platforms. The three main service models are Infrastructure as a Service (IaaS), Platform as a 
    Service (PaaS), and Software as a Service (SaaS). Benefits include cost efficiency, scalability, 
    reliability, and global accessibility. However, organizations must also consider security, 
    compliance, and vendor lock-in when adopting cloud solutions.
    


    DevOps is a cultural and technical movement that emphasizes collaboration between development 
    and operations teams. It aims to shorten the development lifecycle while delivering features, 
    fixes, and updates frequently and reliably. Key practices include continuous integration (CI), 
    continuous deployment 

In [None]:
prompt = f"""Use the following context to answer the question.

Context:
{context}

Question:
{query}

Answer the question using only the information from the context above."""
print(prompt)

Use the following context to answer the question.

Context:

    Cloud computing has transformed the technology landscape by providing scalable, on-demand access 
    to computing resources. Major cloud providers like Amazon Web Services (AWS), Microsoft Azure, 
    and Google Cloud Platform offer services ranging from simple storage to complex machine learning 
    platforms. The three main service models are Infrastructure as a Service (IaaS), Platform as a 
    Service (PaaS), and Software as a Service (SaaS). Benefits include cost efficiency, scalability, 
    reliability, and global accessibility. However, organizations must also consider security, 
    compliance, and vendor lock-in when adopting cloud solutions.
    


    DevOps is a cultural and technical movement that emphasizes collaboration between development 
    and operations teams. It aims to shorten the development lifecycle while delivering features, 
    fixes, and updates frequently and reliably. Key practices incl

In [None]:
client = genai.Client(api_key=GEMINI_API_KEY)

response = client.models.generate_content(
    model="gemini-2.5-flash", contents=prompt
)
print(response.text)

The benefits of cloud computing include cost efficiency, scalability, reliability, and global accessibility.


# Exercise: Build a RAG-Based Resume Chatbot

In this exercise, you will create a simple Retrieval-Augmented Generation (RAG) system that allows a student to chat with their own resume or profile. You will use the same tools and code structure as demonstrated in this notebook, adapting it to work with your own resume data.

---

## Objective

Build a chatbot that can answer questions about your resume or profile by retrieving relevant information and generating responses using a language model.

---

## Instructions

### 1. Prepare Your Resume/Profile Data

- Write your resume or profile as a list of text sections (e.g., Education, Experience, Skills, Projects).
- Example:
    ```python
    resume_sections = [
        "Education: B.Sc. in Computer Science from XYZ University, 2021.",
        "Experience: Software Engineering Intern at ABC Corp, Summer 2020.",
        "Skills: Python, Machine Learning, Data Analysis, SQL.",
        "Projects: Built a personal finance tracker app using Flask and SQLite."
    ]
    ```

### 2. Initialize ChromaDB and Create a Collection

- Use the existing ChromaDB client to create a new collection for your resume.
    ```python
    resume_collection = chroma_client.create_collection(name="student_resume")
    ```

### 3. Generate Embeddings for Each Section

- Use the Gemini embedding model to generate embeddings for each section.
    ```python
    resume_embeddings_response = client.models.embed_content(
        model=embed_model,
        contents=resume_sections,
        config=EmbedContentConfig(taskType="RETRIEVAL_DOCUMENT")
    )
    resume_embeddings = [emb.values for emb in resume_embeddings_response.embeddings]
    ```

### 4. Add Sections and Embeddings to the Collection

- Add your resume sections and their embeddings to the ChromaDB collection.
    ```python
    resume_collection.add(
        documents=resume_sections,
        embeddings=resume_embeddings,
        ids=[f"section{i+1}" for i in range(len(resume_sections))]
    )
    ```

### 5. Ask Questions About Your Resume

- Write a question you want to ask about your resume.
    ```python
    question = "What projects have I worked on?"
    ```

- Generate an embedding for your question.
    ```python
    question_embedding_response = client.models.embed_content(
        model=embed_model,
        contents=[question],
        config=EmbedContentConfig(taskType="RETRIEVAL_QUERY")
    )
    question_embedding = question_embedding_response.embeddings[0].values
    ```

- Query the collection to retrieve relevant sections.
    ```python
    result = resume_collection.query(
        query_embeddings=[question_embedding],
        n_results=2,
        include=["documents"]
    )
    retrieved_sections = result['documents'][0]
    ```

### 6. Construct a Prompt and Get an Answer

- Build a prompt using the retrieved sections and your question.
    ```python
    context = "\n\n".join(retrieved_sections)
    prompt = f"""Use the following context to answer the question.

    Context:
    {context}

    Question:
    {question}

    Answer the question using only the information from the context above."""
    ```

- Generate an answer using the Gemini model.
    ```python
    response = client.models.generate_content(
        model="gemini-2.5-flash", contents=prompt
    )
    print(response.text)
    ```

---

## Tips

- You can add more sections or details to your resume for richer responses.
- Try asking different types of questions (e.g., about skills, education, experience).
- Experiment with the number of results (`n_results`) to see how it affects the chatbot's answers.

---

## Goal

By completing this exercise, you will learn how to build a personalized RAG-based chatbot that can answer questions about your own resume or profile using retrieval and generation techniques!