# **Week 3 Assignment: Building an Advanced RAG System**
---

### **Objective**

The goal of this assignment is to build, evaluate, and iteratively improve a Retrieval-Augmented Generation (RAG) system using a state-of-the-art Large Language Model from Google's Gemini family. You will move beyond a basic pipeline to implement advanced techniques like reranking, with the final application answering complex questions from a real-world financial document.

### **Problem Statement**

You are an AI Engineer at a top financial services firm. Your team has been tasked with creating a tool to help financial analysts quickly extract key information from lengthy, complex annual reports (10-K filings). Manually searching these 100+ page documents for specific figures or risk assessments is slow and error-prone.

Your task is to build a RAG-based Q&A system that allows an analyst to ask natural language questions about a company's 10-K report and receive accurate, grounded answers powered by Gemini.

### **Dataset**

You will be using the official 2022 10-K annual report for **Microsoft**. A 10-K report is a comprehensive summary of a company's financial performance.
*   **Download Link:** [Microsoft Corp. 2022 10-K Report (PDF)](https://www.sec.gov/Archives/edgar/data/789019/000156459022026876/msft-10k_20220630.htm)
    *   *Instructions: Go to the link, and save the webpage as a `.txt` file or copy-paste the relevant sections into a text file for easier processing.*

---

### **Tasks & Instructions**

Structure your work in a Jupyter Notebook (`.ipynb`) or Python files. Use markdown cells or comments (in case of Python file-based submissions) to explain your methodology, justify your choices, and present your findings at each stage.

**Part 1: Setup and API Configuration**
*   **Objective:** To configure your environment to use the Google Gemini API (or an equivalent model).
*   **Tasks:**
    1.  **Get Your API Key:**
        *   Go to [Google AI Studio](https://aistudio.google.com/).
        *   Sign in with your Google account.
        *   Click on **"Get API key"** and create a new API key. **Treat this key like a password and do not share it publicly.**
    2.  **Environment Setup:**
        *   In your development environment (for example, Google Colab notebook or VSCode on your local machine), install the necessary libraries: `pip install -q -U google-generativeai langchain-google-genai langchain chromadb sentence-transformers`.
        *   If you're using Colab, use the "Secrets" feature (look for the key icon 🔑 on the left sidebar) to securely store your API key. Create a new secret named `GEMINI_API_KEY` and paste your key there.
    3.  **Configure the LLM:** In your code, import the necessary libraries and configure your LLM. For example, if you're using Colab:
        ```python
        import google.generativeai as genai
        from langchain_google_genai import ChatGoogleGenerativeAI
        from google.colab import userdata

        # Configure the API key
        api_key = userdata.get('GEMINI_API_KEY')
        genai.configure(api_key=api_key)

        # Instantiate the Gemini model
        llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash")
        ```

**Part 2: Building the Baseline RAG System**
*   **Objective:** To construct a standard, vector-search-only RAG pipeline using Gemini (or an equivalent model) as the generator.
*   **Tasks:**
    1.  **Document Loading:** Load the Microsoft 10-K report into your application.
    2.  **Chunking:** Split the document into chunks. **In a markdown cell (or in a comment, if using Python instead of Jupyter), explicitly state your chosen `chunk_size` and `chunk_overlap` and briefly explain why you chose those values.**
    3.  **Vector Store:** Create embeddings for your chunks using an open-source model (e.g., `sentence-transformers/all-MiniLM-L6-v2`) and store them in a vector database (e.g., ChromaDB).
    4.  **QA Chain:** Create a standard `RetrievalQA` chain using the `llm` object (Gemini 2.5 Flash or equivalent) you configured in Part 1.
    5.  **Initial Test:** Test your baseline system with the following question: `"What were the company's total revenues for the fiscal year that ended on June 30, 2022?"`. Display the answer.

**Part 3: Evaluating the Baseline**
*   **Objective:** To quantitatively and qualitatively assess the performance of your LLM-powered system.
*   **Tasks:**
    1.  **Create a Test Set:** Create a small evaluation set of at least **five** questions. These questions should be a mix of:
        *   **Specific Fact Retrieval:** (e.g., "What is the name of the company's independent registered public accounting firm?")
        *   **Summarization:** (e.g., "Summarize the key risks related to competition.")
        *   **Keyword-Dependent:** (e.g., "What does the report say about 'Azure'?")
    2.  **Qualitative Evaluation:** Run your five questions through the baseline RAG system. For each question, display the generated answer and the source chunks that were retrieved.
    3.  **Analysis:** In a markdown cell (or in a comment, if using Python instead of Jupyter), write a brief analysis. Did the system answer correctly? Were the retrieved chunks relevant? Did you notice any failures?

**Part 4: Implementing an Advanced RAG Technique**
*   **Objective:** To improve upon the baseline by implementing a reranker.
*   **Tasks:**
    1.  **Implement a Reranker:** Add a reranker (e.g., using `CohereRerank` or a Hugging Face cross-encoder model) into your pipeline. The flow should be: Retrieve top 10 docs -> Rerank to get the best 3 -> Pass only these 3 to LLM for the final answer.
    2.  **Re-Evaluation:** Run your same five evaluation questions through your new, advanced RAG pipeline. Display the generated answer and the final source chunks for each.

**Part 5: Final Analysis and Conclusion**
*   **Objective:** To compare the baseline and advanced systems and articulate the value of the advanced technique.
*   **Tasks:**
    1.  **Comparison:** In a markdown cell (or in a comment, if using Python instead of Jupyter), create a simple table or a structured list comparing the answers from the **Baseline RAG** vs. the **Advanced RAG** for your five evaluation questions.
    2.  **Conclusion:** Write a concluding paragraph answering the following:
        *   Did adding the reranker improve the results? How?
        *   Based on your experience, what is the biggest challenge in building a reliable RAG system for dense documents?

**Bonus Section (Optional)**
*   **Objective:** To demonstrate a deeper understanding by implementing more complex features.
*   **Choose any of the following to implement:**
    *   **Implement Query Rewriting:** Before the retrieval step, use Gemini itself to rewrite the user's query to be more effective for a financial document.
    *   **Automated Evaluation with RAGAS:** Use the `ragas` library to automatically score the faithfulness and relevance of your baseline vs. your advanced system.
    *   **Source Citing:** Modify your pipeline to not only return the answer but also explicitly cite the source chunk(s) it used.

---

### **Submission Instructions**

1.  **Deadline:** You have **two weeks** from the assignment release date to submit your work.
2.  **Platform:** All submissions must be made to your allocated private GitLab repository. You **must** submit your work in a branch named `week_3`.
3.  **Format:** You can submit your work as either a Jupyter Notebook (`.ipynb`) or a collection of Python scripts (`.py`).
4.  After pushing, you should verify that your branch and files are visible on the GitLab web interface. No further action is needed. The trainers will review all submissions on the `week_3` branch after the deadline. Any assignments submitted after the deadline won't be reviewed and will reflect in your course score.
5. The use of LLMs is encouraged, but ensure that you’re not copying solutions blindly. Always review, test, and understand any code generated, adapting it to the specific requirements of your assignment. Your submission should demonstrate your own comprehension, problem-solving process, and coding style, not just an unedited output from an AI tool.

### Part 1 - Setup

In [1]:
# Install the required (1.37.0) versions to satisfy existing dependencies
!pip install -q opentelemetry-api==1.37.0 opentelemetry-sdk==1.37.0 opentelemetry-proto==1.37.0 opentelemetry-exporter-otlp-proto-common==1.37.0 opentelemetry-exporter-otlp-proto-grpc==1.37.0 requests==2.32.5

In [2]:
!pip install -q -U google-generativeai langchain-google-genai langchain chromadb sentence-transformers langchain_community

In [3]:
from google.colab import auth
auth.authenticate_user()

In [4]:
import google.generativeai as genai
from langchain_google_genai import ChatGoogleGenerativeAI
from google.colab import userdata

# Configure the API key
api_key = userdata.get('GEMINI_API_KEY')
genai.configure(api_key=api_key)

# Instantiate the Gemini model
llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", google_api_key=api_key)

print("✅ Gemini model configured successfully!")

✅ Gemini model configured successfully!


In [5]:
try:
    # Make a simple call to the Gemini model
    response = llm.invoke("Hello, Gemini!")

    # Print the response to confirm the connection
    print("✅ Gemini model connection successful!")
    print("Model response:", response.content)

except Exception as e:
    print(f"❌ Failed to connect to Gemini model: {e}")
    print("Please double-check your API key and permissions in Google Cloud Console.")

✅ Gemini model connection successful!
Model response: Hello! How can I help you today?


### Part 2 - Building the Baseline RAG System

##### Document Loading

In [6]:
file_path = "/content/sample_data/msft-10K-data.txt"

with open(file_path, "r", encoding="utf-8") as f:
    text_data = f.read()

print("✅ Microsoft 10-K file loaded successfully!")
print(f"Total characters in document: {len(text_data)}\n")

✅ Microsoft 10-K file loaded successfully!
Total characters in document: 392586



##### Chunking

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunk_size = 1200
chunk_overlap = 200

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=["\n\n", "\n", " ", ""]
)

texts = text_splitter.split_text(text_data)
print(f"✅ Total chunks created: {len(texts)}")

✅ Total chunks created: 443


##### Vector Store

In [8]:
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma

embed_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedding_function = SentenceTransformerEmbeddings(model_name=embed_model_name)

persist_directory = "./msft_chroma_store"

vectordb = Chroma.from_texts(
    texts=texts,
    embedding=embedding_function,
    persist_directory=persist_directory
)
vectordb.persist()

print("✅ Embeddings created and stored in ChromaDB.\n")

  embedding_function = SentenceTransformerEmbeddings(model_name=embed_model_name)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


✅ Embeddings created and stored in ChromaDB.



  vectordb.persist()


##### QA Chain

In [9]:
from langchain.chains import RetrievalQA

retriever = vectordb.as_retriever(search_type="similarity", search_kwargs={"k": 4})

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,                     # Gemini 2.5 Flash
    chain_type="stuff",          # simplest type
    retriever=retriever,
    return_source_documents=True
)

print("✅ QA chain ready!\n")

✅ QA chain ready!



##### Initial Test

In [10]:
query = "What were the company's total revenues for the fiscal year that ended on June 30, 2022?"

result = qa_chain.invoke({"query": query})

# Display the answer
print("=== 💬 MODEL ANSWER ===")
print(result["result"])
print("\n")

# Show the source snippets used by the model
print("=== 📄 SOURCE DOCUMENTS ===")
for i, doc in enumerate(result["source_documents"], 1):
    print(f"\n--- Source {i} ---")
    print(doc.page_content[:400].replace("\n", " "), "...")

=== 💬 MODEL ANSWER ===
The company's total revenues for the fiscal year that ended on June 30, 2022, were $198,270 million.


=== 📄 SOURCE DOCUMENTS ===

--- Source 1 ---
Year Ended June 30,     2022        2021        2020                 United States (a)     $  100,218        $  83,953        $  73,160     Other countries        98,052           84,135           69,855                                                                          Total     $  198,270        $  168,088        $  143,015                                               (a)  Includes billin ...

--- Source 2 ---
Year Ended June 30,     2022        2021        2020                 United States (a)     $  100,218        $  83,953        $  73,160     Other countries        98,052           84,135           69,855                                                                          Total     $  198,270        $  168,088        $  143,015                                               (a)  Includes billin ...



### Part 3 - Evaluating the Baseline

##### Creating a test set

In [11]:
test_questions = [
    # Specific Fact Retrieval
    "What is the name of the company's independent registered public accounting firm?",

    # Specific Fact Retrieval
    "Who are Microsoft's key executives mentioned in the report?",

    # Summarization
    "Summarize the key risks related to competition.",

    # Keyword-Dependent
    "What does the report say about Azure?",

    # Summarization / Analysis
    "Summarize the company's financial performance for the fiscal year."
]

##### Qualitative Evaluation

In [12]:
for i, q in enumerate(test_questions, 1):
    print(f"\n❓ Question {i}: {q}\n{'-'*80}")
    result = qa_chain.invoke(q)

    # Display answer
    print(f"💬 **Answer:**\n{result['result']}\n")

    # Display retrieved source chunks
    print("📚 **Retrieved Source Chunks:**")
    for idx, doc in enumerate(result["source_documents"], 1):
        print(f"\nChunk {idx}:\n{doc.page_content[:500]}...")  # show first 500 chars



❓ Question 1: What is the name of the company's independent registered public accounting firm?
--------------------------------------------------------------------------------
💬 **Answer:**
The company's independent registered public accounting firm is Deloitte & Touche LLP.

📚 **Retrieved Source Chunks:**

Chunk 1:
The Company engaged Deloitte & Touche LLP, an independent registered public accounting firm, to audit and render an opinion on the consolidated financial statements and internal control over financial reporting in accordance with the standards of the Public Company Accounting Oversight Board (United States).

The Board of Directors, through its Audit Committee, consisting solely of independent directors of the Company, meets periodically with management, internal auditors, and our independent re...

Chunk 2:
The Company engaged Deloitte & Touche LLP, an independent registered public accounting firm, to audit and render an opinion on the consolidated financial statements an