# RepoMind RAG Evaluation Notebook

This notebook evaluates the performance of the RAG pipeline on a target code repository. It performs the following steps:

1.  **Setup & Ingestion**: Clones the target repository and ingests it into the vector database.
2.  **Define Evaluation Dataset**: Creates a set of questions and ground truth answers to test the RAG system's understanding of the codebase.
3.  **Run Pipeline**: For each question, it runs the full retrieval and generation pipeline.
4.  **Evaluate with `ragas`**: Uses the `ragas` library to calculate key metrics like faithfulness, context precision, context recall, and answer relevancy.

**Target Repository**: `https://github.com/psf/requests`

## 1. Setup and Ingestion

In [1]:
import os
import sys
import pandas as pd
from datasets import Dataset

# Add the src directory to the Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

from src.database import initialize_database
from src.ingestion import ingest_repo
from src.retrieval import Retriever
from src.llm import LLMEngine

# --- Configuration ---
TEST_REPO_URL = "https://github.com/psf/requests"

print("üîß Initializing database...")
initialize_database()

print(f"\nüöÄ Ingesting repository: {TEST_REPO_URL}...")
try:
    ingest_repo(TEST_REPO_URL, force_clone=False) # Set force_clone=True to re-download
except Exception as e:
    print(f"‚ùå Ingestion failed: {e}")

2025-12-01 22:36:12,394 - INFO - Load pretrained SentenceTransformer: BAAI/bge-small-en-v1.5


üîß Initializing database...
üîÑ Loading Embedding Model: BAAI/bge-small-en-v1.5...


2025-12-01 22:36:18,775 - INFO - 1 prompt is loaded, with the key: query
2025-12-01 22:36:19,415 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


‚úÖ Embedding Model Loaded.

üöÄ Ingesting repository: https://github.com/psf/requests...

üöÄ Starting ingestion for: https://github.com/psf/requests

üìÇ Repo already exists at c:\My Projects\RepoMind\data\cloned_repos\requests, skipping clone...
üîç Scanning files in c:\My Projects\RepoMind\data\cloned_repos\requests...
‚úÖ Loaded 45 documents from 45 code files.
‚ö†Ô∏è Skipped 2 files.

üìÑ Processing 45 documents...
‚úÇÔ∏è Chunking code files (AST-aware when possible)...
  ‚ö†Ô∏è Skipping empty document: __init__.py
  ‚ö†Ô∏è AST parsing failed for custom.css, using text fallback for this file




  ‚ö†Ô∏è AST parsing failed for hacks.html, using text fallback for this file
  ‚ö†Ô∏è AST parsing failed for sidebarintro.html, using text fallback for this file
  ‚ö†Ô∏è AST parsing failed for sidebarlogo.html, using text fallback for this file
  ‚úì AST-aware chunked 44 python files into 259 nodes
üß© Created 259 semantic chunks.

üíæ Saving to Vector Database (this may take a while)...


Generating embeddings:   0%|          | 0/259 [00:00<?, ?it/s]


‚úÖ Ingestion Complete! Embeddings stored in ChromaDB.
   Repository: requests
   Documents: 45
   Chunks: 259



## 2. Initialize RAG Components

In [2]:
try:
    retriever = Retriever(use_reranker=True)
    llm_engine = LLMEngine()
    print("‚úÖ RAG components loaded successfully.")
except Exception as e:
    print(f"‚ùå Failed to initialize RAG components: {e}")

2025-12-01 22:36:36,501 - INFO - Loading all indices.


üìÇ Loading Index from c:\My Projects\RepoMind\data\chromadb...
‚ö†Ô∏è No index metadata found, creating index from vector store...
‚úÖ Index created from existing vector store
üöÄ Initializing Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2...
üß† Initializing LLM: openai/gpt-oss-120b...
‚úÖ RAG components loaded successfully.


## 3. Define Evaluation Questions

Here we define a list of questions to ask the RAG model. We also provide `ground_truth` answers, which are required by `ragas` to calculate `context_recall`. The ground truth should be a concise, factual statement that is expected to be found in the source documents.

In [3]:
eval_questions = [
    {
        "question": "What is the main purpose of the requests library?",
        "ground_truth": "Requests is an HTTP library for Python, built for human beings. It allows you to send HTTP/1.1 requests extremely easily."
    },
    {
        "question": "How does the Session object persist parameters across requests?",
        "ground_truth": "A Session object has a variety of methods for customizing requests, such as setting headers, auth, cookies, and proxies. These settings are persisted across all requests made with that session instance."
    },
    {
        "question": "What is the role of the `requests.adapters.HTTPAdapter`?",
        "ground_truth": "The HTTPAdapter is responsible for the actual transport of the request. It sends the request to the target server and handles connection pooling."
    },
    {
        "question": "How can you specify a timeout for a request?",
        "ground_truth": "You can tell Requests to stop waiting for a response after a given number of seconds with the `timeout` parameter. It can be a float for a connect and read timeout, or a tuple `(connect_timeout, read_timeout)`."
    },
    {
        "question": "How are cookies handled in the requests library?",
        "ground_truth": "Cookies are returned in a `RequestsCookieJar`, which acts like a dictionary but also works across domains and paths. Session objects also persist cookies across all requests."
    },
    {
        "question": "What file defines the main `requests.get` function?",
        "ground_truth": "The `get` function is a wrapper defined in `requests/api.py` that calls the `request` function with the method set to 'GET'."
    }
]

## 4. Run the RAG Pipeline and Collect Results

In [4]:
results = []
for item in eval_questions:
    question = item["question"]
    print(f"\nProcessing question: {question}")
    
    # 1. Retrieve context
    context_nodes = retriever.search(question)
    contexts = [node.get_content() for node in context_nodes]
    
    # 2. Generate answer
    answer = llm_engine.chat(question, context_nodes)
    
    results.append({
        "question": question,
        "answer": answer,
        "contexts": contexts,
        "ground_truth": item["ground_truth"]
    })
    
    print(f"  -> Answer generated.")

# Convert to Hugging Face Dataset
results_df = pd.DataFrame(results)
eval_dataset = Dataset.from_pandas(results_df)

print("\n‚úÖ Pipeline execution complete.")
eval_dataset


Processing question: What is the main purpose of the requests library?
üîç Searching for: 'What is the main purpose of the requests library?'
   üìä Found 10 vector matches...
   ‚ú® Reranking to top 5 results...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

   ‚úÖ Reranked to 5 results


2025-12-01 22:36:49,350 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  -> Answer generated.

Processing question: How does the Session object persist parameters across requests?
üîç Searching for: 'How does the Session object persist parameters across requests?'
   üìä Found 10 vector matches...
   ‚ú® Reranking to top 5 results...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

   ‚úÖ Reranked to 5 results


2025-12-01 22:36:51,647 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  -> Answer generated.

Processing question: What is the role of the `requests.adapters.HTTPAdapter`?
üîç Searching for: 'What is the role of the `requests.adapters.HTTPAdapter`?'
   üìä Found 10 vector matches...
   ‚ú® Reranking to top 5 results...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-01 22:36:52,151 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-12-01 22:36:52,153 - INFO - Retrying request to /chat/completions in 7.000000 seconds


   ‚úÖ Reranked to 5 results


2025-12-01 22:37:00,386 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  -> Answer generated.

Processing question: How can you specify a timeout for a request?
üîç Searching for: 'How can you specify a timeout for a request?'
   üìä Found 10 vector matches...
   ‚ú® Reranking to top 5 results...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-01 22:37:00,926 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-12-01 22:37:00,926 - INFO - Retrying request to /chat/completions in 20.000000 seconds


   ‚úÖ Reranked to 5 results


2025-12-01 22:37:23,041 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  -> Answer generated.

Processing question: How are cookies handled in the requests library?
üîç Searching for: 'How are cookies handled in the requests library?'
   üìä Found 10 vector matches...
   ‚ú® Reranking to top 5 results...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-01 22:37:23,603 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"


   ‚úÖ Reranked to 5 results


2025-12-01 22:37:23,604 - INFO - Retrying request to /chat/completions in 22.000000 seconds
2025-12-01 22:37:48,987 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  -> Answer generated.

Processing question: What file defines the main `requests.get` function?
üîç Searching for: 'What file defines the main `requests.get` function?'
   üìä Found 10 vector matches...
   ‚ú® Reranking to top 5 results...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-01 22:37:49,522 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-12-01 22:37:49,522 - INFO - Retrying request to /chat/completions in 25.000000 seconds


   ‚úÖ Reranked to 5 results


2025-12-01 22:38:15,827 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  -> Answer generated.

‚úÖ Pipeline execution complete.


Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truth'],
    num_rows: 6
})

## 5. Evaluate with `ragas`

Now we use `ragas` to evaluate the collected responses. We will measure:

- **Faithfulness**: How factually consistent is the answer with the provided context.
- **Answer Relevancy**: How relevant is the answer to the question.
- **Context Precision**: A measure of how relevant the retrieved contexts are.
- **Context Recall**: Measures if all the necessary information from the `ground_truth` was retrieved.

In [29]:
# ========== FULL RAGAS EVALUATION (ALL SAMPLES) ==========
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import HuggingFaceEmbeddings
from ragas.run_config import RunConfig

from langchain_ollama import ChatOllama

# -----------------------------------------------------------
# 1. PREPARE THE DATASET
# -----------------------------------------------------------
ds = eval_dataset  # already created earlier

# Expected ‚Üí actual column rename mapping
rename_map = {
    "question": "user_input",
    "answer": "response",
    "contexts": "retrieved_contexts",
    "ground_truth": "reference",
}

for old, new in rename_map.items():
    if old in ds.column_names and new not in ds.column_names:
        ds = ds.rename_column(old, new)

# Ensure retrieved_contexts is list[str]
def ensure_list(example):
    rc = example.get("retrieved_contexts")
    if rc is None:
        example["retrieved_contexts"] = []
    elif isinstance(rc, str):
        example["retrieved_contexts"] = [rc]
    return example

if "retrieved_contexts" in ds.column_names:
    ds = ds.map(ensure_list)

# -----------------------------------------------------------
# 2. JUDGE (LLAMA VIA OLLAMA)
# -----------------------------------------------------------
judge_chat = ChatOllama(
    model="llama3.1",
    base_url="http://localhost:11434",
    temperature=0.0,
)

judge_llm = LangchainLLMWrapper(judge_chat)

# -----------------------------------------------------------
# 3. EMBEDDINGS (FOR SIMILARITY METRICS)
# -----------------------------------------------------------
emb = HuggingFaceEmbeddings(model="BAAI/bge-small-en-v1.5")

# -----------------------------------------------------------
# 4. RUN CONFIG
# -----------------------------------------------------------
run_cfg = RunConfig(
    max_workers=1,       # sequential (stable)
    timeout=300,         # 2.5 min per call to avoid timeouts
    max_wait=200,
    max_retries=1,
)

# -----------------------------------------------------------
# 5. RUN FULL EVALUATION
# -----------------------------------------------------------
print("\nüöÄ Running FULL RAGAS evaluation on all samples...")
result = evaluate(
    dataset=ds,
    metrics=[context_precision, context_recall, faithfulness, answer_relevancy],
    llm=judge_llm,
    embeddings=emb,
    run_config=run_cfg,
    raise_exceptions=False,   # Full run ‚Üí don't stop mid-way
)

print("\n‚úÖ Evaluation complete.")

# -----------------------------------------------------------
# 6. SHOW RESULTS
# -----------------------------------------------------------
df = result.to_pandas()
print("\n=== Per-sample metrics ===")
print(df[['user_input', 'context_precision', 'context_recall', 'faithfulness', 'answer_relevancy']])

print("\n=== Averages ===")
print(df[['context_precision', 'context_recall', 'faithfulness', 'answer_relevancy']].mean())

df


Map:   0%|          | 0/6 [00:00<?, ? examples/s]

  judge_llm = LangchainLLMWrapper(judge_chat)
2025-12-02 00:54:49,038 - INFO - Use pytorch device_name: cuda:0
2025-12-02 00:54:49,039 - INFO - Load pretrained SentenceTransformer: BAAI/bge-small-en-v1.5



üöÄ Running FULL RAGAS evaluation on all samples...


Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]

2025-12-02 00:54:59,153 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-12-02 00:55:20,849 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-12-02 00:55:40,640 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-12-02 00:57:09,982 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-12-02 00:59:18,927 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-12-02 01:00:21,113 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-12-02 01:02:00,424 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-12-02 01:03:47,225 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-12-02 01:06:01,924 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-12-02 01:06:06,226 - INFO - HTTP Request: POST http://localhost:11434/api/cha


‚úÖ Evaluation complete.

=== Per-sample metrics ===
                                                        user_input  context_precision  context_recall  faithfulness  answer_relevancy
0                What is the main purpose of the requests library?           0.916667             1.0      0.714286          0.989356
1  How does the Session object persist parameters across requests?           0.804167             1.0           NaN          0.937942
2         What is the role of the `requests.adapters.HTTPAdapter`?           0.679167             0.5      1.000000          0.994467
3                     How can you specify a timeout for a request?           0.887500             1.0      0.833333          0.921742
4                 How are cookies handled in the requests library?           1.000000             0.5           NaN          0.950434
5              What file defines the main `requests.get` function?           0.950000             0.0      0.333333          0.938500

=== Ave

Unnamed: 0,user_input,retrieved_contexts,response,reference,context_precision,context_recall,faithfulness,answer_relevancy
0,What is the main purpose of the requests library?,"[# Requests\r\n\r\n**Requests** is a simple, yet elegant, HTTP library.\r\n\r\n```python\r\n>>> import requests\r\n>>> r = requests.get('https://httpbin.org/basic-auth/user/pass', auth=('user', 'pass'))\r\n>>> r.status_code\r\n200\r\n>>> r.headers['content-type']\r\n'application/json; charset=utf8'\r\n>>> r.encoding\r\n'utf-8'\r\n>>> r.text\r\n'{""authenticated"": true, ...'\r\n>>> r.json()\r\n{...","The **Requests** library is a Python HTTP client whose main purpose is to provide a simple, human‚Äëfriendly way to send HTTP/1.1 requests (GET, POST, etc.) and handle the responses.\n\n- The README describes it as ‚Äúa simple, yet elegant, HTTP library.‚Äù \n- The module docstring in `src\requests\__init__.py` says: ‚ÄúRequests is an HTTP library, written in Python, for human beings.‚Äù \n\nThus, its...","Requests is an HTTP library for Python, built for human beings. It allows you to send HTTP/1.1 requests extremely easily.",0.916667,1.0,0.714286,0.989356
1,How does the Session object persist parameters across requests?,"[""""""\r\nrequests.sessions\r\n~~~~~~~~~~~~~~~~~\r\n\r\nThis module provides a Session object to manage and persist settings across\r\nrequests (cookies, auth, proxies).\r\n""""""\r\nimport os\r\nimport sys\r\nimport time\r\nfrom collections import OrderedDict\r\nfrom datetime import timedelta\r\n\r\nfrom ._internal_utils import to_native_string\r\nfrom .adapters import HTTPAdapter\r\nfrom .auth im...",**How the `Session` object keeps parameters alive between calls**\n\nIn `src\requests\sessions.py` the `Session` class stores a set of attributes that act as the *default* values for every request made through that session:\n\n```python\ndef __init__(self):\n self.headers = default_headers() # default headers\n self.auth = None # default auth tuple / ob...,"A Session object has a variety of methods for customizing requests, such as setting headers, auth, cookies, and proxies. These settings are persisted across all requests made with that session instance.",0.804167,1.0,,0.937942
2,What is the role of the `requests.adapters.HTTPAdapter`?,"[def request_url(self, request, proxies):\r\n """"""Obtain the url to use when making the final request.\r\n\r\n If the message is being sent through a HTTP proxy, the full URL has to\r\n be used. Otherwise, we should only use the path portion of the URL.\r\n\r\n This should not be called from user code, and is only exposed for use\r\n when subclassing the\r\n ...","**Role of `requests.adapters.HTTPAdapter`**\n\n`HTTPAdapter` is the built‚Äëin transport adapter that lets a **Requests** `Session` actually talk to HTTP/HTTPS endpoints. \n\n- It implements the *Transport Adapter* interface required by a `Session`. \n- It creates and manages urllib3 connection pools (via `init_poolmanager`) and, when needed, proxy managers. \n- It handles proxy selection, UR...",The HTTPAdapter is responsible for the actual transport of the request. It sends the request to the target server and handles connection pooling.,0.679167,0.5,1.0,0.994467
3,How can you specify a timeout for a request?,"[class TestTimeout:\r\n def test_stream_timeout(self, httpbin):\r\n try:\r\n requests.get(httpbin(""delay/10""), timeout=2.0)\r\n except requests.exceptions.Timeout as e:\r\n assert ""Read timed out"" in e.args[0].args[0]\r\n\r\n @pytest.mark.parametrize(\r\n ""timeout, error_text"",\r\n (\r\n ((3, 4, 5), ""(connect, read)""),\r\n ...","You set the timeout by passing a **`timeout`** argument to the request‚Äëmaking function (e.g. `requests.get`, `requests.post`, `requests.request`, or the `Session.request` method).\n\n**What you can pass**\n\n| Type | Meaning |\n|------|---------|\n| `float` (or `int`) | The same number of seconds is used for both the *connect* and *read* phases. |\n| `tuple` `(connect, read)` | Separate values...","You can tell Requests to stop waiting for a response after a given number of seconds with the `timeout` parameter. It can be a float for a connect and read timeout, or a tuple `(connect_timeout, read_timeout)`.",0.8875,1.0,0.833333,0.921742
4,How are cookies handled in the requests library?,"[""""""\r\nrequests.cookies\r\n~~~~~~~~~~~~~~~~\r\n\r\nCompatibility code to be able to use `http.cookiejar.CookieJar` with requests.\r\n\r\nrequests.utils imports from here, so be careful with imports.\r\n""""""\r\n\r\nimport calendar\r\nimport copy\r\nimport time\r\n\r\nfrom ._internal_utils import to_native_string\r\nfrom .compat import Morsel, MutableMapping, cookielib, urlparse, urlunparse\r\n\...",**Cookie handling in `requests` is built on top of Python‚Äôs `http.cookiejar.CookieJar` but wrapped with a few helper classes so that the rest of the library can work with a simple dict‚Äëlike API.** \n\nBelow is a concise walk‚Äëthrough of the relevant pieces that appear in the source files you provided.\n\n---\n\n### 1. The underlying jar ‚Äì `RequestsCookieJar`\n\n*File: `src\requests\cookies.py`...,"Cookies are returned in a `RequestsCookieJar`, which acts like a dictionary but also works across domains and paths. Session objects also persist cookies across all requests.",1.0,0.5,,0.950434
5,What file defines the main `requests.get` function?,"[def request(method, url, **kwargs):\r\n """"""Constructs and sends a :class:`Request <Request>`.\r\n\r\n :param method: method for the new :class:`Request` object: ``GET``, ``OPTIONS``, ``HEAD``, ``POST``, ``PUT``, ``PATCH``, or ``DELETE``.\r\n :param url: URL for the new :class:`Request` object.\r\n :param params: (optional) Dictionary, list of tuples or bytes to send\r\n in ...","The top‚Äëlevel **`requests.get`** function is defined in **`src\requests\api.py`**:\n\n```python\ndef get(url, params=None, **kwargs):\n """"""Sends a GET request.""""""\n return request(""get"", url, params=params, **kwargs)\n```\n\n(See the `get` definition in `src\requests\api.py`.)",The `get` function is a wrapper defined in `requests/api.py` that calls the `request` function with the method set to 'GET'.,0.95,0.0,0.333333,0.9385


In [32]:
import numpy as np
import pandas as pd

# df = evaluation_df   # ensure this variable exists

summary_file = "ragas_metrics_summary.txt"

# Select only the four RAGAS metric columns (ignore everything else)
metric_cols = ["context_precision", "context_recall", "faithfulness", "answer_relevancy"]

# Compute averages safely (skip missing columns)
metrics_present = [m for m in metric_cols if m in df.columns]
metric_means = df[metrics_present].mean(skipna=True)

# Write to txt
with open(summary_file, "w", encoding="utf-8") as f:
    f.write("=== RAGAS Evaluation Metrics Summary ===\n\n")
    for metric, value in metric_means.items():
        f.write(f"{metric}: {value:.4f}\n")

print(f"‚úÖ Metrics written to {summary_file}")


‚úÖ Metrics written to ragas_metrics_summary.txt
