## **1. Introduction**

In the previous notebooks, I processed a complex PDF, created two sets of text chunks in order to test two different chunking strategies, generated vector embeddings, and indexed everything in a Weaviate database. Now, we've reached a crucial stage: **retrieval**. How well can we find the right information to answer a user's query?

In this notebook, I run two key experiments to answer that question:

1.  **Chunking Strategy Comparison**: I'll test `fixed-size` chunks against `section-based` chunks to see which strategy yields more relevant context.
2.  **Search Method Comparison**: I'll compare the performance of different search algorithms—keyword (BM25), pure vector (semantic), and hybrid search—to see which performs best for this use case.

---

## **2. The Ground Truth Dataset**

To provide an objective benchmark for these experiments, I created a custom **ground truth dataset**, which is available in the project at `data/evaluation/ground_truth.json`.

This file contains a list of questions a translator might realistically ask. For each question, I manually identified the exact passages from the style guide that should be retrieved to provide a comprehensive answer. Each of these expected passages is also assigned a relevance level ('Primary', 'Secondary', or 'Tertiary'), allowing for a more nuanced, weighted evaluation.

This dataset acts as our "golden set." By comparing the retrieved chunks from each strategy against these ideal answers, we can quantitatively measure performance using metrics like F1-score and make data-driven decisions.

---


## **3. Setup and Configuration**

This first cell imports the necessary libraries and sets up the key constants for our evaluation experiments.

* **Libraries**:
    * `weaviate` and `weaviate.classes.query as wq`: These are used to connect to our database and construct the different types of search queries we want to test.
    * `sentence_transformers` & `torch`: We need these again to load the *same* `BGE` embedding model. To perform a vector search, the user's question must be converted into a vector using the exact same model that was used to embed the documents.
    * `json`, `os`, `datetime`: Standard libraries for loading our ground truth evaluation file and for saving the timestamped results of our experiments.
* **Constants**:
    * `COLLECTION_NAME`: Specifies the `StyleGuide` collection in Weaviate that we will be querying.
    * `MODEL_NAME`: Ensures we load the correct embedding model to create vectors for our queries.

In [1]:
# === IMPORTS AND CONFIGURATION ===
import weaviate
import weaviate.classes.query as wq
from sentence_transformers import SentenceTransformer
import torch
import json
import os
from datetime import datetime

# --- Project Constants ---
COLLECTION_NAME = 'StyleGuide'
MODEL_NAME = 'BAAI/bge-large-en-v1.5'

The second cell prepares the environment for the evaluation experiments. It performs three key actions:
1.  **Checks for a GPU** to ensure the embedding process for our queries runs as fast as possible.
2.  **Loads the `BGE` embedding model** into memory, making it ready to convert questions into vectors.
3.  **Connects to the Weaviate database** and accesses the `StyleGuide` collection, so we can start running queries against it.

In [2]:
# === INITIALIZE MODEL AND CONNECT TO WEAVIATE ===
print('--- Initializing Model and Connecting to Weaviate ---')

if torch.cuda.is_available():
    device = 'cuda'
    print(f'✅ CUDA available. Using GPU: {torch.cuda.get_device_name(0)}')
else:
    device = 'cpu'
    print('⚠️ CUDA not available. Using CPU.')

print(f"Loading embedding model '{MODEL_NAME}'...")
embedding_model = SentenceTransformer(MODEL_NAME, device=device)
print('✅ Embedding model loaded successfully')

client = weaviate.connect_to_local()
collection = client.collections.get(COLLECTION_NAME)
print(f"✅ Connected to Weaviate collection '{COLLECTION_NAME}'.")

--- Initializing Model and Connecting to Weaviate ---
✅ CUDA available. Using GPU: NVIDIA GeForce GTX 1050 Ti
Loading embedding model 'BAAI/bge-large-en-v1.5'...
✅ Embedding model loaded successfully
✅ Connected to Weaviate collection 'StyleGuide'.


---

## **4. Defining the Search Functions**

Now that the environment is ready, the next step is to define the functions that will perform the actual searches. To find the most effective retrieval method, I've implemented three different strategies supported by Weaviate:

1.  **Pure Vector Search (`query_semantic_vector`)**: This function takes a question, converts it into a vector embedding, and searches for the chunks with the most similar vectors in the database based on cosine similarity. This method is excellent at finding semantically related content, even if the keywords don't match exactly.
2.  **Keyword Search (`query_semantic_bm25`)**: This function uses the classic BM25 algorithm, a powerful keyword-based search method. It's very effective when the user's query contains the exact terms present in the source document.
3.  **Hybrid Search (`query_semantic_hybrid`)**: This function combines the strengths of both previous methods. It performs a vector search and a keyword search, then intelligently blends the results. The `alpha` parameter controls the balance: `alpha=0` is pure keyword search, `alpha=1` is pure vector search, and values in between create a weighted combination.

Each function is also designed to accept a `method_filter`, which allows us to restrict the search to only fixed-size or section-based chunks, a crucial feature for our first experiment.

In [3]:
# === SEARCH FUNCTIONS ===
def query_semantic_vector(question: str, method_filter: str, limit: int = 5):
    """Performs a pure vector similarity search in Weaviate."""
    question_vector = embedding_model.encode(question).tolist()
    
    query_filter = wq.Filter.by_property('method').like(method_filter) if method_filter else None

    response = collection.query.near_vector(
        near_vector=question_vector,
        limit=limit,
        filters=query_filter,
        return_metadata=wq.MetadataQuery(distance=True),
        return_properties=['text', 'part', 'chapter', 'section', 'subsection', 'method', 'chunk_id', 'page_number']
    )

    if not response.objects:
        return []

    return [{
        'chunk_id': obj.properties['chunk_id'],
        'text': obj.properties['text'],
        'part': obj.properties.get('part'),
        'chapter': obj.properties.get('chapter'),
        'section': obj.properties.get('section'),
        'subsection': obj.properties.get('subsection'),
        'page_number': obj.properties.get('page_number', 'N/A'),
        'method': obj.properties.get('method'),
        'relevance_score': 1 - obj.metadata.distance,
        'search_method': 'Vector'
    } for obj in response.objects]

def query_semantic_hybrid(question: str, method_filter: str, limit: int = 5, alpha: float = 0.5):
    """Performs a hybrid search in Weaviate, combining vector and keyword methods."""
    question_vector = embedding_model.encode(question).tolist()
    
    query_filter = wq.Filter.by_property('method').like(method_filter) if method_filter else None

    response = collection.query.hybrid(
        query=question,
        vector=question_vector,
        alpha=alpha,
        limit=limit,
        filters=query_filter,
        return_metadata=wq.MetadataQuery(score=True),
        return_properties=['text', 'part', 'chapter', 'section', 'subsection', 'method', 'chunk_id', 'page_number']
    )

    if not response.objects:
        return []

    return [{
        'chunk_id': obj.properties['chunk_id'],
        'text': obj.properties['text'],
        'part': obj.properties.get('part'),
        'chapter': obj.properties.get('chapter'),
        'section': obj.properties.get('section'),
        'subsection': obj.properties.get('subsection'),
        'page_number': obj.properties.get('page_number', 'N/A'),
        'method': obj.properties.get('method'),
        'relevance_score': obj.metadata.score,
        'search_method': 'Hybrid'
    } for obj in response.objects]

def query_semantic_bm25(question: str, method_filter: str, limit: int = 5):
    """Performs a pure keyword (BM25) search in Weaviate."""
    query_filter = wq.Filter.by_property('method').like(method_filter) if method_filter else None

    response = collection.query.bm25(
        query=question,
        limit=limit,
        filters=query_filter,
        return_metadata=wq.MetadataQuery(score=True),
        return_properties=['text', 'part', 'chapter', 'section', 'subsection', 'method', 'chunk_id', 'page_number']
    )

    if not response.objects:
        return []

    return [{
        'chunk_id': obj.properties['chunk_id'],
        'text': obj.properties['text'],
        'part': obj.properties.get('part'),
        'chapter': obj.properties.get('chapter'),
        'section': obj.properties.get('section'),
        'subsection': obj.properties.get('subsection'),
        'page_number': obj.properties.get('page_number', 'N/A'),
        'method': obj.properties.get('method'),
        'relevance_score': obj.metadata.score,
        'search_method': 'BM25'
    } for obj in response.objects]

---

## **5. Retrieval Evaluation Methodology**

After defining the search functions, we need to objectively measure their performance. This section defines the tools for our evaluation: a function to load our ground truth dataset and the functions that will calculate the performance scores.

### **5.1. Loading the Ground Truth Data**

The `load_ground_truth` function loads our manually created `ground_truth.json` file. This file acts as our "answer key" for the experiments, containing the questions we will ask and the exact text passages we expect to be retrieved for each one.

In [4]:
# === GROUND TRUTH LOADING ===
def load_ground_truth():
    """Loads ground truth data from the dedicated JSON file."""
    with open('../data/evaluation/ground_truth.json', 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    return {
        str(q['id']): {'question': q['question'], 'entries': q['ground_truth']}
        for q in data['questions']
    }

### **5.2. Defining the Performance Metrics**

To get a quantitative score for each retrieval strategy, I've implemented three key metric functions:
* **`calculate_text_coverage`**: This is effectively our **recall** score. It measures what percentage of the expected answer text was successfully retrieved.
* **`calculate_content_precision`**: This measures **precision**. Of all the text that was retrieved, it calculates what percentage was actually relevant.
* **`calculate_normalized_score`**: This is the main scoring function that combines coverage and precision into a single **F1-score**. It also uses a weighted system, giving more points for retrieving 'Primary' relevance passages than 'Tertiary' ones. This results in a more nuanced performance score.

In [5]:
# === EVALUATION METRIC FUNCTIONS ===
def calculate_content_precision(retrieved_results, ground_truth_entries):
    """Calculates the percentage of retrieved content that is relevant."""
    if not retrieved_results: return 0.0
    
    all_retrieved_text = ' '.join([result['text'] for result in retrieved_results])
    retrieved_words = set(all_retrieved_text.lower().split())
    if not retrieved_words: return 0.0
    
    all_expected_text = ' '.join([entry['expected_text'] for entry in ground_truth_entries])
    expected_words = set(all_expected_text.lower().split())
    
    relevant_words = retrieved_words & expected_words
    return len(relevant_words) / len(retrieved_words)

def calculate_text_coverage(retrieved_results, expected_text):
    """Calculates the percentage of a single expected text that was retrieved."""
    if not retrieved_results or not expected_text: return 0.0
    
    retrieved_text = ' '.join([result['text'] for result in retrieved_results]).lower().strip()
    expected_text = expected_text.lower().strip()
    
    expected_words = set(expected_text.split())
    retrieved_words = set(retrieved_text.split())
    
    overlap = len(expected_words & retrieved_words)
    coverage = overlap / len(expected_words) if expected_words else 0
    return min(coverage, 1.0)

def calculate_normalized_score(retrieved_results, ground_truth_entries, k=5):
    """Calculates composite evaluation metrics (Coverage, Precision, F1)."""
    relevance_scores = {'Primary': 5, 'Secondary': 3, 'Tertiary': 1}
    coverage_details = []
    total_weighted_coverage = 0
    
    for gt_entry in ground_truth_entries:
        coverage = calculate_text_coverage(retrieved_results, gt_entry['expected_text'])
        relevance_weight = relevance_scores[gt_entry['relevance']]
        weighted_coverage = coverage * relevance_weight
        total_weighted_coverage += weighted_coverage
        
        coverage_details.append({
            'ground_truth_section': gt_entry['section_title'],
            'relevance': gt_entry['relevance'],
            'coverage': coverage,
            'weighted_score': weighted_coverage,
            'max_possible': relevance_weight
        })
    
    max_possible_coverage = sum(relevance_scores[entry['relevance']] for entry in ground_truth_entries)
    coverage_score = total_weighted_coverage / max_possible_coverage if max_possible_coverage > 0 else 0
    content_precision = calculate_content_precision(retrieved_results, ground_truth_entries)
    content_f1 = 2 * (content_precision * coverage_score) / (content_precision + coverage_score) if (content_precision + coverage_score) > 0 else 0
    
    return {
        'coverage_score': coverage_score,
        'content_precision': content_precision,
        'content_f1': content_f1,
        'raw_coverage': total_weighted_coverage,
        'max_possible': max_possible_coverage,
        'coverage_details': coverage_details
    }

### **5.3. Evaluation and Comparison Functions**

With our search methods and evaluation metrics defined, these final two functions execute the experiments and compile the results.

* The **`evaluate_system`** function is the core of the retrieval evaluation pipeline. It takes a specific search function (e.g., `query_semantic_vector`) as input and runs it against every question in our ground truth dataset. For each question, it calculates the performance metrics we defined earlier and returns a detailed dictionary with the scores and the retrieved chunks.
* The **`compare_methods`** function presents the final results. It takes the output from multiple runs of `evaluate_system` and formats it into a clean comparison table, making it easy to see the F1-score for each strategy and the percentage improvement over our baseline.

In [6]:
# === SYSTEM EVALUATION AND COMPARISON FUNCTIONS ===
def evaluate_system(search_function, method_name, ground_truth, k=5):
    """Runs a search function against all ground truth questions and calculates metrics."""
    results = {'method': method_name, 'question_scores': {}, 'total_questions': len(ground_truth)}
    total_f1 = 0

    print(f'Evaluating {method_name}...')
    print('-' * 50)

    for q_id, gt_data in ground_truth.items():
        method_filter = 'section_based*' if 'section' in method_name.lower() else 'fixed_size'
        search_results = search_function(gt_data['question'], method_filter, k)
        metrics = calculate_normalized_score(search_results, gt_data['entries'], k)
        total_f1 += metrics['content_f1']

        results['question_scores'][q_id] = {
            'question': gt_data['question'],
            'metrics': metrics,
            'retrieved_context': [f"{r.get('chapter', '')} > {r.get('section', '')}" for r in search_results],
            'retrieved_chunks': search_results
        }
        print(f"Q{q_id}: {gt_data['question'][:60]}...")
        print(f"  F1: {metrics['content_f1']:.3f}")
    
    results['overall_metrics'] = {'avg_content_f1': total_f1 / len(ground_truth)}
    print(f"\nOverall Average F1: {results['overall_metrics']['avg_content_f1']:.3f}")
    print('=' * 50, '\n')
    return results

def compare_methods(results_list):
    """Displays a comparison table of evaluation results for different methods."""
    print('=== METHOD COMPARISON ===')
    print(f"{'Method':<40} {'F1 Score':<10} {'Improvement'}")
    print('-' * 70)
    
    baseline_f1 = results_list[0]['overall_metrics']['avg_content_f1']
    for i, results in enumerate(results_list):
        f1_score = results['overall_metrics']['avg_content_f1']
        improvement = f'{((f1_score - baseline_f1) / baseline_f1 * 100):+.1f}%' if i > 0 else 'Baseline'
        print(f"{results['method']:<40} {f1_score:<10.3f} {improvement}")
    
    print('-' * 70, '\n')

---

## **6. Evaluation of Chunking Methods**

Now that the evaluation framework is ready, the final step before running the experiments is to load our ground truth dataset. The cell below handles this, loading the questions and the "answer key" that we will use to score our retrieval methods.

In [7]:
# === LOAD GROUND TRUTH DATA ===
print('Loading ground truth data...')
ground_truth = load_ground_truth()
print(f'Loaded {len(ground_truth)} questions with ground truth labels.')
print('='*50)

Loading ground truth data...
Loaded 5 questions with ground truth labels.


With the ground truth data loaded, we can now run our first experiment. The goal is to answer our first key question: **What is the impact of different chunking strategies on retrieval quality?**

The code below evaluates both the `fixed-size` and `section-based` strategies. To ensure a fair comparison, both will be tested using the same **pure vector search** method. This isolates the chunking strategy as the only variable. The `fixed-size` method will serve as our baseline, against which we'll measure any improvement.

In [8]:
# === EXPERIMENT 1: CHUNKING STRATEGY COMPARISON ===
print('\nEXPERIMENT 1: CHUNKING STRATEGY COMPARISON')
print('Testing: Fixed-size vs Section-based chunks (both using Vector search)')
print('='*80)

results_fixed = evaluate_system(query_semantic_vector, 'Fixed-size + Vector', ground_truth)
results_section = evaluate_system(query_semantic_vector, 'Section-based + Vector', ground_truth)

compare_methods([results_fixed, results_section])


EXPERIMENT 1: CHUNKING STRATEGY COMPARISON
Testing: Fixed-size vs Section-based chunks (both using Vector search)
Evaluating Fixed-size + Vector...
--------------------------------------------------
Q1: When is it correct to use square brackets?...
  F1: 0.637
Q2: What’s the proper usage of an en dash?...
  F1: 0.662
Q3: Which spelling is correct: Rhein or Rhin?...
  F1: 0.322
Q4: Is it better to use the symbol '*' or 'x' for multiplication...
  F1: 0.558
Q5: Is there a space before or after suspension points?...
  F1: 0.490

Overall Average F1: 0.534

Evaluating Section-based + Vector...
--------------------------------------------------
Q1: When is it correct to use square brackets?...
  F1: 0.781
Q2: What’s the proper usage of an en dash?...
  F1: 0.705
Q3: Which spelling is correct: Rhein or Rhin?...
  F1: 0.399
Q4: Is it better to use the symbol '*' or 'x' for multiplication...
  F1: 0.868
Q5: Is there a space before or after suspension points?...
  F1: 0.799

Overall Average F1:

### **6.1. Chunking Strategies: Analysis of Results**

The results are very clear: the **`section-based`** chunking strategy significantly outperforms the `fixed-size` baseline, achieving an average F1-score of **0.710** compared to **0.534**. That is a **33% improvement**.

This outcome suggests that preserving the document's logical structure is critical for retrieval quality. By creating chunks that align with the document's chapters and sections, we generate more contextually complete and relevant units of information. The `fixed-size` approach, while simpler and faster, often retrieves fragments that contain both relevant and irrelevant text, which hurts its precision and overall F1-score.

Based on this result, **we will use the `section-based` chunks for all subsequent experiments** in this project.

---

## **7. Evaluation of Search Methods**

Now that we've confirmed that `section-based` chunking strategy works better, we can move on to our second key question: **Which retrieval method yields the most relevant information?**

This experiment will test different points along the search spectrum, from pure keyword matching to pure semantic similarity. We will use the winning `section-based` chunks as the data source for all tests. The methods being compared are:
* **Pure BM25**: A traditional keyword search (`alpha=0`).
* **Hybrid Search (Keyword-Leaning)**: A mix that gives more weight to keywords (`alpha=0.3`).
* **Hybrid Search (Vector-Leaning)**: A mix that gives more weight to semantic meaning (`alpha=0.7`).
* **Pure Vector Search**: A pure semantic search (`alpha=1.0`).

In [9]:
# === EXPERIMENT 2: SEARCH METHOD SPECTRUM ANALYSIS ===
print('\nEXPERIMENT 2: SEARCH METHOD SPECTRUM ANALYSIS')
print('Testing: BM25 → Hybrid → Vector search (all using section-based chunks)')
print('='*80)

search_configs = [
    (query_semantic_bm25, 'Pure BM25', 'Pure keyword search'),
    (lambda q, f, l: query_semantic_hybrid(q, f, l, 0.3), 'Hybrid – keyword-leaning', 'Keyword-focused hybrid'),
    (lambda q, f, l: query_semantic_hybrid(q, f, l, 0.7), 'Hybrid – vector-leaning', 'Vector-focused hybrid'),
    (query_semantic_vector, 'Pure Vector', 'Pure semantic search')
]

results_all_methods = []
for search_func, method_name, _ in search_configs:
    results = evaluate_system(search_func, f'Section-based + {method_name}', ground_truth)
    results_all_methods.append(results)


EXPERIMENT 2: SEARCH METHOD SPECTRUM ANALYSIS
Testing: BM25 → Hybrid → Vector search (all using section-based chunks)
Evaluating Section-based + Pure BM25...
--------------------------------------------------
Q1: When is it correct to use square brackets?...
  F1: 0.805
Q2: What’s the proper usage of an en dash?...
  F1: 0.422
Q3: Which spelling is correct: Rhein or Rhin?...
  F1: 0.150
Q4: Is it better to use the symbol '*' or 'x' for multiplication...
  F1: 0.628
Q5: Is there a space before or after suspension points?...
  F1: 0.192

Overall Average F1: 0.439

Evaluating Section-based + Hybrid – keyword-leaning...
--------------------------------------------------
Q1: When is it correct to use square brackets?...
  F1: 0.805
Q2: What’s the proper usage of an en dash?...
  F1: 0.504
Q3: Which spelling is correct: Rhein or Rhin?...
  F1: 0.602
Q4: Is it better to use the symbol '*' or 'x' for multiplication...
  F1: 0.811
Q5: Is there a space before or after suspension points?...
  F1

In [10]:
# === SEARCH METHOD ANALYSIS AND COMPARISON ===
print('\nSEARCH METHOD ANALYSIS')
print('='*100)
print(f"{'Method':<30} {'Alpha':<6} {'F1 Score':<10} {'Improvement'}")
print('-' * 100)

alpha_values = [0, 0.3, 0.7, 1.0]
search_types = ['Keyword', 'Hybrid – keyword-leaning', 'Hybrid – vector-leaning', 'Semantic']
baseline_f1 = results_all_methods[0]['overall_metrics']['avg_content_f1']

for i, (_, method_name, _) in enumerate(search_configs):
    f1_score = results_all_methods[i]['overall_metrics']['avg_content_f1']
    improvement = f'{((f1_score - baseline_f1) / baseline_f1 * 100):+.1f}%' if i > 0 else 'Baseline'
    print(f'{method_name:<30} {alpha_values[i]:<6} {f1_score:<10.3f} {improvement}')

print('-' * 100)

best_idx = max(range(len(results_all_methods)), key=lambda i: results_all_methods[i]['overall_metrics']['avg_content_f1'])
best_f1 = results_all_methods[best_idx]['overall_metrics']['avg_content_f1']
print(f"\nBest performing method: {search_types[best_idx]} search (α={alpha_values[best_idx]}, F1={best_f1:.3f})")


SEARCH METHOD ANALYSIS
Method                         Alpha  F1 Score   Improvement
----------------------------------------------------------------------------------------------------
Pure BM25                      0      0.439      Baseline
Hybrid – keyword-leaning       0.3    0.649      +47.7%
Hybrid – vector-leaning        0.7    0.733      +66.9%
Pure Vector                    1.0    0.710      +61.7%
----------------------------------------------------------------------------------------------------

Best performing method: Hybrid – vector-leaning search (α=0.7, F1=0.733)


### **7.1. Search Methods: Analysis of Results**

The individual F1-scores for each question show a marked trend. The **Pure BM25** keyword search struggles significantly with questions where the phrasing doesn't exactly match the text in the style guide (e.g., Q3: "Rhein or Rhin" and Q5: "suspension points"), resulting in a low overall F1-score of **0.439**.

As soon as we begin to incorporate vector search, performance improves dramatically. Both hybrid methods and the pure vector search show substantial gains over the keyword-only baseline, confirming that semantic understanding is crucial for this task. 

While all semantic-aware methods performed well, the best results came from the **hybrid search with `alpha=0.7`**. This method, which leans heavily on semantic similarity but still incorporates a small amount of keyword matching, achieved the highest F1-score of **0.733**. This represents a substantial **66.9% improvement** over the pure keyword search baseline.

---

## **8. Qualitative Retrieval Analysis**

Quantitative metrics like the F1-score are essential for measuring performance, but they don't tell the whole story. To truly understand *why* our optimized system is better, we need to look at the actual text it retrieves.

This final analysis provides a **qualitative comparison** for two sample questions. We will compare the initial, un-optimized baseline (`Fixed-size + Vector`) against our final, fully optimized system (`Section-based + Hybrid – vector-leaning`). Questions 3 and 4 were chosen as they provide excellent examples of different retrieval challenges.

In [11]:
# === SAMPLE RETRIEVED CHUNKS ANALYSIS ===
print('\nSAMPLE RETRIEVAL ANALYSIS: BASELINE vs OPTIMIZED')
print('='*80)

def display_retrieval_comparison(question_id, baseline_results, optimized_results, max_chars=700):
    """Compares retrieved chunks between baseline and optimized methods."""
    baseline_q = baseline_results['question_scores'][question_id]
    optimized_q = optimized_results['question_scores'][question_id]
    
    print(f"\nQuestion {question_id}: {baseline_q['question']}")
    print(f"Baseline F1: {baseline_q['metrics']['content_f1']:.3f} → Optimized F1: {optimized_q['metrics']['content_f1']:.3f}")
    print('=' * 75)
    
    print('BASELINE (Fixed-size + Vector):')
    for i, chunk in enumerate(baseline_q['retrieved_chunks'], 1):
        print(f"  {i}. Relevance: {chunk['relevance_score']:.3f} | Text: {chunk['text'][:max_chars]}...")
    
    print(f"\nOPTIMIZED (Section-based + {search_types[best_idx]}):")
    for i, chunk in enumerate(optimized_q['retrieved_chunks'], 1):
        context_str = f"{chunk.get('chapter', 'N/A')} > {chunk.get('section', 'N/A')}"
        print(f"  {i}. Context: {context_str} | Relevance: {chunk['relevance_score']:.3f}")
        print(f"     Text: {chunk['text'][:max_chars]}...")
    print('=' * 75)

display_retrieval_comparison('3', results_fixed, results_all_methods[best_idx])
display_retrieval_comparison('4', results_fixed, results_all_methods[best_idx])


SAMPLE RETRIEVAL ANALYSIS: BASELINE vs OPTIMIZED

Question 3: Which spelling is correct: Rhein or Rhin?
Baseline F1: 0.322 → Optimized F1: 0.539
BASELINE (Fixed-size + Vector):
  1. Relevance: 0.664 | Text: useful to add ‘region’ or ‘area’ in such cases), Lüneburger Heide 
♦ Officially designated development areas. Designated development areas are 
mostly derived from names of administrative units or from traditional 
geographical names, often with a defining adjective. Follow the appropriate 
rule above, e.g.: 
Lower Bavaria; the Charentes development area 
The name of the cross-border region Euregio is written with an initial capital 
only. 
5.22. 
Rivers. Use the forms Meuse (Maas only if the context is solely Flanders or the 
Netherlands) and Moselle (Mosel only if the context is solely Germany). Write 
Rhine for Rhein, Rhin, and Rijn, and Rhineland for Rheinland. Also: Oder for 
Odra (Poli...
  2. Relevance: 0.613 | Text: Нн 
n 
Њњ 
- 
- 
nj 
- 
- 
nj 
Оо 
o 
Пп 
p 
 
1  
When pr

### **8.1. Qualitative Analysis: Results**

#### **Question 3 ("Which spelling is correct: Rhein or Rhin?"):**
* **The baseline problem is the chunk, not the retrieval:** The baseline system correctly retrieves the relevant passage about river names as its top result. The issue is the `fixed-size` chunking strategy. The correct answer is buried in a chunk diluted with irrelevant text about development areas, which harms the precision score and brings down the overall F1-score.
* **The optimized system is precise:** The optimized system excels because its top `section-based` chunk for this topic contains *only* the relevant text on "Rivers." This clean, focused chunk is easily found by the hybrid search, resulting in a much higher F1-score (0.539 vs 0.322). This demonstrates that a precise chunking strategy is just as critical as the search algorithm.

#### **Question 4 ("Is it better to use the symbol '*' or 'x' for multiplication?"):**
* **The baseline is noisy and fragmented:** The baseline system again finds the correct information, but it's presented along with unrelated rules. This noise explains why the F1-score is only 0.558.
* **The optimized system is focused:** The optimized system retrieves chunks that are all from the correct chapter ("Abbreviations, symbols and units of measurement") and section ("Mathematical symbols"). The top result is the exact passage answering the question, earning a perfect relevance score of 1.000. This precision leads to a far superior F1-score of 0.868.

---

## **9. Final Summary and Key Findings**

Now that the qualitative analysis is complete, the next cell provides a high-level summary of the entire evaluation. It quantifies the total improvement from our initial baseline to our final optimized system, summarizing the key findings of this notebook.

In [12]:
# === FINAL COMPARISON AND KEY FINDINGS ===
print('\nFINAL COMPARISON: BASELINE vs OPTIMIZED')
print('='*85)

baseline_metrics = results_fixed['overall_metrics']
optimized_metrics = results_all_methods[best_idx]['overall_metrics']
improvement = f"{((optimized_metrics['avg_content_f1'] - baseline_metrics['avg_content_f1']) / baseline_metrics['avg_content_f1'] * 100):+.1f}%"
optimized_name = f'Section-based + {search_types[best_idx]}'

print(f"{'System':<40} {'F1 Score':<9} {'Improvement'}")
print('-' * 85)
print(f"{'Fixed-size + Vector':<40} {baseline_metrics['avg_content_f1']:<9.3f} {'Baseline'}")
print(f'{optimized_name:<40} {optimized_metrics["avg_content_f1"]:<9.3f} {improvement}')
print('-' * 85)

print('\nEXPERIMENTS COMPLETE!')
print('='*85)
print('Key Findings:')
print('1. Chunking Strategy: Section-based chunks improved F1 score over fixed-size')
print(f'2. Search Method: {search_types[best_idx]} (α={alpha_values[best_idx]}) outperformed pure approaches')
print(f'3. Overall Optimization: {improvement} improvement from baseline to optimized system')
print('4. Next Step: Implement RAG with prompt engineering in notebook 04')


FINAL COMPARISON: BASELINE vs OPTIMIZED
System                                   F1 Score  Improvement
-------------------------------------------------------------------------------------
Fixed-size + Vector                      0.534     Baseline
Section-based + Hybrid – vector-leaning  0.733     +37.3%
-------------------------------------------------------------------------------------

EXPERIMENTS COMPLETE!
Key Findings:
1. Chunking Strategy: Section-based chunks improved F1 score over fixed-size
2. Search Method: Hybrid – vector-leaning (α=0.7) outperformed pure approaches
3. Overall Optimization: +37.3% improvement from baseline to optimized system
4. Next Step: Implement RAG with prompt engineering in notebook 04


---

By moving from a simple `Fixed-size + Vector` approach to a more sophisticated `Section-based + Hybrid` system, we achieved a significant performance boost. The final F1-score of **0.733** represents a **+37.3% improvement** over our initial baseline. 

Here are the two main findings of our experiments:
1.  **Chunking Strategy Matters:** `Section-based` chunks, which preserve the document's logical structure, are considerably more effective than simple `fixed-size` chunks.
2.  **Hybrid Search is Superior:** A `Hybrid – vector-leaning` search (`α=0.7`) outperforms both pure keyword and pure vector approaches, blending the benefits of both.

As a final step, the full results of this notebook are exported to a timestamped JSON file for record-keeping before the connection to Weaviate is closed. With these data-driven decisions made, we are now ready to build the final RAG pipeline in the next notebook.

In [13]:
# === EXPORT RESULTS AND CLOSE CONNECTION ===
def export_retrieval_analysis():
    """Exports complete experimental results to a JSON file."""
    analysis = {
        'timestamp': datetime.now().isoformat(),
        'notebook': '03_chunking_and_retrieval_evaluation',
        'experiments': {
            'chunking_comparison': {
                'fixed_size': results_fixed,
                'section_based': results_section
            },
            'search_method_comparison': {
                'methods': results_all_methods,
                'best_method_idx': best_idx,
                'best_method_name': search_types[best_idx]
            }
        },
        'ground_truth_summary': {
            'total_questions': len(ground_truth),
            'questions': {q_id: data['question'] for q_id, data in ground_truth.items()}
        }
    }
    
    os.makedirs('../results', exist_ok=True)
    with open('../results/retrieval_evaluation_results.json', 'w', encoding='utf-8') as f:
        json.dump(analysis, f, indent=2, ensure_ascii=False)
    
    print('Results exported to: ../results/retrieval_evaluation_results.json')

export_retrieval_analysis()

client.close()
print('Weaviate client connection closed.')

Results exported to: ../results/retrieval_evaluation_results.json
Weaviate client connection closed.
