# <center>Panongbene Sawadogo</center>

📩 **Contact** : amet1900@gmail.com

🌐 **Linkedin** : https://www.linkedin.com/in/panongbene-jean-mohamed-sawadogo-33234a168/

🗓️ **Dernière modification** : 10 August 2025

# <center>Interview  Blue Bridge Group </center>

# Challenge: Context-Aware QA Under Token Constraints

## Objective

Build a question-answering system that selects and presents the most relevant context to a local
language model under a strict context token limit. The LLM can only “see” a small slice of the
knowledge base, so your system must make smart decisions about what to show.

## Provided Materials
You will receive:
- A fictional knowledge base (docs) consisting of 10 short .md files
  - Domain: sci-fi technical manual
- A file containing 5 natural-language questions (questions.json)
- A token budget (max 1024 tokens of context per question)
## Your Tasks
- 1. Chunking
Split the documents into meaningful text chunks and design a way to retrieve and score them while
staying within budget.
- 2. Answer Generation
You may use a local small LLM (1.7B-2B).
- 3. Evaluation &amp; Justification
For each question, choose what to show and how to justify why your method works well or could fail.
You may elaborate on what may or may have been helpful to facilitate evaluation.
## Deliverable
- main.py or notebook.ipynb - to be delivered at least 1 day before interview
- a powerpoint to explain your approach, results, model choices, difficulties, areas of improvement
etc.

# Libraries

In [1]:
#!pip install numpy
#!pip install pandas
#!pip install matplotlib
#!pip install bitsandbytes
#!pip install -U FlagEmbedding
#!pip install --upgrade transformers

In [2]:
import os
import re
import sys
import json
import torch
import threading
import numpy as np
import pandas as pd
from typing import List
from rich.panel import Panel
from rich.syntax import Syntax
import matplotlib.pyplot as plt
from rich.console import Console
from FlagEmbedding import BGEM3FlagModel
from IPython.display import Markdown, display
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, TextIteratorStreamer, BitsAndBytesConfig

## Useful functions

In [3]:
def load_all_md_files(directory):
    """
    """
    md_files = [f for f in os.listdir(directory) if f.endswith('.md')]
    docs = {}
    for filename in md_files:
        path = os.path.join(directory, filename)
        with open(path, 'r', encoding='windows-1252') as f:
            docs[filename] = f.read()
    return docs

In [4]:
def chunk_bullet_points(text: str) -> List[str]:
    """
    """
    lines = text.strip().split('\n')
    chunks = []

    current_chunk = ""
    for line in lines:
        line = line.strip()
        if not line:
            continue
        if line.startswith("*") or line.startswith("#") or line.startswith("-"):
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = line
        else:
            current_chunk += " " + line
    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

In [5]:
def get_embedding(list_chunk, model):
    """
    """
    output = model.encode(list_chunk, return_dense=True, return_sparse=True, return_colbert_vecs=True)
    return output

## Loading data

In [6]:
#Documents
all_docs = load_all_md_files("docs")
print(len(all_docs))

10


In [7]:
display(Markdown(list(all_docs.values())[4]))

# Notes and Logs

* Log Entry 2371: AI subsystem restarted unexpectedly at 03:14 UTC.
* Note from Cmdr. Morales: Override protocols reviewed on 2025-06-01.
* Firmware patch 7-F applied to fix outbound packet leak.
* Dr. Doss's memo on ethical AI constraints pending approval.
* Maintenance window scheduled for 2025-07-20.

*This file serves no operational purpose and is for reference only.*



In [8]:
chunk_bullet_points(list(all_docs.values())[4])

['# Notes and Logs',
 '* Log Entry 2371: AI subsystem restarted unexpectedly at 03:14 UTC.',
 '* Note from Cmdr. Morales: Override protocols reviewed on 2025-06-01.',
 '* Firmware patch 7-F applied to fix outbound packet leak.',
 "* Dr. Doss's memo on ethical AI constraints pending approval.",
 '* Maintenance window scheduled for 2025-07-20.',
 '*This file serves no operational purpose and is for reference only.*']

In [9]:
#Questions
with open("docs/questions.json", "r", encoding="utf-8") as f:
    questions = json.load(f)

In [10]:
questions

{'questions': [{'id': 'q1',
   'question': 'If the QRC fails during boot, what is the expected system behavior and recommended recovery steps?'},
  {'id': 'q2',
   'question': "Explain the ambiguity around the term 'Safe Mode' and how it might affect operator decisions."},
  {'id': 'q3',
   'question': 'What contradictions exist in network policies regarding AI outbound communication?'},
  {'id': 'q4',
   'question': 'Describe the procedure to manually restore communications modules if Isolation Mode was triggered.'},
  {'id': 'q5',
   'question': 'Who is authorized to override network restrictions, and what conflicts might arise?'}]}

## Loading LLM Model Answer

<span style='color: green;font-weight: bold;'>I used the Qwen3-1.7B model, a small-sized open-source reasoning model that meets the project’s constraints.</span>

In [11]:
model_name = "Qwen/Qwen3-1.7B"
model_name = "Qwen/Qwen3-0.6B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

## Loading Embedding Rag Model

<span style='color: green;font-weight: bold;'>
The BGEM3FlagModel("BAAI/bge-m3") is a multilingual embedding model developed by BAAI, optimized for **retrieval, reranking, and semantic similarity tasks ...
<span>

In [12]:
model_BGEM3 = BGEM3FlagModel('BAAI/bge-m3',  use_fp16=True) 

Fetching 30 files:   0%|          | 0/30 [00:00<?, ?it/s]

## Create embedding data base (docs) 

<span style='color: green;font-weight: bold;'>For this project, I did not use vector databases such as ChromaDB, FAISS, or Elasticsearch, as the dataset’s modest size made their implementation unnecessary. This also avoids heavy dependencies and keeps the environment simple and reproducible. The goal is to focus on the core principles of embeddings without added technical complexity. For larger datasets, these tools could be integrated: ChromaDB for fast vector management, FAISS for high-performance approximate searches, and Elasticsearch for combined text and vector search.<span>

In [13]:
%%time
chunks_database = []
for one_docs in all_docs:
    chunks_database += chunk_bullet_points(all_docs[one_docs])

CPU times: user 134 μs, sys: 27 μs, total: 161 μs
Wall time: 55.1 μs


In [14]:
print('Numbers chunks : ', len(chunks_database))

Numbers chunks :  61


In [15]:
%%time
#create embedding database
database_embedding = get_embedding(chunks_database, model_BGEM3)

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


CPU times: user 776 ms, sys: 1.01 s, total: 1.79 s
Wall time: 7.5 s


# Retrieval, prompt generation

<span style='color: green;font-weight: bold;'>
For each question, an embedding vector is generated and then used to measure similarity with the vectors stored in the database.</span></br>

In [16]:
question = questions['questions'][0]['question']
print(question)

If the QRC fails during boot, what is the expected system behavior and recommended recovery steps?


In [17]:
%%time
#create embedding question
question_embedding = get_embedding([question], model_BGEM3)

CPU times: user 173 ms, sys: 29.7 ms, total: 203 ms
Wall time: 215 ms


<span style='color: green;font-weight: bold;'>ColBERT (Contextualized Late Interaction): an advanced technique where similarity is computed token-by-token and then aggregated</span></br>
<span style='color: green;font-weight: bold;'>
We use this distance to calculate which chunks are closest to the question asked.
</span>

In [26]:
database_embedding

{'dense_vecs': array([[-0.0738   ,  0.003437 , -0.007114 , ..., -0.0004618,  0.005165 ,
          0.0346   ],
        [-0.04794  , -0.01212  , -0.045    , ...,  0.002329 , -0.0244   ,
         -0.02559  ],
        [-0.01439  ,  0.007076 , -0.03156  , ..., -0.007336 , -0.015495 ,
         -0.02563  ],
        ...,
        [-0.05194  ,  0.02383  , -0.04108  , ...,  0.001542 , -0.00458  ,
          0.01439  ],
        [ 0.002794 , -0.04257  ,  0.02838  , ...,  0.00799  ,  0.01129  ,
         -0.003778 ],
        [-0.04895  , -0.00334  , -0.0266   , ..., -0.013794 ,  0.0217   ,
         -0.00797  ]], dtype=float16),
 'lexical_weights': [defaultdict(int,
              {'468': 0.0953, '180282': 0.3083, '92669': 0.305}),
  defaultdict(int,
              {'20': 0.11127,
               '313': 0.05377,
               '39': 0.0885,
               '7569': 0.1461,
               '5': 0.07855,
               '567': 0.1505,
               '979': 0.2338,
               '124984': 0.2289,
              

In [27]:
question_embedding['colbert_vecs']

[array([[-0.0034515 ,  0.0318363 , -0.0365462 , ...,  0.03245134,
          0.0416931 ,  0.02675415],
        [-0.01022364,  0.01011241, -0.03062457, ..., -0.01656359,
          0.01760171,  0.04471335],
        [-0.02898744,  0.02025042, -0.03119914, ...,  0.02580321,
          0.05744155,  0.03717545],
        ...,
        [-0.0027031 ,  0.04539758, -0.05225852, ..., -0.00102894,
          0.04462814, -0.00771054],
        [-0.01682237,  0.01473343, -0.0274376 , ...,  0.01036797,
          0.05187395,  0.0168309 ],
        [ 0.00715364,  0.01498455, -0.04339351, ...,  0.0397411 ,
          0.01798388, -0.00816349]], dtype=float32)]

In [28]:
help(model_BGEM3.colbert_score)

Help on method colbert_score in module FlagEmbedding.inference.embedder.encoder_only.m3:

colbert_score(q_reps, p_reps) method of FlagEmbedding.inference.embedder.encoder_only.m3.M3Embedder instance
    Compute colbert scores of input queries and passages.
    
    Args:
        q_reps (np.ndarray): Multi-vector embeddings for queries.
        p_reps (np.ndarray): Multi-vector embeddings for passages/corpus.
    
    Returns:
        torch.Tensor: Computed colbert scores.



In [18]:
%%time
results_retrieved_embedding = dict()
for index_database in range(len(database_embedding['colbert_vecs'])):
    results_retrieved_embedding[index_database] =  float(model_BGEM3.colbert_score(question_embedding['colbert_vecs'][0], database_embedding['colbert_vecs'][index_database]))
results_retrieved_embedding = dict(sorted(results_retrieved_embedding.items(), key=lambda item: item[1], reverse=True))

CPU times: user 35 ms, sys: 19.4 ms, total: 54.3 ms
Wall time: 14.5 ms


In [19]:
dict(list(results_retrieved_embedding.items())[:10])  

{28: 0.6413642764091492,
 20: 0.6169828772544861,
 14: 0.5619165301322937,
 19: 0.5405619144439697,
 32: 0.46172624826431274,
 17: 0.4208328127861023,
 12: 0.41692787408828735,
 0: 0.41197484731674194,
 24: 0.40662792325019836,
 47: 0.40173280239105225}

<span style='color: green;font-weight: bold;'>We use the two chunks closest to the asked question (to avoid exceeding the token budget). Then, we answer the question by providing these chunks to the LLM as context, in addition to the question itself.</span>

In [20]:
context_llm = '''Contexte : __CONTEXT__
Consigne : En te basant uniquement sur le contexte ci-dessus, réponds à la question suivante de manière fluide et naturelle, comme si tu connaissais le sujet sans devoir citer ou rappeler le contexte.
Question : __QUESTION__
'''
context_llm = context_llm.replace('__CONTEXT__', 
                                chunks_database[list(results_retrieved_embedding.keys())[0]]+'\n'+
                                chunks_database[list(results_retrieved_embedding.keys())[1]]) ### Take the two best
context_llm = context_llm.replace('__QUESTION__', question)

In [21]:
context_llm

'Contexte : # Emergency Startup Procedure (Full) This procedure should be followed strictly to minimize downtime: 1. Manually cut power to Zones 1 and 4 circuit breakers. 2. Activate SafeBoot mode via the Maintenance Shell interface. 3. Manually execute `/opt/zshell/scripts/init_ai_safe.py` to initialize AI diagnostics. 4. Inspect QRC diagnostic output with `qrc_diag --trace`. 5. If diagnostics fail, disable AI Layer autoload completely and reboot with minimal profile. 6. After successful boot, escalate privileges to Commander level to verify override routes. 7. Manually restore communications modules if Isolation Mode was triggered. 8. Re-enable AI Layer only after successful network sync.\n# Boot Sequence ZentroSoft initializes modules in this order: 1. Core Services 2. Sensor & Navigation Systems 3. AI Layer 4. User Interface Layer 5. External Network Bridge If the QRC fails, the boot halts at phase 3. In v4.1, SafeBoot skips the AI Layer, but this causes errors in some edge cases w

In [22]:
messages = [
    {"role": "system", "content": "You are a Question-Answering Expert, created by Panongbene to provide accurate, clear, and well-structured answers to user queries."},
    {"role": "user", "content": context_llm}
]

In [23]:
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

num_input_tokens = len(model_inputs["input_ids"][0])
print(f"Number of tokens in the input : {num_input_tokens}")


streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generation_kwargs = dict(model_inputs, streamer=streamer, max_new_tokens=123000, repetition_penalty=1.2)
thread = threading.Thread(target=model.generate, kwargs=generation_kwargs, daemon=True)
thread.start()

Number of tokens in the input : 382


In [24]:
#### Streaming
generated_text = ""
for token in streamer:
    generated_text += token
    sys.stdout.write(token)
    sys.stdout.flush()
thread.join()

<think>
Okay, let's tackle this question about the system behavior when the QRC fails during boot based on the provided context.

First, I need to recall the information given. The context mentions that if QRC fails, the boot halts at phase 3. Then it adds a note saying "In v4.1, SafeBoot skips the AI Layer..." which might cause issues in certain edge cases. However, there's also a part stating that some legacy configurations require AI Layer to start in all safe modes, conflicting with official documentation. 

The user is asking for the expected system behavior and recommended recovery steps when QRC fails. From the text, the main point seems to be that failing QRC stops the boot process at step 3. So the expected behavior would be halt here. For recovery steps, they mention disabling AI Layer autoloader and rebooting with minimal profile as a backup option. Also, restoring communication modules if isolation mode was triggered and re-enabling AI Layer post-sync are mentioned too. The

In [25]:
console = Console()

if "</think>" in generated_text:
    thinking, content = generated_text.split("</think>", 1)
    thinking = thinking.replace("<think>", "").strip()
    content = content.strip()
else:
    thinking = ""

# b) split off the python code
code_start = content.find("```python")
if code_start != -1:
    code_end = content.find("```", code_start + 1)
    python_code = content[code_start + 9:code_end].strip()
    insight_text = content[:code_start].strip()
else:
    python_code = ""
    insight_text = content

python_code = python_code.replace("`", "")

# 5) Render panels
console.print(Panel(thinking, title="Reasoning", style="yellow"))
console.print(Panel(insight_text, title="Response", style="green"))

## <span style='color: green;font-weight: bold;'>Description of the methodology</span>

The approach used in this technique consists of four steps:

1. **Splitting the knowledge base into chunks:**
   The database (docs) is segmented into chunks based on sentences. Each chunk corresponds to a single sentence whose size does not exceed 500 tokens.

2. **Transforming chunks into vector representations:**
   Each chunk is converted into an embedding, a dense vector, using the **BGEM3FlagModel("BAAI/bge-m3")**, specialized in this type of task.

3. **Transforming the question into an embedding and calculating similarities:**
   When a question is asked, it is transformed into an embedding using the same model as for the chunks. Then, the distance between the question embedding and the chunk embeddings is calculated using the **model\_BGEM3.colbert\_score** function to identify the closest chunks.

4. **Selecting chunks and generating the answer:**
   The two chunks with embeddings closest to the question embedding are selected, combined with the question, and provided to the LLM to generate the answer.


## <span style='color: green;font-weight: bold;'>Why is this chunking relevant?</span>

This chunking method is highly relevant for precise information retrieval. Indeed, each chunk contains a single unit of information that can be easily exploited during retrieval.

For questions about a date, a patch, a restart, or an ethical constraint, each chunk is short enough to be injected individually into the context.

This approach enables efficient and fast scoring in both vector and lexical retrieval.

## <span style='color: green;font-weight: bold;'>What happens when you ask a question whose answer requires a global analysis of a document?</span>

**When a question requires comprehensive document analysis to answer, our designed model may fail to provide satisfactory responses. This occurs because the chunks we use offer only partial document visibility, limiting the available context for the LLM and potentially leading to irrelevant answers.**

**To address this issue,** we can enhance our knowledge base by adding a one-sentence summary for each document. This summary will serve as a chunk representing the document's overview, enabling the model to access global context and improve response relevance.

**This approach offers several advantages:**

* It provides concise global context, enhancing the model's overall document understanding  
* It helps guide the search for most relevant chunks by combining summaries with detailed content  
* It remains token-efficient, which is crucial for respecting the model's token budget constraints  

**For implementation,** automated summarization models can generate these synthetic sentences. The summaries can then be integrated into the vector search index or directly added to the chunk database, ensuring better contextual coverage during queries.

## <span style='color: green;font-weight: bold;'>Common problems with this type of segmentation?</span>

**High indexing volume:**
Chunking the text into very small pieces, such as individual sentences, leads to a significant increase in the number of chunks to be indexed. This means that the vector or lexical database must manage a much larger set of elements to store and compare. As a result, search and scoring operations become more computationally expensive and resource-intensive, which can degrade the overall system performance, especially at scale.

**Increased sensitivity to noise:**
Individual sentences may contain noise, such as incomplete thoughts, typos, ambiguous phrasing, or out-of-context information. This noise makes it harder for the Retrieval-Augmented Generation (RAG) system to extract relevant information when relying solely on these small fragments. Excessive fragmentation can therefore harm the quality of generated answers, as the model may be misled by partial or unclear information.

In [31]:
import string


In [48]:
def patterne(n):
    """
    """

    result = []

    alphabet = list(string.ascii_lowercase)
    lower_values = alphabet[0:n]
    #result = [[lower_values[one].upper()+one_ for one_ in lower_values] for one in range(n)]
    result = [lower_values for one in range(n)]
    return result

In [49]:
patterne(3)

[['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']]

In [52]:
" ".join(['Aa', 'Ab', 'Ac'])

'Aa Ab Ac'

In [None]:
#Aa Ab Ac ... Az
#Ba Bb Bc ... Bz
Ca Cb Cc ... Cz
...
Za Zb Zc ... Zz

In [None]:
n = 2
Aa Ab Ac
Ba Bb

Ca C

In [None]:
a b c
a b c
a b c
a b c

