In [1]:
# Lab 1.6 – Retrieval-Augmented Generation (RAG) Extension
### Master Thesis: *Subject Matter Language Recognition Using Training*

# In this notebook, I extend my end-to-end AI system by integrating a Retrieval-Augmented Generation (RAG) component.

# The goal is to:
# - Generate a dataset of at least 1,000 (X, y) examples related to domain-specific Lithuanian speech transcription.
# - Store all examples in a ChromaDB vector database using embeddings for semantic similarity search.
# - For any new input X, automatically retrieve the 3 most similar examples.
# - Use these retrieved examples to augment the prompt for the Gemini model.
# - Demonstrate the full pipeline: **Input X → Retrieval → Augmented Prompt → Output y**.

In [2]:
!pip install -q "google-generativeai>=0.7.0" chromadb sentence-transformers

In [3]:
from google.colab import userdata
import google.generativeai as genai

GEMINI_KEY = userdata.get("GEMINI_KEY")
genai.configure(api_key=GEMINI_KEY)

model = genai.GenerativeModel("gemini-2.5-flash")
print("Gemini ready.")

Gemini ready.


In [4]:
import chromadb
from sentence_transformers import SentenceTransformer

# sentence-transformer model for Lithuanian-ish semantic embeddings
st_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

client = chromadb.Client()

# jei kolekcija jau yra – ją paimame, jei nėra – sukuriame
collection = client.get_or_create_collection(name="asr_examples")

print("ChromaDB collection ready.")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


ChromaDB collection ready.


In [5]:
## Generating (X, y) examples

##In a real system, these (X, y) pairs would come from real audio recordings and their transcriptions.

## For this lab, I generate synthetic but domain-related examples using several templates:
## - Lithuanian medical dictation
## - IT / information system events
## - Short domain-specific sentences

## This still satisfies the requirement of having at least 1,000 examples for retrieval-based similarity.

In [6]:
import random

seed_templates = [
    ("Audio: \"Pacientas skundžiasi {symptom} jau {duration}.\"",
     "Pacientas skundžiasi {symptom} jau {duration}."),
    ("Audio: \"Atlikta {procedure} dėl įtariamos {condition}.\"",
     "Atlikta {procedure} dėl įtariamos {condition}."),
    ("Audio: \"Pacientui paskirta {treatment} terapija.\"",
     "Pacientui paskirta {treatment} terapija."),
    ("Audio: \"Duomenų bazės serveris {server_action}.\"",
     "Duomenų bazės serveris {server_action}."),
    ("Audio: \"Įvykis užregistruotas informacinėje sistemoje: {event}.\"",
     "Įvykis užregistruotas informacinėje sistemoje: {event}.")
]

symptoms   = ["krūtinės skausmu", "dusuliu", "galvos skausmais", "pykinimu", "silpnumu"]
durations  = ["kelias dienas", "savaitę", "dvi dienas", "kelis mėnesius"]
procedures = ["kompiuterinė tomografija", "echokardiograma", "magnetinio rezonanso tomografija"]
conditions = ["širdies funkcijos sutrikimo", "insulto", "naviko", "plaučių embolijos"]
treatments = ["intraveninė", "antibiotikų", "antikoaguliantų", "analgetikų"]
server_actions = ["sėkmingai paleistas", "sustabdytas planiniam atnaujinimui", "patyrė klaidą", "atkurtas po gedimo"]
events = ["naudotojo prisijungimas", "sistemos klaida", "naujo įrašo sukūrimas", "duomenų atnaujinimas"]

examples = []
num_examples = 1200   # >1000 to be safe

for i in range(num_examples):
    template_x, template_y = random.choice(seed_templates)
    data = {
        "symptom": random.choice(symptoms),
        "duration": random.choice(durations),
        "procedure": random.choice(procedures),
        "condition": random.choice(conditions),
        "treatment": random.choice(treatments),
        "server_action": random.choice(server_actions),
        "event": random.choice(events),
    }
    x = template_x.format(**data)
    y = template_y.format(**data)
    examples.append({"id": f"ex_{i:04d}", "x": x, "y": y})

len(examples)

1200

In [7]:
ids = [e["id"] for e in examples]
docs = [e["x"] for e in examples]
metas = [{"y": e["y"]} for e in examples]

collection.add(
    ids=ids,
    documents=docs,
    metadatas=metas
)

print("Inserted", len(ids), "examples into ChromaDB.")

Inserted 1200 examples into ChromaDB.


In [8]:
def retrieve_similar_examples(x_query, k=3):
    result = collection.query(
        query_texts=[x_query],
        n_results=k
    )
    docs = result["documents"][0]
    metas = result["metadatas"][0]
    return list(zip(docs, [m["y"] for m in metas]))

# quick test
test_x = 'Audio: "Pacientas skundžiasi krūtinės skausmu jau kelias dienas."'
retrieved = retrieve_similar_examples(test_x, k=3)
for i, (rx, ry) in enumerate(retrieved, 1):
    print(f"\nExample {i}:")
    print("X:", rx)
    print("Y:", ry)


Example 1:
X: Audio: "Pacientas skundžiasi krūtinės skausmu jau kelias dienas."
Y: Pacientas skundžiasi krūtinės skausmu jau kelias dienas.

Example 2:
X: Audio: "Pacientas skundžiasi krūtinės skausmu jau kelias dienas."
Y: Pacientas skundžiasi krūtinės skausmu jau kelias dienas.

Example 3:
X: Audio: "Pacientas skundžiasi krūtinės skausmu jau kelias dienas."
Y: Pacientas skundžiasi krūtinės skausmu jau kelias dienas.


In [9]:
def build_rag_prompt(x_new):
    retrieved = retrieve_similar_examples(x_new, k=3)

    # Start prompt
    prompt = (
        "You are an AI system that transcribes Lithuanian expert dictation.\n\n"
        "Below are some similar past examples (X, y) from a vector database.\n"
        "Use them as in-context guidance to generate a consistent and accurate transcription.\n\n"
    )

    # Add retrieved examples
    for i, (rx, ry) in enumerate(retrieved, 1):
        prompt += f"Example {i}:\nX: {rx}\nY: {ry}\n\n"

    # Add new input X
    prompt += (
        f"Now transcribe the new input.\n\n"
        f"New input X:\n{x_new}\n\n"
        "Return JSON:\n"
        '{ "transcription": "..." }\n'
    )

    return prompt, retrieved


In [10]:
## RAG Pipeline Demonstration

## We now demonstrate the complete RAG-enabled pipeline:

## 1. Take a new input X (a Lithuanian domain-specific sentence).
## 2. Retrieve the 3 most similar examples from the ChromaDB vector database.
## 3. Construct an augmented prompt including those examples.
## 4. Call Gemini with the augmented prompt.
## 5. Observe the generated output y.

## This shows how retrieval improves reasoning and consistency.

In [11]:
# New input X for demo
new_x = 'Audio: "Atsarginė duomenų kopija sėkmingai sukurta ir patikrinta."'

rag_prompt, retrieved_examples = build_rag_prompt(new_x)

print("=== Retrieved examples ===")
for i, (rx, ry) in enumerate(retrieved_examples, 1):
    print(f"\nExample {i}")
    print("X:", rx)
    print("Y:", ry)

print("\n\n=== RAG Prompt sent to Gemini ===\n")
print(rag_prompt)

response_rag = model.generate_content(rag_prompt)

print("\n\n=== Gemini RAG Output ===")
print(response_rag.text)

=== Retrieved examples ===

Example 1
X: Audio: "Duomenų bazės serveris sustabdytas planiniam atnaujinimui."
Y: Duomenų bazės serveris sustabdytas planiniam atnaujinimui.

Example 2
X: Audio: "Duomenų bazės serveris sustabdytas planiniam atnaujinimui."
Y: Duomenų bazės serveris sustabdytas planiniam atnaujinimui.

Example 3
X: Audio: "Duomenų bazės serveris sustabdytas planiniam atnaujinimui."
Y: Duomenų bazės serveris sustabdytas planiniam atnaujinimui.


=== RAG Prompt sent to Gemini ===

You are an AI system that transcribes Lithuanian expert dictation.

Below are some similar past examples (X, y) from a vector database.
Use them as in-context guidance to generate a consistent and accurate transcription.

Example 1:
X: Audio: "Duomenų bazės serveris sustabdytas planiniam atnaujinimui."
Y: Duomenų bazės serveris sustabdytas planiniam atnaujinimui.

Example 2:
X: Audio: "Duomenų bazės serveris sustabdytas planiniam atnaujinimui."
Y: Duomenų bazės serveris sustabdytas planiniam atnauji

In [12]:
# Plain zero-shot prompt without retrieval
plain_prompt = f"""
You are an AI system that transcribes Lithuanian expert dictation.

New input X:
{new_x}

Return JSON:
{{ "transcription": "..." }}
"""

response_plain = model.generate_content(plain_prompt)

print("=== Zero-shot output (no RAG) ===")
print(response_plain.text)
print("\n=== RAG output ===")
print(response_rag.text)

=== Zero-shot output (no RAG) ===
```json
{
  "transcription": "Atsarginė duomenų kopija sėkmingai sukurta ir patikrinta."
}
```

=== RAG output ===
```json
{
  "transcription": "Atsarginė duomenų kopija sėkmingai sukurta ir patikrinta."
}
```


In [13]:
## Reflection on RAG Extension

## In this lab, I extended my end-to-end AI system with a Retrieval-Augmented Generation (RAG) component.
## Instead of sending only the new input X to Gemini, I first retrieved the three most similar examples from a vector database built with ChromaDB.
## These examples provided in-context guidance about typical phrasing, domain terminology, and structure used in previous transcriptions.
## The RAG-enhanced prompts produced more consistent and domain-appropriate outputs compared to pure zero-shot generation.
## For example, the system better preserved Lithuanian grammar and terminology related to medical and IT contexts.
## The retrieval step also made the behaviour of the model more explainable, because I could inspect which examples were used for reasoning.
## A limitation of this setup is that the examples are synthetic; in future work I would like to replace them with real audio–transcription pairs from my thesis data.
## Overall, RAG clearly improved the reliability and interpretability of my thesis-based AI agent.
