<a href="https://colab.research.google.com/github/NickDee96/common-voice-embedding/blob/main/docs/docs/examples/finetuning/embeddings/finetune_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Leveraging Common Voice Transcriptions for Robust Text-Based RAG: Enhancing Query Diversity Handling

This notebook demonstrates how to leverage Mozilla Common Voice transcriptions to enhance text-based Retrieval-Augmented Generation (RAG) systems. We address two critical limitations:

1. **Handling knowledge base gaps** for niche topics or underrepresented languages
2. **Mitigating coverage issues** by utilizing speech transcriptions with inherent linguistic diversity

We go through four main sections:
1. **Data Preparation**: Loading and preprocessing Common Voice Swahili transcriptions
2. **Embedding Fine-tuning**: Training models to better interpret paraphrased and ambiguous queries
3. **Hybrid Knowledge Base**: Augmenting traditional text corpora with speech-derived data
4. **Evaluation**: Measuring query diversity robustness and coverage breadth

By repurposing Common Voice's speech data for text-based RAG, we enable systems to better align with how users *actually speak* rather than how they write, while expanding access to non-dominant languages.

## Preparing Speech-Derived Text Corpus

We create our corpus using **Mozilla Common Voice Swahili transcriptions**—leveraging the linguistic diversity and speaker variability inherent in speech data. Unlike traditional text corpora, these transcriptions capture:

- **Natural language variations**: How people actually speak vs. formal writing
- **Vernacular expressions**: Colloquial terms and phrasings
- **Linguistic diversity**: Multiple ways of expressing the same concepts
- **Underrepresented language patterns**: Authentic usage in non-dominant languages

This approach bridges the "formality gap" between curated knowledge bases and real-world user interactions, making RAG systems more robust to query diversity.

In [None]:
%pip install datasets
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install llama-index-finetuning
%pip install llama-index-readers-file
%pip install llama-index-embeddings-huggingface
%pip install "transformers[torch]"
%pip install datasets pandas

Collecting llama-index-llms-openai
  Downloading llama_index_llms_openai-0.4.7-py3-none-any.whl.metadata (3.0 kB)
Collecting llama-index-core<0.13,>=0.12.41 (from llama-index-llms-openai)
  Downloading llama_index_core-0.12.42-py3-none-any.whl.metadata (2.4 kB)
Collecting aiosqlite (from llama-index-core<0.13,>=0.12.41->llama-index-llms-openai)
  Downloading aiosqlite-0.21.0-py3-none-any.whl.metadata (4.3 kB)
Collecting banks<3,>=2.0.0 (from llama-index-core<0.13,>=0.12.41->llama-index-llms-openai)
  Downloading banks-2.1.2-py3-none-any.whl.metadata (12 kB)
Collecting dataclasses-json (from llama-index-core<0.13,>=0.12.41->llama-index-llms-openai)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting deprecated>=1.2.9.3 (from llama-index-core<0.13,>=0.12.41->llama-index-llms-openai)
  Downloading Deprecated-1.2.18-py2.py3-none-any.whl.metadata (5.7 kB)
Collecting dirtyjson<2,>=1.0.8 (from llama-index-core<0.13,>=0.12.41->llama-index-llms-openai)
  Downloadin

In [None]:
import json

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode

Download Data

In [None]:
from datasets import load_dataset
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the Swahili subset of Common Voice 17.0
print("Loading Common Voice Swahili dataset...")
train_dataset = load_dataset("mozilla-foundation/common_voice_17_0", "sw", split="train")
test_dataset = load_dataset("mozilla-foundation/common_voice_17_0", "sw", split="test")

# Extract the text column from both splits
train_texts = train_dataset["sentence"]
test_texts = test_dataset["sentence"]

# Create pandas DataFrames
train_df = pd.DataFrame({"text": train_texts})
test_df = pd.DataFrame({"text": test_texts})

# Remove duplicates and empty texts
train_df = train_df.dropna().drop_duplicates().reset_index(drop=True)
test_df = test_df.dropna().drop_duplicates().reset_index(drop=True)

# Save the DataFrames to CSV files
train_df.to_csv("common_voice_swahili_train.csv", index=False)
test_df.to_csv("common_voice_swahili_test.csv", index=False)

print(f"Train dataset: {len(train_df)} sentences saved to common_voice_swahili_train.csv")
print(f"Test dataset: {len(test_df)} sentences saved to common_voice_swahili_test.csv")
print("\nTrain sample:")
print(train_df.head())
print("\nTest sample:")
print(test_df.head())

In [None]:
TRAIN_FILES = ["common_voice_swahili_train.csv"]
VAL_FILES = ["common_voice_swahili_test.csv"]

TRAIN_CORPUS_FPATH = "./data/train_corpus.json"
VAL_CORPUS_FPATH = "./data/val_corpus.json"

def load_corpus(files, verbose=False):
    if verbose:
        print(f"Loading files {files}")

    # Load CSV files containing Common Voice transcriptions
    all_texts = []
    for file_path in files:
        if file_path.endswith('.csv'):
            # Load CSV file
            import pandas as pd
            df = pd.read_csv(file_path)
            texts = df['text'].dropna().tolist()
            all_texts.extend(texts)
            if verbose:
                print(f"Loaded {len(texts)} texts from {file_path}")
        else:
            # Fallback for other file types
            reader = SimpleDirectoryReader(input_files=[file_path])
            docs = reader.load_data()
            for doc in docs:
                all_texts.append(doc.text)
            if verbose:
                print(f"Loaded {len(docs)} docs from {file_path}")
    
    # Create TextNode objects from the texts
    from llama_index.core.schema import TextNode
    import uuid
    
    nodes = []
    for i, text in enumerate(all_texts):
        if text.strip():  # Only add non-empty texts
            node = TextNode(
                text=text.strip(),
                id_=str(uuid.uuid4())
            )
            nodes.append(node)
    
    if verbose:
        print(f"Created {len(nodes)} nodes")

    return nodes

We use the Common Voice Swahili dataset with its native train/test splits. The train split is used for training the embedding model, and the test split is used for validation.

In [None]:
train_nodes = load_corpus(TRAIN_FILES, verbose=True)
val_nodes = load_corpus(VAL_FILES, verbose=True)

Loading files ['./data/10k/lyft_2021.pdf']
Loaded 238 docs


Parsing nodes:   0%|          | 0/238 [00:00<?, ?it/s]

Parsed 344 nodes
Loading files ['./data/10k/uber_2021.pdf']
Loaded 307 docs


Parsing nodes:   0%|          | 0/307 [00:00<?, ?it/s]

Parsed 410 nodes


### Generate synthetic queries for robust diversity handling

Now, we use an LLM (gpt-4o) to generate questions using each text chunk in the corpus as context. This process is crucial for creating training data that captures the **query diversity robustness** we aim to achieve.

Each pair of (generated question, text chunk used as context) becomes a datapoint in the finetuning dataset. By training on speech-derived transcriptions, our embedding model learns to handle:
- **Paraphrased queries**: Multiple ways users might express the same information need
- **Vernacular expressions**: Colloquial and informal language patterns
- **Ambiguous phrasing**: Natural speech patterns that differ from formal text

This approach bridges the gap between how users *actually speak* and how traditional knowledge bases are structured.

In [None]:
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

In [None]:
import os

OPENAI_API_KEY = "sk-"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [None]:
from llama_index.llms.openai import OpenAI


train_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-4o"),
    nodes=train_nodes,
    output_path="train_dataset.json",
)
val_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-4o"),
    nodes=val_nodes,
    output_path="val_dataset.json",
)

100%|██████████| 344/344 [12:51<00:00,  2.24s/it]
100%|██████████| 410/410 [16:07<00:00,  2.36s/it]


In [None]:
# [Optional] Load
train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")

## Fine-tune Embeddings for Query Diversity Robustness

We fine-tune our embedding model specifically to handle the linguistic diversity present in speech-derived text. This enables better alignment with real-world user queries that often differ significantly from formal written text.

In [None]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

In [None]:
finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="BAAI/bge-small-en",
    model_output_path="test_model",
    val_dataset=val_dataset,
)

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
finetune_engine.finetune()

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/69 [00:00<?, ?it/s]

Iteration:   0%|          | 0/69 [00:00<?, ?it/s]

In [None]:
embed_model = finetune_engine.get_finetuned_model()

In [None]:
embed_model

HuggingFaceEmbedding(model_name='test_model', embed_batch_size=10, callback_manager=<llama_index.callbacks.base.CallbackManager object at 0x2cc3d5cd0>, tokenizer_name='test_model', max_length=512, pooling=<Pooling.CLS: 'cls'>, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None)

## Evaluate Model on Query Diversity and Coverage

Our evaluation focuses on the key metrics outlined in the abstract: **query diversity robustness** and **coverage breadth**. We assess how well our fine-tuned model handles the linguistic variations present in speech-derived data compared to traditional embedding approaches.

In this section, we evaluate 3 different embedding models:
1. proprietary OpenAI embedding,
2. open source `BAAI/bge-small-en`, and
3. our finetuned embedding model.

We consider 2 evaluation approaches:
1. a simple custom **hit rate** metric
2. using `InformationRetrievalEvaluator` from sentence_transformers

We show that finetuning on synthetic (LLM-generated) dataset significantly improve upon an opensource embedding model.

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd

### Define eval function

**Option 1**: We use a simple **hit rate** metric for evaluation:
* for each (query, relevant_doc) pair,
* we retrieve top-k documents with the query,  and
* it's a **hit** if the results contain the relevant_doc.

This approach is very simple and intuitive, and we can apply it to both the proprietary OpenAI embedding as well as our open source and fine-tuned embedding models.

In [None]:
def evaluate(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
    index = VectorStoreIndex(
        nodes, embed_model=embed_model, show_progress=True
    )
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

        eval_result = {
            "is_hit": is_hit,
            "retrieved": retrieved_ids,
            "expected": expected_id,
            "query": query_id,
        }
        eval_results.append(eval_result)
    return eval_results

**Option 2**: We use the `InformationRetrievalEvaluator` from sentence_transformers.

This provides a more comprehensive suite of metrics, but we can only run it against the sentencetransformers compatible models (open source and our finetuned model, *not* the OpenAI embedding model).

In [None]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer
from pathlib import Path


def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    evaluator = InformationRetrievalEvaluator(
        queries, corpus, relevant_docs, name=name
    )
    model = SentenceTransformer(model_id)
    output_path = "results/"
    Path(output_path).mkdir(exist_ok=True, parents=True)
    return evaluator(model, output_path=output_path)

### Run Evals

#### OpenAI

Note: this might take a few minutes to run since we have to embed the corpus and queries

In [None]:
ada = OpenAIEmbedding()
ada_val_results = evaluate(val_dataset, ada)

In [None]:
df_ada = pd.DataFrame(ada_val_results)

In [None]:
hit_rate_ada = df_ada["is_hit"].mean()
hit_rate_ada

0.8779904306220095

### BAAI/bge-small-en

In [None]:
bge = "local:BAAI/bge-small-en"
bge_val_results = evaluate(val_dataset, bge)

Downloading (…)ab102/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)2d2d7ab102/README.md:   0%|          | 0.00/78.9k [00:00<?, ?B/s]

Downloading (…)2d7ab102/config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)ab102/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

Downloading (…)2d2d7ab102/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)d7ab102/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Generating embeddings:   0%|          | 0/418 [00:00<?, ?it/s]

  0%|          | 0/836 [00:00<?, ?it/s]

In [None]:
df_bge = pd.DataFrame(bge_val_results)

In [None]:
hit_rate_bge = df_bge["is_hit"].mean()
hit_rate_bge

0.7930622009569378

In [None]:
evaluate_st(val_dataset, "BAAI/bge-small-en", name="bge")

FileNotFoundError: [Errno 2] No such file or directory: 'results/Information-Retrieval_evaluation_bge_results.csv'

### Finetuned

In [None]:
finetuned = "local:test_model"
val_results_finetuned = evaluate(val_dataset, finetuned)

In [None]:
df_finetuned = pd.DataFrame(val_results_finetuned)

In [None]:
hit_rate_finetuned = df_finetuned["is_hit"].mean()
hit_rate_finetuned

In [None]:
evaluate_st(val_dataset, "test_model", name="finetuned")

### Summary of Results

#### Hit rate

In [None]:
df_ada["model"] = "ada"
df_bge["model"] = "bge"
df_finetuned["model"] = "fine_tuned"

We can see that fine-tuning our small open-source embedding model on Common Voice transcriptions dramatically improves its retrieval quality! The speech-derived training data enables the model to better handle linguistic diversity and natural language variations, approaching the quality of proprietary OpenAI embeddings while specifically addressing query diversity robustness.

In [None]:
df_all = pd.concat([df_ada, df_bge, df_finetuned])
df_all.groupby("model").mean("is_hit")

#### InformationRetrievalEvaluator

In [None]:
df_st_bge = pd.read_csv(
    "results/Information-Retrieval_evaluation_bge_results.csv"
)
df_st_finetuned = pd.read_csv(
    "results/Information-Retrieval_evaluation_finetuned_results.csv"
)

The results demonstrate that embedding fine-tuning on Common Voice transcriptions improves metrics consistently across the evaluation suite. This validates our approach of leveraging speech-derived text to bridge the "formality gap" between curated knowledge bases and real-world user interactions, enhancing both query diversity robustness and coverage breadth.

In [None]:
df_st_bge["model"] = "bge"
df_st_finetuned["model"] = "fine_tuned"
df_st_all = pd.concat([df_st_bge, df_st_finetuned])
df_st_all = df_st_all.set_index("model")
df_st_all