# 📊 Evaluating Untested STT + MT Combinations Based on Efficiency

## 🎯 Goal

We want to estimate which STT+MT pairs offer the best trade-off between **quality** and **speed**.

We combine:

- **Whisper models** (STT): e.g., tiny, base, small, etc.
- **Translation models**: e.g., Google Translate, DeepL, GPT-4o, etc.

Each model has:
- A **quality score** (higher is better)
- A **generation time** in seconds (lower is better)

The main metric is:

> **Efficiency = (STT Score + MT Score) / (STT Time + MT Time)**

In [1]:
import pandas as pd
from IPython.display import display, Markdown

# === STT (Speech-to-Text) model scores and generation times (in seconds)
whisper_data = {
    "tiny": (13, 12),
    "base": (15, 8),
    "small": (16, 8),
    "medium": (15, 17),
    "large-v1": (17, 36),
    "large-v2": (19, 17),
    "large-v3": (19, 33),
    "large-turbo-v3": (18, 21)
}

# === MT (Machine Translation) model scores and generation times (in seconds)
translation_data = {
    "Google Translate V1": (10, 1),
    "DeepL V2": (13, 3),
    "ChatGPT mini 3o": (17, 4),
    "GPT-4o mini": (18, 8),
    "GPT-4o": (18, 6),
    "GPT-4o turbo": (18, 8),
    "GPT-3.5 turbo": (14, 5),
    "GPT-4": (18, 11),
    "MyMemory": (17, 3),
    "Groq": (13, 3),
    "Winstxnhdw": (10, 24),
    "Ollama": (14, 4),
    "DeepSeek R1": (12, 58),
    "Gemma 3": (13, 6),
    "ZongweiGemma3": (11, 5),
    "Gemini 2.0 Flash": (15, 2)
}

# === Already tested combinations (to be excluded)
tested_combinations = {
    ("tiny", "Google Translate V1"),
    ("base", "DeepL V2"),
    ("small", "ChatGPT mini 3o"),
    ("small", "GPT-4o mini"),
    ("small", "GPT-4o"),
    ("small", "GPT-4o turbo"),
    ("small", "GPT-3.5 turbo"),
    ("small", "GPT-4"),
    ("medium", "MyMemory"),
    ("large-v1", "Groq"),
    ("large-v2", "Winstxnhdw"),
    ("large-v3", "Ollama"),
    ("large-v3", "DeepSeek R1"),
    ("large-v3", "Gemma 3"),
    ("large-v3", "ZongweiGemma3"),
    ("large-turbo-v3", "Gemini 2.0 Flash")
}

# === Build and evaluate all untested combinations
combinations = []
for whisper, (stt_score, stt_time) in whisper_data.items():
    for trans, (mt_score, mt_time) in translation_data.items():
        if (whisper, trans) in tested_combinations:
            continue
        total_score = stt_score + mt_score
        total_time = stt_time + mt_time  # seconds
        efficiency = round(total_score / total_time, 3)
        combinations.append({
            "STT Model (Whisper)": whisper,
            "MT Model (Translator)": trans,
            "STT Quality Score": stt_score,
            "MT Quality Score": mt_score,
            "Total Score (Quality)": total_score,
            "Total Time (sec)": total_time,
            "Efficiency (Score ÷ Time)": efficiency
        })

# === Sort and display
df = pd.DataFrame(combinations)
df_sorted = df.sort_values("Efficiency (Score ÷ Time)", ascending=False).reset_index(drop=True)

# === Explanation block
display(Markdown("""
## 🔍 Suggested Untested Model Combinations Ranked by Efficiency

This table shows untested combinations of **STT** (Speech-to-Text) and **MT** (Machine Translation) models.
They are ranked by **efficiency**, calculated using the formula:

> **Efficiency = (STT Score + MT Score) / (STT Time + MT Time)**  
> *Time is measured in seconds.*

- **STT Score**: Quality of converting spoken German to written German.
- **MT Score**: Quality of translating German text to English.
- **Time (sec)**: Combined generation time for both STT and MT.
- **Efficiency**: Higher means better quality per second.

Use this to prioritize the most promising model pairs to test next.
"""))

# === Final styled output
display(df_sorted.style.background_gradient(cmap="Greens", subset=["Efficiency (Score ÷ Time)"]))



## 🔍 Suggested Untested Model Combinations Ranked by Efficiency

This table shows untested combinations of **STT** (Speech-to-Text) and **MT** (Machine Translation) models.
They are ranked by **efficiency**, calculated using the formula:

> **Efficiency = (STT Score + MT Score) / (STT Time + MT Time)**  
> *Time is measured in seconds.*

- **STT Score**: Quality of converting spoken German to written German.
- **MT Score**: Quality of translating German text to English.
- **Time (sec)**: Combined generation time for both STT and MT.
- **Efficiency**: Higher means better quality per second.

Use this to prioritize the most promising model pairs to test next.


Unnamed: 0,STT Model (Whisper),MT Model (Translator),STT Quality Score,MT Quality Score,Total Score (Quality),Total Time (sec),Efficiency (Score ÷ Time)
0,small,Gemini 2.0 Flash,16,15,31,10,3.1
1,base,Gemini 2.0 Flash,15,15,30,10,3.0
2,small,MyMemory,16,17,33,11,3.0
3,base,MyMemory,15,17,32,11,2.909
4,small,Google Translate V1,16,10,26,9,2.889
5,base,Google Translate V1,15,10,25,9,2.778
6,base,ChatGPT mini 3o,15,17,32,12,2.667
7,small,Groq,16,13,29,11,2.636
8,small,DeepL V2,16,13,29,11,2.636
9,base,Groq,15,13,28,11,2.545
