# Jupyter Notebook for Project "Comparison of LLM Prompting Techniques"

_Copyright 2025 Aldenkirchs & Reichert_

_All code in this notebook is licensed under the same license as specified in the LICENSE file in the root directory of this project (see LICENSE)._

In [15]:
import pandas as pd
import mlflow
import mlflow.pyfunc
import sacrebleu
from llama_cpp import Llama
import time
from enum import Enum
from rouge_score import rouge_scorer
import subprocess
import json
import os
import gc


## 1 Data Loading
In the first step we import the given translations as pandas Dataframes and print a quick overview of the dataframe.

In [16]:
data = pd.read_pickle('machine_translation.pkl')
data

Unnamed: 0,complexity,text_german,text_english
0,easy,Felix hat es satt: Ständig ist Mama unterwegs....,Felix is fed up: Mom is always on the go. But ...
1,news_gen,Die rund 1.400 eingesetzten Beamten haben demn...,"The approximately 1,400 deployed officers have..."
2,news_spec,"Der Staatschef hat zugleich aber das Recht, vo...",The head of state also has the right to appoin...
3,pop_science,Dass der Klimawandel die Hitzewellen in Südasi...,There is no question that climate change is in...
4,science,"Der DSA-110, der sich am Owens Valley Radio Ob...","The DSA-110, situated at the Owens Valley Radi..."


In [17]:
data_info = pd.DataFrame()
data_info['complexity'] = data['complexity']
data_info['text_german_length'] = data['text_german'].str.len()
data_info['text_english_length'] = data['text_english'].str.len()
data_info

Unnamed: 0,complexity,text_german_length,text_english_length
0,easy,485,415
1,news_gen,296,280
2,news_spec,518,484
3,pop_science,542,521
4,science,1003,827


Here we collected some static variables and enums for improved maintainability and
code reliability. The `ESTIMATED_TOKENS_BUFFER` and `MLFLOW_TRACKING_URI` can be configured based on individual preferences/ setup.

In [18]:
class Language(Enum):
    ENGLISH = 'English'
    GERMAN = 'German'


class Complexity(Enum):
    EASY = 'easy'
    NEWS_GEN = 'news_gen'
    NEWS_SPEC = 'news_spec'
    POP_SCIENCE = 'pop_science'
    SCIENCE = 'science'


ALL_COMPLEXITIES = list(Complexity)

# this constant value is later used to calculate the estimated tokens and context size
# -> it gets later multiplied by the token length of the prompt template + source text + reference text
#       and should be > 1 but not too big
# we identified 1.5 as a good heuristic
ESTIMATED_TOKENS_BUFFER = 1.5

# this is the default mlflow tracking uri and needs to be adjusted
#    if mlflow is available under a different/ remote uri
MLFLOW_TRACKING_URI = 'http://127.0.0.1:5000'

***
## 2 Model Loading
In the second step we import the AI-Models which are given in the specified task. For doing so we use the `llama-cpp-python` library (further documentation can be found [here](https://github.com/abetlen/llama-cpp-python)) and import the models directly from [huggingface](https://huggingface.co/).

Quick overview and installation guide of llama.cpp in case of problems can be found here:
- https://www.datacamp.com/tutorial/llama-cpp-tutorial
- https://christophergs.com/blog/running-open-source-llms-in-python

In [19]:
# Configuration of the models
MODELS = {
    'gemma': {
        'repo_id': 'lmstudio-ai/gemma-2b-it-GGUF',
        'filename': 'gemma-2b-it-q8_0.gguf',
    },
    'llama32': {
        'repo_id': 'hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF',
        'filename': 'llama-3.2-3b-instruct-q8_0.gguf',
    },
    'llama31': {
        'repo_id': 'lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF',
        'filename': 'Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf',
    },
    'aya23': {
        'repo_id': 'bartowski/aya-23-35B-GGUF',
        'filename': 'aya-23-35B-Q5_K_M.gguf',
    },
}

In [20]:
def create_llama_model(repo_id, filename, n_ctx=None):
    """
    Loads and creates the Llama model from the specified repository and file.

    Args:
        repo_id: repository ID of the model.
        filename: filename of the model.
        n_ctx: context window size for the model. Defaults to 512 if None.

    Returns:
        The loaded Llama model, or None if an error occurs.
    """
    try:
        if n_ctx is None:
            # default of llama_cpp
            n_ctx = 512
        if repo_id is not None and filename is not None:
            model = Llama.from_pretrained(
                repo_id=repo_id,
                filename=filename,
                n_ctx=n_ctx,
                # these parameters can be set individually based on the running system
                #n_gpu_layers=n_gpu_layers,
                #n_threads=8,
                verbose=False,
            )
            print(f"Model {repo_id} successfully loaded with n_ctx={n_ctx}")
            return model
        else:
            return None
    except Exception as e:
        print(f"Error occurred when loading the model from file: {filename}: {e}")
        return None

***

## 3 Pipeline

### 3.1 Model Interaction

In [21]:
def translate(model, prompt, reference_translation):
    """
    Translates the given prompt using the provided model.
    estimates the needed max_tokens based on the lengths of the prompt and the reference translation.

    Args:
        model: translation model to be used.
        prompt: text to be translated.
        reference_translation: reference translation used to estimate max_tokens.

    Returns:
        The translated text.
    """
    # we estimate the needed max_tokens based on the tokenized prompt and reference_translation
    token_length_ref = len(model.tokenize(reference_translation.encode('utf-8')))
    token_length_prompt = len(model.tokenize(prompt.encode('utf-8')))
    # the model should not need more tokens than this
    estimated_max_tokens = (token_length_prompt + token_length_ref) * ESTIMATED_TOKENS_BUFFER

    response = model(prompt, max_tokens=estimated_max_tokens, echo=False)
    return response['choices'][0]['text']

### 3.2 Metrics Calculation

These two code cells calculate the metric scores based on the hypothesis and reference translation. Here we also integrated MetricX into our project (
[GitHub repository to MetricX](https://github.com/google-research/metricx))

In [23]:
def evaluate_translation(source, reference, hypothesis):
    """
    Evaluates the quality of a translation (hypothesis) against a reference translation,
    calculating BLEU, chrF, MetricX, and RougeL scores.

    Args:
        source: source text.
        reference: reference translation.
        hypothesis: hypothesis.

    Returns:
        A dictionary containing the BLEU, chrF, RougeL, and MetricX scores.  BLEU, chrF, and RougeL
        are scaled to be between 0 and 100. MetricX will be -1 if it cannot be calculated.
    """

    # Note that BLEU and chrF Scores can only be between 0 and 1
    #   but sacreblue returns floats between 0 and 100
    bleu_score = sacrebleu.corpus_bleu([hypothesis], [[reference]]).score
    chrf_score = sacrebleu.corpus_chrf([hypothesis], [[reference]]).score

    metricx_score = calculate_metricx_score(source, reference, hypothesis)
    if metricx_score is None:
        metricx_score = -1

    rougel_scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    rougel_score = rougel_scorer.score(reference, hypothesis)

    return {'BLEU': bleu_score,
            'chrF': chrf_score,
            # we also edited the rougeL score to lie between 0 and 100 (to be similar to BLEU and chrF)
            'rougeL': (rougel_score['rougeL'].fmeasure * 100),
            'MetricX': metricx_score}


In [22]:
def calculate_metricx_score(source, reference, hypothesis):
    '''
    Calculates the MetricX-score based on source, reference, and hypothesis using metricx24.
    We are currently using the metricx-24-hybrid-large-v2p6-bfloat16 model but there are also other options
        as can be seen here: https://github.com/google-research/metricx

    Args:
        source: The source text (String).
        reference: The reference translation (String).
        hypothesis: The hypothesis translation (String).

    Returns:
        The calculated score as a float or None in case of an error.
    '''

    # Create temporary JSONL files
    input_file = './temp_input.jsonl'
    output_file = './temp_output.jsonl'
    # this is the model that is used for evaluation
    model = 'google/metricx-24-hybrid-large-v2p6-bfloat16'

    tmp_data = [{'id': '1', 'source': source, 'reference': reference, 'hypothesis': hypothesis}]
    try:
        with open(input_file, 'w', encoding='utf-8') as f:
            for entry in tmp_data:
                json.dump(entry, f)
                f.write('\n')

        command = [
            'python', '-m', 'metricx24.predict',
            '--tokenizer', 'google/mt5-xl',
            '--model_name_or_path', model,
            '--max_input_length', '1536',
            '--batch_size', '1',
            '--input_file', input_file,
            '--output_file', output_file
        ]

        process = subprocess.Popen(
            command,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
            bufsize=1,
            universal_newlines=True
        )

        # Capture output and errors (optional, can be useful for debugging)
        #for line in process.stdout:
        #    print(line, end='')
        #for line in process.stderr:
        #    print(f'ERROR: {line}', end='')

        # wait for the metric calculation process to terminate
        process.wait()

        if process.returncode != 0:
            print(f'Error executing metricx24. Return code: {process.returncode}')
            return None

        # Read score from the output file
        with open(output_file, 'r', encoding='utf-8') as f:
            for line in f:
                try:
                    output_data = json.loads(line)
                    score = float(output_data.get('prediction'))
                    return score
                except (json.JSONDecodeError, ValueError, AttributeError):
                    print('Error parsing the output file.')
                    return None

        return None  # If no valid line was found in the output file

    finally:
        # Remove temporary files
        try:
            os.remove(input_file)
            os.remove(output_file)
        except FileNotFoundError:
            pass  #If the files don't exist for some reason, the error is caught




### 3.3 Logging to MLFLow

In [25]:
def log_to_mlflow(experiment_name, template_name, metrics, prompt_type, model_name, complexity, target_language,
                  tmp_result,
                  prompt_language):
    """
    Logs results of a run to MLflow.
    creates the respective experiment if it does not already exist

    Args:
        experiment_name: name of the MLflow experiment.
        template_name: name of the prompt template used.
        metrics: dictionary of metrics to log.
        prompt_type: prompting technique that was used.
        model_name: name of the model.
        complexity: complexity level of the source text.
        target_language: target language of the translation.
        tmp_result: Pandas DataFrame containing temporary results.
        prompt_language: language of the prompt.
    """
    experiment = mlflow.get_experiment_by_name(experiment_name)

    if experiment:
        if experiment.lifecycle_stage == 'deleted':
            mlflow.tracking.MlflowClient().restore_experiment(experiment.experiment_id)
    else:
        mlflow.create_experiment(experiment_name)

    mlflow.set_experiment(experiment_name)
    with mlflow.start_run(run_name=f'{model_name}/{complexity}/{template_name}'):
        mlflow.log_param('model', model_name)
        mlflow.log_param('complexity', complexity)
        mlflow.log_param('prompt_type', prompt_type)
        mlflow.log_param('target_language', target_language)
        mlflow.log_param('prompt_language', prompt_language)
        for key, value in metrics.items():
            mlflow.log_metric(key, value)

        tmp_result.to_json('tmp_results.json', index=False)
        mlflow.log_artifact('tmp_results.json')
        mlflow.end_run()


### 3.4 Pipeline Composition

This is the main part of our pipeline where all the code snippets from above come together.

In [26]:
def run_pipeline(texts):
    """
    Runs the translation pipeline for a given set of texts, iterating through the different MODELS (from above),
    complexities, and PROMPT_TEMPLATES. Also logs the results to MLflow and returns them as a DataFrame.

    Args:
        texts: Pandas DataFrame containing the source and reference texts, as well as the
            complexity level for each text.

    Returns:
        Pandas DataFrame containing the results of all translation runs.
    """
    # this is the result Dataframe where all runs are stored
    results = pd.DataFrame(
        columns=['model', 'complexity', 'prompt_type', 'prompt', 'source_text', 'hypothesis', 'reference', 'metrics',
                 'prompt_language'])

    # this is just for mlflow and can be changed individually
    mlflow.set_tracking_uri(uri=MLFLOW_TRACKING_URI)

    for model_name, model_config in MODELS.items():
        for _, row in texts.iterrows():
            model = createModel(model_config, row)
            complexity_enum = next(c for c in Complexity if c.value == row['complexity'])

            # translations German -> English
            for template_name, template_data in PROMPT_TEMPLATES_GERMAN_ENGLISH.items():
                if pd.notna(row['text_german']) and complexity_enum in template_data['complexities']:
                    results = execute_mlflow_run(template_name, complexity_enum.value, model, model_name,
                                                 Language.ENGLISH, results,
                                                 row['text_german'], row['text_english'], template_data)

            # translations English -> German
            for template_name, template_data in PROMPT_TEMPLATES_ENGLISH_GERMAN.items():
                if pd.notna(row['text_english']) and complexity_enum in template_data['complexities']:
                    results = execute_mlflow_run(template_name, complexity_enum.value, model, model_name,
                                                 Language.GERMAN, results,
                                                 row['text_english'], row['text_german'], template_data)

            # we dont need the model anymore so we delete it
            del model
            gc.collect()
    return results


def createModel(model_config, row):
    """
    Creates the desired model with a context window (n_ctx) that is estimated
    based on the token length of the prompt, source and reference text.

    Args:
        model_config: configuration for the language model.
        row: row from the input DataFrame containing the source and reference texts (needed for token estimation).

    Returns:
        The created language model.
    """
    # at first we just use the dummyModel for the tokenization of text
    dummyModel = create_llama_model(model_config['repo_id'], model_config['filename'])

    # then we determine the minimal tokens needed for a translation (prompt + source text + reference text)
    combined_text = f"{row['text_german']} {row['text_english']}"
    text_tokens = len(dummyModel.tokenize(combined_text.encode('utf-8')))
    # we want to tokenize the longest template/ prompt
    max_promp_template = max(
        (t['template'] for d in (PROMPT_TEMPLATES_GERMAN_ENGLISH, PROMPT_TEMPLATES_ENGLISH_GERMAN) for t in d.values()),
        key=len)
    prompt_tokens = len(dummyModel.tokenize(max_promp_template.encode('utf-8')))

    # now we delete the dummyModel to free up memory
    del dummyModel
    gc.collect()

    # and then create the final model based on the estimated_max_tokens
    estimated_max_tokens = (text_tokens + prompt_tokens) * ESTIMATED_TOKENS_BUFFER
    n_ctx = int(estimated_max_tokens * 1.1)
    print(f"estimated_max_tokens: {estimated_max_tokens}; n_ctx: {n_ctx}")
    model = create_llama_model(model_config['repo_id'], model_config['filename'], n_ctx=n_ctx)
    return model


def execute_mlflow_run(template_name, complexity, model, model_name, target_language: Language, results, source_text,
                       reference_text, template_data):
    """
    Executes a single translation run, including prompt creation, translation, evaluation, and logging to MLflow.

    Args:
        template_name: name of the prompt template used.
        complexity: complexity level of the prompt.
        model: language model used for translation.
        model_name: name of the model.
        target_language: target language of the translation.
        results: Pandas DataFrame to store the results.
        source_text: source text to be translated.
        reference_text: reference translation.
        template_data: data associated with the prompt template.

    Returns:
        the updated results DataFrame.
    """
    # this is the actual composition of the prompt where '{text}' gets replaced with the source text
    prompt = template_data['template'].format(text=source_text)

    start_time_translation = time.time()
    hypothesis = translate(model, prompt, reference_text)
    end_time_translation = time.time()
    print('Prompt finished in (seconds): ', round(end_time_translation - start_time_translation, 2))

    metrics = evaluate_translation(source=source_text, reference=reference_text, hypothesis=hypothesis)
    print('Metric Calculation finished in (seconds): ', round(time.time() - end_time_translation, 2))

    prompt_language = template_data['prompt_language']
    prompt_type = template_data['prompt_type']
    tmp_result = pd.DataFrame([{
        'model': model_name,
        'complexity': complexity,
        'prompt_type': prompt_type,
        'prompt': prompt,
        'source_text': source_text,
        'hypothesis': hypothesis,
        'reference_text': reference_text,
        'metrics': metrics,
        'prompt_language': prompt_language.value  # .value for the string value
    }])

    experiment_name = f'{model_name}_{complexity}'

    log_to_mlflow(experiment_name, template_name, metrics, prompt_type, model_name, complexity, target_language.value,
                  tmp_result,
                  prompt_language.value)

    # add tmp_results Dataframe to overall results
    results = pd.concat([
        results,
        tmp_result
    ], ignore_index=True)
    return results

### 3.5 Prompt Composition

In this cell, the prompt_templates have to be defined. We provided two short examples although the collection of our templates and prompts can be found in the `prompt_templates_few_shot.ipynb` and `prompt_templates_zero_shot.ipynb` notebooks in this project.


In [27]:
# Example for only using specific complexities: 'complexities': [Complexity.EASY, Complexity.NEWS_GEN],

PROMPT_TEMPLATES_ENGLISH_GERMAN = {
    'zero_shot_to-de_en_1': {
        'template': 'Please translate the following text from English to German: \"{text}\"',
        'prompt_language': Language.ENGLISH,
        'prompt_type': 'zero_shot',
        'complexities': ALL_COMPLEXITIES
    },
    'zero_shot_to-de_de_1': {
        'template': 'Bitte übersetze diesen Text von Englisch nach Deutsch: \"{text}\"',
        'prompt_language': Language.GERMAN,
        'prompt_type': 'zero_shot',
        'complexities': ALL_COMPLEXITIES
    },
}

PROMPT_TEMPLATES_GERMAN_ENGLISH = {
    'zero_shot_to-en_en_1': {
        'template': 'Please translate the following text from German to English: \"{text}\"',
        'prompt_language': Language.GERMAN,
        'prompt_type': 'zero_shot_format',
        'complexities': ALL_COMPLEXITIES
    },
    'zero_shot_to-en_de_1': {
        'template': 'Bitte übersetze diesen Text von Deutsch nach Englisch: \"{text}\"',
        'prompt_language': Language.GERMAN,
        'prompt_type': 'zero_shot',
        'complexities': ALL_COMPLEXITIES
    },
}

***
## 4 Execute Pipeline

Here, we just execute the pipeline. Depending on the number of prompt_templates and complexities, this can take a long time. The prompts usually take from 20 seconds on the small gemma model to 15 minutes on the big aya-23 model (for one single model output). The metric calculation should take about 30 - 60 seconds.

In [None]:
translation_results = run_pipeline(data)
translation_results.to_csv('translation_results.csv', sep=';')
print('Pipeline abgeschlossen. Ergebnisse gespeichert.')