# Yerevan Winter School Tutorial 4
## Cross Lingual Prompting. Zero Shot, Few Shot, and Chain of Thought

**Context.**  
In this hands on session, you will experiment with prompt design for simple tasks in English and in a chosen low resource language.  
You will compare zero shot and few shot prompting, explore Chain of Thought (CoT) instructions, and rate model outputs on several dimensions.

**What you will do.**

1. Select one or two simple tasks (for example text classification or question answering).  
2. Design prompts in English and in a low resource language of your choice.  
3. Compare zero shot and few shot prompting.  
4. Add Chain of Thought instructions and see whether the model follows reasoning steps across languages.  
5. Rate outputs for correctness, fluency, and cultural appropriateness using a small evaluation sheet.  
6. Discuss which prompt design choices improved or harmed performance for your target language.

**Important note.**  
You will run LLM queries outside this notebook (for example in Poe, ChatGPT, or another interface) and paste outputs back into the tables here.  
The notebook acts as a structured worksheet and analysis tool.


## 0. Setup

Run the following cell to install and import the Python libraries we will use to organize and summarize your observations.


In [None]:
!pip install -q pandas matplotlib

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

print("Pandas version:", pd.__version__)

## 1. Choose tasks and languages

You will work with.

- At least one task type (for example classification, question answering, translation quality judgement).  
- English (`en`).  
- At least one low resource language of your choice (for example Armenian `hy`, Luxembourgish `lb`, Kurdish `ku`, etc.).

Edit the configuration below to match your choices.


In [None]:
# Define languages you will use.
# You must include English (`en`) plus at least one low resource language.

languages = [
    {"code": "en", "name": "English"},
    {"code": "lb", "name": "Luxembourgish"},  # change to your low resource language if you prefer
]

languages_df = pd.DataFrame(languages)
languages_df

In [None]:
# Define the tasks you want to explore.
# Example tasks. you can modify names and descriptions.

tasks = [
    {
        "task_id": 1,
        "task_name": "sentiment_classification",
        "description": "Classify a short review as positive or negative.",
    },
    {
        "task_id": 2,
        "task_name": "factoid_qa",
        "description": "Answer a short factual question based on background knowledge.",
    },
]

tasks_df = pd.DataFrame(tasks)
tasks_df

You can modify `languages` and `tasks` above to fit your interests.

- Add or remove rows in the lists.  
- Make sure each `task_id` is unique.  
- Use task names that are short and descriptive.


## 2. Define input items for each task and language

For each `(task, language)` pair, you will define a small set of input items and their expected answers or labels.

Examples.

- For sentiment classification, the input might be a short review and the label is `positive` or `negative`.  
- For question answering, the input might be a question and the expected answer is a short phrase.

Fill in the table below with your own items. The examples are just placeholders.


In [None]:
# Define input items per task and language.
# You can replace the examples with your own entries.

items = [
    # Sentiment classification. English.
    {
        "item_id": 1,
        "task_id": 1,
        "language_code": "en",
        "input_text": "The restaurant was amazing, the food was fresh and the staff were friendly.",
        "expected_output": "positive",
    },
    {
        "item_id": 2,
        "task_id": 1,
        "language_code": "en",
        "input_text": "The movie was too long and incredibly boring.",
        "expected_output": "negative",
    },
    # Sentiment classification. Luxembourgish (or your low resource language).
    {
        "item_id": 3,
        "task_id": 1,
        "language_code": "lb",
        "input_text": "D'Iessen am Restaurant war super, an d'Personal war ganz frëndlech.",
        "expected_output": "positive",
    },
    {
        "item_id": 4,
        "task_id": 1,
        "language_code": "lb",
        "input_text": "De Film war ze laang an immens langweileg.",
        "expected_output": "negative",
    },
    # Factoid QA. English.
    {
        "item_id": 5,
        "task_id": 2,
        "language_code": "en",
        "input_text": "Which city is the capital of Armenia?",
        "expected_output": "Yerevan",
    },
    # Factoid QA. Luxembourgish (or your low resource language).
    {
        "item_id": 6,
        "task_id": 2,
        "language_code": "lb",
        "input_text": "Wéi eng Stad ass d'Haaptstad vun Armenien?",
        "expected_output": "Yerevan",
    },
]

items_df = pd.DataFrame(items)
items_df

Feel free to.

- Add more `item_id` rows for each `(task, language)`.  
- Change the texts to match domains relevant to your context.  
- Keep items short so that you can run experiments quickly.


## 3. Prompt design. zero shot, few shot, and Chain of Thought

For each task and language, you will design three categories of prompts.

1. **Zero shot**. The model sees only the task instruction and the input.  
2. **Few shot**. The model sees the instruction plus a few example pairs (input and correct output).  
3. **Chain of Thought (CoT)**. The model sees an instruction that explicitly asks for step by step reasoning before the final answer.

You can define prompt *templates* that you will adapt to each input item.


In [None]:
# Define base instruction templates for each task and language.
# These are generic instructions that you will adapt for zero shot, few shot, and CoT variants.

prompt_templates = [
    {
        "template_id": 1,
        "task_id": 1,
        "language_code": "en",
        "prompt_type": "zero_shot",
        "template_text": (
            "You are a helpful assistant.\n"
            "Classify the sentiment of the following review as 'positive' or 'negative'.\n\n"
            "Review. {input_text}\n"
            "Sentiment."
        ),
    },
    {
        "template_id": 2,
        "task_id": 1,
        "language_code": "en",
        "prompt_type": "cot",
        "template_text": (
            "You are a helpful assistant.\n"
            "First, briefly explain why the review is positive or negative.\n"
            "Then, on a new line, answer with 'positive' or 'negative'.\n\n"
            "Review. {input_text}\n"
            "Explanation and answer."
        ),
    },
    {
        "template_id": 3,
        "task_id": 1,
        "language_code": "lb",
        "prompt_type": "zero_shot",
        "template_text": (
            "Du bass en hëllefsbereeten Assistent.\n"
            "Klassifizéier d'Stëmmung vun dëser Kritik als 'positiv' oder 'negativ'.\n\n"
            "Kritik. {input_text}\n"
            "Stëmmung."
        ),
    },
    {
        "template_id": 4,
        "task_id": 1,
        "language_code": "lb",
        "prompt_type": "cot",
        "template_text": (
            "Du bass en hëllefsbereeten Assistent.\n"
            "Erklär kuerz firwat d'Kritik positiv oder negativ ass.\n"
            "Duerno, op enger neier Zeil, äntwere just mat 'positiv' oder 'negativ'.\n\n"
            "Kritik. {input_text}\n"
            "Erklärung an Äntwert."
        ),
    },
    {
        "template_id": 5,
        "task_id": 2,
        "language_code": "en",
        "prompt_type": "zero_shot",
        "template_text": (
            "You are a helpful assistant.\n"
            "Answer the question briefly and accurately.\n\n"
            "Question. {input_text}\n"
            "Answer."
        ),
    },
    {
        "template_id": 6,
        "task_id": 2,
        "language_code": "en",
        "prompt_type": "cot",
        "template_text": (
            "You are a helpful assistant.\n"
            "Think step by step to answer the question, then give a short final answer on a new line.\n\n"
            "Question. {input_text}\n"
            "Reasoning and final answer."
        ),
    },
    {
        "template_id": 7,
        "task_id": 2,
        "language_code": "lb",
        "prompt_type": "zero_shot",
        "template_text": (
            "Du bass en hëllefsbereeten Assistent.\n"
            "Beäntwer d'Fro kuerz a präzis.\n\n"
            "Fro. {input_text}\n"
            "Äntwert."
        ),
    },
    {
        "template_id": 8,
        "task_id": 2,
        "language_code": "lb",
        "prompt_type": "cot",
        "template_text": (
            "Du bass en hëllefsbereeten Assistent.\n"
            "Denke Schrëtt fir Schrëtt fir d'Fro ze beäntweren, an duerno ginn eng kuerz Äntwert op enger neier Zeil.\n\n"
            "Fro. {input_text}\n"
            "Iwwerleeung an Äntwert."
        ),
    },
]

templates_df = pd.DataFrame(prompt_templates)
templates_df

Few shot prompts are typically created by adding one or more example pairs above the current input.

You can construct few shot prompts manually in the evaluation table later, or you can add explicit few shot templates with placeholders for examples.

For this tutorial, we keep few shot prompts flexible and let you write them directly in the experiments table.


## 4. Running experiments and recording outputs

You will now define an **experiments table** where each row represents one model run.

Each row stores.

- Task and language.  
- Model name.  
- Prompting strategy (zero shot, few shot, CoT).  
- The exact prompt you used.  
- The model output.  
- Your ratings for correctness, fluency, and cultural appropriateness.

Start with a few example rows, then duplicate and edit them as you run more experiments.


In [None]:
# Define the schema for experiments with a small example.
# Replace the placeholder values as you run real experiments.

experiments = [
    {
        "experiment_id": 1,
        "task_id": 1,
        "language_code": "en",
        "item_id": 1,
        "model_name": "ModelA",          # e.g. "gpt4o", "claude", "llama3", etc.
        "prompt_type": "zero_shot",      # "zero_shot", "few_shot", or "cot"
        "shots_used": 0,                 # number of in context examples (0 for zero shot)
        "prompt_text": (
            "You are a helpful assistant.\n"
            "Classify the sentiment of the following review as 'positive' or 'negative'.\n\n"
            "Review. The restaurant was amazing, the food was fresh and the staff were friendly.\n"
            "Sentiment."
        ),
        "model_output": "[paste output here]",
        # Ratings (see section 5 for guidance).
        "correctness": None,             # 0. incorrect, 1. partially correct, 2. fully correct
        "fluency": None,                 # 0. poor, 1. acceptable, 2. very fluent
        "cultural_appropriateness": None,# 0. problematic, 1. acceptable, 2. very appropriate
        "notes": "",
    },
    # Add more rows here for other prompt types, models, languages, and items.
]

experiments_df = pd.DataFrame(experiments)
experiments_df

For each new experiment you run.

1. Duplicate an existing row in `experiments`.  
2. Update `experiment_id` to a new unique number.  
3. Set `task_id`, `language_code`, `item_id`, `model_name`, and `prompt_type`.  
4. Copy the exact `prompt_text` you used.  
5. Paste the `model_output`.  
6. Fill in the rating fields after reading the output carefully.


## 5. Rating guidelines

Use the following scales when annotating each experiment.

### 5.1 Correctness (`correctness`)

- **2. fully correct**. The answer matches the expected label or answer, with no major errors.  
- **1. partially correct**. The answer is close but not perfect (for example correct label but with an incorrect explanation, or correct core fact with minor mistakes).  
- **0. incorrect**. The answer does not match the expected output or is clearly wrong.

### 5.2 Fluency (`fluency`)

- **2. very fluent**. The output is natural and grammatical in the target language.  
- **1. acceptable**. Some minor issues, but overall understandable.  
- **0. poor**. The output has many errors or is difficult to understand.

### 5.3 Cultural appropriateness (`cultural_appropriateness`)

- **2. very appropriate**. The output is respectful and fits the cultural context well.  
- **1. acceptable**. No serious issues, but maybe slightly awkward.  
- **0. problematic**. The output feels insensitive, inappropriate, or culturally misleading.

Use the `notes` field to document interesting cases (for example when CoT reasoning is correct in English but fails in your low resource language).


## 6. Summarizing results

Once you have filled `experiments_df` with several rows, you can compute simple summaries by language, model, and prompt type.

Run the cell below for an overview.


In [None]:
if len(experiments_df) == 0:
    print("experiments_df is empty. Please add some experiments first.")
else:
    summary = experiments_df.groupby(["model_name", "language_code", "prompt_type"])[
        ["correctness", "fluency", "cultural_appropriateness"]
    ].mean()

    count = experiments_df.groupby(["model_name", "language_code", "prompt_type"])[
        "experiment_id"
    ].count().rename("num_examples")

    summary = summary.join(count)
    print("Average scores by model, language, and prompt type (scale 0-2):")
    display(summary)

### 6.1 Visualizing the effect of prompt type

If you have enough experiments, you can plot average correctness by prompt type for a given model and language.

Edit the variables in the cell below to match your setup.


In [None]:
# Configure which model and language you want to visualize.
target_model = "ModelA"   # change this to a real model name used in your experiments
target_language = "en"    # e.g. "lb" for your low resource language

subset = experiments_df[
    (experiments_df["model_name"] == target_model)
    & (experiments_df["language_code"] == target_language)
]

if len(subset) == 0:
    print("No experiments found for the chosen model and language.")
else:
    avg_by_prompt = subset.groupby("prompt_type")[
        ["correctness", "fluency", "cultural_appropriateness"]
    ].mean()

    avg_by_prompt = avg_by_prompt.reindex(["zero_shot", "few_shot", "cot"]).dropna(how="all")

    if avg_by_prompt.empty:
        print("No data to plot. Check that you used 'zero_shot', 'few_shot', and 'cot' labels.")
    else:
        avg_by_prompt["correctness"].plot(kind="bar")
        plt.ylim(0, 2)
        plt.ylabel("Average correctness (0-2)")
        plt.title(f"Effect of prompt type on correctness ({target_model}, {target_language})")
        plt.tight_layout()
        plt.show()

You can duplicate and adapt this cell to compare.

- English vs your low resource language for the same model.  
- Different models for the same language and prompt type.


## 7. Exporting your annotations

If you want to share your results or combine them with other groups, you can export the experiments table to a CSV file.


In [None]:
output_path = "cross_lingual_prompting_experiments.csv"
experiments_df.to_csv(output_path, index=False)
print(f"Experiments exported to {output_path}")

## 8. Reflection and small group discussion

Use the questions below for your final discussion.

1. **Prompt type effects.**  
   - In your experiments, did few shot prompting improve correctness compared to zero shot for your low resource language.  
   - Did Chain of Thought instructions help or confuse the model output in your language.

2. **Cross lingual differences.**  
   - Did the same prompt design work equally well in English and in your low resource language.  
   - Were there cases where reasoning was correct in English but failed when you translated the prompt.

3. **Fluency and cultural appropriateness.**  
   - Did CoT prompts degrade fluency or make outputs more verbose than needed.  
   - Were there outputs that felt culturally inappropriate or out of place in your language.

4. **Practical recommendations.**  
   - Based on your observations, what simple rules of thumb would you give to someone designing prompts for your language.  
   - Which combinations of prompt type, language, and model would you recommend for real applications.

You can use the markdown cell below to write down key points from your discussion.


### 8.1 Notes from your group

Use this space to summarize your main takeaways.

- What worked well (prompt types, models, languages).  
- What did not work well.  
- Surprising observations.  
- Open questions you would like to explore in more systematic studies.
