## Observation

#### From the initial analysis it was observed that meta-llama/Meta-Llama-3.1-8B-Instruct model performance was better than all the other models for both Task A and Task B with Multistage Fallacy detection. Keeping this model as fixed, we experimented with different prompts and observe their performance. For testing, we used meta-llama/Meta-Llama-3.1-8B-Instruct model, with 3-stage fallacy detection.

# Task A

## Prompt1

In [None]:
def create_stage_prompt(text: str, stage_labels: List[str], 
                           examples: Dict[str, List[str]], 
                           stage_name: str,
                           max_examples: int = 3) -> str:
        
        label_descriptions = {
            "Ad-hominem": "Attacking the person instead of the argument",
            "Appeal-to-authority": "Using authority as evidence",
            "Appeal-to-emotion": "Manipulating emotions instead of using logic",
            "Causal-oversimplification": "Oversimplifying cause-effect relationships",
            "Cherry-picking": "Selecting only favorable evidence",
            "Circular-reasoning": "Conclusion is assumed in the premise",
            "Doubt": "Questioning credibility without evidence",
            "Evading-the-burden-of-proof": "Avoiding providing evidence",
            "False-analogy": "Comparing incomparable things",
            "False-dilemma": "Presenting only two options when more exist",
            "Flag-waving": "Appealing to nationalism/patriotism",
            "Hasty-generalization": "Drawing conclusions from insufficient evidence",
            "Loaded-language": "Using emotionally charged words",
            "Name-calling-or-labelling": "Using derogatory labels",
            "Red-herring": "Introducing irrelevant information",
            "Slippery-slope": "Claiming one thing will lead to extreme consequences",
            "Slogan": "Using catchy phrases instead of arguments",
            "Strawman": "Misrepresenting opponent's argument",
            "Thought-terminating-cliches": "Using clichés to stop critical thinking",
            "Vagueness": "Being intentionally unclear"
        }
        
        prompt = f"""You are an expert in identifying logical fallacies and propaganda techniques in text.

STAGE: {stage_name}

Your task is to identify which fallacy types from the list below are present in the given text.
ONLY consider the fallacies listed for this stage. A text can have multiple fallacies.

FALLACY TYPES FOR THIS STAGE:
"""
        for label in stage_labels:
            prompt += f"- {label}: {label_descriptions.get(label, 'Propaganda technique')}\n"
        
        prompt += "\n"
        if examples:
            prompt += "EXAMPLES:\n\n"
            example_posts = []
            for label in stage_labels:
                if label in examples and examples[label]:
                    example_posts.append((examples[label][0], label))
            
            import random
            if len(example_posts) > max_examples:
                random.seed(42)
                example_posts = random.sample(example_posts, max_examples)
            
            for idx, (example_text, label) in enumerate(example_posts, 1):
                prompt += f"Example {idx}:\n"
                prompt += f"Text: \"{example_text}\"\n"
                prompt += f"Contains: {label}\n\n"
        
        prompt += f"""Now analyze this text and identify which fallacies from the {stage_name} list are present:

TEXT: "{text}"

CRITICAL INSTRUCTIONS:
1. ONLY consider fallacies from the list above for this stage
2. Identify ALL applicable fallacies (can be multiple)
3. If no fallacies from this stage are found, return empty array
4. Be precise - only return labels that clearly apply

Output format: Return ONLY a JSON array of labels:
["Label1", "Label2"]

If no fallacies from this stage: []

Your response (JSON array only):"""
        
        return prompt

## Result
<pre>
            prec       rec       f1
a1          32.1      45.5     37.64
a2          34.11     45.48    38.98
avg         33.1      45.49    38.31
</pre>


## Prompt2

In [None]:
def create_stage_prompt(self, text: str, stage_labels: List[str], 
                           examples: Dict[str, List[str]], 
                           stage_name: str,
                           max_examples: int = 3) -> str:
        
        label_descriptions = {
            "Ad-hominem": "Attacking the person instead of the argument",
            "Appeal-to-authority": "Using authority as evidence",
            "Appeal-to-emotion": "Manipulating emotions instead of using logic",
            "Causal-oversimplification": "Oversimplifying cause-effect relationships",
            "Cherry-picking": "Selecting only favorable evidence",
            "Circular-reasoning": "Conclusion is assumed in the premise",
            "Doubt": "Questioning credibility without evidence",
            "Evading-the-burden-of-proof": "Avoiding providing evidence",
            "False-analogy": "Comparing incomparable things",
            "False-dilemma": "Presenting only two options when more exist",
            "Flag-waving": "Appealing to nationalism/patriotism",
            "Hasty-generalization": "Drawing conclusions from insufficient evidence",
            "Loaded-language": "Using emotionally charged words",
            "Name-calling-or-labelling": "Using derogatory labels",
            "Red-herring": "Introducing irrelevant information",
            "Slippery-slope": "Claiming one thing will lead to extreme consequences",
            "Slogan": "Using catchy phrases instead of arguments",
            "Strawman": "Misrepresenting opponent's argument",
            "Thought-terminating-cliches": "Using clichés to stop critical thinking",
            "Vagueness": "Being intentionally unclear"
        }
        
        prompt = f"""You are an expert in identifying logical fallacies and propaganda techniques.

TASK: Analyze the text below for {stage_name} fallacies. Think step-by-step.

FALLACY TYPES TO CONSIDER:
"""
        for label in stage_labels:
            prompt += f"- {label}: {label_descriptions.get(label, 'Propaganda technique')}\n"

        prompt += "\n"

        if examples:
            prompt += "EXAMPLES OF REASONING:\n\n"
            example_posts = list(examples.items())[:2]
            for idx, (label, texts) in enumerate(example_posts, 1):
                if texts:
                    prompt += f"Example {idx}:\n"
                    prompt += f"Text: \"{texts[0][:150]}...\"\n"
                    prompt += f"Reasoning: This text contains {label} because [brief explanation]\n"
                    prompt += f"Label: {label}\n\n"

        prompt += f"""TEXT TO ANALYZE: "{text}"

    INSTRUCTIONS:
    1. Read the text carefully
    2. For each fallacy type, ask: "Is this fallacy clearly present?"
    3. Provide brief reasoning for fallacies you identify
    4. List all applicable fallacies

    FORMAT YOUR RESPONSE AS:
    Reasoning: [Your step-by-step analysis]
    Labels: ["Label1", "Label2"] or []

    Your response:"""

        return prompt

## Result
<pre>
            prec       rec       f1
a1          12.92     60.66     21.3
a2          14.42     63.70     23.52
avg         13.67     62.18     22.41
</pre>


## Prompt3

In [None]:
def create_stage_prompt(self, text: str, stage_labels: List[str],
                        examples: Dict[str, List[str]],
                        stage_name: str,
                        max_examples: int = 3) -> str:

    label_descriptions = {
        "Ad-hominem": "Attacking the person instead of the argument",
        "Appeal-to-authority": "Using authority as evidence",
        "Appeal-to-emotion": "Manipulating emotions instead of using logic",
        "Causal-oversimplification": "Oversimplifying cause-effect relationships",
        "Cherry-picking": "Selecting only favorable evidence",
        "Circular-reasoning": "Conclusion is assumed in the premise",
        "Doubt": "Questioning credibility without evidence",
        "Evading-the-burden-of-proof": "Avoiding providing evidence",
        "False-analogy": "Comparing incomparable things",
        "False-dilemma": "Presenting only two options when more exist",
        "Flag-waving": "Appealing to nationalism/patriotism",
        "Hasty-generalization": "Drawing conclusions from insufficient evidence",
        "Loaded-language": "Using emotionally charged words",
        "Name-calling-or-labelling": "Using derogatory labels",
        "Red-herring": "Introducing irrelevant information",
        "Slippery-slope": "Claiming one thing will lead to extreme consequences",
        "Slogan": "Using catchy phrases instead of arguments",
        "Strawman": "Misrepresenting opponent's argument",
        "Thought-terminating-cliches": "Using clichés to stop critical thinking",
        "Vagueness": "Being intentionally unclear"
    }

    prompt = f"""You are an expert in identifying logical fallacies and propaganda techniques.

TASK: Analyze the text below for possible {stage_name} fallacies.

FALLACY TYPES TO CONSIDER:
"""
    for label in stage_labels:
        prompt += f"- {label}: {label_descriptions.get(label, 'Propaganda technique')}\n"

    prompt += "\n"

    if examples:
        prompt += "EXAMPLES (follow this exact output style):\n\n"
        count = 0
        for label, texts in examples.items():
            if count >= max_examples:
                break
            example_text = texts[0].replace("\n", " ").strip()
            if len(example_text) > 200:
                example_text = example_text[:200] + "..."
            prompt += f"Example {count+1}:\n"
            prompt += f'Text: "{example_text}"\n'
            prompt += f'Reasoning: This text contains {label} because [short justification].\n'
            prompt += f'Labels: ["{label}"]\n\n'
            count += 1

    prompt += f"""TEXT TO ANALYZE:
"{text}"

INSTRUCTIONS:
1. Identify only the fallacies explicitly listed above.
2. Explain briefly why if you detect one.
3. If none, output an empty list.

FORMAT YOUR RESPONSE EXACTLY LIKE THIS:
Reasoning: [short explanation or "No fallacies found."]
Labels: ["Label1", "Label2"]  OR  Labels: []

Your response:
"""

    return prompt


## Result
<pre>
            prec       rec       f1
a1          24.65     42.64     31.24
a2          28.56     46.47     35.38
avg         26.61     44.56     33.31
</pre>


## Prompt4

In [None]:
def create_stage_prompt(self, text: str, stage_labels: List[str],
                        examples: Dict[str, List[str]],
                        stage_name: str,
                        max_examples: int = 3) -> str:

        label_descriptions = {
            "Ad-hominem": "Attacking the person instead of the argument",
            "Appeal-to-authority": "Using authority as evidence",
            "Appeal-to-emotion": "Manipulating emotions instead of using logic",
            "Causal-oversimplification": "Oversimplifying cause-effect relationships",
            "Cherry-picking": "Selecting only favorable evidence",
            "Circular-reasoning": "Conclusion is assumed in the premise",
            "Doubt": "Questioning credibility without evidence",
            "Evading-the-burden-of-proof": "Avoiding providing evidence",
            "False-analogy": "Comparing incomparable things",
            "False-dilemma": "Presenting only two options when more exist",
            "Flag-waving": "Appealing to nationalism/patriotism",
            "Hasty-generalization": "Drawing conclusions from insufficient evidence",
            "Loaded-language": "Using emotionally charged words",
            "Name-calling-or-labelling": "Using derogatory labels",
            "Red-herring": "Introducing irrelevant information",
            "Slippery-slope": "Claiming one thing will lead to extreme consequences",
            "Slogan": "Using catchy phrases instead of arguments",
            "Strawman": "Misrepresenting opponent's argument",
            "Thought-terminating-cliches": "Using clichés to stop critical thinking",
            "Vagueness": "Being intentionally unclear"
        }

        prompt = f"""You are a highly-attuned **Logical Fallacy and Propaganda Identifier**. Your sole function is to perform precise span-extraction for a specific set of fallacies.
##  STAGE: {stage_name}

Your task is to analyze the provided text and **extract all spans** that contain ONLY the logical fallacy types listed below for this stage.

**STRICT RULE:** You **MUST IGNORE** any fallacy type not present in the list below. Focus exclusively on the defined set.

### FALLACY TYPES FOR THIS STAGE:
"""
        for label in stage_labels:
            prompt += f"- **{label}**: {label_descriptions.get(label, 'Propaganda technique')}\n"

        if examples:
            prompt += "EXAMPLES (follow this exact output style):\n\n"
            count = 0
            for label, texts in examples.items():
                if count >= max_examples:
                    break
                example_text = texts[0].replace("\n", " ").strip()
                if len(example_text) > 200:
                    example_text = example_text[:200] + "..."
                prompt += f"Example {count+1}:\n"
                prompt += f'Text: "{example_text}"\n'
                prompt += f'Reasoning: This text contains {label} because [short justification].\n'
                prompt += f'Labels: ["{label}"]\n\n'
                count += 1

        prompt += f"""
##  INPUT TEXT

Analyze the following text for the **{stage_name}** fallacies:

TEXT: "{text}"

---

##  CRITICAL EXTRACTION INSTRUCTIONS:

1.  **SCOPE:** Identify **ALL** text spans exhibiting a fallacy from the **FALLACY TYPES FOR THIS STAGE** list.
2.  **EXCLUSIVITY:** **DO NOT** extract or label any other fallacy type (e.g., if 'Ad-hominem' is in the text but not in the list, you must ignore it).
3.  **PRECISION:** The extracted **"text"** must be the **exact, unedited substring** from the original input text.
4.  **SPAN SIZE:** A span can be a phrase, a single sentence, or a sequence of sentences. Multiple distinct spans can share the same label.

FORMAT YOUR RESPONSE EXACTLY LIKE THIS:
Reasoning: [short explanation or "No fallacies found."]
Labels: ["Label1", "Label2"]  OR  Labels: []

Your response:
"""

        return prompt

## Result
<pre>
            prec       rec       f1
a1          27.93     43.99     34.17
a2          31.36     46.47     37.45
avg         29.65     45.23     35.81
</pre>


### Performance Analysis

In [7]:
import pandas as pd
import altair as alt

#  metrics as data
df = pd.DataFrame({
    "Prompt": ["Prompt1", "Prompt2", "Prompt3", "Prompt4"],
    "prec": [33.1, 13.67, 26.61, 29.65],
    "rec":  [45.49, 62.18, 44.56, 45.23],
    "f1":   [38.31, 22.41, 33.31, 35.81],
})

df_melt = df.melt(id_vars="Prompt", var_name="Metric", value_name="Value")

chart = (
    alt.Chart(df_melt)
    .mark_line(point=True)
    .encode(
        x=alt.X("Prompt:N", title="Prompt",
                axis=alt.Axis(labelAngle=0)),   
        y=alt.Y("Value:Q", title="Score"),
        color="Metric:N",
        tooltip=["Prompt", "Metric", "Value"]
    )
    .properties(
        width=500,
        height=300,
        title="Metrics Across Prompts"
    )
)

chart

### Description 

The comparison across various prompts shows that Prompt 1 achieves the highest F1 score, 
resulting in the best overall balance between precision and recall. Eventhough, its recall is slightly
lower than Prompt 2, Prompt 1 maintains substantially higher precision, leading to the strongest
combined performance. 
Prompt 2 have very low precision, pulling its F1 score down despite having the highest recall. 
Prompts 3 and 4 show moderate and more stable performance but do not outperform Prompt 1. 
Overall, Prompt 1 provides the most effective and reliable results across the evaluation metrics.

# Task B

## Prompt1

In [None]:
def create_stage_prompt(self, text: str, stage_labels: List[str],
                           examples: Dict[str, List[str]],
                           stage_name: str,
                           max_examples_per_label: int = 3) -> str:
        
        label_descriptions = {
            "Ad-hominem": "Attacking the person instead of the argument",
            "Appeal-to-authority": "Using authority as evidence",
            "Appeal-to-emotion": "Manipulating emotions instead of using logic",
            "Causal-oversimplification": "Oversimplifying cause-effect relationships",
            "Cherry-picking": "Selecting only favorable evidence",
            "Circular-reasoning": "Conclusion is assumed in the premise",
            "Doubt": "Questioning credibility without evidence",
            "Evading-the-burden-of-proof": "Avoiding providing evidence",
            "False-analogy": "Comparing incomparable things",
            "False-dilemma": "Presenting only two options when more exist",
            "Flag-waving": "Appealing to nationalism/patriotism",
            "Hasty-generalization": "Drawing conclusions from insufficient evidence",
            "Loaded-language": "Using emotionally charged words",
            "Name-calling-or-labelling": "Using derogatory labels",
            "Red-herring": "Introducing irrelevant information",
            "Slippery-slope": "Claiming one thing will lead to extreme consequences",
            "Slogan": "Using catchy phrases instead of arguments",
            "Strawman": "Misrepresenting opponent's argument",
            "Thought-terminating-cliches": "Using clichés to stop critical thinking",
            "Vagueness": "Being intentionally unclear"
        }
        
        prompt = f"""You are an expert in identifying logical fallacies in text.

STAGE: {stage_name}

Your task is to extract text spans that contain ONLY the fallacy types listed below.
IGNORE all other fallacy types - focus only on this stage's fallacies.

FALLACY TYPES FOR THIS STAGE:
"""
        
        for label in stage_labels:
            prompt += f"- {label}: {label_descriptions.get(label, 'Propaganda technique')}\n"
        
        # Add examples
        if examples:
            prompt += "\nEXAMPLES:\n"
            for label in stage_labels:
                if label in examples and examples[label]:
                    prompt += f"\n{label}:\n"
                    for ex in examples[label][:max_examples_per_label]:
                        prompt += f'  - "{ex}"\n'
        
        prompt += f"""
Now extract spans containing ONLY the {stage_name} fallacies from this text:

TEXT: "{text}"

CRITICAL INSTRUCTIONS:
1. Extract ALL spans containing fallacies from the list above
2. IGNORE fallacies not in this stage's list
3. A span can be a phrase, sentence, or multiple sentences
4. Multiple spans can have the same label
5. Return exact text from the input

Output format (JSON only):
{{"spans": [{{"text": "exact text span", "label": "Label-name"}}]}}

If no fallacies from this stage: {{"spans": []}}

Your response (JSON only):"""
        
        return prompt

## Result
<pre>
[strict evaluation]
        prec    rec     f1
a1      20.03   10.7    13.95
a2      20.17   10.18   13.53
avg     20.1    10.44   13.74

[soft evaluation]
        prec    rec     f1
a1      27.19   14.73   19.11
a2      27.37   14.35   18.83
avg     27.28   14.54   18.97
</pre>


## Prompt2

In [None]:
def create_stage_prompt(self, text: str, stage_labels: List[str],
                        examples: Dict[str, List[str]],
                        stage_name: str,
                        max_examples_per_label: int = 3) -> str:

    label_descriptions = {
        "Ad-hominem": "Attacking the speaker instead of the argument",
        "Appeal-to-authority": "Claiming something is true just because an authority says so",
        "Appeal-to-emotion": "Manipulating feelings instead of providing logic",
        "Causal-oversimplification": "Reducing complex cause-effect relationships to a simple claim",
        "Cherry-picking": "Mentioning only evidence that supports a claim",
        "Circular-reasoning": "The conclusion is also used as a premise",
        "Doubt": "Casting suspicion without offering evidence",
        "Evading-the-burden-of-proof": "Making claims without backing them up",
        "False-analogy": "Comparing two unrelated things as if they are similar",
        "False-dilemma": "Claiming only two options exist when more do",
        "Flag-waving": "Appealing to patriotism or national identity",
        "Hasty-generalization": "Concluding based on very limited data",
        "Loaded-language": "Using emotionally charged language",
        "Name-calling-or-labelling": "Using insult or label instead of argument",
        "Red-herring": "Bringing up irrelevant information to distract",
        "Slippery-slope": "Claiming one event will definitely lead to a dramatic outcome",
        "Slogan": "Repeating a catchy phrase instead of reasoning",
        "Strawman": "Distorting someone’s argument to attack it more easily",
        "Thought-terminating-cliches": "Phrases that shut down critical thinking",
        "Vagueness": "Statements that are unclear and undefined"
    }

    prompt = f"""You are a fallacy detection expert.

STAGE: {stage_name}

GOAL:
Extract ONLY spans of text that contain one or more of the fallacy types listed below.
Do NOT mark anything else as a fallacy.

FALLACY TYPES IN THIS STAGE:
"""
    for label in stage_labels:
        prompt += f"- {label}: {label_descriptions.get(label, 'Fallacy type')}\n"

    # Few-shot examples section
    if examples:
        prompt += "\nREFERENCE EXAMPLES (Learn the style):\n"
        for label in stage_labels:
            if label in examples and examples[label]:
                prompt += f"\n{label} examples:\n"
                for ex in examples[label][:max_examples_per_label]:
                    clean = ex.replace("\n", " ").strip()
                    prompt += f' • "{clean}"\n'

    prompt += f"""
TEXT TO ANALYZE:
"{text}"

IMPORTANT RULES:
Extract spans ONLY if they fit a fallacy in this STAGE
Spans must be copied EXACTLY as written in the text
A text may contain multiple fallacies or none
Include a separate entry for each labeled span
Be precise — avoid over-long spans

EXPECTED OUTPUT FORMAT — JSON ONLY:
{{
  "spans": [
    {{
      "text": "exact span copied from text",
      "label": "Label-name"
    }}
  ]
}}

If no valid fallacies are found:
{{"spans": []}}

DO NOT OUTPUT ANYTHING OTHER THAN THE JSON.

Your response:
"""

    return prompt

## Result
<pre>
[strict evaluation]
        prec    rec     f1
a1      14.39   20.11   16.78
a2      13.45   19.83   16.02
avg     13.92   19.97   16.4

[soft evaluation]
        prec     rec     f1
a1      17.31   25.56   20.64
a2      16.82   26.5    20.58
avg     17.06   26.03   20.61
</pre>


## Prompt3

In [None]:
def create_stage_prompt(self, text: str, stage_labels: List[str],
                               examples: Dict[str, List[str]],
                               stage_name: str,
                               max_examples_per_label: int = 3) -> str:

        label_descriptions = {
            "Ad-hominem": "Attacking the speaker instead of the argument",
            "Appeal-to-authority": "Claiming something is true just because an authority says so",
            "Appeal-to-emotion": "Manipulating feelings instead of providing logic",
            "Causal-oversimplification": "Reducing complex cause-effect relationships to a simple claim",
            "Cherry-picking": "Mentioning only evidence that supports a claim",
            "Circular-reasoning": "The conclusion is also used as a premise",
            "Doubt": "Casting suspicion without offering evidence",
            "Evading-the-burden-of-proof": "Making claims without backing them up",
            "False-analogy": "Comparing two unrelated things as if they are similar",
            "False-dilemma": "Claiming only two options exist when more do",
            "Flag-waving": "Appealing to patriotism or national identity",
            "Hasty-generalization": "Concluding based on very limited data",
            "Loaded-language": "Using emotionally charged language",
            "Name-calling-or-labelling": "Using insult or label instead of argument",
            "Red-herring": "Bringing up irrelevant information to distract",
            "Slippery-slope": "Claiming one event will definitely lead to a dramatic outcome",
            "Slogan": "Repeating a catchy phrase instead of reasoning",
            "Strawman": "Distorting someone’s argument to attack it more easily",
            "Thought-terminating-cliches": "Phrases that shut down critical thinking",
            "Vagueness": "Statements that are unclear and undefined"

        }

        prompt = f"""You are an expert in identifying logical fallacies in text.

    STAGE: {stage_name}

    Target Fallacies:
    """
        for label in stage_labels:
            prompt += f"- {label}: {label_descriptions.get(label, 'Propaganda technique')}\n"

        if examples:
            prompt += "\nExamples:\n"
            for label in stage_labels:
                if label in examples and examples[label]:
                    prompt += f"\n{label}:\n"
                    for ex in examples[label][:max_examples_per_label]:
                        prompt += f'  - "{ex}"\n'

        prompt += f"""
    TEXT TO ANALYZE:
    "{text}"

    ANALYSIS PROCESS:
    1. First, read the entire text carefully
    2. For each sentence/phrase, ask: "Does this match any of the {len(stage_labels)} target fallacies?"
    3. If yes, identify which specific fallacy and extract the exact span
    4. Ignore any fallacies NOT in the target list above
    5. Compile all matches

    Think through your analysis step by step, then provide the final JSON output:

    {{"reasoning": "brief explanation of what you found", "spans": [{{"text": "exact span", "label": "Label-name"}}]}}

    Your response:"""

        return prompt

## Result
<pre>
[strict evaluation]
        prec     rec     f1
a1      17.5    19.84   18.6
a2      17.19   19.88   18.44
avg     17.34   19.86   18.52

[soft evaluation]
        prec     rec     f1
a1      21.85   26.13   23.8
a2      21.79   26.79   24.03
avg     21.82   26.46   23.92
</pre>


### Performance Analysis

In [8]:
import pandas as pd
import altair as alt

#  metrics as data considering soft metrics
df = pd.DataFrame({
    "Prompt": ["Prompt1", "Prompt2", "Prompt3"],
    "prec": [27.28, 17.06, 21.82],
    "rec":  [14.54, 26.03, 26.46],
    "f1":   [18.97, 20.61, 23.92],
})

df_melt = df.melt(id_vars="Prompt", var_name="Metric", value_name="Value")

chart = (
    alt.Chart(df_melt)
    .mark_line(point=True)
    .encode(
        x=alt.X("Prompt:N", title="Prompt",
                axis=alt.Axis(labelAngle=0)),
        y=alt.Y("Value:Q", title="Score"),
        color=alt.Color(
            "Metric:N",
            scale=alt.Scale(
                range=["#1f77b4", "#ff7f0e", "#2ca02c"]  
            ),
            title="Metric"
        ),
        tooltip=["Prompt", "Metric", "Value"]
    )
    .properties(
        width=500,
        height=300,
        title="Metrics Across Prompts"
    )
)

chart

### Description

The comparison across the three prompts for Task B highlights distinct trade-offs among precision, recall, and F1 performance. Prompt 3 shows the strongest overall balance, achieving the highest F1 score alongside the highest recall, indicating consistently effective detection of fallacies across various labels. Prompt 1, shows solid precision and a competitive F1 score, but lags behind in recall, suggesting that it detects fewer relevant fallacies. Prompt 2 exhibits mixed performance: its recall rises sharply compared to Prompt 1, but this gain is offset by a noticeable drop in precision, which keeps its F1 score lower than those of Prompts 1 and 3.

Overall, Prompt 3 demonstrated the most reliable and well-rounded performance across the evaluated metrics, with Prompt 1 following closely and Prompt 2 showing more variability.