After playing with Qwen2.5 14B at the suggestion of Gemini, I figured i should write some first benchmarks. (not AI generated code, unlike the first play with Qwen)

In [1]:
import json
import ollama

In [3]:
SYSTEM_PROMPT = """
You are an expert logician and formal argumentation extractor specializing in the ASPIC+ framework. 

Your sole objective is to analyze natural language text and decompose it into a structured Argumentation Framework. 

You must output ONLY valid, parsable JSON matching the exact schema below. Do not include markdown code blocks (e.g., ```json), conversational filler, explanations, or preamble. Start exactly with '{' and end exactly with '}'.

### JSON SCHEMA
{
  "atoms": [
    {"id": "a1", "text": "<Minimal, atomic logical proposition>"}
  ],
  "rules": [
    {"id": "r1", "type": "<strict|defeasible>", "premises": ["<atom_id>"], "conclusion": "<atom_id>"}
  ],
  "arguments": [
    {"id": "arg1", "premises": ["<atom_id>"], "applied_rules": ["<rule_id>"], "conclusion": "<atom_id>", "sub_arguments": ["<arg_id>"]}
  ],
  "attacks": [
    {"from_arg": "<arg_id>", "to_arg": "<arg_id>", "type": "<rebut|undercut|undermine>"}
  ]
}

### ASPIC+ DEFINITIONS & EXTRACTION RULES

1. ATOMS (Propositions):
   - Break text into the smallest possible, independent logical propositions.
   - Resolve pronouns (e.g., convert "He lied" to "John lied" if John is the subject).
   - Each distinct entity, claim, or fact must be a separate atom.
   - Do NOT merge claims. 

2. RULES (Inferences):
   - "strict": Deductive certainty, logical necessity, or physical laws (e.g., "If x is a bachelor, x is unmarried").
   - "defeasible": Presumptive, typical, or fallible reasoning (e.g., "If x is a bird, x usually flies", "If witness says X, X is true").
   - A rule maps a set of premise atoms to a conclusion atom. 
   - Extract enthymemes (implicit rules) ONLY if strictly necessary to connect stated claims.
   - Crucial: If a conclusion requires multiple facts to be true simultaneously, list ALL those facts in a single 'premises' array for one rule.

3. ARGUMENTS (Tree Structure):
   - An argument applies rules to premises to reach a conclusion.
   - RECURSION: Arguments are tree-like. If an argument uses the conclusion of another argument as its premise, you MUST include the parent argument's ID in the "sub_arguments" array.
   - Base arguments (relying only on direct atoms) have an empty "sub_arguments" list.

4. ATTACKS (Defeats):
   - "rebut": Argument A attacks the CONCLUSION of Argument B (Arg A claims 'not X', Arg B claims 'X'). Only applies if the targeted conclusion comes from a defeasible rule.
   - "undercut": Argument A attacks the RULE APPLICATION / INFERENCE of Argument B (e.g., Arg B uses a witness; Arg A points out the witness is blind. The premise isn't attacked, the conclusion isn't directly attacked, but the link is broken).
   - "undermine": Argument A attacks a PREMISE (atom) used in Argument B.

### STRICT FORMATTING CONSTRAINTS
- IDs must be deterministic and strictly formatted: a1, a2... r1, r2... arg1, arg2...
- Missing elements: If no rules, arguments, or attacks exist, return empty arrays: [].
- NEVER invent information not explicitly stated or logically necessitated by the text.
- JSON key names must perfectly match the schema.

### FEW-SHOT EXAMPLE
Input Text:
"John says the car is red, so the car is red. However, John is colorblind. Also, the car registration says it is blue, meaning it cannot be red."

Expected JSON Output:
{
  "atoms": [
    {"id": "a1", "text": "John says the car is red"},
    {"id": "a2", "text": "The car is red"},
    {"id": "a3", "text": "John is colorblind"},
    {"id": "a4", "text": "The car registration says it is blue"},
    {"id": "a5", "text": "The car is blue"},
    {"id": "a6", "text": "The car cannot be red"}
  ],
  "rules": [
    {"id": "r1", "type": "defeasible", "premises": ["a1"], "conclusion": "a2"},
    {"id": "r2", "type": "defeasible", "premises": ["a4"], "conclusion": "a5"},
    {"id": "r3", "type": "strict", "premises": ["a5"], "conclusion": "a6"}
  ],
  "arguments": [
    {"id": "arg1", "premises": ["a1"], "applied_rules": ["r1"], "conclusion": "a2", "sub_arguments": []},
    {"id": "arg2", "premises": ["a3"], "applied_rules": [], "conclusion": "a3", "sub_arguments": []},
    {"id": "arg3", "premises": ["a4"], "applied_rules": ["r2"], "conclusion": "a5", "sub_arguments": []},
    {"id": "arg4", "premises": [], "applied_rules": ["r3"], "conclusion": "a6", "sub_arguments": ["arg3"]}
  ],
  "attacks": [
    {"from_arg": "arg2", "to_arg": "arg1", "type": "undercut"},
    {"from_arg": "arg4", "to_arg": "arg1", "type": "rebut"}
  ]
}

### LOGIC CHECK - VERIFY BEFORE OUTPUT:
1. Every 'conclusion' in the 'arguments' list must be a text-id from the 'atoms' list.
2. Every 'sub_argument' must be a valid 'arg_id' defined elsewhere in the 'arguments' list.
3. If Arg A attacks Arg B, identify EXACTLY what is being attacked:
   - If A contradicts B's conclusion: type = "rebut".
   - If A attacks a premise in B: type = "undermine".
   - If A attacks the rule linking B's premise to its conclusion: type = "undercut".
4. Ensure no argument is its own premise (Avoid ID circularity: arg1 cannot depend on arg1).

PROCESS THE FOLLOWING TEXT:
"""

Before fine-tuning Qwen I shall define some metrics to test the stock model, so that we can compare it to the fine-tuned version, and test it's reliability.

The 5 (and their corresponding derived versions) are generated by Gemini and ChatGPT.

I'll check variance by determining the change between the n outputs of the same prompt, the stability with the structural change between subtly changed prompts, and correctness aganist a predefined output.

In [4]:
from collections import Counter

def get_structure(json_data):
    try:
        data = json.loads(json_data)
        return {
            "atoms_count": len(data.get("atoms", [])),
            "rules_count": len(data.get("rules", [])),
            "strict_rules": len([r for r in data.get("rules", []) if r.get("type") == "strict"]),
            "defeasible_rules": len([r for r in data.get("rules", []) if r.get("type") == "defeasible"]),
            "attacks_count": len(data.get("attacks", [])),
            "attack_types": Counter([a.get("type") for a in data.get("attacks", [])])
        }
    except:
        return "Invalid JSON"

In [5]:
prompts=[]
with open("data/variance/prompts.txt", "r") as f:
    prompts = f.read()
for i in prompts.split("@"):
    outputs=[]
    print("Prompt:", i)
    for j in range(5):
        response = ollama.chat(
            model="qwen2.5:14b",
            format="json",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": i}
            ],
            options={"temperature": 0}
        )
        outputs.append(response["message"]["content"].strip())
        print("Output", j+1, "processed.")
    for k in outputs:
        print(get_structure(k))
    

Prompt: John is guilty of trespassing because he entered the private property without permission. However, John claims he was invited by the owner, which means he is not guilty of trespassing.

Output 1 processed.
Output 2 processed.
Output 3 processed.
Output 4 processed.
Output 5 processed.
{'atoms_count': 4, 'rules_count': 2, 'strict_rules': 1, 'defeasible_rules': 1, 'attacks_count': 1, 'attack_types': Counter({'rebut': 1})}
{'atoms_count': 4, 'rules_count': 2, 'strict_rules': 1, 'defeasible_rules': 1, 'attacks_count': 1, 'attack_types': Counter({'rebut': 1})}
{'atoms_count': 4, 'rules_count': 2, 'strict_rules': 1, 'defeasible_rules': 1, 'attacks_count': 1, 'attack_types': Counter({'rebut': 1})}
{'atoms_count': 4, 'rules_count': 2, 'strict_rules': 1, 'defeasible_rules': 1, 'attacks_count': 1, 'attack_types': Counter({'rebut': 1})}
{'atoms_count': 4, 'rules_count': 2, 'strict_rules': 1, 'defeasible_rules': 1, 'attacks_count': 1, 'attack_types': Counter({'rebut': 1})}
Prompt: 
The sec

In [6]:
prompts=[]
with open("data/stability/prompts.txt", "r") as f:
    prompts = f.read()
ct=1
for i in prompts.split(";"):
    outputs=[]
    print("Prompt set ", ct, "\n")
    ct2=1
    for j in i.split("@"):
        response = ollama.chat(
            model="qwen2.5:14b",
            format="json",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": j}
            ],
            options={"temperature": 0}
        )
        outputs.append(response["message"]["content"].strip())
        print("Output", (ct-1)*5+ct2, "processed.")
        ct2+=1
    ct+=1
    for k in outputs:
        print(get_structure(k))
    

Prompt set  1 

Output 1 processed.
Output 2 processed.
Output 3 processed.
Output 4 processed.
Output 5 processed.
{'atoms_count': 5, 'rules_count': 3, 'strict_rules': 2, 'defeasible_rules': 1, 'attacks_count': 2, 'attack_types': Counter({'undercut': 1, 'rebut': 1})}
{'atoms_count': 4, 'rules_count': 2, 'strict_rules': 1, 'defeasible_rules': 1, 'attacks_count': 1, 'attack_types': Counter({'rebut': 1})}
{'atoms_count': 4, 'rules_count': 1, 'strict_rules': 1, 'defeasible_rules': 0, 'attacks_count': 1, 'attack_types': Counter({'rebut': 1})}
{'atoms_count': 4, 'rules_count': 2, 'strict_rules': 1, 'defeasible_rules': 1, 'attacks_count': 1, 'attack_types': Counter({'rebut': 1})}
{'atoms_count': 4, 'rules_count': 1, 'strict_rules': 1, 'defeasible_rules': 0, 'attacks_count': 1, 'attack_types': Counter({'rebut': 1})}
Prompt set  2 

Output 6 processed.
Output 7 processed.
Output 8 processed.
Output 9 processed.
Output 10 processed.
{'atoms_count': 4, 'rules_count': 2, 'strict_rules': 0, 'defea

In [7]:
prompts=[]
valid_outputs=[]
with open("data/correctness/prompts.txt", "r") as f:
    prompts = f.read()
with open("data/correctness/outputs.txt", "r") as f:
    valid_outputs = f.read()
ct=0
for i in prompts.split("@"):
    print("Prompt:", i)
    response = ollama.chat(
        model="qwen2.5:14b",
        format="json",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": i}
        ],
        options={"temperature": 0}
    )
    print("Computed output: ",get_structure(response["message"]["content"].strip()))
    print("Valid output: ", get_structure(valid_outputs.split("@")[ct]),"\n")

Prompt: John is guilty of trespassing because he entered the private property without permission. However, John claims he was invited by the owner, which means he is not guilty of trespassing.

Computed output:  {'atoms_count': 4, 'rules_count': 2, 'strict_rules': 1, 'defeasible_rules': 1, 'attacks_count': 1, 'attack_types': Counter({'rebut': 1})}
Valid output:  {'atoms_count': 4, 'rules_count': 2, 'strict_rules': 0, 'defeasible_rules': 2, 'attacks_count': 1, 'attack_types': Counter({'rebut': 1})}
Prompt: 
The security footage shows a person in a red hat stealing the bike, so the thief wore a red hat. But the camera was low-resolution and the colors are distorted, so the footage cannot reliably identify the color of the hat.

Computed output:  {'atoms_count': 4, 'rules_count': 2, 'strict_rules': 1, 'defeasible_rules': 1, 'attacks_count': 1, 'attack_types': Counter({'undercut': 1})}
Valid output:  {'atoms_count': 4, 'rules_count': 2, 'strict_rules': 0, 'defeasible_rules': 2, 'attacks_co

All of the data and results can be found in the corresponding directory in the data folder.