# Finetuned Model QWK Calculation and Scoring
## PERSUADE 2.0 Dataset Notebook

This notebook is in charge of calculating the QWK score for the baseline Gemma 3 instance and the finetuned one. This was the analysis used for the project report. The models are loaded with Hugging Face libraries, not Ollama.

### Calculate QWK of Baseline Gemma 3 and Finetuned Checkpoint

In [1]:
import pandas as pd

df = pd.read_csv('data/slim_persuade_test.csv')
df = df.drop_duplicates()
df

Unnamed: 0,essay_id_comp,full_text,holistic_essay_score
0,BC75783F96E3,This essay will explain if drivers should or s...,4
12,74C8BC7417DE,Driving while the use of cellular devices\n\nT...,2
17,97C1CFD04E4B,Cell phone use should not be legal while drivi...,4
28,2CE1FE38D0E7,Phones and Driving\n\nDriving is a good way to...,5
47,30A8FB981469,PHONES AND DRIVING\n\nIn this world in which w...,4
...,...,...,...
112045,18409261F5C2,80% of Americans believe seeking multiple opin...,5
112060,D46BCB48440A,"When people ask for advice,they sometimes talk...",4
112074,0FB0700DAF44,"During a group project, have you ever asked a ...",4
112088,D72CB1C11673,Making choices in life can be very difficult. ...,4


In [2]:
import requests

# strings to be used in building the evaluation prompt
RUBRIC = """
- SCORE OF 6: An essay in this category demonstrates clear and consistent mastery, although it may have a few minor errors. A typical essay effectively and insightfully develops a point of view on the issue and demonstrates outstanding critical thinking, using clearly appropriate examples, reasons, and other evidence to support its position; the essay is well organized and clearly focused, demonstrating clear coherence and smooth progression of ideas; the essay exhibits skillful use of language, using a varied, accurate, and apt vocabulary and demonstrates meaningful variety in sentence structure; the essay is free of most errors in grammar, usage, and mechanics.
- SCORE OF 5: An essay in this category demonstrates reasonably consistent mastery, although it will have occasional errors or lapses in quality. A typical essay effectively develops a point of view on the issue and demonstrates strong critical thinking, generally using appropriate examples, reasons, and other evidence to support its position; the essay is well organized and focused, demonstrating coherence and progression of ideas; the essay exhibits facility in the use of language, using appropriate vocabulary demonstrates variety in sentence structure; the essay is generally free of most errors in grammar, usage, and mechanics.
- SCORE OF 4: An essay in this category demonstrates adequate mastery, although it will have lapses in quality. A typical essay develops a point of view on the issue and demonstrates competent critical thinking, using adequate examples, reasons, and other evidence to support its position; the essay is generally organized and focused, demonstrating some coherence and progression of ideas exhibits adequate; the essay may demonstrate inconsistent facility in the use of language, using generally appropriate vocabulary demonstrates some variety in sentence structure; the essay may have some errors in grammar, usage, and mechanics.
- SCORE OF 3: An essay in this category demonstrates developing mastery, and is marked by ONE OR MORE of the following weaknesses: develops a point of view on the issue, demonstrating some critical thinking, but may do so inconsistently or use inadequate examples, reasons, or other evidence to support its position; the essay is limited in its organization or focus, or may demonstrate some lapses in coherence or progression of ideas displays; the essay may demonstrate facility in the use of language, but sometimes uses weak vocabulary or inappropriate word choice and/or lacks variety or demonstrates problems in sentence structure; the essay may contain an accumulation of errors in grammar, usage, and mechanics.
- SCORE OF 2: An essay in this category demonstrates little mastery, and is flawed by ONE OR MORE of the following weaknesses: develops a point of view on the issue that is vague or seriously limited, and demonstrates weak critical thinking, providing inappropriate or insufficient examples, reasons, or other evidence to support its position; the essay is poorly organized and/or focused, or demonstrates serious problems with coherence or progression of ideas; the essay displays very little facility in the use of language, using very limited vocabulary or incorrect word choice and/or demonstrates frequent problems in sentence structure; the essay contains errors in grammar, usage, and mechanics so serious that meaning is somewhat obscured.
- SCORE OF 1: An essay in this category demonstrates very little or no mastery, and is severely flawed by ONE OR MORE of the following weaknesses: develops no viable point of view on the issue, or provides little or no evidence to support its position; the essay is disorganized or unfocused, resulting in a disjointed or incoherent essay; the essay displays fundamental errors in vocabulary and/or demonstrates severe flaws in sentence structure; the essay contains pervasive errors in grammar, usage, or mechanics that persistently interfere with meaning.
"""

FEW_SHOT_EVAL = """
Example essay 1 of score "4":\n
"phones and driving\n\nin this world in which we live in, cell phones are a growing market as well as cars. the fact that we depend on cell phones throught the course of our day for numerous reasons.\n\ndrivers should be able to use their cell phones while driving because its easier to operate a phone in your hand than a cell phone that is not in your hand. emergencies can occur while driving and you need to report the emergerncy while driving. cell phones does not cause as big of a distrachon that other things that is done while driving.\n\na lot of people are in an uproar about driving and using a cell phone and somehmes it is overrated. trying to operate a cell phone with one hand while maintaining driving has no danger to it. a phone that is mounted on the holder that begins to ring is harder to operate answering, dialing and switching calls.\n\nmost hmes the mounhng is not secure or has defected parts which can cause more of an issue than having it to your ear. most people have older cars and cannot a? ord to upgrade and dont have the speaker ophon, which usually allows everyone surrounding your car to know you business.\n\ni cannot say it enough, emergencies happen in a ?ash. driving and you see a accident that needs emergency alenhon you need to be able to use the cell phone to call it in. things happen so quickly now and it could be your phone call that saves their life. also if a person is lost and no gps signal but calls can be made a person should be able to dial someone by hand and get direchons out of harms way. emergency is a big deal in cell phone usage.\n\ncell phones do not cause as much as a distrachon than people make out to be. as a driver a person should always be aware of the road and be able to mulh-task when using a cell phone. most hmes its other things besides a cell phone that cause a distrachon but blames the cell phone as a scapegoat. there could be test on the driver liscence test that we should take to see if were able to drive and talk on the phone instead of just banning it.\n\ntalking on a cell phone and driving a car may be a distrachon for some but not all. it should be our call on if we are focused enough to drive while talking. although some people are said to have had bad behavior while talking and driving some are very responsible. texhng and driving is di? erent from talking and driving and that should be the boundary. each person should be accountable for their achons just as speeding."\n\n
Example Essay 2 of score "3":
"phones\n\nmodern humans today are always on their phone. they are always on their phone more than 5 hours a day no stop .all they do is text back and forward and just have group chats on social media. they even do it while driving. they are some really bad consequences when stuff happens when it comes to a phone. some certain areas in the united states ban phones from class rooms just because of it.\n\nwhen people have phones, they know about certain apps that they have .apps like facebook twitter instagram and snapchat. so like if a friend moves away and you want to be in contact you can still be in contact by posting videos or text messages. people always have different ways how to communicate with a phone. phones have changed due to our generation.\n\ndriving is one of the way how to get around. people always be on their phones while doing it. which can cause serious problems. that's why there's a thing that's called no texting while driving. that's a really important thing to remember. some people still do it because they think it's stupid. no matter what they do they still have to obey it because that's the only way how did he save.\n\nsometimes on the news there is either an accident or a suicide. it might involve someone not looking where they're going or tweet that someone sent. it either injury or death. if a mysterious number says i'm going to kill you and they know where you live but you don't know the person's contact ,it makes you puzzled and make you start to freak out. which can end up really badly.\n\nphones are fine to use and it's also the best way to come over help. if you go through a problem and you can't find help you ,always have a phone there with you. even though phones are used almost every day as long as you're safe it would come into use if you get into trouble. make sure you do not be like this phone while you're in the middle of driving. the news always updated when people do something stupid around that involves their phones. the safest way is the best way to stay safe."\n\n
Example Essay 3 of score "2":\n
"should people drive with their cellphone or not?\n\nwe all love our phones now a days and basically ever since they were invented. everyone has one or wants one or the newest latest one .ever since they were invented in 1700s people have taken the full advantage of the telephone presidents and war sargents used them to win battles and etc. now the phone has evolved from tube looking phones to brick phones to flip phones to iphones and to smartphones. with all the new technology and apps like facetime snapchat3 games music social media there's almost no way anyone could ever put their phones down. but according to statistics in 2013 about 3,154 people died in accidents 424,000 were injured. in 2013, 10% of all drivers ages 15 to 19 involved in fatal accidents were reported to be distracted at the time of the crash.\n\nso the question stands do we need to drive while on your cellphone? with all the crashes maybe we should drive without our phones but with the increments in technology we mind as well just use them cars are able to call people just like phones, so if you can call people in your car and be distracted anyways when your calling whom ever. there putting almost smartphones in the car so you can most defiantly be distracted with that. the fact that they are and already are making cars able to drive by themselves is anther reason that is very distracting just as in tulsa the electrically powered car has pilot mode so it drives itself. so many people have been distracted from the road and driving because the car can drive itself even though the car can drive itself you still need to pay attention to the road.\n\ni do understand that you don't need to be looking down when driving and i do understand that phones are distracting. so you picking up a cellphone and calling someone can be distracting. also like i said in another essay i did there is a responsibility factor to firstly own a car and secondly to even own a phone. it might seem like it is not a responsibility owning a phone but it is when you own a phone you have personal information in it. if you were lose your phone and have never put a password or security code on it you can lose your information to whoever has your phone could be a hacker could be a random thieve.\n\nso really my answer is that it really depends on your responsibility level if you can't stay off your phone and it's life then you need to let someone drive but if you can be responsible while driving then i'd say you should us your phone responsibly. cause if you drive irresponsibly you should have your phone."\n
"""

# builds the evaluation prompt to be fed to the LLM evaluator
def build_eval_prompt(text):
    return (
        f"Read and evaluate the essay: \n\n{FEW_SHOT_EVAL}\n"
        f"Essay to score:\n{text}\n\n"
        f"Assign it a score from 1 to 6, in increments of 1, based on this rubric:\n\n{RUBRIC}\n\n"
        f"Your response should be only a numeric value representing the score you gave."
    )

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

BASE_ID   = "google/gemma-3-12b-it"
ADAPTER   = "checkpoints/gemma3/checkpoint-2000"

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(BASE_ID, use_fast=True)
tokenizer.pad_token_id = tokenizer.eos_token_id

def load_model(base_only: bool):
    m = AutoModelForCausalLM.from_pretrained(
        BASE_ID,
        quantization_config=bnb,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )
    if not base_only:
        m = PeftModel.from_pretrained(m, ADAPTER, device_map="auto")
    m.eval(); m.config.use_cache = True
    return m

# switch this line from active to inactive to switch from base models and tuned model
#base_model = load_model(base_only=True)
tuned_model = load_model(base_only=False)

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

In [7]:
SYSTEM = "You are an essay rater specializing in the evaluation of essays written by students from 6th to 12th grade."
def prompt(txt): return f"<s>[INST] {SYSTEM} [/INST]\n{txt.strip()}\n</s>"

@torch.inference_mode()
def score_fn(model, essay):
    toks = tokenizer(prompt(essay), return_tensors="pt").to(model.device)
    out  = model.generate(**toks, max_new_tokens=1, temperature=0.0001,
                          pad_token_id=tokenizer.eos_token_id,
                          eos_token_id=tokenizer.eos_token_id)
    ans  = tokenizer.decode(out[0][toks['input_ids'].shape[1]:],
                            skip_special_tokens=True).strip()
    return ans

In [20]:
# function to evaluate all essays from a df
def evaluate_essays(df, model):
    scores = []
    i = 1

    for idx, row in df.iterrows():
        prompt = build_eval_prompt(row['full_text'])
        score_og = row['score_og']

        try:
            print("----------------")
            print("evaluating essay...")
            response = score_fn(model, prompt)
            score = int(response.strip())

            if 1 <= score <= 6:
                # score is right
                print(f"[{idx}] evaluated. ({i}/{len(df)}) score: {score}. score_og: {score_og}")
                scores.append(score)
            else:
                print(f"invalid score from LLM at idx {idx}: {response}")
                scores.append(None)
        except Exception as e:
            print(f"error scoring essay at idx {idx}: {e}")
            scores.append(None)
        i += 1

    df["score_llm"] = scores

    return df

In [21]:
base_eval_df = evaluate_essays(df, base_model)

----------------
evaluating essay...
[0] evaluated. (1/10402) score: 4. score_og: 4
----------------
evaluating essay...
[12] evaluated. (2/10402) score: 4. score_og: 2
----------------
evaluating essay...
[17] evaluated. (3/10402) score: 4. score_og: 4
----------------
evaluating essay...
[28] evaluated. (4/10402) score: 4. score_og: 5
----------------
evaluating essay...
[47] evaluated. (5/10402) score: 4. score_og: 4
----------------
evaluating essay...
[63] evaluated. (6/10402) score: 4. score_og: 3
----------------
evaluating essay...
[75] evaluated. (7/10402) score: 4. score_og: 4
----------------
evaluating essay...
[86] evaluated. (8/10402) score: 4. score_og: 3
----------------
evaluating essay...
[96] evaluated. (9/10402) score: 4. score_og: 4
----------------
evaluating essay...
[104] evaluated. (10/10402) score: 4. score_og: 3
----------------
evaluating essay...
[113] evaluated. (11/10402) score: 3. score_og: 3
----------------
evaluating essay...
[120] evaluated. (12/1040

In [22]:
base_eval_df.to_csv("checkpoints/base_eval.csv", index=False)

In [9]:
from sklearn.metrics import cohen_kappa_score

base_eval_df = pd.read_csv('checkpoints/base_eval.csv')
base_kappa = cohen_kappa_score(base_eval_df["holistic_essay_score"], base_eval_df["score_llm"], weights="quadratic")

print(f"base gemma 3 12b kappa: {base_kappa}")

base gemma 3 12b kappa: 0.37531115188271924


In [10]:
tuned_eval_df = evaluate_essays(df, tuned_model)

----------------
evaluating essay...
[0] evaluated. (1/10402) score: 5. score_og: 4
----------------
evaluating essay...
[12] evaluated. (2/10402) score: 4. score_og: 2
----------------
evaluating essay...
[17] evaluated. (3/10402) score: 5. score_og: 4
----------------
evaluating essay...
[28] evaluated. (4/10402) score: 5. score_og: 5
----------------
evaluating essay...
[47] evaluated. (5/10402) score: 3. score_og: 4
----------------
evaluating essay...
[63] evaluated. (6/10402) score: 4. score_og: 3
----------------
evaluating essay...
[75] evaluated. (7/10402) score: 4. score_og: 4
----------------
evaluating essay...
[86] evaluated. (8/10402) score: 4. score_og: 3
----------------
evaluating essay...
[96] evaluated. (9/10402) score: 4. score_og: 4
----------------
evaluating essay...
[104] evaluated. (10/10402) score: 4. score_og: 3
----------------
evaluating essay...
[113] evaluated. (11/10402) score: 3. score_og: 3
----------------
evaluating essay...
[120] evaluated. (12/1040

In [11]:
tuned_eval_df.to_csv("checkpoints/tuned_eval.csv", index=False)

In [12]:
from sklearn.metrics import cohen_kappa_score

tuned_eval_df = pd.read_csv('checkpoints/tuned_eval.csv')
tuned_kappa = cohen_kappa_score(tuned_eval_df["holistic_essay_score"], tuned_eval_df["score_llm"], weights="quadratic")

print(f"tuned gemma 3 12b kappa: {tuned_kappa}")

tuned gemma 3 12b kappa: 0.7165229384924516


### Evaluate Originals and CFs with Finetuned Model

In [13]:
# load datasets

import pandas as pd

stance_pro_to_con_df = pd.read_csv('counterfactuals/stance_pro_to_con.csv')
stance_con_to_pro_df = pd.read_csv('counterfactuals/stance_con_to_pro.csv')

sentiment_positive_to_negative_df = pd.read_csv('counterfactuals/sentiment_positive_to_negative.csv')
sentiment_negative_to_positive_df = pd.read_csv('counterfactuals/sentiment_negative_to_positive.csv')

formality_formal_to_informal_df = pd.read_csv('counterfactuals/formality_formal_to_informal.csv')
formality_informal_to_formal_df = pd.read_csv('counterfactuals/formality_informal_to_formal.csv')

#### Stance:

In [21]:
# evaluate essays using finetuned gemma 3

stance_pro_to_con_scored_gemma3_ft_df = evaluate_essays(stance_pro_to_con_df, tuned_model)
stance_con_to_pro_scored_gemma3_ft_df = evaluate_essays(stance_con_to_pro_df, tuned_model)

----------------
evaluating essay...
[0] evaluated. (1/200) score: 2. score_og: 1
----------------
evaluating essay...
[1] evaluated. (2/200) score: 1. score_og: 1
----------------
evaluating essay...
[2] evaluated. (3/200) score: 1. score_og: 1
----------------
evaluating essay...
[3] evaluated. (4/200) score: 3. score_og: 1
----------------
evaluating essay...
[4] evaluated. (5/200) score: 2. score_og: 1
----------------
evaluating essay...
[5] evaluated. (6/200) score: 3. score_og: 1
----------------
evaluating essay...
[6] evaluated. (7/200) score: 1. score_og: 1
----------------
evaluating essay...
[7] evaluated. (8/200) score: 3. score_og: 1
----------------
evaluating essay...
[8] evaluated. (9/200) score: 1. score_og: 2
----------------
evaluating essay...
[9] evaluated. (10/200) score: 1. score_og: 2
----------------
evaluating essay...
[10] evaluated. (11/200) score: 2. score_og: 2
----------------
evaluating essay...
[11] evaluated. (12/200) score: 1. score_og: 2
-----------

#### Sentiment

In [22]:
sentiment_positive_to_negative_scored_gemma3_ft_df = evaluate_essays(sentiment_positive_to_negative_df, tuned_model)
sentiment_negative_to_positive_scored_gemma3_ft_df = evaluate_essays(sentiment_negative_to_positive_df, tuned_model)

----------------
evaluating essay...
[0] evaluated. (1/200) score: 1. score_og: 1
----------------
evaluating essay...
[1] evaluated. (2/200) score: 1. score_og: 1
----------------
evaluating essay...
[2] evaluated. (3/200) score: 1. score_og: 1
----------------
evaluating essay...
[3] evaluated. (4/200) score: 1. score_og: 1
----------------
evaluating essay...
[4] evaluated. (5/200) score: 1. score_og: 1
----------------
evaluating essay...
[5] evaluated. (6/200) score: 3. score_og: 1
----------------
evaluating essay...
[6] evaluated. (7/200) score: 2. score_og: 1
----------------
evaluating essay...
[7] evaluated. (8/200) score: 2. score_og: 1
----------------
evaluating essay...
[8] evaluated. (9/200) score: 3. score_og: 2
----------------
evaluating essay...
[9] evaluated. (10/200) score: 2. score_og: 2
----------------
evaluating essay...
[10] evaluated. (11/200) score: 2. score_og: 2
----------------
evaluating essay...
[11] evaluated. (12/200) score: 3. score_og: 2
-----------

#### Formality:

In [23]:
formality_formal_to_informal_scored_gemma3_ft_df = evaluate_essays(formality_formal_to_informal_df, tuned_model)
formality_informal_to_formal_scored_gemma3_ft_df = evaluate_essays(formality_informal_to_formal_df, tuned_model)

----------------
evaluating essay...
[0] evaluated. (1/200) score: 1. score_og: 1
----------------
evaluating essay...
[1] evaluated. (2/200) score: 1. score_og: 1
----------------
evaluating essay...
[2] evaluated. (3/200) score: 3. score_og: 1
----------------
evaluating essay...
[3] evaluated. (4/200) score: 2. score_og: 1
----------------
evaluating essay...
[4] evaluated. (5/200) score: 2. score_og: 1
----------------
evaluating essay...
[5] evaluated. (6/200) score: 3. score_og: 1
----------------
evaluating essay...
[6] evaluated. (7/200) score: 1. score_og: 1
----------------
evaluating essay...
[7] evaluated. (8/200) score: 3. score_og: 1
----------------
evaluating essay...
[8] evaluated. (9/200) score: 3. score_og: 2
----------------
evaluating essay...
[9] evaluated. (10/200) score: 3. score_og: 2
----------------
evaluating essay...
[10] evaluated. (11/200) score: 2. score_og: 2
----------------
evaluating essay...
[11] evaluated. (12/200) score: 2. score_og: 2
-----------

### Save new scored counterfactuals to folder

In [26]:
import os

# save scored counterfactuals to their own folder
os.makedirs("counterfactuals_scored/persuade/gemma3_ft/", exist_ok=True)

stance_pro_to_con_scored_gemma3_ft_df.to_csv("counterfactuals_scored/persuade/gemma3_ft/stance_pro_to_con_scored.csv", index=False)
stance_con_to_pro_scored_gemma3_ft_df.to_csv("counterfactuals_scored/persuade/gemma3_ft/stance_con_to_pro_scored.csv", index=False)

sentiment_positive_to_negative_scored_gemma3_ft_df.to_csv("counterfactuals_scored/persuade/gemma3_ft/sentiment_positive_to_negative_scored.csv", index=False)
sentiment_negative_to_positive_scored_gemma3_ft_df.to_csv("counterfactuals_scored/persuade/gemma3_ft/sentiment_negative_to_positive_scored.csv", index=False)

formality_formal_to_informal_scored_gemma3_ft_df.to_csv("counterfactuals_scored/persuade/gemma3_ft/formality_formal_to_informal_scored.csv", index=False)
formality_informal_to_formal_scored_gemma3_ft_df.to_csv("counterfactuals_scored/persuade/gemma3_ft/formality_informal_to_formal_scored.csv", index=False)

### Summarize kappas by flip

In [29]:
pro_to_con_kappa = cohen_kappa_score(stance_pro_to_con_scored_gemma3_ft_df["score_og"], stance_pro_to_con_scored_gemma3_ft_df["score_llm"], weights="quadratic")
con_to_pro_kappa = cohen_kappa_score(stance_con_to_pro_scored_gemma3_ft_df["score_og"], stance_con_to_pro_scored_gemma3_ft_df["score_llm"], weights="quadratic")
pos_to_neg_kappa = cohen_kappa_score(sentiment_positive_to_negative_scored_gemma3_ft_df["score_og"], sentiment_positive_to_negative_scored_gemma3_ft_df["score_llm"], weights="quadratic")
neg_to_pos_kappa = cohen_kappa_score(sentiment_negative_to_positive_scored_gemma3_ft_df["score_og"], sentiment_negative_to_positive_scored_gemma3_ft_df["score_llm"], weights="quadratic")
for_to_inf_kappa = cohen_kappa_score(formality_formal_to_informal_scored_gemma3_ft_df["score_og"], formality_formal_to_informal_scored_gemma3_ft_df["score_llm"], weights="quadratic")
inf_to_for_kappa = cohen_kappa_score(formality_informal_to_formal_scored_gemma3_ft_df["score_og"], formality_informal_to_formal_scored_gemma3_ft_df["score_llm"], weights="quadratic")

print(f"QWKs: \n"
      f"Pro to Con: {pro_to_con_kappa:.3f}\n"
      f"Con to Pro: {con_to_pro_kappa:.3f}\n\n"
      f"Pos to Neg: {pos_to_neg_kappa:.3f}\n"
      f"Neg to Pos: {neg_to_pos_kappa:.3f}\n\n"
      f"For to Inf: {for_to_inf_kappa:.3f}\n"
      f"Inf to For: {inf_to_for_kappa:.3f}\n")

QWKs: 
Pro to Con: 0.690
Con to Pro: 0.705

Pos to Neg: 0.689
Neg to Pos: 0.670

For to Inf: 0.722
Inf to For: 0.611

