# CAES Dataset Notebook
## 3. Evaluator

Tests related to the CAES essay dataset.

### Load dataset

In [1]:
# load datasets

import pandas as pd

stance_pro_to_con_df = pd.read_csv('counterfactuals/caes/stance_pro_to_con.csv')
stance_con_to_pro_df = pd.read_csv('counterfactuals/caes/stance_con_to_pro.csv')

sentiment_positive_to_negative_df = pd.read_csv('counterfactuals/caes/sentiment_positive_to_negative.csv')
sentiment_negative_to_positive_df = pd.read_csv('counterfactuals/caes/sentiment_negative_to_positive.csv')

formality_formal_to_informal_df = pd.read_csv('counterfactuals/caes/formality_formal_to_informal.csv')
formality_informal_to_formal_df = pd.read_csv('counterfactuals/caes/formality_informal_to_formal.csv')

### Essay evaluation

In [6]:
import requests

# strings to be used in building the evaluation prompt
RUBRIC = """
Here is a high level description of the knowledge that students of each CEFR level are expected to have.
- A student of level pre-A1 can give basic personal information (e.g. name, address, nationality), perhaps with the use of a dictionary.
- A student of level A1 can give information about matters of personal relevance (e.g. likes and dislikes, family, pets) using simple words/signs and basic expressions; can produce simple isolated phrases and sentences.
- A student of level A2 can produce a series of simple phrases and sentences linked with simple connectors like “and”, “but” and “because”.
- A student of level B1 can produce straightforward connected texts on a range of familiar subjects within their field of interest, by linking a series of shorter discrete elements into a linear sequence.
- A student of level B2 can produce clear, detailed texts on a variety of subjects related to their field of interest, synthesising and evaluating information and arguments from a number of sources.
- A student of level C1 can produce clear, well-structured texts of complex subjects, underlining the relevant issues, expanding and supporting points of view at some length with subsidiary points, reasons and relevant examples, and rounding off with an appropriate conclusion; can employ the structure, vocabulary and register of genres, varying the tone, style and structure in an appropriate way for the target reader.
- A student of level C2 can write clear, smoothly flowing, complex texts in an appropriate and effective style and a logical structure which helps the reader identify significant points.
"""

FEW_SHOT_EVAL = """
Example essay 1 of score "B2":\n
"el cigarro es uma droga que contiene muchas substancias que causan vicio y adicción a las personas . por ser una droga lícita , hay un gran numero de personas que fuman y muchas vezes ni estan preocupada con su salud ni con la salud de las otras persona . en muchos lugares de el mundo existem reglas a cerca_de la utilizacíon de el cigarro . una de estas reglas se traduz en poder fumar en lugares publicos o no . en mi opinión , no deveria ser permitido que las personas fumasen en lugares publicos , pues se aprobarmos la liberacíon de el uso de el cigarro en publico , estaremos ayudando , a estas personas a hacer daño a ellas mismas y aún a las personas que circulan en estes ambientes comunes . el humo producido por el cigarro puede causar cancer de pulmón , de lengua , de laringe , además acarretar daños a el corazón , desgatar los dientes , producir substancias inflamatorias en los vasos sanguineos , depresión , abuso de otras drogras , etc. diante_de tantos problemas y maleficios que el cigarro puede causar , hago la seguiente pregunta : por que las personas fuman ? será que tienem consciencia de las consequencias de el uso de el cigarro ? será que no imaginan que estan haciendo daño a si mismas y tambien a las otras personas con que conviven ? concluo esta mi opinión , decindo que para mi , no debe ser permitido fumar en lugares publicos , porque ademas de hacer daño a los individuos en general , también estaremos haciendo daño a el medio_ambiente . intenten cambiar este habito por otro saludable ! se preocupen no solamente con sus vidas , pero tambien con la de las otras personas , con la naturaleza ! fumar es un tipo de suicidio asistido , digan no a esta adiccíon !"\n\n
Example Essay 2 of score "B1":
"he estado en españa por casi dos meses y te echo de menos mucho , la vida aquí es tranquila , además , la gente aquí es muy amable , no me te preocupes , pero tengo que estudiar todos los dias , lo que me deja muy cansada . te escribo para pedir algunos aconsejos sobre mi vida despúes_de graduar me . tengo dos opciones , estudiar para el master o trabajar en un país de latino_américa . por un lado , ya sabes llevo dos años estado con mi novio y él prefiere hacer un master en china depúes_de graduar se . estoy muy preocupada que nos separemos si trabajo en otro páis . por que , desde mi punto de la vista , la distancia entre las parejas influye mucho sus relaciones . por_otro_lado , trabajar en un páis que hanla español es mi sueño desde el momento que esdutié español , también es muy difícil para mi hacer master en mi carrera en las universidades en china . me parece que tengo que eligir uno entre sueño y amor , y a el pensar de esto , estoy muy triste . si tu fueras yo , qué harías ? lo siento mucho por preguntar te , por que sé que estás muy ocupada todos los días en la ciudad grande . un beso ! seseria"\n\n
Example Essay 3 of score "A2":\n
"gabriela_kraft es mi mejor amiga . somos muy proximas desde siempre a_causa_de sermos hermanas . ella tiene vienteuno años y yo tengo veinte . miéntras tenermos edads muy proximas somos muy diferentes . le gusta trabajar con las personas , es muy simpatica e divertida . le gusta tambíen los perros y otros animales en_general . pero que más le gusta es el chocolate ! hablar sobre su familia es hablar sobre mi , entonces puedo decir que tenemos una buena familia . es claro que han muchos problemas , pero esto hay en todas las casas . nuestra madre es una mujer muy decididada y fuerte , siempre nos dice que tenemos que hacer el mejor que podremos . empiezas su dia muy temprano y solo vuelta a casa muy tarde . ! yo a admiro a_causa_de ella ser muy inteligente ! una de as personas más inteligentes e trabajadoras que he conocido en mi vida . cuando era pequeña , gabriela estudiava mucho y siempre me ayudava con las tareas de casa . mi hermana tiene estatura mediana , es delgada y tiene pelo largo y castanõ . sus ojos son muy bonitos , son verdes ."\n
"""

possible_scores = ["A1", "A2", "B1", "B2", "C1", "C2"]

# calls local Ollama API Mistral instance with a defined text prompt
def query_ollama(prompt, model, temperature=0):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model,
              "prompt": prompt,
              "stream": False,
              "think": False,
              "options": {
                  "temperature": temperature,
              }
        }
    )
    return response.json()["response"].strip()


# builds the evaluation prompt to be fed to the LLM evaluator
def build_eval_prompt(text):
    return (
        f"You are an essay rater specializing in the evaluation of essays written by students. "
        f"Read and evaluate the essay: \n\n{FEW_SHOT_EVAL}\n"
        f"Essay to score:\n{text}\n\n"
        f"Assign it a level from A1 to C2, based on this rubric:\n\n{RUBRIC}\n\n"
        f"Your response should be only a number + letter combination representing the level you gave."
    )

# function to evaluate all essays from a df
def evaluate_essays(df, model="llama3:8b"):
    scores = []

    for idx, row in df.iterrows():
        prompt = build_eval_prompt(row['full_text'])
        score_og = row['score_og']

        try:
            print("----------------")
            print("evaluating essay...")
            response = query_ollama(prompt, model=model)
            score = response.strip()

            if score in possible_scores:
                # score is right
                print(f"[{idx}] evaluated. score: {score}. score_og: {score_og}")
                scores.append(score)
            else:
                print(f"invalid score from LLM at idx {idx}: {response}")
                scores.append(None)
        except Exception as e:
            print(f"error scoring essay at idx {idx}: {e}")
            scores.append(None)

    df["score_llm"] = scores

    return df

### Evaluation

#### A. Evaluate with Gemma3

#### Stance:

In [7]:
# score flipped stance counterfactuals and their originals

stance_pro_to_con_scored_gemma3_df = evaluate_essays(stance_pro_to_con_df, 'gemma3:12b')
stance_con_to_pro_scored_gemma3_df = evaluate_essays(stance_con_to_pro_df, 'gemma3:12b')

----------------
evaluating essay...
[0] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[1] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[2] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[3] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[4] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[5] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[6] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[7] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[8] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[9] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[10] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[11] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[12] evaluated. score: A2. score_og: A1
---------

#### Sentiment:

In [8]:
# score flipped sentiment counterfactuals and their originals

sentiment_positive_to_negative_scored_gemma3_df = evaluate_essays(sentiment_positive_to_negative_df, 'gemma3:12b')
sentiment_negative_to_positive_scored_gemma3_df = evaluate_essays(sentiment_negative_to_positive_df, 'gemma3:12b')

----------------
evaluating essay...
[0] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[1] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[2] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[3] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[4] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[5] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[6] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[7] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[8] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[9] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[10] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[11] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[12] evaluated. score: A2. score_og: A1
---------

#### Formality

In [9]:
# score flipped formality counterfactuals and their originals

formality_formal_to_informal_scored_gemma3_df = evaluate_essays(formality_formal_to_informal_df, 'gemma3:12b')
formality_informal_to_formal_scored_gemma3_df = evaluate_essays(formality_informal_to_formal_df, 'gemma3:12b')

----------------
evaluating essay...
[0] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[1] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[2] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[3] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[4] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[5] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[6] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[7] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[8] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[9] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[10] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[11] evaluated. score: A2. score_og: A1
----------------
evaluating essay...
[12] evaluated. score: A2. score_og: A1
---------

#### Saving to folder

In [10]:
import os

# save scored counterfactuals to their own folder
os.makedirs("counterfactuals_scored/caes/gemma3/", exist_ok=True)

stance_pro_to_con_scored_gemma3_df.to_csv("counterfactuals_scored/caes/gemma3/stance_pro_to_con_scored.csv", index=False)
stance_con_to_pro_scored_gemma3_df.to_csv("counterfactuals_scored/caes/gemma3/stance_con_to_pro_scored.csv", index=False)

sentiment_positive_to_negative_scored_gemma3_df.to_csv("counterfactuals_scored/caes/gemma3/sentiment_positive_to_negative_scored.csv", index=False)
sentiment_negative_to_positive_scored_gemma3_df.to_csv("counterfactuals_scored/caes/gemma3/sentiment_negative_to_positive_scored.csv", index=False)

formality_formal_to_informal_scored_gemma3_df.to_csv("counterfactuals_scored/caes/gemma3/formality_formal_to_informal_scored.csv", index=False)
formality_informal_to_formal_scored_gemma3_df.to_csv("counterfactuals_scored/caes/gemma3/formality_informal_to_formal_scored.csv", index=False)

#### B. Evaluate with LLaMa 3

#### Stance:

In [11]:
# evaluate essays using llama3:8b LLM

stance_pro_to_con_scored_llama3_df = evaluate_essays(stance_pro_to_con_df)
stance_con_to_pro_scored_llama3_df = evaluate_essays(stance_con_to_pro_df)

----------------
evaluating essay...
invalid score from LLM at idx 0: Based on the rubric, I would assign this essay to level B1.

Here's why:

* The essay is written in simple sentences with basic vocabulary.
* It lacks cohesion and coherence, with no clear introduction, body, or conclusion.
* The writer jumps from one idea to another without connecting them logically.
* There are no complex structures, such as subordinate clauses or relative pronouns.
* The text is mostly descriptive, with little analysis or evaluation of ideas.

Overall, the essay shows some basic writing skills, but lacks the complexity and coherence expected at higher levels.
----------------
evaluating essay...
invalid score from LLM at idx 1: Based on the rubric, I would assign this essay to level B1.

Here's why:

* Essay 1 (B2) is too advanced for this essay, as it requires more complex sentence structures and vocabulary.
* Essay 2 (A2) is not a good match because it lacks the coherence and cohesion required a

#### Sentiment:

In [12]:
sentiment_positive_to_negative_scored_llama3_df = evaluate_essays(sentiment_positive_to_negative_df)
sentiment_negative_to_positive_scored_llama3_df = evaluate_essays(sentiment_negative_to_positive_df)

----------------
evaluating essay...
invalid score from LLM at idx 0: Based on the rubric, I would assign the essay a score of "B2".
----------------
evaluating essay...
invalid score from LLM at idx 1: Based on the rubric, I would assign the essay to score "B1".

Here's why:

* Essay 1 (Example essay 1 of score "B2") is more detailed and coherent than this essay, with clear arguments and supporting evidence. This essay lacks that level of detail and coherence.
* Essay 2 (Example Essay 2 of score "B1") is similar in terms of its structure and content to this essay, so I would expect a similar level of proficiency.
* Essay 3 (Example Essay 3 of score "A2") is less developed than this essay, with more simplistic language and ideas.

Overall, I think the essay demonstrates a good understanding of basic sentence structures and vocabulary, but lacks the complexity and coherence expected at higher levels.
----------------
evaluating essay...
invalid score from LLM at idx 2: Based on the rubr

#### Formality:

In [13]:
formality_formal_to_informal_scored_llama3_df = evaluate_essays(formality_formal_to_informal_df)
formality_informal_to_formal_scored_llama3_df = evaluate_essays(formality_informal_to_formal_df)

----------------
evaluating essay...
invalid score from LLM at idx 0: Based on the rubric provided, I would assign this essay to level B1.

Here's why:

* The essay is written in a straightforward and clear manner, with a logical structure that allows the reader to follow the writer's thoughts.
* The text is connected by simple connectors like "and", "but" and "because", which indicates a level of cohesion and coherence.
* The language used is relatively simple, but still conveys the writer's ideas and opinions effectively.
* There are some minor errors in grammar, vocabulary, and sentence structure, but they do not significantly impede the reader's understanding.

Overall, while the essay may not be perfect, it demonstrates a level of proficiency that is typical of a B1 student.
----------------
evaluating essay...
invalid score from LLM at idx 1: Based on the rubric, I would assign this essay to level B1.

Here's why:

* The student is able to produce a straightforward connected text

#### Saving to folder

In [14]:
# save scored counterfactuals to their own folder
os.makedirs("counterfactuals_scored/caes/llama3/", exist_ok=True)

stance_pro_to_con_scored_llama3_df.to_csv("counterfactuals_scored/caes/llama3/stance_pro_to_con_scored.csv", index=False)
stance_con_to_pro_scored_llama3_df.to_csv("counterfactuals_scored/caes/llama3/stance_con_to_pro_scored.csv", index=False)

sentiment_positive_to_negative_scored_llama3_df.to_csv("counterfactuals_scored/caes/llama3/sentiment_positive_to_negative_scored.csv", index=False)
sentiment_negative_to_positive_scored_llama3_df.to_csv("counterfactuals_scored/caes/llama3/sentiment_negative_to_positive_scored.csv", index=False)

formality_formal_to_informal_scored_llama3_df.to_csv("counterfactuals_scored/caes/llama3/formality_formal_to_informal_scored.csv", index=False)
formality_informal_to_formal_scored_llama3_df.to_csv("counterfactuals_scored/caes/llama3/formality_informal_to_formal_scored.csv", index=False)

#### C. Evaluate with Qwen3

#### Stance:

In [15]:
# evaluate essays using qwen3:8b LLM

stance_pro_to_con_scored_qwen3_df = evaluate_essays(stance_pro_to_con_df, model="qwen3:8b")
stance_con_to_pro_scored_qwen3_df = evaluate_essays(stance_con_to_pro_df, model="qwen3:8b")

----------------
evaluating essay...
[0] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[1] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[2] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[3] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[4] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[5] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[6] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[7] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[8] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[9] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[10] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[11] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[12] evaluated. score: B1. score_og: A1
---------

#### Sentiment:

In [16]:
sentiment_positive_to_negative_scored_qwen3_df = evaluate_essays(sentiment_positive_to_negative_df, model="qwen3:8b")
sentiment_negative_to_positive_scored_qwen3_df = evaluate_essays(sentiment_negative_to_positive_df, model="qwen3:8b")

----------------
evaluating essay...
[0] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[1] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[2] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[3] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[4] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[5] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[6] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[7] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[8] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[9] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[10] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[11] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[12] evaluated. score: B1. score_og: A1
---------

#### Formality:

In [17]:
formality_formal_to_informal_scored_qwen3_df = evaluate_essays(formality_formal_to_informal_df, model="qwen3:8b")
formality_informal_to_formal_scored_qwen3_df = evaluate_essays(formality_informal_to_formal_df, model="qwen3:8b")

----------------
evaluating essay...
[0] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[1] evaluated. score: B2. score_og: A1
----------------
evaluating essay...
[2] evaluated. score: B2. score_og: A1
----------------
evaluating essay...
[3] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[4] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[5] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[6] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[7] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[8] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[9] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[10] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[11] evaluated. score: B1. score_og: A1
----------------
evaluating essay...
[12] evaluated. score: B1. score_og: A1
---------

#### Saving to folder

In [18]:
# save scored counterfactuals to their own folder
os.makedirs("counterfactuals_scored/caes/qwen3/", exist_ok=True)

stance_pro_to_con_scored_qwen3_df.to_csv("counterfactuals_scored/caes/qwen3/stance_pro_to_con_scored.csv", index=False)
stance_con_to_pro_scored_qwen3_df.to_csv("counterfactuals_scored/caes/qwen3/stance_con_to_pro_scored.csv", index=False)

sentiment_positive_to_negative_scored_qwen3_df.to_csv("counterfactuals_scored/caes/qwen3/sentiment_positive_to_negative_scored.csv", index=False)
sentiment_negative_to_positive_scored_qwen3_df.to_csv("counterfactuals_scored/caes/qwen3/sentiment_negative_to_positive_scored.csv", index=False)

formality_formal_to_informal_scored_qwen3_df.to_csv("counterfactuals_scored/caes/qwen3/formality_formal_to_informal_scored.csv", index=False)
formality_informal_to_formal_scored_qwen3_df.to_csv("counterfactuals_scored/caes/qwen3/formality_informal_to_formal_scored.csv", index=False)

#### D. Evaluate with DeepSeek

#### Stance:

In [19]:
# evaluate essays using deepseek-r1:7b LLM

stance_pro_to_con_scored_deepseek_df = evaluate_essays(stance_pro_to_con_df, model="deepseek-r1:7b")
stance_con_to_pro_scored_deepseek_df = evaluate_essays(stance_con_to_pro_df, model="deepseek-r1:7b")

----------------
evaluating essay...
[0] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[1] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[2] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[3] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[4] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[5] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[6] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[7] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[8] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[9] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[10] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[11] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[12] evaluated. score: A1. score_og: A1
---------

#### Sentiment:

In [20]:
sentiment_positive_to_negative_scored_deepseek_df = evaluate_essays(sentiment_positive_to_negative_df, model="deepseek-r1:7b")
sentiment_negative_to_positive_scored_deepseek_df = evaluate_essays(sentiment_negative_to_positive_df, model="deepseek-r1:7b")

----------------
evaluating essay...
[0] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[1] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[2] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[3] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[4] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[5] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[6] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[7] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[8] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[9] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[10] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[11] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[12] evaluated. score: A1. score_og: A1
---------

#### Formality:

In [21]:
formality_formal_to_informal_scored_deepseek_df = evaluate_essays(formality_formal_to_informal_df, model="deepseek-r1:7b")
formality_informal_to_formal_scored_deepseek_df = evaluate_essays(formality_informal_to_formal_df, model="deepseek-r1:7b")

----------------
evaluating essay...
[0] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[1] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[2] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[3] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[4] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[5] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[6] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[7] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[8] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[9] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[10] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[11] evaluated. score: A1. score_og: A1
----------------
evaluating essay...
[12] evaluated. score: A1. score_og: A1
---------

#### Saving to folder

In [22]:
# save scored counterfactuals to their own folder
os.makedirs("counterfactuals_scored/caes/deepseek/", exist_ok=True)

stance_pro_to_con_scored_deepseek_df.to_csv("counterfactuals_scored/caes/deepseek/stance_pro_to_con_scored.csv", index=False)
stance_con_to_pro_scored_deepseek_df.to_csv("counterfactuals_scored/caes/deepseek/stance_con_to_pro_scored.csv", index=False)

sentiment_positive_to_negative_scored_deepseek_df.to_csv("counterfactuals_scored/caes/deepseek/sentiment_positive_to_negative_scored.csv", index=False)
sentiment_negative_to_positive_scored_deepseek_df.to_csv("counterfactuals_scored/caes/deepseek/sentiment_negative_to_positive_scored.csv", index=False)

formality_formal_to_informal_scored_deepseek_df.to_csv("counterfactuals_scored/caes/deepseek/formality_formal_to_informal_scored.csv", index=False)
formality_informal_to_formal_scored_deepseek_df.to_csv("counterfactuals_scored/caes/deepseek/formality_informal_to_formal_scored.csv", index=False)