# Tagging Module
## CAES Dataset Notebook

This module is in charge of tagging the dataset with content labels. The selected attributes are sentiment, stance and formality.

(We don't include the datasets with the code, as their original files are too big. Assume the CSV files are in a folder called 'data')

In [5]:
import pandas as pd

df = pd.read_csv("data/caes.raw.csv")
df

Unnamed: 0,estudiante,tarea,tipologia,tema,edad,sexo,pais,l1,estudios,edad_inicio,meses_estudio,contactos_habla_es,nivel,oraciones,oraciones_longitud,tokens,texto
0,417.0,1156.0,Carta,Carta amigo,26.0,Mujer,Brasil,Portugués,Universidad,25.0,8.0,Amigos,B1,"['Hola Carlos !', 'Hace mucho que no te veo ho...",18,212,Hola Carlos ! Hace mucho que no te veo hombre ...
1,1387.0,3934.0,Carta,Carta amigo,20.0,Mujer,Marruecos,Árabe,Universidad,15.0,30.0,No,A2,"['Hola querida Sara ,', 'espero que estes muy ...",9,175,"Hola querida Sara , espero que estes muy bien ..."
2,2273.0,6370.0,Postal,Postal vacaciones,18.0,Mujer,Otro,Ruso,Universidad,18.0,8.0,No,A1,"['Como estais ?', '!', 'Hola amigos ! ?', 'Esp...",13,104,Como estais ? ! Hola amigos ! ? Espero que tod...
3,1522.0,4307.0,Postal,Postal vacaciones,22.0,Hombre,China,Chino mandarín,Universidad,21.0,2.0,No,A2,"['Cuando llegué a el destino , le propuso que ...",12,135,"Cuando llegué a el destino , le propuso que si..."
4,775.0,2159.0,Correo electrónico,Familia,27.0,Mujer,Francia,Francés,Otros,20.0,6.0,Amigos,A1,"['¿ Que tal ?', 'un beso', 'Espero que todo va...",8,80,¿ Que tal ? un beso Espero que todo va bien po...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30995,2366.0,6592.0,Carta,Carta amigo,19.0,Mujer,China,Chino mandarín,Universidad,7.0,70.0,Otros,B1,"['Buenas tardes !', '29_de_mayo_de_2019 salama...",8,299,Buenas tardes ! 29_de_mayo_de_2019 salamanca Q...
30996,1497.0,4236.0,Carta,Solicitud admisión,33.0,Hombre,China,Chino mandarín,Universidad,27.0,12.0,Amigos&Familiares,B2,['Tengo la seguridad de mí mismo para contener...,13,157,Tengo la seguridad de mí mismo para contener e...
30997,1538.0,4350.0,Correo electrónico,Reclamación compañía aérea,21.0,Hombre,China,Chino mandarín,Primaria,19.0,24.0,Amigos,B1,"['Distinguidos señor / señra :', 'Hola , geren...",15,178,"Distinguidos señor / señra : Hola , gerente . ..."
30998,2122.0,5976.0,Carta,Solicitud admisión,19.0,Mujer,Portugal,Portugués,Universidad,15.0,36.0,No,B2,['Le escribo esta carta porque me gustaría inm...,14,303,Le escribo esta carta porque me gustaría inmen...


In [8]:
# remove nan
df.dropna(subset=['full_text', 'score'], inplace=True)

# clean dataset
df = df.rename(
    columns={
        "texto": "full_text",
        "nivel": "score",
        "tema":  "prompt_name"
    }
)[["full_text", "score", "prompt_name"]]

# insert essay ID column
df.insert(0, "essay_id", range(len(df)))

# get word count
df["word_count"] = df["full_text"].str.split().str.len().astype(int)

# organize columns
df = df[["essay_id", "full_text", "score", "word_count", "prompt_name"]]

# lowercase and filter
df['full_text'] = df['full_text'].str.lower()
df = df[(df['word_count'] >= 80) & (df['word_count'] <= 850)].copy()
df.reset_index(drop=True, inplace=True)

# write to csv
df.to_csv("caes/cleaned_caes.csv", index=False)

df

Unnamed: 0,essay_id,full_text,score,word_count,prompt_name
0,0,hola carlos ! hace mucho que no te veo hombre ...,B1,212,Carta amigo
1,1,"hola querida sara , espero que estes muy bien ...",A2,175,Carta amigo
2,2,como estais ? ! hola amigos ! ? espero que tod...,A1,104,Postal vacaciones
3,3,"cuando llegué a el destino , le propuso que si...",A2,135,Postal vacaciones
4,4,¿ que tal ? un beso espero que todo va bien po...,A1,80,Familia
...,...,...,...,...,...
26648,26648,buenas tardes ! 29_de_mayo_de_2019 salamanca q...,B1,299,Carta amigo
26649,26649,tengo la seguridad de mí mismo para contener e...,B2,157,Solicitud admisión
26650,26650,"distinguidos señor / señra : hola , gerente . ...",B1,178,Reclamación compañía aérea
26651,26651,le escribo esta carta porque me gustaría inmen...,B2,303,Solicitud admisión


## Tagging

Tag essays for stance, formality and sentiment.
* Stance uses Mistral-7B
* Formality uses formality-classifier-mdeberta-v3-base
* Sentiment uses roberta-large-multilingual-sentiment

### Stance

* Pro: **0**
* Con: **1**
* Neutral **2**

In [11]:
import requests

# number of essays to analyze through the pipeline (limit)
n = len(df)

# convert results list to numeric labels
stance_mapping = {
    'PRO': 0,
    'CON': 1,
    'NEUTRAL': 2
}

# calls local Ollama API Mistral instance with a defined text prompt
def query_ollama(prompt, model, temperature=0):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model,
              "prompt": prompt,
              "stream": False,
              "options": {
                  "temperature": temperature
              }
        }
    )
    return response.json()["response"].strip()

# stance prompt with few shot learning from related work
def get_stance_prompt(essay):
    return (
        "Stance classification is the task of determining the expressed or implied opinion, or stance, of a statement"
        " toward a certain, specified target. The following statements are social media posts expressing opinions about entities.\n"
        "Each statement can either be 'PRO' or 'CON' toward their associated entity.\n"
        "entity: Atheism\n"
        "statement: Leaving Christianity enables you to love the people you once rejected. #freethinker #Christianity #SemST\n"
        "stance: PRO\n"
        "entity: Feminist Movement\n"
        "statement: Always a delight to see chest-drumming alpha males hiss and scuttle backwards up the wall when a feminist enters the room. #manly #SemST\n"
        "stance: PRO\n"
        "entity: Christianity\n"
        "statement: AlharbiF I’ll bomb anything I can get my hands on, especially if THEY aren’t christian. #graham2016 #GOP #SemST\n"
        "stance: CON\n"
        "entity: Hillary Clinton\n"
        "statement: Would you wanna be in a long term relationship with some bitch that hides her emails, & lies to your face? Then #Dontvote #SemST\n"
        "stance: CON\n"
        "Analyze the following statement and determine its stance towards the entity.\n"
        "Respond with a single word: 'PRO' or 'CON'. Only return the stance as a single word, and no other text.'\n"
        f"statement:\n{essay}\n"
        "stance:"
    )

# standalone stance classifier function.
# classifies a text into PRO, CON or NEUTRAL
def stance_classifier(text):
    prompt = get_stance_prompt(text)
    response = query_ollama(prompt, model="mistral")

    response_upper = response.upper()
    if 'CON' in response_upper:
        return stance_mapping['CON']
    elif 'PRO' in response_upper:
        return stance_mapping['PRO']
    elif 'NEUTRAL' in response_upper:
        return stance_mapping['NEUTRAL']
    else:
        return None  # unclear

Loop for full stance classification:

In [14]:
# send data to LLM for stance tagging. get results in list.
# Main loop for tagging stance
stance_results = []
for i in range(n):
    essay_text = df["full_text"].iloc[i]
    response = stance_classifier(essay_text)
    stance_results.append(response)
    print(f"Tagging stance # {i}/{n} - {response}")

stance_results

Tagging stance # 0/26653 - 0
Tagging stance # 1/26653 - 0
Tagging stance # 2/26653 - 2
Tagging stance # 3/26653 - 0
Tagging stance # 4/26653 - 0
Tagging stance # 5/26653 - 0
Tagging stance # 6/26653 - 1
Tagging stance # 7/26653 - 0
Tagging stance # 8/26653 - 0
Tagging stance # 9/26653 - 2
Tagging stance # 10/26653 - 0
Tagging stance # 11/26653 - 2
Tagging stance # 12/26653 - 2
Tagging stance # 13/26653 - 1
Tagging stance # 14/26653 - 0
Tagging stance # 15/26653 - 0
Tagging stance # 16/26653 - 0
Tagging stance # 17/26653 - 1
Tagging stance # 18/26653 - 1
Tagging stance # 19/26653 - 2
Tagging stance # 20/26653 - 1
Tagging stance # 21/26653 - 0
Tagging stance # 22/26653 - 2
Tagging stance # 23/26653 - 2
Tagging stance # 24/26653 - 0
Tagging stance # 25/26653 - 0
Tagging stance # 26/26653 - 0
Tagging stance # 27/26653 - 2
Tagging stance # 28/26653 - 0
Tagging stance # 29/26653 - 2
Tagging stance # 30/26653 - 2
Tagging stance # 31/26653 - 2
Tagging stance # 32/26653 - 0
Tagging stance # 33/

[0,
 0,
 2,
 0,
 0,
 0,
 1,
 0,
 0,
 2,
 0,
 2,
 2,
 1,
 0,
 0,
 0,
 1,
 1,
 2,
 1,
 0,
 2,
 2,
 0,
 0,
 0,
 2,
 0,
 2,
 2,
 2,
 0,
 0,
 2,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 2,
 None,
 None,
 0,
 0,
 1,
 0,
 1,
 0,
 2,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 2,
 0,
 1,
 2,
 0,
 2,
 1,
 2,
 0,
 1,
 2,
 2,
 2,
 1,
 0,
 0,
 0,
 0,
 1,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 2,
 2,
 2,
 0,
 0,
 2,
 0,
 0,
 0,
 1,
 2,
 2,
 0,
 1,
 0,
 None,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 2,
 0,
 2,
 2,
 2,
 0,
 0,
 2,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 2,
 0,
 0,
 0,
 0,
 2,
 0,
 0,
 0,
 1,
 2,
 2,
 0,
 2,
 0,
 0,
 2,
 0,
 0,
 1,
 2,
 0,
 0,
 0,
 2,
 0,
 0,
 2,
 0,
 1,
 1,
 2,
 1,
 0,
 0,
 2,
 0,
 2,
 0,
 2,
 0,
 2,
 0,
 2,
 2,
 0,
 0,
 1,
 0,
 None,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 2,
 0,
 0,
 2,
 0,
 0,
 0,
 0,
 1,
 0,
 2,
 0,
 0,
 None,
 2,
 None,
 None,
 1,
 0,
 None,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 2,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 2,
 0,
 0,
 2,
 1,
 0,
 0,
 0,
 0,
 0,
 2,
 2,
 0,
 2,
 0,
 1,
 0,
 2,


### Formality

* Formal: **0**
* Informal: **1**
* Neutral **2**

In [64]:
from transformers import pipeline

# load model
pipe = pipeline("text-classification", model="LenDigLearn/formality-classifier-mdeberta-v3-base")

id2formality = {"formal": 0, "informal": 1, "neutral": 2}

# extract n texts from full dataframe
texts = df["full_text"].iloc[:n].tolist()

def formality_classifier(text, threshold=0.50):
    label, score = pipe(text)[0].values()
    label_idx = id2formality[label]

    if score > threshold:
        return label_idx
    else:
        return None

Device set to use cpu


Loop for full formality classification:

In [65]:
i# send data to LLM for formality tagging. get results in list.
# Main loop for tagging formality

formality_results = []

for i in range(n):
    result = formality_classifier(texts[i])
    formality_results.append(result)
    print(f"Tagging formality # {i}/{n} - {result}")

formality_results

Tagging formality # 0/26653 - 1
Tagging formality # 1/26653 - 1
Tagging formality # 2/26653 - 1
Tagging formality # 3/26653 - 1
Tagging formality # 4/26653 - 1
Tagging formality # 5/26653 - 1
Tagging formality # 6/26653 - 1
Tagging formality # 7/26653 - 1
Tagging formality # 8/26653 - 1
Tagging formality # 9/26653 - 1
Tagging formality # 10/26653 - 1
Tagging formality # 11/26653 - 1
Tagging formality # 12/26653 - 1
Tagging formality # 13/26653 - 1
Tagging formality # 14/26653 - 1
Tagging formality # 15/26653 - 1
Tagging formality # 16/26653 - 1
Tagging formality # 17/26653 - 1
Tagging formality # 18/26653 - 0
Tagging formality # 19/26653 - 1
Tagging formality # 20/26653 - 1
Tagging formality # 21/26653 - 1
Tagging formality # 22/26653 - 1
Tagging formality # 23/26653 - 1
Tagging formality # 24/26653 - 1
Tagging formality # 25/26653 - 1
Tagging formality # 26/26653 - 1
Tagging formality # 27/26653 - 1
Tagging formality # 28/26653 - 1
Tagging formality # 29/26653 - 1
Tagging formality # 

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 None,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 None,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1

### Sentiment Analysis

* Positive: **0**
* Negative: **1**
* Neutral: **2**

In [70]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_id = "clapAI/roberta-large-multilingual-sentiment"
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, torch_dtype=torch.float16)

model.to(device)
model.eval()

# Retrieve labels from the model's configuration
id2label = model.config.id2label
id2sentiment = {"positive": 0, "negative": 1, "neutral": 2}

def sentiment_classifier(text, threshold=0.90):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device)

    # Perform inference in inference mode
    with torch.inference_mode():
        outputs = model(**inputs)
        predictions = outputs.logits.argmax(dim=-1)
    result_str = id2label[predictions.item()]
    return id2sentiment[result_str]


Loop for full sentiment classification:

In [71]:
# send data to LLM for sentiment tagging. get results in list.
# Main loop for tagging sentiment

sentiment_results = []

for i in range(n):
    result = sentiment_classifier(texts[i])
    sentiment_results.append(result)
    print(f"Tagging sentiment # {i}/{n} - {result}")

sentiment_results

Tagging sentiment # 0/26653 - 1
Tagging sentiment # 1/26653 - 0
Tagging sentiment # 2/26653 - 0
Tagging sentiment # 3/26653 - 0
Tagging sentiment # 4/26653 - 0
Tagging sentiment # 5/26653 - 0
Tagging sentiment # 6/26653 - 1
Tagging sentiment # 7/26653 - 2
Tagging sentiment # 8/26653 - 0
Tagging sentiment # 9/26653 - 2
Tagging sentiment # 10/26653 - 0
Tagging sentiment # 11/26653 - 2
Tagging sentiment # 12/26653 - 0
Tagging sentiment # 13/26653 - 0
Tagging sentiment # 14/26653 - 0
Tagging sentiment # 15/26653 - 0
Tagging sentiment # 16/26653 - 0
Tagging sentiment # 17/26653 - 2
Tagging sentiment # 18/26653 - 2
Tagging sentiment # 19/26653 - 2
Tagging sentiment # 20/26653 - 1
Tagging sentiment # 21/26653 - 0
Tagging sentiment # 22/26653 - 2
Tagging sentiment # 23/26653 - 2
Tagging sentiment # 24/26653 - 2
Tagging sentiment # 25/26653 - 2
Tagging sentiment # 26/26653 - 0
Tagging sentiment # 27/26653 - 0
Tagging sentiment # 28/26653 - 2
Tagging sentiment # 29/26653 - 2
Tagging sentiment # 

[1,
 0,
 0,
 0,
 0,
 0,
 1,
 2,
 0,
 2,
 0,
 2,
 0,
 0,
 0,
 0,
 0,
 2,
 2,
 2,
 1,
 0,
 2,
 2,
 2,
 2,
 0,
 0,
 2,
 2,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 2,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 2,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 2,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 2,
 2,
 2,
 0,
 1,
 2,
 0,
 1,
 2,
 2,
 2,
 2,
 0,
 0,
 1,
 0,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 2,
 2,
 0,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 2,
 0,
 1,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 2,
 0,
 2,
 2,
 1,
 1,
 1,
 0,
 2,
 1,
 0,
 0,
 0,
 2,
 0,
 2,
 2,
 2,
 0,
 0,
 0,
 0,
 0,
 2,
 0,
 2,
 1,
 0,
 2,
 0,
 2,
 2,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 2,
 2,
 1,
 2,
 2,
 2,
 0,
 1,
 0,
 0,
 2,
 2,
 0,
 0,
 0,
 0,
 2,
 1,
 0,
 2,
 0,
 0,
 2,
 1,
 0,
 0,
 1,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 2,
 2,
 1,
 0,
 1,
 0,
 2,
 0,
 2,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 2,
 0,
 0,
 0,
 0,
 2,
 0,
 2,
 2,
 2,
 1,
 2,
 1,
 0,
 2,
 0,
 0,
 2,
 0,
 2,
 0,


### Add columns to the dataframe

In [72]:
# cut the dataframe to our n
df_copy = df.iloc[:n].copy()

# add each list to a column
df_copy['stance'] = stance_results
df_copy['formality'] = formality_results
df_copy['sentiment'] = sentiment_results

# store tagged datset in csv
df_copy.to_csv('caes/tagged_caes.csv', index=False)
df_copy

Unnamed: 0,essay_id,full_text,score,word_count,prompt_name,stance,formality,sentiment
0,0,hola carlos ! hace mucho que no te veo hombre ...,B1,212,Carta amigo,0.0,1.0,1
1,1,"hola querida sara , espero que estes muy bien ...",A2,175,Carta amigo,0.0,1.0,0
2,2,como estais ? ! hola amigos ! ? espero que tod...,A1,104,Postal vacaciones,2.0,1.0,0
3,3,"cuando llegué a el destino , le propuso que si...",A2,135,Postal vacaciones,0.0,1.0,0
4,4,¿ que tal ? un beso espero que todo va bien po...,A1,80,Familia,0.0,1.0,0
...,...,...,...,...,...,...,...,...
26648,26648,buenas tardes ! 29_de_mayo_de_2019 salamanca q...,B1,299,Carta amigo,0.0,1.0,0
26649,26649,tengo la seguridad de mí mismo para contener e...,B2,157,Solicitud admisión,0.0,,0
26650,26650,"distinguidos señor / señra : hola , gerente . ...",B1,178,Reclamación compañía aérea,0.0,1.0,1
26651,26651,le escribo esta carta porque me gustaría inmen...,B2,303,Solicitud admisión,0.0,0.0,2
