# Politics of Emotions or Propaganda?
_Franco Reinaldo Bonifacini (41540A)_

## Libraries (Requirements)

In [51]:
import sys
!{sys.executable} -m pip install -r requirements.txt




[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [52]:
import pandas as pd
import numpy as np
import torch
from joblib import Parallel, delayed

In [28]:
print(f"CUDA: {torch.cuda.is_available()}")
print(f"MPS: {torch.mps.is_available()}")

CUDA: False
MPS: False


As my computer does not have GPU NVIDIA, I can't implement CUDA, so I'll use PyTorch library but will run slower.

In [30]:
# Detect if there is a GPU (CUDA), if not use CPU (I already know that in my case I'll use CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Dataset
The dataset was previously scraped from the official website of the argentinian presidency (https://www.casarosada.gob.ar/informacion/discursos), automatically clean through the use of "Raw Data Manipulation" code, and finally manually cleaned as there are particular cases that couldn't been cleaned with previous code.

In [13]:
df_speeches = pd.read_excel(r"C:\Users\franc\Documents\GitHub\nlp_project\df_final.xlsx")

In [14]:
df_speeches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1274 entries, 0 to 1273
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             1274 non-null   datetime64[ns]
 1   president        1274 non-null   object        
 2   content          1274 non-null   object        
 3   cleaned_content  1274 non-null   object        
dtypes: datetime64[ns](1), object(3)
memory usage: 39.9+ KB


In [15]:
df_speeches

Unnamed: 0,date,president,content,cleaned_content
0,2024-01-17,Javier Milei,"Buenas tardes, muchas gracias: hoy estoy acá p...","Buenas tardes, muchas gracias: hoy estoy acá p..."
1,2024-04-02,Javier Milei,Hoy estamos aquí reunidos a 42 años del inicio...,Hoy estamos aquí reunidos a 42 años del inicio...
2,2024-02-24,Javier Milei,Hola a todos. Yo soy el león. Yo también los a...,Hola a todos. Yo soy el león. Yo también los a...
3,2024-01-26,Javier Milei,"En primer lugar, quiero comenzar por agradecer...","En primer lugar, quiero comenzar por agradecer..."
4,2024-02-08,Javier Milei,Buenos días. Quiero compartir con ustedes una ...,Buenos días. Quiero compartir con ustedes una ...
...,...,...,...,...
1269,2023-12-08,Alberto Fernández,Querido Pueblo Argentino: Hace exactamente 40 ...,Querido Pueblo Argentino: Hace exactamente 40 ...
1270,2023-12-10,Javier Milei,"Hola a todos. Señores ministros de la Corte, s...","Hola a todos. Señores ministros de la Corte, s..."
1271,2023-12-10,Javier Milei,"Hola a todos. ¡Viva la libertad, carajo! ¡Viva...","Hola a todos. ¡Viva la libertad, carajo! ¡Viva..."
1272,2023-12-20,Javier Milei,"Argentinos, hoy es un día histórico para nuest...","Argentinos, hoy es un día histórico para nuest..."


# Data Pre-Processing

# Split Speeches into Sentences

Split speeches into sentences (with a threshold of 50 words) so models can be used efficiently and avoid truncation of speeches when implementing models.

In [20]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\franc\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [21]:
def split_speeches(text, max_words=50, min_words=5):
    """
    Splits the text into sentences using NLTK, and further splits long sentences into word-based chunks.
    Sentences with fewer than `min_words` are merged with the previous chunk if possible.

    Parameters:
        text (str): The full input text.
        max_words (int): Maximum number of words per chunk.
        min_words (int): Minimum number of words per chunk. Shorter ones are merged with the previous.

    Returns:
        List[str]: A list of cleaned text chunks.
    """
    sentences = sent_tokenize(text)
    chunks = []

    for sentence in sentences:
        words = sentence.strip().split()

        if not words:
            continue

        # Split long sentences
        while len(words) > max_words:
            chunks.append(" ".join(words[:max_words]))
            words = words[max_words:]

        # Remaining words
        leftover = " ".join(words)

        if len(words) < min_words:
            if chunks:
                # Merge with previous chunk
                chunks[-1] = chunks[-1] + " " + leftover
            else:
                # If it's the first sentence, just keep it
                chunks.append(leftover)
        else:
            chunks.append(leftover)

    return chunks

In [22]:
df_speeches['sentences'] = df_speeches['cleaned_content'].apply(split_speeches)

In [23]:
df_split_speeches = df_speeches.explode('sentences')[['date', 'president', 'sentences']].reset_index(drop=True)

In [24]:
df_split_speeches

Unnamed: 0,date,president,sentences
0,2024-01-17,Javier Milei,"Buenas tardes, muchas gracias: hoy estoy acá p..."
1,2024-01-17,Javier Milei,"Lamentablemente en las últimas décadas, motiva..."
2,2024-01-17,Javier Milei,"Nosotros estamos, acá, para decirles que los e..."
3,2024-01-17,Javier Milei,"Créanme, nadie mejor que nosotros los argentin..."
4,2024-01-17,Javier Milei,Cuando adoptamos el modelo de la libertad – al...
...,...,...,...
51644,2023-12-30,Javier Milei,"Para finalizar, quiero una vez más, desearles ..."
51645,2023-12-30,Javier Milei,Espero que puedan pasarlo en compañía de su fa...
51646,2023-12-30,Javier Milei,Este puede ser el año en que demos vuelta un s...
51647,2023-12-30,Javier Milei,Mi deseo - para este Nuevo Año - es que la dir...


## Translate Sentences into English

In [41]:
from transformers import MarianMTModel, MarianTokenizer
from tqdm import tqdm

In [26]:
# Create model to translate from spanish to english, so it can be implemented correctly to the final model

language_model_name = "Helsinki-NLP/opus-mt-es-en"
language_tokenizer = MarianTokenizer.from_pretrained(language_model_name)
language_model = MarianMTModel.from_pretrained(language_model_name)

In [38]:
def translate(texts):
    """
    Translates a list of texts from a source language to a target language using a transformer-based translation model.

    Parameters:
        texts (list of str): List of input text strings to be translated.

    Returns:
        list of str: List of translated text strings corresponding to each input text, in the same order.
    """

    language_model.to(device)

    inputs = language_tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

    # Move tokenized tensors to GPU
    inputs = {key: val.to(device) for key, val in inputs.items()}

    translated = language_model.generate(**inputs)
    return [language_tokenizer.decode(t, skip_special_tokens=True) for t in translated]

In [53]:
def batch_translate(df, text_column, batch_size=100, n_jobs=None):
    """
    Translates texts in the `text_column` of the DataFrame `df` in batches,
    processing batches in parallel.

    Parameters:
        df (pandas.DataFrame): DataFrame containing the texts.
        text_column (str): Name of the column with the texts to translate.
        batch_size (int): Number of texts per batch for translation.
        n_jobs (int): Number of parallel jobs (default: half of CPU cores).

    Returns:
        list: List with all translated texts.
    """

    texts = df[text_column].tolist()
    batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)]

    results = Parallel(n_jobs=n_jobs)(
        delayed(translate)(batch) for batch in tqdm(batches, desc="Translating batches")
    )

    # Aplanar lista de listas en una sola lista
    all_translations = [item for sublist in results for item in sublist]

    return all_translations

In [54]:
translated_texts = batch_translate(df_split_speeches, text_column='sentences', batch_size=1000, n_jobs=-1)

Translating batches:  15%|█▌        | 8/52 [00:17<00:00, 57.88it/s]

KeyboardInterrupt: 

# Model

In [8]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

In [None]:
# Create variables containing the model and the tokenizer
model_name = "SamLowe/roberta-base-go_emotions"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, attn_implementation="eager")

In [37]:
# Creation of the emotion classification pipepline
emotion_classifier = pipeline(
    task="text-classification",
    model=model,
    tokenizer=tokenizer,
    top_k=None,
    device=device
)

Device set to use cpu


In [38]:
# Try a sentence outputed by the translation model
example = ['Long live the fucking freedom.']

model_output = emotion_classifier(example)
model_output[0]

[{'label': 'anger', 'score': 0.6907389163970947},
 {'label': 'neutral', 'score': 0.14974887669086456},
 {'label': 'annoyance', 'score': 0.10762118548154831},
 {'label': 'sadness', 'score': 0.019939061254262924},
 {'label': 'joy', 'score': 0.01638145186007023},
 {'label': 'disgust', 'score': 0.01580151915550232},
 {'label': 'disapproval', 'score': 0.010370961390435696},
 {'label': 'disappointment', 'score': 0.009592698886990547},
 {'label': 'approval', 'score': 0.008287952281534672},
 {'label': 'excitement', 'score': 0.006921317894011736},
 {'label': 'fear', 'score': 0.006091337651014328},
 {'label': 'admiration', 'score': 0.005420122295618057},
 {'label': 'caring', 'score': 0.005269885994493961},
 {'label': 'amusement', 'score': 0.0051732794381678104},
 {'label': 'realization', 'score': 0.0050010597333312035},
 {'label': 'love', 'score': 0.004166232421994209},
 {'label': 'surprise', 'score': 0.0035567530430853367},
 {'label': 'grief', 'score': 0.002567428397014737},
 {'label': 'desire'

With this two models created (translation and classifier) I will translate and classify the text, and return the label and score of the most probable emotion included in the given speech.

In [None]:
def top_emotion(text, classifier):
    results = classifier(text)[0]
    top = max(results, key=lambda x: x['score'])
    return pd.Series([top['label'], top['score']])

In [None]:
#df_speeches[['emotion', 'score']] = df_speeches['content'].apply(get_top_emotion_row)