In this Notebook, I train multiple classifier to detect hope and nostalgia.

# Package Import

In [3]:
# General Packages
import pandas as pd
import numpy as np
import re # for text-cleaning
from google.colab import data_table, files
data_table.enable_dataframe_formatter() # to have tables which enable reading the text in full
import warnings
from joblib import dump, load

# Data Handling
!git config --global credential.helper store # To upload a Model to Huggingface
!sudo apt-get install git-lfs # To upload a Model to Huggingface
#huggingface_login =
#!huggingface-cli login # To upload a Model to Huggingface

# Machine Learning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, balanced_accuracy_score
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.naive_bayes import MultinomialNB

# set random seed for reproducibility
SEED_GLOBAL = 1984
np.random.seed(SEED_GLOBAL)

# NLP
import nltk
nltk.download('punkt')
from nltk.tokenize import TreebankWordTokenizer, WhitespaceTokenizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


# Transformer Packages (Laurer, 2023)
!pip install datasets
!pip install transformers==4.40.0 # in Colab I got an error with the trainer when I did not download the most recent transformer
!pip install accelerate -U
import datasets
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
from transformers import pipeline, TrainingArguments, Trainer, logging
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"  # use GPU (cuda) if available, otherwise use CPU



Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed dataset

# Data

### Dataset Import
The PolNos datasets for Nostalgia expressions are open source. You can download them via the [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/L198GI). The Polyhope Dataset is permitted to use or redistribute only for non-commercial or academic-research purposes. It can be downloaded from the [HOPE at IberLEF 2024](https://codalab.lisn.upsaclay.fr/competitions/17714#participate-get_starting_kit) competition.

To make handling easier, I loaded them into my Github and will use them from there.

In [4]:
#https://www.geeksforgeeks.org/how-to-upload-folders-to-google-colab/
!git clone https://github.com/BeJa1996/political_hope_nostalgia/
!unzip political_hope_nostalgia/Training_Datasets.zip

Cloning into 'political_hope_nostalgia'...
remote: Enumerating objects: 60, done.[K
remote: Counting objects: 100% (60/60), done.[K
remote: Compressing objects: 100% (59/59), done.[K
remote: Total 60 (delta 21), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (60/60), 7.16 MiB | 6.78 MiB/s, done.
Resolving deltas: 100% (21/21), done.
Archive:  political_hope_nostalgia/Training_Datasets.zip
   creating: Training_Datasets/
  inflating: Training_Datasets/data_polnos_handcoding.csv  
  inflating: Training_Datasets/Task 2_Test_with_labels_English_PolyHope.csv  
  inflating: Training_Datasets/data_polnos_handcoding_validation.csv  


## Nostalgia Dataset

Müller and Proksch (2023a) created two datasets which I use to train and validate my classifiers. One is '*data_polnos_handcoding*', and the other is '*data_polnos_handcoding_validation*'. Both come from Müller and Proksch (2023b). In Polnos Handcoding, there are 1200 sentences coded by four coders according to whether they contain nostalgia. In hand coding validation, 3515 sentences were coded as nostalgic by one of their annotation methods and manually validated by two coders.

Because in the first dataset, there are only 219 sentences coded as nostalgic by at least two human coders, I add the sentences from the second dataset, which were also coded by two coders as nostalgic. In this way I also reduce the imbalance of the dataset.

### Dataset Import

In [5]:
nost_handcoding = pd.read_csv('/content/Training_Datasets/data_polnos_handcoding.csv')
nost_validation = pd.read_csv('/content/Training_Datasets/data_polnos_handcoding_validation.csv')

### Dataset Overview

I want to first inspect the structure and content of the datasets.

#### Handcoding

In [6]:
nost_handcoding.shape

(1200, 29)

In [7]:
nost_handcoding.dtypes

doc_id                          object
countryname                     object
party                            int64
manifesto_id                    object
text                            object
cmp_code                        object
nostalgic_at_least_1             int64
nostalgic_at_least_2             int64
nostalgic_at_least_3             int64
nostalgic_at_least_4             int64
translation_at_least_1           int64
translation_at_least_2           int64
translation_at_least_3           int64
translation_at_least_4           int64
nostalgic_coder1                 int64
nostalgic_coder2                 int64
nostalgic_coder3                 int64
nostalgic_coder4                 int64
translation_coder1               int64
translation_coder2               int64
translation_coder3               int64
translation_coder4               int64
translation_agreement_coders     int64
nostalgia_sum                    int64
nostalgia_sum_emb                int64
nostalgia_emb            

Nostalgia_agreement_coders informs about how many coders agreed that a text is be nostalgic. Let us see how the texts differ according to agremeent.

In [8]:
nost_handcoding.loc[nost_handcoding['nostalgia_agreement_coders'] == 0,
 ['text', 'nostalgia_agreement_coders']].head()

Unnamed: 0,text,nostalgia_agreement_coders
0,"Economic injustice and cultural, historical an...",0
1,"a new Packaging Act will be adopted, which, am...",0
2,Nationals of countries with which France has n...,0
4,The ULA condemns the complete failure of the g...,0
5,legal certainty - necessary for citizens and i...,0


In [9]:
nost_handcoding.loc[nost_handcoding['nostalgia_agreement_coders'] == 1,
 ['text', 'nostalgia_agreement_coders']].head()

Unnamed: 0,text,nostalgia_agreement_coders
8,The NATO returned to the main and most importa...,1
11,Erase from our streets and squares any honorab...,1
29,Which shows the effectiveness of what we do fr...,1
31,Germany is a successful integration of the cou...,1
35,The past four years of work was also made a se...,1


In [10]:
nost_handcoding.loc[nost_handcoding['nostalgia_agreement_coders'] == 2,
 ['text', 'nostalgia_agreement_coders']].head()

Unnamed: 0,text,nostalgia_agreement_coders
3,7. The re-emerging LSDP their active involveme...,2
7,Restore bilateral migration strategy cooperati...,2
9,11) Restoring the minimum wage at €8.65 an hour.,2
34,The result has been a mere administrative dece...,2
42,Popular Alliance proposes: - Ending the cyclic...,2


In [11]:
nost_handcoding.loc[nost_handcoding['nostalgia_agreement_coders'] > 2,
 ['text', 'nostalgia_agreement_coders']].head()

Unnamed: 0,text,nostalgia_agreement_coders
13,"• Modern, on the personal development focused ...",3
15,We disagree with the globalization that aims t...,4
32,2. Socialist ideas emerged Lithuania Lithuania...,4
41,In the history of civilization at a time when ...,4
86,The course of national history is indispensabl...,4


In [12]:
print(nost_handcoding.groupby(['nostalgic_at_least_2'])['text'].count())

nostalgic_at_least_2
0    981
1    219
Name: text, dtype: int64


#### Validation

In [13]:
nost_validation.shape

(3515, 17)

In [14]:
nost_validation.dtypes

manifesto_id                     object
countryname                      object
text_pre                         object
text                             object
text_post                        object
party                             int64
party_family_recoded             object
cmp_code                         object
nostalgia_sentence_dummy_emb      int64
nostalgia_sentence_bert           int64
nostalgia_sentence_svm            int64
nostalgic_coding_coder1           int64
nostalgic_coding_coder2           int64
nostalgia_coded_both              int64
nostalgia_coded_at_least_one      int64
score_gpt                       float64
justification_gpt                object
dtype: object

In [15]:
nost_validation.loc[nost_validation['nostalgia_coded_both'] == 1,['text']].head()

Unnamed: 0,text
7,Here for the first time since independence in ...
14,Strengthen the United Kingdom and protect and ...
16,Danish book rental and ground traffic control ...
23,The culture is not only considered one of the ...
32,Compulsory school age Restore: restoring lasti...


In [16]:
nost_validation.groupby('nostalgia_coded_both')['nostalgia_coded_both'].count()

nostalgia_coded_both
0    3269
1     246
Name: nostalgia_coded_both, dtype: int64

### Dataset Preparation

I rename columns into label and label test, because that is necessary for the NLI pipeline from Laurer et al. (2023). Then I will concat the two datasets and lowercase the texts.

In [17]:
nost_handcoding['label'] = np.where(nost_handcoding['nostalgia_agreement_coders'] >= 2, 1,0) # https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
nost_handcoding['label_text'] = np.where(nost_handcoding['nostalgia_agreement_coders'] >= 2,
                                         'Nostalgia','Not Nostalgia') # https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
nost_validation['label'] = np.where(nost_validation['nostalgia_coded_both'] == 1, 1,0) # https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
nost_validation['label_text'] = np.where(nost_validation['nostalgia_coded_both'] == 1,
                                         'Nostalgia','Not Nostalgia') # https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
nostalgia = nost_validation.loc[nost_validation['label'] == 1,['text', 'label', 'label_text']]
nostalgia = pd.concat([nostalgia, nost_handcoding[['text', 'label', 'label_text']]])
nostalgia.shape

The text does not entail much special characters, so it will probably be enough to lower case it.

In [21]:
nostalgia['text'] = nostalgia.apply(lambda row: row['text'].lower(),axis=1)
nostalgia.groupby('label_text')['text'].count()

label_text
Nostalgia        465
Not Nostalgia    981
Name: text, dtype: int64

Train Validation Test Split

In [22]:
nost_train, nost_test = train_test_split(nostalgia, test_size = .2, stratify = nostalgia['label'] )
nost_test, nost_validation = train_test_split(nost_test, test_size = .5, stratify = nost_test['label'] )

In [23]:
nost_train.groupby('label_text')['label_text'].count()

label_text
Nostalgia        372
Not Nostalgia    784
Name: label_text, dtype: int64

In [24]:
nost_test.groupby('label_text')['label_text'].count()

label_text
Nostalgia        46
Not Nostalgia    99
Name: label_text, dtype: int64

In [25]:
nost_validation.groupby('label_text')['label_text'].count()

label_text
Nostalgia        47
Not Nostalgia    98
Name: label_text, dtype: int64

## Hope
Dataset from *PolyHope: Two-level hope speech detection from tweets* (Balouchzahi et al., 2023). It can be downloaded through the [HOPE at IberLEF 2024](https://codalab.lisn.upsaclay.fr/competitions/17714#learn_the_details-terms_and_conditions) competition.

## Dataset import

In [26]:
hope = pd.read_csv('/content/Training_Datasets/Task 2_Test_with_labels_English_PolyHope.csv')

### Dataset Overview

Again. I want to get an overview about the data first.

In [27]:
hope.shape

(6192, 4)

In [28]:
hope.dtypes

text          object
binary        object
multiclass    object
id             int64
dtype: object

In [29]:
hope.groupby('binary')['binary'].count()

binary
Hope        3104
Not Hope    3088
Name: binary, dtype: int64

### Dataset Preparation

In the texts there are many Hashtags and Emojis. I need to remove them, because they will not be there in the youtube content, but they might provide information on which the algorithm learns. I want the algorithm to concentrate only on the features which would be there in the final set as well.

In [31]:
# from https://www.kaggle.com/code/tariqsays/tweets-cleaning-with-python
hope['old_text'] = hope['text']
hope['text'] = hope.apply(lambda row: row['text'].lower(),axis=1)
hope['text'] = hope.apply(lambda row: re.sub("@[A-Za-z0-9_]+","",
                                                     row['text']),axis=1)
hope['text'] = hope.apply(lambda row: re.sub("#[A-Za-z0-9_]+","",
                                                   row['text']),axis=1)
# i need to do it a second time, because anonymization lead to #User# which
# was not removed
hope['text'] = hope.apply(lambda row: re.sub("#","",
                                                   row['text']),axis=1)
hope['text'] = hope.apply(lambda row: re.sub(r"http\S+","",
                                                     row['text']),axis=1)
hope['text'] = hope.apply(lambda row: re.sub(r"www.\S+","",
                                                  row['text']),axis=1)
# in the following I added punctuation, so that this is not removed
hope['text'] = hope.apply(lambda row: re.sub("[^a-z0-9\.,;:]"," ",
                                                     row['text']),axis=1)

hope.head()

Unnamed: 0,text,binary,multiclass,id,old_text
0,"i m really liking this project, let s work t...",Hope,Realistic Hope,5820,"#USER# #USER# I'm really liking this project, ..."
1,oh shit really i would hope they d shed some...,Hope,Generalized Hope,4061,#USER# Oh shit really? I would hope they'd she...
2,"good morning, bud another good decision fr...",Hope,Generalized Hope,1621,"#USER# Good morning, Bud! 🥰 Another good decis..."
3,i aspire to have the level of delusion to beli...,Hope,Unrealistic Hope,1754,i aspire to have the level of delusion to beli...
4,projects are continuously attacked by hacker...,Not Hope,Not Hope,401,#USER# #USER# Projects are continuously attack...


In [32]:
hope['label'] = np.where(hope['binary'] == 'Hope', 1,0) # https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
hope['label_text'] = np.where(hope['binary'] == 'Hope', 'Hope','Not Hope') # https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
hope_prepared = hope[['text', 'label', 'label_text']]
len(hope_prepared)

6192

In [33]:
hope_prepared = hope_prepared[~hope_prepared.duplicated('text', keep = 'first')]
len(hope_prepared)

6185

Train, Validation, Test Split

In [34]:
hope_train, hope_test_validation = train_test_split(hope_prepared, test_size = .2, stratify = hope_prepared['label'] )
hope_test, hope_validation = train_test_split(hope_test_validation, test_size = .5, stratify = hope_test_validation['label'] )

In [35]:
hope_train.groupby('label_text')['label_text'].count()

label_text
Hope        2480
Not Hope    2468
Name: label_text, dtype: int64

In [36]:
hope_test.groupby('label_text')['label_text'].count()

label_text
Hope        310
Not Hope    308
Name: label_text, dtype: int64

In [37]:
hope_validation.groupby('label_text')['label_text'].count()

label_text
Hope        310
Not Hope    309
Name: label_text, dtype: int64

# Analyses
I will try multiple models. I run them separately because I want to see the individual results and possibly adjust the number of epochs.  In general, I will perform hyperparameter tuning only for Naive Bayes, as it has little impact on transformer models, according to Laurer (2023) and is too computationally intensive for this assignment. For the transformer models, I might increase the number of epochs if the peak performance is on the last iteration because that implies that additional training might increase performance.

## Functions


#### nb_training
Grid Search to find the best classifier (from Atteveld et al., 2021).

In [38]:
def nb_training(emotion, train_df):
  pipe = Pipeline( #https://cssbook.net/content/chapter11.html
    steps=[
        ("vectorizer", None),
        ("classifier",  MultinomialNB()),
      ]
  )
  grid = {
      "vectorizer": [CountVectorizer(stop_words='english'),
                    TfidfVectorizer(stop_words='english')],
      "vectorizer__tokenizer": [None,
                              TreebankWordTokenizer().tokenize,
                              WhitespaceTokenizer().tokenize],
      "vectorizer__ngram_range": [(1, 1), (1, 2), (1,3)],
      "vectorizer__max_df": [0.5, 1.0],
      "vectorizer__min_df": [1, 5, 10],
  }

  nb = GridSearchCV(
      estimator=pipe, n_jobs=-1, param_grid=grid, scoring="f1_macro",
      cv=5, refit=True #https://stackoverflow.com/a/71140866
  )
  nb.fit(train_df['text'], train_df['label_text'])

  # Download Model
  # First lines from https://stackoverflow.com/a/71140866
  estimator = nb.best_estimator_
  #dump(estimator, f"{emotion}_nb_model.joblib")
  #files.download(f"{emotion}_nb_model.joblib") # https://stackoverflow.com/a/50192371

  return nb

### clean_memory
Memory Cleaner (Laurer, 2023)

In [39]:
# helper function to clean memory and reduce risk of out-of-memory error
import gc
def clean_memory():
  #del(model)
  if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()
  gc.collect()

clean_memory()

### Reformatting Functions
The functions are from Laurer (2023).

In [40]:
## function for reformatting the train set
def format_nli_trainset(df_train=None, hypo_label_dic=None, random_seed=42):
  print(f"Length of df_train before formatting step: {len(df_train)}.")
  length_original_data_train = len(df_train)

  df_train_lst = []
  for label_text, hypothesis in hypo_label_dic.items():
    ## entailment
    df_train_step = df_train[df_train.label_text == label_text].copy(deep=True)
    df_train_step["hypothesis"] = [hypothesis] * len(df_train_step)
    df_train_step["label"] = [0] * len(df_train_step)
    ## not_entailment
    df_train_step_not_entail = df_train[df_train.label_text != label_text].copy(deep=True)
    df_train_step_not_entail = df_train_step_not_entail.sample(n=min(len(df_train_step), len(df_train_step_not_entail)), random_state=random_seed)
    df_train_step_not_entail["hypothesis"] = [hypothesis] * len(df_train_step_not_entail)
    df_train_step_not_entail["label"] = [1] * len(df_train_step_not_entail)
    # append
    df_train_lst.append(pd.concat([df_train_step, df_train_step_not_entail]))
  df_train = pd.concat(df_train_lst)

  # shuffle
  df_train = df_train.sample(frac=1, random_state=random_seed)
  df_train["label"] = df_train.label.apply(int)
  df_train["label_nli_explicit"] = ["True" if label == 0 else "Not-True" for label in df_train["label"]]  # adding this just to simplify readibility

  print(f"After adding not_entailment training examples, the training data was augmented to {len(df_train)} texts.")
  print(f"Max augmentation could be: len(df_train) * 2 = {length_original_data_train*2}. It can also be lower, if there are more entail examples than not-entail for a majority class.")

  return df_train.copy(deep=True)


## function for reformatting the test set
def format_nli_testset(df_test=None, hypo_label_dic=None):
  ## explode test dataset for N hypotheses
  hypothesis_lst = [value for key, value in hypo_label_dic.items()]
  print("Number of hypotheses/classes: ", len(hypothesis_lst))

  # label lists with 0 at alphabetical position of their true hypo, 1 for not-true hypos
  label_text_label_dic_explode = {}
  for key, value in hypo_label_dic.items():
    label_lst = [0 if value == hypo else 1 for hypo in hypothesis_lst]
    label_text_label_dic_explode[key] = label_lst

  df_test["label"] = df_test.label_text.map(label_text_label_dic_explode)
  df_test["hypothesis"] = [hypothesis_lst] * len(df_test)
  print(f"Original test set size: {len(df_test)}")

  # explode dataset to have K-1 additional rows with not_entail label and K-1 other hypotheses
  # ! after exploding, cannot sample anymore, because distorts the order to true label values, which needs to be preserved for evaluation code
  df_test = df_test.explode(["hypothesis", "label"])  # multi-column explode requires pd.__version__ >= '1.3.0'
  print(f"Test set size for NLI classification: {len(df_test)}\n")

  df_test["label_nli_explicit"] = ["True" if label == 0 else "Not-True" for label in df_test["label"]]  # adding this just to simplify readibility

  return df_test.copy(deep=True)


### Metrics
Adapted from Laurer (2023). I took the core part out of the function from Laurer, because I need it independently of the metrics computation.

In [41]:
def metrics_core (true_labels, predicted_labels):
  precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(true_labels, predicted_labels, average='macro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
  precision_micro, recall_micro, f1_micro, _ = precision_recall_fscore_support(true_labels, predicted_labels, average='micro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
  acc_balanced = balanced_accuracy_score(true_labels, predicted_labels)
  acc_not_balanced = accuracy_score(true_labels, predicted_labels)

  metrics = {
      'accuracy': acc_not_balanced,
      'accuracy_balanced': acc_balanced,
      'f1_macro': f1_macro,
      'f1_micro': f1_micro,
      'precision_macro': precision_macro,
      'recall_macro': recall_macro,
      'precision_micro': precision_micro,
      'recall_micro': recall_micro,
  }
  return metrics

In [42]:
def compute_metrics_standard(eval_pred): # https://github.com/MoritzLaurer/summer-school-transformers-2023/blob/main/3_tune_bert.ipynb
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore")

        labels = eval_pred.label_ids
        pred_logits = eval_pred.predictions
        preds_max = np.argmax(pred_logits, axis=1)  # argmax on each row (axis=1) in the tensor

        # metrics
        metrics = metrics_core(labels, preds_max) # taken into extra function #bj

        print("Detailed metrics: \n",
              classification_report(labels,
                                    preds_max,
                                    digits=2,
                                zero_division='warn'), "\n")

        return metrics


def compute_metrics_nli_binary(eval_pred, label_text_alphabetical=None):
    predictions, labels = eval_pred

    ### reformat model output to enable calculation of standard metrics
    # split in chunks with predictions for each hypothesis for one unique premise
    def chunks(lst, n):  # Yield successive n-sized chunks from lst. https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
        for i in range(0, len(lst), n):
            yield lst[i:i + n]

    # for each chunk/premise, select the most likely hypothesis
    softmax = torch.nn.Softmax(dim=1)
    prediction_chunks_lst = list(chunks(predictions, len(set(label_text_alphabetical)) ))
    hypo_position_highest_prob = []
    for i, chunk in enumerate(prediction_chunks_lst):
        hypo_position_highest_prob.append(np.argmax(np.array(chunk)[:, 0]))  # only accesses the first column of the array, i.e. the entailment/true prediction logit of all hypos and takes the highest one

    label_chunks_lst = list(chunks(labels, len(set(label_text_alphabetical)) ))
    label_position_gold = []
    for chunk in label_chunks_lst:
        label_position_gold.append(np.argmin(chunk))  # argmin to detect the position of the 0 among the 1s

    ### calculate standard metrics
    metrics = metrics_core(label_position_gold, hypo_position_highest_prob) # taken into extra function #bj

    print("Detailed metrics: \n",
          classification_report(label_position_gold,
                                hypo_position_highest_prob,
                                digits=2,
                                zero_division='warn'), "\n")
    return metrics


### model_finetuning
In the following function I consolidate the Fine-Tuning Steps from Laurer (2023) from the NLI and the Non-NLI set. If I took code from somewhere else, I indicated it. I mark my comments with (bj).

In [43]:
def model_finetuning(emotion, df_train, df_test, model_name, seed, epochs = 10, nli = False, hypo_label_dict = {} ):
  """
    Trains and Evaluates a Transformer Model from Huggingface

    Returns
    -------
    model
        The path to the optimal model
    results
  """
  clean_memory()
  model_short = model_name.split("/")[1]
  label_text_alphabetical = np.sort(df_train.label_text.unique())

  # Prepare Data (bj)
  print('\nData Preparation and Tokenizer Download\n')
  ## Extra Preparation NLI (bj)
  if nli == True:
      if not hypo_label_dict: # https://stackoverflow.com/a/23177452/23292830
        raise Exception("No Dictionary") #https://www.w3schools.com/python/gloss_python_raise.asp

      df_train = format_nli_trainset(df_train=df_train,
                                           hypo_label_dic=hypo_label_dict,
                                           random_seed=seed)
      df_test = format_nli_testset(df_test=df_test,
                                         hypo_label_dic=hypo_label_dict)
      config = None #bj
  else:
    # i changed the following dictionary creation, because the ids were wrong
    label2id = dict(zip(df_test['label_text'].unique(), df_test['label'].unique().tolist()))
    id2label = dict(zip(df_test['label'].unique().tolist(), df_test['label_text'].unique()))
    config = AutoConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label, num_labels=len(label2id));
    print("\n", label2id, "\n")

  dataset = datasets.DatasetDict({
    "train": datasets.Dataset.from_pandas(df_train),
    "test": datasets.Dataset.from_pandas(df_test)
  })

  # Tokenize (bj)
  tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)

  def tokenize(examples):
    if nli == True:
      return tokenizer(examples["text"], examples["hypothesis"], truncation=True, max_length=512)
    return tokenizer(examples["text"], truncation=True, max_length=512)  # max_length can be reduced to e.g. 256 to increase speed, but long texts will be cut off

  dataset = dataset.map(tokenize, batched=True)

  # Set Training Arguments and Hyperparameter (bj)
  fp16_bool = True if torch.cuda.is_available() else False
  warmup_ratio = 0.25 if nli == True else 0.06

  print('\nSetting Training Arguments and Hyperparameter\n')
  train_args = TrainingArguments(
      num_train_epochs=epochs,  # this can be increased, but higher values increase training time. Good values for NLI are between 3 and 20.
      learning_rate=2e-5,
      per_device_train_batch_size=16,  # if you get an out-of-memory error, reduce this value to 8 or 4 and restart the runtime. Higher values increase training speed, but also increase memory requirements. Ideal values here are always a multiple of 8.
      per_device_eval_batch_size=64,  # if you get an out-of-memory error, reduce this value, e.g. to 40 and restart the runtime
      #gradient_accumulation_steps=2, # Can be used in case of memory problems to reduce effective batch size. accumulates gradients over X steps, only then backward/update. decreases memory usage, but also slightly speed. (!adapt/halve batch size accordingly)
      warmup_ratio=warmup_ratio,  # a good normal default value is 0.06 for normal BERT-base models, but since we want to reuse prior NLI knowledge and avoid catastrophic forgetting, we set the value higher
      weight_decay=0.1,
      seed=SEED_GLOBAL,
      load_best_model_at_end=True,
      metric_for_best_model="f1_macro",
      fp16=fp16_bool,  # Can speed up training and reduce memory consumption, but only makes sense at batch-size > 8. loads two copies of model weights, which creates overhead. https://huggingface.co/transformers/performance.html?#fp16
      fp16_full_eval=fp16_bool,
      evaluation_strategy="epoch", # options: "no"/"steps"/"epoch"
      #eval_steps=10_000,  # evaluate after n steps if evaluation_strategy!='steps'. defaults to logging_steps
      save_strategy = "epoch",  # options: "no"/"steps"/"epoch"
      save_total_limit=5,             # If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir
      logging_strategy="epoch",
      report_to="all",  # "all"  # logging
      #push_to_hub=True, #relevant for uploads (bj)
      #hub_model_id=f"{model_short}_finetuned_{emotion}", # new parameter (bj)
      output_dir=f'./results/{emotion}/{model_short}',
      logging_dir=f'./logs/{emotion}/{model_short}',
  )

  # Train
  print("\nDownloading the Model\n")
  model = AutoModelForSequenceClassification.from_pretrained(model_name, config = config);

  print("\nTraining the Model\n")
  if nli == True:
    trainer = Trainer(
      model=model,
      tokenizer=tokenizer,
      args=train_args,
      train_dataset=dataset["train"],
      eval_dataset=dataset["test"],
      compute_metrics=lambda eval_pred: compute_metrics_nli_binary(eval_pred, label_text_alphabetical)
    )
  else:
    trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        args=train_args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        compute_metrics=lambda eval_pred: compute_metrics_standard(eval_pred)
    )

  trainer.train()

  return(trainer, model)


### inference
I created the following functions to consolidate some of the steps Laurer (2023) goes through and make it more applicable. I thought about creating one function, but the differences are quite large, so it is safer to have two functions. I mark my own additions with a #bj.

In [44]:
def inference(name, df_test, model_name, seed = None, nli = False, hypo_label_dict = {}):

  clean_memory()

  print('Initializing Tokenizer') #bj
  tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)

  print('\nInitializing Pipeline')
  if nli == True:
    if not hypo_label_dict: # https://stackoverflow.com/a/23177452/23292830
        raise Exception("No Dictionary") #https://www.w3schools.com/python/gloss_python_raise.asp
    hypothesis_lst = list(hypo_label_dict.values())

    pipe_classifier = pipeline(
      "zero-shot-classification",
      model= model_name,
      tokenizer=tokenizer,
      framework="pt",
      device=device,
      set_seed = seed
    )
  else:
    pipe_classifier = pipeline(
      "text-classification",
      model=model_name,  # if you have trained a model above, load_best_model_at_end in the training arguments has automatically replaced model with the fine-tuned model
      tokenizer=tokenizer,
      framework="pt",
      device=device,
  )

  # Create dummy dataset  #bj
  df_inference = df_test.copy(deep=True)
  text_lst = df_inference["text"].tolist()

  # use the pipeline with your chosen model for inference (prediction)
  print('\nPredicting ...') #bj

  if nli == True:
    pipe_output = pipe_classifier(
        text_lst,  # input any list of texts here
        candidate_labels=hypothesis_lst,
        hypothesis_template="{}",
        multi_label=False,  # here you can decide if, for your task, only one hypothesis can be true, or multiple can be true
        batch_size=32  # reduce this number to 8 or 16 if you get an out-of-memory error
    )

    # extract the predictions from pipe_outut
    hypothesis_pred_true_probability = []
    hypothesis_pred_true = []
    for dic in pipe_output:
      hypothesis_pred_true_probability.append(dic["scores"][0])
      hypothesis_pred_true.append(dic["labels"][0])

    # map the long hypotheses to their corresponding short label names
    hypothesis_label_dic_inference_inverted = {value: key for key, value in hypo_label_dict.items()}
    label_pred = [hypothesis_label_dic_inference_inverted[hypo] for hypo in hypothesis_pred_true]

    # add inference data to your original dataframe
    df_inference[f"{name}_pred"] = label_pred
    df_inference[f"{name}_prob"] = hypothesis_pred_true_probability

  else:
    pipe_output = pipe_classifier(
      text_lst,  # input any list of texts here
      batch_size=32  # reduce this number to 8 or 16 if you get an out-of-memory error
    )
    df_output = pd.DataFrame(pipe_output)

    df_inference[f"{name}_pred"] = df_output["label"].tolist()
    df_inference[f"{name}_prob"] = df_output["score"].round(2).tolist()


  # printing the classification report #bj
  print("\n")
  print(classification_report(df_inference["label_text"],  #bj
                              df_inference[f"{name}_pred"])) #bj

  return df_inference

## Nostalgia Analysis

### Naive Bayes
At first, I will use the Naive Bayes as 'classic' baseline (Atteveldt et al. 2021).

In [45]:
nb = nb_training('nostalgia', nost_train)

  pid = os.fork()
  pid = os.fork()


I uploaded the estimator from 18.05 to my GitHub, and it was downloaded through my initial download in this notebook. For comparability, I will use the downloaded version subsequently.

In [48]:
# https://stackoverflow.com/a/71140866
nb = load("/content/political_hope_nostalgia/nostalgia_nb_model.joblib")

In [49]:
pred = nb.predict(nost_validation['text'])
print(classification_report(nost_validation['label_text'], pred))

               precision    recall  f1-score   support

    Nostalgia       0.74      0.79      0.76        47
Not Nostalgia       0.89      0.87      0.88        98

     accuracy                           0.84       145
    macro avg       0.82      0.83      0.82       145
 weighted avg       0.84      0.84      0.84       145



### Distilbert
At first, I will take the same model as Müller and Proksch (2023a) and see whether I can replicate their performance.

In [None]:
model_name = "distilbert/distilbert-base-uncased"
emotion = 'nostalgia'
distilbert_trainer, distilbert_model= model_finetuning(emotion,
                      nost_train,
                      nost_validation,
                      model_name,
                      SEED_GLOBAL)
nostalgia_distilbert_results = distilbert_trainer.evaluate()


Data Preparation and Tokenizer Download



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]


 {'Not Nostalgia': 0, 'Nostalgia': 1} 



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/1156 [00:00<?, ? examples/s]

Map:   0%|          | 0/145 [00:00<?, ? examples/s]


Setting Training Arguments and Hyperparameter


Downloading the Model





model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training the Model



Epoch,Training Loss,Validation Loss,Accuracy,Accuracy Balanced,F1 Macro,F1 Micro,Precision Macro,Recall Macro,Precision Micro,Recall Micro
1,0.5306,0.258545,0.917241,0.905558,0.905558,0.917241,0.905558,0.905558,0.917241,0.917241
2,0.254,0.241304,0.924138,0.894051,0.909688,0.924138,0.932143,0.894051,0.924138,0.924138
3,0.178,0.218262,0.931034,0.910226,0.919444,0.931034,0.930803,0.910226,0.931034,0.931034
4,0.1101,0.212514,0.931034,0.921298,0.921298,0.931034,0.921298,0.921298,0.931034,0.931034
5,0.0664,0.247803,0.937931,0.931937,0.929555,0.937931,0.927298,0.931937,0.937931,0.937931
6,0.0353,0.285832,0.924138,0.921733,0.914802,0.924138,0.908947,0.921733,0.924138,0.924138
7,0.0136,0.294199,0.951724,0.942141,0.944599,0.951724,0.947189,0.942141,0.951724,0.951724
8,0.0066,0.324278,0.937931,0.9264,0.92877,0.937931,0.931269,0.9264,0.937931,0.937931
9,0.0026,0.331688,0.931034,0.921298,0.921298,0.931034,0.921298,0.921298,0.931034,0.931034
10,0.002,0.334916,0.931034,0.921298,0.921298,0.931034,0.921298,0.921298,0.931034,0.931034


Detailed metrics: 
               precision    recall  f1-score   support

           0       0.94      0.94      0.94        98
           1       0.87      0.87      0.87        47

    accuracy                           0.92       145
   macro avg       0.91      0.91      0.91       145
weighted avg       0.92      0.92      0.92       145
 

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.91      0.98      0.95        98
           1       0.95      0.81      0.87        47

    accuracy                           0.92       145
   macro avg       0.93      0.89      0.91       145
weighted avg       0.93      0.92      0.92       145
 

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.93      0.97      0.95        98
           1       0.93      0.85      0.89        47

    accuracy                           0.93       145
   macro avg       0.93      0.91      0.92       145
weighted a

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.96      0.97      0.96        98
           1       0.93      0.91      0.92        47

    accuracy                           0.95       145
   macro avg       0.95      0.94      0.94       145
weighted avg       0.95      0.95      0.95       145
 



Because the best model was saved on huggingface and I need to preserve memory, I delete the model here.

In [None]:
!rm -rf 'results/nostalgia/distilbert-base-uncased'
!rm -rf 'logs/nostalgia/distilbert-base-uncased'

### Deberta-v3
Deberta is the best openly available model according to Glue and SuperGlue Benchmarks (Wang et al, 2018, 2019). Therefore, I will try this model as well.

In [None]:
model_name = "microsoft/deberta-v3-base"
emotion = 'nostalgia'
deberta_trainer, deberta_model= model_finetuning(emotion,
                      nost_train,
                      nost_validation,
                      model_name,
                      SEED_GLOBAL)
deberta_results = deberta_trainer.evaluate()


Data Preparation and Tokenizer Download





config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]


 {'Not Nostalgia': 0, 'Nostalgia': 1} 



tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



Map:   0%|          | 0/1156 [00:00<?, ? examples/s]

Map:   0%|          | 0/145 [00:00<?, ? examples/s]


Setting Training Arguments and Hyperparameter


Downloading the Model





pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training the Model



Epoch,Training Loss,Validation Loss,Accuracy,Accuracy Balanced,F1 Macro,F1 Micro,Precision Macro,Recall Macro,Precision Micro,Recall Micro
1,0.5777,0.402753,0.675862,0.5,0.403292,0.675862,0.337931,0.5,0.675862,0.675862
2,0.3516,0.282566,0.910345,0.917065,0.901252,0.910345,0.890924,0.917065,0.910345,0.910345
3,0.2194,0.256059,0.903448,0.917499,0.895058,0.903448,0.883373,0.917499,0.903448,0.903448
4,0.1532,0.355665,0.917241,0.922167,0.908421,0.917241,0.89879,0.922167,0.917241,0.917241
5,0.0704,0.394519,0.917241,0.905558,0.905558,0.917241,0.905558,0.905558,0.917241,0.917241
6,0.0503,0.433093,0.924138,0.91066,0.912941,0.924138,0.915349,0.91066,0.924138,0.924138
7,0.0209,0.505534,0.924138,0.916196,0.9139,0.924138,0.911727,0.916196,0.924138,0.924138
8,0.0119,0.507633,0.931034,0.921298,0.921298,0.931034,0.921298,0.921298,0.931034,0.931034
9,0.0066,0.545054,0.924138,0.916196,0.9139,0.924138,0.911727,0.916196,0.924138,0.924138
10,0.0009,0.556739,0.924138,0.916196,0.9139,0.924138,0.911727,0.916196,0.924138,0.924138


Detailed metrics: 
               precision    recall  f1-score   support

           0       0.68      1.00      0.81        98
           1       0.00      0.00      0.00        47

    accuracy                           0.68       145
   macro avg       0.34      0.50      0.40       145
weighted avg       0.46      0.68      0.55       145
 

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.97      0.90      0.93        98
           1       0.81      0.94      0.87        47

    accuracy                           0.91       145
   macro avg       0.89      0.92      0.90       145
weighted avg       0.92      0.91      0.91       145
 

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.98      0.88      0.92        98
           1       0.79      0.96      0.87        47

    accuracy                           0.90       145
   macro avg       0.88      0.92      0.90       145
weighted a

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.95      0.95      0.95        98
           1       0.89      0.89      0.89        47

    accuracy                           0.93       145
   macro avg       0.92      0.92      0.92       145
weighted avg       0.93      0.93      0.93       145
 



In [None]:
!rm -rf 'results/nostalgia/deberta-v3-base'
!rm -rf 'logs/nostalgia/deberta-v3-base'

### Zero-Shot Model
NLI models have the advantage that I can add some context to the classification (Laurer, 2024). The zero-shot model below is based on NLI tasks.

#### Zero-Shot

In [None]:
nost_hypotheses = {
    "Nostalgia": "The text expresses nostalgia, it speaks positive about events and objects in the past or the past in general.",
    "Not Nostalgia": "The text does not express nostalgia."
    }
model = "MoritzLaurer/deberta-v3-base-zeroshot-v2.0"

dataset = inference(name = 'nost_zeroshot',
                    df_test = nost_train,
                    model_name = model,
                    seed = SEED_GLOBAL,
                    nli = True,
                    hypo_label_dict = nost_hypotheses)

Initializing Tokenizer




tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]


Initializing Pipeline


config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]


Predicting ...


               precision    recall  f1-score   support

    Nostalgia       0.74      0.20      0.32       372
Not Nostalgia       0.72      0.97      0.82       784

     accuracy                           0.72      1156
    macro avg       0.73      0.58      0.57      1156
 weighted avg       0.73      0.72      0.66      1156



#### Fine-Tuning

But as we saw with the results,  a bit learning is necessary nonetheless.

In [None]:
zero_shot_trainer, zero_shot_model = model_finetuning('nostalgia',
                 nost_train,
                 nost_validation,
                 model,
                 SEED_GLOBAL,
                 epochs = 5,
                 nli = True,
                 hypo_label_dict = nost_hypotheses)



Data Preparation and Tokenizer Download

Length of df_train before formatting step: 1156.
After adding not_entailment training examples, the training data was augmented to 1900 texts.
Max augmentation could be: len(df_train) * 2 = 2312. It can also be lower, if there are more entail examples than not-entail for a majority class.
Number of hypotheses/classes:  2
Original test set size: 145
Test set size for NLI classification: 290





Map:   0%|          | 0/1900 [00:00<?, ? examples/s]

Map:   0%|          | 0/290 [00:00<?, ? examples/s]


Setting Training Arguments and Hyperparameter


Downloading the Model






Training the Model



Epoch,Training Loss,Validation Loss,Accuracy,Accuracy Balanced,F1 Macro,F1 Micro,Precision Macro,Recall Macro,Precision Micro,Recall Micro
1,0.4493,0.236019,0.910345,0.905992,0.899311,0.910345,0.893684,0.905992,0.910345,0.910345
2,0.1918,0.27187,0.931034,0.90469,0.918429,0.931034,0.937148,0.90469,0.931034,0.931034
3,0.072,0.342698,0.937931,0.937473,0.930292,0.937931,0.924211,0.937473,0.937931,0.937931
4,0.0208,0.382283,0.944828,0.942575,0.937715,0.944828,0.933355,0.942575,0.944828,0.944828
5,0.0079,0.424039,0.931034,0.915762,0.9204,0.931034,0.925556,0.915762,0.931034,0.931034


Detailed metrics: 
               precision    recall  f1-score   support

           0       0.84      0.89      0.87        47
           1       0.95      0.92      0.93        98

    accuracy                           0.91       145
   macro avg       0.89      0.91      0.90       145
weighted avg       0.91      0.91      0.91       145
 

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.95      0.83      0.89        47
           1       0.92      0.98      0.95        98

    accuracy                           0.93       145
   macro avg       0.94      0.90      0.92       145
weighted avg       0.93      0.93      0.93       145
 

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.88      0.94      0.91        47
           1       0.97      0.94      0.95        98

    accuracy                           0.94       145
   macro avg       0.92      0.94      0.93       145
weighted a

In [None]:
zero_shot_results = zero_shot_trainer.evaluate()
print(zero_shot_results)

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.90      0.94      0.92        47
           1       0.97      0.95      0.96        98

    accuracy                           0.94       145
   macro avg       0.93      0.94      0.94       145
weighted avg       0.95      0.94      0.95       145
 

{'eval_loss': 0.3823595643043518, 'eval_accuracy': 0.9448275862068966, 'eval_accuracy_balanced': 0.9425749023013461, 'eval_f1_macro': 0.9377147766323024, 'eval_f1_micro': 0.9448275862068966, 'eval_precision_macro': 0.9333545918367347, 'eval_recall_macro': 0.9425749023013461, 'eval_precision_micro': 0.9448275862068966, 'eval_recall_micro': 0.9448275862068966, 'eval_runtime': 0.6761, 'eval_samples_per_second': 428.912, 'eval_steps_per_second': 7.395, 'epoch': 5.0}


In [None]:
!rm -rf 'results/nostalgia/deberta-v3-base-zeroshot-v2.0'
!rm -rf 'logs/nostalgia/deberta-v3-base-zeroshot-v2.0'

### Evaluating the finetuned models
Now, I will evaluate the models on the test set.

In [50]:
nost_nb_pred = nb.predict(nost_test['text'])
print(classification_report(nost_test['label_text'], pred))

               precision    recall  f1-score   support

    Nostalgia       0.30      0.33      0.31        46
Not Nostalgia       0.67      0.65      0.66        99

     accuracy                           0.54       145
    macro avg       0.49      0.49      0.49       145
 weighted avg       0.56      0.54      0.55       145



In [None]:
model = "beja1996/distilbert-base-uncased_finetuned_nostalgia" # Getting the Fine-Tuned Model from Huggingface

nost_distilbert_eval = inference('distilbert', nost_test, model)

Initializing Tokenizer




tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]


Initializing Pipeline


config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]


Predicting ...


               precision    recall  f1-score   support

    Nostalgia       0.93      0.89      0.91        46
Not Nostalgia       0.95      0.97      0.96        99

     accuracy                           0.94       145
    macro avg       0.94      0.93      0.94       145
 weighted avg       0.94      0.94      0.94       145



In [None]:
model = "beja1996/deberta-v3-base_finetuned_nostalgia" # Getting the Fine-Tuned Model from Huggingface

nost_deberta_eval = inference('deberta', nost_test, model)

Initializing Tokenizer




tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]


Initializing Pipeline


config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/738M [00:00<?, ?B/s]


Predicting ...


               precision    recall  f1-score   support

    Nostalgia       0.85      0.89      0.87        46
Not Nostalgia       0.95      0.93      0.94        99

     accuracy                           0.92       145
    macro avg       0.90      0.91      0.91       145
 weighted avg       0.92      0.92      0.92       145



In [None]:
model = "beja1996/deberta-v3-base-zeroshot-v2.0_finetuned_nostalgia" # Getting the Fine-Tuned Model from Huggingface

nost_zeroshot_eval = inference('zeroshot', nost_test, model, SEED_GLOBAL, True, nost_hypotheses)

Initializing Tokenizer




tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/970 [00:00<?, ?B/s]


Initializing Pipeline


config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/738M [00:00<?, ?B/s]


Predicting ...


               precision    recall  f1-score   support

    Nostalgia       0.88      0.91      0.89        46
Not Nostalgia       0.96      0.94      0.95        99

     accuracy                           0.93       145
    macro avg       0.92      0.93      0.92       145
 weighted avg       0.93      0.93      0.93       145



The following code is not done with a function, because it would have taken me more time to create a function than writing it once for hope and once for nostalgia.

In [51]:
inferences = nost_test.copy()
inferences['nb_pred'] = nost_nb_pred
inferences['nb_prob'] =  nb.predict_proba(nost_test['text'])[:,1].round(2)
inferences = pd.concat([inferences,
           nost_distilbert_eval[['distilbert_pred', 'distilbert_prob']],
           nost_deberta_eval[['deberta_pred', 'deberta_prob']],
           nost_zeroshot_eval[['zeroshot_pred', 'zeroshot_prob']]
           ],
           axis= 1)

NameError: name 'nost_distilbert_eval' is not defined

In [None]:
models = ['nb', 'distilbert', 'deberta', 'zeroshot']
table = []
for model in models:
  metrics = metrics_core(inferences['label_text'], inferences[f'{model}_pred'])
  table.append(metrics)
nost_metrics = pd.DataFrame(table).round(2)
nost_metrics.index = models
nost_metrics

Unnamed: 0,accuracy,accuracy_balanced,f1_macro,f1_micro,precision_macro,recall_macro,precision_micro,recall_micro
nb,0.86,0.84,0.84,0.86,0.84,0.84,0.86,0.86
distilbert,0.94,0.93,0.94,0.94,0.94,0.93,0.94,0.94
deberta,0.92,0.91,0.91,0.92,0.9,0.91,0.92,0.92
zeroshot,0.93,0.93,0.92,0.93,0.92,0.93,0.93,0.93


In [None]:
nost_metrics.to_csv('nostalgia_metrics.csv')
files.download('nostalgia_metrics.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Hope Analysis

### Naive Bayes

In [52]:
nb = nb_training('hope', hope_train)

  pid = os.fork()


I uploaded the estimator from 18.05 to my GitHub, and it was downloaded through my initial download in this notebook. For comparability, I will use the downloaded version subsequently.

In [53]:
# https://stackoverflow.com/a/71140866
nb = load("/content/political_hope_nostalgia/hope_nb_model.joblib")

In [54]:
pred = nb.predict(hope_validation['text'])
print(classification_report(hope_validation['label_text'], pred))

              precision    recall  f1-score   support

        Hope       0.79      0.84      0.81       310
    Not Hope       0.83      0.78      0.80       309

    accuracy                           0.81       619
   macro avg       0.81      0.81      0.81       619
weighted avg       0.81      0.81      0.81       619



### Distilbert

In [55]:
model_name = "distilbert/distilbert-base-uncased"
emotion = 'hope'
distilbert_trainer, distilbert_model= model_finetuning(emotion,
                      hope_train,
                      hope_validation,
                      model_name,
                      SEED_GLOBAL)
hope_distilbert_results = distilbert_trainer.evaluate()


Data Preparation and Tokenizer Download



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]


 {'Hope': 1, 'Not Hope': 0} 



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/4948 [00:00<?, ? examples/s]

Map:   0%|          | 0/619 [00:00<?, ? examples/s]


Setting Training Arguments and Hyperparameter


Downloading the Model





model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

KeyboardInterrupt: 

Because the best model was saved on huggingface and I need to preserve memory, I delete the model here.

In [None]:
!rm -rf 'results/hope/distilbert-base-uncased'
!rm -rf 'logs/hope/distilbert-base-uncased'

### Deberta-v3

In [None]:
model_name = "microsoft/deberta-v3-base"
emotion = 'hope'
deberta_trainer, deberta_model= model_finetuning(emotion,
                      hope_train,
                      hope_validation,
                      model_name,
                      SEED_GLOBAL)
deberta_results = deberta_trainer.evaluate()


Data Preparation and Tokenizer Download


 {'Hope': 1, 'Not Hope': 0} 





Map:   0%|          | 0/4948 [00:00<?, ? examples/s]

Map:   0%|          | 0/619 [00:00<?, ? examples/s]


Setting Training Arguments and Hyperparameter


Downloading the Model



Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training the Model



Epoch,Training Loss,Validation Loss,Accuracy,Accuracy Balanced,F1 Macro,F1 Micro,Precision Macro,Recall Macro,Precision Micro,Recall Micro
1,0.5343,0.353415,0.862682,0.862647,0.862611,0.862682,0.863338,0.862647,0.862682,0.862682
2,0.331,0.37394,0.867528,0.867544,0.86752,0.867528,0.867659,0.867544,0.867528,0.867528
3,0.2223,0.417851,0.861066,0.861066,0.861066,0.861066,0.861066,0.861066,0.861066,0.861066
4,0.1519,0.51631,0.859451,0.859369,0.859074,0.859451,0.86313,0.859369,0.859451,0.859451
5,0.1061,0.647962,0.859451,0.859406,0.859332,0.859451,0.860558,0.859406,0.859451,0.859451
6,0.0627,0.76499,0.848142,0.848121,0.84811,0.848142,0.848383,0.848121,0.848142,0.848142
7,0.0431,0.889692,0.844911,0.844906,0.844908,0.844911,0.844927,0.844906,0.844911,0.844911
8,0.0378,0.912316,0.85622,0.856206,0.856206,0.85622,0.856318,0.856206,0.85622,0.85622
9,0.0291,0.927313,0.849758,0.849749,0.849751,0.849758,0.849793,0.849749,0.849758,0.849758
10,0.0233,0.92032,0.852989,0.852944,0.852864,0.852989,0.854075,0.852944,0.852989,0.852989


Detailed metrics: 
               precision    recall  f1-score   support

           0       0.88      0.84      0.86       309
           1       0.85      0.88      0.87       310

    accuracy                           0.86       619
   macro avg       0.86      0.86      0.86       619
weighted avg       0.86      0.86      0.86       619
 

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.86      0.88      0.87       309
           1       0.88      0.86      0.87       310

    accuracy                           0.87       619
   macro avg       0.87      0.87      0.87       619
weighted avg       0.87      0.87      0.87       619
 

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.86      0.86      0.86       309
           1       0.86      0.86      0.86       310

    accuracy                           0.86       619
   macro avg       0.86      0.86      0.86       619
weighted a

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.86      0.88      0.87       309
           1       0.88      0.86      0.87       310

    accuracy                           0.87       619
   macro avg       0.87      0.87      0.87       619
weighted avg       0.87      0.87      0.87       619
 



In [None]:
!rm -rf 'results/hope/deberta-v3-base'
!rm -rf 'logs/hope/deberta-v3-base'

### Zero-Shot Model

#### Zero-Shot

In [None]:
hope_hypotheses = {
    "Hope": "The text expresses hope, a future-oriented expectation, desire or wish towards a general or specific event.",
    "Not Hope": "The text does not express hope, wish, desire, or future-oriented expectation."
}
model = "MoritzLaurer/deberta-v3-base-zeroshot-v2.0"

dataset = inference(name = 'hope_zeroshot',
                    df_test = hope_train,
                    model_name = model,
                    seed = SEED_GLOBAL,
                    nli = True,
                    hypo_label_dict = hope_hypotheses)

Initializing Tokenizer





Initializing Pipeline

Predicting ...


              precision    recall  f1-score   support

        Hope       0.79      0.55      0.65      2480
    Not Hope       0.65      0.85      0.74      2468

    accuracy                           0.70      4948
   macro avg       0.72      0.70      0.69      4948
weighted avg       0.72      0.70      0.69      4948



#### Fine-Tuning

But as we saw with the results,  a bit learning is necessary nonetheless.

In [None]:
zero_shot_trainer, zero_shot_model = model_finetuning('hope',
                 hope_train,
                 hope_validation,
                 model,
                 SEED_GLOBAL,
                 epochs = 5,
                 nli = True,
                 hypo_label_dict = hope_hypotheses)


Data Preparation and Tokenizer Download

Length of df_train before formatting step: 4948.
After adding not_entailment training examples, the training data was augmented to 9884 texts.
Max augmentation could be: len(df_train) * 2 = 9896. It can also be lower, if there are more entail examples than not-entail for a majority class.
Number of hypotheses/classes:  2
Original test set size: 619
Test set size for NLI classification: 1238





Map:   0%|          | 0/9884 [00:00<?, ? examples/s]

Map:   0%|          | 0/1238 [00:00<?, ? examples/s]


Setting Training Arguments and Hyperparameter


Downloading the Model






Training the Model



Epoch,Training Loss,Validation Loss,Accuracy,Accuracy Balanced,F1 Macro,F1 Micro,Precision Macro,Recall Macro,Precision Micro,Recall Micro
1,0.4349,0.312319,0.87399,0.873995,0.87399,0.87399,0.874003,0.873995,0.87399,0.87399
2,0.2656,0.410221,0.862682,0.862637,0.862566,0.862682,0.863799,0.862637,0.862682,0.862682
3,0.1169,0.758412,0.861066,0.861045,0.861037,0.861066,0.861317,0.861045,0.861066,0.861066
4,0.0493,0.922697,0.852989,0.852887,0.852372,0.852989,0.858729,0.852887,0.852989,0.852989
5,0.0209,0.954101,0.848142,0.848069,0.847808,0.848142,0.851044,0.848069,0.848142,0.848142


Detailed metrics: 
               precision    recall  f1-score   support

           0       0.88      0.87      0.87       310
           1       0.87      0.88      0.87       309

    accuracy                           0.87       619
   macro avg       0.87      0.87      0.87       619
weighted avg       0.87      0.87      0.87       619
 

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.84      0.89      0.87       310
           1       0.88      0.83      0.86       309

    accuracy                           0.86       619
   macro avg       0.86      0.86      0.86       619
weighted avg       0.86      0.86      0.86       619
 

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.85      0.87      0.86       310
           1       0.87      0.85      0.86       309

    accuracy                           0.86       619
   macro avg       0.86      0.86      0.86       619
weighted a

In [None]:
zero_shot_results = zero_shot_trainer.evaluate()
print(zero_shot_results)

Detailed metrics: 
               precision    recall  f1-score   support

           0       0.88      0.87      0.87       310
           1       0.87      0.88      0.87       309

    accuracy                           0.87       619
   macro avg       0.87      0.87      0.87       619
weighted avg       0.87      0.87      0.87       619
 

{'eval_loss': 0.31230267882347107, 'eval_accuracy': 0.8739903069466882, 'eval_accuracy_balanced': 0.8739951978285834, 'eval_f1_macro': 0.8739899780770435, 'eval_f1_micro': 0.8739903069466882, 'eval_precision_macro': 0.8740030066396626, 'eval_recall_macro': 0.8739951978285834, 'eval_precision_micro': 0.8739903069466882, 'eval_recall_micro': 0.8739903069466882, 'eval_runtime': 2.5908, 'eval_samples_per_second': 477.85, 'eval_steps_per_second': 7.72, 'epoch': 5.0}


In [None]:
!rm -rf '/content/results/hope/deberta-v3-base-zeroshot-v2.0'
!rm -rf '/content/logs/hope/deberta-v3-base-zeroshot-v2.0'

### Evaluating the finetuned models
Now, I will evaluate the models on the test set.

In [None]:
# https://stackoverflow.com/a/71140866
nb = load("/content/political_hope_nostalgia/hope_nb_model.joblib")

In [None]:
hope_nb_pred = nb.predict(hope_test['text'])
print(classification_report(hope_test['label_text'], hope_nb_pred))

              precision    recall  f1-score   support

        Hope       0.78      0.78      0.78       310
    Not Hope       0.78      0.78      0.78       308

    accuracy                           0.78       618
   macro avg       0.78      0.78      0.78       618
weighted avg       0.78      0.78      0.78       618



In [None]:
model = "beja1996/distilbert-base-uncased_finetuned_hope" # Getting the Fine-Tuned Model from Huggingface

hope_distilbert_eval = inference('distilbert', hope_test, model)

Initializing Tokenizer




tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]


Initializing Pipeline


config.json:   0%|          | 0.00/738 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]


Predicting ...


              precision    recall  f1-score   support

        Hope       0.84      0.88      0.86       310
    Not Hope       0.87      0.83      0.85       308

    accuracy                           0.86       618
   macro avg       0.86      0.86      0.86       618
weighted avg       0.86      0.86      0.86       618



In [None]:
model = "beja1996/deberta-v3-base_finetuned_hope" # Getting the Fine-Tuned Model from Huggingface

hope_deberta_eval = inference('deberta', hope_test, model)

Initializing Tokenizer




tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]


Initializing Pipeline


config.json:   0%|          | 0.00/992 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/738M [00:00<?, ?B/s]


Predicting ...


              precision    recall  f1-score   support

        Hope       0.82      0.88      0.85       310
    Not Hope       0.87      0.80      0.83       308

    accuracy                           0.84       618
   macro avg       0.84      0.84      0.84       618
weighted avg       0.84      0.84      0.84       618



In [None]:
hope_hypotheses = {
    "Hope": "The text expresses hope, a future-oriented expectation, desire or wish towards a general or specific event.",
    "Not Hope": "The text does not express hope, wish, desire, or future-oriented expectation."
}
model = "beja1996/deberta-v3-base-zeroshot-v2.0_finetuned_hope" # Getting the Fine-Tuned Model from Huggingface

hope_zeroshot_eval = inference('zeroshot', hope_test, model, SEED_GLOBAL, True, hope_hypotheses)

Initializing Tokenizer




tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/970 [00:00<?, ?B/s]


Initializing Pipeline


config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/738M [00:00<?, ?B/s]


Predicting ...


              precision    recall  f1-score   support

        Hope       0.81      0.88      0.84       310
    Not Hope       0.87      0.80      0.83       308

    accuracy                           0.84       618
   macro avg       0.84      0.84      0.84       618
weighted avg       0.84      0.84      0.84       618



In [None]:
inferences = hope_test.copy()
inferences['nb_pred'] = hope_nb_pred
inferences['nb_prob'] =  nb.predict_proba(hope_test['text'])[:,1].round(2)
inferences = pd.concat([inferences,
           hope_distilbert_eval[['distilbert_pred', 'distilbert_prob']],
           hope_deberta_eval[['deberta_pred', 'deberta_prob']],
           hope_zeroshot_eval[['zeroshot_pred', 'zeroshot_prob']]
           ],
           axis= 1)

In [None]:
models = ['nb', 'distilbert', 'deberta', 'zeroshot']
table = []
for model in models:
  metrics = metrics_core(inferences['label_text'], inferences[f'{model}_pred'])
  table.append(metrics)
hope_metrics = pd.DataFrame(table).round(2)
hope_metrics.index = models
hope_metrics

Unnamed: 0,accuracy,accuracy_balanced,f1_macro,f1_micro,precision_macro,recall_macro,precision_micro,recall_micro
nb,0.78,0.78,0.78,0.78,0.78,0.78,0.78,0.78
distilbert,0.86,0.86,0.86,0.86,0.86,0.86,0.86,0.86
deberta,0.84,0.84,0.84,0.84,0.84,0.84,0.84,0.84
zeroshot,0.84,0.84,0.84,0.84,0.84,0.84,0.84,0.84


It seems wrong that all metrics for one classifier are the same, but aparently, it is true.

In [None]:
true_labels = inferences['label_text']
predicted_labels = inferences['zeroshot_pred']
print(precision_recall_fscore_support(true_labels, predicted_labels, average='macro'))  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
print(precision_recall_fscore_support(true_labels, predicted_labels, average='micro'))  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
print(balanced_accuracy_score(true_labels, predicted_labels))
print(accuracy_score(true_labels, predicted_labels))

(0.8402842202918107, 0.8380603267700042, 0.8379007889877456, None)
(0.8381877022653722, 0.8381877022653722, 0.8381877022653722, None)
0.8380603267700042
0.8381877022653722


In [None]:
hope_metrics.to_csv('hope_metrics.csv')
files.download('hope_metrics.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# References
- Atteveldt, W. van, Trilling, D., & Arcíla, C. (2021). Computational analysis of communication: A practical introduction to the analysis of texts, networks, and images with code examples in Python and R. John Wiley & Sons.
- Balouchzahi, F., Sidorov, G., & Gelbukh, A. (2023). PolyHope: Two-level hope speech detection from tweets. Expert Systems with Applications, 225, 120078. https://doi.org/10.1016/j.eswa.2023.120078
- Laurer, M., Van Atteveldt, W., Casas, A., & Welbers, K. (2024). Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI. Political Analysis, 32(1), 84–100. https://doi.org/10.1017/pan.2023.20
- Laurer, M. (2023). Fine-tuning BERT-NLI. Data Science Summer School 2023, Berlin. https://github.com/MoritzLaurer/summer-school-transformers-2023/blob/main/4_tune_bert_nli.ipynb
- Müller, S., & Proksch, S.-O. (2023a). Nostalgia in European Party Politics: A Text-Based Measurement Approach. British Journal of Political Science, 1–13. doi:10.1017/S0007123423000571
- Müller, S., & Proksch, S.-O. (2023b). PolNos: Political nostalgia in party manifestos \[Data set]. doi:10.7910/DVN/L198GI
- Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. https://doi.org/10.48550/ARXIV.1905.00537
- Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. https://doi.org/10.48550/ARXIV.1804.07461
