# Advanced Social Data Science 2

*By Carl, Asger, & Esben*

# 

# Part 1 - Measuring grandstanding in parliaments

In the article, the authors investigate when and why politicians use emotive rhetoric in the legislative arena. They argue that politicians are more inclined to employ such rhetoric when their speeches are delivered to a large general audience, using it as a strategic tool to appeal to voters. Consistent with their expectations, they find that politicians use emotive rhetoric more in high-profile legislative debates (Osnabrügge et al. 2021: 885). In the following sections, we will assess their choice of methods for quantifying emotive rhetoric in parliamentary speeches. First, we outline how the authors justify their choice of a dictionary-based approach and discuss its limitations in this context. Subsequently, we turn to how the authors use word embeddings to augment the ANEW dictionary with domain specificity, thereby addressing and rectifying some of the shortcomings associated with relying solely on a dictionary-based approach. Lastly, we present and discuss alternative approaches. 

## How do OHR justify their dictionary-based approach and what are its weaknesses in this context?
In order to investigate when politicians use emotive rhetoric, it is essential for the authors to establish an accurate measure. To achieve this, OHR utilizes the ANEW dictionary in combination with word-embedding techniques to take a dictionary-based approach, i.e. every word in the text corpus is mapped to a dictionary that contains a specific emotive value for the word. One of the main reasons that the authors chose a dictionary-based method is that it is cost-effective. The alternative of using supervised learning, where humans hand-annotate speeches based on their level of emotiveness, can be quite costly. In contrast, the ANEW dictionary is already constructed, freely available, and well-established within the scientific community. It also facilitates the application of word-embedding techniques due to the inclusion of complete words, and the fact that it is free allows for replication (Osnabrügge et al. 2021: 890). 

The authors also justify their approach by stating that the ANEW dictionary is “not contaminated by partisan attitudes and political predispositions” (Osnabrügge et al. 2021: 890). However, going a bit more into the context of ANEW’s construction, it can be questioned whether the dictionary is completely free from latent - possibly political - biases. The ratings in ANEW were acquired using an affective rating system where subjects were asked to rate words on the dimensions of pleasure, arousal, and dominance (Bradley & Lang 1999: 1). The respondents were “Introductory Psychology Class” students who participated as part of a course requirement (Bradley & Lang 1999: 2). Though they were balanced on gender, it can still be questioned whether psychology students represent an unbiased social group of annotators or if some form of annotation bias is embedded in the ANEW dictionary (Hovy 2022: 20). 

Relying on ANEW for a dictionary-based approach in a political context, it is worth noting that ANEW is not designed with the political context in mind. This leads to two related challenges: The first is that by using a dictionary constructed in a different context, we risk assigning emotive values to words that may not be considered equally emotive in the political context. The second is that we - if we solely rely on the ANEW dictionary - risk excluding relevant emotive words from the political context, as they are not part of the dictionary’s vocabulary. As the next section (1.2) will show, the authors solves the latter problem by using word-embedding techniques. 

However, the former problem with context specificity remains unsolved and is not a neglectable issue. Delatorre et al. (2019: 1) demonstrate in their study of the ANEW dictionary, that the affective ratings of words are highly influenced by the context they appear in. Thus, words that were originally considered emotive by the annotators are not necessarily considered emotive when they appear within the political arena. Consequently, as the authors also state in the article, we run the risk of generating measurement errors (Osnabrügge et al. 2021: 890). 

To Delatorre et al. (2019), the lack-of-context issue is a theoretical consideration, pertaining to the varying connotative interpretation of words within different social arenas. However, it also presents a more methodological issue. Dictionary-based approaches disregard the sentence-level context of the words in its vocabulary. In ANEW, the particular emotive value of a word is constant across all imaginable sentences and contexts. In section 1.3, we will discuss alternative methodological approaches that address this weakness of the dictionary-based approach.


## How do OHR augment the ANEW dictionary, and how does this contribute to domain specificity? 
The authors augment the ANEW dictionary by leveraging word-embedding techniques, thereby adding context-specific out-of-vocabulary words to the ANEW dictionary along with an emotive score.

First, they identify so-called seed words from the ANEW dictionary, i.e. words that appear in both ANEW and the parliamentary speeches and have a sufficiently high emotive or neutral score with a relatively small standard deviation. Next, OHR uses a skip-gram model with hierarchical softmax to construct word embeddings, thus vectorizing all words in their corpus. They can then calculate an emotive score for each non-seed word - i.e. corpus words that do not appear in the ANEW dictionary - based on their cosine-similarity to seed words in the embedding space. To calculate an emotive score, the sum of a non-seed word’s cosine-similarity with all neutral seed words is subtracted from the sum of the word’s cosine-similarity with all emotive seed words. Put simply, they calculate whether a word is positioned generally closer to emotive words or neutral words in the embedding space and how much closer. They add the non-seed words scoring above the 0.975 and under the 0.025 quantile to ANEW - that is, the most emotive and most neutral words, respectively. This expands the vocabulary of ANEW to comprise 2.015 emotive words and 2.095 neutral words. The exact emotive/neutral values of the newly added vocabulary words are less important, as OHR later determine the degree of emotion in a speech by simply subtracting the percentage of neutral from the percentage of emotive words.

In OHR’s own words, the utilization of this embedding method makes their approach more “domain-specific”, as they can estimate an emotive score of words that appear in the political domain of the parliament speeches but have not been annotated in the default ANEW (Osnabrügge et al. 2021: 890). Thus, they are able to extract more information and capture a broader range of emotive nuances from the parliamentary speeches, while still retaining a dictionary-based approach - even without having to construct the dictionary from scratch themselves.

This approach solves one of the two dictionary-based problems described in the previous section, namely the risk of excluding relevant emotive information about the out-of-vocabulary words appearing in the political context. However, we are still left with the risk that the emotive and neutral words in ANEW have different connotations when they are contextualized in a political setting or depending on the semantic context of the sentences they appear in. Downstream, this concern would also affect the embedding-based augmentation of the ANEW dictionary, as these would inherit the lack of political context when they are assigned an emotive score based on the “contextless” ANEW words. In the next section, we address alternative approaches to this potential issue.


## Alternative approaches and their main advantages and disadvantages
We have mainly criticized the approach of OHR for the lack of context associated with their dictionary-based approach. Here, we present two alternative approaches to potentially counter this issue.

One alternative approach to measure emotive rhetoric in parliaments, which actually captures the semantic context of the words, could be to fine-tune a BERT model that has been pre-trained on parliament speeches for domain specificity. One of the main advantages of using BERT is its ability to handle sequential data inputs, thus allowing it to understand the context of words or phrases within the given sentences. At the core of BERT’s context awareness is the “self-attention” mechanism from the Transformer architecture, entailing multi-head attention and Query-Key-Value parameterization (Raschka & Mirjalili 2019: 616). Taking a bidirectional approach, this architecture allows BERT to pay attention to other parts of a sentence - both left and right - as it processes each individual word. Thus, a word deemed emotive in one sentence might not be regarded as emotive in another. Though this method would provide a much more nuanced understanding of the emotiveness of speeches, there are two main caveats to this approach. First, OHR would either need to find a BERT model that is already pre-trained on text derived from the political context or pre-train the BERT themselves, which requires a lot of data, is computationally expensive, and can be rather time consuming (Bender et al. 2021: 612). Having found or pre-trained a BERT, it would still need fine-tuning for the task it needs to perform. In OHR, the authors “compute the level of emotive rhetoric by subtracting the percentage of neutral from the percentage of emotive words” (Osnabrügge et al. 2021: 891). Using BERT, one could take a classification approach, where the model would assign either a binary or score-based value to entire sentences, paragraphs, or speeches, circumventing the count-words-and-summate measure in OHR. This leads us to the second caveat. However one defines the classification task of the BERT, fine-tuning would require data annotation that the model can be trained on for task specificity. As OHR mentioned in their article, this can be costly. 

A second alternative approach, which could work as a supplementary contextualization of the emotive words in OHR’s dictionary-based approach, would be to include audio and/or image data. Knox & Lucas (2021) have criticized the extensive focus on word usage rather than word delivery in political studies, advocating for audio-as-data approaches. Relevant to OHR’s study of emotive rhetoric in parliaments, Dietrich et al. (2019) have constructed a measure for emotional intensity among legislators in Congressional floor speeches based on vocal pitch. They present their work as a demonstration that audio-as-data can be both feasible and informative when studying political speeches. OHR could combine such vocal pitch measures of emotional intensity to scale the emotive scores in their augmented ANEW dictionary, based on how the word was vocally delivered. Turning to image/video data, OHR could further enrich their measure of emotive rhetoric by looking at images of the ones who utter an emotive or neutral ANEW word, when they deliver the words. Facial expression and body language could both carry relevant information regarding the degree of emotiveness associated with the utterance of a word. The conventional approach to image-as-data tasks within the social sciences is to use convolutional neural networks (Torres & Francisco 2022, Williams et al. 2020). As with audio, OHR could utilize image classification to contextualize and scale the emotive scores in their augmented ANEW dictionary.

Naturally, the inclusion of either audio or images would entail a more extensive data collection process of gathering the possibly non-existing or sparse audio and video material from parliamentary speeches over time. Furthermore, they would have to familiarize themselves with entirely new modeling approaches, matching and applying Dietrich et al.'s (2019) emotional intensity measure and implementing convolutional neural networks for image classification. Also, a caveat working with these two data types is that both are computational demanding data, especially regarding data storage (Rheault 2022: 2; Williams 2020: 13; Torres 2022: 123). Finally, both audio and especially images entail privacy concerns, which is important to be fully aware of before processing the data (Williams 2020: 14).

As OHR have already expressed their concern toward the cost of annotating data for supervised learning approaches, neither of the above approaches seem within the scope of OHR’s design. However, both approaches - BERT and audiovisual data - would most likely yield more accurate representations of the degree of emotive rhetoric in parliament speeches.  


In [None]:
# !pip install datasets==2.2.1 transformers==4.19.1

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score
import gensim.downloader
import torch
import re
from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_metric, load_dataset
import numpy as np
from google.colab import drive
import nltk
import pandas as pd
import gensim.downloader
import pandas as pd
import numpy as np
import torch

nltk.download('punkt')
nltk.download('stopwords')

# Part 3 Supervised text classification

In this section, we will train a range of models to predict whether a tweet could be classified as hate speech or not. We will use a ridge regularized Logistic Regression and two BERT models with different pre-trainings - a general purpose BERT and a BERT specialized to detect hate speech on Twitter.  

## Fitting a logistic regression to predict the hate speech status of the tweets, using TF-IDF features.

First, we load the `tweet_eval` dataset and split the data into train, validation, and test. In the data `0` signifies non-hate and `1` indicates hate.

In [None]:
train_data=load_dataset("tweet_eval", "hate", split="train")
val_data=load_dataset("tweet_eval", "hate", split="validation")
test_data=load_dataset("tweet_eval", "hate", split="test")

Split text and labels into seperate lists.

In [17]:
train_corpus=[tweet["text"] for tweet in train_data]
train_labels=[tweet["label"] for tweet in train_data]
test_corpus=[tweet["text"] for tweet in test_data]
test_labels=[tweet["label"] for tweet in test_data]
val_corpus=[tweet["text"] for tweet in val_data]
val_labels=[tweet["label"] for tweet in val_data]

Examine number of observations and class distribution in the different splits.

In [18]:
label_dict={0:"non-hate", 1:"hate"}

print(f"Label distribtion in train: {dict(Counter([label_dict[label] for label in train_labels]))}")
print(f"Label distribtion in validataion: {dict(Counter([label_dict[label] for label in val_labels]))}")
print(f"Label distribtion in test: {dict(Counter([label_dict[label] for label in test_labels]))}")

Label distribtion in train: {'non-hate': 5217, 'hate': 3783}
Label distribtion in validataion: {'non-hate': 573, 'hate': 427}
Label distribtion in test: {'non-hate': 1718, 'hate': 1252}


Preprocessing the data. The authors of the `TweetEval Hate Speech` dataset have already done a bit of text preprocessing, converting username, indicated by a leading `@`, into a standard `@username` token and convert all URLs, indicated by a leading `http` into a standard `http` token.

In [19]:
def preproc(_str):
    # remove twittertags + lowercase
    _str=re.sub(r'@\w+', "", _str.lower())
    # remove numbers
    _str=re.sub(r'\d+', "", _str)
    # remove punctuations
    _str=_str.translate(str.maketrans("", "", string.punctuation.replace("!","")))
    # Remove extra whitespaces
    _str=re.sub(r'\s+', ' ', _str.strip())
    # tokenize text - we do not use TweetTokenize as we have removed @ either way
    tokens=word_tokenize(_str)
    # remove stopwords and stem
    stemmer=PorterStemmer()
    tokens=[stemmer.stem(word) for word in tokens if word not in stopwords.words('english')]

    return ' '.join(tokens) # join words back into a string

In [20]:
train_corpus_preproc=[preproc(tweet) for tweet in train_corpus]
test_corpus_preproc=[preproc(tweet) for tweet in test_corpus]
val_corpus_preproc=[preproc(tweet) for tweet in val_corpus]

As preprocessing is more of an assessment than a hard science, we have made some decisions that we think are best suited for the task at hand, i.e. to classify hate-speech on twitter. Some of these are:
- Retaining `!` when removing punctuation, as exclamations could be useful for the model when predicting hate-speech.
- Though capital letters could convey some meaning when predicting hate-speech, We have lower-cased all words, to make the classification more about semantic meaning than letter-capitalization.
- We only stem the tokens, though lemmatization arguably might improve model performance with proper Part-of-speech tags.

Ultimately, preprocessing data is about conveying as many relevant nuances as possible in the most simplified way. For instance, if stopwords do not provide any information on hate speech to the model, they should be removed. Should the model be used in production or published, we would have engaged in a process of trial-and-error, going back and forth between different preprocessing steps to see which preprocessing yields the best model performance.

Convert the documents into a TF-IDF feature matrix. We use both `unigrams` and `bigrams` to give the model a little more context about the context of the words - i.e. neighboring words.

In [None]:
vectorizer=TfidfVectorizer(analyzer="word",
                             ngram_range=(1,2))

train_features_lr=vectorizer.fit_transform(train_corpus_preproc)
# only transform() on val and test, to make the evaluation resemble "unseen data" more
val_features_lr=vectorizer.transform(val_corpus_preproc)
test_features_lr=vectorizer.transform(test_corpus_preproc)

### Logistic regression with ridge regularization


To optimize the hyperparameter - `C` - for regularizing the model to avoid overfitting, we carry out a `GridSearchCV`, with 5 folds and spanning `np.logspace(-2, 2, 50)` values of `C`. We use ridge-regression for regularization.

In [22]:
param_grid={"C": np.logspace(-2, 2, 50)} # search parameters to optimize

lr_grid=GridSearchCV(LogisticRegression(penalty="l2", random_state=0, max_iter=300),
                        param_grid=param_grid,
                        cv=5,
                        n_jobs=-1,
                        scoring="accuracy")

lr_grid.fit(train_features_lr, train_labels)

print("Best cross-validation score: {:.2f}".format(lr_grid.best_score_))
print("Best parameters: ", lr_grid.best_params_)
print("Best estimator: ", lr_grid.best_estimator_)

Best cross-validation score: 0.77
Best parameters:  {'C': 39.06939937054613}
Best estimator:  LogisticRegression(C=39.06939937054613, max_iter=300, random_state=0)


## A Sequential Architecture to predict the hate speech status of the tweets

The above Logistic Regression Model uses a TF-IDF matrix as input, meaning that the word ordering of the sentences is disregarded, had we not included bigrams. Next, we want to train a model that takes and *understands* sequential data inputs, as language exactly is sequential and context dependent. For instance, the word `not` is part of nltk's `stopwords.words('english')`-package. `not` can shift the meaning of a sentence completely, depending on its semantic position, but non-sequential models have no way of discerning which part of the sentence the `not` relates to. Sequential models, however, do. 

Both **Recurrent Neural Networks** (RNN) and **Long Short-Term Memory** (LSTM) have a sequential architecture, making them rather fitting for modeling the semantic meaning of language, compared to the Logistic Regression (Hovy 2022b: 73-75). The LSTM tries to mitigate the risk of exploding/vanishing gradients that the RNN might suffer from when the input sequence is long (Raschka & Mirjalili 2019:581). Thus, LSTMs are better when it comes to longer sentences. However, tweets are characterized by being short sentences with a limit of 280 characters, and our dataset might not have the proper sentence lengths for the LSTM architecture to fully flourish, compared to the RNN and Logistic Regression. Though both models could be potential candidates for our hate speech identification task, state-of-the-art language classification no longer relies on RNN or LSTM, but on their evolutionary successor, BERT (Hovy 2022b: 81). In this section, we fine-tune two BERT models with different pre-trainings - a general purpose BERT and a BERT specialized to detect hate speech on Twitter.  

### A fine-tuned BERT model

Below, we define a function to load a pre-trained BERT model, tokenize our text corpus based on the model's pre-trained embeddings, and ultimately fine-tune the BERT to predict hate speech in our Twitter data.  

In [35]:
metric_f1=load_metric("f1")
metric_acc=load_metric("accuracy")

# performance function
def compute_metrics(eval_pred):
      outputs, labels=eval_pred
      predictions=np.argmax(outputs, axis=-1)
      f1=metric_f1.compute(predictions=predictions, references=labels)
      acc=metric_acc.compute(predictions=predictions, references=labels)
      return f1 | acc

def BERT_hate_classifier(model_name):
  # allow model to access the GPU
  device="cuda:0" if torch.cuda.is_available() else "cpu"
  # Set up the tokenizer we want to use
  tokenizer=AutoTokenizer.from_pretrained(model_name)
  # Moving tokenizer to work on GPU
  tokenizer.to_device=device
  # Apply the tokenizer to each row in the dataset
  tokenized_train_dataset=train_data.map(lambda tweet: tokenizer(tweet["text"]), batched=True).remove_columns("text")
  tokenized_val_dataset=val_data.map(lambda tweet: tokenizer(tweet["text"]), batched=True).remove_columns("text")
  tokenized_test_dataset=test_data.map(lambda tweet: tokenizer(tweet["text"]), batched=True).remove_columns("text")
  # Specify task for pretrained model
  hate_classifier=AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
  # Moving model to GPU
  hate_classifier.to(device)
  # Specify training parameters
  training_args=TrainingArguments(output_dir="bert_hatespeech",
                                    evaluation_strategy="steps",
                                    num_train_epochs=5,
                                    per_device_train_batch_size=16,
                                    logging_steps=500,
                                    eval_steps=500)

  trainer=Trainer(model=hate_classifier,
                    args=training_args,
                    compute_metrics=compute_metrics,
                    train_dataset=tokenized_train_dataset,
                    eval_dataset=tokenized_val_dataset,
                    tokenizer=tokenizer)
  # fine-tune model
  trainer.train()

  return trainer, tokenized_test_dataset

We train a BERT model, that is not specialized in twitter hate speech, but a more general purpose BERT trained on Wikipedia and the BookCorpus (Devlin et al. 2017).

In [None]:
trainer, tokenized_test_dataset=  BERT_hate_classifier("bert-base-uncased")

In [11]:
eval_on_val_data=trainer.evaluate()

print(f"Accuracy of fine-tuned BERT model: {eval_on_val_data['eval_accuracy']:.2f}")
print(f"F1 of fine-tuned BERT model: {eval_on_val_data['eval_f1']:.2f}")

Accuracy of fine-tuned BERT model: 0.77
F1 of fine-tuned BERT model: 0.74


Next, to see how much of a difference the pre-trained BERT model makes, we also train a BERT that is developed by the authors behind the TweetEval Hate Speech dataset, pre-trained to detect hate speech on twitter (Basile et al. 2019).

In [None]:
trainer_tweeteval, tokenized_test_dataset_tweeteval= BERT_hate_classifier("cardiffnlp/twitter-roberta-base-hate-latest")

In [13]:
eval_on_val_data=trainer_tweeteval.evaluate()

print(f"Accuracy of fine-tuned BERT model: {eval_on_val_data['eval_accuracy']:.2f}")
print(f"F1 of fine-tuned BERT model: {eval_on_val_data['eval_f1']:.2f}")

Accuracy of fine-tuned BERT model: 0.81
F1 of fine-tuned BERT model: 0.79


## For each of the models you ran in question 1. and 2., briefly discuss (two–four sentences) in what ways the model is a good choice for the current task and data set, plus any downsides the model might have for this application.

**Logistic Regression**: Since Logistic Regression uses a TF-IDF vectorization to convert words and sentences into numbers, it disregards the order with which the words appear in the sentences. That word ordering does not matter to the meaning of a sentence is a somewhat crude assumption, though with regard to this particular task of detecting hate speech, simply identifying the presence or prevalence of a "hateful word" could suffice. This model is also less computationally intensive to train than the rest. 

**BERT**: As oppose to the Logistic Regression Model, BERT allow for sequential data input, which is fitting for language classification. We used two pre-trained BERT-models; One that is trained on twitter data with the objective to detect hate speech, and one that is trained for more general language processing. Both BERT models had similar or better performances on during training than the Logistic Regression, with regard to `accuracy` and `F1`. The BERT model specialized on twitter hate speech performed slightly better than the general purpose BERT. Such pre-trained, specialized model is very convenient, but it also gives the model a natural head-start compared to other models. Being able to utilize the relatively better performance of BERT thus depends on the existence of a pre-trained model and also what that model is trained for. Alternatively, we would have had to pre-train the BERT-model from scratch, which would have been a much more extensive task.

## Which is the best-performing model in terms of F1, and should we prefer F1 over accuracy as an indicator of model performance here?


We calculate the `accuracy` and `F1` score for each model on the `test data` and print the results.

In [16]:
test_pred_lr=lr_grid.predict(test_features_lr) # LR

output, true_labels, eval=trainer.predict(tokenized_test_dataset) # BERT General
acc_bert, f1_bert=eval["test_accuracy"], eval["test_f1"]

output, true_labels, eval_tweeteval=trainer_tweeteval.predict(tokenized_test_dataset_tweeteval) # TweetEval BERT
acc_bert_tweeteval, f1_bert_tweeteval=eval_tweeteval["test_accuracy"], eval_tweeteval["test_f1"]

pd.DataFrame([[accuracy_score(test_pred_lr, test_labels),
               f1_score(test_pred_lr, test_labels)],
              [acc_bert_tweeteval, f1_bert_tweeteval],
              [acc_bert, f1_bert]],
             index=["Logistic Regression", "BERT TweetEval", "BERT General"],
             columns=["Accuracy", "F1"]).round(2)

Unnamed: 0,Accuracy,F1
Logistic Regression,0.49,0.61
BERT TweetEval,0.57,0.65
BERT General,0.53,0.64


First of all, we notice that all models perform significantly worse on the test data, compared to their performance on the validation data. This could obviously be a modeling issue, but it could also encourage a closer look at the test datasets, to see if it comes down to an issue of noise or messy data. The results, however, are close to the one's measured by the authors of the `TweetEval Hate Speech` dataset (Basile et al. 2019). Also, whether various forms of preprocessing could have increased model performance would have to be explored if one were to further fine-tune the models for production or publication.

Deciding how to weigh the importance of either the `accuracy` or `F1` depends on the task and what one wants to achieve - or not to achieve - with one's classification model. Accuracy has straightforward interpretability and is a good metric when the outcome classes are reasonably balanced, as is the case in our dataset. The `F1-score`, on the other hand, is less intuitively interpretable, but is a better pick for imbalanced classes. `F1` takes the `False Positives` and `False Negatives` into account, assessing both error types relative to the number of `True Positives`, rather than the `True Negatives` (Hovy 2022b: 25f). This also means that *F1* changes, depending on which class we "focus on" - that is, which class we assign to be a `True Positive`. In our case, we focus on hate rather than non-hate. So, is that diserable? Since our ultimate goal is hate speech detection, *F1* could be a more adequate measure, in spite of the balanced classes. Adhering to best practice, we present both metrics to provide a comprehensive evaluation of our model´s performance, and, either way, the BERT pre-trained on twitter hate speech has the better performance on both metrics, with an accuracy of 0.57 and F1 of 0.65. 

## Based on the paper by Basile et al. (2019), What further info might you have liked to have about the data selection process and/or the annotation process for this data set? Why?


In the **data selection process**, the term `identified hater` is not further defined (Basile et al. 2019: 55). Thus, the exact sampling strategy is unclear which makes replication difficult.

In the **annotation process**, the authors try to counter the potential inconsistency of the individual annotator by using majority voting on each tweet among three annotators (crowds). The tweet is also annotated by two experts, whereafter the final label is a majority voting between crowd, expert 1, and expert 2. The exact qualifications of the experts remains unclear in the article, aside from language capability and general field knowledge. The article provides a measure of the average confidence (AC) between the annotations, but it is unclear if this measure only relates to the crowd. It would have been informative to explicitly have both the AC between the crowd and between the crowd and the experts. Though the aggregate AC measure is informative, it could have been interesting to include a measure of agreement/reliability for each tweet in the dataset, as to get an indication of which tweets have been unanimously annotated and which one's have been more ambiguous annotated.

Finally, we would have liked if the authors were more explicit about the constitution of their crowdsourcing provider and annotators, especially since crowdsourcing of annotative labor has a history of irresponsible working conditions. This also raise a discursive consideration, as it could also be interesting to know more about the social characteristics of the annotators and experts. Are they all from the same country or do they all identify as the same gender? There might be discursive reminiscents of these social dimensions latently embedded in the annotation of what is hate speech and what is not in the data. 


# Part 4: Word embeddings

The Word2vec algorithm is a widely used embedding implementation introduced by Le and Milov (2014) (Grimmer et al. 2022: 86; Hovy 2022b: 19). The assumption behind word2vec is that words that appear in similar contexts have similar meaning. Based on this assumption, the algorithm creates an n-dimensional vector space, where the meaning of each word is reflected by its relative position to other words. In Section 1, we saw how OHR utilized this to identify the emotiveness of non-labelled words based on their relative proximity to either emotive or neutral words in an embedding space. In this section, we will train a word2vec embedding model on the United Nations General Debate Corpus, and explore the strengths and weaknesses of word embedding for semantic analysis. 

## Train your own word2vec vectors on the dataset. 

We load, tokenize, and lowercase the data.

In [5]:
def load_and_process_data(filename):
    # to hold all sentences in the corpus
    corpus=[]
    
    # Open the file 
    with open(filename, "r", encoding="utf-8") as f:
        for line in f: # iterate over lines
            # is line empty
            if line.strip() != '':
                # Tokenize and lowercase 
                encoded_text=[word.lower() for word in word_tokenize(line)]
                # Add tokens to the corpus
                corpus.append(encoded_text)
    
    return corpus

# load and process the datab
path="Data"
speeches=load_and_process_data(path+"/allspeeches_77_2022.txt")

Next, we train our word2vec model.

In [6]:
def vec_function(seed):
    speech2vec=gensim.models.Word2Vec(
        speeches,         # the corpus object we've loaded
        vector_size=200,  # the dimensionality of the target vectors
        window=5,         # window ngram size
        min_count=4,      # ignoring low-frequency words
        epochs=3,         # how many training passes to have
        sg=1,           # 1 for skip-gram model
        seed=seed)        # seed for replication
    return(speech2vec)

vec_seed13=vec_function(seed=13)

There are two architectures which can be implemented in the Word2Vec; Skip-Gram(SG) or Continous Bag of Words (CBOW). In general SG outperforms CBOW, so we will use that approach for this exercise (Ogundepo 2021). Further, we choose the default of negative sampling for updating the weights in the neural network during training, to minimize the computational intensity. 

Aside from the parameters provided in the assignment description, we set a seed for replication, select the skip-gram model and negative samping, and choose to ignore low frequency words. 

## Print out the ten words that are most similar to: “climate”, “pandemic”, “terrorism” and “future” (or choose your own four words of interest). Briefly discuss anything you find noteworthy about the associations that the model has picked up.

We use the `most_similar` method, which measures the distance between words in the vector space using cosine-similarity,  

In [10]:
word_of_interest=["gender", "elizabeth", "russia", "ukraine"] 

#  func to find top ten most similar words
def find_top_words(word_of_interest, model):
    df=pd.DataFrame() 
    for word in word_of_interest:
        # Using the most_similar function to find the 10 words
        similar_words=list(map(lambda x: x[0], model.wv.most_similar(word, 10)))
        df[word]=similar_words
    return(df) # Returning a dataframe

top_words_seed13=find_top_words(word_of_interest=word_of_interest, model= vec_seed13)
top_words_seed13

Unnamed: 0,gender,elizabeth,russia,ukraine
0,protection,king,ukraine,russia
1,empowerment,majesty,illegal,military
2,women,excellency,military,aggression
3,services,abdulla,aggression,illegal
4,girls,shahid,russian,russian
5,equality,abdullah,invasion,forces
6,employment,charles,federation,armed
7,promotion,csaba,forces,crimes
8,public,predecessor,iran,war
9,participation,maldives,war,federation


Starting with `gender`, we see that *empowerment*, *protection*, and *women* are the most similar words to gender. This would indicate that these words often appear in the similar contexts. It is interesting that *men* is not one of the most similar word to gender, while *women* is. This could indicate that debates on gender in the UN is mainly concerning empowerment and protection of women rather than men. A rough Beauvoirian interpretation of this could be that men are considered the *default* gender, and women is the *other*.

Looking at `elisabeth`, it makes sense that *king*, *charles*, and *iii* are closely positioned in the vector space since they probably refer to the Queens death in 2022 and her son and successor King Charles III. Oddly, the words *abdulla(h)* and *shahid* also appear as similar, which probably refers to Abdullah Shahid, a key politician within the UN, who is often addressed as "his exellence" in the text corpus. 

Both `russia` and `ukraine` respectively have each other as the most similar words. It makes sense then, that they also share many of the same words as their most similar ones, e.g. military, aggresion, and war. Since they are both each others most similar word, they are positioned rather close to each other in the vector space, therefore also sharing their proximity to all the other words. This makes it difficult to discern which words are targeting which of the countries, e.g. is the similarity between *illigal* and *ukrain* mainly because ukrain is close to russia in the vector space? And which one of the countries are commiting the *aggresion*?  

## Suppose that we would like to use these word embeddings as input in a supervised model, detecting whether the speech comes from a country in the global North or the global South. Briefly discuss the upside(s) and downside(s) of training your word embeddings locally, on the speeches themselves, versus using pre-trained embeddings.

Deciding whether to train our own word embeddings locally or use pre-trained embeddings, it is important to consider how similar or different one's text corpus of interest is to the copera of the pre-trained embeddings (Rodriguez & Spirling 2022: 6). Using pre-trained embeddings comes with the risk of inaccuracy due to lacking domain specificity compared to one's corpus of interest - say we load embeddings trained on Wikipedia to use for our UN speech analysis (Rodriguez & Spirling 2022: 6). One could also encounter that words are not within the pre-trained embedding's vocabulary, thus losing the potential information in out-of-vocabulary words (Grimmer et al. 2022: 84f). On the other, data accessibility and computing power can be a problem for locally trained embeddings. The pre-trained embeddings are often trained on hundreds of billions of tokens, while we only have 77 speeches in the UN. If the copora of the pre-trained embeddings is related to one's local corpus - for instance if both are texts on political speecehs - one can save some computation by opting for pre-trained embeddings. In general Rodriguez & Spirling find that the pre-trained embeddings often do better than the locally trained models (Rodriguez & Spirling 2022: 15-16). 

A limitation to the embedding framework is that the vector representations are fixed when training is completed, meaning that the exact context of the words are average out (Grimmer et al. 2022: 88). Thus, discerning between global North or global South could be difficult using word2vec's vector representations, as the potential contextual nuaces and differences of a particular word between North and South is aggregated across all speeches and fixed into its average contextual meaning. In newer models such as BERT, they have implemented contextual embeddings, which are better at capturing semantic nuances in the language. 

## (optional) Train the model again. Are the word embeddings stable?

Not answered due to character limitation.

## (optional) Conduct an informal validation of the embeddings from your first run, by checking their ability to find the “odd one out” in three different series of four–five terms related to international relations and current events (e.g. “covid”, “pandemic”, “disease”, “vaccine”, “environment”). Briefly discuss how you might validate the embeddings more systematically, if you had more time and resources. 

In [15]:
list_of_lists_of_words={
    "war": ["bombs", "guns", "ammo", "tanks" "cake"],
    "climate": ["floods", "drought", "heatwave", "hallo"],
    "nations": ["china", "america", "russia", "hammer"],
    "covid 19": ["covid", "pandemic", "disease", "vaccine", "environment"]
}
for key, value in list_of_lists_of_words.items():
    print(f"For '{key}' the odd one out is: {vec_seed13.wv.doesnt_match(value)}")

For 'war' the odd one out is: guns
For 'climate' the odd one out is: floods
For 'nations' the odd one out is: russia
For 'covid 19' the odd one out is: environment


The model does not do a particularly good job of identifying the "odd one out". It only gets ´environment´ right in the covid 19 list. If one had the time and ressources, a more systematic validation could be conducted using the approach presented by Rodriguez & Spirling (2021). For a given target word, they construct two seperate lists of the 10 most similar words according to either human annotators or their embedding model (Rodriguez & Spirling 2021: 11). They then ask a seperate group of humans to choose whether the embedding model or the human annotators have provided the most fitting list of similar words. By applying this approach to a subset of relevant target words, it could provide a somewhat systematic, yet rather costly validation of one's embedding model. 

## Literature

**Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel Pardo, F. M., Rosso, P., & Sanguinetti, M.** (2019). SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. In *Proceedings of the 13th International Workshop on Semantic Evaluation* (pp. 54-63). Minneapolis, Minnesota, USA: Association for Computational Linguistics. [Link](https://www.aclweb.org/anthology/S19-2007) (DOI: [10.18653/v1/S19-2007](https://doi.org/10.18653/v1/S19-2007))

**Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova-** 2018. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." CoRR, abs/1810.04805. Retrieved from http://arxiv.org/abs/1810.04805.

**Raschka, Sebastian, V. Mirjalili** (2019). Python Machine Learning;Machine Learning and Deep Learning with Python Scikit-Learn and Tensorflow 2 3rd Edition. https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=2329991. Accessed June 14 2023.

**Ogundepo, Odunayo**  (2021): "Understanding Word2Vec", Medium, https://medium.com/analytics-vidhya/understanding-word2vec-39fabe660705