# Advanced Social Data Science 2

*By Carl, Asger, & Esben*

In [None]:
# !pip install datasets==2.2.1 transformers==4.19.1

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score
import gensim.downloader
import torch
import re
from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForSequenceClassification
import psutil
import platform
from datasets import load_metric, load_dataset
import numpy as np
from tqdm import tqdm
from google.colab import drive
import nltk
import pandas as pd

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Part 3 Supervised text classification

In this section, we will train a range of models to predict whether a sentence could be classified as hate speech or not. We will use a ridge regularized Logistic Regression, a Recurrent Neural Network (RNN), a deep RNN, a Long Short-Term Memomry model (LSTM), and two BERT models with different pre-trainings - a general purpose BERT and a BERT specialized to detect hate speech on Twitter.  

## Use the training data set to fit at least one of the following models to try and predict the hate speech status of the tweets, using TF-IDF features.

First, we load the `tweet_eval` dataset and split the data into train, validation, and test. In the data `0` signifies non-hate and `1` indicates hate.

In [None]:
train_data = load_dataset("tweet_eval", "hate", split = "train")
val_data = load_dataset("tweet_eval", "hate", split = "validation")
test_data = load_dataset("tweet_eval", "hate", split = "test")

Check the number of observations within each split.

In [16]:
print(f"Tweets in train: {train_data.shape[0]}")
print(f"Tweets in validation: {val_data.shape[0]}")
print(f"Tweets in test: {test_data.shape[0]}")

Tweets in train: 9000
Tweets in validation: 1000
Tweets in test: 2970


Split text and labels into seperate lists.

In [17]:
train_corpus = [tweet["text"] for tweet in train_data]
train_labels = [tweet["label"] for tweet in train_data]
test_corpus = [tweet["text"] for tweet in test_data]
test_labels = [tweet["label"] for tweet in test_data]
val_corpus = [tweet["text"] for tweet in val_data]
val_labels = [tweet["label"] for tweet in val_data]

Examine class distribution in the different splits.

In [18]:
label_dict = {0:"non-hate", 1:"hate"}

print(f"Label distribtion in train: {dict(Counter([label_dict[label] for label in train_labels]))}")
print(f"Label distribtion in validataion: {dict(Counter([label_dict[label] for label in val_labels]))}")
print(f"Label distribtion in test: {dict(Counter([label_dict[label] for label in test_labels]))}")

Label distribtion in train: {'non-hate': 5217, 'hate': 3783}
Label distribtion in validataion: {'non-hate': 573, 'hate': 427}
Label distribtion in test: {'non-hate': 1718, 'hate': 1252}


Preprocessing the data. We note that to avoid an extensive process of providing accurate Part-of-Speech-tags to the lemmatizer we only stem the tokens, though lemmatization arguably might improve model performance.

In [19]:
def preproc(_str):
    # remove twittertags + lowercase
    _str = re.sub(r'@\w+', "", _str.lower())
    # remove numbers
    _str = re.sub(r'\d+', "", _str)
    # remove punctuations
    _str = _str.translate(str.maketrans("", "", string.punctuation.replace("!","")))
    # Remove extra whitespaces
    _str = re.sub(r'\s+', ' ', _str.strip())
    # tokenize text - we do not use TweetTokenize as we have removed @ either way
    tokens = word_tokenize(_str)
    # remove stopwords and stem
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens if word not in stopwords.words('english')]

    return ' '.join(tokens) # join words back into a string

In [20]:
train_corpus_preproc = [preproc(tweet) for tweet in train_corpus]
test_corpus_preproc = [preproc(tweet) for tweet in test_corpus]
val_corpus_preproc = [preproc(tweet) for tweet in val_corpus]

**Considerations on the preprocessing**: As preprocessing is more of an assessment than a hard science, we have made some decisions that we think are best suited for the task at hand, i.e. to classify hate-speech on twitter. Some of these are:
- Retaining `!` when removing punctuation, as exclamations could be useful for the model when predicting hate-speech.
- Though capital letters could convey some meaning when predicting hate-speech, We have lower-cased all words, to make the classification more about semantic meaning than letter-capitalization.

Ultimately, preprocessing data is about conveying as many relevant nuances as possible in the most simplified way. For instance, if stopwords do not provide any information on hate speech to the model, they should be removed. Should the model be used in production or published, we would have engaged in a process of trial-and-error, going back and forth between different preprocessing steps to see which preprocessing yields the best model performance.

For information, the authors of the `TweetEval Hate Speech` dataset have already done a bit of text preprocessing, converting username, indicated by a leading `@`, into a standard `@username` token and convert all URLs, indicated by a leading `http` into a standard `http` token.

Use `TfidfVectorizer()` to convert the documents into a TF-IDF feature matrix. We use both `unigrams` and `bigrams` to give the model a little more context about the context of the words - i.e. neighboring words.

In [None]:
vectorizer = TfidfVectorizer(analyzer="word",
                             ngram_range = (1,2))

train_features_lr = vectorizer.fit_transform(train_corpus_preproc)
# only transform() on val and test, to make the evaluation resemble "unseen data" more
val_features_lr = vectorizer.transform(val_corpus_preproc)
test_features_lr = vectorizer.transform(test_corpus_preproc)

### Logistic regression with ridge regularization (with a c parameter tuned via cross-validation on the train set)


To optimize the hyperparameter - `C` - for regularizing the model to avoid overfitting, we carry out a `GridSearchCV`, with 5 folds and spanning `np.logspace(-2, 2, 50)` values of `C`. We use ridge-regression for regularization.

In [22]:
param_grid = {"C": np.logspace(-2, 2, 50)} # search parameters to optimize

lr_grid = GridSearchCV(LogisticRegression(penalty="l2", random_state=0, max_iter=300),
                        param_grid = param_grid,
                        cv=5,
                        n_jobs=-1,
                        scoring = "accuracy")

lr_grid.fit(train_features_lr, train_labels)

print("Best cross-validation score: {:.2f}".format(lr_grid.best_score_))
print("Best parameters: ", lr_grid.best_params_)
print("Best estimator: ", lr_grid.best_estimator_)

Best cross-validation score: 0.77
Best parameters:  {'C': 39.06939937054613}
Best estimator:  LogisticRegression(C=39.06939937054613, max_iter=300, random_state=0)


## Also fit at least one of the following models to try and predict the hate speech status of the tweets

All of the following models take and *understands* sequential data inputs, as opposed to the logistic regression model, which used a TF-IDF matrix as input. We have to preprocess the input data differently, depending on whether word order matters. For instance, the word `not` is part of nltk's `stopwords.words('english')`-package. `not` can shift the meaning of a sentence completely, depending on its position in a sentence, but non-sequential models have no way of discerning which part of the sentence the `not` relates to. Sequential models, however, do. Thus, we will commit a much less extensive data cleaning process in the following preprocessing phase, and only lower-case the words. As previously mentioned, the authors of the dataset have already standardized user-tags and urls.

We want to represent each token through an embedding, rather than as a Tf-Idf-score. We use the pretrained embeddings from `glove-twitter-200` to embed the features in our vocabulary into a 200-dimensional vector space. One of the main advantages of this kind of text `vectorization` - for instance compared to one-hot encoding - is to reduced the dimensionality of the feature space and decrease the effect of the curse of dimensionality (Raschka & Mirjalili 2019:590). Compared to TF-IDF matrices, we also decrease the number of dimensions and avoid a sparse matrix of mainly zeros.

In [25]:
# load the pretrained embeddings (these can be used as the embedding argument in create_embedding_matrix)
glove = gensim.downloader.load('glove-twitter-200')

First, we want to establish our total vocabulary, were we lowercase all tokens. We also include 

In [26]:
# creating the full list of vocabulary in the tweet_eval data
total_vocabulary = set()
for tweet in train_corpus+val_corpus+test_corpus:
    tokens = word_tokenize(tweet)
    for t in tokens:
        total_vocabulary.add(t.lower()) # lower.case
total_vocabulary = sorted(list(total_vocabulary))

We include a padding-token, "", as the first entry, which will be the first row in the embedding matrix. We do this to later ensure that the input sequences - i.e. tokenized sentences - have the same length. 

In [27]:
total_vocabulary = [""]+total_vocabulary

We create a embedding matrix with the dimension `vocabulary_length x embedding_dimension`. These 200-dimensional vectors will serve as representations of the features and be the input to the RNN and LSTM. Tokens that are not already in the pretrained embeddings will have all their values set to zero.

In [39]:
def create_embedding_matrix(tokens, embedding):
    """creates an embedding matrix from pre-trained embeddings for a new vocabulary. It also adds an extra vector
    of zeroes in row 0 to embed the padding token, and initializes missing tokens as vectors of 0s"""
    oov = set()
    size = embedding.vector_size
    # note the extra zero vector that will be used for padding
    embedding_matrix=np.zeros((len(tokens),size))
    c = 0
    for i in range(1,len(tokens)):
        try:
            embedding_matrix[i]=embedding[tokens[i]]
        except KeyError: #to catch the words missing in the embeddings
            try:
                embedding_matrix[i]=embedding[tokens[i].lower()]
            except KeyError:
                #if the token does not have an embedding, we initialize it as a vector of 0s
                embedding_matrix[i] = np.zeros(size)
                #we keep track of the out of vocabulary tokens
                oov.add(tokens[i])
                c +=1
    print(f'{c/len(tokens)*100:.2f} % of tokens are out of vocabulary')
    return embedding_matrix, oov

embedding_matrix, oov = create_embedding_matrix(total_vocabulary, glove)

33.14 % of tokens are out of vocabulary


Convert the text sentences into a list of indices, each of which corresponds to the words' indices in the total vocabulary and in the embedding matrix. Also, we use padding on the feature vectors to ensure that all vector representations of the input sentences have the same length.

In [29]:
def text_to_indices(text, total_vocabulary):
    """turns the input text into a vector of indices in total_vocabulary that corresponds to the tokenized words in the input text"""
    encoded_text = []
    tokens = word_tokenize(text)
    for t in tokens:
        index = total_vocabulary.index(t.lower())
        encoded_text.append(index)
    return encoded_text

def add_padding(vector, max_length, padding_index):
    """adds copies of the padding token to make the input vector the max_length size, so that all inputs are the same length
    (the length of tweet with most words)"""
    if len(vector) < max_length:
        vector = [padding_index for _ in range(max_length-len(vector))] + vector
    return vector

# vectorize sentences to indices
train_features = [text_to_indices(text, total_vocabulary) for text in train_corpus]
val_features = [text_to_indices(text, total_vocabulary) for text in val_corpus]
test_features = [text_to_indices(text, total_vocabulary) for text in test_corpus]

# Find the length of the longest tweet
longest_tweet = max(train_features+val_features+test_features, key=len)
max_length = len(longest_tweet)
padding_index = 0 #position 0 is where we had put the padding token in our vocabulary and embedding matrix

# padding the feature vectors
train_features = [add_padding(vector, max_length, padding_index) for vector in train_features]
val_features = [add_padding(vector, max_length, padding_index) for vector in val_features]
test_features = [add_padding(vector, max_length, padding_index) for vector in test_features]

Next, we convert all splits of the data into a PyTorch DataSet as well as a DataLoader, to perform `mini batch gradient descent` when training the model.

In [30]:
class TweetEvalTrain(torch.utils.data.Dataset):
    # defining the sources of the data
    def __init__(self, features, labels):
        self.X = torch.LongTensor(features)
        self.y = torch.from_numpy(np.array(labels))

    def __getitem__(self, index):
        X = self.X[index]
        y = self.y[index].unsqueeze(0)
        return X, y

    def __len__(self):
        return len(self.y)

data_train = TweetEvalTrain(train_features, train_labels)
data_val = TweetEvalTrain(val_features, val_labels)
data_test = TweetEvalTrain(test_features, test_labels)

train_loader = torch.utils.data.DataLoader(data_train, batch_size=64)
val_loader = torch.utils.data.DataLoader(data_val, batch_size=64)
test_loader = torch.utils.data.DataLoader(data_test, batch_size=64)

The data is now ready to be fed into the RNN and LSTM.

### Many-to-one RNN with pre-trained word embeddings as the inputs

First, we define both a custom RNN/LSTM class, a training_loop function and an evaluation function. Note that we do not need to explicitly define a softmax-activation for the outputs to be passed through, as our loss function, `CrossEntropyLoss`, automatically applies a softmax transformation before calculating the loss.

In [31]:
# defining the embedding step and RNN model

class RNN_LSTM(torch.nn.Module):
    def __init__(self, rnn_size, n_classes, embedding_matrix, num_layers=1, type_="RNN"):
        super().__init__()

        #applying the embeddings to the inputs. Tokens corresponding to the padding_idx will not be included in the training of the model
        self.embedding = torch.nn.Embedding.from_pretrained(torch.FloatTensor(embedding_matrix), padding_idx=0, freeze=True)
        emb_dim = embedding_matrix.shape[1] # Each word is represented through a 200 dimensional vector
        self.num_layers = num_layers
        self.type = type_

        if self.type == "RNN":
            self.rnn = torch.nn.RNN(input_size=emb_dim, hidden_size=rnn_size, num_layers=self.num_layers, batch_first=True)
        elif self.type == "LSTM":
            self.rnn = torch.nn.LSTM(input_size=emb_dim, hidden_size=rnn_size, bidirectional=False, num_layers=self.num_layers, batch_first=True)
        else:
            raise LookupError("Only RNN and LSTM are supported.")

        #applies a linear transformation to the RNN/LSTM
        self.linear_outputs = torch.nn.Linear(rnn_size, n_classes)

    def forward(self, inputs):
        # encode the input vectors
        encoded_inputs = self.embedding(inputs)

        # In this many-to-one model, we only need the final hidden states,
        # where all prio information is somehow present
        if self.type == "RNN" and self.num_layers > 1:
            all_states, final_state = self.rnn(encoded_inputs)
            outputs = self.linear_outputs(final_state[-1,:,:]) # only last layer as it is a deep RNN

        elif self.type == "RNN" and self.num_layers == 1:
            all_states, final_state = self.rnn(encoded_inputs)
            outputs = self.linear_outputs(final_state.squeeze())

        elif self.type == "LSTM" and self.num_layers == 1:
            all_states, (final_state, c_n) = self.rnn(encoded_inputs)
            outputs = self.linear_outputs(final_state.squeeze())

        return outputs

# training loop
def training_loop(model, num_epochs):
    loss_function = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    for epoch in range(num_epochs):
        losses = []
        for batch_index, (inputs, targets) in enumerate(train_loader):
            optimizer.zero_grad() # zero the gradients from the previous opt step
            outputs = model.forward(inputs).squeeze() # compute outputs
            targets = targets.squeeze().long()
            loss = loss_function(outputs, targets) #loss function
            loss.backward() # backpropagation - get deriv
            optimizer.step() # optimize based on derivative
            losses.append(loss.item()) # append batch loss
        print(f'Epoch {epoch+1}: loss {np.mean(losses)}')
    return model

def evaluate(model, val_loader):
    predictions = []
    labels = []
    with torch.no_grad(): # don't backpropagate or update weights anymore
        for batch_index, (inputs, targets) in enumerate(val_loader):
            # apply softmax
            outputs = torch.softmax(model(inputs), 1 )
            # indices highest softmax values
            vals, indices = torch.max(outputs, 1)
            # accumulating the predictions
            predictions += indices.tolist()
            # accumulating true labels
            labels += targets.tolist()

    acc = accuracy_score(predictions, labels)
    f1 = f1_score(predictions, labels)
    print(f'Model accuracy: {acc:.2f}')
    print(f'Model F1: {f1:.2f}')
    return acc, f1, predictions

We start by training a RNN with a single hidden layer and 100 nodes in the layer.

In [32]:
# initializing and training the model:
myRNN = RNN_LSTM(rnn_size=100, n_classes=2, num_layers = 1, embedding_matrix=embedding_matrix)

myRNN = training_loop(myRNN, num_epochs = 5)
acc, f1, preds = evaluate(myRNN, val_loader)

Epoch 1: loss 0.6058707241470932
Epoch 2: loss 0.550754392189337
Epoch 3: loss 0.5269700645977724
Epoch 4: loss 0.5149752362400082
Epoch 5: loss 0.5122840057873557
Model accuracy: 0.68
Model F1: 0.66



### LSTM

A single hidden layer LSTM with 100 nodes in the layer.

In [34]:
myLSTM = RNN_LSTM(rnn_size=100, n_classes=2, num_layers = 1, embedding_matrix=embedding_matrix, type_="LSTM")

myLSTM = training_loop(myLSTM, num_epochs = 3)
acc, f1, preds = evaluate(myLSTM, val_loader)

Epoch 1: loss 0.5880376453518023
Epoch 2: loss 0.5178072105908225
Epoch 3: loss 0.4880447679377617
Model accuracy: 0.72
Model F1: 0.64


### A fine-tuned BERT model

In [35]:
metric_f1 = load_metric("f1")#use the load_metric method from the datasets library to load f1 from sklearn
metric_acc = load_metric("accuracy")

# Defining a function that compute metrics and give a tuple of outputs and labels
def compute_metrics(eval_pred):
      outputs, labels = eval_pred
      predictions = np.argmax(outputs, axis=-1)
      f1 = metric_f1.compute(predictions=predictions, references=labels)
      acc = metric_acc.compute(predictions=predictions, references=labels)
      return f1 | acc

def BERT_hate_classifier(model_name):
  # allow model to access the GPU
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
  # Set up the tokenizer we want to use
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  # Moving tokenizer to work on GPU
  tokenizer.to_device = device
  # Apply the tokenizer to each row in the dataset
  tokenized_train_dataset = train_data.map(lambda tweet: tokenizer(tweet["text"]), batched=True).remove_columns("text")
  tokenized_val_dataset = val_data.map(lambda tweet: tokenizer(tweet["text"]), batched=True).remove_columns("text")
  tokenized_test_dataset = test_data.map(lambda tweet: tokenizer(tweet["text"]), batched=True).remove_columns("text")
  ## Specify task for pretrained model
  hate_classifier = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 2)
  # Moving model to GPU
  hate_classifier.to(device)
  # Specify training parameters
  training_args = TrainingArguments(output_dir="bert_hatespeech",
                                    evaluation_strategy = "steps",
                                    num_train_epochs=5,
                                    per_device_train_batch_size=16,
                                    logging_steps=500,
                                    eval_steps=500)

  trainer = Trainer(model = hate_classifier,
                    args = training_args,
                    compute_metrics = compute_metrics,
                    train_dataset = tokenized_train_dataset,
                    eval_dataset = tokenized_val_dataset,
                    tokenizer = tokenizer)
  # fine-tune model
  trainer.train()

  return trainer, tokenized_test_dataset

 We train a BERT model, that is not specialized in twitter hate speech, but a more general purpose BERT trained on Wikipedia and The Book Corpus.

In [None]:
trainer, tokenized_test_dataset=  BERT_hate_classifier("bert-base-uncased")

In [11]:
eval_on_val_data = trainer.evaluate()

print(f"Accuracy of fine-tuned BERT model: {eval_on_val_data['eval_accuracy']:.2f}")
print(f"F1 of fine-tuned BERT model: {eval_on_val_data['eval_f1']:.2f}")

Accuracy of fine-tuned BERT model: 0.77
F1 of fine-tuned BERT model: 0.74


Next, to see how much of a difference the pre-trained BERT model makes, we also train a BERT that is developed by the authors behind the TweetEval Hate Speech dataset, and is pre-trained to detect hate speech on twitter (Basile et al. 2019).

In [None]:
trainer_tweeteval, tokenized_test_dataset_tweeteval =  BERT_hate_classifier("cardiffnlp/twitter-roberta-base-hate-latest")

In [13]:
eval_on_val_data = trainer_tweeteval.evaluate()

print(f"Accuracy of fine-tuned BERT model: {eval_on_val_data['eval_accuracy']:.2f}")
print(f"F1 of fine-tuned BERT model: {eval_on_val_data['eval_f1']:.2f}")

Accuracy of fine-tuned BERT model: 0.81
F1 of fine-tuned BERT model: 0.79


## For each of the models you ran in question 1. and 2., briefly discuss (two–four sentences) in what ways the model is a good choice for the current task and data set, plus any downsides the model might have for this application.

**Logistic Regression**: Since Logistic Regression uses a TF-IDF vectorization to convert words and sentences into numbers, it disregards the order with which the words appear in the sentences. That word ordering does not matter to the meaning of a sentence is a somewhat crude assumption, though with regard to this particular task of detecting hate speech, simply identifying the presence or prevalence of "hateful word" could suffice.

**RNN and LSTM**: The fact that both RNN and LSTM have a sequential architecture makes them rather fitting for modeling the semantic meaning of language, compared to the TF-IDF vectorization of the Logistic Regression. The LSTM tries to mitigate the risk of exploding/vanishing gradients that the RNN might suffer from when the input sequence is long. Thus, LSTMs are better when it comes to longer sentences, as the core concept of an LSTM is to navigate what to remember from the previous parts of the sequence, and what to forget. However, tweets are characterized by being short sentences with a limit of 280 characters, and our dataset might not have the proper sentence lengths for the LSTM architecture to fully flourish, compared to the RNN and Logistic Regression.

Using pretrained embeddings for the text vectorization as we did with RNN and LSTM, the performances of our models are influenced by the corpus used to construct the n-dimensional embedding vector space. For instance in relation to inheriting biases in the embeddings stemming from the data on which it was trained, or if the vocabulary of the embeddings is small, wherefor many of the tokens from our own dataset might not be "embed-able". We saw that around 30% of the tokens were not in the pretrained vocabulary of GloVe, so almost a third of our vocabulary will not be used to train the model or when predicting.

**BERT**: As with RNN and LSTM, BERT models allow for sequential input, which is fitting for language classification. The fine-tuned BERT model had the best performance, both with `accuracy` and `F1`. We used two pre-trained BERT-models; One that is trained on twitter data with the objective to detect hate speech, and one that is trained for more general language processing. The model specialized on twitter hate speech is very convenient for our task at hand, but it also gives the model a natural head-start compared to some of the other models, where we used general, pretrained word-embeddings or simple preprocessing/TF-IDF. In our case, the relatively better performance of the BERT models thus depends on the existence of a pre-trained model, and also what that model is trained for. Alternatively, we would have had to pre-train the BERT-model from scratch, which would have been a much more extensive task.

## Present the performance of your models on the test set in terms of both F1 and accuracy. Which is the best-performing model in terms of F1? In your opinion, should we prefer F1 over accuracy as an indicator of model performance here, and why?


First, we calculate the `accuracy` and `F1` score for each model on the `test data`.

In [None]:
test_pred_lr = lr_grid.predict(test_features_lr) # LLR
acc_rnn, f1_rnn, preds = evaluate(myRNN, test_loader) # RNN
acc_deep_rnn, f1_deep_rnn, preds = evaluate(my_deep_RNN, test_loader) # Deep RNN
acc_lstm, f1_lstm, preds = evaluate(myLSTM, test_loader) # LSTM

output, true_labels, eval = trainer.predict(tokenized_test_dataset) # BERT
acc_bert, f1_bert = eval["test_accuracy"], eval["test_f1"]

output, true_labels, eval_tweeteval = trainer_tweeteval.predict(tokenized_test_dataset_tweeteval) # TweetEval BERT
acc_bert_tweeteval, f1_bert_tweeteval = eval_tweeteval["test_accuracy"], eval_tweeteval["test_f1"]

In [37]:
pd.DataFrame([[accuracy_score(test_pred_lr, test_labels),
               f1_score(test_pred_lr, test_labels)],
              [acc_rnn, f1_rnn],
              [acc_deep_rnn, f1_deep_rnn],
              [acc_lstm, f1_lstm],
              [acc_bert_tweeteval, f1_bert_tweeteval],
              [acc_bert, f1_bert]],
             index = ["Logistic Regression", "RNN", "Deep RNN", "LSTM", "BERT TweetEval", "BERT"],
             columns = ["Accuracy", "F1"]).round(2)

Unnamed: 0,Accuracy,F1
Logistic Regression,0.49,0.61
RNN,0.48,0.58
Deep RNN,0.52,0.6
LSTM,0.53,0.6
BERT TweetEval,0.57,0.65
BERT,0.53,0.64


### NYT FORSLAG
First of all, we notice that all models perform significantly worse on the test data, compared to their performance on the validation data. This could obviously be a modeling issue, but it could also encourage a closer look at the test datasets, to see if it comes down to an issue of noise or messy data. The results, however, are close to the one's measured by the authors of the `TweetEval Hate Speech` dataset (Basile et al. 2019).

Deciding how to weigh the importance of either the `accuracy` or `F1` depends on the task and what one wants to achieve - or not to achieve - with one's classification model. Accuracy can serve as an appropriate performance metric along with its straightforward interpretability, i.e. how many hate or non-hate tweets did the classifier predict correctly. It serves as a good indicator especially when the outcomes in the dataset are reasonably balanced. The `F1-score`, on the other hand, is less intuitively interpretable, but is a better pick, if the classes in the dataset are imbalanced. This is because that the `F1-score` depend on `False Positives` and `False Negatives` - i.e. if the model incorrectly classifies something as hate speech or if it fails to classify actual hate speech Both these types of errors are assessed relative to the number of True Positives, revealing how the model handles different forms of misclassification (Hovy 2022b: 25f). As we are trying to predict hate speech in twitter data we are mostly interested in getting a model with a generally good score with accuracy and F1, but if one were trying to predict on something in which False Negatives or False Positives would be catastrophic then one could tweak the model to take this into consideration. 

In our case with the twitter-eval speech data the class-balance between `hate` and `non-hate` is fairly evenly distributed, and thus the full benefits of the F1 score is not fully pronounced. However adhering to best practice, we present both metrics to provide a comprehensive evaluation of our model´s performance  

Also, whether various forms of preprocessing could have increased model performance would have to be explored if one were to further fine-tune the models for production or publication.

### DET GAMLE TEKST
First of all, we notice that all models perform significantly worse on the test data, compared to their performance on the validation data. This could obviously be a modeling issue, but it could also encourage a closer look at the test datasets, to see if it comes down to an issue of noise or messy data. The results, however, are close to the one's measured by the authors of the `TweetEval Hate Speech` dataset (Basile et al. 2019).

Deciding how to weigh the importance of either the `accuracy` or `F1` depends on the task and what one wants to achieve - or not to achieve - with one's classification model. As previously presented, the class-balance between `hate` and `non-hate` is fairly evenly distributed, and accuracy can then serve as an appropriate performance metric along with its straightforward interpretability, i.e. how many tweets did the classifier predict correctly.

The `F1-score`, on the other hand, is less intuitively interpretable, but would likely have been a better pick, had the classes been more imbalanced (Hovy 2022b: 25f). The F1 score also cares about how the classes have been numerically encoded, meaning which class is a `Positive` and which is a `Negative`. In our case, we have assigned hate speech to be positives, i.e having a value of one. That is because the `F1-score` depend on `False Positives` and `False Negatives` - i.e. if the model incorrectly classifies something as hate speech or if it fails to classify hate speech as such - both in relation to how many `True Positives` the model have classified. Thus, F1-score cares about which type of error the model makes.

Also, whether various forms of preprocessing could have increased model performance would have to be explored if one were to further fine-tune the models for production or publication.

## Based on the paper by Basile et al. (2019), What further info might you have liked to have about the data selection process and/or the annotation process for this data set? Why?


Hate Speech is an abstract and arguably ambiguous concept, and human labeling is not necessarily a highly systematic and reliable endeavor.

With regard to the **data selection process**, the term `identified hater` is not further defined (Basile et al. 2019: 55). Thus, the exact sampling strategy is unclear.

In the **annotation process**, the authors try to counter the potential inconsistency of the individual annotator by using majority voting on each tweet among three annotators (crowds). The tweet is also annotated by two experts, whereafter the final label is a majority voting between crowd, expert 1, and expert 2. The exact qualifications of the experts remains unclear in the article, aside from language capability and general field knowledge. The article provides a measure of the average confidence (AC) between the annotations, but it is unclear if this measure only relates to the crowd. It would have been informative to explicitly have both the AC between the crowd and between the crowd and the experts. Though the aggregate AC measure is informative, it could have been interesting to include a measure of agreement/reliability for each tweet in the dataset, as to get an indication of which tweets have been unanimously annotated and which one's have been more ambiguous annotated.

The authors should explicitly outline their approach to safeguarding the working conditions of their crowdsourced labor. Upholding rigorous ethical guidelines in scientific research necessitates clarity about the measures taken to ensure fair pay and appropriate working conditions, especially when crowdsourcing has a history of being abusive of its workers. As a general discursive consideration, it could also be interesting to know more about the social characteristics of the annotators and experts. Are they all from the same country or do they all identify as the same gender? There might be discursive reminiscents of these social dimensions latently embedded in the annotation of what is hate speech and what is not in the data. 


## Literature

**Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel Pardo, F. M., Rosso, P., & Sanguinetti, M.** (2019). SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. In *Proceedings of the 13th International Workshop on Semantic Evaluation* (pp. 54-63). Minneapolis, Minnesota, USA: Association for Computational Linguistics. [Link](https://www.aclweb.org/anthology/S19-2007) (DOI: [10.18653/v1/S19-2007](https://doi.org/10.18653/v1/S19-2007))

**Raschka, Sebastian, V. Mirjalili** (2019). Python Machine Learning;Machine Learning and Deep Learning with Python Scikit-Learn and Tensorflow 2 3rd Edition. https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=2329991. Accessed June 14 2023.
