# Advanced Social Data Science 2

*By Carl, Asger, & Esben*

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
import gensim.downloader
import torch
import re


# We also set a random state in order to make reproducible results
RANDOM_STATE = 1111
np.random.seed(RANDOM_STATE)

# Part 3 Supervised text classification

## Use the training data set to fit at least one of the following models to try and predict the hate speech status of the tweets, using TF-IDF features.

First, we load the `tweet_eval` dataset, split into train, validation, and test. `0 = non-hate` and `1 = hate`.  

In [2]:
train_data = load_dataset("tweet_eval", "hate", split = "train")
val_data = load_dataset("tweet_eval", "hate", split = "validation")
test_data = load_dataset("tweet_eval", "hate", split = "test")

Found cached dataset tweet_eval (C:/Users/45242/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)
Found cached dataset tweet_eval (C:/Users/45242/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)
Found cached dataset tweet_eval (C:/Users/45242/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Check the number of observations within each split

In [3]:
print(f"Tweets in train: {train_data.shape[0]}")
print(f"Tweets in validation: {val_data.shape[0]}")
print(f"Tweets in test: {test_data.shape[0]}")

Tweets in train: 9000
Tweets in validation: 1000
Tweets in test: 2970


Split text and labels into seperate lists

In [4]:
train_corpus = [tweet["text"] for tweet in train_data]
train_labels = [tweet["label"] for tweet in train_data] 
test_corpus = [tweet["text"] for tweet in test_data]
test_labels = [tweet["label"] for tweet in test_data] 
val_corpus = [tweet["text"] for tweet in val_data]
val_labels = [tweet["label"] for tweet in val_data]

Examine class distribution in the different splits

In [5]:
label_dict = {0:"non-hate", 1:"hate"}

print(f"Label distribtion in train: {dict(Counter([label_dict[label] for label in train_labels]))}")
print(f"Label distribtion in validataion: {dict(Counter([label_dict[label] for label in val_labels]))}")
print(f"Label distribtion in test: {dict(Counter([label_dict[label] for label in test_labels]))}")

Label distribtion in train: {'non-hate': 5217, 'hate': 3783}
Label distribtion in validataion: {'non-hate': 573, 'hate': 427}
Label distribtion in test: {'non-hate': 1718, 'hate': 1252}


preprocessing - to avoid an extensive process of providing accurate Part-of-Speech-tags to the lemmatizer, as well as the increased computational cost, we only stem the tokens. 

In [6]:
def preproc(_str):
    # remove twittertags + lowercase 
    _str = re.sub(r'@\w+', "", _str.lower())
    # remove numbers
    _str = re.sub(r'\d+', "", _str) 
    # remove punctuations 
    _str = _str.translate(str.maketrans("", "", string.punctuation.replace("!","")))
    # Remove extra whitespaces 
    _str = re.sub(r'\s+', ' ', _str.strip())  
    # tokenize text - we do not use TweetTokenize as we have remove @ either way
    tokens = nltk.word_tokenize(_str)
    # remove stopwords and stem
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens if word not in stopwords.words('english')]

    return ' '.join(tokens) # join words back into a string

In [7]:
train_corpus_preproc = [preproc(tweet) for tweet in train_corpus]
test_corpus_preproc = [preproc(tweet) for tweet in test_corpus]
val_corpus_preproc = [preproc(tweet) for tweet in val_corpus]

**Considerations on the preprocessing**: As preprocessing is more of an assessment than a hard science, we have made some decision that we think are best suited for the task at hand, i.e. to classify hate-speech on twitter, some of these are:
- Retaining `!` when removing punctiation, as exclamations could be useful for the model when predicting hate-speech. 
- Though capital letters could convey some meaning when predicting hate-speech, We have lower-cased all words, to make the classification more about semnatic meaning than letter-capitalization.  

Ultimalty, how one preprocesses data is about how to convey relevant nuances in the data, in the most simplified way, e.g. by removing stopwords, if these do not provide any information to the model. 

Use `TfidfVectorizer()` to convert the documents into a TF-IDF feature matrix. We use both `unigrams` and `bigrams` to give the model a little more context about the context of the words - i.e. neighboring words.

In [8]:
vectorizer = TfidfVectorizer(analyzer="word",
                             ngram_range = (1,2))
                             
train_features = vectorizer.fit_transform(train_corpus)
# only transform() on val and test, to make the evaluation resemeble "unseen data" more
val_features = vectorizer.transform(val_corpus)  
test_features = vectorizer.transform(test_corpus)

### Logistic regression with ridge regularization (with a c parameter tuned via cross-validation on the train set)


To optimize the hyperparameter, `C`, for regularizing the model to avoid overfitting, we carry out a `GridSearchCV`, with 5 folds and spanning `np.logspace(0, 1, 30)` values of `C`.

In [178]:
param_grid = {"C": np.logspace(-2, 2, 100)} # search from 10^-2 to 10^2 with 100 steps

lr_grid = GridSearchCV(LogisticRegression(penalty="l2", random_state=0), 
                        param_grid = param_grid, 
                        cv=5, 
                        n_jobs=-1, 
                        scoring = "accuracy")

lr_grid.fit(train_features, train_labels)

print("Best cross-validation score: {:.2f}".format(lr_grid.best_score_))
print("Best parameters: ", lr_grid.best_params_)
print("Best estimator: ", lr_grid.best_estimator_)

Best cross-validation score: 0.77
Best parameters:  {'C': 10.722672220103236}
Best estimator:  LogisticRegression(C=10.722672220103236, random_state=0)


### SVM with a linear kernel (with a c parameter tuned via crossvalidation on the train set)

In [179]:
param_grid = {"C": np.logspace(0, 2, 100)} # search from 10^-10 to 10^2 with 30 steps

lscv_grid = GridSearchCV(LinearSVC(penalty="l2", random_state=0), 
                        param_grid = param_grid, 
                        cv=5, 
                        n_jobs=-1, 
                        scoring = "accuracy")

lscv_grid.fit(train_features, train_labels)

print("Best cross-validation score: {:.2f}".format(lscv_grid.best_score_))
print("Best parameters: ", lscv_grid.best_params_)
print("Best estimator: ", lscv_grid.best_estimator_)

Best cross-validation score: 0.77
Best parameters:  {'C': 1.0974987654930561}
Best estimator:  LinearSVC(C=1.0974987654930561, random_state=0)


### Multinomial Naive Bayes

Multinomial Naive Bayes does not take the `TfidfVectorizer()` as input, but rather `CountVectorizer()`, i.e. simply the word counts. 

In [186]:
counterizer = CountVectorizer()
train_counts = counterizer.fit_transform(train_corpus)
val_counts = counterizer.transform(val_corpus)
test_counts = counterizer.transform(test_corpus)

nb_score = cross_val_score(MultinomialNB(), 
                           train_counts.toarray(), 
                           train_labels, 
                           cv=5, 
                           scoring="accuracy", 
                           verbose=2)

print("Naive Bayes cross-validation accuracy: ", np.mean(nb_score))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END .................................................... total time=   2.7s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.8s remaining:    0.0s


[CV] END .................................................... total time=   2.5s
[CV] END .................................................... total time=   2.5s
[CV] END .................................................... total time=   2.5s
[CV] END .................................................... total time=   2.6s
Naive Bayes cross-validation accuracy:  0.7411111111111112


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   13.5s finished


## Also fit at least one of the following models to try and predict the hate speech status of the tweets

All of the following models take and *"understands"* sequential data inputs, as opposed to the previous models, who use a bag-of-words or Tf-IDF matrix as inputs, where the order the the words in the sentence are neglected. Thus, we have to preprocess the input data differently. For instance, the word `not` is part of nltk's `stopwords.words('english')`-package. The word `not` can shift the meaning of a sentence completly, depending on it's position. Non-sequential datastructures and models have no way of decerning which part of the sentece the `not` relates to - but sequential models do. Thus, we will commit a much less extensive data cleaning process in the preprocessing phase, and only lower-case. `Stemming` and `Lemmatizing` would also reduce the semantic meaning of the words, so this will not commit either of these steps as well. BLABLA, måske en henvisning til noget.    

In [199]:
"not" in stopwords.words('english')

True

Since we want to represent each token through an embedding, rather than as a Tf-Idf-score or word-count, we use the pretrained embeddings from `glove-twitter-200` to embed the features in our vocabulary into a 200-dimensional vector space. One of the main advantages of this kind of text `vectorization` - e.g. compared to one-hot encoding - is to reduced the dimensionality of the feature space and decrease the effect of the curse of dimensionality (Raschka & Mirjalili 2019:590). With TF-Idf, the feature matrix had the dimension `vocabulary_length x n_documents` - 88.987 x 9.000 in the above models, where both unigrams and bigrams were included - and was a sparse matrix mostly containing zeros. This is not the case with embeddings, where we reduce the dimenensions of the feature matrix to `vocabulary_length x embedding_dimension` (20.620 x 200).  

In [15]:
# load the pretrained embeddings (these can be used as the embedding argument in create_embedding_matrix)
glove = gensim.downloader.load('glove-twitter-200')

First, we want to establish our total vocabulary. We lowercase all tokens.

In [38]:
# creating the full list of vocabulary in the tweet_eval data
total_vocabulary = set()
for tweet in train_corpus + val_corpus+test_corpus:
    tokens = word_tokenize(tweet)
    for t in tokens:
        total_vocabulary.add(t.lower()) # lower.case
total_vocabulary = sorted(list(total_vocabulary))

Include a padding-token, "", as the first entry, which will be the first row in the embedding matrix. We are going to ensure that the input sequences - i.e. tokenized sentences - have the same length, which we will do by continously appending empty string token to the front of the list of tokens in sentences that are shorter than the longest sentece. 

In [39]:
# appending an empty padding token at the beginning of the vocabulary
total_vocabulary = [""]+total_vocabulary

We create a embedding matrix with the dimension `vocabulary_length x embedding_dimension`. These 200-dimensional vector will serve as representations of the features and be the input to the RNN and LSTM. Tokens that are not already in the pretrained embeddings will have all their values set to zero. 

In [42]:
def create_embedding_matrix(tokens, embedding):
    """creates an embedding matrix from pre-trained embeddings for a new vocabulary. It also adds an extra vector
    of zeroes in row 0 to embed the padding token, and initializes missing tokens as vectors of 0s"""
    oov = set()
    size = embedding.vector_size
    # note the extra zero vector that will used for padding
    embedding_matrix=np.zeros((len(tokens),size))
    c = 0
    for i in tqdm(range(1,len(tokens))):
        try:
            embedding_matrix[i]=embedding[tokens[i]]
        except KeyError: #to catch the words missing in the embeddings
            try:
                embedding_matrix[i]=embedding[tokens[i].lower()]
            except KeyError:
                #if the token does not have an embedding, we initialize it as a vector of 0s
                embedding_matrix[i] = np.zeros(size)
                #we keep track of the out of vocabulary tokens
                oov.add(tokens[i])
                c +=1
    print(f'{c/len(tokens)*100} % of tokens are out of vocabulary')
    return embedding_matrix, oov

#get the embedding matrix and out of vocabulary words for our tweet_eval vocabulary
embedding_matrix, oov = create_embedding_matrix(total_vocabulary, glove)

100%|██████████| 24019/24019 [00:00<00:00, 161210.19it/s]

33.1390507910075 % of tokens are out of vocabulary





Convert the text sentences into a list of indicies, each of which corresponding to the words' indices in the total vocabulary and in the embedding matrix. Also, we use padding on the feature vectore to ensure that all vector respresentations of the input sentences have the same length.

In [43]:
def text_to_indices(text, total_vocabulary):
    """turns the input text (one tweet) into a vector of indices in total_vocabulary that corresponds to the tokenized words in the input text"""
    encoded_text = []
    tokens = word_tokenize(text)
    for t in tokens:
        index = total_vocabulary.index(t.lower())
        encoded_text.append(index)
    return encoded_text

def add_padding(vector, max_length, padding_index):
    """adds copies of the padding token to make the input vector the max_length size, so that all inputs are the same length
    (the length of tweet with most words)"""
    if len(vector) < max_length:
        vector = [padding_index for _ in range(max_length-len(vector))] + vector
    return vector

# vectorize sentences to indices
train_features = [text_to_indices(text, total_vocabulary) for text in train_corpus]
val_features = [text_to_indices(text, total_vocabulary) for text in val_corpus]
test_features = [text_to_indices(text, total_vocabulary) for text in test_corpus]

# calc length of longest tweet
longest_tweet = max(train_features+val_features, key=len)
max_length = len(longest_tweet)
padding_index = 0 #position 0 is where we had put the padding token in our vocabulary and embedding matrix

# padding the feature vectors
train_features = [add_padding(vector, max_length, padding_index) for vector in train_features]
val_features = [add_padding(vector, max_length, padding_index) for vector in val_features]
test_features = [text_to_indices(text, total_vocabulary) for text in test_corpus]

Next, we convert all splits of the data into a PyTorch DataSet as well as a DataLoader, to perform `batch gradient descent` when training the model. 

In [47]:
class TweetEvalTrain(torch.utils.data.Dataset):
    # defining the sources of the data
    def __init__(self, features, labels):
        self.X = torch.LongTensor(features)
        self.y = torch.from_numpy(np.array(labels))

    def __getitem__(self, index):
        X = self.X[index]
        y = self.y[index].unsqueeze(0)
        return X, y

    def __len__(self):
        return len(self.y)

data_train = TweetEvalTrain(train_features, train_labels)
data_val = TweetEvalTrain(val_features, val_labels)
data_test = TweetEvalTrain(val_features, val_labels) 

train_loader = torch.utils.data.DataLoader(data_train, batch_size=64)
val_loader = torch.utils.data.DataLoader(data_val, batch_size=64)
test_loader = torch.utils.data.DataLoader(data_test, batch_size=64)

Now, the data is all set to be fed into the RNN.

### Many-to-one RNN with pre-trained word embeddings as the inputs

First, we define both a custom RNN/LSTM class, a training_loop function and an evaluation function. Note that we do not need to explicitly define a softmax-activation for the outputs to be passed through, as our loss function, `CrossEntropyLoss`, automatically applies a softmax transformation before calculating the loss. 

In [None]:
# advanced version supporting multiple types of RNN layers

class RNN_or_LSTM(torch.nn.Module):
    def __init__(self, rnn_size, n_classes, embedding_matrix, type="RNN"):
        # initialize the model with a certain dimension of the RNN unit activations (this is rnn_size)
        # and a certain number of output classes
        
        super().__init__()
        
        #applying the embeddings to the inputs
        self.embedding = torch.nn.Embedding.from_pretrained(torch.FloatTensor(embedding_matrix), padding_idx=0, freeze=True)
        emb_dim = embedding_matrix.shape[1] # = 200 in this case --> Det er vores input_size, og representere blot 1 ord, men gennem 200 features! Derfor er der også 200 vægte mellem input-layer og hver node i hidden layer
        
        #remember the batch_first=True argument
        if type == "RNN":
            self.rnn = torch.nn.RNN(input_size=emb_dim, hidden_size=rnn_size, num_layers=1, batch_first=True)
        elif type == "LSTM":
            self.rnn = torch.nn.LSTM(input_size=emb_dim, hidden_size=rnn_size, bidirectional=False, num_layers=1, batch_first=True)
        else:
            raise LookupError("Only RNN and LSTM are supported.")
        self.linear_output = torch.nn.Linear(rnn_size, n_classes)

    def forward(self, inputs):
        
        # encode the input vectors
        encoded_inputs = self.embedding(inputs)

        #apply the RNN or LSTM
        if type == "RNN":
            all_states, final_state = self.rnn(encoded_inputs)
        else:
            # LSTM's output is different and needs to be treated differently, see documentation for details
            all_states, (final_state, c_n) = self.rnn(encoded_inputs)
        
        # run the final states through the output layer
        outputs = self.linear_output(final_state.squeeze())
        return outputs

Define our RNN and LSTM classes, where the number of hidden layers can be change using the `num_layers` argument.

In [199]:
# defining the embedding step and RNN model

class RNN_LSTM(torch.nn.Module):
    def __init__(self, rnn_size, n_classes, embedding_matrix, num_layers=1, type_="RNN"):
        # initialize the model with a certain dimension of the RNN unit activations (this is rnn_size)
        # and a certain number of output classes
        
        super().__init__()
        
        #applying the embeddings to the inputs
        self.embedding = torch.nn.Embedding.from_pretrained(torch.FloatTensor(embedding_matrix), padding_idx=0, freeze=True) # the tokens corresponding to the padding_idx will not be included in the training of the model 
        emb_dim = embedding_matrix.shape[1] # = 200 in this case --> It is our input_size, and it represents just one word, but through 200 features! That's why there are also 200 weights between the input layer and each node in the hidden layer.
        self.num_layers = num_layers
        self.type = type_ 

        #remember the batch_first=True argument
        if self.type == "RNN":
            self.rnn = torch.nn.RNN(input_size=emb_dim, hidden_size=rnn_size, num_layers=self.num_layers, batch_first=True)
        elif self.type == "LSTM":
            self.rnn = torch.nn.LSTM(input_size=emb_dim, hidden_size=rnn_size, bidirectional=False, num_layers=self.num_layers, batch_first=True)
        else:
            raise LookupError("Only RNN and LSTM are supported.")

        #define the output layer (no softmax needed here; we will apply softmax as part of the loss calculation)
        #applies a linear transformation to the RNN
        #final layer state and outputs scores for the n classes
        self.linear_outputs = torch.nn.Linear(rnn_size, n_classes)                      
        
    def forward(self, inputs):
        # encode the input vectors
        encoded_inputs = self.embedding(inputs)

        # The RNN returns two tensors: one representing the hidden states at all positions,
        # and another representing only the final hidden states.
        # In this many-to-one model, we only need the final hidden states,
        # i.e the prediction at the end of the sequence, where all prio information is somehow present
        
        # run the final state through the output layer
        if self.type == "RNN" and self.num_layers > 1:
            all_states, final_state = self.rnn(encoded_inputs)
            outputs = self.linear_outputs(final_state[-1,:,:]) # only activation from the last layer in the stack of layers

        elif self.type == "RNN" and self.num_layers == 1:
            all_states, final_state = self.rnn(encoded_inputs)
            final_state = final_state.squeeze()
            outputs = self.linear_outputs(final_state)

        elif self.type == "LSTM" and self.num_layers > 1:
            all_states, (final_state, c_n) = self.rnn(encoded_inputs)
            outputs = self.linear_outputs(final_state[-1,:,:]) # only activation from the last layer in the stack of layers
            
        elif self.type == "LSTM" and self.num_layers == 1:
            # LSTM's output is different and needs to be treated differently, see documentation for details
            all_states, (final_state, c_n) = self.rnn(encoded_inputs)
            outputs = self.linear_outputs(final_state) 

        return outputs

# training loop
def training_loop(model, num_epochs):
    loss_function = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    for epoch in range(num_epochs):
        losses = []
        for batch_index, (inputs, targets) in enumerate(train_loader):
            optimizer.zero_grad() # zero the gradients that are stored from the previous optimization step
            outputs = model.forward(inputs).squeeze() # compute the outputs of the model
            targets = targets.squeeze().long() #dependending on your torch version you might have to use targets = targets.squeeze().long()
            loss = loss_function(outputs, targets) #loss function (Adam) compares model output and the true labels
            loss.backward() # Backpropagation -> get derivative of loss function 
            optimizer.step() # optimize based on derivative
            losses.append(loss.item()) #add this batch's loss to the losses for this epoch
        print(f'Epoch {epoch+1}: loss {np.mean(losses)}')
    return model

def evaluate(model, val_loader):
    predictions = []
    labels = []
    with torch.no_grad(): # for evaluation we don't backpropagate and update weights anymore
        for batch_index, (inputs, targets) in enumerate(val_loader):
            outputs = torch.softmax(model(inputs), 1 ) # apply softmax to get probabilities/logits
            # getting the indices of the logit with the highest value, which corresponds to the predicted class (as labels 0, 1, 2)
            vals, indices = torch.max(outputs, 1)
            # accumulating the predictions
            predictions += indices.tolist()
            # accumulating the true labels
            labels += targets.tolist()
    
    acc = accuracy_score(predictions, labels)
    print(f'Model accuracy: {acc}')
    return acc, predictions

We train an RNN with a single hidden layer.

In [62]:
# initializing and training the model:
myRNN = RNN_LSTM(rnn_size=100, n_classes=2, num_layers = 1, embedding_matrix=embedding_matrix)

myRNN = training_loop(myRNN, num_epochs = 5)
acc, preds = evaluate(myRNN, val_loader)

Epoch 1: loss 0.6019936258911242
Epoch 2: loss 0.5577724246268577
Epoch 3: loss 0.532719999974501
Epoch 4: loss 0.512544610821609
Epoch 5: loss 0.5200163743174668
Model accuracy: 0.674


We train a Deep RNN with two hidden layers

In [189]:
# initializing and training the model:
myRNN = RNN_LSTM(rnn_size=100, n_classes=2, num_layers = 3, embedding_matrix=embedding_matrix)

myRNN = training_loop(myRNN, num_epochs = 5)
acc, preds = evaluate(myRNN, val_loader)

RNN


AttributeError: 'tuple' object has no attribute 'squeeze'

In [65]:
# initializing and training the model:
myRNN = RNN_LSTM(rnn_size=100, n_classes=2, num_layers = 3, embedding_matrix=embedding_matrix)

myRNN = training_loop(myRNN, num_epochs = 5)
acc, preds = evaluate(myRNN, val_loader)

Epoch 1: loss 0.603381429369568
Epoch 2: loss 0.5530780082476054
Epoch 3: loss 0.5255337934544746
Epoch 4: loss 0.5062031760706124
Epoch 5: loss 0.49184963931428627
Model accuracy: 0.692


In [188]:
myLSTM = RNN_LSTM(rnn_size=100, n_classes=2, num_layers = 1, embedding_matrix=embedding_matrix, type_="LSTM")

myLSTM = training_loop(myLSTM, num_epochs = 3)
acc, preds = evaluate(myLSTM, val_loader)

LSTM


AttributeError: 'tuple' object has no attribute 'squeeze'

In [200]:
myLSTM = RNN_LSTM(rnn_size=100, n_classes=2, num_layers = 1, embedding_matrix=embedding_matrix, type_="LSTM")


### LSTM

In [None]:

myRNN = RNN_LSTM(rnn_size=100, n_classes=2, num_layers = 2, embedding_matrix=embedding_matrix)

myRNN = training_loop(myRNN, num_epochs = 5)
acc, preds = evaluate(myRNN, val_loader)

### A fine-tuned BERT model

## For each of the models you ran in question 1. and 2., briefly discuss (two–four sentences) in what ways the model is a good choice for the current task and data set, plus any downsides the model might have for this application.


When using pretrained embeddings for text vectorization, the corpus used to construct n-dimensional vector space is very important for the performance of our model, for instance in relation to inherent biases in the embeddings stemming from the data on which it was trained, or if the vocabulary of the embeddings is small, wherefor many of the tokens from our own dataset might not be embed-able as they are not part of the embeddings vocabulary. We saw that around 30% of the tokens were not in the pretrained vocabulary, so almost a third of our vocabulary will not be used to train the model or when predicting. 

## Present the performance of your models on the test set in terms of both F1 and accuracy. Which is the best-performing model in terms of F1? In your opinion, should we prefer F1 over accuracy as an indicator of model performance here, and why?


## Based on the paper by Basile et al. (2019), What further info might you have liked to have about the data selection process and/or the annotation process for this data set? Why?
