# Introduction to Natural Language Processing 2 Lab04

**This lab is mainly about data and model analysis. There is very little code. Make sure you send back a proper report with your code, guideline, annotated sheets, and theoretical answers.**

## Introduction (1 point)

Your company wants to sell a moderation API tackling toxic content on Twitter. They ask you to come up with a model which detect toxic tweets. You remember your NLP classes, and start looking for existing models or datasets, and find a collection of [academic Twitter dataset on HuggingFace hub](https://huggingface.co/datasets/tweet_eval). Especially, the `hate` and `offensive` datasets seem close to what you are looking for.

1. (1 point) Pick one of the datasets between hate and offensive, and justify your choice. Remember that it is for a commercial application (there is a good and a bad answer).

Let's check some crucial points to choose between the two datasets:

* `Relevance`: Is the dataset relevant to the problem we are trying to solve ?

* `Quality`: Is the dataset of good quality ? Accuracy or precision of the labels ?

* `Size`: Is the dataset big enough to train a model ?

* `Diversity`: Is the dataset diverse enough to make the model robust and generalizable ?

Considering these points, let's compare the two datasets:

The `hate` dataset will most likely contain tweets that express hatred, which is certainly a form of toxic content.
However, the scope of this dataset might be limited, as there are other forms of toxic content beyond expressions of hate.

On the other hand, the `offensive` dataset is likely to cover a broader range of toxic content, including not only hate speech but also other forms of offensive language such as insults, or obscene content. This makes it more relevant to the task at hand. Additionally, this broader dataset will help train a model that is more robust and able to generalize to a wide range of toxic content.

Furthermore, we noticed that the `hate` dataset is not usable for commercial purposes, as it is licensed under CC BY-NC-SA 4.0. This is not the case for the `offensive` dataset, which is licensed under CC BY-SA 4.0.

For these reasons, we will choose the `offensive` dataset.

## Evaluating the dataset (5 points)

Before using the data to train a model, you have the right reflex and start with a data analysis.

1. (1 point) Describe the dataset. Look at the splits, proportion of classes, and see what you can figure out by just looking at the text.

In [30]:
# load our offensive dataset from https://huggingface.co/datasets/tweet_eval

from datasets import load_dataset

dataset_offensive = load_dataset("tweet_eval", "offensive")

dataset_offensive

Found cached dataset tweet_eval (/Users/rb2/.cache/huggingface/datasets/tweet_eval/offensive/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 11916
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 860
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1324
    })
})

In [31]:
# extract the training test and validation sets

train_dataset = dataset_offensive["train"]
test_dataset = dataset_offensive["test"]
validation_dataset = dataset_offensive["validation"]

# compute the proportion of offensive tweets in each dataset

print("Proportion of offensive tweets in the training set: ", sum(train_dataset["label"])/len(train_dataset["label"]))
print("Proportion of offensive tweets in the test set: ", sum(test_dataset["label"])/len(test_dataset["label"]))
print("Proportion of offensive tweets in the validation set: ", sum(validation_dataset["label"])/len(validation_dataset["label"]))


Proportion of offensive tweets in the training set:  0.3307317891910037
Proportion of offensive tweets in the test set:  0.27906976744186046
Proportion of offensive tweets in the validation set:  0.3466767371601209


We can see that there are three splits in the dataset: `train`, `validation` and `test`. The `train` split contains 11916 tweets, the `validation` split contains 1324 tweets, and the `test` split contains 860 tweets. The dataset is not balanced, with almost 30% of the tweets being offensive and 70% not being offensive.

At first glance we can see that the tweets contain a lot of hashtags, mentions, and emojis. We can also see that there are a lot of spelling mistakes and abbreviations.

2. (3 points) Use [BERTopic](https://github.com/MaartenGr/BERTopic) to extract the topics within the data, and the main topics within each class. Please, think about [fixing the random seed](https://stackoverflow.com/questions/71320201/how-to-fix-random-seed-for-bertopic).
    * A [good model](https://github.com/MaartenGr/BERTopic#embedding-models) for sentence similarity is `all-MiniLM-L6-v2`, as it is [fast, light, and pretty accurate](https://www.sbert.net/docs/pretrained_models.html). You can use another one, but make sure to document your choice.
    * [This](https://maartengr.github.io/BERTopic/api/plotting/topics_per_class.html) might help.

In [32]:
from bertopic import BERTopic

# Our BERTopic model contains a pre-trained embedding model and a UMAP model for dimensionality reduction which is reproducible
model = BERTopic(embedding_model="all-MiniLM-L6-v2", umap_model=UMAP(random_state=42), verbose=True)

topics, _ = model.fit_transform(train_dataset["text"])

# Extract the main topics within each label
topics_per_class = model.topics_per_class(train_dataset["text"], train_dataset["label"])

# Visualize topics per class
model.visualize_topics_per_class(topics_per_class)


Batches:   0%|          | 0/373 [00:00<?, ?it/s]

2023-06-14 20:21:11,148 - BERTopic - Transformed documents to Embeddings
2023-06-14 20:21:17,728 - BERTopic - Reduced dimensionality
2023-06-14 20:21:17,902 - BERTopic - Clustered reduced embeddings
2it [00:00, 13.47it/s]


In [47]:
# we can access the frequent topics that were generated by the model
model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3069,-1_the_to_and_of,"[the, to, and, of, is, user, for, maga, in, he]",[@user I just said that very same thing. That ...
1,0,3939,0_she_you_is_he,"[she, you, is, he, are, user, so, her, my, and]","[@user @user He is, @user She is 😭😭😭, @user Sh..."
2,1,991,1_gun_control_guns_laws,"[gun, control, guns, laws, the, to, about, in,...","[@user But gun control, @user Gun control is n..."
3,2,382,2_liberals_they_user_their,"[liberals, they, user, their, the, are, libera...","[@user Liberals be like, @user @user @user @us..."
4,3,317,3_antifa_user_they_your,"[antifa, user, they, your, to, you, of, them, ...","[@user @user No that is Antifa, @user Like ANT..."


3. (1 point) What do you think about the results? How do you think it could impact a model trained on these data?

We can see that the topics extracted from the dataset are very similar for both classes. This is not surprising, as the topics are extracted from the whole dataset, and not from each class separately. This means that the topics extracted from the dataset are not specific to the class, and therefore will not help the model distinguish between the two classes.

4. **Bonus** By default, BERTopic extracts single keywords. Play with the model to extract bigrams or more. See if you can go deeper in your analysis.

By default we can see that BERT uses CountVectorizer which, utilizes unigrams for creating the document-term matrix. However, it is possible to change a parameter in the BERTopic model to use bigrams or n-grams instead:

The `n_gram_range` parameter accepts a tuple (min_n, max_n), where `min_n` is the lower and `max_n` is the upper boundary of the range of n-values for different n-grams to be extracted.



We can see that the topics extracted from the dataset are more specific when using bigrams or n-grams, and therefore will help the model distinguish between the two classes.

In [42]:
# import the CountVectorizer from scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
model_bigram = BERTopic(embedding_model="all-MiniLM-L6-v2", umap_model=UMAP(random_state=42), vectorizer_model=CountVectorizer(ngram_range=(1, 5)), verbose=True)

topics, _ = model_bigram.fit_transform(train_dataset["text"])

Batches:   0%|          | 0/373 [00:00<?, ?it/s]

2023-06-14 22:45:36,805 - BERTopic - Transformed documents to Embeddings
2023-06-14 22:45:43,523 - BERTopic - Reduced dimensionality
2023-06-14 22:45:43,700 - BERTopic - Clustered reduced embeddings


In [44]:
# show the topics of the model

model_bigram.get_topic_info().head(5)


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3069,-1_the_user_to_user user,"[the, user, to, user user, is, and, of, user u...",[@user @user @user @user @user @user @user @us...
1,0,3939,0_user_is_you_she,"[user, is, you, she, you are, she is, he, are,...","[@user @user He is, @user She is 😭😭😭, @user Sh..."
2,1,991,1_gun_control_gun control_user,"[gun, control, gun control, user, the, user us...",[@user Gun control is not about guns. Gun cont...
3,2,382,2_liberals_user_user user_user liberals,"[liberals, user, user user, user liberals, the...",[@user @user @user @user @user @user @user @us...
4,3,317,3_antifa_user_antifa user_user antifa,"[antifa, user, antifa user, user antifa, user ...",[Antifa Caught Off Guard After Getting Confron...


## Evaluating the model (8 points)



1. (2 points) Evaluate the model on the test split of the dataset you picked, using precision, recall, and F1-score.

In [41]:
# evaluate the model on the test set, using precision, recall and f1-score

from sklearn.metrics import classification_report

test_topics, _ = model.transform(test_dataset["text"])


print(classification_report(test_dataset["label"], test_topics))


Batches:   0%|          | 0/27 [00:00<?, ?it/s]

2023-06-14 22:40:45,414 - BERTopic - Reduced dimensionality
2023-06-14 22:40:45,432 - BERTopic - Predicted clusters


              precision    recall  f1-score   support

          -1       0.00      0.00      0.00         0
           0       0.77      0.21      0.33       620
           1       0.17      0.04      0.07       240
           2       0.00      0.00      0.00         0
           3       0.00      0.00      0.00         0
           4       0.00      0.00      0.00         0
           5       0.00      0.00      0.00         0
           6       0.00      0.00      0.00         0
           7       0.00      0.00      0.00         0
           8       0.00      0.00      0.00         0
           9       0.00      0.00      0.00         0
          10       0.00      0.00      0.00         0
          12       0.00      0.00      0.00         0
          13       0.00      0.00      0.00         0
          14       0.00      0.00      0.00         0
          15       0.00      0.00      0.00         0
          16       0.00      0.00      0.00         0
          17       0.00    

2. (2 points) Look for prediction failures. Extract the top 5 misclassified tweets (highest score in wrong class) for each class and discuss what could be wrong with the model.

In [53]:
# For each class, extract the top 5 missclassified tweets (highest probability of belonging to the wrong topic) for each class











AttributeError: 'BERTopic' object has no attribute 'get_top_n_missclassified'

3. (2 points) Extract the top 10 tweets your model is most confident about in the target class (offensive or hateful), the top 10 in the neutral class, and the top 10 your model is most uncertain about. Do you believe the model is doing a great job?


4. **Bonus** Use [SHAP](https://github.com/slundberg/shap/tree/45b85c1837283fdaeed7440ec6365a886af4a333#natural-language-example-transformers) on the provided tweets, or manually written texts, to see if you can find topics on which the model is biased.


5. (2 points) What are the advantages of using a pre-trained transformer vs naive Bayes?
    * Think about training, and usage in production.

6. **Bonus** Train a naive Bayes model on the data, and compare its results with this model.

## Annotate data (7 points)

1. (1 point) Extract about 100 tweets containing at least 20% of your target class (offensive/hateful), from the 10K tweets provided. You can use the pretrained model to help you find tweets in the target class.

In [None]:
# extract 100 tweets containing at least 20% of offensive class, from the dataset

offensive_tweets = train_dataset[train_dataset["label"] == 1].sample(100)

2. (3 points) Altogether, write down an annotation guildeline (which should be at least 2/3 of a page long).
    * What does the target class look like?
    * Any examples you could provide for ambiguous cases?
    * Keep "Can't tell / not annotable" class. Make sure you document what this class mean in your guideline.

3. (1 point) Every person in your group is going to annotate these tweets separately. So if you are 3, annotate them 3 times.
    * Typically, create a Google sheet or an excel document, one tab per person, in each tab one column for the text, and annother on the class.


4. (2 point) Evaluate your inter-annotaor agreement using Fleiss Kappa.
    * statsmodel provide an easy to use [implementation](https://www.statsmodels.org/stable/generated/statsmodels.stats.inter_rater.fleiss_kappa.html#statsmodels.stats.inter_rater.fleiss_kappa).
    * What does the score mean? Are you doing a good job annotating the data and, if not, why?


5. **Bonus** Iterate on your annotation guideline with what you learned. Please send both version in your report.


6. **Bonus** Evaluate the model your data. Use a majority vote for labels (remove majority "can't tell") and compute the precision, recall, and F1-score.