# Introduction to Natural Language Processing 2 Lab04

## Introduction

Your company wants to sell a moderation API tackling toxic content on Twitter. They ask you to come up with a model which detect toxic tweets. You remember your NLP classes, and start looking for existing models or datasets, and find a collection of [academic Twitter dataset on HuggingFace hub](https://huggingface.co/datasets/tweet_eval). Especially, the `hate` and `offensive` datasets seem close to what you are looking for.

In [71]:
# Get the datasets from HuggingFace
from datasets import load_dataset

hate_dataset = load_dataset("tweet_eval", "hate")
offensive_dataset = load_dataset("tweet_eval", "offensive")

train_hate = hate_dataset["train"]
valid_hate = hate_dataset["validation"]
test_hate = hate_dataset["test"]
train_offensive = offensive_dataset["train"]
valid_offensive = offensive_dataset["validation"]
test_offensive = offensive_dataset["test"]

Found cached dataset tweet_eval (/home/pierre/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


  0%|          | 0/3 [00:00<?, ?it/s]

Found cached dataset tweet_eval (/home/pierre/.cache/huggingface/datasets/tweet_eval/offensive/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


  0%|          | 0/3 [00:00<?, ?it/s]

### 1. (1 point) Pick one of the datasets between `hate` and `offensive`, and justify your choice. Remember that it is for a commercial application (there is a good and a bad answer).

The model needs to detect toxic tweets.

In [72]:
from IPython.display import Markdown, display

# Get a preview of the data
display(Markdown("### Hate dataset"))
display(Markdown("#### Label 0"))
display(train_hate.filter(lambda example: example["label"] == 0).select(range(5))["text"])
display(Markdown("#### Label 1"))
display(train_hate.filter(lambda example: example["label"] == 1).select(range(5))["text"])
display(Markdown("### Offensive dataset"))
display(Markdown("#### Label 0"))
display(train_offensive.filter(lambda example: example["label"] == 0).select(range(5))["text"])
display(Markdown("#### Label 1"))
display(train_offensive.filter(lambda example: example["label"] == 1).select(range(5))["text"])

### Hate dataset

#### Label 0

Loading cached processed dataset at /home/pierre/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343/cache-f4b9014e43714767.arrow


['@user nice new signage. Are you not concerned by Beatlemania -style hysterical crowds crongregating on you…',
 'Hysterical woman like @user',
 'Me flirting- So tell me about your father...',
 'The Philippine Catholic bishops\' work for migrant workers should focus on families who are "paying the great...',
 'When cuffin season is finally over']

#### Label 1

Loading cached processed dataset at /home/pierre/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343/cache-6f737a7e1391ce3b.arrow


['A woman who you fucked multiple times saying yo dick small is a compliment you know u hit that spot 😎',
 '@user @user real talk do you have eyes or were they gouged out by a rapefugee?',
 'your girlfriend lookin at me like a groupie in this bitch!',
 "I AM NOT GOING AFTER YOUR EX BF YOU LIEING SACK OF SHIT ! I'm done with you dude that's why I dumped your ass cause your a lieing 😂😡 bitch",
 'Send home migrants not in need of protection, Peter Dutton tells UN, HEY DUTTON HOW ABOUT THE ONES THAT HAVE STAYED AND NOT LEFT THE COUNTRY WHEN THEY SHOULD OVERSTAYERS ? WHY DONT YOU GO AND ROUND ALL THEM UP ?']

### Offensive dataset

#### Label 0

Loading cached processed dataset at /home/pierre/.cache/huggingface/datasets/tweet_eval/offensive/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343/cache-eb9ecfc3e8c05dd5.arrow


['@user Bono... who cares. Soon people will understand that they gain nothing from following a phony celebrity. Become a Leader of your people instead or help and support your fellow countrymen.',
 '@user Get him some line help. He is gonna be just fine. As the game went on you could see him progressing more with his reads. He brought what has been missing. The deep ball presence. Now he just needs a little more time',
 '@user @user She is great. Hi Fiona!',
 '@user @user @user @user @user @user @user @user @user @user @user @user @user @user @user This is the VetsResistSquadron"" is Bullshit.. They are girl scout veterans, I have never met any other veterans or served with anyone that was a gun control advocate? Have you?""',
 '@user @user Lol. Except he’s the most successful president in our lifetimes. He’s undone most of the damage Obummer did and set America on the right path again. #MAGA']

#### Label 1

Loading cached processed dataset at /home/pierre/.cache/huggingface/datasets/tweet_eval/offensive/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343/cache-444e3b49659f88c2.arrow


['@user Eight years the republicans denied obama’s picks. Breitbarters outrage is as phony as their fake president.',
 "@user She has become a parody unto herself? She has certainly taken some heat for being such an....well idiot. Could be optic too  Who know with Liberals  They're all optics.  No substance",
 '@user Your looking more like a plant #maga #walkaway',
 '@user Antifa would burn a Conservatives house down and CNN would be there lighting the torches &amp; throwing gas on the flames.',
 '@user They cite Jones being banned for violating Twitter\'s ToS. There are blue checkmarks spewing the same, if not worse, kind of shit. If you are going to play the anyone can get banned"" card. Shouldn\'t these people also receive bans and suspensions? #VerifiedHate""']

From the preview of the two datasets we can see that the `hate` dataset tackles more toxic messages than the `offensive` one.  
But we will choose `offensive` because we want a broader coverage to detect more toxic tweets, the fact the tweets from the `hate` dataset seems more harmfull does not mean the one from the `offensive` dataset are acceptable.

In [118]:
chosen = offensive_dataset
train = chosen["train"]
valid = chosen["validation"]
test = chosen["test"]

## Evaluating the dataset

Let's start with a data analysis.  
### 1. (1 point) Describe the dataset. Look at the splits, proportion of classes, and see what you can figure out by just looking at the text.

In [119]:
import pandas as pd

# Get proportion of splits and classes
print("Training split:", len(train), "\n" + str(pd.Series(train["label"]).value_counts()),
      "\nValidation split:", len(valid), "\n" + str(pd.Series(valid["label"]).value_counts()),
        "\nTest split:", len(test), "\n" + str(pd.Series(test["label"]).value_counts()))

Training split: 11916 
0    7975
1    3941
Name: count, dtype: int64 
Validation split: 1324 
0    865
1    459
Name: count, dtype: int64 
Test split: 860 
0    620
1    240
Name: count, dtype: int64


From the proportion of classes we can see that there are less negative (label 1) than positive tweets.

### 2. (3 points) Use BERTopic to extract the topics within the data, and the main topics within each class. Please, think about fixing the random seed


#### with the raw data

In [120]:
import bertopic
from umap import UMAP

# fix the seed for reproducibility
umap_model = UMAP(random_state=42)

# Use BERTopic model
# we use all-MiniLM-L6-v2
topic_model = bertopic.BERTopic(language="english", calculate_probabilities=True, verbose=True, embedding_model="all-MiniLM-L6-v2", umap_model=umap_model)
topics, probs = topic_model.fit_transform(train["text"])


Batches:   0%|          | 0/373 [00:00<?, ?it/s]

2023-06-09 16:52:18,644 - BERTopic - Transformed documents to Embeddings
2023-06-09 16:52:23,596 - BERTopic - Reduced dimensionality
2023-06-09 16:52:28,042 - BERTopic - Clustered reduced embeddings


In [121]:
# get topic per class
topics_per_class = topic_model.topics_per_class(train["text"], train["label"])

topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=10)

2it [00:00,  9.75it/s]


#### with balanced data

In [122]:
import random
from datasets import Dataset
random.seed(42)

# Separate the samples based on their labels
label_0_samples = [sample for sample in train if sample["label"] == 0]
label_1_samples = [sample for sample in train if sample["label"] == 1]

# Randomly select 1000 samples from each label
balanced_train = random.sample(label_0_samples, 3000) + random.sample(label_1_samples, 3000)

# Shuffle the order of the balanced train set
random.shuffle(balanced_train)
balanced_train_dataset = Dataset.from_dict({"text": [sample["text"] for sample in balanced_train], "label": [sample["label"] for sample in balanced_train]})
print("Balanced training set:", len(balanced_train_dataset), "\n" + str(pd.Series(balanced_train_dataset["label"]).value_counts()))

Balanced training set: 6000 
0    3000
1    3000
Name: count, dtype: int64


In [123]:
# fix the seed for reproducibility
umap_model = UMAP(random_state=42)

# Use BERTopic model
# we use all-MiniLM-L6-v2
topic_model = bertopic.BERTopic(language="english", calculate_probabilities=True, verbose=True, embedding_model="all-MiniLM-L6-v2", umap_model=umap_model)
topics, probs = topic_model.fit_transform(balanced_train_dataset["text"])

Batches:   0%|          | 0/188 [00:00<?, ?it/s]

2023-06-09 16:52:30,624 - BERTopic - Transformed documents to Embeddings
2023-06-09 16:52:36,539 - BERTopic - Reduced dimensionality
2023-06-09 16:52:37,315 - BERTopic - Clustered reduced embeddings


In [124]:
# get topic per class
topics_per_class = topic_model.topics_per_class(balanced_train_dataset["text"], balanced_train_dataset["label"])

topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=10)

2it [00:00, 21.19it/s]


### 3. (1 point) What do you think about the results? How do you think it could impact a model trained on these data?

We printed a second graph with only 3000 samples per class to see if it has an impact on the frequency (and it does).

Since the data is unbalanced we can't really compare frequency of the topics, they are really close from label 0 to 1.

But we can still look at the topics :
between label 1 and 0 we have some topics only represented in label 1 (offensive) such as : `boobs`, `tits`, `dead`, `suck`, `fascist`  
on the other side we have : `lmfao` or `lol` only represented in class 0  
but most of the topics are represented in boths so it will be difficult to detect offensive content


## Evaluate a model