# Introduction to Natural Language Processing 2 Lab04

## Introduction

We want to sell a moderation API tackling toxic content on Twitter. We find a collection of tweets labeled on [HuggingFace](https://huggingface.co/datasets/tweet_eval).  
We want to train a model to predict the toxicity of a tweet. Two datasets seem close to our needs: `hate` and `offensive`.

We will use the `hate` dataset due to its greatest toxicity. The moderation we need here is to detect some type of high toxicity firstly instead of offensive language.

## Load the dataset


In [2]:
from datasets import load_dataset
dataset = load_dataset('tweet_eval', 'hate')

Downloading builder script: 9.72kB [00:00, 3.50MB/s]                   
Downloading metadata: 30.4kB [00:00, 8.96MB/s]                   


Downloading and preparing dataset tweet_eval/hate (download: 1.62 MiB, generated: 1.72 MiB, post-processed: Unknown size, total: 3.35 MiB) to /home/leme/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343...


Downloading data: 1.13MB [00:00, 7.95MB/s]/6 [00:00<?, ?it/s]
Downloading data: 18.0kB [00:00, 6.59MB/s]                   .20s/it]
Downloading data: 399kB [00:00, 6.41MB/s]                   1.29it/s]
Downloading data: 5.94kB [00:00, 2.09MB/s]                  1.50it/s]
Downloading data: 144kB [00:00, 3.69MB/s]                    .58it/s]
Downloading data: 2.00kB [00:00, 791kB/s]                   1.53it/s]
Downloading data files: 100%|██████████| 6/6 [00:04<00:00,  1.46it/s]
Extracting data files: 100%|██████████| 6/6 [00:00<00:00, 900.39it/s]
                                                                                     

Dataset tweet_eval downloaded and prepared to /home/leme/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 123.33it/s]


## Evaluating the dataset

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 9000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2970
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
})

The dataset is composed of 3 splits: `train`, `test` and `validation`.  
The `train` split is composed of **9,000** tweets.  
The `test` split is composed of **2,970** tweets.  
The `validation` split is composed of **1,000** tweets.  


Each split is composed of two features: `text` and `label`.


In [11]:
print("Number of non hate tweets in each split:")
print(dataset.filter(lambda split: split['label'] == 0).num_rows)
print("Number of hate tweets in each split:")
print(dataset.filter(lambda split: split['label'] == 1).num_rows)

Loading cached processed dataset at /home/leme/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343/cache-64a2f09d35003d57.arrow
Loading cached processed dataset at /home/leme/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343/cache-5a71c3687b1e1ece.arrow
Loading cached processed dataset at /home/leme/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343/cache-6cb6426c5dba15f5.arrow
Loading cached processed dataset at /home/leme/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343/cache-9cc66ca5333ae6aa.arrow
Loading cached processed dataset at /home/leme/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343/cache-c31554a090f8450d.arrow
Loading cached processed dataset at /home/lem

Number of non hate tweets in each split:
{'train': 5217, 'test': 1718, 'validation': 573}
Number of hate tweets in each split:
{'train': 3783, 'test': 1252, 'validation': 427}


We can observe that the labels are binary: `0` for non-hate tweets and `1` for hate tweets.  
We can observe that the dataset is unbalanced: there are more non-hate tweets than hate tweets.

Let's see some examples of tweets and their labels.

In [32]:
texts, labels = dataset['train'][:10].items()
for text, label in zip(texts[1], labels[1]):
    print(f"Label: {label} | {text}")

Label: 0 | @user nice new signage. Are you not concerned by Beatlemania -style hysterical crowds crongregating on you…
Label: 1 | A woman who you fucked multiple times saying yo dick small is a compliment you know u hit that spot 😎
Label: 1 | @user @user real talk do you have eyes or were they gouged out by a rapefugee?
Label: 1 | your girlfriend lookin at me like a groupie in this bitch!
Label: 0 | Hysterical woman like @user
Label: 0 | Me flirting- So tell me about your father...
Label: 0 | The Philippine Catholic bishops' work for migrant workers should focus on families who are "paying the great...
Label: 1 | I AM NOT GOING AFTER YOUR EX BF YOU LIEING SACK OF SHIT ! I'm done with you dude that's why I dumped your ass cause your a lieing 😂😡 bitch
Label: 0 | When cuffin season is finally over
Label: 1 | Send home migrants not in need of protection, Peter Dutton tells UN, HEY DUTTON HOW ABOUT THE ONES THAT HAVE STAYED AND NOT LEFT THE COUNTRY WHEN THEY SHOULD OVERSTAYERS ? WHY DONT YO

In most of the hate tweets, we can observe some juron, insult and vulgarity. We can also observe that words in capital letters are used to emphasize the hate.

Now let's use [BERTopic](https://github.com/MaartenGr/BERTopic) to extract the topics within the data, and the main topics within each class.

In [33]:
! pip install bertopic

Defaulting to user installation because normal site-packages is not writeable
Collecting bertopic
  Downloading bertopic-0.12.0-py2.py3-none-any.whl (90 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.7/90.7 kB[0m [31m887.2 kB/s[0m eta [36m0:00:00[0mMB/s[0m eta [36m0:00:01[0m
Collecting hdbscan>=0.8.28
  Downloading hdbscan-0.8.29.tar.gz (5.2 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m0m eta [36m0:00:01[0m0:01[0m:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting pyyaml<6.0
  Downloading PyYAML-5.4.1.tar.gz (175 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.1/175.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25

In [34]:
from bertopic import BERTopic
from umap import UMAP

umap_model = UMAP(random_state=42)
topic_model = BERTopic(umap_model=umap_model, embedding_model="all-MiniLM-L6-v2")

2022-11-22 15:19:43.817020: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-11-22 15:19:43.817130: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [35]:
topics, _ = topic_model.fit_transform(dataset['train']['text'])

Downloading: 100%|██████████| 1.18k/1.18k [00:00<00:00, 729kB/s]
Downloading: 100%|██████████| 190/190 [00:00<00:00, 137kB/s]
Downloading: 100%|██████████| 10.6k/10.6k [00:00<00:00, 1.27MB/s]
Downloading: 100%|██████████| 612/612 [00:00<00:00, 143kB/s]
Downloading: 100%|██████████| 116/116 [00:00<00:00, 13.3kB/s]
Downloading: 100%|██████████| 39.3k/39.3k [00:00<00:00, 415kB/s]
Downloading: 100%|██████████| 90.9M/90.9M [00:12<00:00, 7.42MB/s]
Downloading: 100%|██████████| 53.0/53.0 [00:00<00:00, 6.53kB/s]
Downloading: 100%|██████████| 112/112 [00:00<00:00, 14.0kB/s]
Downloading: 100%|██████████| 466k/466k [00:01<00:00, 272kB/s]  
Downloading: 100%|██████████| 350/350 [00:00<00:00, 43.5kB/s]
Downloading: 100%|██████████| 13.2k/13.2k [00:00<00:00, 1.77MB/s]
Downloading: 100%|██████████| 232k/232k [00:00<00:00, 276kB/s] 
Downloading: 100%|██████████| 349/349 [00:00<00:00, 60.5kB/s]


In [36]:
topic_model.visualize_topics()

In [38]:
topic_model.visualize_barchart()

In [39]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,1942,-1_you_bitch_women_your
1,0,4433,0_the_to_of_in
2,1,349,1_bitch_cunt_user_you
3,2,273,2_rape_women_woman_user
4,3,112,3_men_all_not_women
5,4,106,4_hoe_hoes_ho_you
6,5,99,5_bitch_whore_shit_stupid
7,6,87,6_dick_my_bitches_you
8,7,84,7_skank_you_user_re
9,8,75,8_me_when_someone_ever


TODO : What do you think about the results? How do you think it could impact a model trained on these data?

## Evaluate a model

In [40]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request

# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

MODEL = f"cardiffnlp/twitter-roberta-base-hate"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

# download label mapping
labels=[]
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)

text = "Good night 😊"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

# # TF
# model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
# model.save_pretrained(MODEL)

# text = "Good night 😊"
# encoded_input = tokenizer(text, return_tensors='tf')
# output = model(encoded_input)
# scores = output[0][0].numpy()
# scores = softmax(scores)

ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

Downloading: 100%|██████████| 588/588 [00:00<00:00, 288kB/s]
Downloading: 100%|██████████| 899k/899k [00:02<00:00, 399kB/s] 
Downloading: 100%|██████████| 456k/456k [00:01<00:00, 229kB/s]  
Downloading: 100%|██████████| 150/150 [00:00<00:00, 79.4kB/s]
Downloading: 100%|██████████| 499M/499M [01:13<00:00, 6.74MB/s] 


1) not-hate 0.9168
2) hate 0.0832
