# Introduction to Natural Language Processing 2 Lab04

**This lab is mainly about data and model analysis. There is very little code. Make sure you send back a proper report with your code, guideline, annotated sheets, and theoretical answers.**

## Introduction (1 point)

Your company wants to sell a moderation API tackling toxic content on Twitter. They ask you to come up with a model which detect toxic tweets. You remember your NLP classes, and start looking for existing models or datasets, and find a collection of [academic Twitter dataset on HuggingFace hub](https://huggingface.co/datasets/tweet_eval). Especially, the `hate` and `offensive` datasets seem close to what you are looking for.

1. (1 point) Pick one of the datasets between hate and offensive, and justify your choice. Remember that it is for a commercial application (there is a good and a bad answer).

Let's check some crucial points to choose between the two datasets:

* `Relevance`: Is the dataset relevant to the problem we are trying to solve ?

* `Quality`: Is the dataset of good quality ? Accuracy or precision of the labels ?

* `Size`: Is the dataset big enough to train a model ?

* `Diversity`: Is the dataset diverse enough to make the model robust and generalizable ?

Considering these points, let's compare the two datasets:

The `hate` dataset will most likely contain tweets that express hatred, which is certainly a form of toxic content.
However, the scope of this dataset might be limited, as there are other forms of toxic content beyond expressions of hate.

On the other hand, the `offensive` dataset is likely to cover a broader range of toxic content, including not only hate speech but also other forms of offensive language such as insults, or obscene content. This makes it more relevant to the task at hand. Additionally, this broader dataset will help train a model that is more robust and able to generalize to a wide range of toxic content.

Furthermore, we noticed that the `hate` dataset is not usable for commercial purposes, as it is licensed under CC BY-NC-SA 4.0. This is not the case for the `offensive` dataset, which is licensed under CC BY-SA 4.0.

For these reasons, we will choose the `offensive` dataset.

## Imports

In [47]:
from datasets import load_dataset
from bertopic import BERTopic
from umap import UMAP
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import precision_recall_fscore_support
from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification,  AutoTokenizer, pipeline
import statsmodels
from statsmodels.stats.inter_rater import fleiss_kappa, aggregate_raters
import numpy as np
import pandas as pd
from scipy.special import softmax
import csv
import urllib.request
import json
import torch

## Evaluating the dataset (5 points)

Before using the data to train a model, you have the right reflex and start with a data analysis.

1. (1 point) Describe the dataset. Look at the splits, proportion of classes, and see what you can figure out by just looking at the text.

In [3]:
dataset_offensive = load_dataset("tweet_eval", "offensive")
dataset_offensive

Found cached dataset tweet_eval (/Users/rb2/.cache/huggingface/datasets/tweet_eval/offensive/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 11916
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 860
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1324
    })
})

In [4]:
# we extract the training test and validation sets

train_dataset = dataset_offensive["train"]
test_dataset = dataset_offensive["test"]
validation_dataset = dataset_offensive["validation"]

# compute the proportion of offensive tweets in each dataset

print("Proportion of offensive tweets in the training set: ", sum(train_dataset["label"])/len(train_dataset["label"]))
print("Proportion of offensive tweets in the test set: ", sum(test_dataset["label"])/len(test_dataset["label"]))
print("Proportion of offensive tweets in the validation set: ", sum(validation_dataset["label"])/len(validation_dataset["label"]))

Proportion of offensive tweets in the training set:  0.3307317891910037
Proportion of offensive tweets in the test set:  0.27906976744186046
Proportion of offensive tweets in the validation set:  0.3466767371601209


We can see that there are three splits in the dataset: `train`, `validation` and `test`. The `train` split contains 11916 tweets, the `validation` split contains 1324 tweets, and the `test` split contains 860 tweets. The dataset is not balanced, with almost 30% of the tweets being offensive and 70% not being offensive.

At first glance we can see that the tweets contain a lot of hashtags, mentions, and emojis. We can also see that there are a lot of spelling mistakes and abbreviations.

2. (3 points) Use [BERTopic](https://github.com/MaartenGr/BERTopic) to extract the topics within the data, and the main topics within each class. Please, think about [fixing the random seed](https://stackoverflow.com/questions/71320201/how-to-fix-random-seed-for-bertopic).
    * A [good model](https://github.com/MaartenGr/BERTopic#embedding-models) for sentence similarity is `all-MiniLM-L6-v2`, as it is [fast, light, and pretty accurate](https://www.sbert.net/docs/pretrained_models.html). You can use another one, but make sure to document your choice.
    * [This](https://maartengr.github.io/BERTopic/api/plotting/topics_per_class.html) might help.

In [5]:
# Our BERTopic model will contain a pre-trained embedding model and a UMAP reproducible model for dimensionality reduction
model = BERTopic(embedding_model="all-MiniLM-L6-v2", umap_model=UMAP(random_state=42), verbose=True)

topics, _ = model.fit_transform(train_dataset["text"])

Batches:   0%|          | 0/373 [00:00<?, ?it/s]

2023-06-25 20:52:21,042 - BERTopic - Transformed documents to Embeddings
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
2023-06-25 20:52:34,491 - BERTopic - Reduced dimensionality
2023-06-25 20:52:34,649 - BERTopic - Clustered reduced embeddings


In [6]:
topic_info = model.get_topic_info()
topic_info.head(10)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3069,-1_the_to_and_of,"[the, to, and, of, is, user, for, maga, in, he]",[@user I just said that very same thing. That ...
1,0,3939,0_she_you_is_he,"[she, you, is, he, are, user, so, her, my, and]","[@user @user He is, @user She is 😭😭😭, @user Sh..."
2,1,991,1_gun_control_guns_laws,"[gun, control, guns, laws, the, to, about, in,...","[@user But gun control, @user Gun control is n..."
3,2,382,2_liberals_they_user_their,"[liberals, they, user, their, the, are, libera...","[@user Liberals be like, @user @user @user @us..."
4,3,317,3_antifa_user_they_your,"[antifa, user, they, your, to, you, of, them, ...","[@user @user No that is Antifa, @user Like ANT..."
5,4,262,4_conservatives_they_the_and,"[conservatives, they, the, and, are, conservat...","[@user all conservatives are bad people, @user..."
6,5,195,5_kavanaugh_maga_judge_vote,"[kavanaugh, maga, judge, vote, to, ford, will,...",[@user Pray for Judge Kavanaugh and his family...
7,6,161,6_brexit_uk_tories_eu,"[brexit, uk, tories, eu, labour, the, tory, co...",[@user @user @user @user And there's #Brexit 👇...
8,7,153,7_user_treph_follow_gt,"[user, treph, follow, gt, you, lt, following, ...",[@user @user @user @user @user @user @user @us...
9,8,113,8_canada_trudeau_liberals_ndp,"[canada, trudeau, liberals, ndp, canadians, on...","[@user Go back to Canada, @user Time for Canad..."


Let's visualize the topics we extracted within the data

In [7]:
topics_per_class = model.topics_per_class(train_dataset["text"], train_dataset["label"])
model.visualize_topics_per_class(topics_per_class)

2it [00:00, 13.74it/s]


3. (1 point) What do you think about the results? How do you think it could impact a model trained on these data?

We can observe sereral things:
* There is a lot of <i>noise</i>, meaning that there are a high majority of irrelevant topics
* Furthermore, the disparity accross the remaining topics is very unbalanced, with some topics (guns) being much minoritary than others (women, liberals, etc.).

This could impact a model trained on these data in several ways:
* The model could be biased towards the most frequent topics. Concretely, it could be more likely to classify a tweet as offensive if it contains words related to the most frequent topics.
* The model could also be confused by the noise and the irrelevant topics.

4. **Bonus** By default, BERTopic extracts single keywords. Play with the model to extract bigrams or more. See if you can go deeper in your analysis.

By default we can see that BERT uses CountVectorizer which, utilizes unigrams for creating the document-term matrix. However, it is possible to change a parameter in the BERTopic model to use bigrams or n-grams instead:

The `n_gram_range` parameter accepts a tuple (min_n, max_n), where `min_n` is the lower and `max_n` is the upper boundary of the range of n-values for different n-grams to be extracted.

In [8]:
model_ngram = BERTopic(embedding_model="all-MiniLM-L6-v2", umap_model=UMAP(random_state=42), vectorizer_model=CountVectorizer(ngram_range=(2, 5)), verbose=True)

topics, _ = model_ngram.fit_transform(train_dataset["text"])

Batches:   0%|          | 0/373 [00:00<?, ?it/s]

2023-06-25 20:53:23,425 - BERTopic - Transformed documents to Embeddings
2023-06-25 20:53:29,230 - BERTopic - Reduced dimensionality
2023-06-25 20:53:29,383 - BERTopic - Clustered reduced embeddings


In [9]:
model_ngram.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3069,-1_user user_user user user_user user user use...,"[user user, user user user, user user user use...",[@user @user @user @user @user @user @user @us...
1,0,3939,0_you are_she is_he is_user user,"[you are, she is, he is, user user, user you, ...","[@user @user He is, @user She is 😘, @user She ..."
2,1,991,1_gun control_user user_user user user_user us...,"[gun control, user user, user user user, user ...",[@user @user @user @user @user @user @user @us...
3,2,382,2_user user_user liberals_the liberals_liberal...,"[user user, user liberals, the liberals, liber...",[@user @user @user @user @user @user @user @us...
4,3,317,3_antifa user_user antifa_user user_antifa use...,"[antifa user, user antifa, user user, antifa u...",[@user Sounds like he joined #Antifa - Gov. @u...


In [10]:
topics_per_class_ngram = model.topics_per_class(train_dataset["text"], train_dataset["label"])
model.visualize_topics_per_class(topics_per_class_ngram)

2it [00:00, 14.77it/s]


## Evaluating the model (8 points)

You were thinking about fine-tuning a [RoBERTa](https://arxiv.org/abs/1907.11692) model on the dataset, but RoBERTa has been trained on 2019 data, which do not include any tweet. Moreover, pretraining a model from scratch can be costly. Fortunately, a [reliable entity](https://github.com/cardiffnlp) pretrained RoBERTa on recent tweets and even fine-tuned it on both datasets [here](https://huggingface.co/cardiffnlp/twitter-roberta-base-offensive?text=I+like+you.+I+love+you) and [here](https://huggingface.co/cardiffnlp/twitter-roberta-base-hate?text=I+like+you.+I+love+you).



1. (2 points) Evaluate the model on the test split of the dataset you picked, using precision, recall, and F1-score.

In [11]:
# Preprocess text
def preprocess(text:str) -> str:
    """
    Preprocess text

    :param text: string
    :return: string
    """
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)


task='offensive'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

In [12]:
# download label mapping
labels=[]
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]


In [15]:
# Assuming MODEL, tokenizer and labels are already defined
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

# Create pipeline
nlp = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Predict
predictions = []
for item in test_dataset:
    text = preprocess(item['text'])
    output = nlp(text)[0]
    pred_label = output['label']
    # we convert to string if the labels are string
    pred_label = str(pred_label)
    predictions.append(pred_label)

# for each item in predictions, replace 'offensive'  with 1 and 'not-offensive' with 0
predictions = [1 if item == 'offensive' else 0 for item in predictions]

# Get true labels
true_labels = [item['label'] for item in test_dataset]


In [None]:
# compute precision, recall and f1-score
precision = precision_score(true_labels, predictions, average='macro')
recall = recall_score(true_labels, predictions, average='macro')
f1 = f1_score(true_labels, predictions, average='macro')

print('Precision: %.3f' % precision)
print('Recall: %.3f' % recall)
print('F1-Score: %.3f' % f1)

Precision: 0.837
Recall: 0.800
F1-Score: 0.816


2. (2 points) Look for prediction failures. Extract the top 5 misclassified tweets (highest score in wrong class) for each class and discuss what could be wrong with the model.

In [16]:
# create a list of dictionaries with the text, true label, predicted label and confidence score
missclassified_tweets = []
for i in range(len(predictions)):
    if predictions[i] != true_labels[i]:
        text = preprocess(test_dataset[i]['text'])
        output = nlp(text)[0]
        pred_label = output['label'] # get predicted label
        # Convert to string if the labels are string
        pred_label = str(pred_label)
        confidence_score = output['score'] # get confidence score
        missclassified_tweets.append({'text': text, 'true_label': true_labels[i], 'pred_label': pred_label, 'confidence_score': confidence_score})

# sort the list of dictionaries by confidence score
missclassified_tweets = sorted(missclassified_tweets, key=lambda k: k['confidence_score'], reverse=True)

# print the 5 missclassified tweets with the highest confidence score
for i in range(5):
    print('Text: ', missclassified_tweets[i]['text'])
    print('True label: ', missclassified_tweets[i]['true_label'])
    print('Predicted label: ', missclassified_tweets[i]['pred_label'])
    print('Confidence score: ', missclassified_tweets[i]['confidence_score'])
    print('\n')


Text:  #Liberals / #Democrats THIS is what you stand for. If not, then #WalkAway
True label:  1
Predicted label:  non-offensive
Confidence score:  0.9338217973709106


Text:  #Liberals Are Reaching Peak Desperation To Call On #PhillipRuddock To Talk With #Turnbull To Convince Him To Help with #WentworthVotes 18 Sept 2018  @user #Auspol #LNP #NSWpol @user  @user @user #LNPMemes
True label:  1
Predicted label:  non-offensive
Confidence score:  0.919756293296814


Text:  #NoPasaran: Unity demo to oppose the far-right in #London – #antifa #Oct13 — Enough is Enough!
True label:  1
Predicted label:  non-offensive
Confidence score:  0.9112220406532288


Text:  #BREXIT deal HAS been reached - and will be unveiled at special summit in NOVEMBER, Has @user sold out the #UK to the eu??? She better have not or the @user are finished!! @user
True label:  1
Predicted label:  non-offensive
Confidence score:  0.9081719517707825


Text:  Are you fucking serious?
True label:  0
Predicted label:  offensiv

The model may be confused by the context of the tweets. For example, the model may be confused by the fact that a tweet is quoting another tweet, or by the fact that a tweet is a reply to another tweet.

Furthermore, it can be difficult to understand sarcasm, irony, or other forms of implicit language. Like the fact that a tweet is using irony to express a positive sentiment.

It can also give more importance to the presence of certain words than to the context of the tweet. For example, the model may be confused by the fact that a tweet contains a word that is often used in offensive tweets, but that is not used in an offensive way in this particular tweet.

3. (2 points) Extract the top 10 tweets your model is most confident about in the target class (offensive or hateful), the top 10 in the neutral class, and the top 10 your model is most uncertain about. Do you believe the model is doing a great job?


## Top 10 highest confidence tweets in the offensive class

In [17]:
# extract the top 10 tweets with the highest confidence score for offensive class

# create a list of dictionaries with the text, true label, predicted label and confidence score
offensive_tweets = []
for i in range(len(predictions)):
    text = preprocess(test_dataset[i]['text'])
    output = nlp(text)[0]
    pred_label = output['label'] # get predicted label
    # Convert to string if the labels are string
    pred_label = str(pred_label)
    confidence_score = output['score'] # get confidence score
    if pred_label == 'offensive':
        offensive_tweets.append({'text': text, 'true_label': true_labels[i], 'pred_label': pred_label, 'confidence_score': confidence_score})

# sort the list of dictionaries by confidence score
offensive_tweets = sorted(offensive_tweets, key=lambda k: k['confidence_score'], reverse=True)

# print the top 10 tweets with the highest confidence score for offensive class
for i in range(10):
    print('Text: ', offensive_tweets[i]['text'])
    print('True label: ', offensive_tweets[i]['true_label'])
    print('Predicted label: ', offensive_tweets[i]['pred_label'])
    print('Confidence score: ', offensive_tweets[i]['confidence_score'])
    print('\n')



Text:  @user nigga are you stupid your trash dont play with him play with your bitch 😂
True label:  1
Predicted label:  offensive
Confidence score:  0.9518308639526367


Text:  #ArianaAsesina? Is that serious?! Holy shit, please your fucking assholes, don't blame someone for the death of other one. She is sad enough for today, don't you see? It isn't fault of none, he had an overdose and died. End. Stop wanting someone to blame, fuckers.
True label:  1
Predicted label:  offensive
Confidence score:  0.9391582012176514


Text:  @user Damn I felt this shit. Why you so loud lol
True label:  1
Predicted label:  offensive
Confidence score:  0.9266201853752136


Text:  $1500 for a phone. You all are fucking dumb.
True label:  1
Predicted label:  offensive
Confidence score:  0.9251351356506348


Text:  All these sick ass ppl from school gave me something and now I have to chug down this nasty drink so it can go away🙃
True label:  1
Predicted label:  offensive
Confidence score:  0.9212592244148

## Top 10 highest confidence tweets in the neutral class

In [18]:
# let's create a list of dictionaries with the text, true label, predicted label and confidence score
not_offensive_tweets = []
for i in range(len(predictions)):
    text = preprocess(test_dataset[i]['text'])
    output = nlp(text)[0]
    pred_label = output['label'] # get predicted label
    # Convert to string if the labels are string
    pred_label = str(pred_label)
    confidence_score = output['score'] # get confidence score
    if pred_label == 'non-offensive':
        not_offensive_tweets.append({'text': text, 'true_label': true_labels[i], 'pred_label': pred_label, 'confidence_score': confidence_score})
    
# sort the list of dictionaries by confidence score
not_offensive_tweets = sorted(not_offensive_tweets, key=lambda k: k['confidence_score'], reverse=True)

for i in range(10):
    print('Text: ', not_offensive_tweets[i]['text'])
    print('True label: ', not_offensive_tweets[i]['true_label'])
    print('Predicted label: ', not_offensive_tweets[i]['pred_label'])
    print('Confidence score: ', not_offensive_tweets[i]['confidence_score'])
    print('\n')

Text:  #WCW #WCE @user  It’s your special day of the week again. I really miss you and I’m looking forward to see you soon. Don’t forget that I love you. I love you with all my heart because you are my heart! ❤️
True label:  0
Predicted label:  non-offensive
Confidence score:  0.9806139469146729


Text:  #WELOVESEUNGCHEOL @user   I am happy and proud of the work you have done to train seventeen along with the other members. I see you and you are wonderful and incredible. I really love u ㅠㅠ 💕.
True label:  0
Predicted label:  non-offensive
Confidence score:  0.9781582355499268


Text:  #NationalDayofEncouragement  #TwitterFamily  You’re amazing just the way you are 💟☮️
True label:  0
Predicted label:  non-offensive
Confidence score:  0.9777984023094177


Text:  #AskAlly I only say that I adore you and you are a great person, please take care of yourself always❤️
True label:  0
Predicted label:  non-offensive
Confidence score:  0.9766384959220886


Text:  #WeLoveSergioBecause he is the s

In [None]:
# top 10 tweets with the lowest confidence score for offensive class

4. **Bonus** Use [SHAP](https://github.com/slundberg/shap/tree/45b85c1837283fdaeed7440ec6365a886af4a333#natural-language-example-transformers) on the provided tweets, or manually written texts, to see if you can find topics on which the model is biased.


In [19]:
import transformers
import shap
import json

%pip install numpy==1.23.0 # to make shap working

# load a transformers pipeline model
model_shap = transformers.pipeline('text-classification', return_all_scores=True, model=model, tokenizer=tokenizer)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Note: you may need to restart the kernel to use updated packages.


In [27]:
# load the JSON file and extract the 10 first texts in a list
with open('tweets.json') as f:
    data = json.load(f)
text = [item['text'] for item in data[:1]]


In [28]:
# explainer
explainer = shap.Explainer(model_shap, tokenizer)

# compute shap values
shap_values = explainer(text)

  0%|          | 0/498 [00:00<?, ?it/s]

Partition explainer: 2it [00:22, 22.42s/it]               


In [29]:
# plot the shap values for the first text
shap.plots.text(shap_values[0])

5. **Bonus** Train a naive Bayes model on the data, and compare its results with this model.

In [32]:
# Vectorize the text data
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_dataset['text'])
y_train = np.array(train_dataset['label'])
X_test = vectorizer.transform(test_dataset['text'])
y_test = np.array(test_dataset['label'])


# our Naive Bayes classifier
nb = MultinomialNB()

# Train our classifier
model = nb.fit(X_train, y_train)

# Make predictions
preds = nb.predict(X_test)

In [33]:
# let's evaluate the performance of the model on the test set

precision = precision_score(true_labels, preds, average='macro')
recall = recall_score(true_labels, preds, average='macro')
f1 = f1_score(true_labels, preds, average='macro')

print('Precision: %.3f' % precision)
print('Recall: %.3f' % recall)
print('F1-Score: %.3f' % f1)

Precision: 0.729
Recall: 0.714
F1-Score: 0.721


We can see that the performance of the model is not very good compared to the previous model.

## Annotate data (7 points)

1. (1 point) Extract about 100 tweets containing at least 20% of your target class (offensive/hateful), from the 10K tweets provided. You can use the pretrained model to help you find tweets in the target class.

In [37]:
with open('tweets.json') as f:
    data = json.load(f)
texts = [item['text'] for item in data]

In [38]:
df_tweets = pd.DataFrame(texts, columns=['text'])

In [39]:
# use the model to predict the labels for the tweets
df_tweets['label'] = df_tweets['text'].apply(lambda x: nlp(x)[0]['label'])
df_tweets['label'] = df_tweets['label'].apply(lambda x: 1 if x == 'offensive' else 0)


In [40]:
df_tweets.head()

Unnamed: 0,text,label
0,YOU BETTER SUCK HIS DICK KOZY I SEE YOU WITH K...,1
1,I still canr believe it.😭😭😭😭😭,0
2,You should raise the webform....how would they...,0
3,im tired too but this is so entertaining i cant,0
4,Fuckof,1


2. (3 points) Altogether, write down an annotation guildeline (which should be at least 2/3 of a page long).
    * What does the target class look like?
    * Any examples you could provide for ambiguous cases?
    * Keep "Can't tell / not annotable" class. Make sure you document what this class mean in your guideline.

`Annotation Tips and Guidelines for Offensive Language Detection`

---

Our offensive dataset is designed to categorize content into two broad classes: non-offensive and offensive. We will try to provide an analysis of the criteria for each class, along with a third category, "Can't tell / not annotable," for ambiguous or unclear instances.

`Non-Offensive Class`

A text in this category does not contain any language or expressions that may be perceived as offensive, disrespectful, or harmful. It adheres to common standards of politeness and appropriateness, and does not engage in personal attacks, hate speech, or any form of derogatory communication. 

`Offensive Class`

Content in this category incorporates language or expressions that are harmful, disrespectful, or offensive. The offensive class can be further divided into various subcategories such as:

a) **Hate speech**: Any communication that degrades, threatens, or discriminates against an individual or group based on attributes such as race, religion, ethnic origin, sexual orientation, disability, or gender.

b) **Obscene language**: Use of vulgar or profane language which may include sexual content or explicit language.

c) **Insults/Personal Attacks**: Language that personally attacks an individual or group, whether through name-calling, ad hominem attacks, or derogatory comments.

**Ambiguous Cases**

1. **Sarcasm or Irony**: Sarcasm or irony can often blur the line between offensive and non-offensive, such as "You're really smart, aren't you?" Depending on context, this could be either a genuine compliment or a sarcastic remark.

2. **Cultural Differences**: Certain phrases or terms may be considered offensive in some cultures but not in others. For instance, colloquialisms that might seem harmless in one culture could be offensive in another. 

3. **Implicit or Coded Language**: Certain offensive phrases might be veiled under seemingly harmless words or coded language, like using "snowflake" to insult someone for being sensitive or politically correct. These can be difficult to identify without an understanding of the coded meaning.

`Can't Tell / Not Annotable Class`

This category is reserved for instances where the content does not clearly fall into the offensive or non-offensive class. This may occur for several reasons:

1. **Lack of Context**: If the content cannot be accurately interpreted without additional context, it should be classified as "Can't tell / not annotable."

2. **Ambiguous Language**: If the language is ambiguous or can be interpreted in multiple ways, such as in the case of sarcasm or irony without clear indicators, classify it as "Can't tell / not annotable."

3. **Language Barriers**: If the text is in a language, dialect, or cultural context that the annotator does not understand well enough to make an accurate classification, it should be classified as "Can't tell / not annotable."

With these tips we tried to provide a good starting point for annotating the data, but they are not exhaustive. It is important to keep in mind that the person who is annotating have his own biases and cultural background and they may influence their interpretation of the text. Therefore, we must try to be as objective as possible when annotating the data and avoid making assumptions about the author's intent or meaning.

3. (1 point) Every person in your group is going to annotate these tweets separately. So if you are 3, annotate them 3 times.
    * Typically, create a Google sheet or an excel document, one tab per person, in each tab one column for the text, and annother on the class.


In [41]:
# list of 20 offensive tweets
offensive_tweets = df_tweets[df_tweets['label'] == 1]['text'].tolist()[:20]

# list of 80 non-offensive tweets
not_offensive_tweets = df_tweets[df_tweets['label'] == 0]['text'].tolist()[:80]

Our three `.csv` files named `raphael.csv`, `bastien.csv` and `francois.csv` are in the `annotations` folder of the lab.
 

4. (2 point) Evaluate your inter-annotator agreement using Fleiss Kappa.
    * statsmodel provide an easy to use [implementation](https://www.statsmodels.org/stable/generated/statsmodels.stats.inter_rater.fleiss_kappa.html#statsmodels.stats.inter_rater.fleiss_kappa).
    * What does the score mean? Are you doing a good job annotating the data and, if not, why?


In [42]:
# the data from our excel file
df_raphael = pd.read_excel('annotations.xlsx', 'raphael')
df_bastien = pd.read_excel('annotations.xlsx', 'bastien')
df_francois = pd.read_excel('annotations.xlsx', 'francois')

# we get a list with the labels for each of the 3 annotators : raphael, bastien and francois
labels_raphael = df_raphael['label'].tolist()
labels_bastien = df_bastien['label'].tolist()
labels_francois = df_francois['label'].tolist()

data = [list(item) for item in zip(labels_raphael, labels_bastien, labels_francois)]

In [48]:

# format the data for Fleiss' Kappa
arr, categories = aggregate_raters(data, n_cat=2)

# compute Fleiss' Kappa

kappa_value = fleiss_kappa(arr, method='fleiss')
print(f"Fleiss' kappa: {kappa_value}")

Fleiss' kappa: 0.6463306808134396


5. **Bonus** Iterate on your annotation guideline with what you learned. Please send both version in your report.


6. **Bonus** Evaluate the model on your data. Use a majority vote for labels (remove majority "can't tell") and compute the precision, recall, and F1-score.