# NLP-LAB8

In [18]:
from datasets import load_dataset
from bertopic import BERTopic
import numpy as np
from umap import UMAP

## Introduction (1 point)

    1. (1 point) Pick one of the datasets between hate and offensive, and justify your choice. Remember that it is for a commercial application (there is a good and a bad answer).

Dans ce contexte, le meilleur choix pour une application commerciale serait l'ensemble de données "offensif".

Cet ensemble se concentre sur l'identification du langage offensant dans les tweets. Le langage offensant englobe un éventail plus large de contenus, y compris les jurons, les insultes et les remarques péjoratives. En utilisant cet ensemble de données, une application commerciale peut se concentrer sur la filtration et le signalement de contenus offensants afin de maintenir un environnement respectueux et inclusif pour les utilisateurs. 


## Evaluating the dataset (5 points)

    1. (1 point) Describe the dataset. Look at the splits, proportion of classes, and see what you can figure out by just looking at the text.

In [19]:
dataset = load_dataset('tweet_eval', 'offensive')

Found cached dataset tweet_eval (/home/amine/.cache/huggingface/datasets/tweet_eval/offensive/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)
100%|██████████| 3/3 [00:00<00:00, 1386.55it/s]


In [20]:
splits = dataset.keys()
print("Available splits:", splits)
print()

train_data = dataset['train']
test_data = dataset['test']
validation_data = dataset['validation']

num_offensive = sum(1 for label in train_data['label'] if label == 1)
num_non_offensive = len(train_data) - num_offensive

print("Number of offensive tweets:", num_offensive)
print("Number of non-offensive tweets:", num_non_offensive)
print()

total_examples = sum(len(dataset[s]) for s in dataset.keys())


splits = dataset.keys()
for split in splits:
    num_examples = len(dataset[split])
    print(f"{split} : {num_examples} -> {(num_examples / total_examples) * 100 }%")


Available splits: dict_keys(['train', 'test', 'validation'])

Number of offensive tweets: 3941
Number of non-offensive tweets: 7975

train : 11916 -> 84.51063829787235%
test : 860 -> 6.0992907801418434%
validation : 1324 -> 9.390070921985815%


In [21]:
offensive_tweets = []
non_offensive_tweets = []

for example in train_data:
    text = example['text']
    label = example['label']
    
    if label == 1:
        offensive_tweets.append(text)
    else:
        non_offensive_tweets.append(text)

print("Sample offensive tweets:")
for tweet in offensive_tweets[:5]:
    print(tweet)
    
print("\nSample non-offensive tweets:")
for tweet in non_offensive_tweets[:5]:
    print(tweet)

Sample offensive tweets:
@user Eight years the republicans denied obama’s picks. Breitbarters outrage is as phony as their fake president.
@user She has become a parody unto herself? She has certainly taken some heat for being such an....well idiot. Could be optic too  Who know with Liberals  They're all optics.  No substance
@user Your looking more like a plant #maga #walkaway
@user Antifa would burn a Conservatives house down and CNN would be there lighting the torches &amp; throwing gas on the flames.
@user They cite Jones being banned for violating Twitter's ToS. There are blue checkmarks spewing the same, if not worse, kind of shit. If you are going to play the anyone can get banned"" card. Shouldn't these people also receive bans and suspensions? #VerifiedHate""

Sample non-offensive tweets:
@user Bono... who cares. Soon people will understand that they gain nothing from following a phony celebrity. Become a Leader of your people instead or help and support your fellow countrymen

On peut voir que par catégories on a différentes sortes par situation : 

Pour les tweets offensive, on a : 
- Les attaques personnelles représenté par des insultes directes ou des language insultant.
- Des point de vue politique discriminant et contribuant à divisée les débats politiques
- Théorie du complot et la désinformation : des informations non étayées et infondées.

Pour les tweets inoffensifs on a : 
- des conversations informelle étayant des sujets neutre tel que le sport
- des opinions au ton neutres

Globalement on peut voir que certains tweets sont plus subtiles dans la manière d'être offensant et qu'il faudra faire attention à cela.

    2. (3 points) Use BERTopic to extract the topics within the data, and the main topics within each class. 

In [22]:
import re

def preprocess_text(text):
    processed_text = re.sub(r'@user', '', text)
        
    return processed_text

In [23]:
preprocessed_texts = [preprocess_text(text) for text in train_data['text']]

In [24]:
from sentence_transformers import SentenceTransformer

umap_model = UMAP(random_state=42)
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")

topic_model = BERTopic(umap_model=umap_model, embedding_model=sentence_model)

topics, _= topic_model.fit_transform(preprocessed_texts)

topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3527,-1_is_the_she_and,"[is, the, she, and, to, he, you, of, that, are]",[ Nothing abusive should ever be done to an...
1,0,1193,0_you_are_bitch_my,"[you, are, bitch, my, love, so, shit, me, ass,...",[ Yes! We were. Back again. Thank you so mu...
2,1,975,1_antifa_they_the_of,"[antifa, they, the, of, and, to, left, violenc...","[ Yes that and ANTIFA., That would be Antifa..."
3,2,887,2_gun_control_guns_laws,"[gun, control, guns, laws, the, to, about, in,...","[ Or we could have gun control, GUN CONTROL W..."
4,3,661,3_maga_trump_qanon_wwg1wga,"[maga, trump, qanon, wwg1wga, walkaway, presid...",[ Thank you President Trump!! You are #MAGA!!!...
...,...,...,...,...,...
80,79,13,79_levi_rosie_runs_long,"[levi, rosie, runs, long, she, bitxh, konjam, ...",[ because im a broke ass bitxh :(((( and LOL E...
81,80,12,80_her_mane_blonde_me,"[her, mane, blonde, me, she, dead, hurts, drin...",[ His frustration began to subside as he felt ...
82,81,11,81_she_tv_shame_5children,"[she, tv, shame, 5children, iya, abortive, alp...",[ Iya Risi in my area didnt get the money. She...
83,82,11,82_ect_represents_pro_all,"[ect, represents, pro, all, traitor, choice, b...",[ Harley is the right choice for public educat...


In [25]:
topic_freq = topic_model.get_topic_freq()

for i in range(len(topic_freq)):
    if i in topic_freq['Topic'].values:
        terms = topic_model.get_topic(i)
        first_term = terms
        print(f"Topic {i}: {first_term}")

Topic 0: [('you', 0.03331236060163969), ('are', 0.026921978562256567), ('bitch', 0.016723565026909307), ('my', 0.016353993503892644), ('love', 0.014879897617191789), ('so', 0.014472967813977503), ('shit', 0.014283494713751921), ('me', 0.013565285987739793), ('ass', 0.013506062725468497), ('fuck', 0.013067996246574825)]
Topic 1: [('antifa', 0.05910450011161495), ('they', 0.011723008805178783), ('the', 0.011689573428195573), ('of', 0.010606622773460082), ('and', 0.010462920126402788), ('to', 0.009671640220384612), ('left', 0.009285333501965668), ('violence', 0.009114582612419117), ('fascist', 0.008843518014682998), ('like', 0.008767320286951345)]
Topic 2: [('gun', 0.05372058123253262), ('control', 0.049400412365636644), ('guns', 0.016262606249783637), ('laws', 0.01540265042564565), ('the', 0.010690881499212213), ('to', 0.009872345834673247), ('about', 0.009582369320340876), ('in', 0.009453942192125914), ('nra', 0.00917008038487963), ('that', 0.009047538851918134)]
Topic 3: [('maga', 0.07

In [26]:
for i in range(len(topic_freq)):
    if i in topic_freq['Topic'].values and  i != 0:
        terms = topic_model.get_topic(i)
        first_term = terms[0]
        print(f"Topic {i}: {first_term}")

Topic 1: ('antifa', 0.05910450011161495)
Topic 2: ('gun', 0.05372058123253262)
Topic 3: ('maga', 0.07230746045815623)
Topic 4: ('he', 0.06293771775756164)
Topic 5: ('liberals', 0.04095152262864591)
Topic 6: ('she', 0.06315531850027671)
Topic 7: ('kavanaugh', 0.07203403775876376)
Topic 8: ('brexit', 0.04325263723877844)
Topic 9: ('she', 0.042062628639383565)
Topic 10: ('he', 0.04156201315354384)
Topic 11: ('holder', 0.10546616047560078)
Topic 12: ('trudeau', 0.04523532336622797)
Topic 13: ('she', 0.08370686226207139)
Topic 14: ('pope', 0.08259559876581513)
Topic 15: ('moore', 0.03032751787066377)
Topic 16: ('blocked', 0.04457828869505638)
Topic 17: ('nfl', 0.14934076923458928)
Topic 18: ('serena', 0.06799787186153236)
Topic 19: ('chicago', 0.13270593382252277)
Topic 20: ('guilty', 0.03753113046082875)
Topic 21: ('women', 0.072724137841612)
Topic 22: ('good', 0.16491264164772712)
Topic 23: ('black', 0.04949058453585468)
Topic 24: ('user', 0.07852016125568982)
Topic 25: ('kerry', 0.061583

    3. (1 point) What do you think about the results? How do you think it could impact a model trained on these data?

On peut voir que les topics sont globalement diversifié et beaucoup sont des sujets de controverses.

Il faudra faire attention à la nuance entre un sujet sensible et un contenu offensant. 


Les différents topics peuvent impacter le modèle en apportant un biais d'interprétation. Le modèle risque d'avoir un point de vue baser les datas et considérer les informations par expérience et de ce fait risque de rater certains tweets offensifs.

On peut deviner qu'il y aura des nuances de contexte à prendre en compte pour les propos controversé.

    4. Bonus By default, BERTopic extracts single keywords. Play with the model to extract bigrams or more. See if you can go deeper in your analysis.

In [27]:
from sklearn.feature_extraction.text import CountVectorizer

umap_model = UMAP(random_state=42)
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")

vectorizer_model = CountVectorizer(ngram_range=(5,5))

topic_model = BERTopic(umap_model=umap_model, vectorizer_model=vectorizer_model, embedding_model=sentence_model)

topics, _= topic_model.fit_transform(preprocessed_texts)

topic_info = topic_model.get_topic_info()

In [28]:
topic_freq = topic_model.get_topic_freq()

for i in range(len(topic_freq)):
    if i in topic_freq['Topic'].values and  i != 0:
        terms = topic_model.get_topic(i)
        first_term = terms[0]
        print(f"Topic {i}: {first_term}")

Topic 1: ('kkk hoods beating up strangers', 0.0018539697359027949)
Topic 2: ('to talk about gun control', 0.001397272481626641)
Topic 3: ('zombie marketing sales retail style', 0.0013424828049194816)
Topic 4: ('he is yes he is', 0.0035961423777717714)
Topic 5: ('me give me give me', 0.0030617749254221536)
Topic 6: ('yes she is she is', 0.0034817723102017356)
Topic 7: ('not going to work and', 0.002902922280350557)
Topic 8: ('remain london cityoflondon news breakingnews', 0.0035216712285047473)
Topic 9: ('think she is one of', 0.003993839665817484)
Topic 10: ('as good as he is', 0.004990564337370127)
Topic 11: ('holder should be in prison', 0.013273369421415812)
Topic 12: ('people party of canada liberal', 0.0055265025398821615)
Topic 13: ('croatian president bikini photos worlds', 0.01744575163865596)
Topic 14: ('the homosexual rot in the', 0.009970731665704067)
Topic 15: ('michael moore and what he', 0.007196919097049552)
Topic 16: ('also don even know who', 0.012188330728874241)
Topi

On peut remarquer qu'en augmentant le nombre de mots par topics on comprends mieux le sujet en lui même. Le sujet est beaucoup plus clair. plus on augmente le gram et moins il y a d'article tel que "you" "she" seul qui n'ont pas beaucoup de sens comme titre. 

Les mots sans sens seul sont rattaché à des contextes et permettent de mieux comprendre le topics à la lecture. 

A partir de (5,5) grams on  a déjà une compréhension claire de quasi tous les topics.

## Evaluate a model (6 points)

    1. (2 points) Evaluate their model on the test split of the dataset you picked, using precision, recall, and F1-score.

In [29]:
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

In [30]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
from sklearn.metrics import precision_recall_fscore_support
import numpy as np
from scipy.special import softmax
import csv
import urllib.request
import torch as torch

task='offensive'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

model = AutoModelForSequenceClassification.from_pretrained(MODEL)

In [31]:

tokenizer = AutoTokenizer.from_pretrained(MODEL)

labels=[]
text_test = test_data['text']
true_labels = test_data['label']
predictions = []

# Pour la question 2

# Pour les offensives
misclassified_scores = []
misclassified_texts = []
misclassified_labels = []
misclassified_true_labels = []

# Pour les non offensives
misclassified_scores_NO = []
misclassified_texts_NO = []
misclassified_labels_NO = []
misclassified_true_labels_NO = []

for text, true_label in zip(text_test, true_labels):
    text = preprocess(text)
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    scores = torch.nn.functional.softmax(output.logits, dim=1).detach().cpu().numpy()
    _, predicted = torch.max(output.logits, 1)

    if predicted != true_label and true_label == 1:

        misclassified_scores.append(scores[0][predicted[0]])
        misclassified_texts.append(text)
        misclassified_labels.append(predicted[0])
        misclassified_true_labels.append(true_label)

    if predicted != true_label and true_label == 0:

        misclassified_scores_NO.append(scores[0][predicted[0]])
        misclassified_texts_NO.append(text)
        misclassified_labels_NO.append(predicted[0])
        misclassified_true_labels_NO.append(true_label)

    predictions.append(predicted.detach().cpu().numpy().flatten())

precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predictions, average="weighted")

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Precision: 0.8556
Recall: 0.8593
F1-score: 0.8552


    2. (2 points) Look for prediction failures. Extract the top 5 misclassified tweets (highest score in wrong class) for each class and discuss what could be wrong with the model.

##### Offensive

In [32]:
misclassified_scores = np.array(misclassified_scores)
misclassified_texts = np.array(misclassified_texts)
misclassified_labels = np.array(misclassified_labels)
misclassified_true_labels = np.array(misclassified_true_labels)

sorted_indices = np.argsort(misclassified_scores)[::-1]
#sorted_indices

for i in range(5):
    print(f"Text : {misclassified_texts[sorted_indices[i]]}")
    print(f"Label : {misclassified_labels[sorted_indices[i]]}")
    print(f"True Label : {misclassified_true_labels[sorted_indices[i]]}")
    print(f"Score : {misclassified_scores[sorted_indices[i]]}")

Text : #Liberals / #Democrats THIS is what you stand for. If not, then #WalkAway
Label : 0
True Label : 1
Score : 0.9338217973709106
Text : #Liberals Are Reaching Peak Desperation To Call On #PhillipRuddock To Talk With #Turnbull To Convince Him To Help with #WentworthVotes 18 Sept 2018  @user #Auspol #LNP #NSWpol @user  @user @user #LNPMemes
Label : 0
True Label : 1
Score : 0.9197562336921692
Text : #NoPasaran: Unity demo to oppose the far-right in #London – #antifa #Oct13 — Enough is Enough!
Label : 0
True Label : 1
Score : 0.9112220406532288
Text : #BREXIT deal HAS been reached - and will be unveiled at special summit in NOVEMBER, Has @user sold out the #UK to the eu??? She better have not or the @user are finished!! @user
Label : 0
True Label : 1
Score : 0.9081718325614929
Text : #America  ... tear down that #Wall! #tcot #partisanship #Trump #thewall #Borderwall #liberty #civilsociety #think #Conservatives #Democrats #Progressives #liberals #Independent #libertarians #GOP #DNC #Cri

Ces erreurs soulignent comment ce modèle de peut avoir du mal à comprendre les nuances du langage et du contexte humains, en particulier dans les discussions politiques où le ton et l'intention peuvent être subtils ou sujets à interprétation.

Par exemple : "#Liberals Are Reaching Peak Desperation To Call On #PhillipRuddock To Talk With #Turnbull To Convince Him To Help with #WentworthVotes 18 Sept 2018 @user #Auspol #LNP #NSWpol @user @user @user #LNPMemes"

Cette phrase semble être une déclaration politique critique, mais elle n'utilise pas de langage explicitement offensant. Le modèle peut avoir du mal à saisir la complexité de cette déclaration.

##### Non Offensive

In [33]:
misclassified_scores_NO = np.array(misclassified_scores_NO)
misclassified_texts_NO = np.array(misclassified_texts_NO)
misclassified_labels_NO = np.array(misclassified_labels_NO)
misclassified_true_labels_NO = np.array(misclassified_true_labels_NO)

sorted_indices = np.argsort(misclassified_scores_NO)[::-1]
#sorted_indices

for i in range(5):
    print(f"Text : {misclassified_texts_NO[sorted_indices[i]]}")
    print(f"Label : {misclassified_labels_NO[sorted_indices[i]]}")
    print(f"True Label : {misclassified_true_labels_NO[sorted_indices[i]]}")
    print(f"Score : {misclassified_scores_NO[sorted_indices[i]]}")

Text : Are you fucking serious?
Label : 1
True Label : 0
Score : 0.9010689854621887
Text : @user I guess that’s where swamp ass originated
Label : 1
True Label : 0
Score : 0.8939997553825378
Text : An American Tail really is one of the most underrated animations ever ever ever. Fuck I cried in this scene
Label : 1
True Label : 0
Score : 0.8576400279998779
Text : @user @user Bull crap. You know she doesn't care.  She is trying to get attention for her Presidential run.  Do you see any other Senator giving nonsense?  Nope.
Label : 1
True Label : 0
Score : 0.8492955565452576
Text : #Room25 is actually incredible, Noname is the shit, always has been,  and I’m seein her in like 5 days in Melbourne. Life is good. Have a nice day.
Label : 1
True Label : 0
Score : 0.8418040871620178


Ces erreurs mettent en évidence certaines limites de ce modèle dans l'interprétation du langage humain, notamment en matière de contexte, d'argot et de nuances dans l'usage des jurons.

Par exemple : "@user I guess that’s where swamp ass originated" 

Cette phrase utilise une expression argotique qui peut être considérée comme vulgaire dans certains contextes, mais ici, elle est probablement utilisée de manière humoristique. Le modèle a mal interprété le contexte.

    2. (2 points) Extract the top 10 tweets your model is most confident about in the target class (offensive or hateful), the top 10 in the neutral class, and the top 10 your model is most uncertain about. Do you believe the model is doing a great job?

In [34]:
import json
import pandas as pd
with open("tweets.json", "r") as f:
    tweets = json.load(f)

df = pd.DataFrame(tweets)

predictions = []
scores = []
for text in df["text"]:
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    _, predicted = torch.max(output.logits, 1)
    predictions.append(predicted.detach().cpu().numpy().flatten())
    score = torch.nn.functional.softmax(output.logits, dim=1).detach().cpu().numpy()
    scores.append(score[0][predicted.item()])

df["prediction"] = predictions
df["score"] = scores

top_10_offensive = df[df["prediction"] == 1].sort_values("score", ascending=False).head(10)
top_10_neutral = df[df["prediction"] == 0].sort_values("score", ascending=False).head(10)

# Get top 10 uncertain predictions
df["uncertainty"] = 1 - df["score"]
top_10_uncertain = df.sort_values("uncertainty", ascending=False).head(10)

In [35]:
top_10_offensive['text']

2198    don’t you suck his dick or something ? ur fuck...
7686    i genuinely feel sick to my stomach and i cann...
4042                                You're a little bitch
9867                   Bitch you raggedy af phony ass hoe
9141                        You're a fucking racist moron
3849                     Shut the fuck up you damn monkey
8073               its wild how fucking stupid people are
1929    Shut the hell up nobody give a shit about your...
2206    You're a fucking dumbass that think's he's bad...
7800    This dude is a total crybaby man all he does i...
Name: text, dtype: object

In [36]:
top_10_neutral['text']

7811                   Thank you beautiful you are too!!🥰
8599                                Thank you for this! 💜
1109                  Thank you for your supporttt ! 😍❤️✨
1416    Oh, that would be great. I will be waiting, th...
8684     Aww thank you, Crystal - that means the world. 💗
4249           No problem! Thanks for waiting too 🙆🏻‍♀️❤️
6506    Aweeee happy 1 year!!! Hope to see you stream ...
4719         Awe I am so so glad you are getting this. ❤️
7242                                       Thank you!! 😭💕
1345    Thanks for your kind words ❤️ our team are all...
Name: text, dtype: object

In [37]:
top_10_uncertain['text']

5174                             I'm lucky not literate 🙃
3490                     THE PAIN IN MY ARM WORTHS SO BAD
7104    its crazy to me that people think fictional ch...
4359                                 ok I’m gonna cry now
2689            someone pinch me i feel like i’m dreaming
6390                                      MY PHONES DYING
2897    Why didn’t anyone tell me one piece is also sa...
6689    It’s just become a case of ‘this doctor said t...
1907                      This is just furry and 100 gecs
2662                           this is a big brain moment
Name: text, dtype: object

On peut faire les observations suivantes :

- Le modèle a les meilleures résultat en terme de tweets offensive dès que les mots clés sont des insultes et le format est une insulte directe dirigée. 
Plus on est proche de la forme  [sujet] + [verbe] + [insulte] plus le modèle est efficace.

- Le modèle a les meilleures résultats en terme de tweets non offensive dès que les mots clés sont positifs et des adjectifs qualitatifs valorisant.

- Au niveau des plus neutres, sont des phrases avec le moins possible d'adjectifs qualificatifs.

On peut en déduire que le modèle a une prédominance dans la reconnaissance des mots avant le sens des expressions. 

Cela pourrait avoir pour conséquences de fausser le modèle en utilisant des jurons de façon valorisante ou des adjectifs valorisants de façon insultante. 

    4. Bonus Use SHAP on the provided tweets, or manually written texts, to see if you can find topics on which the model is biased.

Pour cette question, lorsque l'on utilise SHAP avec la totalité des tweets le kernel crash trop rapidement. Pour un meilleur visuel on va prendre seulement les 500 premiers tweets.

In [None]:
import shap

def f(x):
    tv = torch.tensor([tokenizer.encode(v, padding='max_length', max_length=512, truncation=True) for v in x])
    attention_mask = (tv!=0).type(torch.int64)
    outputs = model(tv, attention_mask=attention_mask)[0].detach().cpu().numpy()
    scores = softmax(outputs)
    return scores

explainer = shap.Explainer(f, tokenizer)

shap_values = explainer(tweets[:500])

shap.plots.text(shap_values)

Nous n'avons pas eut le temps de finir le lancement proprement

    5. Bonus Train a naive Bayes model on the data, and compare its results with this model.

## Annotate data (7 points)

    1. (1 point) Extract about 100 tweets containing at least 20% of your target class (offensive/hateful), from the 10K tweets provided. You can use the pretrained model to help you find tweets in the target class.

Nous allons utiliser le dataframe que l'on a pu récupérer dans la question précédente pour choisir nos tweets.

Dans notre sample d'analyse personnel nous allons mettre des prédictions le moins possible dans les extrêmes pour challenger l'ambiguité d'interprétation.

Pour cela nous allons prendre les score entre 0.50 et 0.80

In [39]:
import numpy as np

offensive_tweets = df[(df["prediction"] == 1) & (df["score"] >= 0.50) & (df["score"] <= 0.80)].sample(20)

non_offensive_tweets = df[(df["prediction"] == 0) & (df["score"] >= 0.50) & (df["score"] <= 0.80)].sample(80)

sampled_tweets = pd.concat([offensive_tweets, non_offensive_tweets])

print(f"Extracted {len(sampled_tweets)} tweets with at least 20% from the target class.")

Extracted 100 tweets with at least 20% from the target class.


In [40]:
sampled_tweets['text'][:5]

1804                    THE CLEAN ASS FLIP I CANT DO THIS
5984    If that is how crimes are gonna be dealt with ...
7968    She’s an entire menace to the Djokovic family ...
3289                      Gotta learn how to let shit go.
6450    Spite fucked me up. Learning to use positive m...
Name: text, dtype: object

    2. (3 points) Altogether, write down an annotation guildeline (which should be at least 2/3 of a page long).

Ref Annotation_guideline.pdf

    3. (1 point) Every person in your group is going to annotate these tweets separately. So if you are 3, annotate them 3 times. 

Avant d'extraire, nous allons mélanger les lignes

In [41]:
from sklearn.utils import shuffle

sampled_tweets = shuffle(sampled_tweets)
sample_tweets_text = sampled_tweets['text']

In [42]:
sample_tweets_text.to_csv('sample_tweets.csv', index=False)

les données sont extraites en csv puis sont mise dans un excel pour être analysée (Ref sampled_tweets_annoted.xlsx)

    4. (2 point) Evaluate your inter-annotaor agreement using Fleiss Kappa. 

In [43]:
data = pd.read_excel('sample_tweets_annoted.xlsx', sheet_name=None)

ratings = []
for rater, data_ in data.items():
    rater_ratings = data_['label'].tolist()
    ratings.append(rater_ratings)
ratings = np.array(ratings).transpose()
print(len(ratings))


100


In [44]:
if np.isnan(ratings).any():
    print("Le tableau contient un NaN.")
else:
    print("Le tableau ne contient pas de NaN.")

Le tableau ne contient pas de NaN.


In [45]:
from statsmodels.stats.inter_rater import fleiss_kappa
import statsmodels.stats.inter_rater as stats


raters, _ = stats.aggregate_raters(ratings)
kappa_score = fleiss_kappa(raters)

print(f"The Fleiss' Kappa score is: {kappa_score}")

The Fleiss' Kappa score is: 0.8519615099925979


Cela suggère qu'il y a un haut niveau d'accord entre les évaluateurs dans notre ensemble de données. C'est généralement interprété comme un niveau d'accord solide et fiable.

Cependant, on doit se rappeler que bien que ce score suggère un fort accord, il ne garantit pas la justesse ou l'exactitude des annotations. 

    5. (Bonus) Iterate on your annotation guideline with what you learned. Please send both version in your report.

Ref Annotation_guideline_2.pdf

    6. Evaluate the model your data. Use a majority vote for labels (remove majority "can't tell") and compute the precision, recall, and F1-score.