# OLID-BR - Iteration 3

In this notebook, we will read the annotated data from an S3 bucket, build OLID-BR dataset and save it to an S3 bucket in JSON and CSV formats.

The annotated data is stored in the Label Studio JSON format. See [Label Studio Documentation — Export Annotations](https://labelstud.io/guide/export.html#Label-Studio-JSON-format-of-annotated-tasks) for more details.

## Imports

In [1]:
import sys
from pathlib import Path

if str(Path(".").absolute().parent) not in sys.path:
    sys.path.append(str(Path(".").absolute().parent.parent))

In [2]:
from dotenv import load_dotenv

# Initialize the env vars
load_dotenv("../../.env")

True

In [3]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

from irrCAC.raw import CAC
from src.data_classes import Annotator, LabelStrategy, Metadata
from src.dataset import Dataset
from src.labeling.assignment import majority_vote, at_least_one, all_labeled_spans
from src.labeling.metrics import (
    percent_agreement,
    disagreement_by_raters,
    disagreement_score
)

from src.s3 import Bucket
from src.settings import AppSettings
from src.utils import (
    read_yaml,
    check_words,
    prepare_data_to_px,
    dict_serialize_date,
    get_lead_time,
    get_annotations_by_rater
)

import nltk
from nltk.metrics import agreement
from nltk.metrics.agreement import AnnotationTask
from nltk.metrics import masi_distance, jaccard_distance

# Plotly
import plotly.express as px
import plotly.io as pio
from plotly.graph_objs import Layout

pio.templates.default = "plotly_dark"

layout = Layout(
    xaxis={
        "type": "category",
        "showgrid": False,
        "zeroline": False,
    },
    yaxis={
        "showgrid": False,
        "zeroline": False
    },
    paper_bgcolor="rgba(0,0,0,0)",
    plot_bgcolor="rgba(0,0,0,0)",
    font={"color": "rgb(180,180,180)"},
)

args = AppSettings()

## Load annotators

In the next cells, we will read the annotators data and create a list with all annotators objects.

It will be used to add the annotations as a metadata for each text.

In [4]:
annotators = read_yaml("../../properties/annotators.yaml")
annotators = [Annotator(**a) for a in annotators]
annotators

[Annotator(id=None, annotator_id=1, gender='Male', year_of_birth=1993, education_level="Bachelor's degree", annotator_type='Researcher'),
 Annotator(id=None, annotator_id=32, gender='Female', year_of_birth=1991, education_level="Bachelor's degree", annotator_type='Volunteer'),
 Annotator(id=None, annotator_id=127, gender='Female', year_of_birth=1975, education_level="Master's degree", annotator_type='Contract worker'),
 Annotator(id=None, annotator_id=128, gender='Female', year_of_birth=1992, education_level="Master's degree", annotator_type='Contract worker'),
 Annotator(id=None, annotator_id=126, gender='Male', year_of_birth=1997, education_level='High school', annotator_type='Contract worker'),
 Annotator(id=None, annotator_id=260, gender='Female', year_of_birth=2001, education_level='High school', annotator_type='Contract worker')]

## Load data

In the next cells, we will read the labeled data from the S3 bucket and concatenate all annotations into a single base.

In [5]:
bucket = Bucket(args.AWS_S3_BUCKET)

bucket.get_session_from_aksk(
    args.AWS_ACCESS_KEY_ID,
    args.AWS_SECRET_ACCESS_KEY)

In [6]:
files = [
    "raw/labeled/phase3/olid-br-3-126.json",
    "raw/labeled/phase3/olid-br-3-127.json",
    "raw/labeled/phase3/olid-br-3-260.json"
]

As we have each annotator data in a separate file, we will need to concatenate all annotations into a single base.

In [7]:
data = {}

for file in files:
    print(f"Reading {file}")
    temp = bucket.download_json(key=file)

    lead_time = get_lead_time(temp)
    print(f"{file} >> Mean: {np.mean(lead_time):.0f}s | Std: {np.std(lead_time):.0f}s")

    for row in temp:
        if row["data"]["text"] not in data.keys():
            data[row["data"]["text"]] = row
        else:
            data[row["data"]["text"]]["annotations"].extend(row["annotations"])
    
    print()

data = [v for _, v in data.items()]

print(f"Count: {len(data)}")

Reading raw/labeled/phase3/olid-br-3-126.json
raw/labeled/phase3/olid-br-3-126.json >> Mean: 77s | Std: 486s

Reading raw/labeled/phase3/olid-br-3-127.json
raw/labeled/phase3/olid-br-3-127.json >> Mean: 154s | Std: 1197s

Reading raw/labeled/phase3/olid-br-3-260.json
raw/labeled/phase3/olid-br-3-260.json >> Mean: 871s | Std: 10081s

Count: 1851


Check if all texts have the same number of annotations.

In [8]:
annotations_count = {}

for item in data:
    count = len(item["annotations"])
    if count not in annotations_count.keys():
        annotations_count[count] = 1
    else:
        annotations_count[count] += 1

annotations_count

{3: 978, 1: 701, 2: 172}

## Build dataset

In [9]:
dataset = Dataset(
    annotators=annotators,
    toxicity_threshold=args.PERSPECTIVE_THRESHOLD
)

raw_texts = dataset.get_raw_texts(data)

We will filter only texts with all three annotators.

In [10]:
raw_texts = [text for text in raw_texts if len(text.annotations) == 3]

print(f"{len(raw_texts)} raw texts with 3 annotations.")

978 raw texts with 3 annotations.


## Inter-Rater Reliability (IRR) analysis

a.k.a inter-rater agreement (IRA) or concordance.

In the next cells, we will perform an agreement analysis to check if the annotations are consistent.

See [Inter-Rater Reliability - OLID-BR](https://dougtrajano.github.io/olid-br/annotation/inter-rater-reliability.html) for more details.

### `is_offensive`

In [11]:
is_offensive = pd.DataFrame(dataset.get_annotations(raw_texts, "is_offensive"))
is_offensive.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,968,969,970,971,972,973,974,975,976,977
126,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,...,NOT,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF
127,OFF,OFF,OFF,OFF,OFF,OFF,NOT,OFF,OFF,OFF,...,OFF,OFF,OFF,OFF,NOT,OFF,NOT,OFF,OFF,OFF
260,OFF,OFF,OFF,OFF,OFF,OFF,NOT,OFF,NOT,OFF,...,OFF,OFF,NOT,OFF,OFF,OFF,OFF,OFF,OFF,OFF


In [12]:
fig = px.bar(
    data_frame=prepare_data_to_px(is_offensive),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="is_offensive distribution")

fig.update_layout(layout)

fig.show()

In [13]:
cac = CAC(is_offensive)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 978, Raters: 3, Categories: ['NOT', 'OFF'], Weights: "identity">
Percent agreement: 0.7137
Krippendorff's alpha: 0.2064
Gwet's AC1: 0.7487


In [14]:
for k, v in disagreement_by_raters(cac.ratings, "OFF").items():
    print(f"{v} texts was annotated by {k} rater(s) as offensive.")

print(f"Disagreement score (class OFF): {disagreement_score(cac.ratings, 'OFF'):.4f}")

79 texts was annotated by 1 rater(s) as offensive.
201 texts was annotated by 2 rater(s) as offensive.
681 texts was annotated by 3 rater(s) as offensive.
Disagreement score (class OFF): 0.2914


In [15]:
for k, v in disagreement_by_raters(cac.ratings, "NOT").items():
    print(f"{v} texts was annotated by {k} rater(s) as non-offensive.")

print(f"Disagreement score (class NOT): {disagreement_score(cac.ratings, 'NOT'):.4f}")

201 texts was annotated by 1 rater(s) as non-offensive.
79 texts was annotated by 2 rater(s) as non-offensive.
17 texts was annotated by 3 rater(s) as non-offensive.
Disagreement score (class NOT): 0.9428


### `is_targeted`

In [16]:
is_targeted = pd.DataFrame(dataset.get_annotations(raw_texts, "is_targeted"))
is_targeted.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,968,969,970,971,972,973,974,975,976,977
126,TIN,UNT,TIN,TIN,UNT,UNT,UNT,TIN,TIN,UNT,...,UNT,TIN,TIN,TIN,TIN,UNT,UNT,TIN,TIN,TIN
127,TIN,TIN,TIN,UNT,TIN,TIN,UNT,TIN,TIN,TIN,...,TIN,TIN,TIN,TIN,UNT,TIN,UNT,TIN,TIN,TIN
260,TIN,TIN,TIN,TIN,TIN,TIN,UNT,TIN,UNT,TIN,...,TIN,TIN,UNT,TIN,TIN,TIN,UNT,TIN,TIN,TIN


In [17]:
fig = px.bar(
    data_frame=prepare_data_to_px(is_targeted),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="is_targeted distribution")

fig.update_layout(layout)

fig.show()

In [18]:
cac = CAC(is_targeted)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 978, Raters: 3, Categories: ['TIN', 'UNT'], Weights: "identity">
Percent agreement: 0.4397
Krippendorff's alpha: 0.1415
Gwet's AC1: 0.3389


In [19]:
for k, v in disagreement_by_raters(cac.ratings, "TIN").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted.")

print(f"Disagreement score (class TIN): {disagreement_score(cac.ratings, 'TIN'):.4f}")

168 texts was annotated by 1 rater(s) as targeted.
380 texts was annotated by 2 rater(s) as targeted.
356 texts was annotated by 3 rater(s) as targeted.
Disagreement score (class TIN): 0.6062


In [20]:
for k, v in disagreement_by_raters(cac.ratings, "UNT").items():
    print(f"{v} texts was annotated by {k} rater(s) as untargeted.")

print(f"Disagreement score (class UNT): {disagreement_score(cac.ratings, 'UNT'):.4f}")

380 texts was annotated by 1 rater(s) as untargeted.
168 texts was annotated by 2 rater(s) as untargeted.
74 texts was annotated by 3 rater(s) as untargeted.
Disagreement score (class UNT): 0.8810


### `targeted_type`

In [21]:
targeted_type = pd.DataFrame(dataset.get_annotations(raw_texts, "targeted_type"))
targeted_type.fillna(np.nan, inplace=True)
targeted_type.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,968,969,970,971,972,973,974,975,976,977
126,GRP,,GRP,IND,,,,OTH,OTH,,...,,OTH,IND,IND,IND,,,IND,IND,GRP
127,IND,IND,GRP,,OTH,IND,,OTH,GRP,IND,...,GRP,IND,GRP,IND,,OTH,,IND,IND,OTH
260,IND,IND,GRP,IND,OTH,IND,,OTH,,IND,...,GRP,OTH,,IND,IND,OTH,,IND,IND,GRP


In [22]:
fig = px.bar(
    data_frame=prepare_data_to_px(targeted_type),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="targeted_type distribution")

fig.update_layout(layout)

fig.show()

In [23]:
cac = CAC(targeted_type)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 904, Raters: 3, Categories: ['GRP', 'IND', 'OTH'], Weights: "identity">
Percent agreement: 0.2611
Krippendorff's alpha: 0.5031
Gwet's AC1: 0.6303




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [24]:
for k, v in disagreement_by_raters(cac.ratings, "IND").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted to an individual.")

print(f"Disagreement score (class IND): {disagreement_score(cac.ratings, 'IND'):.4f}")

230 texts was annotated by 1 rater(s) as targeted to an individual.
228 texts was annotated by 2 rater(s) as targeted to an individual.
190 texts was annotated by 3 rater(s) as targeted to an individual.
Disagreement score (class IND): 0.7068


In [25]:
for k, v in disagreement_by_raters(cac.ratings, "GRP").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted to a group.")

print(f"Disagreement score (class GRP): {disagreement_score(cac.ratings, 'GRP'):.4f}")

147 texts was annotated by 1 rater(s) as targeted to a group.
78 texts was annotated by 2 rater(s) as targeted to a group.
30 texts was annotated by 3 rater(s) as targeted to a group.
Disagreement score (class GRP): 0.8824


In [26]:
for k, v in disagreement_by_raters(cac.ratings, "OTH").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted to other.")

print(f"Disagreement score (class OTH): {disagreement_score(cac.ratings, 'OTH'):.4f}")

167 texts was annotated by 1 rater(s) as targeted to other.
66 texts was annotated by 2 rater(s) as targeted to other.
16 texts was annotated by 3 rater(s) as targeted to other.
Disagreement score (class OTH): 0.9357


### `toxic_spans`

In [27]:
toxic_spans = pd.DataFrame(dataset.get_annotations(raw_texts, "toxic_spans"))
toxic_spans.head()

Unnamed: 0,126,127,260
0,"[42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 5...","[43, 44, 45, 46, 47, 48]","[52, 53, 54, 55, 56, 57]"
1,"[34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 4...","[4, 5, 6, 7, 8, 9, 10, 11, 33, 34, 35, 36, 37,...","[5, 6, 7, 8, 9, 10, 11, 34, 35, 36, 37, 38, 39..."
2,"[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 195, 196, 19...","[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 211, ...","[211, 212, 213, 214, 215, 216, 217, 218]"
3,"[15, 16, 17, 20, 21, 22, 23, 24, 25]",[],"[15, 16, 17, 20, 21, 22, 23, 24]"
4,"[0, 1, 2, 3, 4, 5, 6, 7]","[0, 1, 2, 3, 4, 5, 6]","[0, 1, 2, 3, 4, 5, 6]"


In [28]:
from typing import List

def len_toxic_spans(toxic_spans: List[int]):
    return None if len(toxic_spans) == 0 else len(toxic_spans)

pd.DataFrame([toxic_spans[col].apply(lambda x: len_toxic_spans(x)) for col in toxic_spans.columns]).transpose().describe()

Unnamed: 0,126,127,260
count,746.0,682.0,659.0
mean,11.529491,12.464809,9.863429
std,9.673866,8.597002,9.562124
min,2.0,2.0,1.0
25%,6.0,7.0,5.0
50%,8.0,10.0,7.0
75%,15.0,15.75,12.0
max,148.0,58.0,154.0


In [29]:
fig = px.bar(
    data_frame=prepare_data_to_px(pd.DataFrame([toxic_spans[col].apply(lambda x: len(x) > 0) for col in toxic_spans.columns]).transpose()),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="toxic_spans distribution")

fig.update_layout(layout)

fig.show()

### `health`

In [30]:
health = pd.DataFrame(dataset.get_annotations(raw_texts, "health"))
health.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,968,969,970,971,972,973,974,975,976,977
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,True,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [31]:
fig = px.bar(
    data_frame=prepare_data_to_px(health),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Health distribution")

fig.update_layout(layout)

fig.show()

In [32]:
cac = CAC(health)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 978, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9703
Krippendorff's alpha: 0.1847
Gwet's AC1: 0.9797


In [33]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as health.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

22 texts was annotated by 1 rater(s) as health.
7 texts was annotated by 2 rater(s) as health.
0 texts was annotated by 3 rater(s) as health.
Disagreement score (class True): 1.0000


### `ideology`

In [34]:
ideology = pd.DataFrame(dataset.get_annotations(raw_texts, "ideology"))
ideology.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,968,969,970,971,972,973,974,975,976,977
126,False,False,True,False,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,True,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


In [35]:
fig = px.bar(
    data_frame=prepare_data_to_px(ideology),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Ideology distribution")

fig.update_layout(layout)

fig.show()

In [36]:
cac = CAC(ideology)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 978, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.8620
Krippendorff's alpha: 0.4265
Gwet's AC1: 0.8904


In [37]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as ideology.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

84 texts was annotated by 1 rater(s) as ideology.
51 texts was annotated by 2 rater(s) as ideology.
24 texts was annotated by 3 rater(s) as ideology.
Disagreement score (class True): 0.8491


### `insult`

In [38]:
insult = pd.DataFrame(dataset.get_annotations(raw_texts, "insult"))
insult.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,968,969,970,971,972,973,974,975,976,977
126,True,False,True,True,False,True,True,True,True,False,...,False,True,True,True,False,True,False,True,True,True
127,True,True,True,False,False,True,False,True,True,False,...,False,True,False,True,False,False,False,False,False,False
260,True,True,True,True,False,True,False,True,False,True,...,False,True,False,True,True,False,False,True,False,True


In [39]:
fig = px.bar(
    data_frame=prepare_data_to_px(insult),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Insult distribution")

fig.update_layout(layout)

fig.show()

In [40]:
cac = CAC(insult)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 978, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.5685
Krippendorff's alpha: 0.3355
Gwet's AC1: 0.4929


In [41]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as insult.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

168 texts was annotated by 1 rater(s) as insult.
254 texts was annotated by 2 rater(s) as insult.
443 texts was annotated by 3 rater(s) as insult.
Disagreement score (class True): 0.4879


### `lgbtqphobia`

In [42]:
lgbtqphobia = pd.DataFrame(dataset.get_annotations(raw_texts, "lgbtqphobia"))
lgbtqphobia.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,968,969,970,971,972,973,974,975,976,977
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [43]:
fig = px.bar(
    data_frame=prepare_data_to_px(lgbtqphobia),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="LGBTQphobia distribution")

fig.update_layout(layout)

fig.show()

In [44]:
cac = CAC(lgbtqphobia)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 978, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9683
Krippendorff's alpha: 0.7329
Gwet's AC1: 0.9770


In [45]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as lgbtqphobia.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

19 texts was annotated by 1 rater(s) as lgbtqphobia.
12 texts was annotated by 2 rater(s) as lgbtqphobia.
26 texts was annotated by 3 rater(s) as lgbtqphobia.
Disagreement score (class True): 0.5439


### `other_lifestyle`

In [46]:
other_lifestyle = pd.DataFrame(dataset.get_annotations(raw_texts, "other_lifestyle"))
other_lifestyle.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,968,969,970,971,972,973,974,975,976,977
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [47]:
fig = px.bar(
    data_frame=prepare_data_to_px(other_lifestyle),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Other-Lifestyle distribution")

fig.update_layout(layout)

fig.show()

In [48]:
cac = CAC(other_lifestyle)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 978, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9724
Krippendorff's alpha: 0.2612
Gwet's AC1: 0.9811


In [49]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as other_lifestyle.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

17 texts was annotated by 1 rater(s) as other_lifestyle.
10 texts was annotated by 2 rater(s) as other_lifestyle.
0 texts was annotated by 3 rater(s) as other_lifestyle.
Disagreement score (class True): 1.0000


### `physical_aspects`

In [50]:
physical_aspects = pd.DataFrame(dataset.get_annotations(raw_texts, "physical_aspects"))
physical_aspects.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,968,969,970,971,972,973,974,975,976,977
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [51]:
fig = px.bar(
    data_frame=prepare_data_to_px(physical_aspects),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Physical Aspects distribution")

fig.update_layout(layout)

fig.show()

In [52]:
cac = CAC(physical_aspects)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 978, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9387
Krippendorff's alpha: 0.3605
Gwet's AC1: 0.9563


In [53]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as physical_aspects.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

53 texts was annotated by 1 rater(s) as physical_aspects.
7 texts was annotated by 2 rater(s) as physical_aspects.
10 texts was annotated by 3 rater(s) as physical_aspects.
Disagreement score (class True): 0.8571


### `profanity_obscene`

In [54]:
profanity_obscene = pd.DataFrame(dataset.get_annotations(raw_texts, "profanity_obscene"))
profanity_obscene.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,968,969,970,971,972,973,974,975,976,977
126,True,True,False,False,True,False,False,False,False,True,...,False,False,False,False,False,True,True,True,False,False
127,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,True,False,True,False,False
260,False,True,False,True,True,False,False,False,False,True,...,True,False,False,False,False,True,False,False,False,False


In [55]:
fig = px.bar(
    data_frame=prepare_data_to_px(profanity_obscene),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Profanity/Obscene distribution")

fig.update_layout(layout)

fig.show()

In [56]:
cac = CAC(profanity_obscene)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 978, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.7290
Krippendorff's alpha: 0.5087
Gwet's AC1: 0.7144


In [57]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as profanity_obscene.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

124 texts was annotated by 1 rater(s) as profanity_obscene.
141 texts was annotated by 2 rater(s) as profanity_obscene.
102 texts was annotated by 3 rater(s) as profanity_obscene.
Disagreement score (class True): 0.7221


### `racism`

In [58]:
racism = pd.DataFrame(dataset.get_annotations(raw_texts, "racism"))
racism.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,968,969,970,971,972,973,974,975,976,977
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [59]:
fig = px.bar(
    data_frame=prepare_data_to_px(racism),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Racism distribution")

fig.update_layout(layout)

fig.show()

In [60]:
cac = CAC(racism)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 978, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9908
Krippendorff's alpha: 0.3049
Gwet's AC1: 0.9938


In [61]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as racism.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

8 texts was annotated by 1 rater(s) as racism.
1 texts was annotated by 2 rater(s) as racism.
1 texts was annotated by 3 rater(s) as racism.
Disagreement score (class True): 0.9000


### `religious_intolerance`

In [62]:
religious_intolerance = pd.DataFrame(dataset.get_annotations(raw_texts, "religious_intolerance"))
religious_intolerance.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,968,969,970,971,972,973,974,975,976,977
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [63]:
fig = px.bar(
    data_frame=prepare_data_to_px(religious_intolerance),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Religious intolerance distribution")

fig.update_layout(layout)

fig.show()

In [64]:
cac = CAC(religious_intolerance)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
try:
    print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
except:
    print("Krippendorff's alpha: NaN")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 978, Raters: 3, Categories: [False], Weights: "identity">
Percent agreement: 1.0000
Krippendorff's alpha: NaN
Gwet's AC1: 1.0000



divide by zero encountered in double_scalars


invalid value encountered in multiply



In [65]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as religious_intolerance.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

0 texts was annotated by 1 rater(s) as religious_intolerance.
0 texts was annotated by 2 rater(s) as religious_intolerance.
0 texts was annotated by 3 rater(s) as religious_intolerance.
Disagreement score (class True): 0.0000


### `sexism`

In [66]:
sexism = pd.DataFrame(dataset.get_annotations(raw_texts, "sexism"))
sexism.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,968,969,970,971,972,973,974,975,976,977
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [67]:
fig = px.bar(
    data_frame=prepare_data_to_px(sexism),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Sexism distribution")

fig.update_layout(layout)

fig.show()

In [68]:
cac = CAC(sexism)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 978, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9591
Krippendorff's alpha: 0.1531
Gwet's AC1: 0.9718


In [69]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as sexism.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

35 texts was annotated by 1 rater(s) as sexism.
5 texts was annotated by 2 rater(s) as sexism.
1 texts was annotated by 3 rater(s) as sexism.
Disagreement score (class True): 0.9756


### `xenophobia`

In [70]:
xenophobia = pd.DataFrame(dataset.get_annotations(raw_texts, "xenophobia"))
xenophobia.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,968,969,970,971,972,973,974,975,976,977
126,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [71]:
fig = px.bar(
    data_frame=prepare_data_to_px(xenophobia),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Xenophobia distribution")

fig.update_layout(layout)

fig.show()

In [72]:
cac = CAC(xenophobia)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 978, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9734
Krippendorff's alpha: 0.3571
Gwet's AC1: 0.9818


In [73]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as xenophobia.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

14 texts was annotated by 1 rater(s) as xenophobia.
12 texts was annotated by 2 rater(s) as xenophobia.
1 texts was annotated by 3 rater(s) as xenophobia.
Disagreement score (class True): 0.9630


### Krispendorff's alpha Multi-Label

In the next cells, we will calculate the Krippendorff's alpha considering as a multi-label problem instead of several binary problems.

In [74]:
ratings = {
    "health": health,
    "ideology": ideology,
    "insult": insult,
    "lgbtqphobia": lgbtqphobia,
    "other_lifestyle": other_lifestyle,
    "physical_aspects": physical_aspects,
    "profanity_obscene": profanity_obscene,
    "racism": racism,
    "religious_intolerance": religious_intolerance,
    "sexism": sexism,
    "xenophobia": xenophobia
}

task_data = []
for coder in [a.annotator_id for a in annotators]:
    for item in range(len(health)):
        temp = get_annotations_by_rater(ratings, coder, item)
        if temp != []:
            task_data.append((
                coder,
                item,
                frozenset(temp)
            ))

jaccard_task = AnnotationTask(distance=jaccard_distance)
masi_task = AnnotationTask(distance=masi_distance)

for task in [jaccard_task, masi_task]:
    task.load_array(task_data)
    print(f"Krippendorff's alpha using {task.distance}")
    print(f"Krippendorff's alpha: {task.alpha():.4f}", "\n")

Krippendorff's alpha using <function jaccard_distance at 0x00000241804A1940>
Krippendorff's alpha: 0.4807 

Krippendorff's alpha using <function masi_distance at 0x00000241804A19D0>
Krippendorff's alpha: 0.4387 



## Label Assignment

In this section, we will define the label assigment strategy and assign labels to the texts.

Possible label assigment strategies are:

- **Majority Vote**: assign the label with the highest frequency.
- **At least one**: assign the label if at least one annotator marked it as true.

### Strategy per features

We will have a label assignment strategy for each feature.

The LabelStrategy object will be used to assign a function to each feature that corresponds to the label assigment strategy selected.

In [75]:
label_strategy = LabelStrategy(
    is_offensive=majority_vote,
    is_targeted=majority_vote,
    targeted_type=majority_vote,
    toxic_spans=all_labeled_spans,
    health=at_least_one,
    ideology=at_least_one,
    insult=at_least_one, # majority_vote
    lgbtqphobia=at_least_one,
    other_lifestyle=at_least_one,
    physical_aspects=at_least_one,
    profanity_obscene=at_least_one,
    racism=at_least_one,
    religious_intolerance=at_least_one,
    sexism=at_least_one,
    xenophobia=at_least_one
)

processed_texts, metadata = dataset.build(
    raw=data,
    label_strategy=label_strategy
)

processed_texts = [i.dict() for i in processed_texts]
metadata = [i.dict() for i in metadata]


The annotation list is even. The returned vote will be random.



## Create DataFrames

In the next cells, we will create Pandas DataFrames for the dataset and the metadata.

In [76]:
df = pd.DataFrame(processed_texts)

print(f"Shape: {df.shape}")
df.head()

Shape: (1851, 17)


Unnamed: 0,id,text,is_offensive,is_targeted,targeted_type,toxic_spans,health,ideology,insult,lgbtqphobia,other_lifestyle,physical_aspects,profanity_obscene,racism,religious_intolerance,sexism,xenophobia
0,1d8ea463907a4ffa9823d0216b415379,"""vc é minha vida bela"" vc não tem vida seu pei...",OFF,TIN,IND,"[42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 5...",False,False,True,False,False,False,True,False,False,False,False
1,6565048057f94316bc26b3b1dcd453ca,USER Caralho mano eu ia mandar um vai tomar no...,OFF,TIN,IND,"[4, 5, 6, 7, 8, 9, 10, 11, 33, 34, 35, 36, 37,...",False,False,True,False,False,False,True,False,False,False,False
2,9c3301b0ac794929b1c37d9aa00d245f,Os ignorantes nos comentários querendo dar des...,OFF,TIN,GRP,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 195, ...",False,True,True,False,False,False,False,False,False,False,False
3,9c567d3a7c4e4e5eb66e0f1585eaef26,"rafael é muito fdp, praga de garoto...",OFF,TIN,IND,"[15, 16, 17, 20, 21, 22, 23, 24, 25]",False,False,True,False,False,False,True,False,False,False,False
4,a65c470bb67a422e85026c74733c995d,PUTARIA 🔞 URL,OFF,TIN,OTH,"[0, 1, 2, 3, 4, 5, 6, 7]",False,False,False,False,False,False,True,False,False,False,False


In [77]:
df_metadata = pd.DataFrame(metadata)

print(f"Shape: {df_metadata.shape}")
df_metadata.head()

Shape: (5830, 11)


Unnamed: 0,id,source,created_at,collected_at,toxicity_score,category,annotator_id,gender,year_of_birth,education_level,annotator_type
0,1d8ea463907a4ffa9823d0216b415379,YouTube,2013-05-14 17:01:53,2022-04-08 08:03:44.134767,0.8605,,,,,,
1,1d8ea463907a4ffa9823d0216b415379,,,NaT,,,126.0,Male,1997.0,High school,Contract worker
2,1d8ea463907a4ffa9823d0216b415379,,,NaT,,,127.0,Female,1975.0,Master's degree,Contract worker
3,1d8ea463907a4ffa9823d0216b415379,,,NaT,,,260.0,Female,2001.0,High school,Contract worker
4,6565048057f94316bc26b3b1dcd453ca,Twitter,2022-03-27 01:43:01+00:00,2022-03-27 00:52:43.398462,0.9389,,,,,,


## Validate data

In this section, we will apply some simple validation to guarantee that the data is correct.

Remove duplicated texts.

In [78]:
df.drop_duplicates(subset=["text"], inplace=True)

print(f"Shape: {df.shape}")

Shape: (1851, 17)


Remove understandable texts.

In [136]:
invalid_texts = [
    "Yo que me iba a dormir y ese marica de chris dizque mine a dar un rose ni modo 🫠🫡"
]

processed_texts = []

for text in df.to_dict(orient="records"):
    if text["text"] not in invalid_texts and not check_words(text["text"], ["USER", "HASHTAG", "URL"]):
        processed_texts.append(text)

print(f"Count: {len(processed_texts)}")

Count: 1842


Rebuild dataframe from the cleaned data.

In [137]:
df = pd.DataFrame(processed_texts)

print(f"Shape: {df.shape}")
df.head()

Shape: (1842, 17)


Unnamed: 0,id,text,is_offensive,is_targeted,targeted_type,toxic_spans,health,ideology,insult,lgbtqphobia,other_lifestyle,physical_aspects,profanity_obscene,racism,religious_intolerance,sexism,xenophobia
0,1d8ea463907a4ffa9823d0216b415379,"""vc é minha vida bela"" vc não tem vida seu pei...",OFF,TIN,IND,"[42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 5...",False,False,True,False,False,False,True,False,False,False,False
1,6565048057f94316bc26b3b1dcd453ca,USER Caralho mano eu ia mandar um vai tomar no...,OFF,TIN,IND,"[4, 5, 6, 7, 8, 9, 10, 11, 33, 34, 35, 36, 37,...",False,False,True,False,False,False,True,False,False,False,False
2,9c3301b0ac794929b1c37d9aa00d245f,Os ignorantes nos comentários querendo dar des...,OFF,TIN,GRP,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 195, ...",False,True,True,False,False,False,False,False,False,False,False
3,9c567d3a7c4e4e5eb66e0f1585eaef26,"rafael é muito fdp, praga de garoto...",OFF,TIN,IND,"[15, 16, 17, 20, 21, 22, 23, 24, 25]",False,False,True,False,False,False,True,False,False,False,False
4,a65c470bb67a422e85026c74733c995d,PUTARIA 🔞 URL,OFF,TIN,OTH,"[0, 1, 2, 3, 4, 5, 6, 7]",False,False,False,False,False,False,True,False,False,False,False


In [138]:
metadata = dict_serialize_date(
    data=[i.dict() if isinstance(i, Metadata) else i for i in metadata],
    keys=["created_at", "collected_at"])

# Remove deleted texts metadata
metadata = [i for i in metadata if i["id"] in df["id"].tolist()]

print(f"Count: {len(metadata)}")

Count: 5804


## Profiling Report

We will generate a profiling report that provides some statistics about the data.

In [139]:
profile = ProfileReport(
    df, title="OLID-BR Pilot 3",
    explorative=True)

profile.to_file("../../docs/reports/olidbr_pilot_3.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Upload data to S3

In this section, we will save the dataset in CSV and JSON format in the S3 bucket.

Saving in CSV format.

In [140]:
bucket.upload_csv(
    data=df,
    key="processed/olid-br/3/olidbr.csv")

bucket.upload_csv(
    data=df_metadata,
    key="processed/olid-br/3/metadata.csv")

print("CSV Files uploaded.")

CSV Files uploaded.


Saving in JSON format.

In [141]:
bucket.upload_json(
    data=processed_texts,
    key="processed/olid-br/3/olidbr.json")

bucket.upload_json(
    data=metadata,
    key="processed/olid-br/3/metadata.json")

print("JSON Files uploaded.")

JSON Files uploaded.
