# OLID-BR (Build Dataset)

In this notebook, we will read the annotated data from an S3 bucket, build OLID-BR dataset and save it to an S3 bucket in JSON and CSV formats.

The annotated data is stored in the Label Studio JSON format. See [Label Studio Documentation — Export Annotations](https://labelstud.io/guide/export.html#Label-Studio-JSON-format-of-annotated-tasks) for more details.

## Imports

In [1]:
import sys
from pathlib import Path

if str(Path(".").absolute().parent) not in sys.path:
    sys.path.append(str(Path(".").absolute().parent.parent))

In [2]:
from dotenv import load_dotenv

# Initialize the env vars
load_dotenv("../../.env")

True

In [3]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

from src.data_classes import Annotator, LabelStrategy
from src.dataset import Dataset
from src.labeling.assignment import majority_vote, at_least_one, all_labeled_spans
from src.labeling.metrics import InterRaterReliability
from src.s3 import Bucket
from src.settings import AppSettings
from src.utils import (
    read_yaml,
    check_words,
    prepare_data_to_px,
    dict_serialize_date
)

# Plotly
import plotly.express as px
import plotly.io as pio
from plotly.graph_objs import Layout

pio.templates.default = "plotly_dark"

layout = Layout(
    xaxis={
        "type": "category",
        "showgrid": False,
        "zeroline": False,
    },
    yaxis={
        "showgrid": False,
        "zeroline": False
    },
    paper_bgcolor="rgba(0,0,0,0)",
    plot_bgcolor="rgba(0,0,0,0)",
    font={"color": "rgb(180,180,180)"},
)

args = AppSettings()

## Load annotators

In the next cells, we will read the annotators data and create a list with all annotators objects.

It will be used to add the annotations as a metadata for each text.

In [4]:
annotators = read_yaml("../../properties/annotators.yaml")
annotators = [Annotator(**a) for a in annotators]
annotators

[Annotator(id=None, annotator_id=1, gender='Male', age=28, education_level="Bachelor's degree", annotator_type='Researcher'),
 Annotator(id=None, annotator_id=32, gender='Female', age=30, education_level="Bachelor's degree", annotator_type='Volunteer'),
 Annotator(id=None, annotator_id=127, gender='Female', age=0, education_level="Master's degree", annotator_type='Contract worker'),
 Annotator(id=None, annotator_id=128, gender='Female', age=0, education_level="Master's degree", annotator_type='Contract worker'),
 Annotator(id=None, annotator_id=126, gender='Male', age=0, education_level='High school', annotator_type='Contract worker')]

## Load data

In the next cells, we will read the labeled data from the S3 bucket and concatenate all annotations into a single base.

In [5]:
bucket = Bucket(args.AWS_S3_BUCKET)

bucket.get_session_from_aksk(
    args.AWS_ACCESS_KEY_ID,
    args.AWS_SECRET_ACCESS_KEY)

In [6]:
files = [
    "raw/labeled/phase2/olid-br-2-126.json",
    "raw/labeled/phase2/olid-br-2-127.json",
    "raw/labeled/phase2/olid-br-2-128.json"
]

As we have each annotator data in a separate file, we will need to concatenate all annotations into a single base.

In [7]:
data = {}

for file in files:
    print(f"Reading {file}")
    temp = bucket.download_json(key=file)

    for row in temp:
        if row["data"]["text"] not in data.keys():
            data[row["data"]["text"]] = row
        else:
            data[row["data"]["text"]]["annotations"].extend(row["annotations"])

data = [v for _, v in data.items()]

print(f"Count: {len(data)}")

Reading raw/labeled/phase2/olid-br-2-126.json
Reading raw/labeled/phase2/olid-br-2-127.json
Reading raw/labeled/phase2/olid-br-2-128.json
Count: 3000


Check if all texts have the same number of annotations.

In [8]:
annotations_count = {}

for item in data:
    count = len(item["annotations"])
    if count not in annotations_count.keys():
        annotations_count[count] = 1
    else:
        annotations_count[count] += 1

annotations_count

{3: 3000}

## Build dataset

In [9]:
dataset = Dataset(
    annotators=annotators,
    toxicity_threshold=args.PERSPECTIVE_THRESHOLD
)

raw_texts = dataset.get_raw_texts(data)

We will filter only texts with all three annotators.

In [10]:
raw_texts = [text for text in raw_texts if len(text.annotations) == 3]

print(f"{len(raw_texts)} raw texts with 3 annotations.")

3000 raw texts with 3 annotations.


## Agreement Analysis

a.k.a inter-rater reliability (IRR), inter-rater agreement (IRA), or concordance.

In the next cells, we will perform an agreement analysis to check if the annotations are consistent.

### Interpretation of Kappa or Krippendorff's alpha

| **Kappa** | **Level of Agreement** |
| :-: | :-: |
| > 0.8 | Almost perfect    |
| > 0.6 | Substantial       |
| > 0.4 | Moderate          |
| > 0.2 | Fair              |
| > 0   | Slight            |
| 0     | No agreement      |
| < 0   | Inverse agreement |

Krippendorff's alpha in contrast is based on the observed disagreement corrected for disagreement expected by chance.

This leads to a range of −1 to 1 for both measures, where 1 indicates perfect agreement, 0 indicates no agreement beyond chance and negative values indicate inverse agreement.

Krippendorff suggests: “[I]t is customary to require α ≥ .800. Where tentative conclusions are still acceptable, α ≥ .667 is the lowest conceivable limit (2004, p. 241).”

### `is_offensive`

In [11]:
is_offensive = pd.DataFrame(dataset.get_annotations(raw_texts, "is_offensive"))
is_offensive.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,OFF,OFF,OFF,OFF,OFF,NOT,OFF,OFF,OFF,OFF,...,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF
127,OFF,OFF,OFF,OFF,OFF,NOT,OFF,OFF,NOT,OFF,...,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF
128,OFF,OFF,OFF,OFF,OFF,NOT,OFF,OFF,OFF,OFF,...,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF


In [12]:
fig = px.bar(
    data_frame=prepare_data_to_px(is_offensive),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="is_offensive distribution")

fig.update_layout(layout)

fig.show()

In [13]:
irr = InterRaterReliability(
    is_offensive.values.transpose().tolist()
)

for metric, score in irr.get_all().items():
    print(f"{metric}: {score:.4f}")

percent_agreement: 0.7277
krippendorff_alpha: 0.0595
fleiss_kappa: 0.0594
randolph_kappa: 0.6369


### `is_targeted`

In [14]:
is_targeted = pd.DataFrame(dataset.get_annotations(raw_texts, "is_targeted"))
is_targeted.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN,...,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN
127,UNT,UNT,UNT,UNT,TIN,UNT,UNT,UNT,UNT,UNT,...,TIN,TIN,TIN,TIN,TIN,UNT,TIN,TIN,TIN,TIN
128,UNT,TIN,UNT,UNT,UNT,UNT,UNT,TIN,UNT,UNT,...,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN


In [15]:
fig = px.bar(
    data_frame=prepare_data_to_px(is_targeted),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="is_targeted distribution")

fig.update_layout(layout)

fig.show()

In [16]:
irr = InterRaterReliability(
    is_targeted.values.transpose().tolist()
)

for metric, score in irr.get_all().items():
    print(f"{metric}: {score:.4f}")

percent_agreement: 0.1610
krippendorff_alpha: -0.1348
fleiss_kappa: -0.1349
randolph_kappa: -0.1187


### `targeted_type`

In [17]:
targeted_type = pd.DataFrame(dataset.get_annotations(raw_texts, "targeted_type"))
targeted_type.fillna(np.nan, inplace=True)
targeted_type.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,IND,GRP,GRP,GRP,GRP,OTH,GRP,GRP,IND,IND,...,GRP,GRP,IND,IND,GRP,IND,GRP,IND,GRP,GRP
127,,,,,GRP,,,,,,...,GRP,GRP,IND,IND,GRP,,GRP,IND,GRP,GRP
128,,IND,,,,,,IND,,,...,GRP,GRP,GRP,IND,IND,IND,GRP,IND,GRP,GRP


In [18]:
fig = px.bar(
    data_frame=prepare_data_to_px(targeted_type),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="targeted_type distribution")

fig.update_layout(layout)

fig.show()

In [19]:
irr = InterRaterReliability(
    targeted_type.values.transpose().tolist()
)

for metric, score in irr.get_all().items():
    print(f"{metric}: {score:.4f}")

percent_agreement: 0.0893
krippendorff_alpha: 0.2461
fleiss_kappa: -0.0181
randolph_kappa: 0.1154


### `toxic_spans`

In [20]:
toxic_spans = pd.DataFrame(dataset.get_annotations(raw_texts, "toxic_spans"))
toxic_spans.head()

Unnamed: 0,126,127,128
0,[],"[52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 6...",[]
1,"[20, 21, 22, 23, 24, 25, 93, 94, 95, 96, 97]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",[]
2,"[14, 15, 16, 17, 18]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",[]
3,"[10, 11, 12, 13, 14, 17, 18, 19, 20, 21, 22, 2...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",[]
4,"[10, 11, 12, 13, 14, 165, 166, 167, 176, 177, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[10, 11, 12, 13, 14, 176, 177, 178, 179, 180, ..."


### `health`

In [21]:
health = pd.DataFrame(dataset.get_annotations(raw_texts, "health"))
health.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [22]:
fig = px.bar(
    data_frame=prepare_data_to_px(health),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Health distribution")

fig.update_layout(layout)

fig.show()

In [23]:
irr = InterRaterReliability(
    health.values.astype(int).transpose().tolist()
)

for metric, score in irr.get_all().items():
    print(f"{metric}: {score:.4f}")

percent_agreement: 0.9760
krippendorff_alpha: 0.0447
fleiss_kappa: 0.0446
randolph_kappa: 0.9680


### `ideology`

In [24]:
ideology = pd.DataFrame(dataset.get_annotations(raw_texts, "ideology"))
ideology.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,True,False,False,False,False,False,False,False,...,False,True,False,True,True,True,True,False,True,True
127,False,False,False,False,False,False,False,False,False,False,...,False,True,False,True,True,False,True,False,True,True
128,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,False,True,False,False,True


In [25]:
fig = px.bar(
    data_frame=prepare_data_to_px(ideology),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Ideology distribution")

fig.update_layout(layout)

fig.show()

In [26]:
irr = InterRaterReliability(
    ideology.values.astype(int).transpose().tolist()
)

for metric, score in irr.get_all().items():
    print(f"{metric}: {score:.4f}")

percent_agreement: 0.7647
krippendorff_alpha: 0.3019
fleiss_kappa: 0.3018
randolph_kappa: 0.6862


### `insult`

In [27]:
insult = pd.DataFrame(dataset.get_annotations(raw_texts, "insult"))
insult.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,True,True,False,True,True,True,True,False,False,True,...,True,True,True,True,True,False,False,True,False,False
127,True,True,True,True,False,False,True,True,False,True,...,True,True,True,True,True,True,True,True,False,False
128,True,True,True,True,True,False,True,False,True,True,...,True,True,False,False,False,True,False,True,True,False


In [28]:
fig = px.bar(
    data_frame=prepare_data_to_px(insult),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Insult distribution")

fig.update_layout(layout)

fig.show()

In [29]:
irr = InterRaterReliability(
    insult.values.astype(int).transpose().tolist()
)

for metric, score in irr.get_all().items():
    print(f"{metric}: {score:.4f}")

percent_agreement: 0.4713
krippendorff_alpha: 0.0895
fleiss_kappa: 0.0894
randolph_kappa: 0.2951


### `lgbtqphobia`

In [30]:
lgbtqphobia = pd.DataFrame(dataset.get_annotations(raw_texts, "lgbtqphobia"))
lgbtqphobia.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [31]:
fig = px.bar(
    data_frame=prepare_data_to_px(lgbtqphobia),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="LGBTQphobia distribution")

fig.update_layout(layout)

fig.show()

In [32]:
irr = InterRaterReliability(
    lgbtqphobia.values.astype(int).transpose().tolist()
)

for metric, score in irr.get_all().items():
    print(f"{metric}: {score:.4f}")

percent_agreement: 0.9453
krippendorff_alpha: 0.5583
fleiss_kappa: 0.5583
randolph_kappa: 0.9271


### `other_lifestyle`

In [33]:
other_lifestyle = pd.DataFrame(dataset.get_annotations(raw_texts, "other_lifestyle"))
other_lifestyle.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [34]:
fig = px.bar(
    data_frame=prepare_data_to_px(other_lifestyle),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Other-Lifestyle distribution")

fig.update_layout(layout)

fig.show()

In [35]:
irr = InterRaterReliability(
    other_lifestyle.values.astype(int).transpose().tolist()
)

for metric, score in irr.get_all().items():
    print(f"{metric}: {score:.4f}")

percent_agreement: 0.9860
krippendorff_alpha: 0.0824
fleiss_kappa: 0.0823
randolph_kappa: 0.9813


### `physical_aspects`

In [36]:
physical_aspects = pd.DataFrame(dataset.get_annotations(raw_texts, "physical_aspects"))
physical_aspects.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [37]:
fig = px.bar(
    data_frame=prepare_data_to_px(physical_aspects),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Physical Aspects distribution")

fig.update_layout(layout)

fig.show()

In [38]:
irr = InterRaterReliability(
    physical_aspects.values.astype(int).transpose().tolist()
)

for metric, score in irr.get_all().items():
    print(f"{metric}: {score:.4f}")

percent_agreement: 0.9463
krippendorff_alpha: 0.3272
fleiss_kappa: 0.3271
randolph_kappa: 0.9284


### `profanity_obscene`

In [39]:
profanity_obscene = pd.DataFrame(dataset.get_annotations(raw_texts, "profanity_obscene"))
profanity_obscene.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,True,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,False,False,False,False,True,True,False,False,...,True,True,False,False,True,False,False,False,False,False


In [40]:
fig = px.bar(
    data_frame=prepare_data_to_px(profanity_obscene),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Profanity/Obscene distribution")

fig.update_layout(layout)

fig.show()

In [41]:
irr = InterRaterReliability(
    profanity_obscene.values.astype(int).transpose().tolist()
)

for metric, score in irr.get_all().items():
    print(f"{metric}: {score:.4f}")

percent_agreement: 0.6837
krippendorff_alpha: 0.0850
fleiss_kappa: 0.0849
randolph_kappa: 0.5782


### `racism`

In [42]:
racism = pd.DataFrame(dataset.get_annotations(raw_texts, "racism"))
racism.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [43]:
fig = px.bar(
    data_frame=prepare_data_to_px(racism),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Racism distribution")

fig.update_layout(layout)

fig.show()

In [44]:
irr = InterRaterReliability(
    racism.values.astype(int).transpose().tolist()
)

for metric, score in irr.get_all().items():
    print(f"{metric}: {score:.4f}")

percent_agreement: 0.9750
krippendorff_alpha: 0.2564
fleiss_kappa: 0.2563
randolph_kappa: 0.9667


### `religious_intolerance`

In [45]:
religious_intolerance = pd.DataFrame(dataset.get_annotations(raw_texts, "religious_intolerance"))
religious_intolerance.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [46]:
fig = px.bar(
    data_frame=prepare_data_to_px(religious_intolerance),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Religious intolerance distribution")

fig.update_layout(layout)

fig.show()

In [47]:
irr = InterRaterReliability(
    religious_intolerance.values.astype(int).transpose().tolist()
)

for metric, score in irr.get_all().items():
    print(f"{metric}: {score:.4f}")

Error calculating Krippendorff's alpha: There has to be more than one value in the domain.


percent_agreement: 1.0000
krippendorff_alpha: nan
fleiss_kappa: nan
randolph_kappa: nan



invalid value encountered in double_scalars



### `sexism`

In [48]:
sexism = pd.DataFrame(dataset.get_annotations(raw_texts, "sexism"))
sexism.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,True,False
128,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,True,False


In [49]:
fig = px.bar(
    data_frame=prepare_data_to_px(sexism),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Sexism distribution")

fig.update_layout(layout)

fig.show()

In [50]:
irr = InterRaterReliability(
    sexism.values.astype(int).transpose().tolist()
)

for metric, score in irr.get_all().items():
    print(f"{metric}: {score:.4f}")

percent_agreement: 0.8753
krippendorff_alpha: 0.1721
fleiss_kappa: 0.1721
randolph_kappa: 0.8338


### `xenophobia`

In [51]:
xenophobia = pd.DataFrame(dataset.get_annotations(raw_texts, "xenophobia"))
xenophobia.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
127,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [52]:
fig = px.bar(
    data_frame=prepare_data_to_px(xenophobia),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Xenophobia distribution")

fig.update_layout(layout)

fig.show()

In [53]:
irr = InterRaterReliability(
    xenophobia.values.astype(int).transpose().tolist()
)

for metric, score in irr.get_all().items():
    print(f"{metric}: {score:.4f}")

percent_agreement: 0.9673
krippendorff_alpha: 0.0732
fleiss_kappa: 0.0731
randolph_kappa: 0.9564


### Summary

| feature / metrics      | Percent Agreement | Krippendorff's alpha | Fleiss' Kappa | Randolph's Kappa | Comments |
| ---------------------- | :---------------: | :------------------: | :-----------: | :--------------: | -------- |
| is\_offensive          | 0.7277            | 0.0595               | 0.0594        | 0.6369           | |
| is\_targeted           | 0.161             | \-0.1348             | \-0.1349      | \-0.1187         | [1] |
| targeted\_type         | 0.0893            | 0.2461               | \-0.0181      | 0.1154           | [1] |
| health                 | 0.976             | 0.0447               | 0.0446        | 0.968            | |
| ideology               | 0.7647            | 0.3019               | 0.3018        | 0.6862           | [3] |
| insult                 | 0.4713            | 0.0895               | 0.0894        | 0.2951           | [3] |
| lgbtqphobia            | 0.9453            | 0.5583               | 0.5583        | 0.9271           | |
| other\_lifestyle       | 0.986             | 0.0824               | 0.0823        | 0.9813           | |
| physical\_aspects      | 0.9463            | 0.3272               | 0.3271        | 0.9284           | |
| profanity\_obscene     | 0.6837            | 0.085                | 0.0849        | 0.5782           | [3] |
| racism                 | 0.975             | 0.2564               | 0.2563        | 0.9667           | |
| religious\_intolerance | 1                 | 0                    | 0             | 0                | [2] |
| sexism                 | 0.8753            | 0.1721               | 0.1721        | 0.8338           |  |
| xenophobia             | 0.9673            | 0.0732               | 0.0731        | 0.9564           |  |

#### Comments

- [1] The question that originated features `is_targeted` and `targeted_type` are optional, it must be marked only if the text is targeted. Looks like the annotator 126 didn't understand it and marked everything as targeted.
- [2] We don't have any text tagged with `religious_intolerance` by our annotators.
- [3] We have more inconsistent annotations in labels `idelogy`, `insult`, and `profanity_obscene` (disconsidering [1] [2])

#### Conclusions

The Agreement Analysis shows that the annotations are not consistent.

- Medians (without `is_targeted`, `targeted_type` and `religious_intolerance` as explained in the comments):
  - Percent Agreement: 0.9453
  - Krippendorff's alpha: 0.0895
  - Fleiss' Kappa: 0.0894
  - Randolph's Kappa: 0.9271

## Label Assignment

In this section, we will define the label assigment strategy and assign labels to the texts.

Possible label assigment strategies are:

- **Majority Vote**: assign the label with the highest frequency.
- **At least one**: assign the label if at least one annotator marked it as true.

### Strategy per features

We will have a label assignment strategy for each feature.

The LabelStrategy object will be used to assign a function to each feature that corresponds to the label assigment strategy selected.

In [54]:
label_strategy = LabelStrategy(
    is_offensive=majority_vote,
    is_targeted=majority_vote,
    targeted_type=majority_vote,
    toxic_spans=all_labeled_spans,
    health=at_least_one,
    ideology=at_least_one,
    insult=at_least_one,
    lgbtqphobia=at_least_one,
    other_lifestyle=at_least_one,
    physical_aspects=at_least_one,
    profanity_obscene=at_least_one,
    racism=at_least_one,
    religious_intolerance=at_least_one,
    sexism=at_least_one,
    xenophobia=at_least_one
)

processed_texts, metadata = dataset.build(
    raw=data,
    label_strategy=label_strategy
)

In [55]:
df = pd.DataFrame([i.dict() for i in processed_texts])

print(f"Shape: {df.shape}")
df.head()

Shape: (3000, 17)


Unnamed: 0,id,text,is_offensive,is_targeted,targeted_type,toxic_spans,health,ideology,insult,lgbtqphobia,other_lifestyle,physical_aspects,profanity_obscene,racism,religious_intolerance,sexism,xenophobia
0,617431e5d7734c43a7e87cd30418fd87,USER Adorei o comercial também Jesus. Só achei...,OFF,UNT,,"[52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 6...",False,False,True,False,False,False,True,False,False,False,False
1,87e2f689490b490bb0d983fdcf257782,Cara isso foi muito babaca geral USER conhece ...,OFF,TIN,GRP,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,False,True,False,False,False,False,False,False,False,False
2,10328996ecc54f2a8a51ae4b50ee45e4,Quem liga pra judeu kkkk,OFF,UNT,,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,True,True,False,False,False,False,False,False,False,True
3,7de8950e893d4ee6996da54e88eeff8a,"Se vc for porco, folgado e relaxado, você não ...",OFF,UNT,,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,False,True,False,False,False,False,False,False,False,False
4,d226d81c4ae04d179ec6e4c4687ab44d,"Rapaziada chata, né?! O cara trabalha c funk, ...",OFF,TIN,GRP,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,False,True,False,False,False,False,False,False,False,False


In [56]:
df_metadata = pd.DataFrame([i.dict() for i in metadata])

print(f"Shape: {df_metadata.shape}")
df_metadata.head()

Shape: (12000, 11)


Unnamed: 0,id,source,created_at,collected_at,toxicity_score,category,annotator_id,gender,age,education_level,annotator_type
0,617431e5d7734c43a7e87cd30418fd87,YouTube,2015-05-29 09:27:57,2022-04-08 08:03:44.134767,0.7189,,,,,,
1,617431e5d7734c43a7e87cd30418fd87,,,NaT,,,126.0,Male,0.0,High school,Contract worker
2,617431e5d7734c43a7e87cd30418fd87,,,NaT,,,127.0,Female,0.0,Master's degree,Contract worker
3,617431e5d7734c43a7e87cd30418fd87,,,NaT,,,128.0,Female,0.0,Master's degree,Contract worker
4,87e2f689490b490bb0d983fdcf257782,YouTube,2022-02-09 12:40:52,2022-04-08 08:03:44.134767,0.9852,,,,,,


## Validate data

In this section, we will apply some simple validation to guarantee that the data is correct.

Remove duplicated texts.

In [57]:
df.drop_duplicates(subset=["text"], inplace=True)

print(f"Shape: {df.shape}")

Shape: (3000, 17)


Remove understandable texts.

In [58]:
texts = df.to_dict(orient="records")

texts = [i for i in texts if not check_words(i["text"], ["USER", "HASHTAG", "URL"])]

print(f"Count: {len(texts)}")

Count: 2996


Rebuild dataframe from the cleaned data.

In [59]:
df = pd.DataFrame(texts)

print(f"Shape: {df.shape}")
df.head()

Shape: (2996, 17)


Unnamed: 0,id,text,is_offensive,is_targeted,targeted_type,toxic_spans,health,ideology,insult,lgbtqphobia,other_lifestyle,physical_aspects,profanity_obscene,racism,religious_intolerance,sexism,xenophobia
0,617431e5d7734c43a7e87cd30418fd87,USER Adorei o comercial também Jesus. Só achei...,OFF,UNT,,"[52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 6...",False,False,True,False,False,False,True,False,False,False,False
1,87e2f689490b490bb0d983fdcf257782,Cara isso foi muito babaca geral USER conhece ...,OFF,TIN,GRP,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,False,True,False,False,False,False,False,False,False,False
2,10328996ecc54f2a8a51ae4b50ee45e4,Quem liga pra judeu kkkk,OFF,UNT,,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,True,True,False,False,False,False,False,False,False,True
3,7de8950e893d4ee6996da54e88eeff8a,"Se vc for porco, folgado e relaxado, você não ...",OFF,UNT,,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,False,True,False,False,False,False,False,False,False,False
4,d226d81c4ae04d179ec6e4c4687ab44d,"Rapaziada chata, né?! O cara trabalha c funk, ...",OFF,TIN,GRP,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,False,True,False,False,False,False,False,False,False,False


## Profiling Report

We will generate a profiling report that provides some statistics about the data.

In [60]:
profile = ProfileReport(
    df, title="OLID-BR Pilot 2",
    explorative=True)

profile.to_file("../../reports/olidbr_pilot_2.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Upload data to S3

In this section, we will save the dataset in CSV and JSON format in the S3 bucket.

Saving in CSV format.

In [61]:
bucket.upload_csv(
    data=df,
    key="processed/olid-br/2/olidbr.csv")

bucket.upload_csv(
    data=df_metadata,
    key="processed/olid-br/2/metadata.csv")

print("CSV Files uploaded.")

CSV Files uploaded.


Saving in JSON format.

In [62]:
metadata = dict_serialize_date(
    data=[i.dict() for i in metadata],
    keys=["created_at", "collected_at"])

In [64]:
bucket.upload_json(
    data=texts,
    key="processed/olid-br/2/olidbr.json")

bucket.upload_json(
    data=metadata,
    key="processed/olid-br/2/metadata.json")

print("JSON Files uploaded.")

JSON Files uploaded.
