# OLID-BR - Iteration 4

In this notebook, we will read the annotated data from an S3 bucket, build OLID-BR dataset and save it to an S3 bucket in JSON and CSV formats.

The annotated data is stored in the Label Studio JSON format. See [Label Studio Documentation — Export Annotations](https://labelstud.io/guide/export.html#Label-Studio-JSON-format-of-annotated-tasks) for more details.

## Imports

In [2]:
import sys
from pathlib import Path

if str(Path(".").absolute().parent) not in sys.path:
    sys.path.append(str(Path(".").absolute().parent.parent))

In [3]:
from dotenv import load_dotenv

# Initialize the env vars
load_dotenv("../../.env")

True

In [4]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
from typing import List

from irrCAC.raw import CAC
from src.data_classes import Annotator, LabelStrategy, Metadata
from src.dataset import Dataset
from src.labeling.assignment import majority_vote, at_least_one, all_labeled_spans
from src.labeling.metrics import (
    percent_agreement,
    disagreement_by_raters,
    disagreement_score
)

from src.s3 import Bucket
from src.settings import AppSettings
from src.utils import (
    read_yaml,
    check_words,
    prepare_data_to_px,
    dict_serialize_date,
    get_lead_time,
    get_annotations_by_rater
)

import nltk
from nltk.metrics import agreement
from nltk.metrics.agreement import AnnotationTask
from nltk.metrics import masi_distance, jaccard_distance

# Plotly
import plotly.express as px
import plotly.io as pio
from plotly.graph_objs import Layout

pio.templates.default = "plotly_dark"

layout = Layout(
    xaxis={
        "type": "category",
        "showgrid": False,
        "zeroline": False,
    },
    yaxis={
        "showgrid": False,
        "zeroline": False
    },
    paper_bgcolor="rgba(0,0,0,0)",
    plot_bgcolor="rgba(0,0,0,0)",
    font={"color": "rgb(180,180,180)"},
)

args = AppSettings()

## Load data

In the next cells, we will read the labeled data from the S3 bucket and concatenate all annotations into a single base.

In [5]:
bucket = Bucket(args.AWS_S3_BUCKET)

bucket.get_session_from_aksk(
    args.AWS_ACCESS_KEY_ID,
    args.AWS_SECRET_ACCESS_KEY)

In [6]:
files = [
    "raw/labeled/phase4/olid-br-4-2.json",
    "raw/labeled/phase4/olid-br-4-2-1.json",
    "raw/labeled/phase4/olid-br-4-33.json",
    "raw/labeled/phase4/olid-br-4-33-1.json",
    "raw/labeled/phase4/olid-br-4-41.json",
    "raw/labeled/phase4/olid-br-4-41-1.json"
]

As we have each annotator data in a separate file, we will need to concatenate all annotations into a single base.

In [10]:
data = {}

for file in files:
    print(f"Reading {file}")
    temp = bucket.download_json(key=file)

    lead_time = get_lead_time(temp)
    print(f"{file} >> Mean: {np.mean(lead_time):.0f}s | Std: {np.std(lead_time):.0f}s")

    for row in temp:
        # Due a bug in the database, the id for annotator 504 was changed to 2
        if row["annotations"][0]["completed_by"] == 2:
            for annotation in row["annotations"]:
                annotation["completed_by"] = 504

        if row["data"]["text"] not in data.keys():
            data[row["data"]["text"]] = row
        else:
            data[row["data"]["text"]]["annotations"].extend(row["annotations"])
    
    print()

data = [v for _, v in data.items()]

print(f"Count: {len(data)}")

Reading raw/labeled/phase4/olid-br-4-2.json
raw/labeled/phase4/olid-br-4-2.json >> Mean: 99s | Std: 654s

Reading raw/labeled/phase4/olid-br-4-2-1.json
raw/labeled/phase4/olid-br-4-2-1.json >> Mean: 70s | Std: 208s

Reading raw/labeled/phase4/olid-br-4-33.json
raw/labeled/phase4/olid-br-4-33.json >> Mean: 128s | Std: 2281s

Reading raw/labeled/phase4/olid-br-4-33-1.json
raw/labeled/phase4/olid-br-4-33-1.json >> Mean: 56s | Std: 387s

Reading raw/labeled/phase4/olid-br-4-41.json
raw/labeled/phase4/olid-br-4-41.json >> Mean: 323s | Std: 2561s

Reading raw/labeled/phase4/olid-br-4-41-1.json
raw/labeled/phase4/olid-br-4-41-1.json >> Mean: 1025s | Std: 4457s

Count: 2416


## Fixing errors in the data

In this iteration, we have some errors in the data that we need to fix.

In the next cell, we will count how many annotations we have for each item and who has annotated each item.

In [12]:
from typing import Any, Dict

def get_annotation_count(data: List[Any]) -> Dict[str, Any]:
    """Returns a dictionary with the number of annotations per text.

    Args:
    - data: A list of dictionaries with the data of the dataset.

    Returns:
    - A dictionary with the number of annotations per text.
    """
    annotations_count = {}
    iteration_annotators = []

    for item in data:
        for annotation in item["annotations"]:
            if annotation["completed_by"] not in iteration_annotators:
                iteration_annotators.append(annotation["completed_by"])

        count = len(item["annotations"])
        if count not in annotations_count.keys():
            annotations_count[count] = 1
        else:
            annotations_count[count] += 1
    return {
        "Annotators": iteration_annotators,
        "Count": annotations_count
    }

def remap_annotators(data: List[Any], annotators_map: Dict[int, int]) -> List[Any]:
    """Remaps the annotators in the dataset.

    Args:
    - data: A list of dictionaries with the data of the dataset.
    - annotators_map: A dictionary with the old annotator id as key and the new annotator id as value.
    """
    for item in data:
        for annotation in item["annotations"]:
            if annotation["completed_by"] in annotators_map.keys():
                annotation["completed_by"] = annotators_map[annotation["completed_by"]]
    return data

annotators_map = {
    2: 504,
    33: 260,
    41: 127
}

data = remap_annotators(data, annotators_map)

for k, v in get_annotation_count(data).items():
    print(f"{k}: {v}")

Annotators: [504, 260, 127]
Count: {2: 304, 3: 470, 4: 2, 1: 1640}


In the next cell, we will remove annotations that do not have a valid result.

In [13]:
def remove_null_annotations(data: List[Any]) -> List[Any]:
    """Remove null annotations from a list of annotations.

    Args:
    - data: A list of dictionaries with the data of the dataset.

    Returns:
    - A list of dictionaries with the data of the dataset.
    """
    counter = 0
    for item in data:
        annotators = []
        for annotation in item["annotations"]:
            if len(annotation["result"]) == 0:
                item["annotations"].remove(annotation)
                counter += 1

            if annotation["completed_by"] not in annotators:
                annotators.append(annotation["completed_by"])
            else:
                item["annotations"].remove(annotation)
                counter += 1

    print(f"Removed {counter} null annotations.")
    return data

data = remove_null_annotations(data)

print(f"Count: {len(data)}")
for k, v in get_annotation_count(data).items():
    print(f"{k}: {v}")

Removed 11 null annotations.
Count: 2416
Annotators: [504, 260, 127]
Count: {2: 297, 3: 471, 1: 1648}


In the next cell, we will remove items that do not have three annotations.

In [14]:
data = [item for item in data if len(item["annotations"]) == 3]
print(f"Count: {len(data)}")

Count: 471


## Load annotators

In the next cells, we will read the annotators data and create a list with all annotators objects.

It will be used to add the annotations as a metadata for each text.

In [15]:
annotators = read_yaml("../../properties/annotators.yaml")
annotators = [Annotator(**a) for a in annotators]
annotators

# Filter out the annotators that are not present in the data
annotators = [a for a in annotators if a.annotator_id in get_annotation_count(data)["Annotators"]]
annotators

[Annotator(id=None, annotator_id=127, gender='Female', year_of_birth=1975, education_level="Master's degree", annotator_type='Contract worker'),
 Annotator(id=None, annotator_id=260, gender='Female', year_of_birth=2001, education_level='High school', annotator_type='Contract worker'),
 Annotator(id=None, annotator_id=504, gender='Female', year_of_birth=1999, education_level='High school', annotator_type='Contract worker')]

## Build dataset

In [16]:
dataset = Dataset(
    annotators=annotators,
    toxicity_threshold=args.PERSPECTIVE_THRESHOLD
)

raw_texts = dataset.get_raw_texts(data)

We will filter only texts with all three annotators.

In [17]:
raw_texts = [text for text in raw_texts if len(text.annotations) == 3]

print(f"{len(raw_texts)} raw texts with 3 annotations.")

471 raw texts with 3 annotations.


## Inter-Rater Reliability (IRR) analysis

a.k.a inter-rater agreement (IRA) or concordance.

In the next cells, we will perform an agreement analysis to check if the annotations are consistent.

See [Inter-Rater Reliability - OLID-BR](https://dougtrajano.github.io/olid-br/annotation/inter-rater-reliability.html) for more details.

### `is_offensive`

In [55]:
raw_texts = [text for text in raw_texts if len(text.annotations) == 3]

In [56]:
is_offensive = pd.DataFrame(dataset.get_annotations(raw_texts, "is_offensive"))
is_offensive.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2989,2990,2991,2992,2993,2994,2995,2996,2997,2998
1,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,...,NOT,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF
2,OFF,OFF,OFF,OFF,OFF,OFF,NOT,OFF,OFF,OFF,...,NOT,NOT,NOT,OFF,OFF,OFF,OFF,NOT,OFF,OFF
3,OFF,OFF,OFF,OFF,OFF,OFF,NOT,OFF,NOT,OFF,...,NOT,NOT,OFF,OFF,OFF,OFF,OFF,NOT,OFF,NOT


In [57]:
fig = px.bar(
    data_frame=prepare_data_to_px(is_offensive),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="is_offensive distribution")

fig.update_layout(layout)

fig.show()

In [58]:
cac = CAC(is_offensive)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2999, Raters: 3, Categories: ['NOT', 'OFF'], Weights: "identity">
Percent agreement: 0.6509
Krippendorff's alpha: 0.1777
Gwet's AC1: 0.6754


In [59]:
for k, v in disagreement_by_raters(cac.ratings, "OFF").items():
    print(f"{v} texts was annotated by {k} rater(s) as offensive.")

print(f"Disagreement score (class OFF): {disagreement_score(cac.ratings, 'OFF'):.4f}")

293 texts was annotated by 1 rater(s) as offensive.
754 texts was annotated by 2 rater(s) as offensive.
1887 texts was annotated by 3 rater(s) as offensive.
Disagreement score (class OFF): 0.3569


In [60]:
for k, v in disagreement_by_raters(cac.ratings, "NOT").items():
    print(f"{v} texts was annotated by {k} rater(s) as non-offensive.")

print(f"Disagreement score (class NOT): {disagreement_score(cac.ratings, 'NOT'):.4f}")

754 texts was annotated by 1 rater(s) as non-offensive.
293 texts was annotated by 2 rater(s) as non-offensive.
65 texts was annotated by 3 rater(s) as non-offensive.
Disagreement score (class NOT): 0.9415


### `is_targeted`

In [61]:
is_targeted = pd.DataFrame(dataset.get_annotations(raw_texts, "is_targeted"))
is_targeted.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2989,2990,2991,2992,2993,2994,2995,2996,2997,2998
1,TIN,UNT,TIN,TIN,UNT,UNT,UNT,TIN,TIN,UNT,...,UNT,UNT,TIN,TIN,UNT,TIN,UNT,UNT,UNT,UNT
2,TIN,TIN,TIN,UNT,TIN,TIN,UNT,TIN,TIN,TIN,...,UNT,UNT,UNT,TIN,TIN,UNT,TIN,UNT,TIN,TIN
3,TIN,TIN,TIN,TIN,TIN,TIN,UNT,TIN,UNT,TIN,...,UNT,UNT,TIN,TIN,TIN,TIN,TIN,UNT,TIN,UNT


In [62]:
fig = px.bar(
    data_frame=prepare_data_to_px(is_targeted),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="is_targeted distribution")

fig.update_layout(layout)

fig.show()

In [63]:
cac = CAC(is_targeted)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2999, Raters: 3, Categories: ['TIN', 'UNT'], Weights: "identity">
Percent agreement: 0.3551
Krippendorff's alpha: 0.1072
Gwet's AC1: 0.1709


In [64]:
for k, v in disagreement_by_raters(cac.ratings, "TIN").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted.")

print(f"Disagreement score (class TIN): {disagreement_score(cac.ratings, 'TIN'):.4f}")

724 texts was annotated by 1 rater(s) as targeted.
1210 texts was annotated by 2 rater(s) as targeted.
740 texts was annotated by 3 rater(s) as targeted.
Disagreement score (class TIN): 0.7233


In [65]:
for k, v in disagreement_by_raters(cac.ratings, "UNT").items():
    print(f"{v} texts was annotated by {k} rater(s) as untargeted.")

print(f"Disagreement score (class UNT): {disagreement_score(cac.ratings, 'UNT'):.4f}")

1210 texts was annotated by 1 rater(s) as untargeted.
724 texts was annotated by 2 rater(s) as untargeted.
325 texts was annotated by 3 rater(s) as untargeted.
Disagreement score (class UNT): 0.8561


### `targeted_type`

In [66]:
targeted_type = pd.DataFrame(dataset.get_annotations(raw_texts, "targeted_type"))
targeted_type.fillna(np.nan, inplace=True)
targeted_type.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2989,2990,2991,2992,2993,2994,2995,2996,2997,2998
1,GRP,,GRP,IND,,,,OTH,OTH,,...,,,OTH,GRP,,IND,,,,
2,IND,IND,GRP,,OTH,IND,,OTH,GRP,IND,...,,,,GRP,OTH,,IND,,IND,OTH
3,IND,IND,GRP,IND,OTH,IND,,OTH,,IND,...,,,IND,IND,OTH,IND,IND,,IND,


In [67]:
fig = px.bar(
    data_frame=prepare_data_to_px(targeted_type),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="targeted_type distribution")

fig.update_layout(layout)

fig.show()

In [68]:
cac = CAC(targeted_type)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2674, Raters: 3, Categories: ['GRP', 'IND', 'OTH'], Weights: "identity">
Percent agreement: 0.1975
Krippendorff's alpha: 0.4887
Gwet's AC1: 0.6300




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [69]:
for k, v in disagreement_by_raters(cac.ratings, "IND").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted to an individual.")

print(f"Disagreement score (class IND): {disagreement_score(cac.ratings, 'IND'):.4f}")

834 texts was annotated by 1 rater(s) as targeted to an individual.
661 texts was annotated by 2 rater(s) as targeted to an individual.
446 texts was annotated by 3 rater(s) as targeted to an individual.
Disagreement score (class IND): 0.7702


In [70]:
for k, v in disagreement_by_raters(cac.ratings, "GRP").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted to a group.")

print(f"Disagreement score (class GRP): {disagreement_score(cac.ratings, 'GRP'):.4f}")

431 texts was annotated by 1 rater(s) as targeted to a group.
198 texts was annotated by 2 rater(s) as targeted to a group.
48 texts was annotated by 3 rater(s) as targeted to a group.
Disagreement score (class GRP): 0.9291


In [71]:
for k, v in disagreement_by_raters(cac.ratings, "OTH").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted to other.")

print(f"Disagreement score (class OTH): {disagreement_score(cac.ratings, 'OTH'):.4f}")

481 texts was annotated by 1 rater(s) as targeted to other.
158 texts was annotated by 2 rater(s) as targeted to other.
34 texts was annotated by 3 rater(s) as targeted to other.
Disagreement score (class OTH): 0.9495


### `toxic_spans`

In [72]:
toxic_spans = pd.DataFrame(dataset.get_annotations(raw_texts, "toxic_spans"))
toxic_spans.head()

Unnamed: 0,1,2,3
0,"[42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 5...","[43, 44, 45, 46, 47, 48]","[52, 53, 54, 55, 56, 57]"
1,"[34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 4...","[4, 5, 6, 7, 8, 9, 10, 11, 33, 34, 35, 36, 37,...","[5, 6, 7, 8, 9, 10, 11, 34, 35, 36, 37, 38, 39..."
2,"[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 195, 196, 19...","[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 211, ...","[211, 212, 213, 214, 215, 216, 217, 218]"
3,"[15, 16, 17, 20, 21, 22, 23, 24, 25]",[],"[15, 16, 17, 20, 21, 22, 23, 24]"
4,"[0, 1, 2, 3, 4, 5, 6, 7]","[0, 1, 2, 3, 4, 5, 6]","[0, 1, 2, 3, 4, 5, 6]"


In [73]:
task_data = []
for annotator in toxic_spans.columns:
    for item in range(len(toxic_spans)):
        temp = toxic_spans.iloc[item][annotator]
        if temp != []:
            task_data.append((
                annotator,
                item,
                frozenset(temp)
            ))

jaccard_task = AnnotationTask(distance=jaccard_distance)
masi_task = AnnotationTask(distance=masi_distance)

for task in [jaccard_task, masi_task]:
    task.load_array(task_data)
    print(f"Krippendorff's alpha using {task.distance}")
    print(f"Krippendorff's alpha: {task.alpha():.4f}", "\n")

print(f"Percent agreement: {percent_agreement(toxic_spans):.4f}")

Krippendorff's alpha using <function jaccard_distance at 0x0000020F3AE1DA60>
Krippendorff's alpha: 0.6114 

Krippendorff's alpha using <function masi_distance at 0x0000020F3AE1DAF0>
Krippendorff's alpha: 0.4427 

Percent agreement: 0.1757


In [74]:
def len_toxic_spans(toxic_spans: List[int]):
    return None if len(toxic_spans) == 0 else len(toxic_spans)

pd.DataFrame([toxic_spans[col].apply(lambda x: len_toxic_spans(x)) for col in toxic_spans.columns]).transpose().describe()

Unnamed: 0,1,2,3
count,2295.0,1957.0,1909.0
mean,15.235294,11.671947,9.720272
std,15.385697,7.847843,8.471268
min,2.0,2.0,1.0
25%,6.0,7.0,5.0
50%,10.0,9.0,7.0
75%,18.0,14.0,11.0
max,223.0,72.0,154.0


In [75]:
fig = px.bar(
    data_frame=prepare_data_to_px(pd.DataFrame([toxic_spans[col].apply(lambda x: len(x) > 0) for col in toxic_spans.columns]).transpose()),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="toxic_spans distribution")

fig.update_layout(layout)

fig.show()

### `health`

In [76]:
health = pd.DataFrame(dataset.get_annotations(raw_texts, "health"))
health.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2989,2990,2991,2992,2993,2994,2995,2996,2997,2998
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [77]:
fig = px.bar(
    data_frame=prepare_data_to_px(health),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Health distribution")

fig.update_layout(layout)

fig.show()

In [78]:
cac = CAC(health)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2999, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9700
Krippendorff's alpha: 0.2641
Gwet's AC1: 0.9794


In [79]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as health.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

65 texts was annotated by 1 rater(s) as health.
25 texts was annotated by 2 rater(s) as health.
3 texts was annotated by 3 rater(s) as health.
Disagreement score (class True): 0.9677


### `ideology`

In [80]:
ideology = pd.DataFrame(dataset.get_annotations(raw_texts, "ideology"))
ideology.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2989,2990,2991,2992,2993,2994,2995,2996,2997,2998
1,False,False,True,False,False,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [81]:
fig = px.bar(
    data_frame=prepare_data_to_px(ideology),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Ideology distribution")

fig.update_layout(layout)

fig.show()

In [82]:
cac = CAC(ideology)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2999, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.8670
Krippendorff's alpha: 0.4728
Gwet's AC1: 0.8934


In [83]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as ideology.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

240 texts was annotated by 1 rater(s) as ideology.
159 texts was annotated by 2 rater(s) as ideology.
92 texts was annotated by 3 rater(s) as ideology.
Disagreement score (class True): 0.8126


### `insult`

In [84]:
insult = pd.DataFrame(dataset.get_annotations(raw_texts, "insult"))
insult.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2989,2990,2991,2992,2993,2994,2995,2996,2997,2998
1,True,False,True,True,False,True,True,True,True,False,...,False,False,True,True,True,True,True,True,True,True
2,True,True,True,False,False,True,False,True,True,False,...,False,False,False,True,True,True,True,False,True,True
3,True,True,True,True,False,True,False,True,False,True,...,False,False,True,True,True,True,True,False,True,False


In [85]:
fig = px.bar(
    data_frame=prepare_data_to_px(insult),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Insult distribution")

fig.update_layout(layout)

fig.show()

In [86]:
cac = CAC(insult)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2999, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.5488
Krippendorff's alpha: 0.3317
Gwet's AC1: 0.4531


In [87]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as insult.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

538 texts was annotated by 1 rater(s) as insult.
815 texts was annotated by 2 rater(s) as insult.
1251 texts was annotated by 3 rater(s) as insult.
Disagreement score (class True): 0.5196


### `lgbtqphobia`

In [88]:
lgbtqphobia = pd.DataFrame(dataset.get_annotations(raw_texts, "lgbtqphobia"))
lgbtqphobia.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2989,2990,2991,2992,2993,2994,2995,2996,2997,2998
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [89]:
fig = px.bar(
    data_frame=prepare_data_to_px(lgbtqphobia),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="LGBTQphobia distribution")

fig.update_layout(layout)

fig.show()

In [90]:
cac = CAC(lgbtqphobia)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2999, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9613
Krippendorff's alpha: 0.6393
Gwet's AC1: 0.9722


In [91]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as lgbtqphobia.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

57 texts was annotated by 1 rater(s) as lgbtqphobia.
59 texts was annotated by 2 rater(s) as lgbtqphobia.
53 texts was annotated by 3 rater(s) as lgbtqphobia.
Disagreement score (class True): 0.6864


### `other_lifestyle`

In [92]:
other_lifestyle = pd.DataFrame(dataset.get_annotations(raw_texts, "other_lifestyle"))
other_lifestyle.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2989,2990,2991,2992,2993,2994,2995,2996,2997,2998
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [93]:
fig = px.bar(
    data_frame=prepare_data_to_px(other_lifestyle),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Other-Lifestyle distribution")

fig.update_layout(layout)

fig.show()

In [94]:
cac = CAC(other_lifestyle)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2999, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9787
Krippendorff's alpha: 0.4683
Gwet's AC1: 0.9854


In [95]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as other_lifestyle.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

45 texts was annotated by 1 rater(s) as other_lifestyle.
19 texts was annotated by 2 rater(s) as other_lifestyle.
13 texts was annotated by 3 rater(s) as other_lifestyle.
Disagreement score (class True): 0.8312


### `physical_aspects`

In [96]:
physical_aspects = pd.DataFrame(dataset.get_annotations(raw_texts, "physical_aspects"))
physical_aspects.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2989,2990,2991,2992,2993,2994,2995,2996,2997,2998
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [97]:
fig = px.bar(
    data_frame=prepare_data_to_px(physical_aspects),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Physical Aspects distribution")

fig.update_layout(layout)

fig.show()

In [98]:
cac = CAC(physical_aspects)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2999, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9560
Krippendorff's alpha: 0.4160
Gwet's AC1: 0.9691


In [99]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as physical_aspects.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

98 texts was annotated by 1 rater(s) as physical_aspects.
34 texts was annotated by 2 rater(s) as physical_aspects.
22 texts was annotated by 3 rater(s) as physical_aspects.
Disagreement score (class True): 0.8571


### `profanity_obscene`

In [100]:
profanity_obscene = pd.DataFrame(dataset.get_annotations(raw_texts, "profanity_obscene"))
profanity_obscene.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2989,2990,2991,2992,2993,2994,2995,2996,2997,2998
1,True,True,False,False,True,False,False,False,False,True,...,False,False,False,False,True,False,True,True,False,False
2,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,True,False,True,True,False,False,False,False,True,...,False,False,False,False,False,False,False,False,True,False


In [101]:
fig = px.bar(
    data_frame=prepare_data_to_px(profanity_obscene),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Profanity/Obscene distribution")

fig.update_layout(layout)

fig.show()

In [102]:
cac = CAC(profanity_obscene)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2999, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.7089
Krippendorff's alpha: 0.4894
Gwet's AC1: 0.6870


In [103]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as profanity_obscene.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

402 texts was annotated by 1 rater(s) as profanity_obscene.
471 texts was annotated by 2 rater(s) as profanity_obscene.
317 texts was annotated by 3 rater(s) as profanity_obscene.
Disagreement score (class True): 0.7336


### `racism`

In [104]:
racism = pd.DataFrame(dataset.get_annotations(raw_texts, "racism"))
racism.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2989,2990,2991,2992,2993,2994,2995,2996,2997,2998
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [105]:
fig = px.bar(
    data_frame=prepare_data_to_px(racism),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Racism distribution")

fig.update_layout(layout)

fig.show()

In [106]:
cac = CAC(racism)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2999, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9913
Krippendorff's alpha: 0.3781
Gwet's AC1: 0.9942


In [107]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as racism.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

22 texts was annotated by 1 rater(s) as racism.
4 texts was annotated by 2 rater(s) as racism.
4 texts was annotated by 3 rater(s) as racism.
Disagreement score (class True): 0.8667


### `religious_intolerance`

In [108]:
religious_intolerance = pd.DataFrame(dataset.get_annotations(raw_texts, "religious_intolerance"))
religious_intolerance.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2989,2990,2991,2992,2993,2994,2995,2996,2997,2998
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [109]:
fig = px.bar(
    data_frame=prepare_data_to_px(religious_intolerance),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Religious intolerance distribution")

fig.update_layout(layout)

fig.show()

In [110]:
cac = CAC(religious_intolerance)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
try:
    print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
except:
    print("Krippendorff's alpha: NaN")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2999, Raters: 3, Categories: [False], Weights: "identity">
Percent agreement: 1.0000
Krippendorff's alpha: 1.0000
Gwet's AC1: 1.0000



divide by zero encountered in double_scalars


invalid value encountered in multiply


invalid value encountered in multiply


divide by zero encountered in double_scalars


divide by zero encountered in double_scalars


invalid value encountered in multiply



In [111]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as religious_intolerance.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

0 texts was annotated by 1 rater(s) as religious_intolerance.
0 texts was annotated by 2 rater(s) as religious_intolerance.
0 texts was annotated by 3 rater(s) as religious_intolerance.
Disagreement score (class True): 0.0000


### `sexism`

In [112]:
sexism = pd.DataFrame(dataset.get_annotations(raw_texts, "sexism"))
sexism.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2989,2990,2991,2992,2993,2994,2995,2996,2997,2998
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False


In [113]:
fig = px.bar(
    data_frame=prepare_data_to_px(sexism),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Sexism distribution")

fig.update_layout(layout)

fig.show()

In [114]:
cac = CAC(sexism)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2999, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9550
Krippendorff's alpha: 0.1566
Gwet's AC1: 0.9689


In [115]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as sexism.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

116 texts was annotated by 1 rater(s) as sexism.
19 texts was annotated by 2 rater(s) as sexism.
3 texts was annotated by 3 rater(s) as sexism.
Disagreement score (class True): 0.9783


### `xenophobia`

In [116]:
xenophobia = pd.DataFrame(dataset.get_annotations(raw_texts, "xenophobia"))
xenophobia.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2989,2990,2991,2992,2993,2994,2995,2996,2997,2998
1,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [117]:
fig = px.bar(
    data_frame=prepare_data_to_px(xenophobia),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Xenophobia distribution")

fig.update_layout(layout)

fig.show()

In [118]:
cac = CAC(xenophobia)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2999, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9847
Krippendorff's alpha: 0.2980
Gwet's AC1: 0.9896


In [119]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as xenophobia.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

29 texts was annotated by 1 rater(s) as xenophobia.
17 texts was annotated by 2 rater(s) as xenophobia.
1 texts was annotated by 3 rater(s) as xenophobia.
Disagreement score (class True): 0.9787


### Krispendorff's alpha Multi-Label

In the next cells, we will calculate the Krippendorff's alpha considering as a multi-label problem instead of several binary problems.

In [164]:
ratings = {
    "health": health,
    "ideology": ideology,
    "insult": insult,
    "lgbtqphobia": lgbtqphobia,
    "other_lifestyle": other_lifestyle,
    "physical_aspects": physical_aspects,
    "profanity_obscene": profanity_obscene,
    "racism": racism,
    "religious_intolerance": religious_intolerance,
    "sexism": sexism,
    "xenophobia": xenophobia
}

task_data = []
for annotator in health.columns.tolist():
    for item in range(len(health)):
        temp = get_annotations_by_rater(ratings, annotator, item)
        if temp != []:
            task_data.append((
                annotator,
                item,
                frozenset(temp)
            ))

jaccard_task = AnnotationTask(distance=jaccard_distance)
masi_task = AnnotationTask(distance=masi_distance)

for task in [jaccard_task, masi_task]:
    task.load_array(task_data)
    print(f"Krippendorff's alpha using {task.distance}")
    print(f"Krippendorff's alpha: {task.alpha():.4f}", "\n")

pa_mlabels = {}
for item in range(len(health)):
    for annotator in health.columns.tolist():
        temp = get_annotations_by_rater(ratings, annotator, item)
        
        if annotator not in pa_mlabels.keys():
            pa_mlabels[annotator] = []
        
        pa_mlabels[annotator].append(temp)

print(f"Percent agreement: {percent_agreement(pd.DataFrame(pa_mlabels)):.4f}")

Krippendorff's alpha using <function jaccard_distance at 0x0000020F3AE1DA60>
Krippendorff's alpha: 0.5070 

Krippendorff's alpha using <function masi_distance at 0x0000020F3AE1DAF0>
Krippendorff's alpha: 0.4653 

Percent agreement: 0.2758


## Label Assignment

In this section, we will define the label assigment strategy and assign labels to the texts.

Possible label assigment strategies are:

- **Majority Vote**: assign the label with the highest frequency.
- **At least one**: assign the label if at least one annotator marked it as true.

### Strategy per features

We will have a label assignment strategy for each feature.

The LabelStrategy object will be used to assign a function to each feature that corresponds to the label assigment strategy selected.

In [121]:
label_strategy = LabelStrategy(
    is_offensive=majority_vote,
    is_targeted=majority_vote,
    targeted_type=majority_vote,
    toxic_spans=all_labeled_spans,
    health=at_least_one,
    ideology=at_least_one,
    insult=at_least_one, # majority_vote
    lgbtqphobia=at_least_one,
    other_lifestyle=at_least_one,
    physical_aspects=at_least_one,
    profanity_obscene=at_least_one,
    racism=at_least_one,
    religious_intolerance=at_least_one,
    sexism=at_least_one,
    xenophobia=at_least_one
)

processed_texts, metadata = dataset.get_processed_texts(
    raw=[i for i in dataset.get_raw_texts(data) if len(i.annotations) == 3],
    label_strategy=label_strategy
)

processed_texts = [i.dict() for i in processed_texts]
metadata = [i.dict() for i in metadata]

## Create DataFrames

In the next cells, we will create Pandas DataFrames for the dataset and the metadata.

In [122]:
df = pd.DataFrame(processed_texts)

print(f"Shape: {df.shape}")
df.head()

Shape: (2999, 17)


Unnamed: 0,id,text,is_offensive,is_targeted,targeted_type,toxic_spans,health,ideology,insult,lgbtqphobia,other_lifestyle,physical_aspects,profanity_obscene,racism,religious_intolerance,sexism,xenophobia
0,cff9ea01e0344678aed34311592c9ee2,"""vc é minha vida bela"" vc não tem vida seu pei...",OFF,TIN,IND,"[42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 5...",False,False,True,False,False,False,True,False,False,False,False
1,c6ca745f0ec54da38bbcae923ba635fd,USER Caralho mano eu ia mandar um vai tomar no...,OFF,TIN,IND,"[4, 5, 6, 7, 8, 9, 10, 11, 33, 34, 35, 36, 37,...",False,False,True,False,False,False,True,False,False,False,False
2,22ece9d156444dd2ab6e016d41d56216,Os ignorantes nos comentários querendo dar des...,OFF,TIN,GRP,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 195, ...",False,True,True,False,False,False,False,False,False,False,False
3,d9f846d5bd98409b84ce81c0ac6157a3,"rafael é muito fdp, praga de garoto...",OFF,TIN,IND,"[15, 16, 17, 20, 21, 22, 23, 24, 25]",False,False,True,False,False,False,True,False,False,False,False
4,6d96143ab90e4c78b272c6689fd6b14f,PUTARIA 🔞 URL,OFF,TIN,OTH,"[0, 1, 2, 3, 4, 5, 6, 7]",False,False,False,False,False,False,True,False,False,False,False


In [123]:
df_metadata = pd.DataFrame(metadata)

print(f"Shape: {df_metadata.shape}")
df_metadata.head()

Shape: (11996, 11)


Unnamed: 0,id,source,created_at,collected_at,toxicity_score,category,annotator_id,gender,year_of_birth,education_level,annotator_type
0,cff9ea01e0344678aed34311592c9ee2,YouTube,2013-05-14 17:01:53,2022-04-08 08:03:44.134767,0.8605,,,,,,
1,cff9ea01e0344678aed34311592c9ee2,,,NaT,,,126.0,Male,1997.0,High school,Contract worker
2,cff9ea01e0344678aed34311592c9ee2,,,NaT,,,127.0,Female,1975.0,Master's degree,Contract worker
3,cff9ea01e0344678aed34311592c9ee2,,,NaT,,,260.0,Female,2001.0,High school,Contract worker
4,c6ca745f0ec54da38bbcae923ba635fd,Twitter,2022-03-27 01:43:01+00:00,2022-03-27 00:52:43.398462,0.9389,,,,,,


## Validate data

In this section, we will apply some simple validation to guarantee that the data is correct.

Remove duplicated texts.

In [124]:
df.drop_duplicates(subset=["text"], inplace=True)

print(f"Shape: {df.shape}")

Shape: (2999, 17)


Remove understandable texts.

In [157]:
invalid_texts = [
    "Yo que me iba a dormir y ese marica de chris dizque mine a dar un rose ni modo 🫠🫡",
    "USER Traigo todo el kit pero ni así ando al 100 \U0001fae0 maldita vejez jajajaja"
]

processed_texts = []

for text in df.to_dict(orient="records"):
    if text["text"] not in invalid_texts and not check_words(text["text"], ["USER", "HASHTAG", "URL"]):
        processed_texts.append(text)

print(f"Count: {len(processed_texts)}")

Count: 2987


Rebuild dataframe from the cleaned data.

In [158]:
df = pd.DataFrame(processed_texts)

print(f"Shape: {df.shape}")
df.head()

Shape: (2987, 17)


Unnamed: 0,id,text,is_offensive,is_targeted,targeted_type,toxic_spans,health,ideology,insult,lgbtqphobia,other_lifestyle,physical_aspects,profanity_obscene,racism,religious_intolerance,sexism,xenophobia
0,cff9ea01e0344678aed34311592c9ee2,"""vc é minha vida bela"" vc não tem vida seu pei...",OFF,TIN,IND,"[42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 5...",False,False,True,False,False,False,True,False,False,False,False
1,c6ca745f0ec54da38bbcae923ba635fd,USER Caralho mano eu ia mandar um vai tomar no...,OFF,TIN,IND,"[4, 5, 6, 7, 8, 9, 10, 11, 33, 34, 35, 36, 37,...",False,False,True,False,False,False,True,False,False,False,False
2,22ece9d156444dd2ab6e016d41d56216,Os ignorantes nos comentários querendo dar des...,OFF,TIN,GRP,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 195, ...",False,True,True,False,False,False,False,False,False,False,False
3,d9f846d5bd98409b84ce81c0ac6157a3,"rafael é muito fdp, praga de garoto...",OFF,TIN,IND,"[15, 16, 17, 20, 21, 22, 23, 24, 25]",False,False,True,False,False,False,True,False,False,False,False
4,6d96143ab90e4c78b272c6689fd6b14f,PUTARIA 🔞 URL,OFF,TIN,OTH,"[0, 1, 2, 3, 4, 5, 6, 7]",False,False,False,False,False,False,True,False,False,False,False


In [159]:
metadata = dict_serialize_date(
    data=[i.dict() if isinstance(i, Metadata) else i for i in metadata],
    keys=["created_at", "collected_at"])

# Remove deleted texts metadata
metadata = [i for i in metadata if i["id"] in df["id"].tolist()]

print(f"Count: {len(metadata)}")

Count: 11948


## Profiling Report

We will generate a profiling report that provides some statistics about the data.

In [160]:
profile = ProfileReport(
    df, title="OLID-BR Pilot 3",
    explorative=True)

profile.to_file("../../docs/reports/olidbr_pilot_3.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Upload data to S3

In this section, we will save the dataset in CSV and JSON format in the S3 bucket.

Saving in CSV format.

In [161]:
bucket.upload_csv(
    data=df,
    key="processed/olid-br/iterations/3/olidbr.csv")

bucket.upload_csv(
    data=df_metadata,
    key="processed/olid-br/iterations/3/metadata.csv")

print("CSV Files uploaded.")

CSV Files uploaded.


Saving in JSON format.

In [162]:
bucket.upload_json(
    data=processed_texts,
    key="processed/olid-br/iterations/3/olidbr.json")

bucket.upload_json(
    data=metadata,
    key="processed/olid-br/iterations/3/metadata.json")

print("JSON Files uploaded.")

JSON Files uploaded.
