# OLID-BR - Iteration 4

In this notebook, we will read the annotated data from an S3 bucket, build OLID-BR dataset and save it to an S3 bucket in JSON and CSV formats.

The annotated data is stored in the Label Studio JSON format. See [Label Studio Documentation — Export Annotations](https://labelstud.io/guide/export.html#Label-Studio-JSON-format-of-annotated-tasks) for more details.

## Imports

In [1]:
import sys
from pathlib import Path

if str(Path(".").absolute().parent) not in sys.path:
    sys.path.append(str(Path(".").absolute().parent.parent))

In [2]:
from dotenv import load_dotenv

# Initialize the env vars
load_dotenv("../../.env")

True

In [3]:
import datetime
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
from typing import List

from irrCAC.raw import CAC
from src.data_classes import Annotator, LabelStrategy, Metadata
from src.dataset import Dataset
from src.labeling.assignment import majority_vote, at_least_one, all_labeled_spans
from src.labeling.metrics import (
    percent_agreement,
    disagreement_by_raters,
    disagreement_score
)

from src.s3 import Bucket
from src.settings import AppSettings
from src.utils import (
    read_yaml,
    check_words,
    prepare_data_to_px,
    dict_serialize_date,
    get_lead_time,
    get_annotations_by_rater
)

import nltk
from nltk.metrics import agreement
from nltk.metrics.agreement import AnnotationTask
from nltk.metrics import masi_distance, jaccard_distance

# Plotly
import plotly.express as px
import plotly.io as pio
from plotly.graph_objs import Layout

pio.templates.default = "plotly_dark"

layout = Layout(
    xaxis={
        "type": "category",
        "showgrid": False,
        "zeroline": False,
    },
    yaxis={
        "showgrid": False,
        "zeroline": False
    },
    paper_bgcolor="rgba(0,0,0,0)",
    plot_bgcolor="rgba(0,0,0,0)",
    font={"color": "rgb(180,180,180)"},
)

args = AppSettings()

## Load data

In the next cells, we will read the labeled data from the S3 bucket and concatenate all annotations into a single base.

In [4]:
bucket = Bucket(args.AWS_S3_BUCKET)

bucket.get_session_from_aksk(
    args.AWS_ACCESS_KEY_ID,
    args.AWS_SECRET_ACCESS_KEY)

In [5]:
files = [
    "raw/labeled/phase4/olid-br-4-2.json",
    "raw/labeled/phase4/olid-br-4-2-1.json",
    "raw/labeled/phase4/olid-br-4-33.json",
    "raw/labeled/phase4/olid-br-4-33-1.json",
    "raw/labeled/phase4/olid-br-4-41.json",
    "raw/labeled/phase4/olid-br-4-41-1.json"
]

As we have each annotator data in a separate file, we will need to concatenate all annotations into a single base.

In [6]:
data = {}

for file in files:
    print(f"Reading {file}")
    temp = bucket.download_json(key=file)

    lead_time = get_lead_time(temp)
    print(f"{file} >> Mean: {np.mean(lead_time):.0f}s | Std: {np.std(lead_time):.0f}s")

    for row in temp:
        # Due a bug in the database, the id for annotator 504 was changed to 2
        if row["annotations"][0]["completed_by"] == 2:
            for annotation in row["annotations"]:
                annotation["completed_by"] = 504

        if row["data"]["text"] not in data.keys():
            data[row["data"]["text"]] = row
        else:
            data[row["data"]["text"]]["annotations"].extend(row["annotations"])
    
    print()

data = [v for _, v in data.items()]

print(f"Count: {len(data)}")

Reading raw/labeled/phase4/olid-br-4-2.json
raw/labeled/phase4/olid-br-4-2.json >> Mean: 114s | Std: 658s

Reading raw/labeled/phase4/olid-br-4-2-1.json
raw/labeled/phase4/olid-br-4-2-1.json >> Mean: 70s | Std: 208s

Reading raw/labeled/phase4/olid-br-4-33.json
raw/labeled/phase4/olid-br-4-33.json >> Mean: 81s | Std: 1776s

Reading raw/labeled/phase4/olid-br-4-33-1.json
raw/labeled/phase4/olid-br-4-33-1.json >> Mean: 56s | Std: 387s

Reading raw/labeled/phase4/olid-br-4-41.json
raw/labeled/phase4/olid-br-4-41.json >> Mean: 140s | Std: 1395s

Reading raw/labeled/phase4/olid-br-4-41-1.json
raw/labeled/phase4/olid-br-4-41-1.json >> Mean: 1025s | Std: 4457s

Count: 6010


## Fixing errors in the data

In this iteration, we have some errors in the data that we need to fix.

In the next cell, we will count how many annotations we have for each item and who has annotated each item.

In [7]:
from typing import Any, Dict

def get_annotation_count(data: List[Any]) -> Dict[str, Any]:
    """Returns a dictionary with the number of annotations per text.

    Args:
    - data: A list of dictionaries with the data of the dataset.

    Returns:
    - A dictionary with the number of annotations per text.
    """
    annotations_count = {}
    iteration_annotators = []

    for item in data:
        for annotation in item["annotations"]:
            if annotation["completed_by"] not in iteration_annotators:
                iteration_annotators.append(annotation["completed_by"])

        count = len(item["annotations"])
        if count not in annotations_count.keys():
            annotations_count[count] = 1
        else:
            annotations_count[count] += 1
    return {
        "Annotators": iteration_annotators,
        "Count": annotations_count
    }

def remap_annotators(data: List[Any], annotators_map: Dict[int, int]) -> List[Any]:
    """Remaps the annotators in the dataset.

    Args:
    - data: A list of dictionaries with the data of the dataset.
    - annotators_map: A dictionary with the old annotator id as key and the new annotator id as value.
    """
    for item in data:
        for annotation in item["annotations"]:
            if annotation["completed_by"] in annotators_map.keys():
                annotation["completed_by"] = annotators_map[annotation["completed_by"]]
    return data

annotators_map = {
    2: 504,
    33: 260,
    41: 127
}

data = remap_annotators(data, annotators_map)

for k, v in get_annotation_count(data).items():
    print(f"{k}: {v}")

Annotators: [504, 260, 127]
Count: {3: 1747, 4: 11, 1: 4171, 2: 81}


In the next cell, we will remove annotations that do not have a valid result.

In [8]:
def remove_null_annotations(data: List[Any]) -> List[Any]:
    """Remove null annotations from a list of annotations.

    Args:
    - data: A list of dictionaries with the data of the dataset.

    Returns:
    - A list of dictionaries with the data of the dataset.
    """
    counter = 0
    for item in data:
        annotators = []
        for annotation in item["annotations"]:
            if len(annotation["result"]) == 0:
                item["annotations"].remove(annotation)
                counter += 1

            if annotation["completed_by"] not in annotators:
                annotators.append(annotation["completed_by"])
            else:
                item["annotations"].remove(annotation)
                counter += 1

    print(f"Removed {counter} null annotations.")
    return data

data = remove_null_annotations(data)

print(f"Count: {len(data)}")
for k, v in get_annotation_count(data).items():
    print(f"{k}: {v}")

Removed 32 null annotations.
Count: 6010
Annotators: [504, 260, 127]
Count: {3: 1758, 1: 4192, 2: 60}


## Load annotators

In the next cells, we will read the annotators data and create a list with all annotators objects.

It will be used to add the annotations as a metadata for each text.

In [9]:
annotators = read_yaml("../../properties/annotators.yaml")
annotators = [Annotator(**a) for a in annotators]
annotators

# Filter out the annotators that are not present in the data
annotators = [a for a in annotators if a.annotator_id in get_annotation_count(data)["Annotators"]]
annotators

[Annotator(id=None, annotator_id=127, gender='Female', year_of_birth=1975, education_level="Master's degree", annotator_type='Contract worker'),
 Annotator(id=None, annotator_id=260, gender='Female', year_of_birth=2001, education_level='High school', annotator_type='Contract worker'),
 Annotator(id=None, annotator_id=504, gender='Female', year_of_birth=1999, education_level='High school', annotator_type='Contract worker')]

## Build dataset

In [10]:
dataset = Dataset(
    annotators=annotators,
    toxicity_threshold=args.PERSPECTIVE_THRESHOLD
)

raw_texts = dataset.get_raw_texts(data)

We will filter only texts with all three annotators.

In [11]:
raw_texts = [text for text in raw_texts if len(text.annotations) == 3]

print(f"{len(raw_texts)} raw texts with 3 annotations.")

1758 raw texts with 3 annotations.


## Inter-Rater Reliability (IRR) analysis

a.k.a inter-rater agreement (IRA) or concordance.

In the next cells, we will perform an agreement analysis to check if the annotations are consistent.

See [Inter-Rater Reliability - OLID-BR](https://dougtrajano.github.io/olid-br/annotation/inter-rater-reliability.html) for more details.

### `is_offensive`

In [12]:
raw_texts = [text for text in raw_texts if len(text.annotations) == 3]

In [13]:
is_offensive = pd.DataFrame(dataset.get_annotations(raw_texts, "is_offensive"))
is_offensive.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757
504,OFF,OFF,OFF,OFF,NOT,OFF,OFF,OFF,OFF,OFF,...,OFF,OFF,OFF,OFF,OFF,NOT,OFF,OFF,OFF,OFF
260,OFF,OFF,NOT,OFF,NOT,NOT,NOT,NOT,OFF,NOT,...,OFF,NOT,NOT,OFF,OFF,NOT,OFF,OFF,NOT,NOT
127,OFF,OFF,OFF,OFF,NOT,NOT,OFF,OFF,OFF,OFF,...,OFF,OFF,NOT,NOT,OFF,NOT,OFF,OFF,NOT,OFF


In [14]:
fig = px.bar(
    data_frame=prepare_data_to_px(is_offensive),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="is_offensive distribution")

fig.update_layout(layout)

fig.show()

In [15]:
cac = CAC(is_offensive)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 1758, Raters: 3, Categories: ['NOT', 'OFF'], Weights: "identity">
Percent agreement: 0.5830
Krippendorff's alpha: 0.2167
Gwet's AC1: 0.5692


In [16]:
for k, v in disagreement_by_raters(cac.ratings, "OFF").items():
    print(f"{v} texts was annotated by {k} rater(s) as offensive.")

print(f"Disagreement score (class OFF): {disagreement_score(cac.ratings, 'OFF'):.4f}")

171 texts was annotated by 1 rater(s) as offensive.
562 texts was annotated by 2 rater(s) as offensive.
921 texts was annotated by 3 rater(s) as offensive.
Disagreement score (class OFF): 0.4432


In [17]:
for k, v in disagreement_by_raters(cac.ratings, "NOT").items():
    print(f"{v} texts was annotated by {k} rater(s) as non-offensive.")

print(f"Disagreement score (class NOT): {disagreement_score(cac.ratings, 'NOT'):.4f}")

562 texts was annotated by 1 rater(s) as non-offensive.
171 texts was annotated by 2 rater(s) as non-offensive.
104 texts was annotated by 3 rater(s) as non-offensive.
Disagreement score (class NOT): 0.8757


### `is_targeted`

In [18]:
is_targeted = pd.DataFrame(dataset.get_annotations(raw_texts, "is_targeted"))
is_targeted.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757
504,TIN,UNT,TIN,UNT,UNT,UNT,TIN,UNT,TIN,UNT,...,TIN,UNT,TIN,UNT,TIN,UNT,UNT,UNT,UNT,UNT
260,TIN,TIN,UNT,TIN,UNT,UNT,UNT,UNT,TIN,UNT,...,TIN,UNT,UNT,UNT,TIN,UNT,TIN,TIN,UNT,UNT
127,TIN,TIN,TIN,TIN,UNT,UNT,TIN,TIN,TIN,TIN,...,TIN,TIN,UNT,UNT,TIN,UNT,TIN,TIN,UNT,TIN


In [19]:
fig = px.bar(
    data_frame=prepare_data_to_px(is_targeted),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="is_targeted distribution")

fig.update_layout(layout)

fig.show()

In [20]:
cac = CAC(is_targeted)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 1758, Raters: 3, Categories: ['TIN', 'UNT'], Weights: "identity">
Percent agreement: 0.4391
Krippendorff's alpha: 0.1977
Gwet's AC1: 0.2998


In [21]:
for k, v in disagreement_by_raters(cac.ratings, "TIN").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted.")

print(f"Disagreement score (class TIN): {disagreement_score(cac.ratings, 'TIN'):.4f}")

375 texts was annotated by 1 rater(s) as targeted.
611 texts was annotated by 2 rater(s) as targeted.
576 texts was annotated by 3 rater(s) as targeted.
Disagreement score (class TIN): 0.6312


In [22]:
for k, v in disagreement_by_raters(cac.ratings, "UNT").items():
    print(f"{v} texts was annotated by {k} rater(s) as untargeted.")

print(f"Disagreement score (class UNT): {disagreement_score(cac.ratings, 'UNT'):.4f}")

611 texts was annotated by 1 rater(s) as untargeted.
375 texts was annotated by 2 rater(s) as untargeted.
196 texts was annotated by 3 rater(s) as untargeted.
Disagreement score (class UNT): 0.8342


### `targeted_type`

In [23]:
targeted_type = pd.DataFrame(dataset.get_annotations(raw_texts, "targeted_type"))
targeted_type.fillna(np.nan, inplace=True)
targeted_type.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757
504,GRP,,OTH,,,,IND,,GRP,,...,OTH,,IND,,IND,,,,,
260,GRP,IND,,GRP,,,,,GRP,,...,OTH,,,,IND,,IND,GRP,,
127,GRP,IND,GRP,GRP,,,IND,GRP,GRP,IND,...,OTH,GRP,,,IND,,GRP,GRP,,GRP


In [24]:
fig = px.bar(
    data_frame=prepare_data_to_px(targeted_type),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="targeted_type distribution")

fig.update_layout(layout)

fig.show()

In [25]:
cac = CAC(targeted_type)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 1562, Raters: 3, Categories: ['GRP', 'IND', 'OTH'], Weights: "identity">
Percent agreement: 0.2318
Krippendorff's alpha: 0.4934
Gwet's AC1: 0.5796




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [26]:
for k, v in disagreement_by_raters(cac.ratings, "IND").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted to an individual.")

print(f"Disagreement score (class IND): {disagreement_score(cac.ratings, 'IND'):.4f}")

362 texts was annotated by 1 rater(s) as targeted to an individual.
337 texts was annotated by 2 rater(s) as targeted to an individual.
291 texts was annotated by 3 rater(s) as targeted to an individual.
Disagreement score (class IND): 0.7061


In [27]:
for k, v in disagreement_by_raters(cac.ratings, "GRP").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted to a group.")

print(f"Disagreement score (class GRP): {disagreement_score(cac.ratings, 'GRP'):.4f}")

308 texts was annotated by 1 rater(s) as targeted to a group.
147 texts was annotated by 2 rater(s) as targeted to a group.
55 texts was annotated by 3 rater(s) as targeted to a group.
Disagreement score (class GRP): 0.8922


In [28]:
for k, v in disagreement_by_raters(cac.ratings, "OTH").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted to other.")

print(f"Disagreement score (class OTH): {disagreement_score(cac.ratings, 'OTH'):.4f}")

353 texts was annotated by 1 rater(s) as targeted to other.
124 texts was annotated by 2 rater(s) as targeted to other.
16 texts was annotated by 3 rater(s) as targeted to other.
Disagreement score (class OTH): 0.9675


### `toxic_spans`

In [29]:
toxic_spans = pd.DataFrame(dataset.get_annotations(raw_texts, "toxic_spans"))
toxic_spans.head()

Unnamed: 0,504,260,127
0,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[170, 171, 172, 173, 174]"
1,"[0, 1, 2, 3, 4, 5]","[0, 1, 2, 3, 4, 5, 6]","[0, 1, 2, 3, 4, 5, 6]"
2,"[28, 29, 30, 31]",[],"[27, 28, 29, 30, 31, 32, 33, 82, 83, 84, 85, 8..."
3,"[51, 52, 53, 54, 55, 56, 57, 71, 72, 73]","[51, 52, 53, 54, 55, 56, 57, 58]","[54, 55, 56, 57, 58, 59, 60]"
4,[],[],[]


In [30]:
task_data = []
for annotator in toxic_spans.columns:
    for item in range(len(toxic_spans)):
        temp = toxic_spans.iloc[item][annotator]
        if temp != []:
            task_data.append((
                annotator,
                item,
                frozenset(temp)
            ))

jaccard_task = AnnotationTask(distance=jaccard_distance)
masi_task = AnnotationTask(distance=masi_distance)

for task in [jaccard_task, masi_task]:
    task.load_array(task_data)
    print(f"Krippendorff's alpha using {task.distance}")
    print(f"Krippendorff's alpha: {task.alpha():.4f}", "\n")

print(f"Percent agreement: {percent_agreement(toxic_spans):.4f}")

Krippendorff's alpha using <function jaccard_distance at 0x00000276374CC4C0>
Krippendorff's alpha: 0.6419 

Krippendorff's alpha using <function masi_distance at 0x00000276374CC550>
Krippendorff's alpha: 0.4773 

Percent agreement: 0.2270


In [31]:
def len_toxic_spans(toxic_spans: List[int]):
    return None if len(toxic_spans) == 0 else len(toxic_spans)

pd.DataFrame([toxic_spans[col].apply(lambda x: len_toxic_spans(x)) for col in toxic_spans.columns]).transpose().describe()

Unnamed: 0,504,260,127
count,1299.0,1005.0,865.0
mean,13.422633,9.01791,10.764162
std,11.953395,6.066175,6.892015
min,1.0,2.0,2.0
25%,6.0,5.0,6.0
50%,10.0,7.0,9.0
75%,17.0,11.0,13.0
max,144.0,61.0,71.0


In [32]:
fig = px.bar(
    data_frame=prepare_data_to_px(pd.DataFrame([toxic_spans[col].apply(lambda x: len(x) > 0) for col in toxic_spans.columns]).transpose()),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="toxic_spans distribution")

fig.update_layout(layout)

fig.show()

### `health`

In [33]:
health = pd.DataFrame(dataset.get_annotations(raw_texts, "health"))
health.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757
504,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [34]:
fig = px.bar(
    data_frame=prepare_data_to_px(health),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Health distribution")

fig.update_layout(layout)

fig.show()

In [35]:
cac = CAC(health)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 1758, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9795
Krippendorff's alpha: 0.1152
Gwet's AC1: 0.9861


In [36]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as health.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

31 texts was annotated by 1 rater(s) as health.
5 texts was annotated by 2 rater(s) as health.
0 texts was annotated by 3 rater(s) as health.
Disagreement score (class True): 1.0000


### `ideology`

In [37]:
ideology = pd.DataFrame(dataset.get_annotations(raw_texts, "ideology"))
ideology.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757
504,False,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,True,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False


In [38]:
fig = px.bar(
    data_frame=prepare_data_to_px(ideology),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Ideology distribution")

fig.update_layout(layout)

fig.show()

In [39]:
cac = CAC(ideology)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 1758, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.8527
Krippendorff's alpha: 0.2930
Gwet's AC1: 0.8859


In [40]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as ideology.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

158 texts was annotated by 1 rater(s) as ideology.
101 texts was annotated by 2 rater(s) as ideology.
12 texts was annotated by 3 rater(s) as ideology.
Disagreement score (class True): 0.9557


### `insult`

In [41]:
insult = pd.DataFrame(dataset.get_annotations(raw_texts, "insult"))
insult.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757
504,False,False,True,True,False,False,True,True,False,False,...,True,False,True,False,True,False,True,True,False,True
260,True,True,False,True,False,False,False,False,True,False,...,True,False,False,False,True,False,True,True,False,False
127,True,True,True,True,False,False,True,True,True,True,...,True,True,False,False,True,False,True,True,False,True


In [42]:
fig = px.bar(
    data_frame=prepare_data_to_px(insult),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Insult distribution")

fig.update_layout(layout)

fig.show()

In [43]:
cac = CAC(insult)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 1758, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.5028
Krippendorff's alpha: 0.3057
Gwet's AC1: 0.3659


In [44]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as insult.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

349 texts was annotated by 1 rater(s) as insult.
525 texts was annotated by 2 rater(s) as insult.
600 texts was annotated by 3 rater(s) as insult.
Disagreement score (class True): 0.5929


### `lgbtqphobia`

In [45]:
lgbtqphobia = pd.DataFrame(dataset.get_annotations(raw_texts, "lgbtqphobia"))
lgbtqphobia.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757
504,False,False,False,False,False,False,False,False,True,False,...,False,True,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False


In [46]:
fig = px.bar(
    data_frame=prepare_data_to_px(lgbtqphobia),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="LGBTQphobia distribution")

fig.update_layout(layout)

fig.show()

In [47]:
cac = CAC(lgbtqphobia)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 1758, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9539
Krippendorff's alpha: 0.5021
Gwet's AC1: 0.9673


In [48]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as lgbtqphobia.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

36 texts was annotated by 1 rater(s) as lgbtqphobia.
45 texts was annotated by 2 rater(s) as lgbtqphobia.
14 texts was annotated by 3 rater(s) as lgbtqphobia.
Disagreement score (class True): 0.8526


### `other_lifestyle`

In [49]:
other_lifestyle = pd.DataFrame(dataset.get_annotations(raw_texts, "other_lifestyle"))
other_lifestyle.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757
504,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [50]:
fig = px.bar(
    data_frame=prepare_data_to_px(other_lifestyle),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Other-Lifestyle distribution")

fig.update_layout(layout)

fig.show()

In [51]:
cac = CAC(other_lifestyle)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 1758, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9704
Krippendorff's alpha: 0.2255
Gwet's AC1: 0.9798


In [52]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as other_lifestyle.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

39 texts was annotated by 1 rater(s) as other_lifestyle.
13 texts was annotated by 2 rater(s) as other_lifestyle.
1 texts was annotated by 3 rater(s) as other_lifestyle.
Disagreement score (class True): 0.9811


### `physical_aspects`

In [53]:
physical_aspects = pd.DataFrame(dataset.get_annotations(raw_texts, "physical_aspects"))
physical_aspects.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757
504,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False


In [54]:
fig = px.bar(
    data_frame=prepare_data_to_px(physical_aspects),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Physical Aspects distribution")

fig.update_layout(layout)

fig.show()

In [55]:
cac = CAC(physical_aspects)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 1758, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9545
Krippendorff's alpha: 0.3289
Gwet's AC1: 0.9682


In [56]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as physical_aspects.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

59 texts was annotated by 1 rater(s) as physical_aspects.
21 texts was annotated by 2 rater(s) as physical_aspects.
7 texts was annotated by 3 rater(s) as physical_aspects.
Disagreement score (class True): 0.9195


### `profanity_obscene`

In [57]:
profanity_obscene = pd.DataFrame(dataset.get_annotations(raw_texts, "profanity_obscene"))
profanity_obscene.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757
504,True,True,False,True,False,False,False,False,False,False,...,False,False,False,False,True,False,False,True,True,False
260,True,True,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
127,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [58]:
fig = px.bar(
    data_frame=prepare_data_to_px(profanity_obscene),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Profanity/Obscene distribution")

fig.update_layout(layout)

fig.show()

In [59]:
cac = CAC(profanity_obscene)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 1758, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.7452
Krippendorff's alpha: 0.5534
Gwet's AC1: 0.7258


In [60]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as profanity_obscene.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

206 texts was annotated by 1 rater(s) as profanity_obscene.
242 texts was annotated by 2 rater(s) as profanity_obscene.
219 texts was annotated by 3 rater(s) as profanity_obscene.
Disagreement score (class True): 0.6717


### `racism`

In [61]:
racism = pd.DataFrame(dataset.get_annotations(raw_texts, "racism"))
racism.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757
504,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [62]:
fig = px.bar(
    data_frame=prepare_data_to_px(racism),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Racism distribution")

fig.update_layout(layout)

fig.show()

In [63]:
cac = CAC(racism)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 1758, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9937
Krippendorff's alpha: 0.2647
Gwet's AC1: 0.9958


In [64]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as racism.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

10 texts was annotated by 1 rater(s) as racism.
1 texts was annotated by 2 rater(s) as racism.
1 texts was annotated by 3 rater(s) as racism.
Disagreement score (class True): 0.9167


### `religious_intolerance`

In [65]:
religious_intolerance = pd.DataFrame(dataset.get_annotations(raw_texts, "religious_intolerance"))
religious_intolerance.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757
504,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [66]:
fig = px.bar(
    data_frame=prepare_data_to_px(religious_intolerance),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Religious intolerance distribution")

fig.update_layout(layout)

fig.show()

In [67]:
cac = CAC(religious_intolerance)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
try:
    print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
except:
    print("Krippendorff's alpha: NaN")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 1758, Raters: 3, Categories: [False], Weights: "identity">
Percent agreement: 1.0000
Krippendorff's alpha: 1.0000
Gwet's AC1: 1.0000



divide by zero encountered in double_scalars


invalid value encountered in multiply


invalid value encountered in multiply


divide by zero encountered in double_scalars


divide by zero encountered in double_scalars


invalid value encountered in multiply



In [68]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as religious_intolerance.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

0 texts was annotated by 1 rater(s) as religious_intolerance.
0 texts was annotated by 2 rater(s) as religious_intolerance.
0 texts was annotated by 3 rater(s) as religious_intolerance.
Disagreement score (class True): 0.0000


### `sexism`

In [69]:
sexism = pd.DataFrame(dataset.get_annotations(raw_texts, "sexism"))
sexism.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757
504,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [70]:
fig = px.bar(
    data_frame=prepare_data_to_px(sexism),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Sexism distribution")

fig.update_layout(layout)

fig.show()

In [71]:
cac = CAC(sexism)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 1758, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9630
Krippendorff's alpha: 0.1950
Gwet's AC1: 0.9746


In [72]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as sexism.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

48 texts was annotated by 1 rater(s) as sexism.
17 texts was annotated by 2 rater(s) as sexism.
0 texts was annotated by 3 rater(s) as sexism.
Disagreement score (class True): 1.0000


### `xenophobia`

In [73]:
xenophobia = pd.DataFrame(dataset.get_annotations(raw_texts, "xenophobia"))
xenophobia.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757
504,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
260,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [74]:
fig = px.bar(
    data_frame=prepare_data_to_px(xenophobia),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Xenophobia distribution")

fig.update_layout(layout)

fig.show()

In [75]:
cac = CAC(xenophobia)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 1758, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9898
Krippendorff's alpha: 0.3967
Gwet's AC1: 0.9931


In [76]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as xenophobia.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

12 texts was annotated by 1 rater(s) as xenophobia.
6 texts was annotated by 2 rater(s) as xenophobia.
2 texts was annotated by 3 rater(s) as xenophobia.
Disagreement score (class True): 0.9000


### Krispendorff's alpha Multi-Label

In the next cells, we will calculate the Krippendorff's alpha considering as a multi-label problem instead of several binary problems.

In [77]:
ratings = {
    "health": health,
    "ideology": ideology,
    "insult": insult,
    "lgbtqphobia": lgbtqphobia,
    "other_lifestyle": other_lifestyle,
    "physical_aspects": physical_aspects,
    "profanity_obscene": profanity_obscene,
    "racism": racism,
    "religious_intolerance": religious_intolerance,
    "sexism": sexism,
    "xenophobia": xenophobia
}

task_data = []
for annotator in health.columns.tolist():
    for item in range(len(health)):
        temp = get_annotations_by_rater(ratings, annotator, item)
        if temp != []:
            task_data.append((
                annotator,
                item,
                frozenset(temp)
            ))

jaccard_task = AnnotationTask(distance=jaccard_distance)
masi_task = AnnotationTask(distance=masi_distance)

for task in [jaccard_task, masi_task]:
    task.load_array(task_data)
    print(f"Krippendorff's alpha using {task.distance}")
    print(f"Krippendorff's alpha: {task.alpha():.4f}", "\n")

pa_mlabels = {}
for item in range(len(health)):
    for annotator in health.columns.tolist():
        temp = get_annotations_by_rater(ratings, annotator, item)
        
        if annotator not in pa_mlabels.keys():
            pa_mlabels[annotator] = []
        
        pa_mlabels[annotator].append(temp)

print(f"Percent agreement: {percent_agreement(pd.DataFrame(pa_mlabels)):.4f}")

Krippendorff's alpha using <function jaccard_distance at 0x00000276374CC4C0>
Krippendorff's alpha: 0.4897 

Krippendorff's alpha using <function masi_distance at 0x00000276374CC550>
Krippendorff's alpha: 0.4486 

Percent agreement: 0.2782


## Label Assignment

In this section, we will define the label assigment strategy and assign labels to the texts.

Possible label assigment strategies are:

- **Majority Vote**: assign the label with the highest frequency.
- **At least one**: assign the label if at least one annotator marked it as true.

### Strategy per features

We will have a label assignment strategy for each feature.

The LabelStrategy object will be used to assign a function to each feature that corresponds to the label assigment strategy selected.

In [78]:
label_strategy = LabelStrategy(
    is_offensive=majority_vote,
    is_targeted=majority_vote,
    targeted_type=majority_vote,
    toxic_spans=all_labeled_spans,
    health=at_least_one,
    ideology=at_least_one,
    insult=at_least_one, # majority_vote
    lgbtqphobia=at_least_one,
    other_lifestyle=at_least_one,
    physical_aspects=at_least_one,
    profanity_obscene=at_least_one,
    racism=at_least_one,
    religious_intolerance=at_least_one,
    sexism=at_least_one,
    xenophobia=at_least_one
)

processed_texts, metadata, texts = dataset.build(
    raw=[item for item in data if len(item["annotations"]) == 3],
    label_strategy=label_strategy
).values()

## Create DataFrames

In the next cells, we will create Pandas DataFrames for the dataset and the metadata.

In [79]:
processed_texts = [i.dict() for i in processed_texts]
df = pd.DataFrame(processed_texts)

print(f"Shape: {df.shape}")
df.head()

Shape: (1758, 17)


Unnamed: 0,id,text,is_offensive,is_targeted,targeted_type,toxic_spans,health,ideology,insult,lgbtqphobia,other_lifestyle,physical_aspects,profanity_obscene,racism,religious_intolerance,sexism,xenophobia
0,96c257db2ddc4d5c8ab01176eccceed3,USER USER USER Na parte que o USER não tem din...,OFF,TIN,GRP,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...",False,True,True,False,False,False,True,False,False,False,False
1,42afd00129a64e09b7146453d661fc23,fodase tbm,OFF,TIN,IND,"[0, 1, 2, 3, 4, 5, 6]",False,False,True,False,False,False,True,False,False,False,False
2,cdeab32e09ff429b8c965c23552ef176,"O USER DO BURGUER USER É UM LIXO , É USER E US...",OFF,TIN,OTH,"[27, 28, 29, 30, 31, 32, 33, 82, 83, 84, 85, 8...",False,False,True,False,False,False,False,False,False,False,False
3,2fae40cbe4e646d58acca6d4eb955f29,agora que eu tô sozinho nesse mal caminho chei...,OFF,TIN,GRP,"[51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 71, 7...",False,False,True,False,False,False,True,False,False,True,False
4,5623aa86eeee4a148e21dc8a26ed6eae,USER USER a sua opinião desmerece sim várias p...,NOT,UNT,,,False,False,False,False,False,False,False,False,False,False,False


In [80]:
metadata = [i.dict() for i in metadata]
df_metadata = pd.DataFrame(metadata)

print(f"Shape: {df_metadata.shape}")
df_metadata.head()

Shape: (7032, 11)


Unnamed: 0,id,source,created_at,collected_at,toxicity_score,category,annotator_id,gender,year_of_birth,education_level,annotator_type
0,96c257db2ddc4d5c8ab01176eccceed3,YouTube,2020-05-04 06:20:55,2022-04-08 08:03:44.134767,0.9083,,,,,,
1,96c257db2ddc4d5c8ab01176eccceed3,,,NaT,,,504.0,Female,1999.0,High school,Contract worker
2,96c257db2ddc4d5c8ab01176eccceed3,,,NaT,,,260.0,Female,2001.0,High school,Contract worker
3,96c257db2ddc4d5c8ab01176eccceed3,,,NaT,,,127.0,Female,1975.0,Master's degree,Contract worker
4,42afd00129a64e09b7146453d661fc23,Twitter,2022-03-27 03:25:56+00:00,2022-03-27 00:52:43.398462,0.8605,,,,,,


## Validate data

In this section, we will apply some simple validation to guarantee that the data is correct.

Remove duplicated and understandable texts.

In [81]:
df.drop_duplicates(subset=["text"], inplace=True)

print(f"Shape: {df.shape}")

invalid_texts = [
    "RT USER: USER mas exatamente tá essa questão, o corno veio quando vc namorava, desculpa te contar assim \U0001fae0"
]

processed_texts = []

for text in df.to_dict(orient="records"):
    if text["text"] not in invalid_texts and not check_words(text["text"], ["USER", "HASHTAG", "URL"]):
        processed_texts.append(text)

print(f"Count: {len(processed_texts)}")

Shape: (1758, 17)
Count: 1752


In [82]:
ids = [i["id"] for i in processed_texts]
texts = [i for i in texts if i.id in ids]

print(f"Count: {len(texts)}")

Count: 1752


Rebuild dataframe from the cleaned data.

In [83]:
df = pd.DataFrame(processed_texts)

print(f"Shape: {df.shape}")
df.head()

Shape: (1752, 17)


Unnamed: 0,id,text,is_offensive,is_targeted,targeted_type,toxic_spans,health,ideology,insult,lgbtqphobia,other_lifestyle,physical_aspects,profanity_obscene,racism,religious_intolerance,sexism,xenophobia
0,96c257db2ddc4d5c8ab01176eccceed3,USER USER USER Na parte que o USER não tem din...,OFF,TIN,GRP,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...",False,True,True,False,False,False,True,False,False,False,False
1,42afd00129a64e09b7146453d661fc23,fodase tbm,OFF,TIN,IND,"[0, 1, 2, 3, 4, 5, 6]",False,False,True,False,False,False,True,False,False,False,False
2,cdeab32e09ff429b8c965c23552ef176,"O USER DO BURGUER USER É UM LIXO , É USER E US...",OFF,TIN,OTH,"[27, 28, 29, 30, 31, 32, 33, 82, 83, 84, 85, 8...",False,False,True,False,False,False,False,False,False,False,False
3,2fae40cbe4e646d58acca6d4eb955f29,agora que eu tô sozinho nesse mal caminho chei...,OFF,TIN,GRP,"[51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 71, 7...",False,False,True,False,False,False,True,False,False,True,False
4,5623aa86eeee4a148e21dc8a26ed6eae,USER USER a sua opinião desmerece sim várias p...,NOT,UNT,,,False,False,False,False,False,False,False,False,False,False,False


In [84]:
metadata = dict_serialize_date(
    data=[i.dict() if isinstance(i, Metadata) else i for i in metadata],
    keys=["created_at", "collected_at"])

# Remove deleted texts metadata
metadata = [i for i in metadata if i["id"] in df["id"].tolist()]

print(f"Count: {len(metadata)}")

Count: 7008


## Profiling Report

We will generate a profiling report that provides some statistics about the data.

In [85]:
profile = ProfileReport(
    df, title="OLID-BR Pilot 4",
    explorative=True)

profile.to_file("../../docs/reports/olidbr_pilot_4.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Get full texts

In the next cells, we will prepare a list of texts with all the annotations and metadata.

In [86]:
def serialize_texts(texts):
    for text in [text.dict() for text in texts]:
        for k, v in text["metadata"].items():
            if isinstance(v, datetime.datetime):
                text["metadata"][k] = v.isoformat()
        yield text

texts = list(serialize_texts(texts))

print(f"Count: {len(texts)}")

Count: 1752


## Upload data to S3

In this section, we will save the dataset in CSV and JSON format in the S3 bucket.

Saving in CSV format.

In [87]:
bucket.upload_csv(
    data=df,
    key="processed/olid-br/iterations/4/olidbr.csv")

bucket.upload_csv(
    data=df_metadata,
    key="processed/olid-br/iterations/4/metadata.csv")

print("CSV files uploaded.")

CSV files uploaded.


Saving in JSON format.

In [88]:
bucket.upload_json(
    data=processed_texts,
    key="processed/olid-br/iterations/4/olidbr.json")

bucket.upload_json(
    data=metadata,
    key="processed/olid-br/iterations/4/metadata.json")

print("JSON files uploaded.")

JSON files uploaded.


Saving full texts in JSON format.

In [89]:
bucket.upload_json(
    data=texts,
    key="processed/olid-br/iterations/4/full_olidbr.json")

print("JSON file uploaded.")

JSON file uploaded.
