# OLID-BR - Iteration 2

In this notebook, we will read the annotated data from an S3 bucket, build OLID-BR dataset and save it to an S3 bucket in JSON and CSV formats.

The annotated data is stored in the Label Studio JSON format. See [Label Studio Documentation — Export Annotations](https://labelstud.io/guide/export.html#Label-Studio-JSON-format-of-annotated-tasks) for more details.

## Imports

In [1]:
import sys
from pathlib import Path

if str(Path(".").absolute().parent) not in sys.path:
    sys.path.append(str(Path(".").absolute().parent.parent))

In [2]:
from dotenv import load_dotenv

# Initialize the env vars
load_dotenv("../../.env")

True

In [3]:
import datetime
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
from typing import List

from irrCAC.raw import CAC
from src.data_classes import Annotator, LabelStrategy, Metadata
from src.dataset import Dataset
from src.labeling.assignment import majority_vote, at_least_one, all_labeled_spans
from src.labeling.metrics import (
    percent_agreement,
    disagreement_by_raters,
    disagreement_score
)

from src.s3 import Bucket
from src.settings import AppSettings
from src.utils import (
    read_yaml,
    check_words,
    prepare_data_to_px,
    dict_serialize_date,
    get_lead_time,
    get_annotations_by_rater
)

import nltk
from nltk.metrics import agreement
from nltk.metrics.agreement import AnnotationTask
from nltk.metrics import masi_distance, jaccard_distance

# Plotly
import plotly.express as px
import plotly.io as pio
from plotly.graph_objs import Layout

pio.templates.default = "plotly_dark"

layout = Layout(
    xaxis={
        "type": "category",
        "showgrid": False,
        "zeroline": False,
    },
    yaxis={
        "showgrid": False,
        "zeroline": False
    },
    paper_bgcolor="rgba(0,0,0,0)",
    plot_bgcolor="rgba(0,0,0,0)",
    font={"color": "rgb(180,180,180)"},
)

args = AppSettings()

## Load data

In the next cells, we will read the labeled data from the S3 bucket and concatenate all annotations into a single base.

In [4]:
bucket = Bucket(args.AWS_S3_BUCKET)

bucket.get_session_from_aksk(
    args.AWS_ACCESS_KEY_ID,
    args.AWS_SECRET_ACCESS_KEY)

In [5]:
files = [
    "raw/labeled/phase2/olid-br-2-126.json",
    "raw/labeled/phase2/olid-br-2-127.json",
    "raw/labeled/phase2/olid-br-2-128.json"
]

As we have each annotator data in a separate file, we will need to concatenate all annotations into a single base.

In [6]:
data = {}

for file in files:
    print(f"Reading {file}")
    temp = bucket.download_json(key=file)

    lead_time = get_lead_time(temp)
    print(f"{file} >> Mean: {np.mean(lead_time):.0f}s | Std: {np.std(lead_time):.0f}s")

    for row in temp:
        if row["data"]["text"] not in data.keys():
            data[row["data"]["text"]] = row
        else:
            data[row["data"]["text"]]["annotations"].extend(row["annotations"])
    
    print()

data = [v for _, v in data.items()]

print(f"Count: {len(data)}")

Reading raw/labeled/phase2/olid-br-2-126.json
raw/labeled/phase2/olid-br-2-126.json >> Mean: 78s | Std: 908s

Reading raw/labeled/phase2/olid-br-2-127.json
raw/labeled/phase2/olid-br-2-127.json >> Mean: 114s | Std: 827s

Reading raw/labeled/phase2/olid-br-2-128.json
raw/labeled/phase2/olid-br-2-128.json >> Mean: 16s | Std: 189s

Count: 3000


Check if all texts have the same number of annotations.

In [7]:
annotations_count = {}
iteration_annotators = []

for item in data:
    for annotation in item["annotations"]:
        if annotation["completed_by"] not in iteration_annotators:
            iteration_annotators.append(annotation["completed_by"])

    count = len(item["annotations"])
    if count not in annotations_count.keys():
        annotations_count[count] = 1
    else:
        annotations_count[count] += 1

print(f"Annotators: {iteration_annotators}")
print(f"Annotations count: {annotations_count}")

Annotators: [126, 127, 128]
Annotations count: {3: 3000}


## Load annotators

In the next cells, we will read the annotators data and create a list with all annotators objects.

It will be used to add the annotations as a metadata for each text.

In [8]:
annotators = read_yaml("../../properties/annotators.yaml")
annotators = [Annotator(**a) for a in annotators]
annotators

# Filter out the annotators that are not present in the data
annotators = [a for a in annotators if a.annotator_id in iteration_annotators]
annotators

[Annotator(id=None, annotator_id=127, gender='Female', year_of_birth=1975, education_level="Master's degree", annotator_type='Contract worker'),
 Annotator(id=None, annotator_id=128, gender='Female', year_of_birth=1992, education_level="Master's degree", annotator_type='Contract worker'),
 Annotator(id=None, annotator_id=126, gender='Male', year_of_birth=1997, education_level='High school', annotator_type='Contract worker')]

## Build dataset

In [9]:
dataset = Dataset(
    annotators=annotators,
    toxicity_threshold=args.PERSPECTIVE_THRESHOLD
)

raw_texts = dataset.get_raw_texts(data)

We will filter only texts with all three annotators.

In [10]:
raw_texts = [text for text in raw_texts if len(text.annotations) == 3]

print(f"{len(raw_texts)} raw texts with 3 annotations.")

3000 raw texts with 3 annotations.


## Inter-Rater Reliability (IRR) analysis

a.k.a inter-rater agreement (IRA) or concordance.

In the next cells, we will perform an agreement analysis to check if the annotations are consistent.

See [Inter-Rater Reliability - OLID-BR](https://dougtrajano.github.io/olid-br/annotation/inter-rater-reliability.html) for more details.

### `is_offensive`

In [11]:
is_offensive = pd.DataFrame(dataset.get_annotations(raw_texts, "is_offensive"))
is_offensive.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,OFF,OFF,OFF,OFF,OFF,NOT,OFF,OFF,OFF,OFF,...,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF
127,OFF,OFF,OFF,OFF,OFF,NOT,OFF,OFF,NOT,OFF,...,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF
128,OFF,OFF,OFF,OFF,OFF,NOT,OFF,OFF,OFF,OFF,...,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF,OFF


In [12]:
fig = px.bar(
    data_frame=prepare_data_to_px(is_offensive),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="is_offensive distribution")

fig.update_layout(layout)

fig.show()

In [13]:
cac = CAC(is_offensive)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 3000, Raters: 3, Categories: ['NOT', 'OFF'], Weights: "identity">
Percent agreement: 0.7277
Krippendorff's alpha: 0.0595
Gwet's AC1: 0.7750


In [14]:
for k, v in disagreement_by_raters(cac.ratings, "OFF").items():
    print(f"{v} texts was annotated by {k} rater(s) as offensive.")

print(f"Disagreement score (class OFF): {disagreement_score(cac.ratings, 'OFF'):.4f}")

97 texts was annotated by 1 rater(s) as offensive.
720 texts was annotated by 2 rater(s) as offensive.
2163 texts was annotated by 3 rater(s) as offensive.
Disagreement score (class OFF): 0.2742


In [15]:
for k, v in disagreement_by_raters(cac.ratings, "NOT").items():
    print(f"{v} texts was annotated by {k} rater(s) as non-offensive.")

print(f"Disagreement score (class NOT): {disagreement_score(cac.ratings, 'NOT'):.4f}")

720 texts was annotated by 1 rater(s) as non-offensive.
97 texts was annotated by 2 rater(s) as non-offensive.
20 texts was annotated by 3 rater(s) as non-offensive.
Disagreement score (class NOT): 0.9761


### `is_targeted`

In [16]:
is_targeted = pd.DataFrame(dataset.get_annotations(raw_texts, "is_targeted"))
is_targeted.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN,...,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN
127,UNT,UNT,UNT,UNT,TIN,UNT,UNT,UNT,UNT,UNT,...,TIN,TIN,TIN,TIN,TIN,UNT,TIN,TIN,TIN,TIN
128,UNT,TIN,UNT,UNT,UNT,UNT,UNT,TIN,UNT,UNT,...,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN,TIN


In [17]:
fig = px.bar(
    data_frame=prepare_data_to_px(is_targeted),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="is_targeted distribution")

fig.update_layout(layout)

fig.show()

In [18]:
cac = CAC(is_targeted)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 3000, Raters: 3, Categories: ['TIN', 'UNT'], Weights: "identity">
Percent agreement: 0.1610
Krippendorff's alpha: -0.1348
Gwet's AC1: -0.1029


In [19]:
for k, v in disagreement_by_raters(cac.ratings, "TIN").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted.")

print(f"Disagreement score (class TIN): {disagreement_score(cac.ratings, 'TIN'):.4f}")

1202 texts was annotated by 1 rater(s) as targeted.
1315 texts was annotated by 2 rater(s) as targeted.
402 texts was annotated by 3 rater(s) as targeted.
Disagreement score (class TIN): 0.8623


In [20]:
for k, v in disagreement_by_raters(cac.ratings, "UNT").items():
    print(f"{v} texts was annotated by {k} rater(s) as untargeted.")

print(f"Disagreement score (class UNT): {disagreement_score(cac.ratings, 'UNT'):.4f}")

1315 texts was annotated by 1 rater(s) as untargeted.
1202 texts was annotated by 2 rater(s) as untargeted.
81 texts was annotated by 3 rater(s) as untargeted.
Disagreement score (class UNT): 0.9688


### `targeted_type`

In [21]:
targeted_type = pd.DataFrame(dataset.get_annotations(raw_texts, "targeted_type"))
targeted_type.fillna(np.nan, inplace=True)
targeted_type.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,IND,GRP,GRP,GRP,GRP,OTH,GRP,GRP,IND,IND,...,GRP,GRP,IND,IND,GRP,IND,GRP,IND,GRP,GRP
127,,,,,GRP,,,,,,...,GRP,GRP,IND,IND,GRP,,GRP,IND,GRP,GRP
128,,IND,,,,,,IND,,,...,GRP,GRP,GRP,IND,IND,IND,GRP,IND,GRP,GRP


In [22]:
fig = px.bar(
    data_frame=prepare_data_to_px(targeted_type),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="targeted_type distribution")

fig.update_layout(layout)

fig.show()

In [23]:
cac = CAC(targeted_type)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 2919, Raters: 3, Categories: ['GRP', 'IND', 'OTH'], Weights: "identity">
Percent agreement: 0.0641
Krippendorff's alpha: 0.2461
Gwet's AC1: 0.4978




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [24]:
for k, v in disagreement_by_raters(cac.ratings, "IND").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted to an individual.")

print(f"Disagreement score (class IND): {disagreement_score(cac.ratings, 'IND'):.4f}")

1274 texts was annotated by 1 rater(s) as targeted to an individual.
768 texts was annotated by 2 rater(s) as targeted to an individual.
153 texts was annotated by 3 rater(s) as targeted to an individual.
Disagreement score (class IND): 0.9303


In [25]:
for k, v in disagreement_by_raters(cac.ratings, "GRP").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted to a group.")

print(f"Disagreement score (class GRP): {disagreement_score(cac.ratings, 'GRP'):.4f}")

722 texts was annotated by 1 rater(s) as targeted to a group.
214 texts was annotated by 2 rater(s) as targeted to a group.
33 texts was annotated by 3 rater(s) as targeted to a group.
Disagreement score (class GRP): 0.9659


In [26]:
for k, v in disagreement_by_raters(cac.ratings, "OTH").items():
    print(f"{v} texts was annotated by {k} rater(s) as targeted to other.")

print(f"Disagreement score (class OTH): {disagreement_score(cac.ratings, 'OTH'):.4f}")

435 texts was annotated by 1 rater(s) as targeted to other.
41 texts was annotated by 2 rater(s) as targeted to other.
1 texts was annotated by 3 rater(s) as targeted to other.
Disagreement score (class OTH): 0.9979


### `toxic_spans`

In [27]:
toxic_spans = pd.DataFrame(dataset.get_annotations(raw_texts, "toxic_spans"))
toxic_spans.head()

Unnamed: 0,126,127,128
0,[],"[52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 6...",[]
1,"[20, 21, 22, 23, 24, 25, 93, 94, 95, 96, 97]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",[]
2,"[14, 15, 16, 17, 18]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",[]
3,"[10, 11, 12, 13, 14, 17, 18, 19, 20, 21, 22, 2...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",[]
4,"[10, 11, 12, 13, 14, 165, 166, 167, 176, 177, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[10, 11, 12, 13, 14, 176, 177, 178, 179, 180, ..."


In [28]:
task_data = []
for annotator in [a.annotator_id for a in annotators]:
    for item in range(len(toxic_spans)):
        temp = toxic_spans.iloc[item][annotator]
        if temp != []:
            task_data.append((
                annotator,
                item,
                frozenset(temp)
            ))

jaccard_task = AnnotationTask(distance=jaccard_distance)
masi_task = AnnotationTask(distance=masi_distance)

for task in [jaccard_task, masi_task]:
    task.load_array(task_data)
    print(f"Krippendorff's alpha using {task.distance}")
    print(f"Krippendorff's alpha: {task.alpha():.4f}", "\n")

print(f"Percent agreement: {percent_agreement(toxic_spans):.4f}")

Krippendorff's alpha using <function jaccard_distance at 0x00000244937B8430>
Krippendorff's alpha: 0.3805 

Krippendorff's alpha using <function masi_distance at 0x00000244937B84C0>
Krippendorff's alpha: 0.2709 

Percent agreement: 0.1220


In [29]:
def len_toxic_spans(toxic_spans: List[int]):
    return None if len(toxic_spans) == 0 else len(toxic_spans)

pd.DataFrame([toxic_spans[col].apply(lambda x: len_toxic_spans(x)) for col in toxic_spans.columns]).transpose().describe()

Unnamed: 0,126,127,128
count,2137.0,2008.0,1440.0
mean,11.334113,39.480578,11.360417
std,8.881962,30.773059,8.559401
min,2.0,4.0,1.0
25%,6.0,19.0,6.0
50%,9.0,31.0,8.5
75%,14.0,50.0,14.0
max,127.0,277.0,81.0


In [30]:
fig = px.bar(
    data_frame=prepare_data_to_px(pd.DataFrame([toxic_spans[col].apply(lambda x: len(x) > 0) for col in toxic_spans.columns]).transpose()),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="toxic_spans distribution")

fig.update_layout(layout)

fig.show()

### `health`

In [31]:
health = pd.DataFrame(dataset.get_annotations(raw_texts, "health"))
health.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [32]:
fig = px.bar(
    data_frame=prepare_data_to_px(health),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Health distribution")

fig.update_layout(layout)

fig.show()

In [33]:
cac = CAC(health)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 3000, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9760
Krippendorff's alpha: 0.0447
Gwet's AC1: 0.9837


In [34]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as health.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

68 texts was annotated by 1 rater(s) as health.
4 texts was annotated by 2 rater(s) as health.
0 texts was annotated by 3 rater(s) as health.
Disagreement score (class True): 1.0000


### `ideology`

In [35]:
ideology = pd.DataFrame(dataset.get_annotations(raw_texts, "ideology"))
ideology.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,True,False,False,False,False,False,False,False,...,False,True,False,True,True,True,True,False,True,True
127,False,False,False,False,False,False,False,False,False,False,...,False,True,False,True,True,False,True,False,True,True
128,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,False,True,False,False,True


In [36]:
fig = px.bar(
    data_frame=prepare_data_to_px(ideology),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Ideology distribution")

fig.update_layout(layout)

fig.show()

In [37]:
cac = CAC(ideology)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 3000, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.7647
Krippendorff's alpha: 0.3019
Gwet's AC1: 0.7976


In [38]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as ideology.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

500 texts was annotated by 1 rater(s) as ideology.
206 texts was annotated by 2 rater(s) as ideology.
83 texts was annotated by 3 rater(s) as ideology.
Disagreement score (class True): 0.8948


### `insult`

In [39]:
insult = pd.DataFrame(dataset.get_annotations(raw_texts, "insult"))
insult.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,True,True,False,True,True,True,True,False,False,True,...,True,True,True,True,True,False,False,True,False,False
127,True,True,True,True,False,False,True,True,False,True,...,True,True,True,True,True,True,True,True,False,False
128,True,True,True,True,True,False,True,False,True,True,...,True,True,False,False,False,True,False,True,True,False


In [40]:
fig = px.bar(
    data_frame=prepare_data_to_px(insult),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Insult distribution")

fig.update_layout(layout)

fig.show()

In [41]:
cac = CAC(insult)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 3000, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.4713
Krippendorff's alpha: 0.0895
Gwet's AC1: 0.4250


In [42]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as insult.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

472 texts was annotated by 1 rater(s) as insult.
1114 texts was annotated by 2 rater(s) as insult.
1313 texts was annotated by 3 rater(s) as insult.
Disagreement score (class True): 0.5471


### `lgbtqphobia`

In [43]:
lgbtqphobia = pd.DataFrame(dataset.get_annotations(raw_texts, "lgbtqphobia"))
lgbtqphobia.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [44]:
fig = px.bar(
    data_frame=prepare_data_to_px(lgbtqphobia),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="LGBTQphobia distribution")

fig.update_layout(layout)

fig.show()

In [45]:
cac = CAC(lgbtqphobia)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 3000, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9453
Krippendorff's alpha: 0.5583
Gwet's AC1: 0.9603


In [46]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as lgbtqphobia.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

96 texts was annotated by 1 rater(s) as lgbtqphobia.
68 texts was annotated by 2 rater(s) as lgbtqphobia.
52 texts was annotated by 3 rater(s) as lgbtqphobia.
Disagreement score (class True): 0.7593


### `other_lifestyle`

In [47]:
other_lifestyle = pd.DataFrame(dataset.get_annotations(raw_texts, "other_lifestyle"))
other_lifestyle.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [48]:
fig = px.bar(
    data_frame=prepare_data_to_px(other_lifestyle),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Other-Lifestyle distribution")

fig.update_layout(layout)

fig.show()

In [49]:
cac = CAC(other_lifestyle)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 3000, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9860
Krippendorff's alpha: 0.0824
Gwet's AC1: 0.9906


In [50]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as other_lifestyle.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

38 texts was annotated by 1 rater(s) as other_lifestyle.
4 texts was annotated by 2 rater(s) as other_lifestyle.
0 texts was annotated by 3 rater(s) as other_lifestyle.
Disagreement score (class True): 1.0000


### `physical_aspects`

In [51]:
physical_aspects = pd.DataFrame(dataset.get_annotations(raw_texts, "physical_aspects"))
physical_aspects.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [52]:
fig = px.bar(
    data_frame=prepare_data_to_px(physical_aspects),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Physical Aspects distribution")

fig.update_layout(layout)

fig.show()

In [53]:
cac = CAC(physical_aspects)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 3000, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9463
Krippendorff's alpha: 0.3272
Gwet's AC1: 0.9622


In [54]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as physical_aspects.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

130 texts was annotated by 1 rater(s) as physical_aspects.
31 texts was annotated by 2 rater(s) as physical_aspects.
18 texts was annotated by 3 rater(s) as physical_aspects.
Disagreement score (class True): 0.8994


### `profanity_obscene`

In [55]:
profanity_obscene = pd.DataFrame(dataset.get_annotations(raw_texts, "profanity_obscene"))
profanity_obscene.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,True,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,False,False,False,False,True,True,False,False,...,True,True,False,False,True,False,False,False,False,False


In [56]:
fig = px.bar(
    data_frame=prepare_data_to_px(profanity_obscene),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Profanity/Obscene distribution")

fig.update_layout(layout)

fig.show()

In [57]:
cac = CAC(profanity_obscene)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 3000, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.6837
Krippendorff's alpha: 0.0850
Gwet's AC1: 0.7260


In [58]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as profanity_obscene.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

753 texts was annotated by 1 rater(s) as profanity_obscene.
196 texts was annotated by 2 rater(s) as profanity_obscene.
17 texts was annotated by 3 rater(s) as profanity_obscene.
Disagreement score (class True): 0.9824


### `racism`

In [59]:
racism = pd.DataFrame(dataset.get_annotations(raw_texts, "racism"))
racism.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [60]:
fig = px.bar(
    data_frame=prepare_data_to_px(racism),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Racism distribution")

fig.update_layout(layout)

fig.show()

In [61]:
cac = CAC(racism)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 3000, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9750
Krippendorff's alpha: 0.2564
Gwet's AC1: 0.9829


In [62]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as racism.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

57 texts was annotated by 1 rater(s) as racism.
18 texts was annotated by 2 rater(s) as racism.
3 texts was annotated by 3 rater(s) as racism.
Disagreement score (class True): 0.9615


### `religious_intolerance`

In [63]:
religious_intolerance = pd.DataFrame(dataset.get_annotations(raw_texts, "religious_intolerance"))
religious_intolerance.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [64]:
fig = px.bar(
    data_frame=prepare_data_to_px(religious_intolerance),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Religious intolerance distribution")

fig.update_layout(layout)

fig.show()

In [65]:
cac = CAC(religious_intolerance)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 3000, Raters: 3, Categories: [False], Weights: "identity">
Percent agreement: 1.0000
Krippendorff's alpha: 1.0000
Gwet's AC1: 1.0000



divide by zero encountered in double_scalars


invalid value encountered in multiply


invalid value encountered in multiply


divide by zero encountered in double_scalars


divide by zero encountered in double_scalars


invalid value encountered in multiply



In [66]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as religious_intolerance.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

0 texts was annotated by 1 rater(s) as religious_intolerance.
0 texts was annotated by 2 rater(s) as religious_intolerance.
0 texts was annotated by 3 rater(s) as religious_intolerance.
Disagreement score (class True): 0.0000


### `sexism`

In [67]:
sexism = pd.DataFrame(dataset.get_annotations(raw_texts, "sexism"))
sexism.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
127,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,True,False
128,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,True,False


In [68]:
fig = px.bar(
    data_frame=prepare_data_to_px(sexism),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Sexism distribution")

fig.update_layout(layout)

fig.show()

In [69]:
cac = CAC(sexism)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 3000, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.8753
Krippendorff's alpha: 0.1721
Gwet's AC1: 0.9076


In [70]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as sexism.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

307 texts was annotated by 1 rater(s) as sexism.
67 texts was annotated by 2 rater(s) as sexism.
12 texts was annotated by 3 rater(s) as sexism.
Disagreement score (class True): 0.9689


### `xenophobia`

In [71]:
xenophobia = pd.DataFrame(dataset.get_annotations(raw_texts, "xenophobia"))
xenophobia.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
126,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
127,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
128,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [72]:
fig = px.bar(
    data_frame=prepare_data_to_px(xenophobia),
    x="Annotator", y="Count", color="Label", opacity=0.8,
    height=600, text="Count", title="Xenophobia distribution")

fig.update_layout(layout)

fig.show()

In [73]:
cac = CAC(xenophobia)

print("CAC:", cac)
print(f"Percent agreement: {percent_agreement(cac.ratings):.4f}")
print(f"Krippendorff's alpha: {cac.krippendorff()['est']['coefficient_value']:.4f}")
print(f"Gwet's AC1: {cac.gwet()['est']['coefficient_value']:.4f}")

CAC: <irrCAC.raw.CAC Subjects: 3000, Raters: 3, Categories: [False, True], Weights: "identity">
Percent agreement: 0.9673
Krippendorff's alpha: 0.0732
Gwet's AC1: 0.9777


In [74]:
for k, v in disagreement_by_raters(cac.ratings, True).items():
    print(f"{v} texts was annotated by {k} rater(s) as xenophobia.")

print(f"Disagreement score (class True): {disagreement_score(cac.ratings, True):.4f}")

92 texts was annotated by 1 rater(s) as xenophobia.
6 texts was annotated by 2 rater(s) as xenophobia.
1 texts was annotated by 3 rater(s) as xenophobia.
Disagreement score (class True): 0.9899


### Krispendorff's alpha Multi-Label

In the next cells, we will calculate the Krippendorff's alpha considering as a multi-label problem instead of several binary problems.

In [75]:
ratings = {
    "health": health,
    "ideology": ideology,
    "insult": insult,
    "lgbtqphobia": lgbtqphobia,
    "other_lifestyle": other_lifestyle,
    "physical_aspects": physical_aspects,
    "profanity_obscene": profanity_obscene,
    "racism": racism,
    "religious_intolerance": religious_intolerance,
    "sexism": sexism,
    "xenophobia": xenophobia
}

task_data = []
for annotator in [a.annotator_id for a in annotators]:
    for item in range(len(health)):
        temp = get_annotations_by_rater(ratings, annotator, item)
        if temp != []:
            task_data.append((
                annotator,
                item,
                frozenset(temp)
            ))

jaccard_task = AnnotationTask(distance=jaccard_distance)
masi_task = AnnotationTask(distance=masi_distance)

for task in [jaccard_task, masi_task]:
    task.load_array(task_data)
    print(f"Krippendorff's alpha using {task.distance}")
    print(f"Krippendorff's alpha: {task.alpha():.4f}", "\n")

pa_mlabels = {}

for item in range(len(health)):
    for annotator in [a.annotator_id for a in annotators]:
        temp = get_annotations_by_rater(ratings, annotator, item)
        
        if annotator not in pa_mlabels.keys():
            pa_mlabels[annotator] = []
        
        pa_mlabels[annotator].append(temp)

print(f"Percent agreement: {percent_agreement(pd.DataFrame(pa_mlabels)):.4f}")

Krippendorff's alpha using <function jaccard_distance at 0x00000244937B8430>
Krippendorff's alpha: 0.2146 

Krippendorff's alpha using <function masi_distance at 0x00000244937B84C0>
Krippendorff's alpha: 0.1962 

Percent agreement: 0.1877


## Label Assignment

In this section, we will define the label assigment strategy and assign labels to the texts.

Possible label assigment strategies are:

- **Majority Vote**: assign the label with the highest frequency.
- **At least one**: assign the label if at least one annotator marked it as true.

### Strategy per features

We will have a label assignment strategy for each feature.

The LabelStrategy object will be used to assign a function to each feature that corresponds to the label assigment strategy selected.

In [76]:
label_strategy = LabelStrategy(
    is_offensive=majority_vote,
    is_targeted=majority_vote,
    targeted_type=majority_vote,
    toxic_spans=all_labeled_spans,
    health=at_least_one,
    ideology=at_least_one,
    insult=at_least_one, # majority_vote
    lgbtqphobia=at_least_one,
    other_lifestyle=at_least_one,
    physical_aspects=at_least_one,
    profanity_obscene=at_least_one,
    racism=at_least_one,
    religious_intolerance=at_least_one,
    sexism=at_least_one,
    xenophobia=at_least_one
)

processed_texts, metadata = dataset.build(
    raw=data,
    label_strategy=label_strategy
)

processed_texts = [i.dict() for i in processed_texts]
metadata = [i.dict() for i in metadata]

## Create DataFrames

In the next cells, we will create Pandas DataFrames for the dataset and the metadata.

In [77]:
df = pd.DataFrame(processed_texts)

print(f"Shape: {df.shape}")
df.head()

Shape: (3000, 17)


Unnamed: 0,id,text,is_offensive,is_targeted,targeted_type,toxic_spans,health,ideology,insult,lgbtqphobia,other_lifestyle,physical_aspects,profanity_obscene,racism,religious_intolerance,sexism,xenophobia
0,618c0bd4fcf946e39d31bf04ce257f52,USER Adorei o comercial também Jesus. Só achei...,OFF,UNT,,"[52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 6...",False,False,True,False,False,False,True,False,False,False,False
1,1b9446e3b87c4e6092bbca1dc94ff7e7,Cara isso foi muito babaca geral USER conhece ...,OFF,TIN,GRP,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,False,True,False,False,False,False,False,False,False,False
2,c19887c524b24395bec9c71aeafa24ed,Quem liga pra judeu kkkk,OFF,UNT,,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,True,True,False,False,False,False,False,False,False,True
3,0999844ebec6445789828e567864feeb,"Se vc for porco, folgado e relaxado, você não ...",OFF,UNT,,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,False,True,False,False,False,False,False,False,False,False
4,7f001bd9a4d34c2394055c366c31125a,"Rapaziada chata, né?! O cara trabalha c funk, ...",OFF,TIN,GRP,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,False,True,False,False,False,False,False,False,False,False


In [78]:
df_metadata = pd.DataFrame(metadata)

print(f"Shape: {df_metadata.shape}")
df_metadata.head()

Shape: (12000, 11)


Unnamed: 0,id,source,created_at,collected_at,toxicity_score,category,annotator_id,gender,year_of_birth,education_level,annotator_type
0,618c0bd4fcf946e39d31bf04ce257f52,YouTube,2015-05-29 09:27:57,2022-04-08 08:03:44.134767,0.7189,,,,,,
1,618c0bd4fcf946e39d31bf04ce257f52,,,NaT,,,126.0,Male,1997.0,High school,Contract worker
2,618c0bd4fcf946e39d31bf04ce257f52,,,NaT,,,127.0,Female,1975.0,Master's degree,Contract worker
3,618c0bd4fcf946e39d31bf04ce257f52,,,NaT,,,128.0,Female,1992.0,Master's degree,Contract worker
4,1b9446e3b87c4e6092bbca1dc94ff7e7,YouTube,2022-02-09 12:40:52,2022-04-08 08:03:44.134767,0.9852,,,,,,


## Validate data

In this section, we will apply some simple validation to guarantee that the data is correct.

Remove duplicated texts.

In [79]:
df.drop_duplicates(subset=["text"], inplace=True)

print(f"Shape: {df.shape}")

Shape: (3000, 17)


Remove understandable texts.

In [80]:
processed_texts = df.to_dict(orient="records")

processed_texts = [i for i in processed_texts if not check_words(i["text"], ["USER", "HASHTAG", "URL"])]

print(f"Count: {len(processed_texts)}")

Count: 2996


Rebuild dataframe from the cleaned data.

In [81]:
df = pd.DataFrame(processed_texts)

print(f"Shape: {df.shape}")
df.head()

Shape: (2996, 17)


Unnamed: 0,id,text,is_offensive,is_targeted,targeted_type,toxic_spans,health,ideology,insult,lgbtqphobia,other_lifestyle,physical_aspects,profanity_obscene,racism,religious_intolerance,sexism,xenophobia
0,618c0bd4fcf946e39d31bf04ce257f52,USER Adorei o comercial também Jesus. Só achei...,OFF,UNT,,"[52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 6...",False,False,True,False,False,False,True,False,False,False,False
1,1b9446e3b87c4e6092bbca1dc94ff7e7,Cara isso foi muito babaca geral USER conhece ...,OFF,TIN,GRP,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,False,True,False,False,False,False,False,False,False,False
2,c19887c524b24395bec9c71aeafa24ed,Quem liga pra judeu kkkk,OFF,UNT,,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,True,True,False,False,False,False,False,False,False,True
3,0999844ebec6445789828e567864feeb,"Se vc for porco, folgado e relaxado, você não ...",OFF,UNT,,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,False,True,False,False,False,False,False,False,False,False
4,7f001bd9a4d34c2394055c366c31125a,"Rapaziada chata, né?! O cara trabalha c funk, ...",OFF,TIN,GRP,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",False,False,True,False,False,False,False,False,False,False,False


In [82]:
metadata = dict_serialize_date(
    data=[i.dict() if isinstance(i, Metadata) else i for i in metadata],
    keys=["created_at", "collected_at"])

# Remove deleted texts metadata
metadata = [i for i in metadata if i["id"] in df["id"].tolist()]

print(f"Count: {len(metadata)}")

Count: 11984


## Profiling Report

We will generate a profiling report that provides some statistics about the data.

In [83]:
profile = ProfileReport(
    df, title="OLID-BR Pilot 2",
    explorative=True)

profile.to_file("../../docs/reports/olidbr_pilot_2.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Get full texts

In the next cells, we will prepare a list of texts with all the annotations and metadata.

In [105]:
texts = dataset.get_texts(
    raw=[i for i in dataset.get_raw_texts(data) if len(i.annotations) == 3 and i.text in df["text"].tolist()]
)

def serialize_texts(texts):
    for text in [text.dict() for text in texts]:
        for k, v in text["metadata"].items():
            if isinstance(v, datetime.datetime):
                text["metadata"][k] = v.isoformat()
        yield text

texts = list(serialize_texts(texts))

print(f"Count: {len(texts)}")

Count: 2996


## Upload data to S3

In this section, we will save the dataset in CSV and JSON format in the S3 bucket.

Saving in CSV format.

In [99]:
bucket.upload_csv(
    data=df,
    key="processed/olid-br/iterations/2/olidbr.csv")

bucket.upload_csv(
    data=df_metadata,
    key="processed/olid-br/iterations/2/metadata.csv")

print("CSV Files uploaded.")

CSV Files uploaded.


Saving in JSON format.

In [100]:
bucket.upload_json(
    data=processed_texts,
    key="processed/olid-br/iterations/2/olidbr.json")

bucket.upload_json(
    data=metadata,
    key="processed/olid-br/iterations/2/metadata.json")

print("JSON Files uploaded.")

JSON Files uploaded.


Saving full texts in JSON format.

In [107]:
bucket.upload_json(
    data=texts,
    key="processed/olid-br/iterations/2/full_olidbr.json")

print("JSON file uploaded.")

JSON file uploaded.
