# OLID-BR (Build Dataset)

In this notebook, we will build the OLID-BR dataset from the processed data.

## Imports

In [1]:
import sys
from pathlib import Path

if str(Path(".").absolute().parent) not in sys.path:
    sys.path.append(str(Path(".").absolute().parent.parent))

In [2]:
from dotenv import load_dotenv

# Initialize the env vars
load_dotenv("../../.env")

True

In [3]:
import pandas as pd
from pandas_profiling import ProfileReport
from src.s3 import Bucket
from src.settings import AppSettings

In [4]:
args = AppSettings()

version = "v0.4-alpha"

bucket = Bucket(args.AWS_S3_BUCKET)

bucket.get_session_from_aksk(
    args.AWS_ACCESS_KEY_ID,
    args.AWS_SECRET_ACCESS_KEY)

## Load data

In the next cells, we will load all processed data.

In [5]:
iterations = [
    {
        "data": "processed/olid-br/iterations/1/olidbr.json",
        "metadata": "processed/olid-br/iterations/1/metadata.json"
    },
    {
        "data": "processed/olid-br/iterations/2/olidbr.json",
        "metadata": "processed/olid-br/iterations/2/metadata.json",
        "full_data": "processed/olid-br/iterations/2/full_olidbr.json"
    },
    {
        "data": "processed/olid-br/iterations/3/olidbr.json",
        "metadata": "processed/olid-br/iterations/3/metadata.json",
        "full_data": "processed/olid-br/iterations/3/full_olidbr.json"
    },
    {
        "data": "processed/olid-br/iterations/4/olidbr.json",
        "metadata": "processed/olid-br/iterations/4/metadata.json",
        "full_data": "processed/olid-br/iterations/4/full_olidbr.json"
    }
]

In [6]:
data = []
metadata = []
full_data = []

for iteration in iterations:
    print(f"Loading {iteration['data']}")

    iteration_data = bucket.download_json(key=iteration["data"])
    iteration_metadata = bucket.download_json(key=iteration["metadata"])

    data.extend(iteration_data)
    metadata.extend(iteration_metadata)
    
    print(f"Data iteration size: {len(iteration_data)}")
    print(f"Metadata iteration size: {len(iteration_metadata)}")

    if iteration.get("full_data"):
        iteration_full_data = bucket.download_json(key=iteration["full_data"])
        full_data.extend(iteration_full_data)
        print(f"Full data iteration size: {len(iteration_full_data)}")

print(f"Data: {len(data)}")
print(f"Metadata: {len(metadata)}")
print(f"Full data: {len(full_data)}")

Loading processed/olid-br/iterations/1/olidbr.json
Data iteration size: 706
Metadata iteration size: 1520
Loading processed/olid-br/iterations/2/olidbr.json
Data iteration size: 2996
Metadata iteration size: 11984
Full data iteration size: 2996
Loading processed/olid-br/iterations/3/olidbr.json
Data iteration size: 2987
Metadata iteration size: 11948
Full data iteration size: 2987
Loading processed/olid-br/iterations/4/olidbr.json
Data iteration size: 1851
Metadata iteration size: 7404
Full data iteration size: 1851
Data: 8540
Metadata: 32856
Full data: 7834


## Remove duplicated entries

In [7]:
df = pd.DataFrame(data)
print(f"Duplicated text: {df['text'].duplicated().sum()}")

df.drop_duplicates(subset="text", inplace=True)
print(df.shape)

data = df.to_dict("records")

Duplicated text: 37
(8503, 17)


In [8]:
# Remove duplicated texts from full data
full_data = [i for i in full_data if i["text"] in df["text"].values]

print(f"Full data: {len(full_data)}")

Full data: 7834


In [9]:
# Remove duplicated texts from metadata
print(f"Count metadata (before): {len(metadata)}")
metadata = [i for i in metadata if i["id"] in df["id"].values]
print(f"Count metadata (after): {len(metadata)}")

Count metadata (before): 32856
Count metadata (after): 32657


### Profiling Report

In [10]:
profile = ProfileReport(
    pd.DataFrame(data),
    title=f"OLID-BR {version}",
    explorative=True)

profile.to_file(f"../../docs/reports/olidbr_{version}.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Upload data to S3

Saving in CSV format.

In [11]:
bucket.upload_csv(
    data=pd.DataFrame(data),
    key=f"processed/olid-br/{version}/olidbr.csv")

bucket.upload_csv(
    data=pd.DataFrame(metadata),
    key=f"processed/olid-br/{version}/metadata.csv")

print("CSV Files uploaded.")

CSV Files uploaded.


Saving in JSON format.

In [12]:
bucket.upload_json(
    data=data,
    key=f"processed/olid-br/{version}/olidbr.json")

bucket.upload_json(
    data=metadata,
    key=f"processed/olid-br/{version}/metadata.json")

bucket.upload_json(
    data=full_data,
    key=f"processed/olid-br/{version}/full_olidbr.json")

print("JSON Files uploaded.")

JSON Files uploaded.
