# OLID-BR (Build Dataset)

In this notebook, we will build the OLID-BR dataset from the processed data.

## Imports

In [1]:
import sys
from pathlib import Path

if str(Path(".").absolute().parent) not in sys.path:
    sys.path.append(str(Path(".").absolute().parent.parent))

In [2]:
from dotenv import load_dotenv

# Initialize the env vars
load_dotenv("../../.env")

True

In [3]:
import pandas as pd
from pandas_profiling import ProfileReport
from src.s3 import Bucket
from src.settings import AppSettings

In [4]:
args = AppSettings()

version = "v0.3-alpha"

bucket = Bucket(args.AWS_S3_BUCKET)

bucket.get_session_from_aksk(
    args.AWS_ACCESS_KEY_ID,
    args.AWS_SECRET_ACCESS_KEY)

## Load data

In the next cells, we will load all processed data.

In [5]:
iterations = [
    {
        "data": "processed/olid-br/iterations/1/olidbr.json",
        "metadata": "processed/olid-br/iterations/1/metadata.json"
    },
    {
        "data": "processed/olid-br/iterations/2/olidbr.json",
        "metadata": "processed/olid-br/iterations/2/metadata.json"
    },
    {
        "data": "processed/olid-br/iterations/3/olidbr.json",
        "metadata": "processed/olid-br/iterations/3/metadata.json"
    }
]

In [6]:
data = []
metadata = []

for iteration in iterations:
    print(f"Loading {iteration['data']}")

    iteration_data = bucket.download_json(key=iteration["data"])
    iteration_metadata = bucket.download_json(key=iteration["metadata"])

    print(f"Data iteration size: {len(iteration_data)}")
    print(f"Metadata iteration size: {len(iteration_metadata)}")

    data.extend(iteration_data)
    metadata.extend(iteration_metadata)

print(f"Data: {len(data)}")
print(f"Metadata: {len(metadata)}")

Loading processed/olid-br/iterations/1/olidbr.json
Data iteration size: 706
Metadata iteration size: 1520
Loading processed/olid-br/iterations/2/olidbr.json
Data iteration size: 2996
Metadata iteration size: 12000
Loading processed/olid-br/iterations/3/olidbr.json
Data iteration size: 974
Metadata iteration size: 3896
Data: 4676
Metadata: 17416


### Profiling Report

In [7]:
profile = ProfileReport(
    pd.DataFrame(data),
    title=f"OLID-BR {version}",
    explorative=True)

profile.to_file(f"../../docs/reports/olidbr_{version}.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Upload data to S3

Saving in CSV format.

In [8]:
bucket.upload_csv(
    data=pd.DataFrame(data),
    key=f"processed/olid-br/{version}/olidbr.csv")

bucket.upload_csv(
    data=pd.DataFrame(metadata),
    key=f"processed/olid-br/{version}/metadata.csv")

print("CSV Files uploaded.")

CSV Files uploaded.


Saving in JSON format.

In [9]:
bucket.upload_json(
    data=data,
    key=f"processed/olid-br/{version}/olidbr.json")

bucket.upload_json(
    data=metadata,
    key=f"processed/olid-br/{version}/metadata.json")

print("JSON Files uploaded.")

JSON Files uploaded.
