# OLID-BR (Build Dataset)

In this notebook, we will build the OLID-BR dataset from the processed data.

## Imports

In [96]:
import sys
from pathlib import Path

if str(Path(".").absolute().parent) not in sys.path:
    sys.path.append(str(Path(".").absolute().parent.parent))

In [97]:
from dotenv import load_dotenv

# Initialize the env vars
load_dotenv("../../.env")

True

In [98]:
import pandas as pd
from pandas_profiling import ProfileReport
from src.s3 import Bucket
from src.settings import AppSettings

In [99]:
args = AppSettings()

bucket = Bucket(args.AWS_S3_BUCKET)

bucket.get_session_from_aksk(
    args.AWS_ACCESS_KEY_ID,
    args.AWS_SECRET_ACCESS_KEY)

## Load data

In the next cells, we will load all processed data.

In [100]:
iterations = [
    {
        "data": "processed/olid-br/1/olidbr.json",
        "metadata": "processed/olid-br/1/metadata.json"
    },
    {
        "data": "processed/olid-br/2/olidbr.json",
        "metadata": "processed/olid-br/2/metadata.json"
    }
]

In [101]:
data = []
metadata = []

for iteration in iterations:
    data.extend(bucket.download_json(key=iteration["data"]))
    metadata.extend(bucket.download_json(key=iteration["metadata"]))

print(f"Data: {len(data)}")
print(f"Metadata: {len(metadata)}")

Count: 3702
Count: 13520


### Profiling Report

In [107]:
profile = ProfileReport(
    pd.DataFrame(data),
    title="OLID-BR v2",
    explorative=True)

profile.to_file("../../docs/reports/olidbr_v2.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Upload data to S3

Saving in CSV format.

In [96]:
bucket.upload_csv(
    data=pd.DataFrame(data),
    key="processed/olid-br/2/olidbr_full.csv")

bucket.upload_csv(
    data=pd.DataFrame(metadata),
    key="processed/olid-br/2/metadata_full.csv")

print("CSV Files uploaded.")

CSV Files uploaded.


Saving in JSON format.

In [97]:
bucket.upload_json(
    data=data,
    key="processed/olid-br/2/olidbr_full.json")

bucket.upload_json(
    data=metadata,
    key="processed/olid-br/2/metadata_full.json")

print("JSON Files uploaded.")

JSON Files uploaded.
