#Datasets Pre-processing

This notebook aims to perform the preprocessing of the datasets used in the comparative tests between the Multilingual LLMs and those adjusted for Brazilian Portuguese. In this stage, the 6 demonstrations per dataset that will be used in the In-Context Learning method will also be defined.

The preprocessing consists of:

1. Reading the files in their original formats as DataFrames;

2. Removing unnecessary columns, keeping only the texts and their respective labels in the DataFrame

3. Removing instances that are not of the `Positive` or `Negative` class;

4. Standardizing the names of the remaining columns;

5. Standardizing the class labels: `1` for the `Positive` sentiment and `-1` for the `Negative` sentiment;

6. Splitting the dataset into training and test subsets;

7. Saving the training and test subsets in CSV format;

8. Selecting the 6 demonstrations to be used in In-Context Learning from the training set: 3 from the positive class and 3 from the negative class;

9. Saving the dataset with the demonstrations.

# Installations

Installing packages that are not available by default in Google Colab

In [None]:
! pip install transformers datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

# 00 - Google Drive Mount

Connect to Google Drive for reading and writing files.








In [None]:
from google.colab import drive
drive.mount('./gdrive', force_remount=True)

Mounted at ./gdrive


# 01 - Imports

Import the necessary libraries.

In [None]:
import os
import glob
import re
import numpy as np
import pandas as pd
import json
import requests


from google.colab import userdata
from sklearn.model_selection import train_test_split
from datasets import load_dataset
from pathlib import Path

# 02 - Constants
In this stage, two constants `RAW_DATA_ROOT_PATH` and `PREPROCESSED_DATA_ROOT_PATH` are defined.

The constant `RAW_DATA_ROOT_PATH` will store the path to the folder where the raw data is stored.

The constant `PREPROCESSED_DATA_ROOT_PATH` stores the path to the folder where the preprocessed data will be saved.


In [None]:
RAW_DATA_ROOT_PATH = userdata.get("IA_DATA_RAW")
PREPROCESSED_DATA_ROOT_PATH = userdata.get("IA_DATA_PREPROCESSED")


# 03 - Functions

Definition of the functions that will be used during data preprocessing.

### 03.01 - General purpose functions

In [None]:
def check_str_or_path(path:str|Path)->bool:
    """
    Check if the given path is a string or a Path object.

    Args:
        path (str | Path): The path to check.

    Returns:
        bool: True if the path is either a string or a Path object, False otherwise.
    """
    if isinstance(path, Path) or isinstance(path, str):
        return True
    else:
       return False

def check_file_exists(path:str|Path)->bool:
    """
    Check if a file exists at the given path.

    Args:
        path (str | Path): The path to the file.

    Returns:
        bool: True if the file exists, False otherwise.
    """
    if check_str_or_path(path):
        return os.path.isfile(path)

def check_directory_exists(path:str|Path)->bool:
    """
    Check if a directory exists at the given path.

    Args:
        path (str | Path): The path to the directory.

    Returns:
        bool: True if the directory exists, False otherwise.
    """
    if check_str_or_path(path):
        return os.path.isdir(path)

def create_directory(path:str|Path)->Path:
    """
    Create a directory at the given path if it does not already exist.

    Args:
        path (str | Path): The path where the directory will be created.

    Returns:
        Path: The Path object of the created directory if successful, or a message indicating that the directory already exists.
    """
    if check_str_or_path(path) and not check_directory_exists(path=path):
        os.mkdir(path)
        return Path(path)
    else:
        return f"Directory {path} already exists"

### 03.02 - Data Pipeline Functions

In [None]:
def adjust_columns_names(df:pd.DataFrame, mapping:dict)->pd.DataFrame:
    """
    Adjust the column names of a DataFrame based on a provided mapping.

    Args:
        df (pd.DataFrame): The DataFrame whose columns need to be renamed.
        mapping (dict): A dictionary where keys are old column names and values are new column names.

    Returns:
        pd.DataFrame: The DataFrame with renamed columns.
    """
    return df.rename(columns=mapping)

def adjust_labels(df:pd.DataFrame,  lable_column_name:str, mapping:dict)->pd.DataFrame:
    """
    Adjust the labels in a specified column of a DataFrame based on a provided mapping.

    Args:
        df (pd.DataFrame): The DataFrame containing the labels to be adjusted.
        label_column_name (str): The name of the column containing the labels to be adjusted.
        mapping (dict): A dictionary where keys are old labels and values are new labels.

    Returns:
        pd.DataFrame: The DataFrame with adjusted labels.
    """
    df[lable_column_name] = df[lable_column_name].map(mapping)
    return df

def drop_unused_labels(df:pd.DataFrame, lable_column_name:str, condition:list, condition_type:str)->pd.DataFrame:
    """
    Drop unused labels from a specified column of a DataFrame based on a condition.

    Args:
        df (pd.DataFrame): The DataFrame containing the labels to be dropped.
        label_column_name (str): The name of the column containing the labels to be dropped.
        condition (list): A list of labels to keep or drop.
        condition_type (str): The type of condition, either "keep" or "drop".

    Returns:
        pd.DataFrame: The DataFrame with unused labels dropped.
    """
    if condition_type == "keep":
        df = df[df[lable_column_name].isin(condition)].reset_index(drop=True)

    else:
        df = df[~df[lable_column_name].isin(condition)].reset_index(drop=True)

    return df

def drop_unused_columns(df:pd.DataFrame, column_name:str|list, condition_type:str)->pd.DataFrame:
    """
    Drop unused columns from a DataFrame based on a condition.

    Args:
        df (pd.DataFrame): The DataFrame containing the columns to be dropped.
        column_name (str | list): The name or list of names of the columns to be dropped.
        condition_type (str): The type of condition, either "keep" or "drop".

    Returns:
        pd.DataFrame: The DataFrame with unused columns dropped.
    """
    if not isinstance(column_name, list):
        column_name = list(column_name)

    if condition_type == "keep":
        df = df[column_name]

    else:
         df = df.drop(columns=column_name, axis=1)

    return df

def create_train_test_split(df:pd.DataFrame, features=list|str, target=str)->dict:
    """
    Create train and test splits from a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame to be split.
        features (list | str): The list of feature columns or a single feature column.
        target (str): The target column.
    Returns:
        dict: A dictionary containing the train and test splits.
    """
    if not isinstance(features, list):
        features_list = [features]

    if not isinstance(target, list):
        target_list = [target]

    features = df[features_list]
    target = df[target_list]

    X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42, stratify=target)
    train = pd.concat([X_train, y_train], axis=1)
    test = pd.concat([X_test, y_test], axis=1)
    return {"train":train, "test":test}

def persist_splits_as_csv(split:dict, directory_path:str|Path, file_name:str)->None:
    """
    Persist train and test splits as CSV files in a specified directory.

    Args:
        split (dict): A dictionary containing the train and test splits.
        directory_path (str | Path): The path to the directory where the CSV files will be saved.
        file_name (str): The base name for the CSV files.

    Returns:
        None
    """
    if not isinstance(directory_path, Path):
        directory_path = Path(directory_path)

    if not check_directory_exists(directory_path):
        create_directory(directory_path)

    for key in split.keys():
        file_full_path = directory_path.joinpath(f"{file_name}_{key}.csv")
        split[key].to_csv(f"{file_full_path}", index=False, encoding="utf-8")
    return



##04 - Datasets

Datasets preprocessing.

### 04.01 - CSP_Eletronicos

**Reference**
<br>
[Belisário, L., Luiz G., F., and Thiago A. S., P. (2019). Classificação de subje-
tividade para o português: Métodos baseados em aprendizado de máquina e em
léxico. In 27º Simpósio Internacional de Iniciação Científica e Tecnológica da
USP (SIICUSP), pages 1–1.](https://drive.google.com/file/d/1NObaSVn4ryYMMmAjZLrXcRiEmPuppRkK/view)

<br>

**Dataset Link**
<br>
[CSP_Eletronicos](https://github.com/Luizgferreira/subjectivity-classifier/blob/master/src/data/raw/sentencas.xlsx)


#### Defining Paths

In [None]:
csp_eletronicos_raw_path = RAW_DATA_ROOT_PATH + "/CSP_Eletronicos"
csp_eletronicos_preprocessed_path = PREPROCESSED_DATA_ROOT_PATH + "/CSP_Eletronicos"
file_name="csp-eletronicos"

if not check_directory_exists(path=csp_eletronicos_raw_path):
    create_directory(path=csp_eletronicos_raw_path)

if not check_directory_exists(path=csp_eletronicos_preprocessed_path):
    create_directory(path=csp_eletronicos_preprocessed_path)

#### Reading data

In [None]:
df = pd.read_excel(f"{csp_eletronicos_raw_path}/sentencas.xlsx")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233 entries, 0 to 232
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Arquivo     233 non-null    object
 1   Sentença    233 non-null    object
 2   Polaridade  233 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 5.6+ KB


#### Pre-processing pipeline

In [None]:
column_name_mapping = {
    'Sentença':'text',
    'Polaridade':'label'
}

splits = (
    df.pipe(drop_unused_columns, column_name=['Sentença', 'Polaridade'], condition_type='keep' )
    .pipe(drop_unused_labels, lable_column_name="Polaridade", condition = [0], condition_type = "drop")
    .pipe(adjust_columns_names, mapping = column_name_mapping)
    .pipe(create_train_test_split, features="text", target="label")
    )


persist_splits_as_csv(
    split=splits,
    directory_path=csp_eletronicos_preprocessed_path,
    file_name=file_name
    )

#### Datasets metadata

In [None]:
print(f'Train dataset size: {splits["train"].shape}\n\n')

print(f'Test dataset size: {splits["test"].shape}\n\n')

print(f'Label distribution Train dataset:\n {splits["train"]["label"].value_counts(dropna=False, normalize=True)}')

Train dataset size: (152, 2)


Test dataset size: (38, 2)


Label distribution Train dataset:
 label
 1    0.690789
-1    0.309211
Name: proportion, dtype: float64


#### Demonstrations

##### Positive

In [None]:
positive_examples = splits["train"][splits["train"]['label']==1].sample(n=3, random_state=42)
positive_examples

Unnamed: 0,text,label
142,UM PRODUTO COM PREÇO EXCELENTE E D BOA QUALIDADE,1
41,"A televisão em si é boa mas, falta uma aceitaç...",1
124,MUITO BOM E DE FÁCIL MANUSEIO,1


##### Negative

In [None]:
negative_examples = splits["train"][splits["train"]['label']==-1].sample(n=3, random_state=42)
negative_examples

Unnamed: 0,text,label
67,Positiva - Portabilidade\nNegativa - Peso. Dev...,-1
182,Engodo completo. Para uma câmera com 8.1 megap...,-1
187,se ela não esticesse travando e apagando as fo...,-1


##### Demonstrations dataframe

In [None]:
demo_df = pd.concat([positive_examples, negative_examples])
demo_df.to_csv(f'{csp_eletronicos_preprocessed_path}/{file_name}_demo.csv', index=False)

### 04.02 - CSP_Livros

**Reference**
<br>
[Belisário, L., Luiz G., F., and Thiago A. S., P. (2019). Classificação de subje-
tividade para o português: Métodos baseados em aprendizado de máquina e em
léxico. In 27º Simpósio Internacional de Iniciação Científica e Tecnológica da
USP (SIICUSP), pages 1–1.](https://drive.google.com/file/d/1NObaSVn4ryYMMmAjZLrXcRiEmPuppRkK/view)

<br>

**Dataset Link**
<br>
[CSP_Livros](https://github.com/Lubelisa/Natural-Linguage-Processing/tree/master/Corpus%20of%20Book%20Reviews)

#### Defining Paths

In [None]:
csp_livros_raw_path = RAW_DATA_ROOT_PATH + "/CSP_Livros"
csp_livros_preprocessed_path = PREPROCESSED_DATA_ROOT_PATH + "/CSP_Livros"
file_name="csp-livros"

if not check_directory_exists(path=csp_livros_raw_path):
    create_directory(path=csp_livros_raw_path)

if not check_directory_exists(path=csp_livros_preprocessed_path):
    create_directory(path=csp_livros_preprocessed_path)

#### Reading data

In [None]:
df = pd.read_csv(f"{csp_livros_raw_path}/corpus_book_reviews_portuguese.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   FRASE       350 non-null    object 
 1   Unnamed: 1  0 non-null      float64
 2   OBJ/SUBJ    350 non-null    object 
 3   POLARIDADE  350 non-null    object 
dtypes: float64(1), object(3)
memory usage: 11.1+ KB


#### Pre-processing pipeline

In [None]:
label_mapping = {
    "positiva":1,
    "negativa":-1
    }

column_name_mapping = {
    "FRASE":"text",
    "POLARIDADE":"label"
}


splits = (
    df.pipe(drop_unused_columns, column_name=["FRASE", "POLARIDADE"], condition_type="keep")
    .pipe(drop_unused_labels, lable_column_name="POLARIDADE", condition=["positiva", "negativa"], condition_type="keep")
    .pipe(adjust_labels, lable_column_name="POLARIDADE", mapping=label_mapping)
    .pipe(adjust_columns_names, mapping = column_name_mapping)
    .pipe(create_train_test_split, features="text", target="label")
    )


persist_splits_as_csv(
    split=splits,
    directory_path=csp_livros_preprocessed_path,
    file_name=file_name)

#### Datasets metadata

In [None]:
print(f'Train dataset size: {splits["train"].shape}\n\n')

print(f'Test dataset size: {splits["test"].shape}\n\n')

print(f'Label distribution Train dataset:\n {splits["train"]["label"].value_counts(dropna=False, normalize=True)}')

Train dataset size: (140, 2)


Test dataset size: (35, 2)


Label distribution Train dataset:
 label
-1    0.5
 1    0.5
Name: proportion, dtype: float64


#### Demonstrations

##### Positive

In [None]:
positive_examples = splits["train"][splits["train"]['label']==1].sample(n=3, random_state=42)
positive_examples

Unnamed: 0,text,label
137,O livro prometeu uma história de memórias e fo...,1
104,O livro tem uma ótima linguagem e aborda temas...,1
148,E estou dentro dessa fatia de fãs filmes de te...,1


##### Negative

In [None]:
negative_examples = splits["train"][splits["train"]['label']==-1].sample(n=3, random_state=42)
negative_examples

Unnamed: 0,text,label
10,"O começo do livro foi bom, não vou mentir, tal...",-1
54,Eu estava com uma super expectativa para lê-lo...,-1
29,Em vez da autora aprofundar ainda mais no roma...,-1


##### Demonstrations dataframe

In [None]:
demo_df = pd.concat([positive_examples, negative_examples])
demo_df.to_csv(f'{csp_livros_preprocessed_path}/{file_name}_demo.csv', index=False)

### 04.03 - Computer-BR

**Reference**
<br>
[Moraes, S., Santos, A., Redecker, M., et al. (2016). Comparing approaches to
subjectivity classification: A study on portuguese tweets. In Silva, J., Ribeiro, R.,
Quaresma, P., Adami, A., and Branco, A., editors, Lecture Notes in Computer
Science, volume 9727, page 86–94. Springer International Publishing. https:
//doi.org/10.1007/978-3-319-41552-9_8.](https://doi.org/10.1007/978-3-319-41552-9_8)

<br>

**Dataset Link**
<br>
[Computer-BR](https://github.com/Luizgferreira/subjectivity-classifier/blob/master/src/data/raw/Computer-BR.xlsx)

#### Defining Paths

In [None]:
computer_br_raw_path = RAW_DATA_ROOT_PATH + "/Computer-BR"
computer_br_preprocessed_path = PREPROCESSED_DATA_ROOT_PATH + "/Computer-BR"
file_name="computer-br"

if not check_directory_exists(path=computer_br_raw_path):
    create_directory(path=computer_br_raw_path)

if not check_directory_exists(path=computer_br_preprocessed_path):
    create_directory(path=computer_br_preprocessed_path)

#### Reading data

In [None]:
df = pd.read_excel(f"{computer_br_raw_path}/Computer-BR.xlsx", sheet_name="Pesquisa")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2317 entries, 0 to 2316
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Data                2317 non-null   object 
 1   FINAL               2317 non-null   int64  
 2   Mensagem            2317 non-null   object 
 3   Unnamed: 3          0 non-null      float64
 4   Unnamed: 4          0 non-null      float64
 5   Unnamed: 5          0 non-null      float64
 6   Unnamed: 6          0 non-null      float64
 7   Unnamed: 7          0 non-null      float64
 8   Parametro de Busca  2317 non-null   object 
dtypes: float64(5), int64(1), object(3)
memory usage: 163.0+ KB


#### Pre-processing pipeline

In [None]:
label_mapping = {
    1:1,
    -1:-1,
    -2:-1
}

column_name_mapping ={
    "FINAL": "label",
    "Mensagem":"text"
}

splits = (
    df.pipe(drop_unused_columns, column_name=['FINAL', 'Mensagem'], condition_type="keep")
    .pipe(drop_unused_labels, lable_column_name="FINAL", condition=[0, 2], condition_type="drop")
    .pipe(adjust_labels, lable_column_name='FINAL', mapping=label_mapping)
    .pipe(adjust_columns_names, mapping=column_name_mapping)
    .pipe(create_train_test_split, features="text", target="label")
    )


persist_splits_as_csv(
    split=splits,
    directory_path=computer_br_preprocessed_path,
    file_name=file_name)

#### Datasets metadata

In [None]:
print(f'Train dataset size: {splits["train"].shape}\n\n')

print(f'Test dataset size: {splits["test"].shape}\n\n')

print(f'Label distribution Train dataset:\n {splits["train"]["label"].value_counts(dropna=False, normalize=True)}')

Train dataset size: (512, 2)


Test dataset size: (128, 2)


Label distribution Train dataset:
 label
-1    0.691406
 1    0.308594
Name: proportion, dtype: float64


#### Demonstrations

##### Positive

In [None]:
positive_examples = splits["train"][splits["train"]['label']==1].sample(n=3, random_state=42)
positive_examples

Unnamed: 0,text,label
222,"Só compro notebook da Dell agora, sem comparações",1
18,@DellnoBrasil compraria fácil um celular da De...,1
166,Finalmente meu notebook novo chegou! Agora eu ...,1


##### Negative

In [None]:
negative_examples = splits["train"][splits["train"]['label']==-1].sample(n=3, random_state=42)
negative_examples

Unnamed: 0,text,label
430,Vlw dell por disponibilizar uma atualização qu...,-1
325,P/ quem tem notebook @Dell se prepare: qdo su...,-1
483,"Aff que bosta, meu note ta travando muito, de ...",-1


##### Demonstrations dataframe

In [None]:
demo_df = pd.concat([positive_examples, negative_examples])
demo_df.to_csv(f'{computer_br_preprocessed_path}/{file_name}_demo.csv', index=False)

### 04.04 - Corpus-4p

**Reference**
<br>
[Silva, R. R. and Pardo, T. A. S. (2019). Corpus 4p: um córpus anotado de opiniões
em português sobre produtos eletrônicos para fins de sumarização contrastiva de
opinião. In Proceedings of the 6a Jornada de Descrição do Português (JDP),
pages 1–9. SOCIEDADE BRASILEIRA DE COMPUTAÇÃO. http://drive.
google.com/file/d/1Nqu66l-z7eQenXEsvcnAEClt1LQzioJw/view.](http://drive.google.com/file/d/1Nqu66l-z7eQenXEsvcnAEClt1LQzioJw/view)

<br>

**Dataset Link**
<br>
[Corpus-4p](https://github.com/raphsilva/corpus-4p)

#### Defining Paths

In [None]:
corpus4p_raw_path = RAW_DATA_ROOT_PATH + "/Corpus-4p"
corpus4p_preprocessed_path = PREPROCESSED_DATA_ROOT_PATH + "/Corpus-4p"
file_name="corpus-4p"

if not check_directory_exists(path=corpus4p_raw_path):
    create_directory(path=corpus4p_raw_path)

if not check_directory_exists(path=corpus4p_preprocessed_path):
    create_directory(path=corpus4p_preprocessed_path)

#### Reading data

In [None]:
files = [10, 11, 30, 31]
base_url = "https://raw.githubusercontent.com/raphsilva/corpus-4p/master/dataset/whole/json/"


data = []
for file in files:
    req = requests.get(f"{base_url}{str(file)}.json")
    response = req.json()
    for item in response.get("data"):
        item_id = item.get("id")
        try:
            if item.get("excerpts"):
                for index, value in enumerate(item.get("excerpts")):
                    text = value
                    label = item["opinions"][index][1]
                    data.append([file, item_id, text, label])
            else:
                text = item["sentence"]
                label = item["opinions"][0][1]
                data.append([file, item_id, text, label])
        except:
            pass

df = pd.DataFrame(data, columns=["file", "id", "text", "label"])
df = df.drop_duplicates(subset=["text","label"])
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1594 entries, 0 to 1901
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   file    1594 non-null   int64 
 1   id      1594 non-null   int64 
 2   text    1594 non-null   object
 3   label   1594 non-null   object
dtypes: int64(2), object(2)
memory usage: 62.3+ KB


####Pre-processing pipeline

In [None]:
label_mapping = {
    "+":1,
    "++":1,
    "+.":1,
    "-":-1,
    "--":-1,
    ".-":-1
}


splits = (
    df.pipe(drop_unused_columns, column_name=['text', 'label'], condition_type="keep")
    .pipe(drop_unused_labels, lable_column_name="label", condition=list(label_mapping.keys()), condition_type="keep")
    .pipe(adjust_labels, lable_column_name='label', mapping=label_mapping)
    .pipe(create_train_test_split, features="text", target="label")
    )


persist_splits_as_csv(
    split=splits,
    directory_path=corpus4p_preprocessed_path,
    file_name=file_name)

#### Datasets metadata

In [None]:
print(f'Train dataset size: {splits["train"].shape}\n\n')

print(f'Test dataset size: {splits["test"].shape}\n\n')

print(f'Label distribution Train dataset:\n {splits["train"]["label"].value_counts(dropna=False, normalize=True)}')

Train dataset size: (1111, 2)


Test dataset size: (278, 2)


Label distribution Train dataset:
 label
 1    0.820882
-1    0.179118
Name: proportion, dtype: float64


#### Demonstrations

##### Positive

In [None]:
positive_examples = splits["train"][splits["train"]['label']==1].sample(n=3, random_state=42)
positive_examples

Unnamed: 0,text,label
988,"O celular possui design bastante sofisticado, ...",1
970,Maravilhosa.,1
142,É um produto com uma ótima qualidade.,1


##### Negative

In [None]:
negative_examples = splits["train"][splits["train"]['label']==-1].sample(n=3, random_state=42)
negative_examples

Unnamed: 0,text,label
60,Só acho que devem melhorar a qualidade da câme...,-1
799,O NFC não é compatível com os cartões Mirafire...,-1
1162,Por isso acabei devolvendo o aparelho para ten...,-1


##### Demonstrations dataframe

In [None]:
demo_df = pd.concat([positive_examples, negative_examples])
demo_df.to_csv(f'{corpus4p_preprocessed_path}/{file_name}_demo.csv', index=False)

### 04.05 - IMDB_PT

**Reference**
<br>
[Maas, A. L., Daly, R. E., Pham, P. T., et al. (2011). Learning word vectors for
sentiment analysis.](https://aclanthology.org/P11-1015)

[Pires, R., Abonizio, H., Almeida, T. S., et al. (2023). Sabiá: Portuguese large
language models. In Naldi, M. C. and Bianchi, R. A. C., editors, Lecture Notes
in Computer Science, page 226–240. Springer Nature Switzerland.
](https://arxiv.org/abs/2304.07880)


<br>

**Dataset Link**
<br>
[IMDB_PT](https://huggingface.co/datasets/maritaca-ai/imdb_pt)

#### Defining Paths

In [None]:
imdb_raw_path = RAW_DATA_ROOT_PATH + "/IMDB_PT"
imdb_preprocessed_path = PREPROCESSED_DATA_ROOT_PATH + "/IMDB_PT"
file_name="imdb-pt"

if not check_directory_exists(path=imdb_raw_path):
    create_directory(path=imdb_raw_path)

if not check_directory_exists(path=imdb_preprocessed_path):
    create_directory(path=imdb_preprocessed_path)

#### Reading data

In [None]:
dataset = load_dataset("maritaca-ai/imdb_pt")
train = dataset["train"].to_pandas()
test = dataset["test"].to_pandas()

#### Pre-processing pipeline

In [None]:
label_mapping = {
    0: -1,
    1:1
}

train = train.pipe(adjust_labels, lable_column_name='label', mapping=label_mapping)
test = test.pipe(adjust_labels, lable_column_name='label', mapping=label_mapping)

splits = {
    "train": train,
    "test":test
}

persist_splits_as_csv(
    split=splits,
    directory_path=imdb_preprocessed_path,
    file_name=file_name)

#### Datasets metadata

In [None]:
print(f'Train dataset size: {splits["train"].shape}\n\n')

print(f'Test dataset size: {splits["test"].shape}\n\n')

print(f'Label distribution Train dataset:\n {splits["train"]["label"].value_counts(dropna=False, normalize=True)}')

Train dataset size: (25000, 2)


Test dataset size: (5000, 2)


Label distribution Train dataset:
 label
-1    0.5
 1    0.5
Name: proportion, dtype: float64


#### Demonstrations

##### Positive

In [None]:
positive_examples = splits["train"][splits["train"]['label']==1].sample(n=3, random_state=42)
positive_examples

Unnamed: 0,text,label
14266,Este filme é para o Halloween o que a hilária ...,1
24419,Bom Western filmado no Rocky Arizona Wilds. Mu...,1
21409,"O melhor filme de John Singleton, antes de Blo...",1


##### Negative

In [None]:
negative_examples = splits["train"][splits["train"]['label']==-1].sample(n=3, random_state=42)
negative_examples

Unnamed: 0,text,label
1766,"Uau, que total decepcionado!O fato de as pesso...",-1
11919,"Se Bob Ludlum visse essa mini série, ele teria...",-1
8909,Plantar um filme sobre um fantasma aleijado se...,-1


##### Demonstrations dataframe

In [None]:
demo_df = pd.concat([positive_examples, negative_examples])
demo_df.to_csv(f'{imdb_preprocessed_path}/{file_name}_demo.csv', index=False)

### 03.06 - MTMSLA

**Reference**
<br>
[Araujo, M., Reis, J., Pereira, A., et al. (2016). An evaluation of machine translation for multilingual sentence-level sentiment analysis. In Proceedings of the 31st Annual ACM Symposium on Applied Computing, page 1140–1145. Association for Computing Machinery.https://doi.org/10.1145/2851613.2851817](https://dl.acm.org/doi/10.1145/2851613.2851817)

<br>

**Dataset Link**
<br>
[MTMSLA](https://homepages.dcc.ufmg.br/%7efabricio/sentiment-languages-dataset/index.htm)

#### Defining Paths

In [None]:
mtmsla_raw_path = RAW_DATA_ROOT_PATH + "/MTMSLA"
mtmsla_preprocessed_path = PREPROCESSED_DATA_ROOT_PATH + "/MTMSLA"
file_name="mtmsla"

if not check_directory_exists(path=mtmsla_raw_path):
    create_directory(path=mtmsla_raw_path)

if not check_directory_exists(path=mtmsla_preprocessed_path):
    create_directory(path=mtmsla_preprocessed_path)

#### Reading data

In [None]:
df = pd.read_excel(f"{mtmsla_raw_path}/mtmsla.xlsx", sheet_name="portuguese", header=None)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 774 entries, 0 to 773
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       774 non-null    object
 1   1       774 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 12.2+ KB


#### Pre-processing pipeline

In [None]:
column_name_mapping = {
        0 : "text",
        1: "label"
    }

splits = (
    df.pipe(drop_unused_labels, lable_column_name=1, condition=[-1, 1], condition_type="keep")
    .pipe(adjust_columns_names, mapping=column_name_mapping)
    .pipe(create_train_test_split, features="text", target="label")
    )


persist_splits_as_csv(
    split=splits,
    directory_path=mtmsla_preprocessed_path,
    file_name=file_name)


#### Datasets metadata

In [None]:
print(f'Train dataset size: {splits["train"].shape}\n\n')

print(f'Test dataset size: {splits["test"].shape}\n\n')

print(f'Label distribution Train dataset:\n {splits["train"]["label"].value_counts(dropna=False, normalize=True)}')

Train dataset size: (408, 2)


Test dataset size: (102, 2)


Label distribution Train dataset:
 label
 1    0.583333
-1    0.416667
Name: proportion, dtype: float64


#### Demonstrations

##### Positive

In [None]:
positive_examples = splits["train"][splits["train"]['label']==1].sample(n=3, random_state=42)
positive_examples

Unnamed: 0,text,label
375,salve o tricolor paulista; amado clube brasile...,1
302,Uma das melhores e mais inspiradoras propagand...,1
488,Boa noiite Fiel! ;) #CorinthiansIsBiggerThanCN...,1


##### Negative

In [None]:
negative_examples = splits["train"][splits["train"]['label']==-1].sample(n=3, random_state=42)
negative_examples

Unnamed: 0,text,label
83,Ou eu assisto essas danadas; ou assisto as aul...,-1
221,"""Neymar foi engolido pela monstro do deslumbra...",-1
180,Q palhaçada eh essa?! Colocaram um papel escri...,-1


##### Demonstrations dataframe

In [None]:
demo_df = pd.concat([positive_examples, negative_examples])
demo_df.to_csv(f'{mtmsla_preprocessed_path}/{file_name}_demo.csv', index=False)

### 03.07 - OPCovidBR

**Reference**
<br>
[Vargas, F. A., Sanches, R., and Rocha, P. R. (2020). Identifying fine-grained opinion and classifying polarity on coronavirus pandemic. In Proceedings of the 9th Brazil ian Conference on Intelligent Systems (BRACIS 2020), page 511–520. Springer-Verlag. https://doi.org/10.1007/978-3-030-61377-8_35](https://doi.org/10.1007/978-3-030-61377-8_35)

<br>

**Dataset Link**
<br>
[OPCovidBR](https://github.com/franciellevargas/OPCovidBR/blob/master/data/opcovid-br/opcovidbr.csv)

#### Defining Paths

In [None]:
opcovidbr_raw_path = RAW_DATA_ROOT_PATH + "/OPCovidBR"
opcovidbr_preprocessed_path = PREPROCESSED_DATA_ROOT_PATH + "/OPCovidBR"
file_name="opcovidbr"

if not check_directory_exists(path=opcovidbr_raw_path):
    create_directory(path=opcovidbr_raw_path)

if not check_directory_exists(path=opcovidbr_preprocessed_path):
    create_directory(path=opcovidbr_preprocessed_path)

#### Reading data

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/franciellevargas/OPCovidBR/master/data/opcovid-br/opcovidbr.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1211 entries, 0 to 1210
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Id        1211 non-null   int64  
 1   twitter   1211 non-null   object 
 2   polarity  613 non-null    float64
 3   aspect1   613 non-null    object 
 4   aspect2   613 non-null    object 
 5   aspect3   613 non-null    object 
 6   aspect4   613 non-null    object 
dtypes: float64(1), int64(1), object(5)
memory usage: 66.4+ KB


#### Pre-processing pipeline

In [None]:
column_name_mapping = {
    'twitter':'text',
    'polarity':'label'
}

splits = (
    df.pipe(drop_unused_columns, column_name=["twitter", "polarity"], condition_type="keep")
    .pipe(drop_unused_labels, lable_column_name='polarity', condition=[-1, 1], condition_type="keep")
    .pipe(adjust_columns_names, mapping=column_name_mapping)
    .pipe(create_train_test_split, features="text", target="label")
    )


persist_splits_as_csv(
    split=splits,
    directory_path=opcovidbr_preprocessed_path,
    file_name=file_name)

#### Datasets metadata

In [None]:
print(f'Train dataset size: {splits["train"].shape}\n\n')

print(f'Test dataset size: {splits["test"].shape}\n\n')

print(f'Label distribution Train dataset:\n {splits["train"]["label"].value_counts(dropna=False, normalize=True)}')

Train dataset size: (490, 2)


Test dataset size: (123, 2)


Label distribution Train dataset:
 label
 1.0    0.502041
-1.0    0.497959
Name: proportion, dtype: float64


#### Demonstrations

##### Positive

In [None]:
positive_examples = splits["train"][splits["train"]['label']==1].sample(n=3, random_state=42)
positive_examples

Unnamed: 0,text,label
554,Que todo álcool que tô passando nas mãos se tr...,1.0
411,os cientistas dos eua disseram a casa branca q...,1.0
349,"nas contas do ministério da saúde, país tem ma...",1.0


##### Negative

In [None]:
negative_examples = splits["train"][splits["train"]['label']==-1].sample(n=3, random_state=42)
negative_examples

Unnamed: 0,text,label
449,qualquer um com um mínimo de experiência geren...,-1.0
330,mp vai apurar por que funai não usou recursos ...,-1.0
598,Tinha 33 anos e mais de 1 milhão de seguidores...,-1.0


##### Demonstrations dataframe

In [None]:
demo_df = pd.concat([positive_examples, negative_examples])
demo_df.to_csv(f'{opcovidbr_preprocessed_path}/{file_name}_demo.csv', index=False)

### 03.08 - ReLI

**Reference**
<br>
Freitas, C., Motta, E., Milidiú, R., et al. (2014). Sparkling vampire... lol! annotat ing opinions in a book review corpus. New language technologies and linguistic research: a two-way Road, pages 128–146

<br>

**Dataset Link**
<br>
[ReLI](https://www.linguateca.pt/Repositorio/ReLi/)

#### Defining Paths

In [None]:
reli_raw_path = RAW_DATA_ROOT_PATH + "/ReLI"
reli_preprocessed_path = PREPROCESSED_DATA_ROOT_PATH + "/ReLI"
file_name="reli"

if not check_directory_exists(path=reli_raw_path):
    create_directory(path=reli_raw_path)

if not check_directory_exists(path=reli_preprocessed_path):
    create_directory(path=reli_preprocessed_path)

#### Reading data

In [None]:
!wget  https://raw.githubusercontent.com/pedrobalage/ReLi_Experiments/master/ReLi.py

--2025-01-13 15:42:59--  https://raw.githubusercontent.com/pedrobalage/ReLi_Experiments/master/ReLi.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36364 (36K) [text/plain]
Saving to: ‘ReLi.py’


2025-01-13 15:43:00 (4.62 MB/s) - ‘ReLi.py’ saved [36364/36364]



In [None]:
from ReLi import ReLiCorpusReader

reli_raw_path = RAW_DATA_ROOT_PATH + '/ReLI'
reli_preprocessed_path = PREPROCESSED_DATA_ROOT_PATH + '/ReLI'
file_name="reli"

reli_txt_files_paths =  glob.glob(f"{reli_raw_path}/**.txt")


corpus = ReLiCorpusReader(path=reli_raw_path)

lista = []
for book in corpus.keys():
    for review in corpus[book]:
        score = corpus[book][review].get('score', None)
        if corpus[book][review].get('sentences'):
            for sentence in corpus[book][review]['sentences']:
                phrase = " ".join(corpus.words_sentence(sentence))
                sentiment =  1 if sentence[0][4] == '+' else -1 if sentence[0][4] == "-" else 0
                lista.append([book, review, score, phrase, sentiment])


df = pd.DataFrame(lista, columns=['book', 'review', 'score','phrase', 'sentiment'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11465 entries, 0 to 11464
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   book       11465 non-null  object 
 1   review     11465 non-null  int64  
 2   score      11465 non-null  float64
 3   phrase     11465 non-null  object 
 4   sentiment  11465 non-null  int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 448.0+ KB


#### Pre-processing pipeline

In [None]:
column_name_mapping ={
    "phrase":"text",
    "sentiment": "label"
}


splits = (
    df.pipe(drop_unused_columns, column_name=['phrase', 'sentiment'], condition_type="keep")
    .pipe(drop_unused_labels, lable_column_name="sentiment", condition=[0], condition_type="drop")
    .pipe(adjust_columns_names, mapping=column_name_mapping)
    .pipe(create_train_test_split, features="text", target="label")
    )


persist_splits_as_csv(
    split=splits,
    directory_path=reli_preprocessed_path,
    file_name=file_name)

#### Datasets metadata

In [None]:
print(f'Train dataset size: {splits["train"].shape}\n\n')

print(f'Test dataset size: {splits["test"].shape}\n\n')

print(f'Label distribution Train dataset:\n {splits["train"]["label"].value_counts(dropna=False, normalize=True)}')

Train dataset size: (2508, 2)


Test dataset size: (627, 2)


Label distribution Train dataset:
 label
 1    0.825359
-1    0.174641
Name: proportion, dtype: float64


#### Demonstrations

##### Positive

In [None]:
positive_examples = splits["train"][splits["train"]['label']==1].sample(n=3, random_state=42)
positive_examples

Unnamed: 0,text,label
6,Não tenho palavras pra esse livro .,1
3074,"Este livro , sem sombra de dúvidas é um clássi...",1
2747,A sensibilidade como diversos temas são tratad...,1


##### Negative

In [None]:
negative_examples = splits["train"][splits["train"]['label']==-1].sample(n=3, random_state=42)
negative_examples

Unnamed: 0,text,label
1712,Esse livro não é nada mais de o que a historia...,-1
2236,Você tem uma protagonista burra que dói que fa...,-1
619,Não achei a história empolgante como muitos me...,-1


##### Demonstrations dataframe

In [None]:
demo_df = pd.concat([positive_examples, negative_examples])
demo_df.to_csv(f'{reli_preprocessed_path}/{file_name}_demo.csv', index=False)

### 03.09 - RePRO

**Reference**
<br>
[dos Santos Silva, L. N., Real, L., Zandavalle, A. C. B., et al. (2024). Repro: a benchmark for opinion mining for brazilian portuguese. In Gamallo, P., Claro,
D., Teixeira, A., Real, L., Garcia, M., Oliveira, H. G., and Amaro, R., editors,
ACLWeb, page 432–440. Association for Computational Lingustics. https://aclanthology.org/2024.propor-1.44](https://aclanthology.org/2024.propor-1.44/)
<br>

Real, L., Oshiro, M., and Mafra, A. (2019). B2w-reviews01: An open product
reviews corpus. In the Proceedings of the XII Symposium in Information and
Human Language Technology., pages 200–208. SOCIEDADE BRASILEIRA DE
COMPUTAÇÃO (SBC)]

<br>

**Dataset Link**
<br>
[RePRO](https://github.com/lucasnil/repro)

#### Defining Paths

In [None]:
repro_raw_path = RAW_DATA_ROOT_PATH + "/RePRO"
repro_preprocessed_path = PREPROCESSED_DATA_ROOT_PATH + "/RePRO"
file_name="repro"

if not check_directory_exists(path=repro_raw_path):
    create_directory(path=repro_raw_path)

if not check_directory_exists(path=repro_preprocessed_path):
    create_directory(path=repro_preprocessed_path)

#### Reading data

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/lucasnil/repro/main/RePro.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10003 entries, 0 to 10002
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   submission_date         10003 non-null  object 
 1   reviewer_id             10003 non-null  object 
 2   product_id              10003 non-null  object 
 3   product_name            9997 non-null   object 
 4   product_brand           2967 non-null   object 
 5   site_category_lv1       10003 non-null  object 
 6   site_category_lv2       9708 non-null   object 
 7   review_title            10003 non-null  object 
 8   review_text             10003 non-null  object 
 9   overall_rating          10003 non-null  int64  
 10  recommend_to_a_friend   10001 non-null  object 
 11  reviewer_birth_year     9565 non-null   float64
 12  reviewer_gender         9703 non-null   object 
 13  reviewer_state          9715 non-null   object 
 14  topics                  10003 non-null

#### Pre-processing pipeline

In [None]:
label_mapping = {
    "['POSITIVO']":1,
    "['NEGATIVO']":-1
}

column_name_mapping = {
    'review_text':'text',
    'polarity':'label'
}

splits = (
    df.pipe(drop_unused_columns, column_name=['review_text', 'polarity'], condition_type="keep")
    .pipe(drop_unused_labels, lable_column_name="polarity", condition=["['POSITIVO']", "['NEGATIVO']"], condition_type="keep")
    .pipe(adjust_labels, lable_column_name='polarity', mapping=label_mapping)
    .pipe(adjust_columns_names, mapping=column_name_mapping)
    .pipe(create_train_test_split, features="text", target="label")
    )


persist_splits_as_csv(
    split=splits,
    directory_path=repro_preprocessed_path,
    file_name="repro")


#### Datasets metadata

In [None]:
print(f'Train dataset size: {splits["train"].shape}\n\n')

print(f'Test dataset size: {splits["test"].shape}\n\n')

print(f'Label distribution Train dataset:\n {splits["train"]["label"].value_counts(dropna=False, normalize=True)}')

Train dataset size: (6060, 2)


Test dataset size: (1516, 2)


Label distribution Train dataset:
 label
 1    0.544719
-1    0.455281
Name: proportion, dtype: float64


#### Demonstrations

##### Positive

In [None]:
positive_examples = splits["train"][splits["train"]['label']==1].sample(n=3, random_state=42)
positive_examples

Unnamed: 0,text,label
5453,Tinha uma igual e procurei exatamente o mesmo ...,1
2818,"eu tinha comprado a, com recortes, preto e bra...",1
3164,Excelente produto recomendo de olhos fechados ...,1


##### Negative

In [None]:
negative_examples = splits["train"][splits["train"]['label']==-1].sample(n=3, random_state=42)
negative_examples

Unnamed: 0,text,label
338,Ainda não recebi o produto. Demora demais para...,-1
3859,"Trava muito no Netflix , o aparelho não cumpre...",-1
5411,"O MEU SHAMPOO, VEIO PELA METADE. PRATICAMENTE ...",-1


##### Demonstrations dataframe

In [None]:
demo_df = pd.concat([positive_examples, negative_examples])
demo_df.to_csv(f'{repro_preprocessed_path}/{file_name}_demo.csv', index=False)

### 03.10 - SST2_PT

**Reference**
<br>
[Socher, R., Perelygin, A., Wu, J., et al. (2013). Recursive deep models for se-
mantic compositionality over a sentiment treebank. In Yarowsky, D., Baldwin,
T., Korhonen, A., Livescu, K., and Bethard, S., editors, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages
1631–1642, Seattle, Washington, USA. Association for Computational Linguis-
tics. https://aclanthology.org/D13-1170](https://aclanthology.org/D13-1170)

[Pires, R., Abonizio, H., Almeida, T. S., et al. (2023). Sabiá: Portuguese large
language models. In Naldi, M. C. and Bianchi, R. A. C., editors, Lecture Notes
in Computer Science, page 226–240. Springer Nature Switzerland.
](https://arxiv.org/abs/2304.07880)


<br>

**Dataset Link**
<br>
[SST2_PT](https://huggingface.co/datasets/maritaca-ai/sst2_pt)

#### Defining Paths

In [None]:
sst2_raw_path = RAW_DATA_ROOT_PATH + "/SST2_PT"
sst2_preprocessed_path = PREPROCESSED_DATA_ROOT_PATH + "/SST2_PT"
file_name="sst2-pt"

if not check_directory_exists(path=sst2_raw_path):
    create_directory(path=sst2_raw_path)

if not check_directory_exists(path=sst2_preprocessed_path):
    create_directory(path=sst2_preprocessed_path)

#### Reading data

In [None]:
dataset = load_dataset("maritaca-ai/sst2_pt")
train = dataset["train"].to_pandas()
test = dataset["validation"].to_pandas()

Downloading data:   0%|          | 0.00/2.84M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/69.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

#### Pre-processing pipeline

In [None]:
label_mapping = {
    0: -1,
    1:1
}

train = train.pipe(adjust_labels, lable_column_name='label', mapping=label_mapping)
test = test.pipe(adjust_labels, lable_column_name='label', mapping=label_mapping)

splits = {
    "train": train,
    "test":test
}

persist_splits_as_csv(
    split=splits,
    directory_path=sst2_preprocessed_path,
    file_name=file_name)

#### Datasets metadata

In [None]:
print(f'Train dataset size: {splits["train"].shape}\n\n')

print(f'Test dataset size: {splits["test"].shape}\n\n')

print(f'Label distribution Train dataset:\n {splits["train"]["label"].value_counts(dropna=False, normalize=True)}')

Train dataset size: (67349, 2)


Test dataset size: (872, 2)


Label distribution Train dataset:
 label
 1    0.557826
-1    0.442174
Name: proportion, dtype: float64


#### Demonstrations

##### Positive

In [None]:
positive_examples = splits["train"][splits["train"]['label']==1].sample(n=3, random_state=42)
positive_examples

Unnamed: 0,text,label
41259,Atuou a meditação nos eventos profundamente de...,1
38498,"Este filme estranho e poético da estrada, crav...",1
528,Dirigido com propósito e requinte por Roger Mi...,1


##### Negative

In [None]:
negative_examples = splits["train"][splits["train"]['label']==-1].sample(n=3, random_state=42)
negative_examples

Unnamed: 0,text,label
30440,"um horror monótono, mudo e derivado",-1
64285,"Se George Romero tivesse dirigido este filme, ...",-1
25294,"A atuação é amadora, a cinematografia é atroz",-1


##### Demonstrations dataframe

In [None]:
demo_df = pd.concat([positive_examples, negative_examples])
demo_df.to_csv(f'{sst2_preprocessed_path}/{file_name}_demo.csv', index=False)

###03.11 - TweetSentBr

**Reference**
<br>
[Brum, H. and das Graças Volpe Nunes, M. (2018). Building a Sentiment Corpus
of Tweets in Brazilian Portuguese. In chair), N. C. C., Choukri, K., Cieri, C.,
Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo,H., Moreno, A., Odijk, J., Piperidis, S., and Tokunaga, T., editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association
(ELRA)](https://aclanthology.org/L18-1658/)
<br>


**Dataset Link**
<br>
[TweetSentBr](https://huggingface.co/datasets/eduagarcia/tweetsentbr_fewshot)

#### Defining Paths

In [None]:
tweetsentbr_raw_path = RAW_DATA_ROOT_PATH + "/TweetSentBR"
tweetsentbr_preprocessed_path = PREPROCESSED_DATA_ROOT_PATH + "/TweetSentBR"
file_name="tweet_sent_br"

if not check_directory_exists(path=tweetsentbr_raw_path):
    create_directory(path=tweetsentbr_raw_path)

if not check_directory_exists(path=tweetsentbr_preprocessed_path):
    create_directory(path=tweetsentbr_preprocessed_path)

#### Reading data

In [None]:
dataset = load_dataset("eduagarcia/tweetsentbr_fewshot")
train = dataset["train"].to_pandas()
test = dataset["test"].to_pandas()

Downloading readme:   0%|          | 0.00/2.95k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.37k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/105k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/75 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2010 [00:00<?, ? examples/s]

#### Pre-processing pipeline

In [None]:
label_mapping = {
    'Positive': 1,
    'Negative': -1,
}

column_name_mapping = {
    'sentence':'text'
    }

train = (
    train.pipe(drop_unused_columns, column_name=['id'], condition_type="drop")
    .pipe(drop_unused_labels, lable_column_name='label', condition=['Neutral'], condition_type="drop")
    .pipe(adjust_labels, lable_column_name='label', mapping=label_mapping)
    .pipe(adjust_columns_names, mapping=column_name_mapping)
    )



test = (
    test.pipe(drop_unused_columns, column_name=['id'], condition_type="drop")
    .pipe(drop_unused_labels, lable_column_name='label', condition=['Neutral'], condition_type="drop")
    .pipe(adjust_labels, lable_column_name='label', mapping=label_mapping)
    .pipe(adjust_columns_names, mapping=column_name_mapping)
    )

splits = {
    "train": train,
    "test":test
}

persist_splits_as_csv(
    split=splits,
    directory_path=tweetsentbr_preprocessed_path,
    file_name=file_name)

#### Datasets metadata

In [None]:
print(f'Train dataset size: {splits["train"].shape}\n\n')

print(f'Test dataset size: {splits["test"].shape}\n\n')

print(f'Label distribution Train dataset:\n {splits["train"]["label"].value_counts(dropna=False, normalize=True)}')

Train dataset size: (50, 2)


Test dataset size: (1494, 2)


Label distribution Train dataset:
 label
 1    0.5
-1    0.5
Name: proportion, dtype: float64


#### Demonstrations

##### Positive

In [None]:
positive_examples = splits["train"][splits["train"]['label']==1].sample(n=3, random_state=42)
positive_examples

Unnamed: 0,text,label
16,se o lindo USERNAME sair eu nem sei viu,1
30,já já 📺 #NasNovelasDaNoiteSBT #CarinhaDeAnjo12...,1
0,joca tá com a corda toda 😂 😂 😂 😂,1


##### Negative

In [None]:
negative_examples = splits["train"][splits["train"]['label']==-1].sample(n=3, random_state=42)
negative_examples

Unnamed: 0,text,label
17,acho q não seria justo victor B ser eliminado ...,-1
34,O já foi extremamente melhor,-1
1,O SBT gosta de me iludir eu q pensava que o DL...,-1


##### Demonstrations dataframe

In [None]:
demo_df = pd.concat([positive_examples, negative_examples])
demo_df.to_csv(f'{tweetsentbr_preprocessed_path}/{file_name}_demo.csv', index=False)

###03.12 - TA-Restaurantes

**Reference**
<br>
[Oliveira, M. V. and de Melo, T. (2020). Investigating sets of linguistic features for two sentiment analysis tasks in brazilian portuguese web reviews. Anais Estendidos do Simpósio Brasileiro de Sistemas Multimídia e Web (WebMedia), pages 45–48. https://sol.sbc.org.br/index.php/webmedia_estendido/article/view/13060](https://sol.sbc.org.br/index.php/webmedia_estendido/article/view/13060)
<br>


**Dataset Link**
<br>
[TA-Restaurantes](https://data.mendeley.com/datasets/hsn6g3dbsk/2)

#### Defining Paths

In [None]:
ta_restaurantes_raw_path = RAW_DATA_ROOT_PATH + "/TA-Restaurantes"
ta_restaurantes_preprocessed_path = PREPROCESSED_DATA_ROOT_PATH + "/TA-Restaurantes"
file_name="ta-restaurantes"

if not check_directory_exists(path=ta_restaurantes_raw_path):
    create_directory(path=ta_restaurantes_raw_path)

if not check_directory_exists(path=ta_restaurantes_preprocessed_path):
    create_directory(path=ta_restaurantes_preprocessed_path)

#### Reading data

In [None]:
df = pd.read_csv(f"{ta_restaurantes_raw_path}/POL_restaurants.tsv", header=0, sep='\t')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 561 entries, 0 to 560
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   polarity  561 non-null    int64 
 1   sentence  561 non-null    object
dtypes: int64(1), object(1)
memory usage: 8.9+ KB


#### Pre-processing pipeline

In [None]:
column_name_mapping = {
    'sentence':'text',
    'polarity':'label'
}

splits = (
    df.pipe(adjust_columns_names, mapping=column_name_mapping)
    .pipe(create_train_test_split, features="text", target="label")
    )


persist_splits_as_csv(
    split=splits,
    directory_path=ta_restaurantes_preprocessed_path,
    file_name=file_name)

#### Datasets metadata

In [None]:
print(f'Train dataset size: {splits["train"].shape}\n\n')

print(f'Test dataset size: {splits["test"].shape}\n\n')

print(f'Label distribution Train dataset:\n {splits["train"]["label"].value_counts(dropna=False, normalize=True)}')

Train dataset size: (448, 2)


Test dataset size: (113, 2)


Label distribution Train dataset:
 label
 1    0.899554
-1    0.100446
Name: proportion, dtype: float64


#### Demonstrations

##### Positive

In [None]:
positive_examples = splits["train"][splits["train"]['label']==1].sample(n=3, random_state=42)
positive_examples

Unnamed: 0,text,label
355,"Restaurante bonito, que pela estilo parace ser...",1
185,A cozinheira veio até à nossa mesa e se coloco...,1
408,O Lugar fica numa galeria com amplo espaço na ...,1


##### Negative

In [None]:
negative_examples = splits["train"][splits["train"]['label']==-1].sample(n=3, random_state=42)
negative_examples

Unnamed: 0,text,label
412,"O atendimento é problemático, não porque o ate...",-1
135,Depois de muita dúvida pois o cardápio é muito...,-1
372,O preço cobrado 30 reais (Jan 20) é altíssimo...,-1


##### Demonstrations dataframe

In [None]:
demo_df = pd.concat([positive_examples, negative_examples])
demo_df.to_csv(f'{ta_restaurantes_preprocessed_path}/{file_name}_demo.csv', index=False)