In [1]:
%pip install datasets




DEPRECATION: colab 1.13.5 has a non-standard dependency specifier pytz>=2011n. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of colab or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


In [2]:
from datasets import Dataset
import pandas as pd
from datasets import Dataset,DatasetDict, load_dataset
import os
from sklearn.model_selection import train_test_split




In [3]:
dataset_directory = '../_2_data_generation-gemini-pro/Dataset/'
files = os.listdir(dataset_directory)

data_df = pd.DataFrame()

for file in files:
    df = pd.read_csv(f"{dataset_directory}{file}")
    data_df = pd.concat([data_df, df], axis=0)

bool_columns = ['F', 'BR', 'AU', 'FI', 'IR', 'A', 'L', 'LF', 'MN', 'O', 'PE', 'SC', 'SE', 'US', 'PO']
data_df[bool_columns] = data_df[bool_columns].astype(bool)
data_df = data_df.dropna(subset=['review_text'])

train_ratio = 0.62
val_ratio = 0.08
test_ratio = 0.30

train_dataset, test_dataset = train_test_split(data_df, test_size=test_ratio, random_state=42)

train_dataset, val_dataset = train_test_split(train_dataset, test_size=val_ratio/(1-train_ratio), random_state=42)


split_datasets = DatasetDict({
    'train': Dataset.from_pandas(train_dataset),
    'validation': Dataset.from_pandas(val_dataset),
    'test': Dataset.from_pandas(test_dataset)
})

split_datasets.save_to_disk("./hugging-face-dataset/dataset")
split_datasets

Saving the dataset (1/1 shards): 100%|██████████| 30476/30476 [00:00<00:00, 733803.35 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 8127/8127 [00:00<00:00, 677191.45 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 16545/16545 [00:00<00:00, 870808.88 examples/s]


DatasetDict({
    train: Dataset({
        features: ['review_text', 'F', 'BR', 'AU', 'FI', 'IR', 'A', 'L', 'LF', 'MN', 'O', 'PE', 'SC', 'SE', 'US', 'PO', '__index_level_0__'],
        num_rows: 30476
    })
    validation: Dataset({
        features: ['review_text', 'F', 'BR', 'AU', 'FI', 'IR', 'A', 'L', 'LF', 'MN', 'O', 'PE', 'SC', 'SE', 'US', 'PO', '__index_level_0__'],
        num_rows: 8127
    })
    test: Dataset({
        features: ['review_text', 'F', 'BR', 'AU', 'FI', 'IR', 'A', 'L', 'LF', 'MN', 'O', 'PE', 'SC', 'SE', 'US', 'PO', '__index_level_0__'],
        num_rows: 16545
    })
})

## Load dataset

Note that you can also easily load your local data (i.e. csv files, txt files, Parquet files, JSON, ...) as explained [here](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files).



In [7]:
dataset = DatasetDict.load_from_disk("./hugging-face-dataset/dataset/")
features = dataset["train"].features
for feature_name, feature_info in features.items():
    data_type = feature_info.dtype
    print(f"Feature: {feature_name}, Data Type: {data_type}")

Feature: review_text, Data Type: string
Feature: F, Data Type: bool
Feature: BR, Data Type: bool
Feature: AU, Data Type: bool
Feature: FI, Data Type: bool
Feature: IR, Data Type: bool
Feature: A, Data Type: bool
Feature: L, Data Type: bool
Feature: LF, Data Type: bool
Feature: MN, Data Type: bool
Feature: O, Data Type: bool
Feature: PE, Data Type: bool
Feature: SC, Data Type: bool
Feature: SE, Data Type: bool
Feature: US, Data Type: bool
Feature: PO, Data Type: bool
Feature: __index_level_0__, Data Type: int64


In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['review_text', 'F', 'BR', 'AU', 'FI', 'IR', 'A', 'L', 'LF', 'MN', 'O', 'PE', 'SC', 'SE', 'US', 'PO', '__index_level_0__'],
        num_rows: 30476
    })
    validation: Dataset({
        features: ['review_text', 'F', 'BR', 'AU', 'FI', 'IR', 'A', 'L', 'LF', 'MN', 'O', 'PE', 'SC', 'SE', 'US', 'PO', '__index_level_0__'],
        num_rows: 8127
    })
    test: Dataset({
        features: ['review_text', 'F', 'BR', 'AU', 'FI', 'IR', 'A', 'L', 'LF', 'MN', 'O', 'PE', 'SC', 'SE', 'US', 'PO', '__index_level_0__'],
        num_rows: 16545
    })
})

As we can see, the dataset contains 3 splits: one for training, one for validation and one for testing.

Let's check the first example of the training split:

In [10]:
example = dataset['train'][0]
example

{'review_text': "I've had twitter for a very long time and I love it, but since the latest update, I can't open my app anymore. I've closed the app and tried reopening it, I've deleted and reinstalled the app as well, but it won't open. Please fix this",
 'F': False,
 'BR': True,
 'AU': False,
 'FI': False,
 'IR': False,
 'A': False,
 'L': False,
 'LF': False,
 'MN': False,
 'O': False,
 'PE': False,
 'SC': False,
 'SE': False,
 'US': False,
 'PO': False,
 '__index_level_0__': 44897}

The dataset consists of app reviews, labeled with one or more emotions.

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [11]:
labels = [label for label in dataset['train'].features.keys() if label not in ['__index_level_0__', 'review_text']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['F',
 'BR',
 'AU',
 'FI',
 'IR',
 'A',
 'L',
 'LF',
 'MN',
 'O',
 'PE',
 'SC',
 'SE',
 'US',
 'PO']

## Preprocess data

As models like BERT don't expect text as direct input, but rather `input_ids`, etc., we tokenize the text using the tokenizer. Here I'm using the `AutoTokenizer` API, which will automatically load the appropriate tokenizer based on the checkpoint on the hub.

What's a bit tricky is that we also need to provide labels to the model. For multi-label text classification, this is a matrix of shape (batch_size, num_labels). Also important: this should be a tensor of floats rather than integers, otherwise PyTorch' `BCEWithLogitsLoss` (which the model will use) will complain, as explained [here](https://discuss.pytorch.org/t/multi-label-binary-classification-result-type-float-cant-be-cast-to-the-desired-output-type-long/117915/3).

In [12]:
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_data(examples):
  if examples["review_text"] == None:
    return None
  # take a batch of texts
  text = examples["review_text"]
  # encode them
  encoding = tokenizer(text, padding="max_length", truncation=True, max_length=128)
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()
  return encoding

In [13]:
encoded_dataset = dataset.map(preprocess_data, batched=True, remove_columns=dataset["train"].column_names)

Map: 100%|██████████| 30476/30476 [00:03<00:00, 10020.83 examples/s]
Map: 100%|██████████| 8127/8127 [00:00<00:00, 9151.05 examples/s]
Map: 100%|██████████| 16545/16545 [00:01<00:00, 9306.51 examples/s]


In [14]:
example = encoded_dataset['train'][0]
print(example.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [15]:
tokenizer.decode(example['input_ids'])

"[CLS] i've had twitter for a very long time and i love it, but since the latest update, i can't open my app anymore. i've closed the app and tried reopening it, i've deleted and reinstalled the app as well, but it won't open. please fix this [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]"

In [16]:
example['labels']

[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

In [17]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['BR']

Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch [datasets](https://pytorch.org/docs/stable/data.html).

In [18]:
encoded_dataset.set_format("torch")

In [19]:
encoded_dataset['train']['labels']

tensor([[0., 1., 0.,  ..., 0., 0., 0.],
        [0., 1., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

In [21]:
#Saving the DataFrame as in serialized format to preserve the properties of the data structures(Tensor)

import pickle

#Define the file path where you want to save the serialized DataFrame
file_path = 'output/serialized_encoded_review_dataset.pkl'

# Step 3: Save the DataFrame using pickle
with open(file_path, 'wb') as f:
    pickle.dump(encoded_dataset, f)

print("Dataset saved as a serialized file:", file_path)

Dataset saved as a serialized file: output/serialized_encoded_review_dataset.pkl
