# Sentiment Analaysis

Sentiment analaysis(text-classification) using DistillBert.

We'll cover,
    1. Datasets --> Load and process datasets
    2. Tokenizers --> Tokenize input text
    3. Transformers --> Load models, train and infer
    4. Datasets --> Load metrics and evaluate models

## The Dataset

To build our emotional detectors, we're gonna use an article that explored how emotions are represented in English Twitter messages. This datasets contains six-basic emotions: anger, disgust, fear, joy, sadness and surprise.

Given a tweet, we've to train a model that can classify into one of these emotions.

### First look at Hugging Face Datasets

`list_datasets()` from `datasets` will list all dataset available in Hub.

In [4]:
from datasets import list_datasets

all_datasets = list_datasets()

In [5]:
type(all_datasets)

list

In [6]:
print(f"There are {len(all_datasets)} in hub")
print(f"The first 10 are: a{all_datasets[:10]}")

There are 39519 in hub
The first 10 are: a['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc', 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue', 'ajgt_twitter_ar', 'allegro_reviews']


`list_datasets()` returns the list of datasets names available in Hub.
`load_dataset()` loads a dataset based on dataset name.

Let's load the `emotion` dataset.

In [8]:
from datasets import load_dataset
emotions = load_dataset("emotion")

No config specified, defaulting to: emotion/split
Found cached dataset emotion (/home/codespace/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd)
100%|██████████| 3/3 [00:00<00:00, 670.98it/s]


In [16]:
type(emotions)

datasets.dataset_dict.DatasetDict

In [12]:
emotions

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

We've three splits --> train, validation and test and for each split we've the features of dataset in `features` and total samples in `num_rows`.

We can access the different splits of data like accessing a key in dict.

In [14]:
train_ds = emotions["train"]
train_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})

In [17]:
type(train_ds)

datasets.arrow_dataset.Dataset

Each item in DatasetDict is Dataset. Dataset behaves similar to ordinary Python array or list.

In [18]:
len(train_ds)

16000

In [19]:
# Let's look at a single sample
train_ds[0]

{'text': 'i didnt feel humiliated', 'label': 0}

In [20]:
# Column names
train_ds.column_names

['text', 'label']

The keys correspond to column names. This reflect that Datsets is base on `Apache arrow` which defines an typed columnar format that is more efficient than native Python.

What are the datatypes used by each column can be accessed under `features` attribute of an `Dataset` object.

In [21]:
print(train_ds.features)

{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}


In [26]:
train_ds.features['label']

ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)

Datatype of `text` is `string` while `label` column is special `ClassLabel `object that contains information about the class names and their mapping to integers.

In [27]:
# Slicing dataset
print(train_ds[:5])

{'text': ['i didnt feel humiliated', 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake', 'im grabbing a minute to post i feel greedy wrong', 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property', 'i am feeling grouchy'], 'label': [0, 0, 3, 2, 3]}


### What if dataset is not on the hub?

In many cases. We'll be working with data in laptop or remote server in an organization. Datasets provides several loading script so handle local and remote datasets.

* To load csv --> ```load_dataset("csv", data_files="my_file.csv")```
* To load text --> ```load_dataset("text", data_files="my_file.txt")```
* To load json --> ```load_dataset("json", data_files="my_file.json")```

Just pass the format and file, also we can pass an url of the file to data_files param.

In [28]:
# Let's load emotion data from it's source.
dataset_url = "https://huggingface.co/datasets/transformersbook/emotion-train-split/raw/main/train.txt"

In [33]:
# Let's get the file
!wget {dataset_url}

--2023-06-08 05:08:06--  https://huggingface.co/datasets/transformersbook/emotion-train-split/raw/main/train.txt
Resolving huggingface.co (huggingface.co)... 65.8.11.15, 65.8.11.70, 65.8.11.53, ...
Connecting to huggingface.co (huggingface.co)|65.8.11.15|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1658616 (1.6M) [text/plain]
Saving to: ‘train.txt’


2023-06-08 05:08:07 (1.47 MB/s) - ‘train.txt’ saved [1658616/1658616]



In [35]:
# Let's look at top few lines of this file
!cat train.txt | head -n5

i didnt feel humiliated;sadness
i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake;sadness
im grabbing a minute to post i feel greedy wrong;anger
i am ever feeling nostalgic about the fireplace i will know that it is still on the property;love
i am feeling grouchy;anger
cat: write error: Broken pipe


The data is similar to a csv file with no headers. Text seperated by emotion.
Let's load this.

In [36]:
emotions_local = load_dataset("csv", data_files="train.txt", sep=";", names=["text", "label"])

Downloading and preparing dataset csv/default to /home/codespace/.cache/huggingface/datasets/csv/default-9f045124772ab15b/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 368.70it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 574.40it/s]
                                                        

Dataset csv downloaded and prepared to /home/codespace/.cache/huggingface/datasets/csv/default-9f045124772ab15b/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


100%|██████████| 1/1 [00:00<00:00, 447.77it/s]


In [39]:
emotions_local["train"][:5]

{'text': ['i didnt feel humiliated',
  'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
  'im grabbing a minute to post i feel greedy wrong',
  'i am ever feeling nostalgic about the fireplace i will know that it is still on the property',
  'i am feeling grouchy'],
 'label': ['sadness', 'sadness', 'anger', 'love', 'anger']}

For more read [Datasets documentation](https://huggingface.co/docs/datasets/index)