# Sentiment Analaysis

Sentiment analaysis(text-classification) using DistillBert.

We'll cover,
    1. Datasets --> Load and process datasets
    2. Tokenizers --> Tokenize input text
    3. Transformers --> Load models, train and infer
    4. Datasets --> Load metrics and evaluate models

## The Dataset

To build our emotional detectors, we're gonna use an article that explored how emotions are represented in English Twitter messages. This datasets contains six-basic emotions: anger, disgust, fear, joy, sadness and surprise.

Given a tweet, we've to train a model that can classify into one of these emotions.

### First look at Hugging Face Datasets

`list_datasets()` from `datasets` will list all dataset available in Hub.

In [1]:
from datasets import list_datasets

all_datasets = list_datasets()

In [2]:
type(all_datasets)

list

In [3]:
print(f"There are {len(all_datasets)} in hub")
print(f"The first 10 are: a{all_datasets[:10]}")

There are 39590 in hub
The first 10 are: a['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc', 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue', 'ajgt_twitter_ar', 'allegro_reviews']


`list_datasets()` returns the list of datasets names available in Hub.
`load_dataset()` loads a dataset based on dataset name.

Let's load the `emotion` dataset.

In [4]:
from datasets import load_dataset

In [5]:
emotions = load_dataset("SetFit/emotion")

Using custom data configuration SetFit--emotion-e444b7640ce3116e
Found cached dataset json (/Users/jayaprakashsivagami/.cache/huggingface/datasets/SetFit___json/SetFit--emotion-e444b7640ce3116e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)


  0%|          | 0/3 [00:00<?, ?it/s]

In [6]:
type(emotions)

datasets.dataset_dict.DatasetDict

In [7]:
emotions

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 16000
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2000
    })
})

We've three splits --> train, validation and test and for each split we've the features of dataset in `features` and total samples in `num_rows`.

We can access the different splits of data like accessing a key in dict.

In [8]:
train_ds = emotions["train"]
train_ds

Dataset({
    features: ['text', 'label', 'label_text'],
    num_rows: 16000
})

In [9]:
type(train_ds)

datasets.arrow_dataset.Dataset

Each item in DatasetDict is Dataset. Dataset behaves similar to ordinary Python array or list.

In [10]:
len(train_ds)

16000

In [11]:
# Let's look at a single sample
train_ds[0]

{'text': 'i didnt feel humiliated', 'label': 0, 'label_text': 'sadness'}

In [12]:
# Column names
train_ds.column_names

['text', 'label', 'label_text']

The keys correspond to column names. This reflect that Datsets is base on `Apache arrow` which defines an typed columnar format that is more efficient than native Python.

What are the datatypes used by each column can be accessed under `features` attribute of an `Dataset` object.

In [13]:
print(train_ds.features)

{'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None), 'label_text': Value(dtype='string', id=None)}


In [14]:
train_ds.features['label']

Value(dtype='int64', id=None)

Datatype of `text` is `string` while `label` column is special `ClassLabel `object that contains information about the class names and their mapping to integers.

In [15]:
# Slicing dataset
print(train_ds[:5])

{'text': ['i didnt feel humiliated', 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake', 'im grabbing a minute to post i feel greedy wrong', 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property', 'i am feeling grouchy'], 'label': [0, 0, 3, 2, 3], 'label_text': ['sadness', 'sadness', 'anger', 'love', 'anger']}


### What if dataset is not on the hub?

In many cases. We'll be working with data in laptop or remote server in an organization. Datasets provides several loading script so handle local and remote datasets.

* To load csv --> ```load_dataset("csv", data_files="my_file.csv")```
* To load text --> ```load_dataset("text", data_files="my_file.txt")```
* To load json --> ```load_dataset("json", data_files="my_file.json")```

Just pass the format and file, also we can pass an url of the file to data_files param.

In [16]:
# Let's load emotion data from it's source.
dataset_url = "https://huggingface.co/datasets/transformersbook/emotion-train-split/raw/main/train.txt"

In [17]:
# Let's get the file
!wget {dataset_url}

--2023-06-08 16:46:24--  https://huggingface.co/datasets/transformersbook/emotion-train-split/raw/main/train.txt
Resolving huggingface.co (huggingface.co)... 18.161.246.31, 18.161.246.100, 18.161.246.27, ...
Connecting to huggingface.co (huggingface.co)|18.161.246.31|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1658616 (1.6M) [text/plain]
Saving to: ‘train.txt.3’


2023-06-08 16:46:26 (1.57 MB/s) - ‘train.txt.3’ saved [1658616/1658616]



In [18]:
# Let's look at top few lines of this file
!cat train.txt | head -n5

i didnt feel humiliated;sadness
i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake;sadness
im grabbing a minute to post i feel greedy wrong;anger
i am ever feeling nostalgic about the fireplace i will know that it is still on the property;love
i am feeling grouchy;anger
cat: stdout: Broken pipe


The data is similar to a csv file with no headers. Text seperated by emotion.
Let's load this.

In [23]:
emotions_local = load_dataset("csv", data_files="train.txt", sep=";", names=["text", "label"])

Using custom data configuration default-b0da5dd3bf69d180


Downloading and preparing dataset csv/default to /Users/jayaprakashsivagami/.cache/huggingface/datasets/csv/default-b0da5dd3bf69d180/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetGenerationError: An error occurred while generating the dataset

In [None]:
emotions_local["train"][:5]

{'text': ['i didnt feel humiliated',
  'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
  'im grabbing a minute to post i feel greedy wrong',
  'i am ever feeling nostalgic about the fireplace i will know that it is still on the property',
  'i am feeling grouchy'],
 'label': ['sadness', 'sadness', 'anger', 'love', 'anger']}

For more read [Datasets documentation](https://huggingface.co/docs/datasets/index)

### From Datasets to DataFramed

Datasets only provide low-level functionality to slice and dice our data. Let's convert this into pandas DataFramt to leverage high-level API's and visualize data.

We can do this with `set_format()`

In [None]:
import pandas as pd
emotions.set_format(type="pandas")
train_df = emotions["train"][:]
train_df.head()

NameError: name 'emotions' is not defined