
- **transformers:** It provides easy access to a wide variety of pre-trained models for natural language processing (NLP) tasks
- **datasets:** It provides access to a large collection of datasets for training and evaluating models.
- **sentencepiece:** It helps break down text into smaller pieces (tokens) that models can process more efficiently. It supports various languages and is particularly good for handling rare and unknown words.

# **Datasets**

In [16]:
!pip install datasets



## **List of Datasets**

In [17]:
from datasets import list_datasets
all_datasets = list_datasets()

print("There are total of", len(all_datasets), "in the hub")

There are total of 168492 in the hub


In [18]:
print("Some example datasets are:")
for dataset in all_datasets[:10]:
  print(dataset)

Some example datasets are:
amirveyseh/acronym_identification
ade-benchmark-corpus/ade_corpus_v2
UCLNLP/adversarial_qa
Yale-LILY/aeslc
nwu-ctext/afrikaans_ner_corpus
fancyzhx/ag_news
allenai/ai2_arc
google/air_dialogue
komari6/ajgt_twitter_ar
legacy-datasets/allegro_reviews


## **Load the dataset from Huggingface**

In [19]:
from datasets import load_dataset
data = load_dataset('dair-ai/emotion')
data

Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]

The repository for dair-ai/emotion contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/dair-ai/emotion.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [20]:
train_dataset = data['train']
train_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})

In [21]:
validation_dataset = data['validation']
validation_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 2000
})

In [22]:
testing_dataset = data['test']
testing_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 2000
})

## **Convert dataset in pandas frame**

In [23]:
import pandas as pd
data.set_format(type='pandas')
df = data['train'][:]
df.head()

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3


In [24]:
# to see the names of categories
train_dataset.features['label'].names

['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

**The int2str()**<br>


 method is used to convert integer indices back to string labels. It's particularly useful when dealing with classification tasks where labels are represented as integers but you want to convert them back to their original string representations.

In [26]:
# for better understanding
def label_to_str(row):
  return data['train'].features['label'].int2str(row)

df['labels_name'] = df['label'].apply(label_to_str)
df.head()

Unnamed: 0,text,label,labels_name
0,i didnt feel humiliated,0,sadness
1,i can go from feeling so hopeless to so damned...,0,sadness
2,im grabbing a minute to post i feel greedy wrong,3,anger
3,i am ever feeling nostalgic about the fireplac...,2,love
4,i am feeling grouchy,3,anger


In [27]:
# convert back to huggingface dataset format
data.reset_format()

In [28]:
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})