# Working with Datasets Libary

In [1]:
#It loads a dataset from the HuggingFace Hub
from datasets import load_dataset

Datasets migth have several configurations. For instances, The GLUE dataset as an agregated benchmark has 10 subsets (as of writing this notebook) as: COLA, SST2, MRPC, QQP, STSB, MNLI, QNLI, RTE, WNLI and the diagnostic subset AX. 

To access each glue dataset, we pass two arguments where the first is **'glue'** and second is a **sub-part** of it to be chosen. Likewise, the wikipedia dataset have several configuration provided for several languages.

Lets load 'cola' subset of GLUE as follows:

In [2]:
cola = load_dataset('glue', 'cola')
cola['train'][18:22]

Downloading readme:   0%|          | 0.00/31.9k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/251k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/37.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/37.7k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

{'sentence': ['They drank the pub.',
  'The professor talked us into a stupor.',
  'The professor talked us.',
  'We yelled ourselves hoarse.'],
 'label': [0, 1, 0, 1],
 'idx': [18, 19, 20, 21]}

While some dataset comes with DatasetDict object, some can be of type Dataset depending on splitting condition. The CoLA dataset come with DatasetDict where we have 3 splits: train,validation, and test. Train and validation datasets include the labels as well (1: Acceptable, 0: Unacceptable), but the label values of test split are -1, which means 'no-label'.   

In [3]:
cola

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

In [120]:
cola['train'][12]

{'idx': 12, 'label': 1, 'sentence': 'Bill rolled out of the room.'}

In [121]:
cola['validation'][68]

{'idx': 68,
 'label': 0,
 'sentence': 'Which report that John was incompetent did he submit?'}

In [122]:
cola['test'][20]

{'idx': 20, 'label': -1, 'sentence': 'Has John seen Mary?'}

## Metadata of Datasets
* split
* description
* citation
* homepage
* license
* info


In [123]:
print(cola["train"].split)
print(cola["train"].description)
print(cola["train"].citation)
print(cola["train"].homepage)
print(cola["train"].license)

train
GLUE, the General Language Understanding Evaluation benchmark
(https://gluebenchmark.com/) is a collection of resources for training,
evaluating, and analyzing natural language understanding systems.


@article{warstadt2018neural,
  title={Neural Network Acceptability Judgments},
  author={Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R},
  journal={arXiv preprint arXiv:1805.12471},
  year={2018}
}
@inproceedings{wang2019glue,
  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
  note={In the Proceedings of ICLR.},
  year={2019}
}

https://nyu-mll.github.io/CoLA/



### Loading other datasets

In [124]:
sst2 = load_dataset('glue', 'sst2')

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


In [125]:
mrpc = load_dataset('glue', 'mrpc')

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


To check entire subsets, run the following piece of code


```
glue=['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax']
for g in glue:
 _=load_dataset('glue', g)
```




## Listing all datasets and metrics in the hub

In [126]:
from pprint import pprint
from datasets import list_datasets, list_metrics
all = list_datasets()
metrics = list_metrics()

print(f"{len(all)} datasets and {len(metrics)} metrics exists in the hub\n")
pprint(all[:20], compact=True)
pprint(metrics, compact=True)

995 datasets and 27 metrics exists in the hub

['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc',
 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue',
 'ajgt_twitter_ar', 'allegro_reviews', 'allocine', 'alt', 'amazon_polarity',
 'amazon_reviews_multi', 'amazon_us_reviews', 'ambig_qa', 'amttl', 'anli',
 'app_reviews', 'aqua_rat']
['accuracy', 'bertscore', 'bleu', 'bleurt', 'cer', 'comet', 'coval', 'cuad',
 'f1', 'gleu', 'glue', 'indic_glue', 'matthews_correlation', 'meteor',
 'pearsonr', 'precision', 'recall', 'rouge', 'sacrebleu', 'sari', 'seqeval',
 'spearmanr', 'squad', 'squad_v2', 'super_glue', 'wer', 'xnli']


## Selecting, sorting, filtering

### Split
which split of the data to be loaded. If None by default, will return a `dict` with all splits (Train, Test, Validation or any other).  If split is specified, it will return a single Dataset rather than a Dictionary

In [130]:
cola = load_dataset('glue', 'cola', split ='train[:300]+validation[-30%:]')
# Which means the first 300 examples of train  plus the last 30% of validation.

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


#### Other Split Examples
The first 100 examples from train and validation

`split='train[:100]+validation[:100]'` 

50% of train and 30 % of validation

`split='train[:50%]+validation[:30%]'`


The first 20% of train and examples in the slice 30:50 from validation

`split='train[:20%]+validation[30:50]'`

### Sorting

In [131]:
cola.sort('label')['label'][:15]

Loading cached sorted indices for dataset at /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-54fbf680867c6dca.arrow


[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [132]:
cola.sort('label')['label'][-15:]

Loading cached sorted indices for dataset at /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-54fbf680867c6dca.arrow


[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

###  Indexing
You can also access several rows using slice notation or with a list of indices

In [133]:
cola[6,19,44]

{'idx': [6, 19, 44],
 'label': [1, 1, 1],
 'sentence': ['Fred watered the plants flat.',
  'The professor talked us into a stupor.',
  'The trolley rumbled through the tunnel.']}

In [134]:
cola[42:46]

{'idx': [42, 43, 44, 45],
 'label': [0, 1, 1, 1],
 'sentence': ['They made him to exhaustion.',
  'They made him into a monster.',
  'The trolley rumbled through the tunnel.',
  'The wagon rumbled down the road.']}

### Shuffling 

In [135]:
cola.shuffle(seed=42)[:3]

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-97a24a7d09391f14.arrow


{'idx': [904, 1017, 885],
 'label': [1, 0, 1],
 'sentence': ['Lou forgot the umbrella in the closet.',
  'It is the problem that he is here.',
  'I met the person who left.']}

## Caching and reusability
Using cache files allows us to load large datasets by means of memory mapping if datasets fit on the drive  to use a fast backend and do smart caching by saving and reusing the results of operations executed on the drive.

In [None]:
pprint(list(dir(cola)))

In [137]:
cola.cache_files

[{'filename': '/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/glue-train.arrow'},
 {'filename': '/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/glue-validation.arrow'}]

In [138]:
cola.info

DatasetInfo(description='GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n', citation='@article{warstadt2018neural,\n  title={Neural Network Acceptability Judgments},\n  author={Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R},\n  journal={arXiv preprint arXiv:1805.12471},\n  year={2018}\n}\n@inproceedings{wang2019glue,\n  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n  note={In the Proceedings of ICLR.},\n  year={2019}\n}\n', homepage='https://nyu-mll.github.io/CoLA/', license='', features={'idx': Value(dtype='int32', id=None), 'label': ClassLabel(num_classes=2, names=['unacceptable', 'acceptable'], names_file=None, id=None), 'sentence': Value(dtype='strin

## Dataset Filter and Map Function



### Filter function

In [19]:
# To get 3 sentences ,including the term "kick" with Filter
cola = load_dataset('glue', 'cola',split='train')


In [20]:
print(cola.filter(lambda s: "kick" in s['sentence'])["sentence"][:3])

Filter:   0%|          | 0/8551 [00:00<?, ? examples/s]

['Jill kicked the ball from home plate to third base.', 'Fred kicked the ball under the porch.', 'Fred kicked the ball behind the tree.']


In [22]:
# To get 3 acceptable sentences
print(cola.filter(lambda s: s['label']== 1 )["sentence"][:3])

Filter:   0%|          | 0/8551 [00:00<?, ? examples/s]

["Our friends won't buy this analysis, let alone the next one we propose.", "One more pseudo generalization and I'm giving up.", "One more pseudo generalization or I'm giving up."]


In some cases, we might not know the integer code of a class label. Suppose we have many 
classes, and the code of the culture class is hard to remember out of 10 classes. Instead 
of giving integer code 1 in our preceding example, which is the code for acceptable, 
we can pass an acceptable label to the str2int() function, as follows:

In [23]:
# To get 3 acceptable sentences - alternative version
cola.filter(lambda s: s['label']== cola.features['label'].str2int('acceptable'))["sentence"][:3]

Filter:   0%|          | 0/8551 [00:00<?, ? examples/s]

["Our friends won't buy this analysis, let alone the next one we propose.",
 "One more pseudo generalization and I'm giving up.",
 "One more pseudo generalization or I'm giving up."]

### Processing data with  map function
datasets.Dataset.map() function iterates over the dataset applying a processing function to each examples in a dataset and modifies the content of the samples.

In [24]:
# E.g. adding new features
cola_new=cola.map(lambda e: {'len': len(e['sentence'])})

Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

In [25]:
cola_new

Dataset({
    features: ['sentence', 'label', 'idx', 'len'],
    num_rows: 8551
})

In [27]:
print(cola_new[0:3])

{'sentence': ["Our friends won't buy this analysis, let alone the next one we propose.", "One more pseudo generalization and I'm giving up.", "One more pseudo generalization or I'm giving up."], 'label': [1, 1, 1], 'idx': [0, 1, 2], 'len': [71, 49, 48]}


As another example, the following piece of code cut the sentence after 20 characters. We 
do not create a new feature, but instead update the content of the sentence feature, as 
follows: 

In [28]:
cola_cut=cola_new.map(lambda e: {'sentence': e['sentence'][:20]+'_'})

Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

In [29]:
pd.DataFrame(cola_cut[:3])

Unnamed: 0,sentence,label,idx,len
0,Our friends won't bu_,1,0,71
1,One more pseudo gene_,1,1,49
2,One more pseudo gene_,1,2,48


## Working with Local Files

In [147]:
import os
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [148]:
os.getcwd()

'/content/drive/My Drive/akademi/Packt NLP with Transformers/CH02'

In [149]:
os.listdir("/content/drive/My Drive/akademi/Packt NLP with Transformers/CH02")

['CH02b_Working_with_Datasets_Libary.ipynb',
 'data',
 'CH02a_Working_with_Language_Models_and_Tokenizers.mp4',
 'CH02a_Working_with_Language_Models_and_Tokenizers .ipynb',
 'CH02c_Speed_and_Memory_Benchmarking.ipynb']

In [150]:
if os.getcwd()!='/content/drive/My Drive/akademi/Packt NLP with Transformers/CH02':
    os.chdir("drive/MyDrive/akademi/Packt NLP with Transformers/CH02")

In [151]:
os.getcwd()

'/content/drive/My Drive/akademi/Packt NLP with Transformers/CH02'

In [152]:
os.listdir()

['CH02b_Working_with_Datasets_Libary.ipynb',
 'data',
 'CH02a_Working_with_Language_Models_and_Tokenizers.mp4',
 'CH02a_Working_with_Language_Models_and_Tokenizers .ipynb',
 'CH02c_Speed_and_Memory_Benchmarking.ipynb']

In [153]:
# To load a dataset from local files CSV, TXT, JSON, a generic loading scripts are provided 

In [154]:
# under data folder there are the files[a.csv, b.csv, c.csv], some random part of SST-2 dataset
from datasets import load_dataset
data1 = load_dataset('csv', data_files='./data/a.csv', delimiter="\t")
data2 = load_dataset('csv', data_files=['./data/a.csv','./data/b.csv', './data/c.csv'], delimiter="\t")
data3 = load_dataset('csv', data_files={'train':['./data/a.csv','./data/b.csv'], 'test':['./data/c.csv']}, delimiter="\t") 

Using custom data configuration default-811df5c9519fddd3
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-811df5c9519fddd3/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
Using custom data configuration default-37a89142f75f1c5a
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-37a89142f75f1c5a/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
Using custom data configuration default-6468b1b0b5900944
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-6468b1b0b5900944/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)


In [155]:
import pandas as pd
pd.DataFrame(data1["train"][:3])

Unnamed: 0,sentence,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1


In [156]:
pd.DataFrame(data3["test"][:3])

Unnamed: 0,label,sentence
0,0,inane and awful
1,0,told in scattered fashion
2,1,takes chances that are bold by studio standards


In [157]:
# get the files in other format
# data_json = load_dataset('json', data_files='a.json')
# data_text = load_dataset('text', data_files='a.txt')


In [158]:
#you can also access several rows using slice notation or with a list of indices

In [159]:
# shuffling

In [160]:
data3_shuf=data3['train'].shuffle(seed=42)
data3_shuf['label'][:15]


Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/csv/default-6468b1b0b5900944/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0/cache-3a29ecf37f77eb59.arrow


[0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0]

## Preparing the data for model training
Let us take an example with a tokenizer. 
To do so, we need to install transformers library

In [30]:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

If batched is True, it provides batch of examples to any function.
batch_size (default is 1000) is  number of instances per batch provided to a function. If not selected, the whole dataset is provided as a single batch to any given function.

In [31]:
cola

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 8551
})

In [32]:
cola['sentence'][0]

"Our friends won't buy this analysis, let alone the next one we propose."

In [33]:
encoded_data1 = cola.map( lambda e: tokenizer(e['sentence']), batched=True, batch_size=1000)

Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

In [35]:
encoded_data1

Dataset({
    features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
    num_rows: 8551
})

In [36]:
encoded_data1['sentence'][0]

"Our friends won't buy this analysis, let alone the next one we propose."

In [167]:
encoded_data3 = data3.map(lambda e: tokenizer( e['sentence'], padding=True, truncation=True, max_length=12), batched=True, batch_size=1000) 

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-6468b1b0b5900944/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0/cache-d1d250b2127835bc.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-6468b1b0b5900944/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0/cache-99b6e3afa67cadc0.arrow


In [168]:
data3

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 199
    })
    test: Dataset({
        features: ['label', 'sentence'],
        num_rows: 100
    })
})

In [169]:
encoded_data3

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'sentence'],
        num_rows: 199
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'sentence'],
        num_rows: 100
    })
})

In [170]:
pprint(encoded_data3['test'][12])

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
 'input_ids': [101, 2019, 5186, 16010, 2143, 1012, 102, 0, 0, 0, 0, 0],
 'label': 0,
 'sentence': 'an extremely unpleasant film . '}
