# Datasets Library

## Installing and loading a dataset

In [None]:
!pip install datasets


Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/da/d6/a3d2c55b940a7c556e88f5598b401990805fc0f0a28b2fc9870cf0b8c761/datasets-1.6.0-py3-none-any.whl (202kB)
[K     |█▋                              | 10kB 13.7MB/s eta 0:00:01[K     |███▎                            | 20kB 12.3MB/s eta 0:00:01[K     |████▉                           | 30kB 7.8MB/s eta 0:00:01[K     |██████▌                         | 40kB 6.8MB/s eta 0:00:01[K     |████████                        | 51kB 4.5MB/s eta 0:00:01[K     |█████████▊                      | 61kB 4.9MB/s eta 0:00:01[K     |███████████▍                    | 71kB 5.2MB/s eta 0:00:01[K     |█████████████                   | 81kB 5.4MB/s eta 0:00:01[K     |██████████████▋                 | 92kB 5.9MB/s eta 0:00:01[K     |████████████████▏               | 102kB 5.5MB/s eta 0:00:01[K     |█████████████████▉              | 112kB 5.5MB/s eta 0:00:01[K     |███████████████████▌            | 122kB 5.5MB/s eta

In [None]:
#It loads a datasets from the HuggingFace Hub
from datasets import load_dataset

Datasets migth have several configurations. For instances, The GLUE dataset as an agregated benchmark has 10 subsets (as of writing this notebook) as: COLA, SST2, MRPC, QQP, STSB, MNLI, QNLI, RTE, WNLI and the diagnostic subset AX. 

To access each glue dataset, we pass two arguments where the first is **'glue'** and second is a **sub-part** of it to be chosen. Likewise, the wikipedia dataset have several configuration provided for several languages.

Lets load 'cola' subset of GLUE as follows:

In [None]:
cola = load_dataset('glue', 'cola')
cola['train'][18:22]

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


{'idx': [18, 19, 20, 21],
 'label': [0, 1, 0, 1],
 'sentence': ['They drank the pub.',
  'The professor talked us into a stupor.',
  'The professor talked us.',
  'We yelled ourselves hoarse.']}

While some dataset comes with DatasetDict object, some can be of type Dataset depending on splitting condition. The CoLA dataset come with DatasetDict where we have 3 splits: train,validation, and test. Train and validation datasets include the labels as well (1: Acceptable, 0: Unacceptable), but the label values of test split are -1, which means 'no-label'.   

In [None]:
cola

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

In [None]:
cola['train'][12]

{'idx': 12, 'label': 1, 'sentence': 'Bill rolled out of the room.'}

In [None]:
cola['validation'][68]

{'idx': 68,
 'label': 0,
 'sentence': 'Which report that John was incompetent did he submit?'}

In [None]:
cola['test'][20]

{'idx': 20, 'label': -1, 'sentence': 'Has John seen Mary?'}

## Metadata of Datasets
* split
* description
* citation
* homepage
* license
* info


In [None]:
print(cola["train"].split)
print(cola["train"].description)
print(cola["train"].citation)
print(cola["train"].homepage)
print(cola["train"].license)

train
GLUE, the General Language Understanding Evaluation benchmark
(https://gluebenchmark.com/) is a collection of resources for training,
evaluating, and analyzing natural language understanding systems.


@article{warstadt2018neural,
  title={Neural Network Acceptability Judgments},
  author={Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R},
  journal={arXiv preprint arXiv:1805.12471},
  year={2018}
}
@inproceedings{wang2019glue,
  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
  note={In the Proceedings of ICLR.},
  year={2019}
}

https://nyu-mll.github.io/CoLA/



### Loading other datasets

In [None]:
sst2 = load_dataset('glue', 'sst2')

Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB, post-processed: Unknown size, total: 11.90 MiB) to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=7439277.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


In [None]:
mrpc = load_dataset('glue', 'mrpc')

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


To check entire subsets, run the following piece of code


```
glue=['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax']
for g in glue:
 _=load_dataset('glue', g)
```




## Listing all datasets and metrics in the hub

In [None]:
from pprint import pprint
from datasets import list_datasets, list_metrics
all = list_datasets()
metrics = list_metrics()

print(f"{len(all)} datasets and {len(metrics)} metrics exists in the hub\n")
pprint(all[:20], compact=True)
pprint(metrics, compact=True)

837 datasets and 23 metrics exists in the hub

['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc',
 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue',
 'ajgt_twitter_ar', 'allegro_reviews', 'allocine', 'alt', 'amazon_polarity',
 'amazon_reviews_multi', 'amazon_us_reviews', 'ambig_qa', 'amttl', 'anli',
 'app_reviews', 'aqua_rat']
['accuracy', 'bertscore', 'bleu', 'bleurt', 'cer', 'comet', 'coval', 'f1',
 'gleu', 'glue', 'indic_glue', 'meteor', 'precision', 'recall', 'rouge',
 'sacrebleu', 'sari', 'seqeval', 'squad', 'squad_v2', 'super_glue', 'wer',
 'xnli']


## XTREME: Working with Cross-lingual dataset

MLQA is a subset of Xtreme benchmark, which is designed for assessing performances of cross-lingual question answering models. It includes about 5K extractive Question-Answer instances in SQuAD format in seven languages which are:
* (English, German, Arabic, Hindi, Vietnamese, Spanish and Simplified Chinese.) 

E.g. MLQA.en.de is English-German QA example dataset and can be loaded as follows:

In [None]:
en_de = load_dataset('xtreme', 'MLQA.en.de')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=9321.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=23203.0, style=ProgressStyle(descriptio…


Downloading and preparing dataset xtreme/MLQA.en.de (download: 72.21 MiB, generated: 5.39 MiB, post-processed: Unknown size, total: 77.60 MiB) to /root/.cache/huggingface/datasets/xtreme/MLQA.en.de/1.0.0/26c67ab93fdff98a4e58918149a88827cfa1320e8f9c2f70f667b5ba4b6e570c...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=75719050.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset xtreme downloaded and prepared to /root/.cache/huggingface/datasets/xtreme/MLQA.en.de/1.0.0/26c67ab93fdff98a4e58918149a88827cfa1320e8f9c2f70f667b5ba4b6e570c. Subsequent calls will reuse this data.


In [None]:
#Here is the dataset structure

In [None]:
en_de

DatasetDict({
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 4517
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 512
    })
})

### Viewing the dataset as a pandas data frame

In [None]:
# View dataset as a pandas data frame
import pandas as pd
pd.DataFrame(en_de['test'][0:4])

Unnamed: 0,answers,context,id,question,title
0,"{'answer_start': [31], 'text': ['cell']}",An established or immortalized cell line has a...,037e8929e7e4d2f949ffbabd10f0f860499ff7c9,Woraus besteht die Linie?,Cell culture
1,"{'answer_start': [232], 'text': ['1885']}",The 19th-century English physiologist Sydney R...,4b36724f3cbde7c287bde512ff09194cbba7f932,Wann hat Roux etwas von seiner Medullarplatte ...,Cell culture
2,"{'answer_start': [131], 'text': ['TRIPS']}","After the Uruguay round, the GATT became the b...",13e58403df16d88b0e2c665953e89575704942d4,"Was muss ratifiziert werden, wenn ein Land ger...",TRIPS Agreement
3,"{'answer_start': [67], 'text': ['developing co...","Since TRIPS came into force, it has been subje...",d23b5372af1de9425a4ae313c01eb80764c910d8,Welche Teile der Welt kritisierten das TRIPS a...,TRIPS Agreement


## Selecting, sorting, filtering

### Split
which split of the data to be loaded. If None by default, will return a `dict` with all splits (Train, Test, Validation or any other).  If split is specified, it will return a single Dataset rather than a Dictionary

In [None]:
cola = load_dataset('glue', 'cola', split ='train[:300]+validation[-30%:]')
# Which means the first 300 examples of train  plus the last 30% of validation.

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


#### Other Split Examples
The first 100 examples from train and validation

`split='train[:100]+validation[:100]'` 

50% of train and 30 % of validation

`split='train[:50%]+validation[:30%]'`


The first 20% of train and examples in the slice 30:50 from validation

`split='train[:20%]+validation[30:50]'`

### Sorting

In [None]:
cola.sort('label')['label'][:15]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [None]:
cola.sort('label')['label'][-15:]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

###  Indexing
You can also access several rows using slice notation or with a list of indices

In [None]:
cola[6,19,44]

{'idx': [6, 19, 44],
 'label': [1, 1, 1],
 'sentence': ['Fred watered the plants flat.',
  'The professor talked us into a stupor.',
  'The trolley rumbled through the tunnel.']}

In [None]:
cola[42:46]

{'idx': [42, 43, 44, 45],
 'label': [0, 1, 1, 1],
 'sentence': ['They made him to exhaustion.',
  'They made him into a monster.',
  'The trolley rumbled through the tunnel.',
  'The wagon rumbled down the road.']}

### Shuffling 

In [None]:
cola.shuffle(seed=42)[:3]

{'idx': [904, 1017, 885],
 'label': [1, 0, 1],
 'sentence': ['Lou forgot the umbrella in the closet.',
  'It is the problem that he is here.',
  'I met the person who left.']}

## Caching and reusability
Using cache files allows us to load large datasets by means of memory mapping if datasets fit on the drive  to use a fast backend and do smart caching by saving and reusing the results of operations executed on the drive.

In [None]:
pprint(list(dir(cola)))

['__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_check_index_is_initialized',
 '_data',
 '_fingerprint',
 '_format_columns',
 '_format_kwargs',
 '_format_type',
 '_get_cache_file_path',
 '_getitem',
 '_indexes',
 '_indices',
 '_info',
 '_map_single',
 '_new_dataset_with_indices',
 '_output_all_columns',
 '_split',
 'add_elasticsearch_index',
 'add_faiss_index',
 'add_faiss_index_from_external_arrays',
 'builder_name',
 'cache_files',
 'cast',
 'cast_',
 'citation',
 'class_encode_column',
 'cleanup_cache_files',
 'column_names',
 'config_name',
 'data',
 'dataset_size',
 'description',
 'd

In [None]:
cola.cache_files

[]

In [None]:
cola.info

DatasetInfo(description='GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n', citation='@article{warstadt2018neural,\n  title={Neural Network Acceptability Judgments},\n  author={Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R},\n  journal={arXiv preprint arXiv:1805.12471},\n  year={2018}\n}\n@inproceedings{wang2019glue,\n  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n  note={In the Proceedings of ICLR.},\n  year={2019}\n}\n', homepage='https://nyu-mll.github.io/CoLA/', license='', features={'sentence': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['unacceptable', 'acceptable'], names_file=None, id=None), 'idx': Value(dtype='int3

## Dataset Filter and Map Function



### Filter function

In [None]:
# To get 3 sentences ,including the term "kick" with Filter
cola = load_dataset('glue', 'cola', split='train[:100%]+validation[-30%:]')
pprint(cola.filter(lambda s: "kick" in s['sentence'])["sentence"][:3])

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))


['Jill kicked the ball from home plate to third base.',
 'Fred kicked the ball under the porch.',
 'Fred kicked the ball behind the tree.']


In [None]:
# To get 3 acceptable sentences
pprint(cola.filter(lambda s: s['label']== 1 )["sentence"][:3])

HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))


["Our friends won't buy this analysis, let alone the next one we propose.",
 "One more pseudo generalization and I'm giving up.",
 "One more pseudo generalization or I'm giving up."]


In [None]:
# To get 3 acceptable sentences - alternative version
cola.filter(lambda s: s['label']== cola.features['label'].str2int('acceptable'))["sentence"][:3]

HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




["Our friends won't buy this analysis, let alone the next one we propose.",
 "One more pseudo generalization and I'm giving up.",
 "One more pseudo generalization or I'm giving up."]

### Processing data with  map function
datasets.Dataset.map() function iterates over the dataset applying a processing function to each examples in a dataset and modifies the content of the samples.

In [None]:
# E.g. adding new features
cola_new=cola.map(lambda e: {'len': len(e['sentence'])})

HBox(children=(FloatProgress(value=0.0, max=8864.0), HTML(value='')))




In [None]:
cola_new

Dataset({
    features: ['idx', 'label', 'len', 'sentence'],
    num_rows: 8864
})

In [None]:
pprint(cola_new[0:3])

{'idx': [0, 1, 2],
 'label': [1, 1, 1],
 'len': [71, 49, 48],
 'sentence': ["Our friends won't buy this analysis, let alone the next one we "
              'propose.',
              "One more pseudo generalization and I'm giving up.",
              "One more pseudo generalization or I'm giving up."]}


In [None]:
cola_cut=cola_new.map(lambda e: {'sentence': e['sentence'][:20]})

HBox(children=(FloatProgress(value=0.0, max=8864.0), HTML(value='')))




In [None]:
pprint(cola_cut[:3])

{'idx': [0, 1, 2],
 'label': [1, 1, 1],
 'len': [71, 49, 48],
 'sentence': ["Our friends won't bu",
              'One more pseudo gene',
              'One more pseudo gene']}


## Working with Local Files

In [None]:
import os
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
os.chdir("drive/MyDrive/akademi/Packt NLP with Transformers/CH 02")

In [None]:
os.listdir()

['CH02.02 Working with Datasets Libary.ipynb',
 'data',
 'Untitled document.gdoc',
 'CH02.03 Speed and Memory Benchmarking.ipynb']

In [None]:
# To load a dataset from local files CSV, TXT, JSON, a generic loading scripts are provided 

In [None]:
# under data folder there are the files[a.csv, b.csv, c.csv], some random part of SST-2 dataset
from datasets import load_dataset
data1 = load_dataset('csv', data_files='./data/a.csv', delimiter="\t")
data2 = load_dataset('csv', data_files=['./data/a.csv','./data/b.csv', './data/c.csv'], delimiter="\t")
data3 = load_dataset('csv', data_files={'train':['./data/a.csv','./data/b.csv'], 'test':['./data/c.csv']}, delimiter="\t") 

Using custom data configuration default-68770e9ce5670cbc


Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-68770e9ce5670cbc/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Using custom data configuration default-ca1f72fb220e506f


Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-68770e9ce5670cbc/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0. Subsequent calls will reuse this data.
Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-ca1f72fb220e506f/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Using custom data configuration default-413b4932740140a4


Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-ca1f72fb220e506f/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0. Subsequent calls will reuse this data.
Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-413b4932740140a4/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-413b4932740140a4/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0. Subsequent calls will reuse this data.


In [None]:
import pandas as pd
pd.DataFrame(data1["train"][:3])

Unnamed: 0,sentence,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1


In [None]:
pd.DataFrame(data3["test"][:3])

Unnamed: 0,label,sentence
0,0,inane and awful
1,0,told in scattered fashion
2,1,takes chances that are bold by studio standards


In [None]:
# get the files in other format
# data_json = load_dataset('json', data_files='a.json')
# data_text = load_dataset('text', data_files='a.txt')


In [None]:
#you can also access several rows using slice notation or with a list of indices

In [None]:
# shuffling

In [None]:
data3_shuf=data3['train'].shuffle(seed=42)
data3_shuf['label'][:15]


[0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0]

## Preparing the data for model training
Let us take an example with a tokenizer. 
To do so, we need to install transformers library

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 5.4MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 26.0MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 42.1MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


In [None]:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




If batched is True, it provides batch of examples to any function.
batch_size (default is 1000) is  number of instances per batch provided to a function. If not selected, the whole dataset is provided as a single batch to any given function.

In [None]:
encoded_data1 = data1.map( lambda e: tokenizer(e['sentence']), batched=True, batch_size=1000)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [None]:
data1

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 99
    })
})

In [None]:
encoded_data1

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'sentence'],
        num_rows: 99
    })
})

In [None]:
pprint(encoded_data1['train'][0])

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102],
 'label': 0,
 'sentence': 'hide new secretions from the parental units '}


In [None]:
encoded_data3 = data3.map(lambda e: tokenizer( e['sentence'], padding=True, truncation=True, max_length=12), batched=True, batch_size=1000) 

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [None]:
data3

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 199
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 100
    })
})

In [None]:
encoded_data3

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'sentence'],
        num_rows: 199
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'sentence'],
        num_rows: 100
    })
})

In [None]:
pprint(encoded_data3['test'][12])

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
 'input_ids': [101, 2019, 5186, 16010, 2143, 1012, 102, 0, 0, 0, 0, 0],
 'label': 0,
 'sentence': 'an extremely unpleasant film . '}
