## Datasets in HuggingFace
In this we will study how we can Load Dataset, extract details, apply tokenization, split and upload back to hugging face.
- We will load data, create mini dataset and upload back to huggingface. 


### Load a Dataset: 
#### `load_dataset_builder`: To get details of dataset

In [3]:
from datasets import load_dataset_builder

dataset_name = "cfilt/iitb-english-hindi"

ds_builder = load_dataset_builder(dataset_name)

README.md:   0%|          | 0.00/3.14k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


dataset_infos.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

In [30]:
ds_builder.info.description

''

In [31]:
ds_builder.info.features

{'translation': {'en': Value(dtype='string', id=None),
  'hi': Value(dtype='string', id=None)}}

If you’re happy with the dataset, then load it with `load_dataset()`:

In [51]:
from datasets import load_dataset

dataset = load_dataset(dataset_name, split="train")

In [52]:
print(f"Length of data: {len(dataset)}")
print(f"One object of dataset: {dataset[0]}")

Length of data: 1659083
One object of dataset: {'translation': {'en': 'Give your application an accessibility workout', 'hi': 'अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें'}}


In [34]:
# view random data 
import random 

print(f"object of dataset: {random.choices(dataset)}")

object of dataset: [{'translation': {'en': 'exotic', 'hi': 'अनभो'}}]


Slicing on dataset

In [35]:
dataset[5:10]

{'translation': [{'en': 'Highlight duration', 'hi': 'अवधि को हाइलाइट रकें'},
  {'en': 'The duration of the highlight box when selecting accessible nodes',
   'hi': 'पहुंचनीय आसंधि (नोड) को चुनते समय हाइलाइट बक्से की अवधि'},
  {'en': 'Highlight border color',
   'hi': 'सीमांत (बोर्डर) के रंग को हाइलाइट करें'},
  {'en': 'The color and opacity of the highlight border.',
   'hi': 'हाइलाइट किए गए सीमांत का रंग और अपारदर्शिता। '},
  {'en': 'Highlight fill color', 'hi': 'भराई के रंग को हाइलाइट करें'}]}

Iterable Dataset

In [36]:
iterable_dataset = load_dataset(dataset_name, split="train", streaming=True)
for example in iterable_dataset:
    print(example)
    break

{'translation': {'en': 'Give your application an accessibility workout', 'hi': 'अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें'}}


You can also create an `IterableDataset` from an existing `Dataset`, but it is faster than streaming mode because the dataset is streamed from local files:

In [38]:
iterable_dataset = dataset.to_iterable_dataset()
for example in iterable_dataset:
    print(example)
    break

{'translation': {'en': 'Give your application an accessibility workout', 'hi': 'अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें'}}


An IterableDataset progressively iterates over a dataset one example at a time, so you don’t have to wait for the whole dataset to download before you can use it. As you can imagine, this is quite useful for large datasets you want to use immediately!

However, this means an IterableDataset’s behavior is different from a regular Dataset. You don’t get random access to examples in an IterableDataset. Instead, you should iterate over its elements, for example, by calling next(iter()) or with a for loop to return the next item from the IterableDataset:

In [42]:
next(iter(iterable_dataset))

{'translation': {'en': 'Give your application an accessibility workout',
  'hi': 'अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें'}}

You can return a subset of the dataset with a specific number of examples in it with IterableDataset.take():

In [43]:
list(iterable_dataset.take(3))

[{'translation': {'en': 'Give your application an accessibility workout',
   'hi': 'अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें'}},
 {'translation': {'en': 'Accerciser Accessibility Explorer',
   'hi': 'एक्सेर्साइसर पहुंचनीयता अन्वेषक'}},
 {'translation': {'en': 'The default plugin layout for the bottom panel',
   'hi': 'निचले पटल के लिए डिफोल्ट प्लग-इन खाका'}}]

### Pre-Process

There are many possible ways to preprocess a dataset, and it all depends on your specific dataset. Sometimes you may need to rename a column, and other times you might need to unflatten nested fields. 🤗 Datasets provides a way to do most of these things. But in nearly all preprocessing cases, depending on your dataset modality, you’ll need to:

Tokenize a text dataset.
Resample an audio dataset.
Apply transforms to an image dataset.

The last preprocessing step is usually setting your dataset format to be compatible with your machine learning framework’s expected input format.

In this tutorial, you’ll also need to install the 🤗 Transformers library:

`pip install transformers`

Grab a dataset of your choice and follow along!

#### Tokenize text
Models cannot process raw text, so you’ll need to convert the text into numbers. Tokenization provides a way to do this by dividing text into individual words called tokens. Tokens are finally converted to numbers.

In [1]:
from transformers import AutoTokenizer
from datasets import load_dataset 

In [2]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [4]:
dataset = load_dataset(dataset_name, split="train")

2. Call your tokenizer on the first row of text in the dataset:

In [13]:
print(f"Sentence:{dataset[0]['translation']['en']}")
print(f"Tokens:{tokenizer(dataset[0]['translation']['en'])}")

Sentence:Give your application an accessibility workout
Tokens:{'input_ids': [101, 2507, 2115, 4646, 2019, 23661, 27090, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


In [14]:
print(f"Sentence:{dataset[0]['translation']['hi']}")
print(f"Tokens:{tokenizer(dataset[0]['translation']['hi'])}")

Sentence:अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें
Tokens:{'input_ids': [101, 1311, 29864, 29863, 1311, 29863, 29864, 29869, 29868, 29879, 29853, 1315, 29879, 1328, 29875, 29854, 29863, 29878, 29868, 29859, 29876, 1335, 29868, 29876, 29868, 29876, 29867, 1315, 29876, 1334, 29876, 29866, 1325, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


The tokenizer returns a dictionary with three items:

- **input_ids:** the numbers representing the tokens in the text.
- **token_type_ids:** indicates which sequence a token belongs to if there is more than one sequence.
- **attention_mask:** indicates whether a token should be masked or not.
These values are actually the model inputs.

3. The fastest way to tokenize your entire dataset is to use the map() function. This function speeds up tokenization by applying the tokenizer to batches of examples instead of individual examples. Set the batched parameter to True:

`Before Tokenization create a MINI Dataset`

In [53]:
import random
num_samples = 20000  # Number of random samples
random_indices = random.sample(range(len(dataset)), num_samples)
mini_dataset = dataset.select(random_indices)

In [26]:
mini_dataset[0]

{'translation': {'en': 'Location at (date) \\t', 'hi': 'स्थान (तिथि) पर\\t'}}

In [32]:
def tokenization_english(example):
    return tokenizer(example['translation']['en'], max_length=512, truncation=True)

In [33]:
def tokenization_hindi(example):
    return tokenizer(example['translation']['hi'], max_length=512, truncation=True)

In [34]:
mini_dataset_tokens_en = mini_dataset.map(tokenization_english, batched=False)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [35]:
mini_dataset_tokens_hi = mini_dataset.map(tokenization_hindi, batched=False)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

4. Set the format of your dataset to be compatible with your machine learning framework:

In [36]:
mini_dataset_tokens_en.save_to_disk('tokenizer_en.json')

Saving the dataset (0/1 shards):   0%|          | 0/10000 [00:00<?, ? examples/s]

In [37]:
mini_dataset_tokens_en.set_format(type="torch", columns=["input_ids"])

In [38]:
mini_dataset_tokens_en[0]

{'input_ids': tensor([ 101, 3295, 2012, 1006, 3058, 1007, 1032, 1056,  102])}

In [54]:
mini_dataset.to_parquet(path_or_buf="B:\CODE\Pytorch_Transformers\Huggingface\\train.parquet")

Creating parquet from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

5382716