NLP Training 1: Datasets
--- 

In [2]:
import os
os.chdir('..')
print(f'Setting working dir to: {os.getcwd()}')

Setting working dir to: /Users/ingomarquart/Documents/GitHub/itern-nlp-training-cases


## Datasets

### Exercise 1 - Datasets

Get a list of all datasets available on the Huggingface Hub.   
How many datasets are available?   
Take a closer look at the first five ones.   

There is a dataset called ["emotion"](https://huggingface.co/datasets/emotion).   

1. Load it and take a look at the column names and features.   
2. Take a look at the first row of the training dataset.

In [3]:
# Add your solution here:
# ...

#### Solution

In [4]:
import pandas as pd
from datasets import load_dataset, list_datasets

# Explore all datasets
all_datasets = list_datasets()
print(len(all_datasets))
print(all_datasets[:6])

# Load the emotion dataset
dataset = load_dataset("emotion") 
train_dataset = dataset['train']


print(len(train_dataset))
print(train_dataset.column_names)
print(train_dataset.features)

train_dataset[0]

8831
['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc', 'afrikaans_ner_corpus', 'ag_news']


Downloading builder script:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset emotion/default (download: 1.97 MiB, generated: 2.07 MiB, post-processed: Unknown size, total: 4.05 MiB) to /Users/ingomarquart/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705...


Downloading data:   0%|          | 0.00/1.66M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/204k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/207k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Dataset emotion downloaded and prepared to /Users/ingomarquart/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

16000
['text', 'label']
{'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}


{'text': 'i didnt feel humiliated', 'label': 0}

### Exercise 1 a

Convert the dataset to pandas and print its head

In [5]:
# Add your solution here:
# ...

#### Solution

In [6]:
dataset.set_format(type='pandas')
df = dataset['train'][:]
df.head()

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3


## Creating a simple NLP pipeline in a few simple steps

Create an inference pipeline that creates a summary up to 50 words long of the following text.

In [7]:
text = """
In deep learning, each level learns to transform its input data into a slightly more abstract and composite representation. 
In an image recognition application, the raw input may be a matrix of pixels; the first representational layer may abstract the pixels and encode edges; 
the second layer may compose and encode arrangements of edges; the third layer may encode a nose and eyes; and the fourth layer may recognize that the image contains a face. 
Importantly, a deep learning process can learn which features to optimally place in which level on its own. 
This does not eliminate the need for hand-tuning; for example, varying numbers of layers and layer sizes can provide different degrees of abstraction.
The word "deep" in "deep learning" refers to the number of layers through which the data is transformed. 
More precisely, deep learning systems have a substantial credit assignment path (CAP) depth. The CAP is the chain of transformations from input to output. 
CAPs describe potentially causal connections between input and output. 
For a feedforward neural network, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). 
For recurrent neural networks, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited.
No universally agreed-upon threshold of depth divides shallow learning from deep learning, but most researchers agree that deep learning involves CAP depth higher than 2. 
CAP of depth 2 has been shown to be a universal approximator in the sense that it can emulate any function. 
Beyond that, more layers do not add to the function approximator ability of the network. 
Deep models (CAP > 2) are able to extract better features than shallow models and hence, extra layers help in learning the features effectively.
"""

# Add your solution here:
# ...

#### Solution

In [8]:
from transformers import pipeline

summarizer = pipeline('summarization')
summarizer(text, min_length=5, max_length=50)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Now, do the same but this time using [Google's Pegasus model](https://huggingface.co/google/pegasus-xsum).   
How does the output differ?

In [None]:
model = 'google/pegasus-xsum'
summarizer = pipeline('summarization', model=model, tokenizer=model)
summarizer(text, min_length=5, max_length=50)

## Advanced Dataset Schemes

### PyTorch Map Dataset: Offloading complexity to workers

In [None]:
import torch
from torch.utils.data import Dataset

def preprocess_data(data):
    # Do something here
    return data

class CustomDataset(Dataset):
    def __init__(self, data, transform, augmentation):
        self.data = data
        self.data = preprocess_data(self.data)
        self.transform = transform
        self.augmentation = augmentation

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        observation = self.data.iloc[index,:]
        observation = torch.tensor(observation, dtype=torch.float)
        observation = self.augmentation(observation, self.transform)
        features, label = observation[:-1], observation[-1]
        return features, label

### PyTorch Iterable Dataset: For streaming data, big data, data too big to be preloaded

In [None]:
import torch
from torch.utils.data import IterableDataset
from Petastorm import make_batch_reader

class CustomIterableDataset(IterableDataset):
    def __init__(self, data_files, epoch_length,
                                        cur_shard, num_shards):
        self.parquet_files = data_files
        self.reader = make_batch_reader(data_files, num_epochs=None, 
                            cur_shard=cur_shard, num_shards=num_shards)
        self.epoch_length = epoch_length

    def __iter__(self):
        for i,observation in enumerate(self.reader):
            if i == self.epoch_length:
                print("Epoch completed")
                break
            observation = torch.tensor(observation, dtype=torch.float) 
            features, label = observation[:-1], observation[-1]
            yield features, label