# Effortless NLP using HuggingFace's Tranformers Ecosystem

![Image](https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/Images/keyboard_1.jpg)

> Image by [Markus Winkler](https://unsplash.com/@markuswinkler)
### How Text is converted into Numbers in a meaningful way?

#### ------------------------------------------------ 
#### *Articles So Far In This Series*
#### -> [[NLP Tutorial] Finish Tasks in Two Lines of Code](https://www.kaggle.com/rajkumarl/nlp-tutorial-finish-tasks-in-two-lines-of-code)
#### -> [[NLP Tutorial] Unwrapping Transformers Pipeline](https://www.kaggle.com/rajkumarl/nlp-unwrapping-transformers-pipeline)
#### -> [[NLP Tutorial] Exploring Tokenizers](https://www.kaggle.com/rajkumarl/nlp-tutorial-exploring-tokenizers)
#### -> [[NLP Tutorial] Fine-Tuning in TensorFlow](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-in-tensorflow) 
#### -> [[NLP Tutorail] Fine-Tuning in Pytorch](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-in-pytorch) 
#### -> [[NLP Tutorail] Fine-Tuning with Trainer API](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-with-trainer-api) 
#### ------------------------------------------------ 

**Tokenizers** API in the **Transformers** library offers essential preprocessing activities such as tokenization, padding, truncating, batching, and so on. 

Let's discuss the Tokenizers API and its different functionalities with use cases.

# Prepare Environment and Data

In [1]:
# Make necessary imports

# for array operations 
import numpy as np 
# for data handling
import pandas as pd
# TensorFlow framework
import tensorflow as tf
# PyTorch framework
import torch
# for pretty printing
from pprint import pprint

2021-12-19 09:15:09.314048: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib
2021-12-19 09:15:09.314169: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
# download a text data
data = pd.read_csv('../input/ruddit-jigsaw-dataset/Dataset/ruddit_with_text.csv')
data.head()

Unnamed: 0,post_id,comment_id,txt,url,offensiveness_score
0,42g75o,cza1q49,> The difference in average earnings between m...,https://www.reddit.com/r/changemyview/comments...,-0.083
1,42g75o,cza1wdh,"The myth is that the ""gap"" is entirely based o...",https://www.reddit.com/r/changemyview/comments...,-0.022
2,42g75o,cza23qx,[deleted],https://www.reddit.com/r/changemyview/comments...,0.167
3,42g75o,cza2bw8,The assertion is that women get paid less for ...,https://www.reddit.com/r/changemyview/comments...,-0.146
4,42g75o,cza2iji,You said in the OP that's not what they're mea...,https://www.reddit.com/r/changemyview/comments...,-0.083


# Encoding Texts

A tokenizer encodes texts into numbers that a model can understand.

In [3]:
# AutoTokenizer module from the HF's transformers library
from transformers import AutoTokenizer

# download and cache a suitable pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

In [4]:
# extract a text for our work
text = data.loc[3,'txt']
# Is it a string?
print(type(text))
# What text does it have?
text

<class 'str'>


'The assertion is that women get paid less for the *same* jobs, and that they get paid less *because* they are women. '

In [5]:
# obtain the tokens
tokens = tokenizer.tokenize(text)
tokens

['The',
 'assertion',
 'is',
 'that',
 'women',
 'get',
 'paid',
 'less',
 'for',
 'the',
 '*',
 'same',
 '*',
 'jobs',
 ',',
 'and',
 'that',
 'they',
 'get',
 'paid',
 'less',
 '*',
 'because',
 '*',
 'they',
 'are',
 'women',
 '.']

In [6]:
# convert the tokens into numeric representation
inputs = tokenizer.convert_tokens_to_ids(tokens)
inputs

[1109,
 26878,
 1110,
 1115,
 1535,
 1243,
 3004,
 1750,
 1111,
 1103,
 115,
 1269,
 115,
 5448,
 117,
 1105,
 1115,
 1152,
 1243,
 3004,
 1750,
 115,
 1272,
 115,
 1152,
 1132,
 1535,
 119]

Aha! We cannot understand these numerical representations, but a **BERT model** can! 

# Decode back into Texts

In [7]:
decoded = tokenizer.decode(inputs)
decoded

'The assertion is that women get paid less for the * same * jobs, and that they get paid less * because * they are women.'

The `decode` method not only decodes the numbers back to tokens, but also merges the tokens exactly similar to the original text.

# Issues with Modeling

We can do sentiment analysis with the above pre-processed text. However there are some common issues caused based on how the model was originally trained with its inputs.

# Issue #1 Batching

Transformers models expect inputs in batches in TWO dimensions. The frst refers to batch size and the second refers to the sequence length.

In [8]:
# In TensorFlow
from transformers import TFAutoModelForSequenceClassification
# fetch the same model as like tokenizer
tf_model = TFAutoModelForSequenceClassification.from_pretrained('bert-base-cased')
ids = tf.constant(inputs)
try:
    tf_model(ids)
except Exception as e:
    print(type(e), e)

Downloading:   0%|          | 0.00/527M [00:00<?, ?B/s]

2021-12-19 09:15:36.661062: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-12-19 09:15:36.664662: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib
2021-12-19 09:15:36.664701: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-12-19 09:15:36.664730: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (7897a5272bad): /proc/driver/nvidia/version does not exist
2021-12-19 09:15:36.666063: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operation

<class 'IndexError'> list index out of range


In [9]:
# nest inputs inside a list to form a batch, 
# i.e, increase dimension
ids = tf.constant([inputs])
tf_model(ids).logits 

<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-0.51531273,  0.11103971]], dtype=float32)>

In [10]:
# In PyTorch
from transformers import AutoModelForSequenceClassification
# fetch the same model as like tokenizer
pt_model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased')
# convert inputs into a tensor
ids = torch.tensor(inputs)
# make prediction
try:
    pt_model(ids)
except ValueError as e:
    print('ValueError:', e)

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

ValueError: not enough values to unpack (expected 2, got 1)


In [11]:
# nest inputs inside a list to form a batch
ids = torch.tensor([inputs])
# make prediction on batched inputs
pt_model(ids).logits

tensor([[-0.5233, -0.4122]], grad_fn=<AddmmBackward>)

# Issue #2 Padding

Models expect a batch of text ids as inputs. But texts are usually of different lengths. Tensors cannot be formulated with different input sizes. They must be rectangle in shape. Here, we go with padding. We pad the short sentences with some *padding id* to match its length to the longest sequence in the batch.

In [12]:
# obtain data where txt is short
# no seed set. So we may see different data every time
data['txt_len'] = data['txt'].apply(lambda x: len(x))
data[np.multiply(data['txt_len']<32, data['txt_len']>23)].sample(5)

Unnamed: 0,post_id,comment_id,txt,url,offensiveness_score,txt_len
3838,cfh0mi,euaefq9,I will never forget that night,https://www.reddit.com/r/AFL/comments/cfh0mi/m...,-0.638,30
4432,agy6wg,eebms8h,IM PICKIN UP GOOD VIBRATIONS 🎶,https://i.redd.it/i28vnrmfsza21.jpg/eebms8h/,-0.604,31
3027,74gz19,dnz8pjw,>killing it\n\n...at golf.,http://thehill.com/blogs/blog-briefing-room/ne...,-0.375,24
4457,aljcyq,efgilg0,Why the hell wouldn't it,https://i.redd.it/cdf8v3lbgnd21.jpg/efgilg0/,-0.104,24
4412,aaahgo,ecr5nws,"Welp, my heart just broke.",https://i.redd.it/wvy18k6ex0721.png/ecr5nws/,-0.479,26


In [13]:
# take some random texts of different lengths
texts = data.loc[[182, 332, 3804], 'txt'].values
print(*texts, sep='\n')

Sure, if it's private, why not?
I never said I hated anyone.
Take them to our leader.


In [14]:
# Extract tokens
tokens = [tokenizer.tokenize(txt) for txt in texts]
# obtain ids
ids = [tokenizer.convert_tokens_to_ids(token) for token in tokens]
ids

[[6542, 117, 1191, 1122, 112, 188, 2029, 117, 1725, 1136, 136],
 [146, 1309, 1163, 146, 5687, 2256, 119],
 [5055, 1172, 1106, 1412, 2301, 119]]

Tokens are of different lengths. They cannot be formulated as tensors. Let's check it!

In [15]:
# In TensorFlow
try:
    tf.constant(ids)
except ValueError as e:
    print('ValueError:', e)

ValueError: Can't convert non-rectangular Python sequence to Tensor.


In [16]:
# In PyTorch
try:
    torch.tensor(ids)
except ValueError as e:
    print('ValueError:', e)

ValueError: expected sequence of length 11 at dim 1 (got 7)


Let's do padding manually to have ids of equal length

In [17]:
max_len = len(max(ids, key=len))
print(max_len)
rectangular_ids = [i + [0]*(max_len-len(i)) for i in ids]
np.array(rectangular_ids)

11


array([[6542,  117, 1191, 1122,  112,  188, 2029,  117, 1725, 1136,  136],
       [ 146, 1309, 1163,  146, 5687, 2256,  119,    0,    0,    0,    0],
       [5055, 1172, 1106, 1412, 2301,  119,    0,    0,    0,    0,    0]])

We have successfully padded short sentences. Now they are ready to be converted into tensors, and make predictions.

In [18]:
# In TensorFlow
# We can have three different tensors for three sentences
tf_individual_tensors = [tf.constant([i]) for i in ids]
# We can have one tensor containing three sentences
tf_batched_tensors = tf.constant(rectangular_ids)

In [19]:
# In PyTotch
# We can have three different tensors for three sentences
pt_individual_tensors = [torch.tensor([i]) for i in ids]
# We can have one tensor containing three sentences
pt_batched_tensors = torch.tensor(rectangular_ids)

# Issue #3 Attention Masking

We have altered some inputs to match length of longest input. Will it affect the prediction?

In [20]:
# In TensorFlow
print("predict individual tensors")
for tensor in tf_individual_tensors:
    print(tf_model(tensor).logits.numpy())
    
print("predict batched tensors")
print(tf_model(tf_batched_tensors).logits.numpy())

predict individual tensors
[[-0.533927    0.02726544]]
[[-0.3758095  0.1972964]]
[[-0.34865278  0.14514713]]
predict batched tensors
[[-0.5339269   0.02726543]
 [-0.46882713  0.02681333]
 [-0.4703202   0.04923887]]


In [21]:
# In PyTorch
print("predict individual tensors")
for tensor in pt_individual_tensors:
    print(pt_model(tensor).logits.detach().numpy())
    
print("predict batched tensors")
print(pt_model(pt_batched_tensors).logits.detach().numpy())

predict individual tensors
[[-0.36821926 -0.41824853]]
[[-0.55537057  0.17383558]]
[[-0.5711603   0.28696358]]
predict batched tensors
[[-0.36821926 -0.41824862]
 [-0.3734094  -0.30911833]
 [-0.49477753  0.0513556 ]]


That's not good! When the first sentence attain same prediction as individual tensor and with the batched tensor, the next two sentences do have variations in logits values. It is the direct cause of padding.

We know that we have padded the last sentences with zeros, but can our model know this and how can it handle the original input by truncating the padded values?

Yes. Attenstion Masking emerges as the solution.

In [22]:
mask = np.zeros_like(rectangular_ids)
for i in range(len(ids)):
    # add 1 where there are elements in ids, leaving other positions with 0
    mask[i][:len(ids[i])] = 1
mask

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])

We have our attention mask now. Let's try making predictions agian.

In [23]:
# In TensorFlow
print("predict individual tensors")
for tensor in tf_individual_tensors:
    print(tf_model(tensor).logits.numpy())
    
print("predict batched tensors with masks")
print(tf_model(tf_batched_tensors, 
               attention_mask=tf.constant(mask)).logits.numpy())

predict individual tensors
[[-0.533927    0.02726544]]
[[-0.3758095  0.1972964]]
[[-0.34865278  0.14514713]]
predict batched tensors with masks
[[-0.5339269   0.02726543]
 [-0.37580946  0.19729649]
 [-0.34865278  0.14514753]]


In [24]:
# In PyTorch
print("predict individual tensors")
for tensor in pt_individual_tensors:
    print(pt_model(tensor).logits.detach().numpy())
    
print("predict batched tensors with mask")
print(pt_model(pt_batched_tensors, 
               attention_mask=torch.tensor(mask)).logits.detach().numpy())

predict individual tensors
[[-0.36821926 -0.41824853]]
[[-0.55537057  0.17383558]]
[[-0.5711603   0.28696358]]
predict batched tensors with mask
[[-0.36821926 -0.41824862]
 [-0.5553706   0.17383553]
 [-0.5711604   0.28696287]]


That's Fantastic. Our predictions are identical with either individual ids or batched ids!

# Issue #5 Truncating

Is there a maximum permissible length of inputs to a model? 
Yes. If it goes beyond the limit, model could not process it. 

In [25]:
# merge some texts to formulate a big text
text = data.loc[1083:1088,'txt'].to_list()
text = ' '.join(text)
text

"I think that’s on them though. I stopped drinking for like 2 years or so just from health issues and didn’t want to risk anything happening. My friends were pretty supportive and I still had a lot of fun going out with them. I drink every now and then in moderation but i don’t get judged if I’m not drinking. I do agree that it can help and it’s really fun to get a few beers with friends after a long week, but it’s not always necessary to drink if you go out. I ain’t trying to change your mind because I’ll say that thank god someone actually thinks like this. I am currently in high school and I live in a city where many people above the age of 13 or 14 drink every week and some people even every day. They also take drugs like ecstasy and shit, along with the occasional marijuana cigarette. Can’t tell you how many times people are trying to get me to go to parties where they do this shit. They go with the “ it’s not that bad!” “ it’s just fun!”. Can’t wait till I never see those people 

In [26]:
tokens = tokenizer.tokenize(text)
ids = tokenizer.convert_tokens_to_ids(tokens)

Token indices sequence length is longer than the specified maximum sequence length for this model (598 > 512). Running this sequence through the model will result in indexing errors


Great. We get the warning from the tokenizer beforehand!

In [27]:
# In TensorFlow
try:
    print(tf_model(tf.constant([ids])).logits.numpy())
except Exception as e:
    print(type(e))
    print('Error:', e) 

<class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>
Error: indices[0,570] = 570 is not in [0, 512) [Op:ResourceGather]


In [28]:
# In PyTorch
try:
    print(pt_model(torch.tensor([ids])).logits.detach().numpy())
except Exception as e:
    print(type(e))
    print('Error:', e) 

<class 'RuntimeError'>
Error: The size of tensor a (598) must match the size of tensor b (512) at non-singleton dimension 1


There we must go with Truncating the extra length of input tensors.

In [29]:
print('Length of input ids before truncation:', len(ids))
# This BERT model accepts sequence ids of length up to 512
ids = ids[:512]
print('Length of input ids after truncation:', len(ids))

Length of input ids before truncation: 598
Length of input ids after truncation: 512


So far, we have dealt with the issues of tokenization and the remedial actions for each issue. 
HuggingFace's tokenizers can handle all these issues in one line on its own with a couple of arguments.

In [30]:
# extract the first 64 texts
texts = data.txt.to_list()[:64]
print('Number of Examples: ', len(texts))
# tensorflow tensors
tf_inputs = tokenizer(texts, padding='longest', truncation=True, return_tensors='tf')
# pytorch tensors
pt_inputs = tokenizer(texts, padding='longest', truncation=True, return_tensors='pt')

print(tf_inputs.keys())
print(pt_inputs.keys())

Number of Examples:  64
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])


A direct call on tokenizer performs all the duties including attention masks without any hazzle. Let's do modeling with this data.

In [31]:
# In TensorFlow
results = tf_model(**tf_inputs).logits.numpy()
results.shape

(64, 2)

In [32]:
# In PyTorch
results = pt_model(**pt_inputs).logits.detach().numpy()
results.shape

(64, 2)

### That's the end. We get a good understanding of HF's Tokenizers

Key reference: [HuggingFace's NLP Course](https://huggingface.co/course)

### Thank you for your valuable time!