## 💻 UnpackAI DL201 Bootcamp - Week 1 - Skills: NLP

### 📕 Learning Objectives

* Solidify the basic notion of NLP and how it can be applied to a variety of tasks.
* Practice using Pandas for loading and processing text data.
* Ilustrate the process of converting a text document into a dataframe and from there into a tensor.

### A basic NLP Overview

From Wikipedia:
- "Natural language processing (NLP) is a subfield of **linguistics, computer science, and artificial intelligence** concerned with the interactions between computers and **human language**, in particular how to program computers to **process and analyze** large amounts of natural language data. The goal is a computer capable of **"understanding"** the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."

- Approaches to NLP tasks:
    - Rule-based
    - Traditional machine learning
    - Deep learning

In NLP, we often need to perform text preprocessing, such as removing stop words, stemming, lemmatization, and tokenization.
A nice overview is presented in: 
- https://stanfordnlp.github.io/CoreNLP/ 
- https://www.techtarget.com/searchenterpriseai/definition/natural-language-processing-NLP

Common NLP tasks:
- Classification
- Masked filing
- Text prediction
- Sentiment analysis
    - Positive
    - Negative
    - Subjectivity
- Entity recognition
    - Person
    - Location
    - Organization
- Entity extraction
- Keyword extraction
- Topic extraction

### Ilustrative example

Below there is a code example that that illustrates the usage of Pandas for text manipulation and a few exploratory steps to create Tensors representing the text data.

In [1]:
# Install packages
! pip install transformers openpyxl

# Import libraries
import os
import numpy as np
import pandas as pd
import torch
from transformers import BertTokenizer

It is important to set correctly your data folder path as a local variable, depending on where you run this notebook.

In [2]:
# Set the data directory path as a variable

# Uncomment this for Kaggle
!git clone https://github.com/unpackAI/DL201.git
from pathlib import Path
DATA_DIR = Path('/kaggle/working/DL201/data/nlp') #uncomment for kaggle

# Uncomment this for local
# os.chdir('../data/nlp')
# DATA_DIR = os.getcwd()

print(f'data directory is {DATA_DIR}')

data directory is d:\GitHub\DL201\data\nlp


Let's load a sample text file and feed it into the BERT model. The data/nlp folder of the repository contains a txt file with sentences from a book, the book was taken from: http://www.textfiles.com/stories/. Feel free to download a different book and use it when explolring this notebook.

In [3]:
# Load a sample text, from the data folder
os.chdir(DATA_DIR)
sample_text = open('alad10.txt').read()

# Split the text into sentences
sentences = sample_text.split('\n')

# Load the sentences into a dataframe
df = pd.DataFrame(sentences, columns=['sentence'])

As it has been reitared before, loading the data into Pandas gives us tremendous flexibility to perform data cleaning and preprocessing with ease.

In [4]:
# Inspect some of the sentences
df.sample(15)

Unnamed: 0,sentence
25,merchandise. Next day he bought Aladdin a fin...
56,"out the oil it contains, and bring it me."" He..."
346,still wore. The genie he had seen in the cave...
78,Immediately an enormous and frightful genie ro...
20,"be surprised at not having seen him before, as..."
73,lamp and kill him afterwards.
343,"what had become of his palace, but they only l..."
222,"and lastly, ten thousand pieces of gold in ten..."
302,offering to exchange fine new lamps for old on...
464,"Aladdin went back to the Princess, saying his ..."


In [5]:
# Remove the sentences that have less than 3 words
df = df[df['sentence'].str.split().str.len() > 3]

In [6]:
# Remove punctuation from all sentences
df['sentence'] = df['sentence'].str.replace('[^\w\s]','')

# Note: instead of regex a list of punctuation can be used, give it a try!
punctuation = [
    '.', ',', '!', '?', ':', ';', '"', "'", '-', '_', '(', ')', '[', ']', '{', '}', '#', '@', '$', '%', '^', '&', '*',
     '+', '=', '<', '>', '/', '\\', '|', '~', '`', '“', '”', '‘', '’'
]

df.sample(10)

  df['sentence'] = df['sentence'].str.replace('[^\w\s]','')


Unnamed: 0,sentence
178,sighed deeply and at last told her mother how ...
324,ordered the executioner to cut off his head T...
162,son and the Princess Take this newmarried man...
259,There is only one thing that surprises me Was...
256,Next day Aladdin invited the Sultan to see the...
32,between them Then they journeyed onwards till...
323,that he came to no harm He was carried before...
207,I would do a great deal more than that for the...
83,he went home but fainted on the threshold Whe...
470,but a wicked magician and told her of how she had


In [7]:
# Convert all sentences to lowercase
df['sentence'] = df['sentence'].str.lower()
df.sample(10)

Unnamed: 0,sentence
411,daughter happened too look up and rubbed his e...
379,not but he will use violence aladdin comforte...
243,the sultan sent musicians with trumpets and cy...
449,hanging from the dome if that is all replied ...
102,genie had brought aladdin sold one of the silv...
315,next morning the sultan looked out of the wind...
94,bowl twelve silver plates containing rich meat...
227,and led him into a hall where a feast was spre...
439,him what he thought of it it is truly beautif...
124,notice of her she went every day for a week a...


**Sentences are a key unit of information when it comes to NLP** (as wells as tokens) in order to represent our data as a uniform "block" of text, we need to find out our longest sentence, the rest of them will later be padded with padding tokens.

In [8]:
# Get the length of the longest senctence
max_len = df['sentence'].str.len().max()
print(f'max sentence length is {max_len}')

max sentence length is 75


The transformers library provides a convenient way to load a variety of BERT models. Let's first load and explore a tokenizer.

In [9]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [10]:
# Get the tokenizer vocabulary words
vocab = bert_tokenizer.vocab
vocab_size = len(vocab)
print(f'vocab size is: {vocab_size}')

vocab size is: 30522


In [18]:
# Get the vocabulary words as a list, load them into a dataframe
vocab_list = list(vocab.keys())
vocab_df = pd.DataFrame(vocab_list, columns=['tokens'])
vocab_df.sample(15)

Unnamed: 0,tokens
28797,subtly
8080,monitor
16606,dukes
2064,can
3346,stadium
16281,##iling
15484,dispersed
22312,blitz
1964,金
24292,##rith


In [19]:
# Get the count of tokens that begin with 'UNUSED'
unused_tokens = vocab_df[vocab_df['tokens'].str.find('unused')>=0]
print(f'There are {len(unused_tokens)} tokens that begin with "unused"')
unused_tokens.sample(10)

There are 995 tokens that begin with "unused"


Unnamed: 0,tokens
388,[unused383]
90,[unused89]
269,[unused264]
337,[unused332]
458,[unused453]
873,[unused868]
255,[unused250]
164,[unused159]
120,[unused115]
517,[unused512]


In [20]:
# Get the tokens that have a size of 1 character
one_char_tokens = vocab_df[vocab_df['tokens'].str.len()==1]
print(f'There are {len(one_char_tokens)} tokens that have a size of 1 character')
one_char_tokens.sample(10)

There are 997 tokens that have a size of 1 character


Unnamed: 0,tokens
1048,l
1832,岡
1225,ի
1408,ා
1425,།
1019,5
1218,ӏ
1925,神
1032,\
1761,仮


In [21]:
# Get the tokens which have a size of more than 2 characters and does not contain the word 'unused'
two_char_tokens = vocab_df[(vocab_df['tokens'].str.len()>2) & (vocab_df['tokens'].str.find('unused')<0)]
print(f'There are {len(two_char_tokens)} tokens that likely reprensent English words')
two_char_tokens.sample(10)

There are 28042 tokens that likely reprensent English words


Unnamed: 0,tokens
8790,dynamic
18415,truce
4269,begins
12197,watches
25778,##col
5984,puerto
22495,distributors
8872,cop
25743,helix
8025,curious


Each sentence is currently represented as a list of characters. We need to transform this into a list of tokens, tokens then get converted into numbers using the tokenizers vocabulary as indexes. Here is an example with a phrase:

In [22]:
# Example of tokenizing a sentence
sample_sentence = "This is a sample sentence, which we will tokenize using the BERT tokenizer."
print(f'The sample sentence is:\n{sample_sentence}')

tokenized_sentence = bert_tokenizer.tokenize(sample_sentence)
print(f'\nThe tokenized sentence is:\n{tokenized_sentence}')

numericalized_sentence = bert_tokenizer.convert_tokens_to_ids(tokenized_sentence)
print(f'\nThe numericalized sentence is:\n{numericalized_sentence}')

The sample sentence is:
This is a sample sentence, which we will tokenize using the BERT tokenizer.

The tokenized sentence is:
['this', 'is', 'a', 'sample', 'sentence', ',', 'which', 'we', 'will', 'token', '##ize', 'using', 'the', 'bert', 'token', '##izer', '.']

The numericalized sentence is:
[2023, 2003, 1037, 7099, 6251, 1010, 2029, 2057, 2097, 19204, 4697, 2478, 1996, 14324, 19204, 17629, 1012]


We should now do the same for the sentences in the dataframe. Before proceding is a good idea to create a copy of what we have so far to be able to revert back to the original dataframe in case we need to.

In [23]:
# Create a copy of the senctences dataframe
tokens_df = df.copy()

In [24]:
# Tokenize each sentence in the dataframe
tokens_df['tokenized_sentence'] = tokens_df['sentence'].apply(bert_tokenizer.tokenize)
tokens_df.sample(10)

Unnamed: 0,sentence,tokenized_sentence
128,that i may find out what she wants next day a...,"[that, i, may, find, out, what, she, wants, ne..."
435,prosperity when he had done the princess made...,"[prosperity, when, he, had, done, the, princes..."
113,with her at first sight he went home so chang...,"[with, her, at, first, sight, he, went, home, ..."
412,stood the palace as before he hastened thithe...,"[stood, the, palace, as, before, he, haste, ##..."
347,asked his will save my life genie said aladdi...,"[asked, his, will, save, my, life, genie, said..."
137,hand of the princess now i pray you to forgiv...,"[hand, of, the, princess, now, i, pray, you, t..."
127,every day carrying something in a napkin call...,"[every, day, carrying, something, in, a, napki..."
322,the people however who loved him followed arme...,"[the, people, however, who, loved, him, follow..."
194,before and the sultan who had forgotten aladdi...,"[before, and, the, sultan, who, had, forgotten..."
104,another set of plates and thus they lived many...,"[another, set, of, plates, and, thus, they, li..."


In [25]:
# Add the numericalized sentences to the dataframe
tokens_df['numericalized_sentence'] = tokens_df['tokenized_sentence'].apply(bert_tokenizer.convert_tokens_to_ids)
tokens_df.sample(10)

Unnamed: 0,sentence,tokenized_sentence,numericalized_sentence
306,its value laughingly bade the slave take it an...,"[its, value, laughing, ##ly, bad, ##e, the, sl...","[2049, 3643, 5870, 2135, 2919, 2063, 1996, 665..."
200,must remember his promises and i will remember...,"[must, remember, his, promises, and, i, will, ...","[2442, 3342, 2010, 10659, 1998, 1045, 2097, 33..."
473,after this aladdin and his wife lived in peace,"[after, this, ala, ##ddin, and, his, wife, liv...","[2044, 2023, 21862, 18277, 1998, 2010, 2564, 2..."
258,rubies diamonds and emeralds he cried it is a ...,"[rub, ##ies, diamonds, and, emerald, ##s, he, ...","[14548, 3111, 11719, 1998, 14110, 2015, 2002, ..."
49,told saying the names of his father and grandf...,"[told, saying, the, names, of, his, father, an...","[2409, 3038, 1996, 3415, 1997, 2010, 2269, 199..."
457,deserve to be burnt to ashes but that this req...,"[deserve, to, be, burnt, to, ashes, but, that,...","[10107, 2000, 2022, 11060, 2000, 11289, 2021, ..."
403,while the magician drained his to the dregs an...,"[while, the, magician, drained, his, to, the, ...","[2096, 1996, 16669, 11055, 2010, 2000, 1996, 2..."
114,was frightened he told her he loved the princ...,"[was, frightened, he, told, her, he, loved, th...","[2001, 10363, 2002, 2409, 2014, 2002, 3866, 19..."
15,and told his mother of his newly found uncle ...,"[and, told, his, mother, of, his, newly, found...","[1998, 2409, 2010, 2388, 1997, 2010, 4397, 217..."
33,the mountains aladdin was so tired that he be...,"[the, mountains, ala, ##ddin, was, so, tired, ...","[1996, 4020, 21862, 18277, 2001, 2061, 5458, 2..."


Phrases that will be inputted to a BERT model must include the special tokens `[CLS]` and `[SEP]`. These tokens are used to indicate the start and end of the input sequence. Let's add these tokens to the sample phrase. Another special token is `[PAD]`, which is used to pad shorter sequences.

In [26]:
tokenized_sentence = ['CLS'] + tokenized_sentence + ['SEP']
print(f'\nThe tokenized sentence is:\n{tokenized_sentence}')

numericalized_sentence = bert_tokenizer.convert_tokens_to_ids(tokenized_sentence)
print(f'\nThe numericalized sentence is:\n{numericalized_sentence}')

# Print the IDs for the special tokens for the BERT model
print(f'- The token ID for the special token [CLS] is: {bert_tokenizer.cls_token_id}')
print(f'- The token ID for the special token [SEP] is: {bert_tokenizer.sep_token_id}')
print(f'- The token ID for the special token [PAD] is: {bert_tokenizer.pad_token_id}')


The tokenized sentence is:
['CLS', 'this', 'is', 'a', 'sample', 'sentence', ',', 'which', 'we', 'will', 'token', '##ize', 'using', 'the', 'bert', 'token', '##izer', '.', 'SEP']

The numericalized sentence is:
[100, 2023, 2003, 1037, 7099, 6251, 1010, 2029, 2057, 2097, 19204, 4697, 2478, 1996, 14324, 19204, 17629, 1012, 100]
- The token ID for the special token [CLS] is: 101
- The token ID for the special token [SEP] is: 102
- The token ID for the special token [PAD] is: 0


As the exampled indicates, we need to add the [CLS] and [SEP] tokens and tokenize each sentence of the text dataframe

In [27]:
# Add the 100 special tokens to the numericalized sentences on the dataframe
tokens_df['numericalized_sentence'] = tokens_df['numericalized_sentence'].apply(lambda x: [bert_tokenizer.cls_token_id] + x + [bert_tokenizer.sep_token_id])
tokens_df['numericalized_sentence'].sample(10)

327    [101, 3140, 2037, 2126, 2046, 1996, 10119, 199...
229    [101, 3038, 1045, 2442, 3857, 1037, 4186, 4906...
38     [101, 8587, 2039, 12668, 2096, 1045, 2785, 257...
301    [101, 21658, 3880, 1996, 6658, 2040, 2064, 239...
127    [101, 2296, 2154, 4755, 2242, 1999, 1037, 2061...
386    [101, 2010, 2406, 2002, 2097, 2175, 2005, 2070...
195    [101, 4622, 2032, 1998, 2741, 2005, 2014, 2006...
438    [101, 3571, 1997, 5456, 1996, 4615, 3662, 2032...
76     [101, 2012, 2197, 2002, 16763, 2010, 2398, 199...
103    [101, 2127, 3904, 2020, 2187, 2002, 2059, 2018...
Name: numericalized_sentence, dtype: object

In [28]:
# Add the 0 padding to the numericalized sentences on the dataframe
tokens_df['numericalized_sentence'] = tokens_df['numericalized_sentence'].apply(lambda x: x + [bert_tokenizer.pad_token_id] * (max_len - len(x)))

In [29]:
# Add a new column that indicates the length of the numericalized sentences
tokens_df['numericalized_sentence_length'] = tokens_df['numericalized_sentence'].apply(len)
tokens_df['numericalized_sentence_length'] .sample(10)

85     75
80     75
311    75
181    75
290    75
228    75
242    75
34     75
279    75
263    75
Name: numericalized_sentence_length, dtype: int64

In [30]:
# Extract the numericalized sentences from the dataframe
numericalized_sentences = tokens_df['numericalized_sentence'].values
numericalized_sentences.shape

(447,)

In [31]:
# Convert each row of the numericalized sentences to a list
numericalized_sentences = [list(x) for x in numericalized_sentences]

In [32]:
# Convert the list into a 2D NumPy array
numericalized_sentences = np.array(numericalized_sentences)
print(f'The shape of the numericalized sentences is: {numericalized_sentences.shape}')
print(numericalized_sentences)

The shape of the numericalized sentences is: (447, 75)
[[  101 21862 18277 ...     0     0     0]
 [  101  2045  2320 ...     0     0     0]
 [  101  1037 23358 ...     0     0     0]
 ...
 [  101  2044  2023 ...     0     0     0]
 [  101  2002  4594 ...     0     0     0]
 [  101  2005  2116 ...     0     0     0]]


In [33]:
#  Convert the numpy array into a Tensor
numericalized_sentences = torch.from_numpy(numericalized_sentences)
print(numericalized_sentences)
print(f'the shape of the numericalized tensor is: {numericalized_sentences.shape}')

tensor([[  101, 21862, 18277,  ...,     0,     0,     0],
        [  101,  2045,  2320,  ...,     0,     0,     0],
        [  101,  1037, 23358,  ...,     0,     0,     0],
        ...,
        [  101,  2044,  2023,  ...,     0,     0,     0],
        [  101,  2002,  4594,  ...,     0,     0,     0],
        [  101,  2005,  2116,  ...,     0,     0,     0]], dtype=torch.int32)
the shape of the numericalized tensor is: torch.Size([447, 75])


### Discuss the following:
* What whas the pipeline of this exercise?
* A Summary of the data cleaning and preprocessing steps.
* What is the difference between a token and a sentence?
* Why did we converted the tokens to numbers?
* Why did we add the special tokens?
* What advantages offered Pandas for text manipulation?
* Would this approach be suitable for complex datasets?

### Exercise:

* Repeat this pipeline with 3 different books that appear very different in nature (don't add the special tokens).
* When you obtain the numericalized sentences, convert them into a long 1D Numpy array.
* Plot the distribution of the numericalized tokens for each book using histograms.
* Comment your experience during the next lesson.