## 💻 UnpackAI DL201 Bootcamp - Week 1 - Skills: NLP

### 📕 Learning Objectives

* Solidify the basic notion of NLP and how it can be applied to a variety of tasks.
* Practice using Pandas for loading and processing text data.
* Ilustrate the process of converting a text document into a dataframe and from there into a tensor.

### A basic NLP Overview

From Wikipedia:
- "Natural language processing (NLP) is a subfield of **linguistics, computer science, and artificial intelligence** concerned with the interactions between computers and **human language**, in particular how to program computers to **process and analyze** large amounts of natural language data. The goal is a computer capable of **"understanding"** the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."

- Approaches to NLP tasks:
    - Rule-based
    - Traditional machine learning
    - Deep learning

In NLP, we often need to perform text preprocessing, such as removing stop words, stemming, lemmatization, and tokenization.
A nice overview is presented in: 
- https://stanfordnlp.github.io/CoreNLP/ 
- https://www.techtarget.com/searchenterpriseai/definition/natural-language-processing-NLP

Common NLP tasks:
- Classification
- Masked filing
- Text prediction
- Sentiment analysis
    - Positive
    - Negative
    - Subjectivity
- Entity recognition
    - Person
    - Location
    - Organization
- Entity extraction
- Keyword extraction
- Topic extraction

### Ilustrative example

Below there is a code example that that illustrates the usage of Pandas for text manipulation and a few exploratory steps to create Tensors representing the text data.

In [1]:
# Install packages
! pip install transformers openpyxl

# Import libraries
import numpy as np
import pandas as pd
import torch
import requests
from transformers import BertTokenizer

Let's load a sample book conveniently available in txt format from the collection at http://www.textfiles.com/stories/ the book in this case is Aladdin.


In [2]:
# Load a sample text, from the provided url
response = requests.get('http://www.textfiles.com/stories/alad10.txt')
sample_text = response.text

# Split the text into sentences
sentences = sample_text.split('\n')

# Load the sentences into a dataframe
df = pd.DataFrame(sentences, columns=['sentence'])

As it has been reitared before, loading the data into Pandas gives us tremendous flexibility to perform data cleaning and preprocessing with ease.

In [3]:
# Inspect some of the sentences
df.sample(15)

Unnamed: 0,sentence
94,"bowl, twelve silver plates containing rich mea..."
136,desperate deed if I refused to go and ask your...
165,"the Princess. ""Fear nothing,"" Aladdin said to..."
33,the mountains. Aladdin was so tired that he b...
180,passed there. Her mother did not believe her i...
8,"Aladdin did not mend his ways. One day, when ..."
432,"people by her touch of their ailments, whereup..."
330,"way and ordered Aladdin to be unbound, and par..."
73,lamp and kill him afterwards.\r
420,more wicked and more cunning than himself. He...


In [4]:
# Remove the sentences that have less than 3 words
df = df[df['sentence'].str.split().str.len() > 3]

In [5]:
# Remove punctuation from all sentences
df['sentence'] = df['sentence'].str.replace('[^\w\s]','')

# Note: instead of regex a list of punctuation can be used, give it a try!
punctuation = [
    '.', ',', '!', '?', ':', ';', '"', "'", '-', '_', '(', ')', '[', ']', '{', '}', '#', '@', '$', '%', '^', '&', '*',
     '+', '=', '<', '>', '/', '\\', '|', '~', '`', '“', '”', '‘', '’'
]

df.sample(10)

  df['sentence'] = df['sentence'].str.replace('[^\w\s]','')


Unnamed: 0,sentence
117,Aladdin at last prevailed upon her to go befor...
118,carry his request She fetched a napkin and la...
221,Besides this six slaves beautifully dressed to...
48,treasure Aladdin forgot his fears and grasped ...
334,amazed he could not say a word Where is your ...
209,and filled up the small house and garden Alad...
265,spokesman we cannot find jewels enough The Su...
457,deserve to be burnt to ashes but that this req...
27,at nightfall to his mother who was overjoyed t...
192,When the three months were over Aladdin sent h...


In [6]:
# Convert all sentences to lowercase
df['sentence'] = df['sentence'].str.lower()
df.sample(10)

Unnamed: 0,sentence
179,the bed had been carried into some strange hou...
116,her father his mother on hearing this burst o...
198,the princess that no man living would come up ...
261,returned aladdin i wished your majesty to hav...
303,hearing this said there is an old one on the c...
189,another such fearful night and wished to be se...
460,whom he murdered he it was who put that wish ...
229,saying i must build a palace fit for her and t...
396,cellar and the princess put the powder aladdin...
134,him of her sons violent love for the princess ...


**Sentences are a key unit of information when it comes to NLP** (as wells as tokens) in order to represent our data as a uniform "block" of text, we need to find out our longest sentence, the rest of them will later be padded with padding tokens.

In [7]:
# Get the length of the longest senctence
max_len = df['sentence'].str.len().max()
print(f'max sentence length is {max_len}')

max sentence length is 76


The transformers library provides a convenient way to load a variety of BERT models. Let's first load and explore a tokenizer.

In [8]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [9]:
# Get the tokenizer vocabulary words
vocab = bert_tokenizer.vocab
vocab_size = len(vocab)
print(f'vocab size is: {vocab_size}')

vocab size is: 30522


In [10]:
# Get the vocabulary words as a list, load them into a dataframe
vocab_list = list(vocab.keys())
vocab_df = pd.DataFrame(vocab_list, columns=['tokens'])
vocab_df.sample(15)

Unnamed: 0,tokens
12204,encyclopedia
2853,sold
3550,##ized
23141,##firmed
13789,cheshire
2730,killed
12930,midst
15755,captains
21536,jing
1182,в


In [11]:
# Get the count of tokens that begin with 'UNUSED'
unused_tokens = vocab_df[vocab_df['tokens'].str.find('unused')>=0]
print(f'There are {len(unused_tokens)} tokens that begin with "unused"')
unused_tokens.sample(10)

There are 995 tokens that begin with "unused"


Unnamed: 0,tokens
667,[unused662]
320,[unused315]
781,[unused776]
459,[unused454]
759,[unused754]
910,[unused905]
554,[unused549]
261,[unused256]
56,[unused55]
722,[unused717]


In [12]:
# Get the tokens that have a size of 1 character
one_char_tokens = vocab_df[vocab_df['tokens'].str.len()==1]
print(f'There are {len(one_char_tokens)} tokens that have a size of 1 character')
one_char_tokens.sample(10)

There are 997 tokens that have a size of 1 character


Unnamed: 0,tokens
1492,ᴰ
1967,長
1203,ш
1754,井
1645,〜
1383,ச
1828,將
1053,q
1631,ⱼ
1901,法


In [13]:
# Get the tokens which have a size of more than 2 characters and does not contain the word 'unused'
two_char_tokens = vocab_df[(vocab_df['tokens'].str.len()>2) & (vocab_df['tokens'].str.find('unused')<0)]
print(f'There are {len(two_char_tokens)} tokens that likely reprensent English words')
two_char_tokens.sample(10)

There are 28042 tokens that likely reprensent English words


Unnamed: 0,tokens
20492,##mt
11114,gazed
16005,brewing
4433,draft
4099,enemy
9280,potentially
11127,gravel
26613,klan
18187,jagged
17587,peterborough


Each sentence is currently represented as a list of characters. We need to transform this into a list of tokens, tokens then get converted into numbers using the tokenizers vocabulary as indexes. Here is an example with a phrase:

In [14]:
# Example of tokenizing a sentence
sample_sentence = "This is a sample sentence, which we will tokenize using the BERT tokenizer."
print(f'The sample sentence is:\n{sample_sentence}')

tokenized_sentence = bert_tokenizer.tokenize(sample_sentence)
print(f'\nThe tokenized sentence is:\n{tokenized_sentence}')

numericalized_sentence = bert_tokenizer.convert_tokens_to_ids(tokenized_sentence)
print(f'\nThe numericalized sentence is:\n{numericalized_sentence}')

The sample sentence is:
This is a sample sentence, which we will tokenize using the BERT tokenizer.

The tokenized sentence is:
['this', 'is', 'a', 'sample', 'sentence', ',', 'which', 'we', 'will', 'token', '##ize', 'using', 'the', 'bert', 'token', '##izer', '.']

The numericalized sentence is:
[2023, 2003, 1037, 7099, 6251, 1010, 2029, 2057, 2097, 19204, 4697, 2478, 1996, 14324, 19204, 17629, 1012]


We should now do the same for the sentences in the dataframe. Before proceding is a good idea to create a copy of what we have so far to be able to revert back to the original dataframe in case we need to.

In [15]:
# Create a copy of the senctences dataframe
tokens_df = df.copy()

In [16]:
# Tokenize each sentence in the dataframe
tokens_df['tokenized_sentence'] = tokens_df['sentence'].apply(bert_tokenizer.tokenize)
tokens_df.sample(10)

Unnamed: 0,sentence,tokenized_sentence
155,him of the lamp he rubbed it and the genie ap...,"[him, of, the, lamp, he, rubbed, it, and, the,..."
86,reality precious stones he then asked for som...,"[reality, precious, stones, he, then, asked, f..."
97,replied aladdin so they sat at breakfast till...,"[replied, ala, ##ddin, so, they, sat, at, brea..."
352,himself in africa under the window of the prin...,"[himself, in, africa, under, the, window, of, ..."
48,treasure aladdin forgot his fears and grasped ...,"[treasure, ala, ##ddin, forgot, his, fears, an..."
198,the princess that no man living would come up ...,"[the, princess, that, no, man, living, would, ..."
369,mine tell me what has become of an old lamp i ...,"[mine, tell, me, what, has, become, of, an, ol..."
214,stood in a halfcircle round the throne with th...,"[stood, in, a, half, ##ci, ##rcle, round, the,..."
100,hath made us aware of its virtues we will use ...,"[hat, ##h, made, us, aware, of, its, virtues, ..."
432,people by her touch of their ailments whereupo...,"[people, by, her, touch, of, their, ai, ##lm, ..."


In [17]:
# Add the numericalized sentences to the dataframe
tokens_df['numericalized_sentence'] = tokens_df['tokenized_sentence'].apply(bert_tokenizer.convert_tokens_to_ids)
tokens_df.sample(10)

Unnamed: 0,sentence,tokenized_sentence,numericalized_sentence
146,mother that though he consented to the marriag...,"[mother, that, though, he, consent, ##ed, to, ...","[2388, 2008, 2295, 2002, 9619, 2098, 2000, 199..."
46,this stone lies a treasure which is to be your...,"[this, stone, lies, a, treasure, which, is, to...","[2023, 2962, 3658, 1037, 8813, 2029, 2003, 200..."
47,may touch it so you must do exactly as i tell ...,"[may, touch, it, so, you, must, do, exactly, a...","[2089, 3543, 2009, 2061, 2017, 2442, 2079, 359..."
86,reality precious stones he then asked for som...,"[reality, precious, stones, he, then, asked, f...","[4507, 9062, 6386, 2002, 2059, 2356, 2005, 207..."
15,and told his mother of his newly found uncle ...,"[and, told, his, mother, of, his, newly, found...","[1998, 2409, 2010, 2388, 1997, 2010, 4397, 217..."
365,aladdin looked up she called to him to come t...,"[ala, ##ddin, looked, up, she, called, to, him...","[21862, 18277, 2246, 2039, 2016, 2170, 2000, 2..."
9,streets as usual a stranger asked him his age ...,"[streets, as, usual, a, stranger, asked, him, ...","[4534, 2004, 5156, 1037, 7985, 2356, 2032, 201..."
396,cellar and the princess put the powder aladdin...,"[cellar, and, the, princess, put, the, powder,...","[15423, 1998, 1996, 4615, 2404, 1996, 9898, 21..."
379,not but he will use violence aladdin comforte...,"[not, but, he, will, use, violence, ala, ##ddi...","[2025, 2021, 2002, 2097, 2224, 4808, 21862, 18..."
16,said your father had a brother but i always th...,"[said, your, father, had, a, brother, but, i, ...","[2056, 2115, 2269, 2018, 1037, 2567, 2021, 104..."


Phrases that will be inputted to a BERT model must include the special tokens `[CLS]` and `[SEP]`. These tokens are used to indicate the start and end of the input sequence. Let's add these tokens to the sample phrase. Another special token is `[PAD]`, which is used to pad shorter sequences.

In [18]:
tokenized_sentence = ['CLS'] + tokenized_sentence + ['SEP']
print(f'\nThe tokenized sentence is:\n{tokenized_sentence}')

numericalized_sentence = bert_tokenizer.convert_tokens_to_ids(tokenized_sentence)
print(f'\nThe numericalized sentence is:\n{numericalized_sentence}')

# Print the IDs for the special tokens for the BERT model
print(f'- The token ID for the special token [CLS] is: {bert_tokenizer.cls_token_id}')
print(f'- The token ID for the special token [SEP] is: {bert_tokenizer.sep_token_id}')
print(f'- The token ID for the special token [PAD] is: {bert_tokenizer.pad_token_id}')


The tokenized sentence is:
['CLS', 'this', 'is', 'a', 'sample', 'sentence', ',', 'which', 'we', 'will', 'token', '##ize', 'using', 'the', 'bert', 'token', '##izer', '.', 'SEP']

The numericalized sentence is:
[100, 2023, 2003, 1037, 7099, 6251, 1010, 2029, 2057, 2097, 19204, 4697, 2478, 1996, 14324, 19204, 17629, 1012, 100]
- The token ID for the special token [CLS] is: 101
- The token ID for the special token [SEP] is: 102
- The token ID for the special token [PAD] is: 0


As the exampled indicates, we need to add the [CLS] and [SEP] tokens and tokenize each sentence of the text dataframe

In [19]:
# Add the 100 special tokens to the numericalized sentences on the dataframe
tokens_df['numericalized_sentence'] = tokens_df['numericalized_sentence'].apply(lambda x: [bert_tokenizer.cls_token_id] + x + [bert_tokenizer.sep_token_id])
tokens_df['numericalized_sentence'].sample(10)

20     [101, 2022, 4527, 2012, 2025, 2383, 2464, 2032...
449    [101, 5689, 2013, 1996, 8514, 2065, 2008, 2003...
475    [101, 2005, 2116, 2086, 2975, 2369, 2032, 1037...
420    [101, 2062, 10433, 1998, 2062, 23626, 2084, 23...
149    [101, 21862, 18277, 4741, 19080, 2005, 3053, 2...
112    [101, 2004, 2016, 2253, 1999, 1998, 2246, 2061...
388    [101, 2187, 2014, 9140, 2098, 2841, 18576, 210...
61     [101, 2677, 1997, 1996, 5430, 1996, 16669, 663...
10     [101, 1996, 2365, 1997, 2442, 9331, 3270, 1996...
81     [101, 8116, 2033, 2013, 2023, 2173, 26090, 199...
Name: numericalized_sentence, dtype: object

In [20]:
# Add the 0 padding to the numericalized sentences on the dataframe
tokens_df['numericalized_sentence'] = tokens_df['numericalized_sentence'].apply(lambda x: x + [bert_tokenizer.pad_token_id] * (max_len - len(x)))

In [21]:
# Add a new column that indicates the length of the numericalized sentences
tokens_df['numericalized_sentence_length'] = tokens_df['numericalized_sentence'].apply(len)
tokens_df['numericalized_sentence_length'] .sample(10)

348    76
343    76
120    76
144    76
201    76
381    76
18     76
439    76
77     76
339    76
Name: numericalized_sentence_length, dtype: int64

In [22]:
# Extract the numericalized sentences from the dataframe
numericalized_sentences = tokens_df['numericalized_sentence'].values
numericalized_sentences.shape

(447,)

In [23]:
# Convert each row of the numericalized sentences to a list
numericalized_sentences = [list(x) for x in numericalized_sentences]

In [24]:
# Convert the list into a 2D NumPy array
numericalized_sentences = np.array(numericalized_sentences)
print(f'The shape of the numericalized sentences is: {numericalized_sentences.shape}')
print(numericalized_sentences)

The shape of the numericalized sentences is: (447, 76)
[[  101 21862 18277 ...     0     0     0]
 [  101  2045  2320 ...     0     0     0]
 [  101  1037 23358 ...     0     0     0]
 ...
 [  101  2044  2023 ...     0     0     0]
 [  101  2002  4594 ...     0     0     0]
 [  101  2005  2116 ...     0     0     0]]


In [25]:
#  Convert the numpy array into a Tensor
numericalized_sentences = torch.from_numpy(numericalized_sentences)
print(numericalized_sentences)
print(f'the shape of the numericalized tensor is: {numericalized_sentences.shape}')

tensor([[  101, 21862, 18277,  ...,     0,     0,     0],
        [  101,  2045,  2320,  ...,     0,     0,     0],
        [  101,  1037, 23358,  ...,     0,     0,     0],
        ...,
        [  101,  2044,  2023,  ...,     0,     0,     0],
        [  101,  2002,  4594,  ...,     0,     0,     0],
        [  101,  2005,  2116,  ...,     0,     0,     0]], dtype=torch.int32)
the shape of the numericalized tensor is: torch.Size([447, 76])


### Discuss the following:
* What whas the pipeline of this exercise?
* A Summary of the data cleaning and preprocessing steps.
* What is the difference between a token and a sentence?
* Why did we converted the tokens to numbers?
* Why did we add the special tokens?
* What advantages offered Pandas for text manipulation?
* Would this approach be suitable for complex datasets?

### Exercise:

* Repeat this pipeline with 3 different books that appear very different in nature (don't add the special tokens).
* When you obtain the numericalized sentences, convert them into a long 1D Numpy array.
* Plot the distribution of the numericalized tokens for each book using histograms.
* Comment your experience during the next lesson.