#**N-Gram Based Sentiment Classification Using BERT - Phase 1**
(For ease of testing and implementation, we will be using a test sentence in this notebook)

The sentence needs to first be processed before we can convert it into a suitable format for feeding as an input for fine tuning BERT. The steps involved are:

1. Tokenize using WordPiece
4. Addition of special tokens - `[CLS]` & `[SEP]` 
2. Convert Tokens to Token IDs
5. Convert Token ID list to N-Grams: Unigrams, Bigrams, and Trigrams
6. Set segment embeddings

# Setting Up
Here, we'll import, and initialize the necessary tools for this project

In [19]:
#Let us take an example sentence to show the implementation
sentence = "Hello CEG! Hope this works well. Here are some GRE words : Inchoate, Persnickety, etc."
sentence

'Hello CEG! Hope this works well. Here are some GRE words : Inchoate, Persnickety, etc.'

In [40]:
#Now, let's load the necessary libraries
from transformers import BertTokenizer, BertModel
from nltk.util import ngrams
import re, collections, torch, string

In [6]:
#And let's initialise some elements
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased", output_hidden_states = True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




# 1) Tokenize Using WordPiece
WordPiece is the default tokenizer, and the one that is recognised by BERT. Therefore, we will be using it here.

In [51]:
#Let's begin the processing
#1, Tokenize using WordPiece. BERT Tokenizer uses WordPiece by default
tokens = tokenizer.tokenize(sentence)
tokens

['hello',
 'ce',
 '##g',
 '!',
 'hope',
 'this',
 'works',
 'well',
 '.',
 'here',
 'are',
 'some',
 'gr',
 '##e',
 'words',
 ':',
 'inch',
 '##oat',
 '##e',
 ',',
 'per',
 '##s',
 '##nick',
 '##ety',
 ',',
 'etc',
 '.']

# 2) Adding Special Tokens
Now we need to add some special tokens.<br><br>
`[CLS]` token is required at the beginning of an input, to let BERT know that this input is for Classification purposes. <br><br>
And the `[SEP]` token is added at the end of every sentence, to let BERT know that that sentence is over. However, we must use this even if our input is only one sentence.

In [53]:
#2, Adding special tokens
tokens.insert(0, "[CLS]")
tokens.append("[SEP]")

print(str(tokens))

['[CLS]', 'hello', 'ce', '##g', '!', 'hope', 'this', 'works', 'well', '.', 'here', 'are', 'some', 'gr', '##e', 'words', ':', 'inch', '##oat', '##e', ',', 'per', '##s', '##nick', '##ety', ',', 'etc', '.', '[SEP]']


# 3)Convert Tokens to Token IDs
Let's convert these tokens to their respective token IDs, because BERT can only understand the language if it is in it's own vocabulary, and it will identify it only by the token ID

In [54]:
#3, Convert Tokens to Token IDs
tokenIDs = tokenizer.convert_tokens_to_ids(tokens)
tokenIDs

[101,
 7592,
 8292,
 2290,
 999,
 3246,
 2023,
 2573,
 2092,
 1012,
 2182,
 2024,
 2070,
 24665,
 2063,
 2616,
 1024,
 4960,
 16503,
 2063,
 1010,
 2566,
 2015,
 13542,
 27405,
 1010,
 4385,
 1012,
 102]

# 4) Create N-Grams
From this list of Token IDs, we will be creating Unigrams, Bigrams, and Trigrams for future analysis

In [55]:
#4,a, Creating Unigrams
unigrams = list(ngrams(tokenIDs, 1))
unigrams

[(101,),
 (7592,),
 (8292,),
 (2290,),
 (999,),
 (3246,),
 (2023,),
 (2573,),
 (2092,),
 (1012,),
 (2182,),
 (2024,),
 (2070,),
 (24665,),
 (2063,),
 (2616,),
 (1024,),
 (4960,),
 (16503,),
 (2063,),
 (1010,),
 (2566,),
 (2015,),
 (13542,),
 (27405,),
 (1010,),
 (4385,),
 (1012,),
 (102,)]

In [56]:
#4,b, Creating Bigrams
bigrams = list(ngrams(tokenIDs, 2))
bigrams

[(101, 7592),
 (7592, 8292),
 (8292, 2290),
 (2290, 999),
 (999, 3246),
 (3246, 2023),
 (2023, 2573),
 (2573, 2092),
 (2092, 1012),
 (1012, 2182),
 (2182, 2024),
 (2024, 2070),
 (2070, 24665),
 (24665, 2063),
 (2063, 2616),
 (2616, 1024),
 (1024, 4960),
 (4960, 16503),
 (16503, 2063),
 (2063, 1010),
 (1010, 2566),
 (2566, 2015),
 (2015, 13542),
 (13542, 27405),
 (27405, 1010),
 (1010, 4385),
 (4385, 1012),
 (1012, 102)]

In [57]:
#4,c, Creating Trigrams
trigrams = list(ngrams(tokenIDs, 3))
trigrams

[(101, 7592, 8292),
 (7592, 8292, 2290),
 (8292, 2290, 999),
 (2290, 999, 3246),
 (999, 3246, 2023),
 (3246, 2023, 2573),
 (2023, 2573, 2092),
 (2573, 2092, 1012),
 (2092, 1012, 2182),
 (1012, 2182, 2024),
 (2182, 2024, 2070),
 (2024, 2070, 24665),
 (2070, 24665, 2063),
 (24665, 2063, 2616),
 (2063, 2616, 1024),
 (2616, 1024, 4960),
 (1024, 4960, 16503),
 (4960, 16503, 2063),
 (16503, 2063, 1010),
 (2063, 1010, 2566),
 (1010, 2566, 2015),
 (2566, 2015, 13542),
 (2015, 13542, 27405),
 (13542, 27405, 1010),
 (27405, 1010, 4385),
 (1010, 4385, 1012),
 (4385, 1012, 102)]

As we can see, each N-Gram is considered as one token.<br>

# 5) Setting Segment IDs
Segments IDs let BERT know which sentence a particular token belongs to. This is necessary when using BERT for question answer, or other similar projects.

In our case, since we will be using BERT for sentiment classification, our inputs will only be single sentences. Hence, the segment IDs will all be 1

In [66]:
unigramSegmentIDs = [1] * len(unigrams)
bigramSegmentIDs = [1] * len(bigrams)
trigramSegmentIDs = [1] * len(trigrams)
segmentIDs = [1] * len(tokenIDs)

print(str(unigramSegmentIDs))
print(str(bigramSegmentIDs))
print(str(trigramSegmentIDs))

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


# **Extract Hidden States**
The input sentence has been converted to tokenID N-Grams, but we our work still isn't completed. We need to convert these to a vector which will be the final input for the BERT model.

However, to create that, we need the features from the hidden layers of the BERT model, that we initialised earlier, which will give us unique values based on our input

Before we do that, we need to convert the list of IDs and Segment IDs to tensors

In [None]:
#Convert to tensors
unigramsTensor = torch.tensor([unigrams])
bigramsTensor = torch.tensor([bigrams])
trigramsTensor = torch.tensor([trigrams])

unigramsTensor = torch.squeeze(unigramsTensor, dim=2)
#bigramsTensor = torch.squeeze(bigramsTensor, dim=2)
bigramsTensor.resize_((1,len(bigrams)))
#trigramsTensor = torch.squeeze(trigramsTensor, dim=2)
trigramsTensor.resize_((1,len(trigrams)))

In [79]:
unigramSegmentTensor = torch.tensor([unigramSegmentIDs])
bigramSegmentTensor = torch.tensor([bigramSegmentIDs])
trigramSegmentTensor = torch.tensor([trigramSegmentIDs])

Now we can proceed to get to get the hidden features

In [114]:
model.eval()
with torch.no_grad():
  unigram_output = model(unigramsTensor, unigramSegmentTensor) #This is an output, which contains the hidden states
  bigram_output = model(bigramsTensor, bigramSegmentTensor)
  trigram_output = model(trigramsTensor, trigramSegmentTensor)
unigram_hidden_states = unigram_output[2]
bigram_hidden_states = bigram_output[2]
trigram_hidden_states = trigram_output[2]

The size of the hidden states tensor:

In [129]:
print(str(len(unigram_hidden_states)) + "*" +  str(len(unigram_hidden_states[0])) + "*" +  str(len(unigram_hidden_states[0][0])) + "*" + str(len(unigram_hidden_states[0][0][0])))

13*1*29*768


And an example of one layer of the hidden state:

In [130]:
print(unigram_hidden_states[0])

tensor([[[ 0.1686, -0.2858, -0.3261,  ..., -0.0276,  0.0383,  0.1640],
         [ 0.3739, -0.0156, -0.2456,  ..., -0.0317,  0.5514, -0.5241],
         [ 0.1955,  0.6688,  0.1070,  ..., -1.0646,  0.3803,  0.2733],
         ...,
         [-0.1432, -0.0014,  0.1065,  ..., -0.2403,  1.3350, -0.2301],
         [-0.4147,  0.3854, -0.0560,  ...,  0.0517,  0.5977,  0.4596],
         [-0.4342,  0.1415,  0.2393,  ..., -0.4481, -0.0569, -0.2665]]])


# Creating Input Vectors from Hidden Features
What we extracted from the BERT Model, is the hidden states, which is a tuple of dimensions `13 * 1 * x * 768`, where `x` is the number of tokens.

The first dimension, 13 is the Input Embeddings and the 12 hidden states.

The second dimension is the number of batches, or segments. In this case, is only 1, because it is only 1 sentence.

The fourth dimension is the number of hidden features, which, for BERT Base, is 768.

Now, we need to convert this, into a single, tensor:

In [115]:
unigramTokenEmbeddings = torch.stack(unigram_hidden_states, dim=0)
bigramTokenEmbeddings = torch.stack(bigram_hidden_states, dim=0)
trigramTokenEmbeddings = torch.stack(trigram_hidden_states, dim=0)

In [116]:
print(unigramTokenEmbeddings.size())

torch.Size([13, 1, 29, 768])


We don't need the Batch dimension, since it is only 1, so we will remove that

In [117]:
unigramTokenEmbeddings = torch.squeeze(unigramTokenEmbeddings, dim=1)
bigramTokenEmbeddings = torch.squeeze(bigramTokenEmbeddings, dim=1)
trigramTokenEmbeddings = torch.squeeze(trigramTokenEmbeddings, dim=1)

In [118]:
print(unigramTokenEmbeddings.size())

torch.Size([13, 29, 768])


Let's interchange the layer and token dimensions, as we want to calculate the vector for tokens

In [119]:
unigramTokenEmbeddings = unigramTokenEmbeddings.permute(1, 0, 2)
bigramTokenEmbeddings = bigramTokenEmbeddings.permute(1, 0, 2)
trigramTokenEmbeddings = trigramTokenEmbeddings.permute(1, 0, 2)

In [120]:
print(unigramTokenEmbeddings.size())

torch.Size([29, 13, 768])


Now that we have the hidden features, we can create the Vectors.

We do this, by concatenating the last 4 layers of each token's hidden states, therefore, each token will have 4*768 elements in the vector

In [121]:
unigramVector = []
bigramVector = []
trigramVector = []

for token in unigramTokenEmbeddings:
  vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
  unigramVector.append(vec)

for token in bigramTokenEmbeddings:
  vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
  bigramVector.append(vec)

for token in trigramTokenEmbeddings:
  vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
  trigramVector.append(vec)

Here is a sample of Token Vector, and the size of each vector

In [126]:
print(bigramVector[0])
print(len(bigramVector[0]))

tensor([ 0.4960,  0.2156,  0.2596,  ..., -0.2427, -0.3104,  0.2826])
3072
