<a href="https://colab.research.google.com/github/Spartanlasergun/A-Step-by-Step-Guide-to-BERT/blob/main/A_Step_by_Step_Guide_To_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Step by Step Guide to BERT
##### by Spartanlasergun
-----

The layout of this guide is very simple. It is meant to be read from top to bottom - i.e. step by step. It's also better if you work directly in google collaboratory rather than reading the notebook from github.

The theoretical understanding that is necessary for using BERT is given in markdown cells that typically appear before or after the relevant blocks of code. To simplify things even further, this guide uses external modules sparingly - and these external modules are only imported in the cells where they are used to avoid confusing the reader concerning which module is in use for a specific task. I have also tried to avoid using dataframes in favour of basic python lists/arrays as most people may find it much more comfortable to stick with the basics.

This guide assumes that you have some basic knowledge of programming and statistics, though it is my hope that even without this, most of the necessary understanding will still be imparted.

-----

# A Brief Introduction

What is machine learning really? What are we doing here? What are we trying to accomplish when we say that we wish to **classify** a bit of text? Well, it's no different than classifying something in real life. We live in world of many things - apples, oranges, trees, flowers, birds, fishes etc. None of these are the same. When we talk about classifying text, we want to understand something specific about the nature of the text. Does it say something positive or negative? Is it gramatically correct? Is it poetry, a narrative essay, a speech, or perhaps something else?

It's easy enough to read and understand a piece of text as a human being, but if we were to give this text as input to a computer, how then would it understand - how then would it be able to **classify**? The answer is simple - the machine must learn.

In the case of BERT, this means taking large archives of text data from places like wikipedia and identifying meaningful representations of words, phrases and sentences that can be used to produce meaningful mathematical representations of any given set of words, phrases or sentences. This mathematical **model** can the be used to determine if a given set of text data is more similar to text that is happy, sad, angry, a poem, a tweet, musical lyrics etc. Theoretically, there are no limits to the categories we can assign.

So how do we take a word and represent it mathematically? Or, how do we take a sentence and do the same?

Cosider the word **'The'**. Lets assign a number to this word.

Let the word **'The'** be represented by the integer **1234**.

Now that we've done this, there's nothing to stop us from assigning unique numbers to all of the words in the english language. In fact, lets take things a step further. Consider the sentence:

**The sky is blue.**

Assuming that we have some dictionary containing unique integer values for each word/symbol in the english language, we can assign unique numbers to each word/symbol in this sentence such that:

**The --> 1234**

**sky --> 435**

**is --> 2389**

**blue --> 678**

**. --> 390**

Now, we have technically represented this sentence mathematically with the list of numbers:

$$[1234, 435, 2389, 678, 390]$$

This is good, but its not yet quite enough. Truthfully, we have very barely captured anything at all - in fact, all we've done is convert the sentence into a list of numbers. To actually capture the meaning in a sentence means considering the actual content of each word, the positions of each word in relation to all the others, the context of the sentence itself and all of the words that constitute it, along with many other factors.

There is no natural law that guides the process of capturing this meaning, but rather it is the culmination of years of rigorous testing that has resulted in the production of models like **BERT** that can do the job for us. As we shall soon see, our simple list of numbers will become transformed into a very complex vector that is typically hundreds of units long - but very accurate as a mathematical representation of a sentence.

The vectors that **BERT** produces are typically of the length 768 or 1024, but there are smaller models ranging from 128 to 512 units. Since we want to conserve processing we will be using one of the smaller models. You can picture the vectors that BERT produces as a list of long rational numbers that have no discernible meaning at a glance, i.e.

$$[3.3432532, -4.56462222, 7.4535535345, -4.54654646, ...]$$

For now, there is little that we need to understand about how these vectors are produced mathematically. In this introductory guide, I simply want to show you how to load some text data, prepare it for processing with **BERT**, generate the vectors that represent our text, and use these vectors to do some simple text classification.

-----
# Loading Some Text Data

We begin with the simple task of loading some text data. Since spreadsheet files (.csv, .tsv, etc.) are most typically used in the field for sharing machine learning data, I have decided that it is best to load data from this type of file. To do this we use the pandas module, which has built in functions for automatically reading spreadsheet data from local and online repsotories.

(Note: There is no reason why we couldn't simply read data from a regular text file, or even create a python list manually with some data that we want to test. But it's still important to know how to work with spreadsheet files.)

In the code below, I import the pandas module, define the url (download-link) for my spreadsheet, and I use the 'pd.read_csv' function to read the data from my spreasheet into a pandas dataframe.

In [28]:
import pandas as pd

url = 'https://github.com/Spartanlasergun/A-Step-by-Step-Guide-to-BERT/raw/main/data.csv'

dataframe = pd.read_csv(url)
dataframe.shape


(7920, 3)

Using the '.shape' function we can see the shape of the dataset that we have loaded. In total it contains 7920 datapoints, and 3 different columns of data. We can print a small sample of this data using the '.sample' function as is shown below.

In [29]:
dataframe.sample(3)

Unnamed: 0,id,label,tweet
1719,1720,0,Birthday present! :)) #apple. #iphone. #5. #wh...
5853,5854,0,Yes!!! Totally unexpected ! Thank you ninong V...
5086,5087,0,I dont care.. #instadaily #moment #stranger #a...


As we can see, the above dataset contains 3 columns - id, label, and tweet. The id represents the row number, which we are not interested in. The tweet column contains tweets that have either positive sentiment, or negative sentiment. The labels correspond to these tweets, with a value of $0$ representing a positive tweet, while a value of $1$ represents a negative tweet. We are going to use these tweets and their labels to train a model that will be able to classify any given tweet as either positive of negative.

First we must extract the data that we want. The tweets are extracted in the cell below.

In [30]:
# extracting the tweets from our data frame
extract_tweets = dataframe.tweet.values # here we extract our tweets from the dataframe
tweets = extract_tweets.tolist() # here we convert our dataframe to a regular python list
print(tweets[0])
print(tweets[1])
print(tweets[2])

#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android #apps #beautiful #cute #health #igers #iphoneonly #iphonesia #iphone
Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias… http://instagram.com/p/YGEt5JC6JM/
We love this! Would you go? #talk #makememories #unplug #relax #iphone #smartphone #wifi #connect... http://fb.me/6N3LsUpCu


As you can see, I have also taken the liberty of converting our dataframe back into a regular python list. I have also printed the first few tweets.

Below we extract the labels, and convert it to a regular python list in the same fashion.

In [31]:
# extracting the labels from our data frame
extract_labels = dataframe.label.values # here we extract our labels from the dataframe
labels = extract_labels.tolist() # here we convert our dataframe to a regular python list
print(labels[0])
print(labels[1])
print(labels[2])

0
0
0


So, the labels have been extracted and we have a regular python list that we can work with. I have also printed the first few labels. These would correspond to the first few tweets that we printed above. Since the numbers are all $0$ this means that they all have positive sentiment.

-----
# Preparing our text data for BERT

Before we can give our text data to BERT, there is a bit of additional preparation that must be done. In order for BERT to effectively recognize the start and end of our tweets (or any text data), we must include some special data.

Lets consider the sentence:

**The sky is blue.**

For BERT to recognize the beginning and end of this sentence we must add the characters **[CLS]** and **[SEP]** where we wish to designate the start and end of the sentence. Thus, our original sentence must become:

**[CLS] The sky is blue. [SEP]**

In the case of our tweets, we must add the **[CLS]** and **[SEP]** tokens in the same manner to the start and end of each. This is done in the code below.

In [32]:
# Adding CLS and SEP tokens at the beginning and end of each sentence for BERT
bertified_tweets = []
for text in tweets:
    bertify = "[CLS] " + text + " [SEP]"
    bertified_tweets.append(bertify)

print(bertified_tweets[0])
print(bertified_tweets[1])
print(bertified_tweets[2])

[CLS] #fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android #apps #beautiful #cute #health #igers #iphoneonly #iphonesia #iphone [SEP]
[CLS] Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias… http://instagram.com/p/YGEt5JC6JM/ [SEP]
[CLS] We love this! Would you go? #talk #makememories #unplug #relax #iphone #smartphone #wifi #connect... http://fb.me/6N3LsUpCu [SEP]


As you can see from the output above, we have added the necessary tokens to the start and end of each tweet.

-----
# Installing transformers

In this section we install the transformers module. I have not gone into any kind of examination of transfomers themselves as this is not necessary for our simple guide. Many textbooks insist on giving at least an overview of transformers, but I have have found that it only overcomplicates the process without giving much insight into machine learning itself. What you need to understand, is that BERT is a transformer itself and the many layers of processing that is done by the BERT transformer to generate our vector representation of text is not readily accesible to the average user. I leave it to the user to explore the larger treatise on the issue and to come to his/her own conclusions or understanding of what necessarily needs to be understood on that point.

The transformers module that we will install is maintained as an open sourced effort by HuggingFace, of which the BERT model is just one of many transformers and machine learning models that you can access freely.

In [33]:
!pip install transformers



-----
# Tokenizing our input

What is tokenization? Tokenization is the process of breaking down our text into individual sentences or words. For instance, if I have the sentences:

["The sky is blue. The trees are green."]

The tokenized **sentence** would be:

[["The sky is blue."], ["The trees are green."]]

A step further and the tokenized **words** would be:

[["The", "sky", "is", "blue", "."], ["The", "trees", "are", "green", "."]]

Essentially we want to take our list of tweets and break it down into an easily accesible list of words that we can use. This is easy enough to do manually, but if we are working with a list of thousands of sentences - in this case, tweets - then we need some way of automatically processing our data to obtain our list of words. The mechanism that we use to do this, is called a 'tokenizer'. There are many tokenizer's available, and some may break down sentences differently than others. Since we are using **BERT** we will use the **BERT** tokenizer.

In the cell below, we import the BertTokenizer and use it to break down each individual tweet in our dataset into a list of words.

In [34]:
#Activating the BERT Tokenizer
from transformers import BertTokenizer

# create an instance of the BERT tokenzation class
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

# run our list of 'BERT-approved' sentences through the tokenizer
tokenized_texts = []
for tweet in bertified_tweets:
    tokenized = tokenizer.tokenize(tweet)
    tokenized_texts.append(tokenized)

print(tokenized_texts[0])
print(tokenized_texts[1])
print(tokenized_texts[2])

['[CLS]', '#', 'finger', '##print', '#', 'pregnancy', 'test', 'https', ':', '/', '/', 'goo', '.', 'g', '##l', '/', 'h', '##1', '##m', '##f', '##q', '##v', '#', 'android', '#', 'apps', '#', 'beautiful', '#', 'cute', '#', 'health', '#', 'i', '##gers', '#', 'iphone', '##on', '##ly', '#', 'iphone', '##sia', '#', 'iphone', '[SEP]']
['[CLS]', 'finally', 'a', 'trans', '##para', '##nt', 'silicon', 'case', '^', '^', 'thanks', 'to', 'my', 'uncle', ':', ')', '#', 'ya', '##y', '#', 'sony', '#', 'xp', '##eria', '#', 's', '#', 'sony', '##ex', '##per', '##ias', '…', 'http', ':', '/', '/', 'ins', '##tagram', '.', 'com', '/', 'p', '/', 'y', '##get', '##5', '##j', '##c', '##6', '##jm', '/', '[SEP]']
['[CLS]', 'we', 'love', 'this', '!', 'would', 'you', 'go', '?', '#', 'talk', '#', 'make', '##me', '##mori', '##es', '#', 'un', '##pl', '##ug', '#', 'relax', '#', 'iphone', '#', 'smartphone', '#', 'wi', '##fi', '#', 'connect', '.', '.', '.', 'http', ':', '/', '/', 'f', '##b', '.', 'me', '/', '6', '##n', '##3'

As you can see from the output, our list of tweets have now been broken down into individual words.

This is only part of our tokenization process. As discussed in the introduction, what we are trying to do is to mathematically represent the meaning in each of these tweets. This requires that we convert our list of words/symbols into a list of integer values. Each word/symbol will have a unqiue integer equivalent that will be assigned to it.

To do this, we will use feed our list of words (formerly list of tweets) into our tokenizer and use the convert to ids function.

This is done below.

In [35]:
token_ids = []

for x in tokenized_texts:
    create_id = tokenizer.convert_tokens_to_ids(x)
    token_ids.append(create_id)

print(token_ids[0])
print(token_ids[1])
print(token_ids[2])

[101, 1001, 4344, 16550, 1001, 10032, 3231, 16770, 1024, 1013, 1013, 27571, 1012, 1043, 2140, 1013, 1044, 2487, 2213, 2546, 4160, 2615, 1001, 11924, 1001, 18726, 1001, 3376, 1001, 10140, 1001, 2740, 1001, 1045, 15776, 1001, 18059, 2239, 2135, 1001, 18059, 8464, 1001, 18059, 102]
[101, 2633, 1037, 9099, 28689, 3372, 13773, 2553, 1034, 1034, 4283, 2000, 2026, 4470, 1024, 1007, 1001, 8038, 2100, 1001, 8412, 1001, 26726, 11610, 1001, 1055, 1001, 8412, 10288, 4842, 7951, 1529, 8299, 1024, 1013, 1013, 16021, 23091, 1012, 4012, 1013, 1052, 1013, 1061, 18150, 2629, 3501, 2278, 2575, 24703, 1013, 102]
[101, 2057, 2293, 2023, 999, 2052, 2017, 2175, 1029, 1001, 2831, 1001, 2191, 4168, 24610, 2229, 1001, 4895, 24759, 15916, 1001, 9483, 1001, 18059, 1001, 26381, 1001, 15536, 8873, 1001, 7532, 1012, 1012, 1012, 8299, 1024, 1013, 1013, 1042, 2497, 1012, 2033, 1013, 1020, 2078, 2509, 4877, 6279, 10841, 102]


Take note of the output. Our list of tokenized texts are now a list ok token ids - i.e. a list of numbers.

-----
# Padding the token ids

In order for the BERT algorithm to function properly, each of our list of token ids must be the same length. We refer to this property as the sequence length. To ensure that each list of token ids we give to the BERT model has the same sequence length we must add padding to the sentences that are shorter in our dataset. This padding is simply the number $0$.

We can determine which is the longest list of token ids in our dataset and hence, determine an appropriate sequence length based on this value.

In [36]:
sequence_length = 0
for item in token_ids:
  if sequence_length <= len(item):
    sequence_length = len(item)

print(sequence_length)

142


From the above output, we can see that the longest tweet in our dataset constitutes 142 separate words. Hence, we need to pad all of our shorter lists with the number $0$ until they are all of the same length - 142.

(The sequence length may be typically be extended beyond the length of the longest sentence in the dataset, but should not be less.)

In the code below, we add a bunch of zeroes to the lists in our token_ids variable that are less than 142 units in length, until they meet the required value.

In [37]:
input_ids = []

for item in token_ids:
  count = len(item)
  temp = item
  while count != sequence_length:
    temp.append(0)
    count = count + 1
  input_ids.append(temp)

print(input_ids[0])
print(len(input_ids[0]))
print(input_ids[1])
print(len(input_ids[1]))

[101, 1001, 4344, 16550, 1001, 10032, 3231, 16770, 1024, 1013, 1013, 27571, 1012, 1043, 2140, 1013, 1044, 2487, 2213, 2546, 4160, 2615, 1001, 11924, 1001, 18726, 1001, 3376, 1001, 10140, 1001, 2740, 1001, 1045, 15776, 1001, 18059, 2239, 2135, 1001, 18059, 8464, 1001, 18059, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
142
[101, 2633, 1037, 9099, 28689, 3372, 13773, 2553, 1034, 1034, 4283, 2000, 2026, 4470, 1024, 1007, 1001, 8038, 2100, 1001, 8412, 1001, 26726, 11610, 1001, 1055, 1001, 8412, 10288, 4842, 7951, 1529, 8299, 1024, 1013, 1013, 16021, 23091, 1012, 4012, 1013, 1052, 1013, 1061, 18150, 2629, 3501, 2278, 2575, 24703, 1013, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

Take note of our output. All of our token_ids lists have been padded with 0's to meet the required sequence length of 142.

-----
# Attention Masks - our final bit of pre-processing

Things get a bit complex here in ways that may feel overwhelming, but I assure you it is very simple.

In addition to our input_ids (now of the fixed length 142), BERT also requires something called an attention mask.

This mask is a list of 1's and 0's corresponding to our input_ids, where the number 1 represents a word that BERT will operate on, while the number 0 represents a word that is 'masked' or hidden from the BERT machine.

We do not want BERT to operate on the padded values that we have added to our input ids, so we need to create a list analgous to our input that contains 1's for each real word in the sentence and 0's for each of the padded values. Since the padding itself is given by the number 0, the attention masks are easy to create. This is done below:


In [38]:
attention_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for item in input_ids:
    temp = []
    for number in item:
      if number != 0:
        temp.append(1)
      else:
        temp.append(0)
    attention_masks.append(temp)

print(attention_masks[0])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


Take note of the output. As described, our attention masks are simply lists of 1's and 0's where 1 represents a real token/word in our input, and 0 represents a padded value that we wish to ignore.

And with this last step we're finished with the pre-processing. We have our input_ids and now we have our attention_masks. In the next section we look at initializing the BERT model.

-----
# Feeding the data into BERT

In creating this guide, I have explored numerous different methods and techniques that each constitute their own path to understanding text classification with BERT. In wanting to maintain simplicity for the sake of the user, I have selected what I have thought are the simplest methods and explanations that could be given, and yet I have not found a way around using a **dataloader** for feeding information into BERT. Thus, I am forced to take you through the following complexities, but there is a silver lining. In exploring the use of dataloaders, I think we can come to a real hands on understanding and appreciation for the massive amounts of resources that machine learning typically requires.

We begin with a very simple task. BERT uses a particular data type - the 'torch tensor'. So we must convert our 'input_ids' and 'attention_masks' into valid torch tensors that we can give to the BERT model. This is done in the cell below.

In [39]:
import torch

bert_input = torch.tensor(input_ids)
bert_mask = torch.tensor(attention_masks)

Next, we must import and initialize the BERT model. In this case, we wish to conserve as much processing as possible, so we will be using the smallest BERT model available - the BERT-tiny model. This model produces vectors of the size 128, while the others are 256, 512, 768 and 1024. Using the tiny model means that our text classification will be less accurate, but this is fine for our simple guide. However, if you wish to experiment with some of the larger models you can simply replace the model name with:

'google/bert_uncased_L-2_H-256_A-2' ---> for the 256 BERT model

'google/bert_uncased_L-4_H-512_A-8' ---> for the 512 BERT model

'google/bert_uncased_L-12_H-768_A-12' ---> for the 768 BERT model

In [40]:
from transformers import BertModel  # import the BERT model from transformers

model_name = 'google/bert_uncased_L-2_H-128_A-2' # define the name of the Tiny BERT model
model = BertModel.from_pretrained(model_name)    # instantiate the BERT model

Ahhh! I have my BERT model ready to go. In my ideal workflow, I would simply like to give the BERT model the torch tensors (input_ids and attention_masks) and let it produce the vectors that I require, but this is not possible. BERT is a bit of a greedy old dog. He won't take our data without trying to allocate memory for the entire operation and this will overload our meagre bit of RAM. So we're forced to feed him information bit by bit. In computing terms, we are going to process the data, batch by batch.

But before we do this, the user can expermient a bit. The code below contains the ideal workflow as I have described to you.

```
outputs = model(input_ids=bert_input, attention_mask=bert_mask)
```

If you run the code, it should crash the google colab. You do not necessarily have to run it, but if you do, remember that you will need to reload all of the cells from the beginning.

(I have found that if you reduce the size of the dataset to around 100 tweets instead of the ~8000 that we are using, it would just barely be able to run. But for machine learning purposes, datasets in the hundreds of thousands are the norm and this makes the dataloader very important)

The first step in using our dataloader is to create a combined dataset of our input_ids and attention_masks. Recall that we just converted these into torch tensors. In the cell below, these tensors are combined into one.

In [41]:
from torch.utils.data import TensorDataset

dataset = TensorDataset(bert_input, bert_mask)

Now that we have a combined dataset we can go ahead and create the dataloader. As we talked about, this dataloader is going to allow us to feed data into BERT into smaller more managable batches.

In the cell below, the DataLoader module is imported, and we define the dataloader based on our combined dataset and a specific batch size. The batch size we are using is 8, which means that BERT will only process 8 of our inputs at any given point in time. This will prevent it from trying to allocate memory for our entire dataset and crashing the operation.

In [42]:
from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=8)

Now that we have our dataloader, we can begin to generate the vectors for our dataset.

(My griping over having to forcefully explain the dataloader is because of the cell below. This cannot be split into smaller parts. It is simply the operation that we must do.)

I begin by defining an empty array that we can push our vectors into as BERT processes each batch, i.e. - 'tweet_vectors'

The dataloader that we created can be iterated through like a regular python list. Since it contains the amalgamation of our input_ids and attention_masks, the zeroth index in each batch will have 8 of our input_ids, and the first index will have 8 of the corresponding attention_masks.

(We use the torch.no_grad() function to generate our output because I have experienced random crashes without it. This function specifies that the algorithm should not do gradient calculations and so reduces the intensity of the processing required.)

Once the outputs are generated, we retrieve the vectors that represent each individual tweet and push them into our tweet_vectors array.

On this point the reader should take note, that BERT produces vectors for each individual tweet and then uses the aggregate of the vectors to represent the entire tweet/sentence itself. The vectors are accessible at both the word level and the sentence level, but since we are simply trying to classify tweets as positive or negative, we need only take the vectors that represent our entire tweet/sentence. In additon to this, BERT takes our input through 13 different layers of processing. The vectors generated at each layer is also accessible to the user, but this is far more advanced than most people may ever need to know. The user can explore these vectors at his/her own discretion.

In [43]:
# Create lists to store the model outputs
tweet_vectors = []

# Iterate through the data loader and store model outputs
for batch in dataloader:
  input = batch[0]   # get 8 input ids
  masks = batch[1]   # get 8 corresponding attention masks
  with torch.no_grad():   # using torch.no_grad()
    outputs = model(input_ids=input, attention_mask=masks)   # generate the vectors for our batch of 8

  # retrieve the vectors and append to our list
  tweet_aggregate = outputs.pooler_output     # retrieve the sentence aggregate vectors

  for tweet_vec in tweet_aggregate:           # append all 8 vectors to our empty list
    tweet_vectors.append((tweet_vec.tolist()))

Note that the 'outputs.pooler_output' function gives us a tuple with all of the 8 vectors contained in any given batch. Since I want my vectors in a regular one dimensional list, I iterate through and append each one individually instead of appending the list itself.

To obtain the word level vectors you would use:

```
word_vec = outputs.last_hidden_state
```
But as stated before, this is not necessary for our classifier. Many other guides will walk you through the syntax associated with retrieving these vectors as well as the vectors produced by each layer of BERT.

For now, we have now sucessfully obtained our vectors in the 'tweet_vectors' variable - which is all that we are concerned with.

Take note that the length of the tweet_vectors match the size of our orginal dataset, which means that we do indeed have a vector that represents each tweet.

In [44]:
print(len(tweet_vectors))

7920


You can also note, that each vector is of the size 128 - as is produced by the BERT tiny model

In [45]:
print(len(tweet_vectors[0]))

128


-----
# Classification using Logistic Regression

For a simple tutorial, it makes no sense to attempt to explain logistic regression, as it is an advanced mathematical concept. This means that the user will have to take a lot of what happens in this section for granted.

To perform classification, we first have to train a model based on the vectors that we have obtained for our dataset. This means many things. But first, I want to address the phrase **'train a model'**. What does this mean? It means that we are going to take the text that we have accurately represented mathematically - our tweet_vectors - and use it to define a mathematical function that can represent all text data of this type. Since our text data represents tweets that are positive or negative, our mathematical function will specifically represent this subset of meaning in the english language. In this way, we can predict whether or not a given tweet is in fact **positive** or **negative** using the mathematical function that we have defined.

In order to ensure that my mathematical function is accurately classifying tweets as positive or negative, we need to be able to test it with a set already classified data so that we can understand it's degree of accuracy. You may recall, that the data that we are working with is already classified/labelled with 1's and 0's to indicate positive and negative sentiment. We can use the majority of this data to train the model, and leave a couple hundred datapoints for testing the model afterward. In total we have 7920 datapoints, and in the cell below I extract the first 7000 tweet vectors and their corresponding labels to be used as training data. The remaining 920 will be used for testing the accuracy model at the end.

In [46]:
training_vectors = tweet_vectors[0:7000]
train_labels = labels[0:7000]

Now that I have my training data, I can go ahead and begin to train the model. In this case, we are using a logisitc regression model. The user will have to take this for granted, as this is a very advanced bit of mathematics that cannot be easily explained in the context of this tutorial. Fortunately, it is not necessary to understand it as we have python modules that can easily do the job for us. (If you wish to explore the math, I would reccommend using the book: Logistic Regression: A Self Learning Text, by David Kleinbaum and Mitchel Klein).

In the cell below, I import the LogisticRegression model from Sklearn library and I initialize the logistic regression classifier. I then give the training data to the model - this includes the tweet vectors and labels that we have previously allocated for training.

In [47]:
from sklearn.linear_model import LogisticRegression

# Initialize a Logistic Regression classifier
classifier = LogisticRegression(max_iter=1000)

# Train the classifier on the training data
classifier.fit(training_vectors, train_labels)

And there you have it! The model has been trained. Now we need to test the model to ensure that it is working properly.

In the cell below, I extract the remaining 920 tweet vectors and labels that we have reserved for testing our model.

In [48]:
testing_vectors = tweet_vectors[7000:7920]
test_labels = labels[7000:7920]

Great! we have our testing data. Before we can use it with our classifier, we have one extra step.

The Sklearn Logistic Classifier requires that our input data be a two dimensional array, or it will not function properly. This means that I simply need to take the data from our one dimensional array and append it to an empty list so that the Sklearn Classifier will have its 2d array.

This is done in the cell below.

In [49]:
t_vec = []
for item in testing_vectors:
    temp = [item]
    t_vec.append(temp)

Now we can begin to make our predictions. Recall that we have two labels - 0 representing a positive tweet, and 1 representing a negative tweet. Hence, our classifier is going to predict that the given tweet is either positive - 0 - or negaitve - 1.

I want to store the predictions in a list so that I can compare the predicted values with the predetermined labels. This will allow me to determine what percentage of the tweets were accurately classified by the Logistic Model.

In the cell below, I define an empty list to store the predictions. I iterate through the list of tweet vectors that have been reserved for testing and feed them into the model one by one to generate predictions for each.

In [50]:
predictions = []
for test_vec in t_vec:
    # Make predictions using the trained Logistic Regression classifier
    predicted_class = classifier.predict(test_vec)

    # store the predictions in its own array
    predictions.append(int(predicted_class))

print(predictions)
print(test_labels)

[0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 

Take note of the output. I have printed the list of predictions, and directly below it is a list of the predetermined labels that represent the correct answer/prediction. For each value that corresponds exactly, our model has made an accurate prediction. At a glance, you will notice that not all the values are accurate. We need to check to see what percentage it has actually been able to classify accurately. This is done in the cell below.

In [51]:
index = 0
correct = 0
for prophecy in predictions:
    if prophecy == test_labels[index]:
        correct = correct + 1
    index = index + 1

accuracy = (correct / len(predictions)) * 100
print("The classifier is " + str(accuracy) + "% accurate")

The classifier is 88.26086956521739% accurate


The classifier is roughly 88% accurate - that's actually very impressive considering that we are using the smallest BERT model. The larger models wil probably give more than 90% on average.

-----
# Having some fun with our sentiment classifier

Now that we have trained a model that can accurately classify tweets as positive or negative, we can play around with this by making up some tweets of our own an seeing how well it works.

In the cell below, I define some tweets that the user can alter to his/her own desire. In this case, I've made sure that the first one is negative, and the second one is positive.

In [52]:
tweety_bird = ["Elon Musk is a complete asshole #thenewtwittersucks", "Elon Musk is a genius! X is much better than twitter <3 #X"]

We can take the tweets defined above and process it for classification using the same code that we used to create the model. Since this is just a rehash of the code that we have used for the tutorial, I won't bother discussing it, and we can just go ahead and run the cell. (I have literally just copied and pasted most of the blocks of code that we have already discussed above.)

In [53]:
# Adding CLS and SEP tokens at the beginning and end of each sentence for BERT
bertified_tweets = []
for text in tweety_bird:
    bertify = "[CLS] " + text + " [SEP]"
    bertified_tweets.append(bertify)

# run our list of 'BERT-approved' sentences through the tokenizer
tokenized_texts = []
for tweet in bertified_tweets:
    tokenized = tokenizer.tokenize(tweet)
    tokenized_texts.append(tokenized)

token_ids = []

for x in tokenized_texts:
    create_id = tokenizer.convert_tokens_to_ids(x)
    token_ids.append(create_id)

input_ids = []

for item in token_ids:
  count = len(item)
  temp = item
  while count != sequence_length:
    temp.append(0)
    count = count + 1
  input_ids.append(temp)

attention_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for item in input_ids:
    temp = []
    for number in item:
      if number != 0:
        temp.append(1)
      else:
        temp.append(0)
    attention_masks.append(temp)

bert_input = torch.tensor(input_ids)
bert_mask = torch.tensor(attention_masks)

with torch.no_grad():   # using torch.no_grad()
    outputs = model(input_ids=bert_input, attention_mask=bert_mask)   # generate the vectors for our batch of 8

vectors = outputs.pooler_output
reg_list = []
for item in vectors:
    reg_list.append((item.tolist()))

vec = []
for item in reg_list:
    temp = [item]
    vec.append(temp)

predictions = []
for tess in vec:
    # Make predictions using the trained Logistic Regression classifier
    predicted_class = classifier.predict(tess)

    # store the predictions in its own array
    predictions.append(int(predicted_class))

print(predictions)

[1, 0]


And there you have it. The classifier has accurately identified that the first tweet is negative - $1$ - and the second tweet is positive - $0$. The user can add as many tweets as he/she desires to the initial list, and you can play around and tinker with the code to suit yourself. ENJOY!

P.S. - If you find any gramatically errors in this document, please let me know under the open issue for gramatical errors.