This notebook was originally created from google colab. It will be used for me to experiment with and implement all the steps necessary to do sentiment analysis with BERT. In each stage, I will try to implement my own version of functions/classes in order to practice my understanding, and I'll then implement it over using TorchText version.

Note that the functions and classes are used for a sentiment analysis task, namely assuming data is textual input and output is numerical input

The dataset that I'll use for this notebook comes from : https://www.kaggle.com/crowdflower/twitter-airline-sentiment

It's a small, simple dataset perfect to test whether code works.

In [1]:
import pandas as pd
import spacy 
import torch
from collections import Counter
from torchtext.vocab import Vocab
from torch.utils.data import DataLoader,Dataset
from torch.nn.utils.rnn import pad_sequence

import re
import numpy as np
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
from collections import Counter

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [131]:
'''
from google.colab import files
files.upload()
'''

'\nfrom google.colab import files\nfiles.upload()\n'

## 1. Data Preparation

For a short but great tutorial on building custom functions for dataset building, I recommend watching this educational video : https://www.youtube.com/watch?v=9sHcLvVXsns&t=589s

### 1a. Preliminary setup

Setup preliminary stuff here, like contractions list for text preprocessing, paths, etc.

In [6]:
root_path = "/content"
file_name = "/Tweets.csv"

In [50]:
df = pd.read_csv(root_path + file_name)

In [52]:
label_mapping = {
    "neutral":0,
    "positive":1,
    "negative":-1
}


In [27]:
# Manually set which column contains label and which column contains text.
label_column = 'airline_sentiment'
text_column = 'text'

In [8]:
# contractions dictionary
contraction_dictionary = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he shall",
"he'll've": "he shall have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i shall have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it had",
"it'd've": "it would have",
"it'll": "it shall",
"it'll've": "it shall have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she shall have",
"she's": "she has",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that has",
"there'd": "there had",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they shall have",
"they're": "they are",
"they've": "they have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
'n': 'and'
}


### 1b. Text Preprocessor setup

Before text can be used, we need to preprocess it. For that we'll define a custom preprocessor class.

In [46]:
class TextPreprocessor():
  '''
  This class is used for text preprocessing. The fit_transform method also helps map labels into unique values based on label mapping provided.
  '''
  def __init__(self, 
               lower = True, 
               contraction = True, 
               contraction_dictionary = None, 
               punctuation = True,
               stop_words = True, 
               lemmatize = False,
               short_words = True, 
               emojis = True,
               alphabet_only = False
               ):
    self.lower_case = lower
    self.contraction = contraction
    self.contraction_dictionary = contraction_dictionary
    self.punctuation = punctuation 
    self.stop_words = stop_words
    self.lemmatize = lemmatize
    self.short_words = short_words
    self.emojis = emojis
    self.alphabet_only = alphabet_only

  def fit_transform(self,data,labels,label_mapping):
    '''
    For now, the preprocessing takes place step by step and in each step a sentence is returned. Might change this later
    so that after the first step or so we get a list of tokens rather than having to join sentence over and over.

    Parameters:
    data - pandas series, column within dataframe containing text
    labels - pandas series, column within dataframe containing target

    Returns:
    tuple of (data,label) that has been preprocessed. In the end text with no words will be removed.
    At this stage, data and labels are still in the form of numpy array.
    '''
    # Preprocess labels (To be implemented)
    labels = labels.apply(lambda s : label_mapping[s])
    # Preprocess data
    data = data.astype(str)

    if self.lower_case == True:
      data = data.apply(lambda x : x.lower())
    
    if self.emojis == True:
      data = data.apply(self.remove_emojis)

    if self.contraction == True:
      assert self.contraction_dictionary is not None, "self.contraction is True, but no contraction_dictionary was provided."
      data = data.apply(self.expand_contractions)

    if self.punctuation == True:
      data = data.apply(self.remove_punctuation)
    
    if self.alphabet_only == True:
      data=data.apply(lambda s: re.sub(r"[^a-zA-Z]"," ",s)) #keep only alphabetical words

    if self.lemmatize == True:
      data = data.apply(self.lemma)

    if self.stop_words == True:
      data = data.apply(self.remove_stop_words)

    if self.short_words == True:
      data = data.apply(self.remove_short_words)
    
    # Drop the empty entries with empty strings
    drop = np.where(data.apply(lambda x : len(x) == 0) == True)
    drop = np.add(drop[0],1)
    data = data.drop(drop)
    labels = labels.drop(drop)
    
    return (drop, data.values,labels.values)

  def transform_text(self,text):
    '''
    Similar to fit_transform, except this will be used to preprocess and transform any string.

    Parameters:
    text - string to be transformed
    
    Returns:
    out - transformed string
    '''
    out = text
    if self.lower_case == True:
      out = out.lower()
    
    if self.emojis == True:
      out = self.remove_emojis(out)

    if self.contraction == True:
      assert self.contraction_dictionary is not None, "self.contraction is True, but no contraction_dictionary was provided."
      out = self.expand_contractions(out)

    if self.punctuation == True:
      out = self.remove_punctuation(out)
    
    if self.alphabet_only == True:
      out= re.sub(r"[^a-zA-Z]"," ",out) #keep only alphabetical words

    if self.lemmatize == True:
      out = self.lemma(out)

    if self.stop_words == True:
      out = self.remove_stop_words(out)

    if self.short_words == True:
      out = self.remove_short_words(out)

    return out


  @staticmethod
  def remove_emojis(text):
    '''
    Returns text with emojis removed.
    '''
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                            "]+", flags = re.UNICODE)
    text=regrex_pattern.sub(r'',text)
    text=text.replace('\n',' ')
    out=re.sub(' +', ' ', text)
    
    return out

  def expand_contractions(self,text):
    '''
    Returns text with contractions expanded.
    '''
    out=re.sub(r"’","'",text)
    out=out.split(" ")

    for idx,word in enumerate(out):
        if word in self.contraction_dictionary:
            
            out[idx] = self.contraction_dictionary[word]
    return " ".join(out)

  @staticmethod
  def remove_punctuation(text):
    '''
    Returns text with punctuations removed.
    '''
    out = re.sub(r"[^\w\s]"," ",text)
    return out

  @staticmethod
  def remove_stop_words(text):
    '''
    Returns text with stop_words removed.
    '''
    stop_words = stopwords.words('english')
    out=[i for i in text.split(" ") if i not in stop_words]
    return " ".join(out)

  @staticmethod
  def remove_short_words(text):
    '''
    Returns text with short words removed.
    '''
    out=[i for i in text.split(" ") if len(i)>2]
    return " ".join(out)

  @staticmethod
  def lemma(text):
    print("Lemmatization not implemented yet")
    

**Testing to see whether TextPreprocessor works**

In [10]:
# Testing whether the class works
preprocessor = TextPreprocessor(contraction_dictionary = contraction_dictionary)
dropped_indexes,test_data, test_label = preprocessor.fit_transform(df['text'],df['airline_sentiment'])

In [11]:
test_data[:15]

array(['virginamerica dhepburn said',
       'virginamerica plus added commercials experience tacky',
       'virginamerica today must mean need take another trip',
       'virginamerica really aggressive blast obnoxious entertainment guests faces amp little recourse',
       'virginamerica really big bad thing',
       'virginamerica seriously would pay flight seats playing really bad thing flying',
       'virginamerica yes nearly every time fly ear worm away',
       'virginamerica really missed prime opportunity men without hats parody https mwpg7grezp',
       'virginamerica well',
       'virginamerica amazing arrived hour early good',
       'virginamerica know suicide second leading death among teens',
       'virginamerica pretty graphics much better minimal iconography',
       'virginamerica great deal already thinking 2nd trip australia amp even gone 1st trip yet',
       'virginamerica virginmedia flying fabulous seductive skies take stress away travel http ahlxhhkiyn',
  

### 1c. Vocabulary setup

Next we need to define vocabulary object. This helps build mappings between text and numeric indexes. Text can't be fed directly into models and needs to be numericalized, so our vocab object will help us do that. 

In this block we implement a simple CustomVocab class that maps text to numeric indexes. For vocab object that can do vectorized mappings can use TorchText's one.

In [12]:
'''
df['review']=df['review'].apply(nltk_lemmatize)
df['review']=df['review'].apply(spacy_lemmatize)
df['review']=df['review'].apply(lambda s: s.replace('-PRON-',""))
'''

'\ndf[\'review\']=df[\'review\'].apply(nltk_lemmatize)\ndf[\'review\']=df[\'review\'].apply(spacy_lemmatize)\ndf[\'review\']=df[\'review\'].apply(lambda s: s.replace(\'-PRON-\',""))\n'

In [13]:
class CustomVocabulary():

  def __init__(self,preprocessor,freq_threshold = 5):
    self.itos = {0:"<PAD>",1:"<SOS>",2:"<EOS>",3:"<UNK>"}
    self.stoi = {"<PAD>":0,"<SOS>":1,"<EOS>":2,"<UNK>":3}
    self.threshold = freq_threshold
    self.preprocessor = preprocessor


  def __len__(self):
    return len(self.itos)
  
  @staticmethod
  def tokenize(text):
    return text.split(" ")
  
  def build_vocab(self,sentence_list):
    '''
    Build vocab from sentence. the itos entries correspond to words with highest frequencies to words with 
    lowest frequencies (aparts from special tokens)
    '''
    counter = Counter()
    for sentence in sentence_list:
      counter.update(self.tokenize(sentence))
    
    # Sort based on alphabet
    sorted_counter = sorted(counter.items(),key = lambda t : t[0])
    # Then sort again based on freq
    sorted_counter.sort(key = lambda t : t[1],reverse = True)
    # This gives list of tuples sorted based on frequencies, then sorted based on alphabetical order

    # start updating from index 4 onwards
    index = 4
    for item in sorted_counter:
      word,count = item
      if count >= self.threshold:
        self.itos[index] = word
        self.stoi[word] = index
        index += 1
  
  def numericalize(self,text):
    '''
    Converts any text into a list of numbers based on stoi. Does not have capabilities for vectorized text.
    The text will be preprocessed using self.preprocessor before checked against stoi.

    Parameters:
    text - string to be numericalized

    Returns:
    tokens - list of indexes from text based on stoi
    '''
    preprocessed_text = self.preprocessor.transform_text(text)
    tokenized_text = self.tokenize(preprocessed_text)


    return [self.stoi[text] if text in self.stoi else self.stoi["<UNK>"] for text in tokenized_text]

  def reverse_numericalize(self,text):
    '''
    Converts list of numericalized text back into text string
    '''
    return " ".join([self.itos[t] for t in text])


    

**Test whether CustomVocabulary works**

In [31]:
vocab = CustomVocabulary(preprocessor)
vocab.build_vocab(test_data.tolist())
sample = df.sample(1)['text'].values[0]
print(f"original sentence : {sample}")
numericalized_sample = vocab.numericalize(sample)
print(f"numericalized sentence : {numericalized_sample}")
print(f"reverse numericalised sentence : {vocab.reverse_numericalize(numericalized_sample)}")

original sentence : @AmericanAir for my delay and you know what I get a we don't credit anybody back a supervisor who cut me off when speaking
numericalized sentence : [7, 64, 47, 10, 197, 3, 31, 546, 916, 1075]
reverse numericalised sentence : americanair delay know get credit <UNK> back supervisor cut speaking


### 1d. Dataset setup

Using the preprocessor and vocab objects setup, we set up the dataset class. As the name implies, this class is highly necessary in every pytorch-related project since you train the model on datasets. The dataset class basically takes the raw csv, and outputs an object that yields tensors of input data (so in this case numericalized sentence) and target values(labels).

To clarify further, input data refers to "x_values" and target values refer to "y_values" that we feed to the ML/DL model.

In [103]:
class TweetsDataset(Dataset):
  # preprocesser object will be passed to the dataset, along with the vocab object?
  # also assume the dataset has been slightly preprocessed before here. So no duplicates, no nulls hopefully
  def __init__(self,root,file_name,label_column,data_column,preprocessor,vocabulary,label_mapping):
    self.raw_df = pd.read_csv(root + file_name)
    self.vocab = vocabulary
    self.preprocessor = preprocessor
    self.label_mapping = label_mapping
    # preprocessor has a simple fit function that will return preprocessed data and labels
    # self.raw_df will remain as raw df
    if self.preprocessor is not None:
      # data and label will be in the form of numpy array after fit_transform.
      self.dropped_indexes, data,label = self.preprocessor.fit_transform(df[data_column],df[label_column],self.label_mapping)
    else:
      data = self.df[data_column].values
      label = self.df[label_column].values
      self.dropped_indexes = None
    
    # pass in list of strings to build vocabulary
    self.vocab.build_vocab(data.tolist())
    self.data = data
    self.label = label
  
  def __len__(self):
    '''
    Note that we take length of preprocessed data, so it is possible 
    the length that gets returned is less than the raw df length.
    '''
    return len(self.data)
  
  def __getitem__(self,index):
    '''
    Data will be converted into the corresponding indexes based on self.vocab.

    returns tensors of numericalized_sentence,labels
    '''
    sentence = self.data[index]
    label = self.label[index]

    numericalized_sentence = [self.vocab.stoi["<SOS>"]]
    numericalized_sentence.extend(self.vocab.numericalize(sentence))
    numericalized_sentence.append(self.vocab.stoi["<EOS>"])

    return torch.from_numpy(np.array(numericalized_sentence)),torch.from_numpy(np.array(label))
  



**Testing dataset class**

In [104]:
dataset = TweetsDataset(root_path,file_name,"airline_sentiment","text",preprocessor,vocab,label_mapping)

### 1e. DataLoader Setup

After setting up dataset, last is to setup a  DataLoader, which allows us to specify how we want to retrieve data from dataset. We also define a custom collate function as well to enable us to do batch padding. 

This code block basically contains the "main()" function for data preparation.


In [123]:
class Collate():
  def __init__(self,pad_idx):
    '''
    pad_idx - index used to represent <PAD>
    '''
    self.pad_idx = pad_idx

  def __call__(self,batch):
    '''
    Class is called when loading into dataloader. So when preparing batches will go through collate function.
    Note that batch is list of (numericalized_sentence,label), where numericalized_sentence is a tensor, label is a tensor

    Returns (text,labels), where text.size() = (batch_size,padded_size). labels.size() = batch_size
    '''
    numericalized_sentence_list = [item[0] for item in batch]
    labels = [item[1] for item in batch]
    # batch_first makes it so the dimension is (batch,longest seq dim, *) where * is other dims
    text = pad_sequence(numericalized_sentence_list,batch_first=True,padding_value = self.pad_idx)
    labels = torch.from_numpy(np.array(labels))
    return text,labels
  

In [124]:
def prepare_loader(
    root_path,
    file_name,
    data_col,
    target_col,
    batch_size,
    preprocessor,
    vocab,
    label_mapping,
    num_workers = -1,
    shuffle = True,
    pin_memory = True
    ):
  '''
  Prepares the data loader. It will
  1. Prepare dataset (using preprocessor)
  2. Load dataset into DataLoader, which will be used to batch up the data during runtime
  '''

  dataset = TweetsDataset(root_path,file_name,target_col,data_col,preprocessor,vocab,label_mapping)
  pad_idx = dataset.vocab.stoi['<PAD>']
  loader = DataLoader(dataset = dataset, 
                      batch_size = batch_size,
                      shuffle =shuffle,
                      num_workers = num_workers,
                      pin_memory = pin_memory,
                      collate_fn = Collate(pad_idx)
                      )
  return loader

**Testing whether dataloader works**

In [125]:
preprocessor = TextPreprocessor(contraction_dictionary = contraction_dictionary)
vocab = CustomVocabulary(preprocessor)
dataloader = prepare_loader(root_path = root_path,
                            file_name = file_name,
                            data_col = text_column,
                            target_col = label_column,
                            batch_size = 32,
                            preprocessor = preprocessor,
                            vocab = vocab,
                            label_mapping = label_mapping,
                            num_workers = 1,
                            shuffle = True,
                            pin_memory = True
                            )


**Testing whether dataloader works**

In [130]:
for i, batch in enumerate(dataloader):
  print(batch[0].size())
  print(batch[1].size())
  # observe 4th batch and stop.
  if i == 2:
    break

torch.Size([32, 18])
torch.Size([32])
torch.Size([32, 17])
torch.Size([32])
torch.Size([32, 16])
torch.Size([32])
