## Predicting TikTok Sentiment Analysis Based on Google Play Store Reviews
### Group: Phishers
### Jacob He, Justin Nakatsu, Rebekah Wong

### Dataset
In order to run our code on the TikTok dataset, please download the following input files first:
https://vault.sfu.ca/index.php/s/8F2V1j01WcERKyQ

For running the code, after downloading the data and the source code, in the source directory set up the virtual environment with 

python3 -m venv venv

source venv/bin/activate

pip3 install -r requirements.txt

Then with the dataset in an input folder you can run any of stats.py, bertbase.py, bertTraining.py, lexicalBase.py, lexicalBootstrap.py

In [None]:
!pip3 install -r ../requirements.txt

Our BERT tests were adapted following Ataie's (2022)'s paper and the bootstrapping algorithm was coded following the algorithm from Volkova et al.'s paper from 2013.

## Stats.py
In the stats.py file we first do some analysis on our data.

In [3]:
import numpy as np
import pandas as pd
from pathlib import Path 
filepath = Path('../input/reviews_evened.csv')  
df = pd.read_csv('../input/tiktok_google_play_reviews.csv')

We read our csv data file from an input folder and will output another data file to the same folder later.

In [4]:
def to_sentiment(rating):
    rating = int(rating)
    
    # Convert 5 star scale rating to sentiment
    if rating <= 2:
        return 0
    elif rating == 3:
        return 1
    else:
        return 2

Since our dataset has the ratings on a 5 star scale, but we only want to determine if a review is positive or negative we convert it to 0 for negative, 1 for neutral, and 2 for positive

In [5]:
df['sentiment'] = df.score.apply(to_sentiment)

In [6]:
def to_posneg(sentiment):
    sentiment = int(sentiment)
    
    # Convert the sentiment to readable class names
    if sentiment == 0:
        return "negative"
    elif sentiment == 1:
        return "neutral"
    else:
        return "positive"

We add a sentiment column for faster comparisin and a class column to make it more readable.

In [15]:
df['class'] = df.sentiment.apply(to_posneg)
print(df[["content","class","sentiment"]].sample(n=20))

                                                  content     class  sentiment
114431                                                 Hh  positive          2
265589                                        Very niceee  positive          2
35846   I love everything about TicTok except for the ...  positive          2
79216                                     Just phenomenal  positive          2
166049                                               Nice  positive          2
47727                                I really like Tiktok  positive          2
181759  I am happy too this apps use because my tiktok...  positive          2
33278                                                Good  positive          2
123126                                               Good  negative          0
267109  Tiktok is a nice place where you do anything t...  negative          0
154592                                       Tiktok Thank  positive          2
50060                                  Good app I lo

In [10]:
scorecount = [0,0,0,0,0]
def count_rating(rating):
    rating = int(rating)
    scorecount[rating-1] = scorecount[rating-1] + 1

We get some data about how many reviews there are for each star rating.

In [14]:
df.score.apply(count_rating)
total = sum(scorecount)

Then print out the counts of the ratings.

In [13]:
cap = max(scorecount)                                               #visualization of rating counts from the entire dataset
print("rating \t|     count\t|     bar")
for i in range(5):
    print("\t|\t\t|")
    num = 100 * scorecount[i]/cap
    num = round(num) 
    out = "  -"
    for j in range(num):
        out = out + "-"
    print("  "+str(i+1)+"\t|     "+str(scorecount[i]) + "\t|"+ out)

rating 	|     count	|     bar
	|		|
  1	|     74160	|  -----------------
	|		|
  2	|     15666	|  ----
	|		|
  3	|     23304	|  ------
	|		|
  4	|     35970	|  ---------
	|		|
  5	|     465014	|  -----------------------------------------------------------------------------------------------------


Then we can see the counts of each of the classes.

In [21]:
negative = scorecount[0] + scorecount[1]
neutral = scorecount[2]
positive = scorecount[3] + scorecount [4]
print(f'counts of positive = {positive}, neutral = {neutral}, negative = {negative}')
print(f'distribution is {100*positive/total}% positive {100*neutral/total}% neutral {100*negative/total}% negative')
print(f'negative/positive = {negative/positive}')

counts of positive = 751476, neutral = 34956, negative = 134739
distribution is 81.57833887519256% positive 3.7947351794617936% neutral 14.626925945345652% negative
negative/positive = 0.17929913929386967


The distribution is heavily skewed to positive reviews which helped the BERT model more than our LBA model we can adjust it to be more even.

In [22]:
df = df.drop(df[df['sentiment'] == 2].sample(frac=positive/total).index)      #remove some random positive reviews to have an even distribution for positive and negative 

Now counting again.

In [23]:
scorecount = [0,0,0,0,0]
df.score.apply(count_rating)
total = sum(scorecount)
negative = scorecount[0] + scorecount[1]
neutral = scorecount[2]
positive = scorecount[3] + scorecount [4]
print(f'counts of positive = {positive}, neutral = {neutral}, negative = {negative}')
print(f'distribution is {100*positive/total}% positive {100*neutral/total}% neutral {100*negative/total}% negative')
print(f'negative/positive = {negative/positive}')

counts of positive = 46145, neutral = 11652, negative = 44913
distribution is 44.9274656800701% positive 11.34456236004284% neutral 43.727971959887064% negative
negative/positive = 0.9733015494636472


The distribution is much more even. Then we can save the evened out data set to "../input/reviews_evened.csv" if we want.
 

In [24]:
# filepath.parent.mkdir(parents=True, exist_ok=True)  #save evened distribution to a file
# df.to_csv(filepath) 

## lexicalBase.py
In lexicalBase.py we do a basic lexical based approach to sentiment analysis.

We do some additional imports and get the VADER lexicon and put it in the "input" folder

In [37]:
from sklearn.metrics import confusion_matrix, classification_report
import nltk
nltk.download('vader_lexicon',download_dir="../input")
from nltk.tokenize import RegexpTokenizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to ../input...
[nltk_data]   Package vader_lexicon is already up-to-date!


We read the evened dataset.

In [38]:
df = pd.read_csv('../input/reviews_evened.csv')
# df = pd.read_csv('../input/tiktok_google_play_reviews.csv')
# df = df.sample(n=1245) #cut dataset down for testing

df = df.dropna(subset='content',axis=0) 

In [39]:
def sentiment_analyzer_score(sentence):
    score = analyser.polarity_scores(sentence)

We then tokenize our reviews with the nltk tokenizer.

In [40]:
tokenizer = RegexpTokenizer(r'\w+')
words_descriptions = df['content'].apply(tokenizer.tokenize)

Compute the polarity with VADER and convert it into a sentiment value.

In [41]:
df['scores'] = df['content'].apply(lambda review: analyser.polarity_scores(review))
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])
def Sentimnt(x):
    if x>= 0.01:
        return "positive"
    elif x<= -0.01:
        return "negative"
    else:
        return "neutral"
df['Sentiment'] = df['compound'].apply(Sentimnt)

Then see how well it did.

In [51]:
print(df[["content","class","Sentiment"]].sample(n=20))

                                                 content     class Sentiment
99561  Good but tik tok banned me for good for no reason  negative  negative
71594                         How to earning on tik tok?  negative   neutral
72879                                Follow back problem  negative  negative
17023  Its is very entertaining.it consumes alot of d...   neutral  positive
15133                                               op,,  positive   neutral
10101                                                Sei  positive   neutral
46729                                     mohmamad faydu  negative   neutral
7902   Sharing Ticktoks doesn't work. When I get them...  negative  positive
64416  Not good because my account is freeze or not v...  negative  negative
28278                                              Great  positive  positive
35516  So I post videos and and I let my inbox get 99...  negative  negative
41381                                               Good  positive  positive

In [47]:
class_names = ['Negative', 'Neutral', 'Positive']
print(classification_report(df["class"], df["Sentiment"], target_names=class_names))

              precision    recall  f1-score   support

    Negative       0.78      0.30      0.43     44912
     Neutral       0.11      0.30      0.16     11652
    Positive       0.60      0.72      0.66     44912

    accuracy                           0.48    101476
   macro avg       0.50      0.44      0.42    101476
weighted avg       0.63      0.48      0.50    101476



Looking at the output, the LBA model seems to struggle with spelling errors and slang words like "lit" which would be out of the lexicon's vocabulary.

## bertbase.py
In bertbase.py we just take a pretrained model and try to run it on our dataset.

In [52]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report
from transformers import AutoTokenizer, AutoModelForSequenceClassification 
import torch 

  from .autonotebook import tqdm as notebook_tqdm


We get our tokenizer and pretrained model and our dataset.

In [54]:
MAX_LEN = 150
RANDOM_SEED = 2

tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment') 
 
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
df = pd.read_csv('../input/tiktok_google_play_reviews.csv')
df = df.dropna(subset='content',axis=0) 

We convert the 5 star rating into positive negative and neutral sentiments.

In [56]:
def to_sentiment(rating):
    rating = int(rating)
    
    if rating <= 2:
        return -1
    elif rating == 3:
        return 0
    else:
        return 1

# Apply to the dataset 
df['sentiment'] = df.score.apply(to_sentiment)

In [59]:
def compute_sentiment(review):
    tokens = tokenizer.encode(review, return_tensors='pt') 
    result = model(tokens) 
    temp = 0 
    temp = int(torch.argmax(result.logits))+1 
    if temp == 1 or temp == 2: 
        return -1 
    elif temp == 4 or temp == 5: 
        return 1 
    else: 
        return 0 
    
class_names = ['Negative', 'Neutral', 'Positive']
def to_class(sentiment):
    if sentiment == -1:
        return 'Negative'
    elif sentiment == 0:
        return 'Neutral'
    else:
        return 'Positive'

We can cut down the dataset for testing

In [60]:
df = df[:12495]

Then we compute the sentiment using BERT pretrained.

In [61]:
df['computed_sentiment'] = df.content.apply(lambda x: compute_sentiment(x[:MAX_LEN]))
df['computed_sentiment'] = df.computed_sentiment.apply(to_class)
df['sentiment'] = df.sentiment.apply(to_class)
print(df[["content","sentiment","computed_sentiment"]].head)

<bound method NDFrame.head of                                                  content sentiment  \
0                                                   Good  Positive   
1      Awesome app! Too many people on it where it's ...  Positive   
2                                                Not bad  Positive   
3                                             It is good  Negative   
4                                   Very interesting app  Positive   
...                                                  ...       ...   
12490  Very good app I recommend everyone to use this up  Positive   
12491                               L think that is well  Positive   
12492                             Utter trash for babies  Negative   
12493                                        Good please   Neutral   
12494                                     Pakistani 💪💪💪💪  Positive   

      computed_sentiment  
0               Positive  
1               Positive  
2                Neutral  
3               Posit

Then we look how the BERT pretrained model preformed.

In [62]:
print(classification_report(df["sentiment"], df["computed_sentiment"], target_names=class_names))

              precision    recall  f1-score   support

    Negative       0.46      0.51      0.49      1931
     Neutral       0.11      0.31      0.16       486
    Positive       0.90      0.80      0.85     10078

    accuracy                           0.73     12495
   macro avg       0.49      0.54      0.50     12495
weighted avg       0.80      0.73      0.76     12495



Overall, it did very well but for the negative and neutral reviews the F1 score is much worse. This is probably due to the dataset being heavily skewed towards positive reviews so that is fixed in stats.py, and used in reviews_evened.csv.

## lexicalBootstrap.py
In lexicalBootstrap.py we use the same lexical approach in lexicalBase.py but we implement bootstrapping to alter the lexicon beforehand to include some out of vocabulary words which may help with sentiment analysis.

In [65]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from collections import defaultdict
import nltk
nltk.download('vader_lexicon',download_dir="../input")
from nltk.sentiment.vader import SentimentIntensityAnalyzer
RANDOM_SEED=42
analyser = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to ../input...
[nltk_data]   Package vader_lexicon is already up-to-date!


We get the lexicon dictionary out of the sentiment analyser.

In [66]:
lex_dict = analyser.make_lex_dict()

We then tokenize the reviews and look at how many different tokens we have.

In [68]:
df = pd.read_csv('../input/reviews_evened.csv')

df = df.dropna(subset='content',axis=0) 
class_names = ['Negative', 'Neutral', 'Positive']

df_train, df_test = train_test_split(df, test_size=0.3, random_state=RANDOM_SEED)

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
words_descriptions = df_train['content'].apply(tokenizer.tokenize)

all_words = [word for tokens in words_descriptions for word in tokens]
df_train['description_lengths']= [len(tokens) for tokens in words_descriptions]

VOCAB = sorted(list(set(all_words)))
print("%s words total, with a vocabulary size of %s" % (len(all_words), len(VOCAB)))

from collections import Counter
count_all_words = Counter(all_words)

698880 words total, with a vocabulary size of 43805


We set up converting the polarity of a review to the sentiment of the review.

In [70]:
def sentiment_analyzer_score(sentence):
    score = analyser.polarity_scores(sentence)
    # print("{:-<150} {}".format(sentence, str(score)))
sentimentThresh = 0.05
def Sentimnt(x):
    if x >= sentimentThresh:
        return "positive"
    elif x <= -sentimentThresh:
        return "negative"
    else:
        return "neutral"

Then we hand coded the bootstrapping function following the algorithm listed in the Volkova et al paper which updates the lexicon with extra words and their polarity.

In [71]:
def defaultval():
    return [0,0,0,0,0]
def bootstrap():
    itercap = 5
    falloff = 0.007 #tentative parameters
    subjectiveThresh = 0.2
    minCount = 50
    numAdditionsPerIter = 150

    iter = 0
    stop = False

    while not stop:
        additions_to_lex = {}
        unknownwords = set(all_words) - set(lex_dict)
        unkWordDict = defaultdict(defaultval)# [key] = [subjectivity prob, positivity prob, subjectivity count, positivity count, count of appearances]
        unkWordDict.clear()
        sdict = {}
        sdict.clear()
        print("\niteration",iter+1)
        # print("num of unknown words:",len(unknownwords))
        for unk in unknownwords:
            if count_all_words[unk] > minCount:
                for sentence in words_descriptions:
                    if unk in sentence:
                        for word in sentence:
                            subjective = False
                            # positive = False
                            if word in lex_dict:
                                # if lex_dict[word] > 0:
                                #     positive = True
                                subjective = True
                        if subjective == True:
                            if (analyser.polarity_scores(''.join(sentence)))['compound']>sentimentThresh: #if the analyser says the sentence is positive
                                unkWordDict[unk][3] = unkWordDict[unk][3] + 1   #mark the positivity count
                            unkWordDict[unk][2] = unkWordDict[unk][2] + 1
                            
                        unkWordDict[unk][4] = unkWordDict[unk][4] + 1   #mark the count of appearances
                unkWordDict[unk][0] = unkWordDict[unk][2] / unkWordDict[unk][4] - iter*falloff #subjectivity prob
                if(unkWordDict[unk][2] == 0):
                    unkWordDict[unk][1]=0
                else:
                    unkWordDict[unk][1] = unkWordDict[unk][3] / unkWordDict[unk][2] #positivity prob
        sdict = dict(sorted(unkWordDict.items(), key=lambda item: item[1][0]))   #sort by subjectivity
        k = 0
        
        for item in reversed(sdict.items()):    #iterate through decreasing subjectivity
            # print(item)
            if item[1][0] > subjectiveThresh:   #check against the minimum subjectivity allowed
                if item[1][1] > 0.5:
                    additions_to_lex[item[0]] = 1   #set the polarity
                else:
                    additions_to_lex[item[0]] = -1
            k = k+1
            if k >= numAdditionsPerIter:
                # print("last subjectivity:",item[1][0])
                break
                                    
        iter = iter +1
        print("num additions to lexicon:", len(additions_to_lex))
        if len(additions_to_lex) == 0 or iter>=itercap:
            stop=True
        else:
            lex_dict.update(additions_to_lex)
            analyser.lexicon.update(additions_to_lex)

bootstrap()


iteration 1
num additions to lexicon: 113

iteration 2
num additions to lexicon: 84

iteration 3
num additions to lexicon: 110

iteration 4
num additions to lexicon: 150

iteration 5
num additions to lexicon: 150


Then we run sentiment analysis on the data with our new bootstrapped lexicon.

In [72]:
df_test['scores'] = df_test['content'].apply(lambda review: analyser.polarity_scores(review))
df_test['compound']  = df_test['scores'].apply(lambda score_dict: score_dict['compound'])
df_test['Sentiment'] = df_test['compound'].apply(Sentimnt)

In [88]:
print(df_test[["content","class","Sentiment"]].sample(n=20))

                                                 content     class Sentiment
52712                                      Vast nice app  negative  positive
16529  Is your app not compatible with Android anymor...  negative  negative
57186                                     Bts funny pics  negative  positive
57991                                          Goodd app  negative   neutral
48440                                                  👍   neutral   neutral
97507                               I love tikto so much  positive  positive
83749  If they don't fix the problem I will delete th...  negative  negative
67494                                  Gander change app  negative   neutral
87470                                    Tiktok is great  positive  positive
85879  I love it but i put the wrong birth date and n...  negative  negative
12750                                            K death  negative  negative
60244                                    For you problem  negative  negative

Even with bootstrapping the LBA model hase some trouble with all the misspelled words and out of vocabulary words.

In [89]:
print(classification_report(df_test["class"], df_test["Sentiment"], target_names=class_names))

              precision    recall  f1-score   support

    Negative       0.63      0.52      0.57     13421
     Neutral       0.10      0.21      0.14      3434
    Positive       0.67      0.61      0.64     13588

    accuracy                           0.53     30443
   macro avg       0.47      0.45      0.45     30443
weighted avg       0.59      0.53      0.55     30443



LBA with bootstrapping did better than it's base LBA model however it is still behind the BERT model's F1 scores and accuracy.

## bertTraining.py
In this file we start with a pretrained BERT model but then do further finetuning on our dataset. The base BERT model already outpreformed our LBA with bootstrapping but the information provided from this model will still be useful and interesting.

In [4]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from collections import defaultdict

# Torch ML libraries
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader

# Misc.
import warnings
warnings.filterwarnings('ignore')

# Set intial variables and constants

# Random seed for reproducibilty
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

# Set GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device = torch.device("cpu") # or force torch to run on cpu

df = pd.read_csv('../input/reviews_evened.csv')
# df = df[:1249] #cut dataset down for testing
df = df[:12495] #cut dataset down for testing

df = df.dropna(subset='content',axis=0) 

We do the same kind of setup as the base BERT, and check the count for each class.

In [5]:
sentimentcount = [0,0,0]
class_names = ['negative', 'neutral', 'positive']
def to_sentiment(rating):
    rating = int(rating)
    # Convert to class
    if rating <= 2:
        sentimentcount[0] = sentimentcount[0]+1
        return 0
    elif rating == 3:
        sentimentcount[1] = sentimentcount[1]+1
        return 1
    else:
        sentimentcount[2] = sentimentcount[2]+1
        return 2

# Apply to the dataset 
df['sentiment'] = df.score.apply(to_sentiment)
print(sentimentcount)

[586, 137, 526]


We then retrieve the pretrained model and tokenize out reviews.

In [6]:
# Set the model name
MODEL_NAME = 'bert-base-cased'

# Build a BERT based tokenizer
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)

# Store length of each review 
token_lens = []

# Iterate through the content slide
for txt in df.content:
    tokens = tokenizer.encode(txt, max_length=512)
    token_lens.append(len(tokens))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Then the data is set up and split into test, development, and training sets.

In [7]:
MAX_LEN = 160

class GPReviewDataset(Dataset):
    # Constructor Function 
    def __init__(self, reviews, targets, tokenizer, max_len):
        self.reviews = reviews
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_len = max_len
    
    # Length magic method
    def __len__(self):
        return len(self.reviews)
    
    # get item magic method
    def __getitem__(self, item):
        review = str(self.reviews[item])
        target = self.targets[item]
        
        # Encoded format to be returned 
        encoding = self.tokenizer.encode_plus(
            review,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            pad_to_max_length=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        
        return {
            'review_text': review,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'targets': torch.tensor(target, dtype=torch.long)
        }
df_train, df_test = train_test_split(df, test_size=0.3, random_state=RANDOM_SEED)
df_val, df_test = train_test_split(df_test, test_size=0.5, random_state=RANDOM_SEED)

def create_data_loader(df, tokenizer, max_len, batch_size):
    ds = GPReviewDataset(
        reviews=df.content.to_numpy(),
        targets=df.sentiment.to_numpy(),
        tokenizer=tokenizer,
        max_len=max_len
    )
    
    return DataLoader(
        ds,
        batch_size=batch_size,
        num_workers=0
    )
    
BATCH_SIZE = 16
train_data_loader = create_data_loader(df_train, tokenizer, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(df_val, tokenizer, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(df_test, tokenizer, MAX_LEN, BATCH_SIZE)

data = next(iter(train_data_loader))

Then the model is defined.

In [8]:
bert_model = BertModel.from_pretrained(MODEL_NAME)

class SentimentClassifier(nn.Module):
    
    # Constructor class 
    def __init__(self, n_classes):
        super(SentimentClassifier, self).__init__()
        self.bert = BertModel.from_pretrained(MODEL_NAME)
        self.drop = nn.Dropout(p=0.3)
        self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
    
    # Forward propagaion class
    def forward(self, input_ids, attention_mask):
        _, pooled_output = self.bert(
          input_ids=input_ids,
          attention_mask=attention_mask,
          return_dict=False
        )
        #  Add a dropout layer 
        output = self.drop(pooled_output)
        return self.out(output)
    
model = SentimentClassifier(len(class_names))
model = model.to(device)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predic

The hyperparameters are then defined, we did not change these from the default as the BERT baseline already outpreformed our lexical based approach with bootstrapping.

In [9]:
# Number of iterations 
EPOCHS = 10      

# Optimizer Adam 
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)

total_steps = len(train_data_loader) * EPOCHS

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

# Set the loss function 
loss_fn = nn.CrossEntropyLoss().to(device)

The training model is defined.

In [10]:
def train_epoch(model, data_loader, loss_fn, optimizer, device, scheduler, n_examples):
    model = model.train()
    losses = []
    correct_predictions = 0
    
    for d in data_loader:
        input_ids = d["input_ids"].to(device)
        attention_mask = d["attention_mask"].to(device)
        targets = d["targets"].to(device)
        
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        
        _, preds = torch.max(outputs, dim=1)
        loss = loss_fn(outputs, targets)
        correct_predictions += torch.sum(preds == targets)
        losses.append(loss.item())
        
        # Backward prop
        loss.backward()
        
        # Gradient Descent
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
    
    return correct_predictions.double() / n_examples, np.mean(losses)

def eval_model(model, data_loader, loss_fn, device, n_examples):
    model = model.eval()
    
    losses = []
    correct_predictions = 0
    
    with torch.no_grad():
        for d in data_loader:
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            targets = d["targets"].to(device)
            
            # Get model ouptuts
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
            )
            
            _, preds = torch.max(outputs, dim=1)
            loss = loss_fn(outputs, targets)
            
            correct_predictions += torch.sum(preds == targets)
            losses.append(loss.item())
            
    return correct_predictions.double() / n_examples, np.mean(losses)

Then the BERT model with training is run.

In [17]:
history = defaultdict(list)
best_accuracy = 0

for epoch in range(EPOCHS):
    
    # Show details 
    print(f"Epoch {epoch + 1}/{EPOCHS}")
    print("-" * 10)
    
    train_acc, train_loss = train_epoch(
        model,
        train_data_loader,
        loss_fn,
        optimizer,
        device,
        scheduler,
        len(df_train)
    )
    
    print(f"Train loss {train_loss} accuracy {train_acc}")
    
    # Get model performance (accuracy and loss)
    val_acc, val_loss = eval_model(
        model,
        val_data_loader,
        loss_fn,
        device,
        len(df_val)
    )
    
    print(f"Val   loss {val_loss} accuracy {val_acc}")
    print()
    
    history['train_acc'].append(train_acc)
    history['train_loss'].append(train_loss)
    history['val_acc'].append(val_acc)
    history['val_loss'].append(val_loss)
    
    # If we beat prev performance
    if val_acc > best_accuracy:
        torch.save(model.state_dict(), 'best_model_state.bin')
        best_accuracy = val_acc

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS

With the full dataset, running on a Nvidia RTX 3070ti the training takes more than 4 hours to complete. After waiting we then can get the evaluation of the model.

In [44]:
test_acc, _ = eval_model(
  model,
  test_data_loader,
  loss_fn,
  device,
  len(df_test)
)

test_acc.item()

def get_predictions(model, data_loader):
    model = model.eval()

    review_texts = []
    predictions = []
    prediction_probs = []
    real_values = []

    with torch.no_grad():
        for d in data_loader:
            texts = d["review_text"]
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            targets = d["targets"].to(device)

            # Get outouts
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )
            _, preds = torch.max(outputs, dim=1)

            review_texts.extend(texts)
            predictions.extend(preds)
            prediction_probs.extend(outputs)
            real_values.extend(targets)

    predictions = torch.stack(predictions).cpu()
    prediction_probs = torch.stack(prediction_probs).cpu()
    real_values = torch.stack(real_values).cpu()

    return review_texts, predictions, prediction_probs, real_values

y_review_texts, y_pred, y_pred_probs, y_test = get_predictions(
    model,
    test_data_loader
)

We can also print out some of the results looking at the real class and comparing with the computed sentiment.

In [45]:
px["content"] = pd.DataFrame(y_review_texts)
px["class"] = pd.DataFrame(y_test)
px["sentiment"] = pd.DataFrame(y_pred)
def to_class(sentiment):
    if sentiment == 0:
        return 'Negative'
    elif sentiment == 1:
        return 'Neutral'
    else:
        return 'Positive'
px['class'] = px["class"].apply(to_class)
px['sentiment'] = px["sentiment"].apply(to_class)
print(px[["content","class","sentiment"]].sample(n=20))

                                               content     class sentiment
165                                                 ✌️  Positive  Positive
90                                                 😒😒😒  Negative  Positive
64   I love tiktok @rabin___officail please support...  Positive  Positive
171                 Wow so cute dxt plz bro support me  Negative  Positive
122                                           Good app  Negative  Positive
51                                                 ❤❤❤   Neutral  Positive
187                                     i love tik tok  Positive  Positive
52                                               Great  Positive  Positive
145  I love posting my own content that I enjoy, it...  Positive   Neutral
76   It's a good app... but everytime I change my p...  Positive   Neutral
80                                            Best app  Positive  Positive
66                                            বাংলাদেশ  Positive  Positive
7                        

The results appear much better than the LBA model, some of the emojis have been classified instead of just being neutral for all emojis. We can also get the classification report.

In [46]:
print(classification_report(y_test, y_pred, target_names=class_names))

              precision    recall  f1-score   support

    negative       0.64      0.66      0.65        92
     neutral       0.29      0.25      0.27        20
    positive       0.64      0.63      0.64        76

    accuracy                           0.61       188
   macro avg       0.52      0.51      0.52       188
weighted avg       0.60      0.61      0.60       188



The BERT model with finetuning preforms better than just the pretrained model by itself when they are both using the same dataset, but both preform much better on the original skewed dataset. This may be because the BERT model is able to assign bias weight values to each of the classes which allows it to have a much better accuracy when the data is skewed.

# Analysis
Both of the BERT models did much better at the classification task than either of the LBA models. The very challenging dataset full of typos and slang seemed to affect both models significantly, but especially affected the LBA models. In the LBA, out of vocabulary words are just given neutral values and that is all. With bootstrapping, the lexicon is able to pick up a few out of vocabulary words and add them to the lexicon but this is not enough to match either of the BERT tests. The BERT model is able to look at the context surrounding the out of vocabulary words to gain information about it, but the BERT model does not even have an explicit vocabulary in the first place which only helps BERT for this dataset. Finetuning on the dataset does help the accuracy of the BERT model, but the final accuracy is still far behind the normal results when tested on a "clean" dataset. This is expected that both approaches run worse on the very flawed dataset and we found that BERT is much better at adapting to this dataset than LBA.