<a href="https://colab.research.google.com/github/Redorhcs/CPSC-310-HW2/blob/main/RTS49_Copy_of_cpsc_310_hw_2_bert_student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### CPSC 310 Homework 2 - BERT Sentiment Analysis

The purpose of this portion of the assignment is to apply the BERT text classification model to the bulletin board you have been working on.

The following Colab notebook will take you through the process of producing a sentiment analysis model trained on a dataset of posts. We will be using the Bidirectional Encoder Representations from Transformers (BERT) model and finetuning it for our data. Before starting the assignment, read about the model [here](https://arxiv.org/pdf/1810.04805.pdf). We will adapt the model to classify between positive and negative posts.

As you work through the notebook, there will be portions that you must fill out to move on to the next step. Instructions will be provided for each specific task, but your submitted code should contain code filled out for each section. Submit your modified version of the notebook as part of your assignment.

#### Data Pre-Processing

In [None]:
#import required libraries

import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns



from sklearn import preprocessing
from sklearn.model_selection import train_test_split

#transformers
!pip install transformers
from transformers import BertTokenizerFast
from transformers import TFBertModel

import nltk
nltk.download('popular')

#keras
import tensorflow as tf
from tensorflow import keras

#metrics
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix

#set seed for reproducibility
seed=42

#set style for plots
sns.set_style("whitegrid")
sns.despine()
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc("axes", labelweight="bold", labelsize="large", titleweight="bold", titlepad=10)

### TODO

Before writing this step, upload data.csv and test_data.csv to the Colab environment. To do this, click the folder icon on the left-hand side of the notebook and drag/drop or upload both files.

In [None]:
#TODO: load the training and testing dataset into a dataframe called train_df and test_df, respectively. Use the same headers as in the csv (name, post, sentiment)

##### Lemmatization and Pre-Processing

In the following blocks, we provide the code to preprocess the posts. We do this to standardize the post formats to make it easier for our classifier to recognize components such as usernames, URLs, and emojis. We also clean up other aspects of posts that make it difficult for them to be processed.

A portion of this `preprocess`  function includes a Lemmatizer, which you can learn more about [here](https://en.wikipedia.org/wiki/Lemmatisation).

In [None]:
# Defining dictionary containing all emojis with their meanings.
emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad', 
          ':-(': 'sad', ':-<': 'sad', ':P': 'raspberry', ':O': 'surprised',
          ':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed', 
          ':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
          '@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused',
          '<(-_-)>': 'robot', 'd[-_-]b': 'dj', ":'-)": 'sadsmile', ';)': 'wink', 
          ';-)': 'wink', 'O:-)': 'angel','O*-)': 'angel','(:-D': 'gossip', '=^.^=': 'cat'}

## Defining set containing all stopwords in english.
stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
             'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
             'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
             'does', 'doing', 'down', 'during', 'each','few', 'for', 'from', 
             'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
             'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
             'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
             'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
             'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're',
             's', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
             't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 
             'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',
             'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
             'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
             "youve", 'your', 'yours', 'yourself', 'yourselves']

In [None]:
from nltk.stem import WordNetLemmatizer
import re
def preprocess(textdata):
    processedText = []
    
    # Create Lemmatizer and Stemmer.
    wordLemm = WordNetLemmatizer()
    
    # Defining regex patterns.
    urlPattern        = r"((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)"
    userPattern       = '@[^\s]+'
    alphaPattern      = "[^a-zA-Z0-9]"
    sequencePattern   = r"(.)\1\1+"
    seqReplacePattern = r"\1\1"
    
    for post in textdata:
        post = post.lower()
        
        # Replace all URls with 'URL'
        post = re.sub(urlPattern,' URL',post)
        # Replace all emojis.
        for emoji in emojis.keys():
            post = post.replace(emoji, "EMOJI" + emojis[emoji])        
        # Replace @USERNAME to 'USER'.
        post = re.sub(userPattern,' USER', post)        
        # Replace all non alphabets.
        post = re.sub(alphaPattern, " ", post)
        # Replace 3 or more consecutive letters by 2 letter.
        post = re.sub(sequencePattern, seqReplacePattern, post)

        postwords = ''
        for word in post.split():
            # Checking if the word is a stopword.
            #if word not in stopwordlist:
            if len(word)>1:
                # Lemmatizing the word.
                word = wordLemm.lemmatize(word)
                postwords += (word+' ')
            
        processedText.append(postwords)
        
    return processedText

In [None]:
#splitting up posts and sentiments for processing
posts, sentiments = list(train_df['post']), list(train_df['sentiment'])
test_posts, test_sentiments = list(test_df['post']), list(test_df['sentiment'])


In [None]:
#TODO: Run the preprocessing step on posts and test_posts, putting the processed text into two variables named processedtext and test_processedtext, respectively.

##### Data Organization

In [None]:
#Split our training dataset into train and validation.

X_train, X_valid, y_train, y_valid = train_test_split(train_df['post'].values, train_df['sentiment'].values, test_size=0.1, stratify=train_df['sentiment'].values, random_state=seed)
X_test, y_test = test_df['post'].values, test_df['sentiment'].values

print(f'Data Split done.')

Data Split done.


##### One Hot Encoding 
Learn about this step [here](https://https://en.wikipedia.org/wiki/One-hot#Natural_language_processing).

In [None]:
y_train_le = y_train.copy()
y_valid_le = y_valid.copy()
y_test_le = y_test.copy()


In [None]:
ohe = preprocessing.OneHotEncoder()
y_train = ohe.fit_transform(np.array(y_train).reshape(-1, 1)).toarray()
y_valid = ohe.fit_transform(np.array(y_valid).reshape(-1, 1)).toarray()
y_test = ohe.fit_transform(np.array(y_test).reshape(-1, 1)).toarray()

In [None]:
print(f"TRAINING DATA: {X_train.shape[0]}\nVALIDATION DATA: {X_valid.shape[0]}\nTESTING DATA: {X_test.shape[0]}" )

TRAINING DATA: 900
VALIDATION DATA: 100
TESTING DATA: 10


##### Tokenization

Learn about this step [here](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html).

You may notice the setup of an `attention_masks` variable below. The purpose of this variable is to standardize the length of each post (as some post could be only a few characters while others could be much longer). This ensures that we can pass in the data with one format.

In [None]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

In [None]:
MAX_LEN=128
def tokenize(data,max_len=MAX_LEN) :
    input_ids = []
    attention_masks = []
    for i in range(len(data)):
        encoded = tokenizer.encode_plus(
            data[i],
            add_special_tokens=True,
            max_length=MAX_LEN,
            padding='max_length',
            return_attention_mask=True
        )
        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])
    return np.array(input_ids),np.array(attention_masks)

In [None]:
train_input_ids, train_attention_masks = tokenize(X_train, MAX_LEN)
val_input_ids, val_attention_masks = tokenize(X_valid, MAX_LEN)
test_input_ids, test_attention_masks = tokenize(X_test, MAX_LEN)

#### Model Setup

In [None]:
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

##### End Layer Addition

In the following step, we take the pretrained model and add our own layers to tune the BERT algorithm to our use case. The key layer to understand here is the output layer, which takes the output of our model and maps it to a probability of 2 outputs, either a 0 (negative) or 1 (positive).

In [None]:

def create_model(bert_model, max_len=MAX_LEN):
    
    ##params###
    opt = tf.keras.optimizers.Adam(learning_rate=1e-5, decay=1e-7)
    loss = tf.keras.losses.CategoricalCrossentropy()
    accuracy = tf.keras.metrics.CategoricalAccuracy()


    input_ids = tf.keras.Input(shape=(max_len,),dtype='int32')
    
    attention_masks = tf.keras.Input(shape=(max_len,),dtype='int32')
    
    embeddings = bert_model([input_ids,attention_masks])[1]
    
    output = tf.keras.layers.Dense(2, activation="softmax")(embeddings)
    
    model = tf.keras.models.Model(inputs = [input_ids,attention_masks], outputs = output)
    
    model.compile(opt, loss=loss, metrics=accuracy)
    
    
    return model

##### Model Creation

In [None]:
model = create_model(bert_model, MAX_LEN)
model.summary()

##### Model Training

Running the code in this section will train the model on our data. Before you run this step, make sure that your runtime in Colab has a GPU assigned to it. If you would like to check that it does, run the code block containing `!nvidia-smi` below to see if there is an available GPU. If there is no GPU, follow the tutorial [here](https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm) to get it setup. You should not have to pay to add a GPU.

This step will take a while (up to 20 minutes), so don't be worried if it is running slower than expected. 

In [None]:
history_bert = model.fit([train_input_ids,train_attention_masks], y_train, validation_data=([val_input_ids,val_attention_masks], y_valid), epochs=10, batch_size=32)

#### Model Use and Evaluation

In [None]:
#Run the prediction on our testing dataset
result_bert = model.predict([test_input_ids,test_attention_masks])


['"Decrease in crime and increase in public safety under Trentino\'s strong law enforcement policies."', '"Trentino\'s leadership successful in improving standard of living for citizens through job creation and economic growth."', '"Trentino\'s government improves access to education and healthcare for all citizens, regardless of their income."', '"Trentino\'s government improves the lives of veterans and elderly by providing support and benefits."', '"Sylvanian cuisine has a wide range of delicious and traditional dishes."']


In [None]:
#TODO: Using the predictions stored in result_bert, write a Python script to export a csv containing only posts in the testing set our model deems to be positive. Save this csv and upload it as part of your submission.

In [None]:
#Visualize our accuracy on the testing set. 

y_pred_bert =  np.zeros_like(result_bert)
y_pred_bert[np.arange(len(y_pred_bert)), result_bert.argmax(1)] = 1

def conf_matrix(y, y_pred, title):
    fig, ax =plt.subplots(figsize=(5,5))
    labels=['Negative', 'Positive']
    ax=sns.heatmap(confusion_matrix(y, y_pred), annot=True, cmap="Blues", fmt='g', cbar=False, annot_kws={"size":25})
    plt.title(title, fontsize=20)
    ax.xaxis.set_ticklabels(labels, fontsize=17) 
    ax.yaxis.set_ticklabels(labels, fontsize=17)
    ax.set_ylabel('Test', fontsize=20)
    ax.set_xlabel('Predicted', fontsize=20)
    plt.show()
conf_matrix(y_test.argmax(1), y_pred_bert.argmax(1),'BERT Sentiment Analysis\nConfusion Matrix')
print('\tClassification Report for BERT:\n\n',classification_report(y_test,y_pred_bert, target_names=['Negative','Positive']))


#TODO: Screenshot the output of this code block and upload it as part of your submission