# **Lab 7: Convolutional Network Architectures - Brain Tumor MRI Images**

- Reece Iriye: 48255107
- Eileen Garcia: 48241821
- Trevor Dohm: 48376059

## **0: Imports**

In [39]:
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import MultiHeadAttention, LayerNormalization
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Layer
from tensorflow.keras.layers import MultiHeadAttention, LayerNormalization, Embedding, Dense, Dropout, GlobalAveragePooling1D, Input
from tensorflow.keras.models import Model
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Layer
import tensorflow as tf
import numpy as np

import matplotlib.pyplot as plt
import re
from sklearn.model_selection import train_test_split

## **1: Data Preparation and Preprocessing**

### **1.1: Preparation**

First we need to load the training dataset ('training.1600000.processed.noemoticon.csv') using Pandas and display the first few rows. Displaying the first few rows helps us understand the data structure and content that we will work with. 

The output shows that each row in the collection of tweets includes a sentiment identifier, an id, a date, a flag, a user, and the tweet text. 

In [40]:
# Load Datasets
dataset = 'Dataset/training.1600000.processed.noemoticon.csv'
data = pd.read_csv(dataset, header=None, names=["sentiment", "id", "date", "flag", "user", "text"], encoding = 'ISO-8859-1')

# Display First Few Rows For Each Daatset
data.head()

Unnamed: 0,sentiment,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


The target "sentiment" identifier represents the polarity of the tweet, where 0 is negative, and 4 is positive. 

The "Id" represents the id of the tweet. 

The "date" is the date that the tweet was published. 

The "flag" represents the query, and if there is no query, the value is "NO_QUERY".

The "user" column represents the username that published the tweet.

The "text" is the text that was posted as a tweet. 

Since a tweet may include mentions to other users, special characters, and numbers, we need to remove some of these markers to have more standardized data. 

In the code below, we remove the mentions (formatted as usernames starting with @), URLs, special characters, and numbers, to leave only alphabetic characters. 

The text is also converted to lower case, and leading or trailing whitespaces are removed. This process further standardizes the text data and results in a new column "clean_text" that contains that cleaned versions of the tweets.

In [37]:
# Clean Text Regex
def clean_text(text):

    # Remove Mentions, URL
    text = re.sub(r'(@[A-Za-z0-9_]+)|(\w+:\/\/\S+)', ' ', text)

    # Remove Special Characters, Numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Convert To Lower Case
    text = text.lower().strip()

    # Return Cleaned Text
    return text

# Apply Cleaning Function To Text Column
data['clean_text'] = data['text'].apply(clean_text)

# Explore Target Column
target_counts = data['sentiment'].value_counts()

# Display First Few Rows Of Cleaned Text, Target Distribution
clean_text_head = data[['clean_text', 'sentiment']].head()
target_counts, clean_text_head

(sentiment
 0    800000
 4    800000
 Name: count, dtype: int64,
                                           clean_text  sentiment
 0  awww thats a bummer  you shoulda got david car...          0
 1  is upset that he cant update his facebook by t...          0
 2  i dived many times for the ball managed to sav...          0
 3     my whole body feels itchy and like its on fire          0
 4  no its not behaving at all im mad why am i her...          0)

After cleaning, we explored the target column: sentiment. The values in the column are counted to help understand the distribution of sentiments in the dataset. The output shows that there are exactly 800,000 tweets categorized as negative (0), and the same amount as positive (4). The sentiment classes in this dataset are perfectly balanced. 

Finally we display the first few rows of the text along with their sentiments. We see a few tweets categorized as "negative". 

The data is almost ready for tokenization, vectorization, and feeding it into a neural network for sentiment classification. 

### **1.2: Choosing an Evaluation Metric**

In the context of Twitter sentiment analysis for understanding sentiment of a specific product using data on a variety of random topics, selecting appropriate evaluation metrics is crucial to ensure the model's reliability and practical usability. Given the nature of sentiment analysis, where the goal is to gauge public sentiment accurately, a combination of Accuracy, F1-Score, and Confusion Matrix offers a comprehensive evaluation approach.

Accuracy, first off, is a straightforward measure of how often the model predicts correctly. In a dataset with a balanced class distribution, as in this case where there’s 800,000 positive Tweets and 800,000 negative Tweets, accuracy becomes a relevant metric because it gives a clear indication of the model's overall performance. A high accuracy rate in a balanced dataset means the model performs well across both positive and negative sentiments, which is essential for businesses to accurately assess public opinion.

F1-Score is particularly important in sentiment analysis because it balances the precision and recall of the classifier. This balance is crucial in a business context where both identifying positive sentiments (precision) and not missing negative sentiments (recall) are equally important. A high F1-Score indicates that the model is not only capturing most of the relevant sentiment but also maintaining a low rate of false positives, which is vital for creating a reliable sentiment analysis tool.
A Confusion Matrix provides detailed insight into the model's performance by showing the true positives, false positives, true negatives, and false negatives. This level of detail is valuable in this context, because it’s a more graphical description of the F1-Score for Twitter Sentiment analysis. It helps us see where exactly the positive sentiment predictions and negative sentiment predictions relate to the actual reality, and it lets us identify exactly where this is the case instead of just looking at the data. For instance, a high number of false negatives might indicate that the model is underestimating negative sentiment, which could be critical for a context like customer service if this model were to be applied in that realm. 

Having broad applicability would require that this model performs well with all of these metrics.

### **1.3: Choosing a Method for Splitting Our Data**


The distribution is evenly distributed. There is almost a 50% split for positive and negative. Thus, we will do an 80-20 split. The reason for this is that we have a large amount of data evenly distributed across two classes, and because of this phenomenon, a class imbalance would be extremely unlikely to occur. 

With 1.6 million tweets evenly split between positive and negative sentiments, your dataset is substantial enough to allow for an 80-20 split without risking the loss of representativeness in either the training or testing sets. This large volume of data ensures that both subsets (training and testing) will likely maintain the same distribution of sentiments as the original dataset.

An 80-20 split minimizes the risk of class imbalance in both training and testing sets. This balance is crucial in training the model to perform equally well on both classes of sentiment, which is essential for a business application where understanding both positive and negative consumer sentiments is vital.

By allocating 80% of the data to training, we ensure that the model has enough examples to learn from, which is crucial for developing a strong and flexible sentiment analysis model. The remaining 20% for testing is also substantial enough to reliably evaluate the model's performance across a wide range of examples, ensuring that the model's accuracy and generalizability are well-tested.

In the code below, we perform the train/test split as described above. 

First, we re-encode the labels. Originally, a 0 represents negative sentiment and 4 represents positive sentiment. We convert these sentiment labels into a binary format where 1 represents positive sentiments and 0 remains as the negative sentiment. The new encodings are stored in a new column called "target_encoded".

We then set the "clean_text" column containing the processed text of tweets as the features. We set the new "target_encoded" column representing the binary sentiment labels as the labels. 

As described above, we perform a 80/20 train/test split with a set random_state to ensure reproducibility of the split. 

Finally, we check the shapes of the training and testing sets.

In [38]:
# Encode Labels: Convert 4 -> 1 For Positive Sentiment

data['target_encoded'] = data['sentiment'].apply(lambda x: 1 if x == 4 else 0)

# Split data into features (text) and labels

# Split Data Into Features 
features = data['clean_text']
labels = data['target_encoded']

# Perform 80 / 20 Train Test Split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = 0.2, random_state = 42)

# Check Shapes
(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

((1280000,), (320000,), (1280000,), (320000,))

The shape output tells us the following:

- There are 1,280,000 samples in the training set for features (clean_text).
- There are 320,000 samples in the testing set for features.
- There are 1,280,000 labels corresponding to the training set.
- There are 320,000 labels corresponding to the testing set.

The data has been successfully split into training and testing sets with the intended proportion and that each feature has a corresponding label in both the training and testing sets. 

### **1.4: Tokenizing and Padding the Dataset**

Next, we need to tokenize and pad the training dataset. Tokenization converts the tweet texts to sequences of integers, and the we pad the sequences to a fixed length. 

We set a few constants first:
- NUM_TOP_WORDS: Set to None, meaning that the tokenizer will consider all unique words in the dataset.
- MAX_ART_LEN: Set to 40 to specify the maximum length of the sequences. Any text sequence longer than 40 will be truncated and shorter sequences will be padded.
- NUM_CLASSES: Set to 2, to represent the two classes in the target variable (positive and negative sentiments).

Then we intialize and fit the tokenizer. A tokenizer is a tool to convert text into a sequence of integers, so that each integer represents a specific word. This tokenizer will consider all words, since we set `NUM_TOP_WORDS` to 'None'. The tokenzier is then fit on `X_train`, allowing it to learn the mapping of words to integers based on the training data. 

We also store a `word_index` variable, which is a dictionary where keys are words and values are their corresponding integers in the learned vocabulary. 

Lastly, we convert the training and testing text data to sequences of integers based on the learned word index. Then we ensure that all sequences are the same length by padding shorter sequences with zeros and truncates longer sequences to this standardized length. 

In [None]:
# Setting Maximum Sequence Length
NUM_TOP_WORDS = None
MAX_ART_LEN = 40 
NUM_CLASSES = 2

# Initialize, Fit Tokenizer
tokenizer = Tokenizer(num_words = NUM_TOP_WORDS)
tokenizer.fit_on_texts(X_train)
word_index = tokenizer.word_index

# Convert Text To Sequence, Padding
train_sequences = tokenizer.texts_to_sequences(X_train)
test_sequences = tokenizer.texts_to_sequences(X_test)
X_train_padded = pad_sequences(train_sequences, maxlen = MAX_ART_LEN)
X_test_padded = pad_sequences(test_sequences, maxlen = MAX_ART_LEN)

In [None]:
# Preparing GloVe Embedding
glove_file = 'Dataset/glove.6B.100d.txt'
embeddings_index = {}
with open(glove_file, 'r', encoding = 'utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype = 'float32')
        embeddings_index[word] = coefs

print('Found %s Word Vectors.' % len(embeddings_index))

# Create Embedding Matrix
found_words = 0
EMBED_SIZE = 100
embedding_matrix = np.zeros((len(tokenizer.word_index) + 1, EMBED_SIZE))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        found_words = found_words + 1

# Print Embedding Information
print("Embedding Shape:", embedding_matrix.shape,
      "\nTotal Words Found:", found_words,
      "\nPercentage:", 100 * found_words / embedding_matrix.shape[0])

# Check Shapes Of Padded Train Test Data
X_train_padded.shape, X_test_padded.shape, embedding_matrix.shape

In [None]:
# Save Embedding
embedding_layer = Embedding(len(word_index) + 1,
                            EMBED_SIZE,
                            weights = [embedding_matrix],
                            input_length = MAX_ART_LEN,
                            trainable = False)

In [None]:
# The transformer architecture 
class TransformerBlock(Layer): # inherit from Keras Layer
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.2):
        super().__init__()
        # setup the model heads and feedforward network
        self.att = MultiHeadAttention(num_heads=num_heads, 
                                      key_dim=embed_dim)
        
        # make a two layer network that processes the attention
        self.ffn = Sequential()
        self.ffn.add( Dense(ff_dim, activation='relu') )
        self.ffn.add( Dense(embed_dim) )
        
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training):
        # apply the layers as needed (similar to PyTorch)
        
        # get the attention output from multi heads
        # Using same inpout here is self-attention
        # call inputs are (query, value, key) 
        # if only two inputs given, value and key are assumed the same
        attn_output = self.att(inputs, inputs)
        
        # create residual output, with attention
        out1 = self.layernorm1(inputs + attn_output)
        
        # apply dropout if training
        out1 = self.dropout1(out1, training=training)
        
        # place through feed forward after layer norm
        ffn_output = self.ffn(out1)
        out2 = self.layernorm2(out1 + ffn_output)
        
        # apply dropout if training
        out2 = self.dropout2(out2, training=training)
        #return the residual from Dense layer
        return out2
    
class TokenAndPositionEmbedding(Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        # create two embeddings 
        # one for processing the tokens (words)
        self.token_emb = Embedding(input_dim=vocab_size, 
                                   output_dim=embed_dim)
        # another embedding for processing the position
        self.pos_emb = Embedding(input_dim=maxlen, 
                                 output_dim=embed_dim)

    def call(self, x):
        # create a static position measure (input)
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        # positions now goes from 0 to 500 (for IMdB) by 1
        positions = self.pos_emb(positions)# embed these positions
        x = self.token_emb(x) # embed the tokens
        return x + positions # add embeddngs to get final embedding