# **ML Final Project**:

### **Group Members:**
- Christopher Johnson (christopher.johnson13@ontariotechu.net)
- Name (Student Email)
- Name (Student Email)
- Name (Student Email)

## **Project Goals & Outline:**
The goal of this project is to use sentiment analysis to analyze the content of tweets, and make decisions on whether their sentiment is positive, negative, or neutral.

### Outline:
1. Data Importing and Preprocessing
2. Model Construction
   1. RNN
   2. LSTM
   3. ???
3. Model Training
   1. RNN
   2. LSTM
   3. ???
4. Model Analysis and Comparison
5. Conclusions

## **Importing Packages & Libraries:**

In [15]:
# general packages/libraries
import numpy as np
import pandas as pd
from datetime import datetime # used to convert Date_time strings to useable format

# torch
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data import random_split

# torchmetrics
%pip install torchmetrics
from torchmetrics import Accuracy

# torchtext
import torchtext.data
from torchtext.vocab import build_vocab_from_iterator

# lightning
%pip install lightning
from lightning.pytorch import LightningModule
from lightning.pytorch import Trainer
from lightning.pytorch.loggers import CSVLogger

# tqdm
from tqdm.notebook import tqdm

# nltk
import nltk # used for tokenziation
nltk.download('punkt')
from nltk import word_tokenize

# regular expressions
import re

Note: you may need to restart the kernel to use updated packages.



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\chris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## **Data Importing and Preprocessing:**
This section is where the tweet data is imported and processed into tokens. This tokenization process is required so the neural network architectures can interpret the text data.

**Tokenization definition:** the process of breaking down a sequence of information into smaller chunks known as tokens.

### Importing the Dataset:

In [16]:
# import the tweet data from the CSV using pandas
tweet_data = pd.read_csv('./Datasets/twitter_training.csv', names=['Tweet ID', 'entity', 'sentiment', 'Tweet content'])
tweet_data.head(20)

Unnamed: 0,Tweet ID,entity,sentiment,Tweet content
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
5,2401,Borderlands,Positive,im getting into borderlands and i can murder y...
6,2402,Borderlands,Positive,So I spent a few hours making something for fu...
7,2402,Borderlands,Positive,So I spent a couple of hours doing something f...
8,2402,Borderlands,Positive,So I spent a few hours doing something for fun...
9,2402,Borderlands,Positive,So I spent a few hours making something for fu...


In [17]:
# droppping "irrelevant" sentiment values
tweet_data.drop(
    tweet_data[tweet_data['sentiment'] == 'Irrelevant'].index,
    inplace=True
)

# showing remaining sentiment distribution
tweet_data['sentiment'].value_counts()

sentiment
Negative    22542
Positive    20832
Neutral     18318
Name: count, dtype: int64

In [18]:
# extracting the the sentiment data for ease of use
sentiment = tweet_data['sentiment']

# the numerical representations of the sentiment values
sentiment_numerical = {
    'Positive': 0,
    'Negative': 1,
    'Neutral': 2,
}

# converting the sentiment data into a numerical form
sentiment.replace(to_replace=sentiment_numerical, inplace=True)

# showing the converted sentiment data
print(sentiment.head(20))

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    2
13    2
14    2
15    2
16    2
17    2
18    0
19    0
Name: sentiment, dtype: int64


In [19]:
# extracting the the tweet content for ease of use
content = tweet_data['Tweet content']

# showing the extracted content
print(content.head(60))

# convert content to an iterator
# content = iter(content)

0     im getting on borderlands and i will murder yo...
1     I am coming to the borders and I will kill you...
2     im getting on borderlands and i will kill you ...
3     im coming on borderlands and i will murder you...
4     im getting on borderlands 2 and i will murder ...
5     im getting into borderlands and i can murder y...
6     So I spent a few hours making something for fu...
7     So I spent a couple of hours doing something f...
8     So I spent a few hours doing something for fun...
9     So I spent a few hours making something for fu...
10    2010 So I spent a few hours making something f...
11                                                  was
12    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
13    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
14    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
15    Rock-Hard La Vita, RARE BUT POWERFUL, HANDSOME...
16    Live Rock - Hard music La la Varlope, RARE & t...
17    I-Hard like me, RARE LONDON DE, HANDSOME 2

In [20]:
'''
method that processes content and removes the following content:
- 1 word tweets
- tweets that contain only special characters (e.g. /, ., <, etc.)
'''
def process_content(values:pd.Series, sentiments:pd.Series):
    # convert series to list
    values = values.to_list()
    sentiments = sentiments.to_list()

    # records the number of tweets removed
    # (used to adjust the access index)
    num_removed = 0

    # #! DEBUG
    # url_matches = 0

    for i in tqdm(range(0, len(values))):
        # checks for 1 word tweet using a tokenizer
        if (len(word_tokenize(str(values[i - num_removed]))) <= 1):
            # print('(1) ', i, ': ', str(values[i - num_removed]))  #! DEBUG
            del values[i - num_removed]     # remove the tweet from the list
            del sentiments[i - num_removed]      # remove the corresponding sentiment as well
            num_removed += 1     # count the number of removed tweets

        # checks for tweets with only "...", " ", "[" or, "]"
        # (only removes the tweet if the match is >=75% of the tweet content)
        if (len(re.match(r'^(\.|\[|\]| |\n|[0-9])*', str(values[i - num_removed])).group(0)) >= int(len(str(values[i - num_removed]))*0.75)):
            # print('(2) ', i, ': ', str(values[i - num_removed]))  #! DEBUG
            del values[i - num_removed]     # remove the tweet from the list
            del sentiments[i - num_removed]      # remove the corresponding sentiment as well
            num_removed += 1     # count the number of removed tweets

        url_match = re.match(r'(https?:\ */\ */\ *)?(?:www\.)?[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(?:\ */\ *[^\s]*)?', str(values[i - num_removed]))

        # removes the tweets with links if a match exists
        # if (url_match):
        #     print('(3) ', i, ': ', url_match.group(0))  #! DEBUG
        #     # del values[i - num_removed]     # remove the tweet from the list
        #     # del sentiments[i - num_removed]      # remove the corresponding sentiment as well
        #     # num_removed += 1     # count the number of removed tweets

        #     #! DEBUG
        #     url_matches += 1

    # #! DEBUG
    # print(url_matches)
    return (pd.Series(values), pd.Series(sentiments))

In [21]:
content, sentiment = process_content(content, sentiment)
content.head(20)

  0%|          | 0/61692 [00:00<?, ?it/s]

0     im getting on borderlands and i will murder yo...
1     I am coming to the borders and I will kill you...
2     im getting on borderlands and i will kill you ...
3     im coming on borderlands and i will murder you...
4     im getting on borderlands 2 and i will murder ...
5     im getting into borderlands and i can murder y...
6     So I spent a few hours making something for fu...
7     So I spent a couple of hours doing something f...
8     So I spent a few hours doing something for fun...
9     So I spent a few hours making something for fu...
10    2010 So I spent a few hours making something f...
11    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
12    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
13    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
14    Rock-Hard La Vita, RARE BUT POWERFUL, HANDSOME...
15    Live Rock - Hard music La la Varlope, RARE & t...
16    I-Hard like me, RARE LONDON DE, HANDSOME 2011,...
17    that was the first borderlands session in 

In [22]:
# converting the sentiment data into a torch tensor
sentiment = torch.tensor(sentiment, dtype=torch.int64)

### Tokenization:

In [23]:
def iterate_tokens(df):
    for val in tqdm(df):
        yield word_tokenize(str(val))

vocab = build_vocab_from_iterator(
    iterate_tokens(content),
    min_freq = 5,
    specials = ['<unk>']
) 

vocab.set_default_index(0)

len(vocab)

  0%|          | 0/59441 [00:00<?, ?it/s]

17481

## **Creating Data Loaders:**
These data loaders are used by the models to access the tweet data. We will create two data loaders:
* Training dataloader: used to train the models.
* Validation dataloader: used to evaluate the performance of the models.

In [28]:
'''
method that generates the training and validation dataloaders using the holdout method
Takes three arguments:
- data to create training and validation dataloaders with
- the sentiment data
- the vocabulary values
- maximum tweet length (default is 250)
- size of batches (default is 32)
'''
def create_dataloaders(data:pd.Series, sentiments:pd.Series, vocab, max_length = 250, batch_size = 32):
    # create the sequences using the vocab
    sequences = [
        torch.tensor(
            vocab.lookup_indices(word_tokenize(str(tweet))),
            dtype = torch.int64
        ) for tweet in tqdm(content)
    ]

    # create the padded sequences
    padded_sequences = pad_sequence(sequences, batch_first=True)[:, :max_length]

    # create the training and validation datasets used to create the dataloaders
    (train_dataset, val_dataset) = random_split(TensorDataset(padded_sequences, sentiments), (0.7, 0.3))  # 70% train, 30% validation

    # create the training and validation dataloaders
    train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
    val_dataloader = DataLoader(val_dataset, shuffle=False, batch_size=batch_size)

    # return the training and validation dataloaders in a tuple
    return (train_dataloader, val_dataloader)


In [29]:
train_dataloader, val_dataloader = create_dataloaders(content, sentiment, vocab)

  0%|          | 0/59441 [00:00<?, ?it/s]

## **Model Creation:**
This is where the models being used are defined.

### **LSTM Model:**

In [32]:
# an LSTM classifier using a LightningModule
class LSTM_classifier(LightningModule):
    def __init__(self, vocab_size, embedding_dimension, state_dimension):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dimension)
        self.lstm = nn.LSTM(
            input_size = embedding_dimension,
            hidden_size = state_dimension,
            num_layers = 1, # hyperparameter
            batch_first = True
        )

        # possible outputs: positive, negative, neutral
        self.output = nn.Linear(state_dimension, 3)

        # monitors accuracy
        self.accuracy = Accuracy(task = 'multiclass', num_classes = 3)

    def forward(self, sequence_batch):
        embedded = self.embedding(sequence_batch)
        h_t, h_n = self.lstm(embedded)  # output features (h_t) and state (h_n)
        output = self.output(h_n[-1])

        return output

    def loss(self, output, targets):
        loss = nn.CrossEntropyLoss()

        return loss(output, targets)

    def training_step(self, batch):
        inputs, targets = batch
        outputs = self.forward(inputs)
        loss = self.loss(outputs, targets)

        # get accuracy value
        self.accuracy(outputs, targets)

        # log the training accuracy
        self.log('training accuracy', self.accuracy, prog_bar = True)

        return loss

    def validation_step(self, batch):
        inputs, targets = batch
        outputs = self.forward(inputs)

        # get accuracy value
        self.accuracy(outputs, targets)

        # log the validation accuracy
        self.log('validation accuracy', self.accuracy, prog_bar = True)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())

### **RNN Model:**

In [31]:
# a basic RNN classifier using a LightningModule
class RNN_classifier(LightningModule):
    def __init__(self, vocab_size, embedding_dimension, state_dimension):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dimension)
        self.rnn = nn.RNN(
            input_size = embedding_dimension,
            hidden_size = state_dimension,
            num_layers = 1, # hyperparameter
            batch_first = True
        )

        # possible outputs: positive, negative, neutral
        self.output = nn.Linear(state_dimension, 3)

        # monitors accuracy
        self.accuracy = Accuracy(task = 'multiclass', num_classes = 3)

    def forward(self, sequence_batch):
        embedded = self.embedding(sequence_batch)
        h_t, h_n = self.rnn(embedded)  # output features (h_t) and state (h_n)
        output = self.output(h_n[-1])

        return output

    def loss(self, output, targets):
        loss = nn.CrossEntropyLoss()

        return loss(output, targets)

    def training_step(self, batch):
        inputs, targets = batch
        outputs = self.forward(inputs)
        loss = self.loss(outputs, targets)

        # get accuracy value
        self.accuracy(outputs, targets)

        # log the training accuracy
        self.log('training accuracy', self.accuracy, prog_bar = True)

        return loss

    def validation_step(self, batch):
        inputs, targets = batch
        outputs = self.forward(inputs)

        # get accuracy value
        self.accuracy(outputs, targets)

        # log the validation accuracy
        self.log('validation accuracy', self.accuracy, prog_bar = True)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())

## **Model Training:**
This is where the models defined in the previous section are trained.

### **LSTM Model:**

In [33]:
LSTM_model = LSTM_classifier(
    vocab_size = len(vocab),
    embedding_dimension = 32, # hyperparameter
    state_dimension = 64 # hyperparameter
)

LSTM_logger = CSVLogger('./lightning_logs/', 'LSTM')
trainer = Trainer(max_epochs = 10, logger = LSTM_logger) # hyperparameter (# epochs)

trainer.fit(
    LSTM_model,
    train_dataloaders = train_dataloader,
    val_dataloaders = val_dataloader
)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: ./lightning_logs/LSTM

  | Name      | Type               | Params
-------------------------------------------------
0 | embedding | Embedding          | 559 K 
1 | lstm      | LSTM               | 25.1 K
2 | output    | Linear             | 195   
3 | accuracy  | MulticlassAccuracy | 0     
-------------------------------------------------
584 K     Trainable params
0         Non-trainable params
584 K     Total params
2.339     Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

c:\Users\chris\anaconda3\Lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.


ValueError: Either `preds` and `target` both should have the (same) shape (N, ...), or `target` should be (N, ...) and `preds` should be (N, C, ...).

### **RNN Model:**

In [None]:
RNN_model = RNN_classifier(
    vocab_size=len(vocab),
    embedding_dimension = 32, # hyperparameter
    state_dimension = 64 # hyperparameter
)

RNN_logger = CSVLogger('./lightning_logs/', 'RNN')
trainer = Trainer(max_epochs = 10, logger = RNN_logger) # hyperparameter (# epochs)

trainer.fit(
    RNN_model,
    train_dataloaders = train_dataloader,
    val_dataloaders = val_dataloader
)