# **ML Final Project**:

### **Group Members:**
- Christopher Johnson (christopher.johnson13@ontariotechu.net)
- Name (Student Email)
- Name (Student Email)
- Name (Student Email)

## **Project Goals & Outline:**
The goal of this project is to use sentiment analysis to analyze the content of tweets, and make decisions on whether their sentiment is positive, negative, or neutral.

### Outline:
1. Data Importing and Preprocessing
2. Model Construction
   1. RNN
   2. LSTM
   3. ???
3. Model Training
   1. RNN
   2. LSTM
   3. ???
4. Model Analysis and Comparison
5. Conclusions

## **Importing Packages & Libraries:**

In [141]:
# general packages/libraries
import numpy as np
import pandas as pd
from datetime import datetime # used to convert Date_time strings to useable format

# torch
import torch

# torchmetrics
from torchmetrics import Accuracy

# torchtext
import torchtext.data
from torchtext.vocab import build_vocab_from_iterator

# lightning
from lightning.pytorch import LightningModule
from lightning.pytorch import Trainer
from lightning.pytorch.loggers import CSVLogger

# tqdm
from tqdm.notebook import tqdm

# nltk
import nltk # used for tokenziation
nltk.download('punkt')
from nltk import word_tokenize

# regular expressions
import re

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\chris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## **Data Importing and Preprocessing:**
This section is where the tweet data is imported and processed into tokens. This tokenization process is required so the neural network architectures can interpret the text data.

**Tokenization definition:** the process of breaking down a sequence of information into smaller chunks known as tokens.

### Importing the Dataset:

In [142]:
# import the tweet data from the CSV using pandas
tweet_data = pd.read_csv('./Datasets/twitter_training.csv', names=['Tweet ID', 'entity', 'sentiment', 'Tweet content'])
tweet_data.head(20)

Unnamed: 0,Tweet ID,entity,sentiment,Tweet content
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
5,2401,Borderlands,Positive,im getting into borderlands and i can murder y...
6,2402,Borderlands,Positive,So I spent a few hours making something for fu...
7,2402,Borderlands,Positive,So I spent a couple of hours doing something f...
8,2402,Borderlands,Positive,So I spent a few hours doing something for fun...
9,2402,Borderlands,Positive,So I spent a few hours making something for fu...


In [143]:
# droppping "irrelevant" sentiment values
tweet_data.drop(
    tweet_data[tweet_data['sentiment'] == 'Irrelevant'].index,
    inplace=True
)

# showing remaining sentiment distribution
tweet_data['sentiment'].value_counts()

sentiment
Negative    22542
Positive    20832
Neutral     18318
Name: count, dtype: int64

In [144]:
# extracting the the tweet content for ease of use
content = tweet_data['Tweet content']

# showing the extracted content
print(content.head(60))

# convert content to an iterator
# content = iter(content)

0     im getting on borderlands and i will murder yo...
1     I am coming to the borders and I will kill you...
2     im getting on borderlands and i will kill you ...
3     im coming on borderlands and i will murder you...
4     im getting on borderlands 2 and i will murder ...
5     im getting into borderlands and i can murder y...
6     So I spent a few hours making something for fu...
7     So I spent a couple of hours doing something f...
8     So I spent a few hours doing something for fun...
9     So I spent a few hours making something for fu...
10    2010 So I spent a few hours making something f...
11                                                  was
12    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
13    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
14    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
15    Rock-Hard La Vita, RARE BUT POWERFUL, HANDSOME...
16    Live Rock - Hard music La la Varlope, RARE & t...
17    I-Hard like me, RARE LONDON DE, HANDSOME 2

In [145]:
'''
method that processes content and removes the following content:
- 1 word tweets
- tweets that contain only numbers
- tweets that contain only special characters (e.g. /, ., <, etc.)
'''
def process_content(values:pd.Series):
    # convert series to list
    values = values.to_list()

    # records the number of tweets removed
    # (used to adjust the access index)
    num_removed = 0

    for i in tqdm(range(0, len(values))):
        # checks for 1 word tweet using a tokenizer
        if (len(word_tokenize(str(values[i - num_removed]))) <= 1):
            # print('(1) ', i, ': ', str(values[i - num_removed]))  #! DEBUG
            del values[i - num_removed]     # remove the tweet from the list
            num_removed += 1     # count the number of removed tweets

        # checks for tweets with only "...", " ", "[" or, "]"
        # (only removes the tweet if the match is >=75% of the tweet content)
        if (len(re.match(r'^(\.|\[|\]| |\n|[0-9])*', str(values[i - num_removed])).group(0)) >= int(len(str(values[i - num_removed]))*0.75)):
            # print('(2) ', i, ': ', str(values[i - num_removed]))  #! DEBUG
            del values[i - num_removed]     # remove the tweet from the list
            num_removed += 1     # count the number of removed tweets

        # defines a general url regex pattern
        url_pattern = r'(https?:\ */\ */\ *)*(?:www\.)?[a-zA-Z0-9.-]+(\.[a-zA-Z])*(?:\ */\ *[^\s]*)?'

        # checks for URLs within the tweet with the url_pattern
        url_check = re.match(rf'({url_pattern}\s*/\s*)+{url_pattern}', str(values[i - num_removed]))

        # checks for links within the tweets
        if (url_check):
            # print('(3) ', i, ': ', url_check.group(0))  #! DEBUG
            del values[i - num_removed]     # remove the tweet from the list
            num_removed += 1     # count the number of removed tweets

    return pd.Series(values)

# method that removes links from tweets
def remove_links(content:pd.Series):
    return 0 # ! TEMP

In [146]:
print(len(content))
content = process_content(content)
print(len(content))

61692


  0%|          | 0/61692 [00:00<?, ?it/s]

(1)  11 :  was
(1)  53 :  all
(2)  60 :   . . [  
(1)  61 :  nan
(2)  62 :  .. [
(2)  63 :  .. 45
(2)  64 :  .. [
(1)  173 :  why
(1)  185 :  I
(1)  407 :  one
(1)  419 :  can
(1)  437 :  of
(1)  461 :  It
(1)  467 :  on
(1)  493 :  nan
(2)  510 :   . .  .  .  .  .  
(1)  511 :  nan
(1)  512 :  ........
(1)  514 :  ......
(1)  515 :  ......
(1)  539 :  of
(1)  575 :  we
(1)  611 :  the
(1)  635 :  to
(1)  660 :  Completed  
(1)  661 :  nan
(1)  662 :  Rejected
(1)  663 :  Completed
(1)  665 :  we
(1)  755 :  that
(1)  785 :  you
(1)  797 :  as
(1)  833 :  that
(1)  863 :  The
(1)  869 :  was
(1)  888 :  Sweet
(1)  889 :  Sweet
(1)  891 :  Sweet
(1)  893 :  of
(1)  984 :   .  
(1)  985 :  nan
(1)  986 :  nan
(1)  987 :  .
(1)  988 :  .
(1)  989 :  .
(1)  1001 :  not
(1)  1061 :  had
(1)  1121 :  on
(1)  1187 :  you
(1)  1193 :  had
(1)  1211 :  not
(1)  1247 :  how
(1)  1271 :  the
(1)  1277 :  in
(1)  1289 :  my
(1)  1343 :  to
(1)  1445 :  to
(1)  1451 :  and
(1)  1523 :  at
(1)  1529

  0%|          | 0/59230 [00:00<?, ?it/s]

59230


In [147]:
content[0:50]

0     im getting on borderlands and i will murder yo...
1     I am coming to the borders and I will kill you...
2     im getting on borderlands and i will kill you ...
3     im coming on borderlands and i will murder you...
4     im getting on borderlands 2 and i will murder ...
5     im getting into borderlands and i can murder y...
6     So I spent a few hours making something for fu...
7     So I spent a couple of hours doing something f...
8     So I spent a few hours doing something for fun...
9     So I spent a few hours making something for fu...
10    2010 So I spent a few hours making something f...
11    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
12    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
13    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
14    Rock-Hard La Vita, RARE BUT POWERFUL, HANDSOME...
15    Live Rock - Hard music La la Varlope, RARE & t...
16    I-Hard like me, RARE LONDON DE, HANDSOME 2011,...
17    that was the first borderlands session in 

In [148]:
# extracting the the sentiment data for ease of use
sentiment = tweet_data['sentiment']

# the numerical representations of the sentiment values
sentiment_numerical = {
    'Positive': 0,
    'Negative': 1,
    'Neutral': 2,
}

# converting the sentiment data into a numerical form
sentiment.replace(to_replace=sentiment_numerical, inplace=True)

# showing the converted sentiment data
print(sentiment.head(20))

# converting the sentiment data into a torch tensor
sentiment = torch.tensor(sentiment, dtype=torch.int64)

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    2
13    2
14    2
15    2
16    2
17    2
18    0
19    0
Name: sentiment, dtype: int64


### Tokenization:

In [149]:
tokens = word_tokenize(content[0])
print(type(tokens))

def iterate_tokens(df):
    for val in tqdm(df):
        yield word_tokenize(str(val))

vocab = build_vocab_from_iterator(
    iterate_tokens(content),
    min_freq = 5,
    specials = ['<unk>']
)

<class 'list'>


  0%|          | 0/59230 [00:00<?, ?it/s]

## **Model Creation:**
This is where the models being used are defined.

### **RNN Model:**