# **ML Final Project**:

### **Group Members:**
- Christopher Johnson (christopher.johnson13@ontariotechu.net)
- Name (Student Email)
- Name (Student Email)
- Name (Student Email)

## **Project Goals & Outline:**
The goal of this project is to use sentiment analysis to analyze the content of tweets, and make decisions on whether their sentiment is positive, negative, or neutral.

### Outline:
1. Data Importing and Preprocessing
2. Model Construction
   1. RNN
   2. LSTM
   3. ???
3. Model Training
   1. RNN
   2. LSTM
   3. ???
4. Model Analysis and Comparison
5. Conclusions

## **Importing Packages & Libraries:**

In [53]:
# general packages/libraries
import numpy as np
import pandas as pd
from datetime import datetime # used to convert Date_time strings to useable format

# torch
import torch

# torchmetrics
from torchmetrics import Accuracy

# torchtext
import torchtext.data
from torchtext.vocab import build_vocab_from_iterator

# lightning
from lightning.pytorch import LightningModule
from lightning.pytorch import Trainer
from lightning.pytorch.loggers import CSVLogger

# tqdm
from tqdm.notebook import tqdm

# nltk
import nltk # used for tokenziation
nltk.download('punkt')
from nltk import word_tokenize

# regular expressions
import re

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\chris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## **Data Importing and Preprocessing:**
This section is where the tweet data is imported and processed into tokens. This tokenization process is required so the neural network architectures can interpret the text data.

**Tokenization definition:** the process of breaking down a sequence of information into smaller chunks known as tokens.

### Importing the Dataset:

In [3]:
# import the tweet data from the CSV using pandas
tweet_data = pd.read_csv('./Datasets/twitter_training.csv', names=['Tweet ID', 'entity', 'sentiment', 'Tweet content'])
tweet_data.head(20)

Unnamed: 0,Tweet ID,entity,sentiment,Tweet content
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
5,2401,Borderlands,Positive,im getting into borderlands and i can murder y...
6,2402,Borderlands,Positive,So I spent a few hours making something for fu...
7,2402,Borderlands,Positive,So I spent a couple of hours doing something f...
8,2402,Borderlands,Positive,So I spent a few hours doing something for fun...
9,2402,Borderlands,Positive,So I spent a few hours making something for fu...


In [4]:
# droppping "irrelevant" sentiment values
tweet_data.drop(
    tweet_data[tweet_data['sentiment'] == 'Irrelevant'].index,
    inplace=True
)

# showing remaining sentiment distribution
tweet_data['sentiment'].value_counts()

sentiment
Negative    22542
Positive    20832
Neutral     18318
Name: count, dtype: int64

In [5]:
# extracting the the tweet content for ease of use
content = tweet_data['Tweet content']

# showing the extracted content
print(content.head(60))

# convert content to an iterator
# content = iter(content)

0     im getting on borderlands and i will murder yo...
1     I am coming to the borders and I will kill you...
2     im getting on borderlands and i will kill you ...
3     im coming on borderlands and i will murder you...
4     im getting on borderlands 2 and i will murder ...
5     im getting into borderlands and i can murder y...
6     So I spent a few hours making something for fu...
7     So I spent a couple of hours doing something f...
8     So I spent a few hours doing something for fun...
9     So I spent a few hours making something for fu...
10    2010 So I spent a few hours making something f...
11                                                  was
12    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
13    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
14    Rock-Hard La Varlope, RARE & POWERFUL, HANDSOM...
15    Rock-Hard La Vita, RARE BUT POWERFUL, HANDSOME...
16    Live Rock - Hard music La la Varlope, RARE & t...
17    I-Hard like me, RARE LONDON DE, HANDSOME 2

In [56]:
'''
method that processes content and removes the following content:
- 1 word tweets
- tweets that contain only numbers
- tweets that contain only special characters (e.g. /, ., <, etc.)
'''
def process_content(values:pd.Series):
    # convert series to list
    values = values.to_list()

    # records the number of tweets removed
    # (used to adjust the access index)
    num_removed = 0

    for i in tqdm(range(0, len(values))):
        # checks for 1 word tweet using a tokenizer
        if (len(word_tokenize(str(values[i - num_removed]))) <= 1):
            del values[i - num_removed]     # remove the tweet from the list
            num_removed += 1     # count the number of removed tweets

        # checks for tweets with only "...", " ", "[" or, "]"
        if (len(re.search(r'^(\Q.\E| |\Q[\E|\Q]\E|[0-9])*', values[i - num_removed])) > 0):
            print(i)
    return pd.Series(values)

# method that removes links from tweets
def remove_links(content:pd.Series):
    return 0 # ! TEMP

In [57]:
content = process_content(content)

  0%|          | 0/59472 [00:00<?, ?it/s]

error: bad escape \Q at position 2

In [52]:
content[40:80]

40    <unk> Gearbox really time to fix this 10 drops...
41                     Check out this epic streamer!.  
42                       Check out this epic streamer!.
43                         Watch this epic striptease!.
44                        Check out our epic streamer!.
45                   Check out this big epic streamer!.
46                      Check<unk> this epic streamer!.
47    Blaming Sight for Tardiness! A little bit of b...
48    A bit of borderland. I was called to work tomo...
49    Guilty of sobriety! A bit of a borderline. I w...
50    Blaming Sight for Tardiness! A little bit of b...
51    for Blaming Sight for Tardiness! A little bit ...
52    why does like every man in borderlands have sl...
53    Why, like every man in border countries, have ...
54    Why, like everyone else in the border countrie...
55    why does like<unk> man in borderlands have sli...
56    why Beth does like every man in borderlands ha...
57    why does practically every man in France h

In [None]:
# extracting the the sentiment data for ease of use
sentiment = tweet_data['sentiment']

# the numerical representations of the sentiment values
sentiment_numerical = {
    'Positive': 0,
    'Negative': 1,
    'Neutral': 2,
}

# converting the sentiment data into a numerical form
sentiment.replace(to_replace=sentiment_numerical, inplace=True)

# showing the converted sentiment data
print(sentiment.head(20))

# converting the sentiment data into a torch tensor
sentiment = torch.tensor(sentiment, dtype=torch.int64)

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    2
13    2
14    2
15    2
16    2
17    2
18    0
19    0
Name: sentiment, dtype: int64


### Tokenization:

In [49]:
tokens = word_tokenize(content[0])
print(type(tokens))

def iterate_tokens(df):
    for val in tqdm(df):
        yield word_tokenize(str(val))

vocab = build_vocab_from_iterator(
    iterate_tokens(content),
    min_freq = 5,
    specials = ['<unk>']
)

<class 'list'>


  0%|          | 0/59472 [00:00<?, ?it/s]

## **Model Creation:**
This is where the models being used are defined.

### **RNN Model:**