<a href="https://colab.research.google.com/github/GiovanniPioDelvecchio/GCNs_on_text/blob/main/GloVe_LSTM_Baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP project work
Summary: Casting text classification to Graph Classification for Sentiment Analysis of Tweets
Members:

- Dell'Olio Domenico
- Delvecchio Giovanni Pio
- Disabato Raffaele

The project was developed in order to evaluate the effectiveness of Graph Neural network on a sentiment analysis task proposed in the challenge:
https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification?resource=download

We decided to implement and test various architectures, including commonly employed transformer-based architectures, in order to compare their performances.
These architectures were either already present at the state of the art or were obtained as a result of experiments.

## This notebook contains the following:
- Implementation and training of a GloVe + LSTM + linear model

In [None]:
# Imports for model implementation
import re
import pandas as pd
import numpy as np
import torch.optim as optim
import torch

from torchtext.vocab import GloVe
from torchtext.data import get_tokenizer
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [None]:
# Check if cuda is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla V100-DGXS-32GB


Load and preprocess of dataset

In [None]:
# Load the train dataframe
df = pd.read_csv("./Corona_NLP_train.csv", encoding='latin1')
df.head

<bound method NDFrame.head of        UserName  ScreenName                      Location     TweetAt  \
0          3799       48751                        London  16-03-2020   
1          3800       48752                            UK  16-03-2020   
2          3801       48753                     Vagabonds  16-03-2020   
3          3802       48754                           NaN  16-03-2020   
4          3803       48755                           NaN  16-03-2020   
...         ...         ...                           ...         ...   
41152     44951       89903  Wellington City, New Zealand  14-04-2020   
41153     44952       89904                           NaN  14-04-2020   
41154     44953       89905                           NaN  14-04-2020   
41155     44954       89906                           NaN  14-04-2020   
41156     44955       89907  i love you so much || he/him  14-04-2020   

                                           OriginalTweet           Sentiment  
0      @MeNyrb

In [None]:
# Drop not so useful columns
df.drop(columns=['UserName','ScreenName','Location','TweetAt'], inplace=True)
df

Unnamed: 0,OriginalTweet,Sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive
3,My food stock is not the only one which is emp...,Positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative
...,...,...
41152,Airline pilots offering to stock supermarket s...,Neutral
41153,Response to complaint not provided citing COVI...,Extremely Negative
41154,You know itÂs getting tough when @KameronWild...,Positive
41155,Is it wrong that the smell of hand sanitizer i...,Neutral


In [None]:
# Filter tweets with a minmum length of 10
def get_long_tweets(df, tweet_lengths):
    to_return = df.loc[tweet_lengths >= 10]
    return to_return

tweet_lengths = df['OriginalTweet'].apply(lambda x: len(x.split()))
df_lengthy = get_long_tweets(df, tweet_lengths)

In [None]:
df_lengthy

Unnamed: 0,OriginalTweet,Sentiment
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive
3,My food stock is not the only one which is emp...,Positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative
5,As news of the regionÂs first confirmed COVID...,Positive
...,...,...
41152,Airline pilots offering to stock supermarket s...,Neutral
41153,Response to complaint not provided citing COVI...,Extremely Negative
41154,You know itÂs getting tough when @KameronWild...,Positive
41155,Is it wrong that the smell of hand sanitizer i...,Neutral


In [None]:
# Define preprocessing function
def preprocessing(x):
    def remove_hashtags(text): return re.sub(r'#', '' , text)
    def remove_mentions(text): return re.sub(r'@', '' , text)
    def remove_urls(text): return re.sub(r'https?://\S+', ' ', text)
    def change_apostrophe(text): return re.sub(r"Â’", "\'", text)
    def remove_special_chars(text): return re.sub(r"[^\w. ',-]", ' ', text)
    def remove_numbers(text): return re.sub(r'[\d]', ' ', text)
    def remove_formatting_symbols(text): return re.sub(r"[\r\n]+",'',text)
    def remove_escape_characters(text): return re.sub(r"\\",'',text)
    def remove_extra_spaces(text): return re.sub(r"\s{2,}",' ',text)
    def remove_space_before_period(text): return re.sub(r"\s\.", ".", text)
    def remove_strange_a(text): return "".join(c if ord(c)!=226 else "a" for c in text )
    x=x.apply(remove_hashtags)
    x=x.apply(remove_mentions)
    x=x.apply(remove_urls)
    x=x.apply(change_apostrophe)
    x=x.apply(remove_special_chars)
    x=x.apply(remove_numbers)
    x=x.apply(remove_formatting_symbols)
    x=x.apply(remove_escape_characters)
    x=x.apply(remove_extra_spaces)
    x=x.apply(remove_space_before_period)
    x=x.str.lower()
    x=x.apply(remove_strange_a)
    return x

In [None]:
# Apply preprocessing function to tweets
df_lengthy['OriginalTweet'] = preprocessing(df_lengthy['OriginalTweet'])
df_lengthy

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_lengthy['OriginalTweet'] = preprocessing(df_lengthy['OriginalTweet'])


Unnamed: 0,OriginalTweet,Sentiment
1,advice talk to your neighbours family to excha...,Positive
2,coronavirus australia woolworths to give elder...,Positive
3,my food stock is not the only one which is emp...,Positive
4,"me, ready to go at supermarket during the covi...",Extremely Negative
5,as news of the regiona s first confirmed covid...,Positive
...,...,...
41152,airline pilots offering to stock supermarket s...,Neutral
41153,response to complaint not provided citing covi...,Extremely Negative
41154,you know ita s getting tough when kameronwilds...,Positive
41155,is it wrong that the smell of hand sanitizer i...,Neutral


In [None]:
# Define and application of preprocessing for the labels
def label_preprocessing(labels):
    lab_dict={
        'Extremely Negative': 0,
        'Negative': 1,
        'Neutral': 2,
        'Positive': 3,
        'Extremely Positive': 4
    }
    labels=labels.map(lab_dict)

    return labels

df_lengthy['Sentiment']=label_preprocessing(df_lengthy['Sentiment'])
df_lengthy

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_lengthy['Sentiment']=label_preprocessing(df_lengthy['Sentiment'])


Unnamed: 0,OriginalTweet,Sentiment
1,advice talk to your neighbours family to excha...,3
2,coronavirus australia woolworths to give elder...,3
3,my food stock is not the only one which is emp...,3
4,"me, ready to go at supermarket during the covi...",0
5,as news of the regiona s first confirmed covid...,3
...,...,...
41152,airline pilots offering to stock supermarket s...,2
41153,response to complaint not provided citing covi...,0
41154,you know ita s getting tough when kameronwilds...,3
41155,is it wrong that the smell of hand sanitizer i...,2


In [None]:
df_to_use = df_lengthy
df_to_use

Unnamed: 0,OriginalTweet,Sentiment
1,advice talk to your neighbours family to excha...,3
2,coronavirus australia woolworths to give elder...,3
3,my food stock is not the only one which is emp...,3
4,"me, ready to go at supermarket during the covi...",0
5,as news of the regiona s first confirmed covid...,3
...,...,...
41152,airline pilots offering to stock supermarket s...,2
41153,response to complaint not provided citing covi...,0
41154,you know ita s getting tough when kameronwilds...,3
41155,is it wrong that the smell of hand sanitizer i...,2


In [None]:
# Remove empty tweets from dataframe
df_to_use = df_to_use[df_to_use["OriginalTweet"] != " "]
df_to_use

Unnamed: 0,OriginalTweet,Sentiment
1,advice talk to your neighbours family to excha...,3
2,coronavirus australia woolworths to give elder...,3
3,my food stock is not the only one which is emp...,3
4,"me, ready to go at supermarket during the covi...",0
5,as news of the regiona s first confirmed covid...,3
...,...,...
41152,airline pilots offering to stock supermarket s...,2
41153,response to complaint not provided citing covi...,0
41154,you know ita s getting tough when kameronwilds...,3
41155,is it wrong that the smell of hand sanitizer i...,2


Split dataframe in train, test and validation sets. \
We split the dataframe following the 80/20 rule, 80% of the entire dataset is the trainset, 20% is the testset. \
Then the train set is splitted in 80% real train set and 20% is the validation set.

In [None]:
# Splitting of the dataframe in train, val, test sets
train_split, test_split = train_test_split(df_to_use, test_size = 0.2, random_state = 10,
                                           stratify =  df_to_use["Sentiment"])
train_split, val_split = train_test_split(train_split, test_size = 0.2, random_state = 10,
                                           stratify =  train_split["Sentiment"])
print(f"train shape: {train_split.shape}, val shape:{val_split.shape}, test shape:{test_split.shape}")

train shape: (25534, 2), val shape:(6384, 2), test shape:(7980, 2)


In [None]:
# Load GloVe embedding for twitter
global_vectors = GloVe(name='twitter.27B', dim = 100)

In [None]:
# Load basic english tokenizer
tokenizer = get_tokenizer("basic_english")

In [None]:
train_tweets = list(train_split.OriginalTweet.values)
train_labels = list(train_split.Sentiment.values)
val_tweets = list(val_split.OriginalTweet.values)
val_labels = list(val_split.Sentiment.values)
test_tweets = list(test_split.OriginalTweet.values)
test_labels = list(test_split.Sentiment.values)

In [None]:
# Define the function for encode the tweets using GloVe embeddings
def encode_split(tweet_list, max_words, embed_len=100):
    X = [tokenizer(t) for t in tweet_list]
    X = [tokens + [""] * (max_words - len(tokens))  if len(tokens) < max_words else tokens[:max_words] for tokens in X]
    X_tensor = torch.zeros(len(tweet_list), max_words, embed_len)
    for i, tokens in enumerate(X):
        X_tensor[i] = global_vectors.get_vecs_by_tokens(tokens)
    return X_tensor

In [None]:
train_split

Unnamed: 0,OriginalTweet,Sentiment
2352,final thought - consider donating to a food pa...,1
37851,a the advice is to stock up on food and other ...,2
15120,walking through the supermarket i stumbled upo...,4
6853,catfordmassive i've read an article saying the...,1
30269,as a measure against the spread of covid japan...,1
...,...,...
29747,there shouldn't be a tp shortage. mask shortag...,1
18059,"sorry, millennials a coronavirus-induced reces...",3
36032,the april imon connections e-newsletter is now...,4
7010,all supermarket trollies and baskets need to b...,2


In [None]:
val_split

Unnamed: 0,OriginalTweet,Sentiment
19658,palladium gold regaining dma platinum silver u...,1
1974,"if corona virus ever comes to uganda, some of ...",0
11291,it is so bizarre to go to the grocery store wi...,1
21894,don q rum to make hand sanitizer for puerto ri...,3
10295,you can feel the tension and stress in that su...,1
...,...,...
9509,how quickly the world changes. meanwhile actor...,4
3431,stay up to date on the latest in consumer fina...,2
34976,shopping during coronavirus higher prices aren...,1
29445,do you know two easy steps for homemade saniti...,3


We encode the train, val and test set using the function defined above, using 60 as <i>max_words</i>, because 60 is the value of the 90<sup>th</sup> percentile of the lenghts of the tweets in the training set.

In [None]:
tok_train = encode_split(train_tweets, 60)
tok_val = encode_split(val_tweets, 60)
tok_test = encode_split(test_tweets, 60)

In [None]:
tok_train = tok_train.clone().detach()
tok_test = tok_test.clone().detach()
tok_val = tok_val.clone().detach()

In [None]:
# Convert other data types to torch.Tensor
train_labels = torch.tensor(train_labels)
val_labels = torch.tensor(val_labels)

In [None]:
batch_size = 64

In [None]:
# Create the DataLoader for our training set
train_data = TensorDataset(tok_train, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

In [None]:
# Create the DataLoader for our validation set
val_data = TensorDataset(tok_val, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

The model utilizes pre-trained GloVe word embeddings as input and a fully connected layer for classification. \
This model is designed to establish baseline performance for our project work.


In [None]:
# Define the class of the model using a basic LSTM and a fully connected layer for classification
class GloVeLSTM(torch.nn.Module):
    """
    Attributes:
      embedding_dim (int): Dimensionality of the GloVe word embeddings.
      hidden_size (int): Number of hidden units in the LSTM layer.
      num_layers (int): Number of stacked LSTM layers.
      output_dim (int): Number of output classes (e.g., positive, negative, neutral).
    """
    def __init__(self, embedding_dim, hidden_size, num_layers, output_dim):
        super(GloVeLSTM, self).__init__()

        self.lstm = torch.nn.LSTM(embedding_dim, hidden_size, batch_first=True, num_layers=num_layers, bidirectional=False)
        self.fc = torch.nn.Linear(hidden_size * 2, output_dim)
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, x):
        out, (h, c) = self.lstm(x)
        h = torch.cat((h[-2, :, :], h[-1, :, :]), dim=1)
        out = self.fc(h.squeeze(0))

        return out

In [None]:
# Define hyperparameters for the GloVeLSTM model

hidden_size = 256     # Number of hidden units in the LSTM layer
output_dim = 5        # Number of output classes (Extremely negative, negative, ...)
num_layers = 2        # Number of stacked LSTM layers
embedding_dim = 100   # Dimensionality of the GloVe word embeddings

In [None]:
model = GloVeLSTM(embedding_dim, hidden_size, num_layers, output_dim).to(device)

In [None]:
def train(model, optimizer, scheduler, train_dataloader, val_dataloader=None, epochs=4, evaluation=False):
    """Train the GloveLSTM model.
    """
    # Start training loop
    print("Start training...\n")
    loss_fn = torch.nn.CrossEntropyLoss()
    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================
        # Print the header of the result table
        print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9}")
        print("-"*70)

        # Reset tracking variables at the beginning of each epoch
        total_loss, batch_loss, batch_counts = 0, 0, 0

        # Put the model into the training mode
        model.train()

        # For each batch of training data...
        for step, batch in enumerate(train_dataloader):
            batch_counts +=1

            # Load batch to GPU
            b_input_ids, b_labels = tuple(t.to(device) for t in batch)

            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            batch_loss += loss.item()
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            # Update parameters and the learning rate
            optimizer.step()
            # scheduler.step()

            # Print the loss values and time elapsed for every 20 batches
            if (step % 100 == 0 and step != 0) or (step == len(train_dataloader) - 1):
                # Calculate time elapsed for 20 batches
                # Print training results
                print(f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9}")
                # Reset batch tracking variables
                batch_loss, batch_counts = 0, 0
                # Calculate the average loss over the entire training data
                avg_train_loss = total_loss / len(train_dataloader)
                print("-"*70)

        if evaluation == True:
            # After the completion of each training epoch, measure the model's performance
            # on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)
            print(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f}")

In [None]:
def evaluate(model, dataloader, return_preds=False):
    """After the completion of each training epoch, measure the model's performance
    on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()
    loss_fn = torch.nn.CrossEntropyLoss()

    # Tracking variables
    accuracies = []
    losses = []

    # For each batch in our validation set...
    for batch in dataloader:
        # Load batch to GPU
        b_input_ids, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():

            logits = model(b_input_ids)

            # Compute loss
            loss = loss_fn(logits, b_labels)
            losses.append(loss.item())

            # Get the predictions
            preds = torch.argmax(logits, dim=1).flatten()

            if return_preds:
                accuracies.append(preds)
            else:
                # Calculate the accuracy rate
                acc = accuracy(preds, b_labels)
                accuracies.append(acc)

    # Compute the average accuracy and loss over the validation set.
    loss = np.mean(losses)
    acc = np.mean(accuracies)
    return loss, acc

In [None]:
def accuracy(pred_y, y):
    """Calculate accuracy."""
    return ((pred_y == y).sum() / len(y)).item()

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.001)

  from .autonotebook import tqdm as notebook_tqdm


This is a basic recipe used for our baseline training run. \
It consists of 10 epochs, the Adam optimizer with a learning rate of 0.001, and no learning rate scheduler.

In [None]:
train(model, optimizer, None, train_dataloader, val_dataloader, 10, True)

Start training...

 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc 
----------------------------------------------------------------------
   1    |   100   |   1.563095   |     -      |     -    
----------------------------------------------------------------------
   1    |   200   |   1.539906   |     -      |     -    
----------------------------------------------------------------------
   1    |   300   |   1.501741   |     -      |     -    
----------------------------------------------------------------------
   1    |   398   |   1.468822   |     -      |     -    
----------------------------------------------------------------------
   1    |    -    |   1.518751   |  1.531048  |   0.28   
 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc 
----------------------------------------------------------------------
   2    |   100   |   1.430154   |     -      |     -    
----------------------------------------------------------------------
   2    |   200   | 

In [None]:
# Create the DataLoader for our validation set
test_labels = torch.tensor(test_labels)
test_data = TensorDataset(tok_test, test_labels)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)
test_loss, test_acc = evaluate(model, test_dataloader)
print(f"Test accuracy {test_acc}")

Test accuracy 0.6636022729873657


Our basic LSTM model achieved a validation accuracy of 67% and a test accuracy of 66%. These results serve as a valuable baseline for comparison with more complex models.