In this experiment, HuggingFace is used for the datasets and the pre-trained models.
Thus, start by installing huggingface packages:

In [66]:
pip install -U datasets transformers[torch] evaluate



In [67]:
from datasets import load_dataset
dataset = load_dataset('MrbBakh/Sentiment140')

## 1.2.1 Text Pre-processing

In [68]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [69]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')
def tokenize(row):
    tokens = word_tokenize(row['text'])
    # to lowercase and remove punctuation
    tokens = [token.lower() for token in tokens if token.isalpha()]
    return {
        'tokens': tokens
    }
dataset = dataset.map(tokenize)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [70]:
from nltk.corpus import stopwords


nltk.download('stopwords')
def remove_stopwords(row):
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in row['tokens'] if token not in stop_words]
    return {
      'tokens': tokens
    }
dataset = dataset.map(remove_stopwords)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Task 1: Use PorterStemmer from NLTK to stem the tokens.

In [71]:


from nltk.stem import PorterStemmer
#PorterStemmer
proter_stemmer = PorterStemmer()
def stem_tokens(row):
    tokens = [proter_stemmer.stem(token) for token in row['tokens']]
    return {
      'tokens': tokens
    }
dataset = dataset.map(stem_tokens)



## 1.2.2 Word Embedding

Now that you have pre-processed tokens, use those to train the embedding model.Word2Vec
model from gensim is to be used in this experiment:

In [72]:
from gensim.models import Word2Vec
word_embedding = Word2Vec(dataset['train']['tokens'], vector_size=100,
min_count=1,window=5,sg=1, hs=0, negative=10)

The model is trained on the training set’s tokens, and the size of the embedding
vector is 100. The other parameters are related to the method used in Word2Vec,
which are not covered in this introductory experiment.
After training the model, you can save it and load it again if you wish to:

In [73]:
word_embedding.save('w2v.model')
word_embedding = Word2Vec.load('w2v.model')

## 1.2.3 Average Vector

The first model would be an Average Vector model. A Naive Bayes classifier is going
to accept the average vector as input in order to classify samples into positive or
negative sentiments.

In [74]:
def filter_tokens(example):
    return {
        'tokens': [token for token in example['tokens'] if token in
         word_embedding.wv]
    }
def mean_vector(example):
    return {
        'mean': word_embedding.wv[example['tokens']].mean(axis=0)
    }
dataset = dataset.map(filter_tokens) \
          .filter(lambda e: len(e['tokens']) > 0) \
          .map(mean_vector)

Map:   0%|          | 0/40000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/40000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/39924 [00:00<?, ? examples/s]

Map:   0%|          | 0/4963 [00:00<?, ? examples/s]

Map:   0%|          | 0/4966 [00:00<?, ? examples/s]

The filter tokens function would return only the tokens that do have embed-
dings in the trained Word2Vec model. After that, samples that do not contain any

valid token are filtered out. Then the mean vector function would return the average
of tokens’ vectors for each sample.
Finally, train the Naive Bayes classifier:

In [75]:
import numpy as np
from sklearn.naive_bayes import GaussianNB
X = np.array(dataset['train']['mean'])
y = np.array(dataset['train']['sentiment'])
clf = GaussianNB()
clf.fit(X, y)

### Task 2: Compute the accuracy and the confusion matrix of the trained classifier on the test dataset.

In [76]:
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix
X_test = np.array(dataset['test']['mean'])
y_test = np.array(dataset['test']['sentiment'])
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
ConfusionMatrix = confusion_matrix(y_test, y_pred)
print("accuracy:", accuracy)
print("confusion matrix:\n", ConfusionMatrix)

accuracy: 0.6657269432138542
confusion matrix:
 [[2066  504]
 [1156 1240]]


## 1.2.4 LSTM

In this section, LSTM is to be used instead of feeding the average vector to a Naive
Bayes classifier.
LSTM in PyTorch is a module that accepts a tensor of shape L × V , where L is
the sequence length and V is the token vector length. LSTM module would consume
each sequence element xt

, along with the previous hidden state (ht−1, ct−1) and output

ot−1, to produce the hidden state (ht

, ct) and output ot

. As a result, the output of

the entire sequence would be of shape L × H where H is the hidden size.
Let’s first convert the tokens to the corresponding vectors using the trained
Word2Vec model.

In [77]:
def vectorize(example):
    return {
        'vectors': word_embedding.wv[example['tokens']]
    }
dataset = dataset.map(vectorize)

Map:   0%|          | 0/39924 [00:00<?, ? examples/s]

Map:   0%|          | 0/4963 [00:00<?, ? examples/s]

Map:   0%|          | 0/4966 [00:00<?, ? examples/s]

In [78]:
import torch
import torch.nn as nn
lstm = nn.LSTM(100, 200)
sequence = torch.tensor(dataset['train'][0]['vectors'])
out, _ = lstm(sequence)

Note: for batched input with batch size being N, LSTM would expect input as
L×N ×V , and outputs the shape L×N × H. To avoid that, use batch first=True
and then the input would be N × L × V and the output would be N × L × H.
To define and use a 2-layers LSTM that accepts a batched input:

In [79]:
lstm = nn.LSTM(100, 200, 2, batch_first=True)
batch = [torch.tensor(sequence) for sequence in
        dataset['train'][0:4]['vectors']]
padded_batch = nn.utils.rnn.pad_sequence(batch)
out, _ = lstm(padded_batch)

### Task 3: slice the output of the LSTM to get the last token’s output for every sample in the batch.

In [80]:

out = out[:, -1, :]




Another module from PyTorch that can be used is the Embedding. This module
contains a learnable weights matrix, that maps every word to its embedding vector.
However, the input isn’t exactly the word, but its index in the weights matrix.
The trained Word2Vec is also a weights matrix (word embedding.wv.vectors).
To be able to fill the Embedding module with Word2Vec weights, map the words to
their corresponding index.

In [81]:
def word_to_index(example):
    indices = [word_embedding.wv.key_to_index[token] for token in
    example['tokens']]
    return {
        'indices': indices
    }
dataset = dataset.map(word_to_index)

Map:   0%|          | 0/39924 [00:00<?, ? examples/s]

Map:   0%|          | 0/4963 [00:00<?, ? examples/s]

Map:   0%|          | 0/4966 [00:00<?, ? examples/s]

To accelerate the training, batches can be padded in order to be grouped in a
single tensor and moved to GPU at once. Define a padding vector, for example, the
zero vector, and give it the last index to avoid collision with other indices.

In [82]:
pad_vector = np.zeros(word_embedding.vector_size)
weights = np.vstack([word_embedding.wv.vectors, pad_vector])

vocab_size, embedding_size = weights.shape
pad_idx = vocab_size - 1

Now, pad the sequences. Arguments batched=True and batch size=None are
used to map all dataset samples at once, and with format(’torch’) is used to
ensure that the returning dataset is a PyTorch tensor.

In [83]:
import torch
import torch.nn as nn

def pad_sequences(batch, pad_idx):
    # Convert indices to tensor and pad them
    indices = [torch.tensor(sample, dtype=torch.long) for sample in batch['indices']]
    indices = nn.utils.rnn.pad_sequence(indices, batch_first=True, padding_value=pad_idx)

    return {'indices': indices}

dataset = dataset.map(pad_sequences, batched=True, batch_size=None, fn_kwargs={'pad_idx': pad_idx}).with_format('torch')


Map:   0%|          | 0/39924 [00:00<?, ? examples/s]

Map:   0%|          | 0/4963 [00:00<?, ? examples/s]

Map:   0%|          | 0/4966 [00:00<?, ? examples/s]

In [84]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SentimentClassifierLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_size, num_layers):
        super(SentimentClassifierLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.lstm = nn.LSTM(embedding_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        embeddings = self.embedding(x)
        out, _ = self.lstm(embeddings)
        out = out[:, -1, :]
        out = self.fc(out)
        out = torch.sigmoid(out)
        return out.squeeze(1)


Define the model and fill the embedding layer, it’s important to set requires grad
to False for the embedding layer, as it’s pre-trained and shouldn’t be considered in
backpropagation and weights update.

In [85]:
hidden_size = 128
num_layers = 2
model = SentimentClassifierLSTM(vocab_size=vocab_size,
        embedding_size=embedding_size, hidden_size=hidden_size,
        num_layers=num_layers)
model.embedding.weight = nn.Parameter(torch.FloatTensor(weights))
model.embedding.weight.requires_grad = False

As usual, define the optimizer and the loss function:

In [86]:
learning_rate = 0.001
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

Then move everything to GPU:

In [87]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
criterion = criterion.to(device)

Define the dataloaders:

In [88]:
from torch.utils.data import DataLoader, TensorDataset

batch_size = 2048

def to_dataloader(dataset, split, shuffle):
    # Create a TensorDataset from the specified split
    data = TensorDataset(dataset[split]['indices'], dataset[split]['sentiment'])
    # Return DataLoader with specified batch_size and shuffle
    return DataLoader(data, batch_size=batch_size, shuffle=shuffle)

train_dataloader = to_dataloader(dataset, 'train', True)
test_dataloader = to_dataloader(dataset, 'test', False)
validation_dataloader = to_dataloader(dataset, 'validation', False)


Finally, define the training function:

In [89]:
def train_one_epoch(dataloader):

    for inputs, labels in dataloader:
        inputs = inputs.to(device)
        labels = labels.to(device).float()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

### Task 4: Use the train one epoch to train the model on 20 epochs. Bonus: Evaluate the model on the validation set after each epoch and print the validation accuracy.


In [None]:
accuracies = []

for epoch in range(20):
    train_one_epoch(train_dataloader)
    with torch.no_grad():
        model.eval()
        correct = 0
        total = 0
        for inputs, labels in validation_dataloader:
            inputs = inputs.to(device)
            labels = labels.to(device).float()
            outputs = model(inputs)
            predicted = (outputs > 0.5).float()
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
        accuracy = correct / total
        accuracies.append(accuracy)
        print(f'Epoch {epoch+1}, Validation Accuracy: {accuracy:.4f}')

### Task 5: Evaluate the model on the test set using the accuracy and confusion matrix.


In [91]:
import numpy as np
from sklearn.metrics import confusion_matrix
with torch.no_grad():
        model.eval()
        correct = 0
        total = 0
        for inputs, labels in test_dataloader:
            inputs = inputs.to(device)
            labels = labels.to(device).float()
            outputs = model(inputs)
            predicted = (outputs > 0.5).float()
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
        accuracy = correct / total
        print(f'Test Accuracy: {accuracy:.4f}')
        y_pred = (outputs > 0.5).float().cpu().numpy()
        y_true = labels.cpu().numpy()
        confusion_mat = confusion_matrix(y_true, y_pred)
        print("Confusion Matrix:")
        print(confusion_mat)

Test Accuracy: 0.5520
Confusion Matrix:
[[279 186]
 [206 199]]


### Task 6: Compare the performance of the model with the performance of the Average Vector model.

The LSTM model analyzes the text word by word to understand relationships and context in depth. It has high accuracy but takes longer to train. While the Average Vector model relies on taking the average of word representations to get a general idea of ​​the text. It is simple and fast, but less accurate and does not understand context well. Choosing a model depends on balancing the need for accuracy and speed

## 1.2.5 Transformers

Transformer is a different architecture for sequence models. It’s the state-of-the-art
in NLP. Use the pre-trained BERT-mini model, which is available on HuggingFace
under ’lyeonii/bert-mini’.
Start by defining the tokenizer. The tokenizer is responsible for returning the
indices of the tokens. Use HuggingFace’s AutoTokenizer with the repository of the model, and it will be able to return the appropriate tokenizer that suits the model:

In [93]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('lyeonii/bert-mini')

Then tokenize the dataset, the tokenizer will handle padding as well. Make sure
to set return tensors to ’pt’ which stands for PyTorch:

In [94]:
tokenized_dataset = dataset.map(lambda x: tokenizer(
x['text'],
padding=True,
return_tensors='pt'
), batched=True, batch_size=None).with_format('torch')

Map:   0%|          | 0/39924 [00:00<?, ? examples/s]

Map:   0%|          | 0/4963 [00:00<?, ? examples/s]

Map:   0%|          | 0/4966 [00:00<?, ? examples/s]

HuggingFace models expect to receive the labels under the key ’labels’. Thus,
we need to rename the column ’sentiment’ in our dataset to be ’labels’.

In [95]:
tokenized_dataset = tokenized_dataset.rename_column('sentiment', 'labels')

Then load the model itself using AutoModelForSequenceClassification, similar
to AutoTokenizer:

In [96]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('lyeonii/bert-mini',
num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at lyeonii/bert-mini and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Finally, use HuggingFace’s Trainer to train the model:

In [97]:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir='sentiment-analysis',
num_train_epochs=3,
per_device_train_batch_size=512,
per_device_eval_batch_size=512,
weight_decay=0.01,
evaluation_strategy='epoch',
save_strategy='epoch',
logging_strategy='epoch',
report_to="none"

)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset['train'],
eval_dataset=tokenized_dataset['validation']
)
trainer.train()



Epoch,Training Loss,Validation Loss
1,0.5805,0.496302
2,0.4777,0.466691
3,0.4558,0.461784


TrainOutput(global_step=234, training_loss=0.5046882792415782, metrics={'train_runtime': 158.9632, 'train_samples_per_second': 753.458, 'train_steps_per_second': 1.472, 'total_flos': 370912765800960.0, 'train_loss': 0.5046882792415782, 'epoch': 3.0})

### Task 7: Use compute metrics in Trainer constructor, with evaluate package, to compute validation accuracy.

In [98]:
from evaluate import load

accuracy_metric = load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred

    if logits.ndim > 1 and logits.shape[1] > 1:
        predictions = logits.argmax(axis=-1)
    else:
        predictions = (logits > 0.5).astype(int)
    return accuracy_metric.compute(predictions=predictions, references=labels)



trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    compute_metrics=compute_metrics,

)


trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4342,0.457522,0.78642
2,0.4075,0.444383,0.792867
3,0.4029,0.44534,0.794882


TrainOutput(global_step=234, training_loss=0.414886719141251, metrics={'train_runtime': 161.7345, 'train_samples_per_second': 740.547, 'train_steps_per_second': 1.447, 'total_flos': 370912765800960.0, 'train_loss': 0.414886719141251, 'epoch': 3.0})

### Task 8: Evaluate the model on the test set using the accuracy and confusion matrix.

In [99]:
import numpy as np
from sklearn.metrics import confusion_matrix


eval_results = trainer.predict(tokenized_dataset['test'])

predictions = np.argmax(eval_results.predictions, axis=1)
labels = eval_results.label_ids


accuracy = np.mean(predictions == labels)
print("Test Accuracy:", accuracy)


conf_matrix = confusion_matrix(labels, predictions)
print("Confusion Matrix:")
print(conf_matrix)


Test Accuracy: 0.8060813532017721
Confusion Matrix:
[[2122  448]
 [ 515 1881]]
