<a href="https://colab.research.google.com/github/ChanglinWu/DL/blob/main/LSTM_Sentiment_torchtext_0_16_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LSTM Sentiment Analysis

In this example notebook, we will demonstrate how to build and train a Long Short-Term Memory (LSTM) network using the IMDB dataset. LSTM is a type of recurrent neural network (RNN) that is particularly effective in capturing sequential information and long-term dependencies in text data. By utilizing LSTM, we can create a powerful model capable of understanding the sentiment expressed in movie reviews.

<img src='https://raw.githubusercontent.com/rishikksh20/IMDB-Movie-Review-sentiment-Analysis/558b1a1eef30b5806f27891019c9dcad6d53cd88/Images/SentimentAnalysis16.png'>

The IMDB (Internet Movie Database) dataset is a commonly used benchmark dataset for sentiment analysis tasks in natural language processing. It consists of a large collection of movie reviews, with each review labeled as either positive or negative based on the sentiment expressed in the text. The dataset is split into a training set and a test set, with 25,000 reviews in each set.

References:

- GitHub repository: https://github.com/rasbt/stat453-deep-learning-ss21


In [None]:
# Downgrade the torchtext version to 0.16.0
!pip install torchtext==0.16.0 -qqq
!python -m spacy download en_core_web_sm -qqq

2023-10-27 12:33:25.262902: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-27 12:33:25.262962: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-27 12:33:25.263005: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-27 12:33:25.274746: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-27 12:33:28.383329: I tensorflow/compiler/

In [None]:
import torch
import torch.nn.functional as F
import torchtext
import time
import random
import pandas as pd

# torch.backends.cudnn.deterministic = True

In [None]:
RANDOM_SEED = 123
torch.manual_seed(RANDOM_SEED)
VOCABULARY_SIZE = 20000
LEARNING_RATE = 0.005
BATCH_SIZE = 128
NUM_EPOCHS = 15
EMBEDDING_DIM = 128
HIDDEN_DIM = 256
NUM_CLASSES = 2

In [None]:
# If we have a GPU available, we'll set our device to GPU. We'll use this device variable later in our code.
is_cuda = torch.cuda.is_available()
if is_cuda:
    DEVICE = torch.device("cuda")
    print("GPU is available")
else:
    DEVICE = torch.device("cpu")
    print("GPU not available, CPU used")

GPU is available


# Download Dataset
The following cells will download the IMDB movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/) for positive-negative sentiment classification in as CSV-formatted file:

In [None]:
!wget https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz

!gunzip -f movie_data.csv.gz

--2023-10-27 12:33:40--  https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rasbt/python-machine-learning-book-3rd-edition/master/ch08/movie_data.csv.gz [following]
--2023-10-27 12:33:40--  https://raw.githubusercontent.com/rasbt/python-machine-learning-book-3rd-edition/master/ch08/movie_data.csv.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26521894 (25M) [application/octet-stream]
Saving to: ‘movie_data.csv.gz’


2023-10-27 12:33:41 (256 MB/s) - ‘movie_data.csv.gz’ saved [26521894/26521894

In [None]:
df = pd.read_csv('movie_data.csv')
print(len(df))
df.head()

50000


Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [None]:
del df

# Prepare Dataset with Torchtext

In [None]:
# ### Defining the feature processing
import torchdata.datapipes as dp
datapipe = dp.iter.FileLister(['.']).filter(filter_fn=lambda filename: filename.endswith('.csv')) # Read the csv file



In [None]:
print(list(datapipe))
datapipe = dp.iter.FileOpener(datapipe, mode='rt')
datapipe = datapipe.parse_csv(delimiter=',', skip_lines=1) # skip_lines = 0 contains the header the csv file

['./movie_data.csv']


# Split Dataset into Train/Validation/Test
Split the dataset into training, validation, and test partitions:

In [None]:
# Check the attributes of the datapipe
N_ROWS = 50000
train, valid, test = datapipe.random_split(total_length=N_ROWS, weights={"train": 0.8 * 0.85, "valid": 0.8 * 0.15, "test": 0.2}, seed= RANDOM_SEED)

In [None]:
print(f'Num Train: {len(train)}, Num Validation: {len(valid)}, Num Test: {len(test)}')
temp_list = list(train)
print(temp_list[0], len(temp_list[0])) # each_iter, length of each_iter

Num Train: 34000, Num Validation: 6000, Num Test: 10000
['In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />"Murder in Greenwich" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenage

# Build Vocabulary
Build the vocabulary based on the top "VOCABULARY_SIZE" words:

In [None]:
import spacy

eng = spacy.load("en_core_web_sm")

def engTokenize(text):
  """
  Tokenize an English text and return a list of tokens
  """
  return [token.text for token in eng.tokenizer(text)]

In [None]:
def getToken(data_iter):
  # Return the token for the data_iter
  for text, label in data_iter:
    yield engTokenize(text)

def getLabel(data_iter):
  # Return the label for the data_iter
  for text, label in data_iter:
    yield label

In [None]:
from collections import Counter, OrderedDict

def count_freq(iterator):
  counter = Counter()
  for tokens in iterator:
    counter.update(tokens)
  return counter

In [None]:
# Try to build the vocabulary
train_tokens = getToken(train)
train_labels = getLabel(train)

In [None]:
from torchtext.vocab import build_vocab_from_iterator

train_vocab = build_vocab_from_iterator(train_tokens, specials= ['<pad>', '<unk>'], special_first=True, max_tokens= VOCABULARY_SIZE + 2)

train_vocab.set_default_index(train_vocab['<unk>'])

In [None]:
print(f'Vocabulary size: {len(train_vocab)}')

print(train_vocab.get_stoi()['the'])

Vocabulary size: 20002
2


## Look at most common words:

In [None]:
print(train_vocab.get_itos()[:10])

['<pad>', '<unk>', 'the', ',', '.', 'a', 'and', 'of', 'to', 'is']


## Converting a string to an integer:



In [None]:
print(train_vocab.get_stoi()['the'])

2


## Class labels:

In [None]:
print(count_freq(train_labels))

Counter({'0': 17103, '1': 16897})


# Numericalize sentences using vocabulary

In [None]:
import torchtext.transforms as T

def getTransform(vocab):
    """
    Create transforms based on given vocabulary. The returned transform is applied to sequence
    of tokens.
    """
    text_tranform = T.Sequential(
        ## converts the sentences to indices based on given vocabulary
        T.VocabTransform(vocab=vocab),
    )
    return text_tranform

def applyTransform(sequence_pair):
    """
    Apply transforms to sequence of tokens in a sequence pair
    """

    return (
        getTransform(train_vocab)(engTokenize(sequence_pair[0])), [int(sequence_pair[1])]
    )


In [None]:
train = train.map(applyTransform)
valid = valid.map(applyTransform)
test = test.map(applyTransform)

# Check the state
temp_list = list(train)
print(temp_list[0])

([158, 6528, 3, 2, 2236, 4408, 1, 28, 4723, 3209, 30, 1149, 8, 2, 355, 17, 759, 1603, 7, 9241, 1, 3, 18236, 3, 8683, 4, 744, 2, 1, 1266, 3, 11865, 7, 2382, 3, 80, 19, 1978, 10, 2, 11001, 7, 54, 440, 6, 54, 675, 4932, 1, 4, 11594, 17, 127, 179, 353, 3, 2, 636, 1643, 1, 28, 1527, 1, 30, 3, 44, 9, 5, 1183, 3818, 1624, 13, 52, 3305, 10, 6878, 23, 1, 10, 1, 5956, 3477, 6, 1681, 8, 1, 3, 1128, 8, 4111, 2, 468, 20, 35, 1942, 1912, 1, 28, 3786, 4046, 30, 20, 2, 1305, 7, 524, 5, 299, 4, 25, 6453, 14111, 6, 57, 33, 3152, 109, 3, 26, 20, 2, 1503, 7, 2, 5695, 1624, 1260, 9486, 28, 666, 12929, 30, 13, 19, 10, 2883, 7, 2, 3917, 10, 2, 1393, 15, 3, 47, 1885, 2, 1922, 6, 5, 6901, 7, 756, 6, 328, 8, 1046, 2, 15928, 18, 1, 10, 18236, 14, 9, 5, 60, 261, 22, 3, 20, 2, 336, 77, 7, 5, 675, 7, 5, 3495, 179, 181, 283, 13, 19, 2530, 41, 5, 2984, 2236, 660, 471, 19, 5, 3221, 4, 25, 989, 6, 1100, 267, 350, 76, 2494, 8, 1046, 2, 675, 23, 61, 88, 1902, 179, 4, 380, 3, 5, 1, 1624, 6, 7095, 1, 10, 6878, 19, 496, 8, 

# Define Data Loaders (with bucket batch)

In [None]:
def sortBucket(bucket):
    """
    Function to sort a given bucket. Here, we want to sort based on the length of
    source and target sequence.
    """
    return sorted(bucket, key=lambda x: (len(x[0])))

In [None]:
train = train.bucketbatch(
    batch_size = BATCH_SIZE, batch_num=5,  bucket_num=1,
    use_in_batch_shuffle=False, sort_key=sortBucket
)
valid = valid.bucketbatch(
    batch_size = BATCH_SIZE, batch_num=5,  bucket_num=1,
    use_in_batch_shuffle=False, sort_key=sortBucket
)
test = test.bucketbatch(
    batch_size = BATCH_SIZE, batch_num=5,  bucket_num=1,
    use_in_batch_shuffle=False, sort_key=sortBucket
)

In [None]:
# print(list(train)[0])

In [None]:
def separateSourceTarget(sequence_pairs):
    """
    input of form: `[(X_1,y_1), (X_2,y_2), (X_3,y_3), (X_4,y_4)]`
    output of form: `((X_1,X_2,X_3,X_4), (y_1,y_2,y_3,y_4))`
    """
    sources,targets = zip(*sequence_pairs)
    return sources,targets

## Apply the function to each element in the iterator
train = train.map(separateSourceTarget)
valid = valid.map(separateSourceTarget)
test = test.map(separateSourceTarget)

# print(list(train)[0])

In [None]:
temp_variable = list(train)[0]
print(len(temp_variable))
print(len(temp_variable[0]))

2
128


In [None]:
print(list(temp_variable[1]))

[[0], [0], [0], [1], [1], [1], [0], [0], [1], [1], [1], [1], [0], [0], [1], [0], [0], [0], [1], [1], [1], [0], [0], [0], [1], [1], [0], [1], [1], [1], [1], [0], [0], [1], [0], [0], [0], [0], [1], [0], [0], [1], [0], [1], [0], [1], [0], [0], [0], [1], [0], [0], [1], [0], [1], [0], [0], [1], [0], [1], [0], [1], [0], [0], [0], [0], [1], [1], [0], [1], [1], [0], [0], [1], [1], [0], [0], [0], [0], [0], [0], [0], [1], [1], [1], [1], [0], [1], [0], [0], [0], [0], [1], [0], [1], [1], [1], [0], [1], [0], [0], [1], [1], [1], [1], [1], [1], [0], [0], [0], [0], [0], [0], [0], [1], [1], [0], [0], [0], [0], [0], [1], [1], [1], [1], [0], [1], [0]]


# Padding

In [None]:
def applyPadding(pair_of_sequences):
    """
    Convert sequences to tensors and apply padding
    """
    return (T.ToTensor(0)(list(pair_of_sequences[0])), torch.tensor(list(pair_of_sequences[1]), dtype= torch.long))

train_pad = train.map(applyPadding)
valid_pad = valid.map(applyPadding)
test_pad = test.map(applyPadding)

Testing the iterators (note that the number of rows depends on the longest document in the respective batch):

In [None]:
print('Train')
# train_pad_value = next(iter(train_pad))
for batch in train_pad:
    print(f'Text matrix size: {batch[0].size()}')
    print(f'Target vector size: {batch[1].size()}')
    print(batch[0].dtype)
    print(batch[1].dtype)
    break

print('\nValid:')
for batch in valid_pad:
    print(f'Text matrix size: {batch[0].size()}')
    print(f'Target vector size: {batch[1].size()}')
    print(batch[0].dtype)
    print(batch[1].dtype)
    break

print('\nTest:')
for batch in test_pad:
    print(f'Text matrix size: {batch[0].size()}')
    print(f'Target vector size: {batch[1].size()}')
    print(batch[0].dtype)
    print(batch[1].dtype)
    break

Train
Text matrix size: torch.Size([128, 250])
Target vector size: torch.Size([128, 1])
torch.int64
torch.int64

Valid:
Text matrix size: torch.Size([128, 134])
Target vector size: torch.Size([128, 1])
torch.int64
torch.int64

Test:
Text matrix size: torch.Size([128, 381])
Target vector size: torch.Size([128, 1])
torch.int64
torch.int64


# Model

In [None]:
class RNN(torch.nn.Module):

    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()

        self.embedding = torch.nn.Embedding(input_dim, embedding_dim)
        #self.rnn = torch.nn.RNN(embedding_dim,
        #                        hidden_dim,
        #                        nonlinearity='relu')
        self.rnn = torch.nn.LSTM(embedding_dim,
                                 hidden_dim, batch_first=True)

        self.fc = torch.nn.Linear(hidden_dim, output_dim)


    def forward(self, text):
        # text dim: [batch size, sentence length]
        embedded = self.embedding(text)
        # embedded dim: [batch size, sentence length, embedding dim]
        output, (hidden, cell) = self.rnn(embedded)
        # output dim: [batch size, sentence length, hidden dim]
        # hidden dim: [1, batch size, hidden dim]
        hidden.squeeze_(0)
        # hidden dim: [batch size, hidden dim]
        output = self.fc(hidden)
        return output

In [None]:
torch.manual_seed(RANDOM_SEED)
model = RNN(input_dim=len(train_vocab),
            embedding_dim=EMBEDDING_DIM,
            hidden_dim=HIDDEN_DIM,
            output_dim=NUM_CLASSES # could use 1 for binary classification
)

model = model.to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

# Training

In [None]:
def compute_accuracy(model, data_loader, device):

    with torch.no_grad():

        correct_pred, num_examples = 0, 0

        for i, (features, targets) in enumerate(data_loader):

            features = features.to(device)
            targets = targets.float().to(device).squeeze_(1)

            logits = model(features)
            _, predicted_labels = torch.max(logits, 1)

            num_examples += targets.size(0)
            correct_pred += (predicted_labels == targets).sum()
    return correct_pred.float()/num_examples * 100

In [None]:
start_time = time.time()

for epoch in range(NUM_EPOCHS):
    model.train()
    for batch_idx, batch_data in enumerate(train_pad):

        text = batch_data[0].to(DEVICE)
        labels = batch_data[1].to(DEVICE).squeeze_(1)

        ### FORWARD AND BACK PROP
        logits = model(text)
        # import pdb; pdb.set_trace()
        loss = F.cross_entropy(logits, labels)
        optimizer.zero_grad()

        loss.backward()

        ### UPDATE MODEL PARAMETERS
        optimizer.step()

        ### LOGGING
        if not batch_idx % 50:
            print (f'Epoch: {epoch+1:03d}/{NUM_EPOCHS:03d} | '
                   f'Batch {batch_idx:03d}/{34000//BATCH_SIZE :03d} | '
                   f'Loss: {loss:.4f}')

    with torch.set_grad_enabled(False):
        print(f'training accuracy: '
              f'{compute_accuracy(model, train_pad, DEVICE):.2f}%'
              f'\nvalid accuracy: '
              f'{compute_accuracy(model, valid_pad, DEVICE):.2f}%')

    print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min')

print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')
print(f'Test accuracy: {compute_accuracy(model, test_pad, DEVICE):.2f}%')

Epoch: 001/015 | Batch 000/265 | Loss: 0.6925
Epoch: 001/015 | Batch 050/265 | Loss: 0.6926
Epoch: 001/015 | Batch 100/265 | Loss: 0.6927
Epoch: 001/015 | Batch 150/265 | Loss: 0.6917
Epoch: 001/015 | Batch 200/265 | Loss: 0.6910
Epoch: 001/015 | Batch 250/265 | Loss: 0.6985
training accuracy: 50.58%
valid accuracy: 49.40%
Time elapsed: 1.32 min
Epoch: 002/015 | Batch 000/265 | Loss: 0.6926
Epoch: 002/015 | Batch 050/265 | Loss: 0.6926
Epoch: 002/015 | Batch 100/265 | Loss: 0.6951
Epoch: 002/015 | Batch 150/265 | Loss: 0.6643
Epoch: 002/015 | Batch 200/265 | Loss: 0.6591
Epoch: 002/015 | Batch 250/265 | Loss: 0.6579
training accuracy: 70.03%
valid accuracy: 66.23%
Time elapsed: 2.63 min
Epoch: 003/015 | Batch 000/265 | Loss: 0.5975
Epoch: 003/015 | Batch 050/265 | Loss: 0.6257
Epoch: 003/015 | Batch 100/265 | Loss: 0.4978
Epoch: 003/015 | Batch 150/265 | Loss: 0.3462
Epoch: 003/015 | Batch 200/265 | Loss: 0.4037
Epoch: 003/015 | Batch 250/265 | Loss: 0.3248
training accuracy: 90.10%
va

In [None]:
import spacy

nlp = spacy.blank("en")

def predict_sentiment(model, sentence):

    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [train_vocab.get_stoi()[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(DEVICE)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    prediction = torch.nn.functional.softmax(model(tensor), dim=1)
    return prediction[0][0].item()

print('Probability positive:')
predict_sentiment(model, "This is such an awesome movie, I really love it!")

Probability positive:


0.8687519431114197

In [None]:
print('Probability negative:')
1-predict_sentiment(model, "I really hate this movie. It is really bad and sucks!")

Probability negative:


0.6380993723869324