# Sentiment Analysis (SA)

### Introduction

Sentiment analysis (SA) is a language processing technique in text mining field that uses a computational method to identify opinionated description and determine whether data is positive or negative. SA is usually performed on unstructured textual data on the Web which often carries users' attitude, sentiments and subjectivity to an entity, like a product or a movie. 

The datasets used in SA are an important issue in this field. The main sources of data are from the product reviews. These reviews are important to the business holders as they can take business decisions according to the analysis results of users’ opinions about their products. 

### Project Targets

Document-level SA will be adopted in this project to classify movie reviews by building machine learning models (i.e. detect if a sentence is positive or negative) using PyTorch and TorchText. The source of data is [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/).,which provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. 

The following questions will be demonstrated:

* What is the data?
* How to deal with the raw data?
* How to build models to identify the results?
* Which methods can we try to improve the accuracy?



In the first section, **recurrent neural network**(RNN) will be used.

An RNN performs sequence words, $X=\{x_1, ..., x_T\}$, one at a time, and produces a _hidden state_, $h$, for each word. We use the RNN _recurrently_ by feeding in the current word $x_t$ and the hidden state from the previous word, $h_{t-1}$, to produce the next hidden state, $h_t$，etc. 

$$h_t = \text{RNN}(x_t, h_{t-1})$$

Once we have our final hidden state, $h_T$, (from feeding in the last word in the sequence, $x_T$) we feed it through a linear layer, $f$, (also known as a fully connected layer), to receive our predicted sentiment, $\hat{y} = f(h_T)$.

# Insights on the raw data

In [3]:
import torch
from torchtext.legacy import data
import spacy


SEED = 1234

In [4]:
torch.manual_seed(SEED)

<torch._C.Generator at 0x7f5818047810>

In [72]:
TEXT = data.Field(tokenize = 'spacy')



In [7]:
LABEL = data.LabelField(dtype = torch.float)

In [8]:
import torch
import torchtext
from torchtext.legacy.data import Field

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = torchtext.legacy.data.Field(tokenize = 'spacy')
LABEL = torchtext.legacy.data.LabelField(dtype = torch.float)



The following code automatically downloads the IMDb dataset and splits it into the canonical train/test splits as `torchtext.legacy.datasets` objects. 

In [9]:
from torchtext.legacy import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:10<00:00, 7.70MB/s]


We can see how many examples are in each split by checking their length.

In [10]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 25000
Number of testing examples: 25000


The example of the train data have also been shown below:

* 'text': the tokenized text
* 'label': positive or negative attitude

In [11]:
print(vars(train_data.examples[0]))

{'text': ['Bromwell', 'High', 'is', 'a', 'cartoon', 'comedy', '.', 'It', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life', ',', 'such', 'as', '"', 'Teachers', '"', '.', 'My', '35', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'Bromwell', 'High', "'s", 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', '"', 'Teachers', '"', '.', 'The', 'scramble', 'to', 'survive', 'financially', ',', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', 'teachers', "'", 'pomp', ',', 'the', 'pettiness', 'of', 'the', 'whole', 'situation', ',', 'all', 'remind', 'me', 'of', 'the', 'schools', 'I', 'knew', 'and', 'their', 'students', '.', 'When', 'I', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school', ',', 'I', 'immediately', 'recalled', '.........', 'at', '..........', 'High', '.', 'A', 'classic', 'l

The IMDb dataset consists of 50.000 movie reviews with lables,and the train and test dataset account for half.
So we need to create a validation set for modifying models.

In [12]:
import random

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

After spliting the train dataset

In [13]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


# Prepare data

Buidling a  _vocabulary_ is the first step before modelling since machine learning models cannot perform on strings, but only numbers. Each _index_ that represents a unique word in the dataset is normally used to construct a _one-hot-vector_. 

However, the number of unique words in this dataset is over 100.000. It will be time-consumping and power-demanding if we operate such huge dimentional data directly. What I adopt is only keeping the most common top 25.000 words avoiding dealing with the whole data.

A special _unknown_ or `<unk>` token will be used to replace the words that have been cut. In addition, `<pad>` have also been token to get the same data length for machine learning models if we meet unbalanced text.

In [14]:
MAX_VOCAB_SIZE = 25000  # the most common top words 

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE) # only operate on train dataset because we want validation dataset can 
                                                        # refect the test dataset as much as possible
LABEL.build_vocab(train_data)

In [15]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


As mentioned, the extra 2 vocabulary is : one of the addition tokens is the `<unk>` token and the other is a `<pad>` token.

We can also view the **most common words and corresponding frequencies**.

In [16]:
print(TEXT.vocab.freqs.most_common(10))

[('the', 202820), (',', 192917), ('.', 166207), ('and', 109906), ('a', 109381), ('of', 100931), ('to', 94053), ('is', 76698), ('in', 61407), ('I', 54461)]


We can also see the original words directly using either the `stoi` (**s**tring **to** **i**nt) or `itos` (**i**nt **to**  **s**tring) method.

In [17]:
print(TEXT.vocab.itos[:10])

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']


We can also review the labels, making sure 0 stands for negative and 1 is for positive.

In [18]:
print(LABEL.vocab.stoi)

defaultdict(None, {'neg': 0, 'pos': 1})


The last step of preparing data is to creat iterators for batch operation.

In [19]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

# Building a model 

As expressed at the beginning, **recurrent neural network**(RNN) will be used. The detailed information about RNN can be seen [here](https://en.wikipedia.org/wiki/Recurrent_neural_network).

In [20]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        
        
        # input_dim : the dimention of the one-hot vectors, equal to the size of vocabulary
        # embediing_dim : the dimention of dense word vectors.usually around 50-250 dimensions
        # hidden_dim: size of hidden layers.usually around 100-500 dimensions
        # output_dim: the result of evaluation of the model. Here, only two results: 0 and 1
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):

        #text = [sent len, batch size]
        
        embedded = self.embedding(text)
        
        #embedded = [sent len, batch size, emb dim]
        
        output, hidden = self.rnn(embedded)
        
        #output = [sent len, batch size, hid dim]
        #hidden = [1, batch size, hid dim]
        
        assert torch.equal(output[-1,:,:], hidden.squeeze(0)) ## squeeze: removing the dimention of 1 here.
        
        return self.fc(hidden.squeeze(0))

In [21]:
Input_dim = len(TEXT.vocab)
Embedding_dim = 100
Hidden_dim = 256
Output_dim = 1

model = RNN(Input_dim, Embedding_dim, Hidden_dim, Output_dim)

# Train the model

In [22]:
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=1e-3)

In [23]:
criterion = nn.BCEWithLogitsLoss()

In [24]:
model = model.to(device)
criterion = criterion.to(device)

The criterion function calculates the loss, however we need write a function to calculate the final accuracy. 

This function first take the evaluations through a sigmoid layer, squashing the values between 0 and 1, then we have to round them to the nearest integer. This rounds any value greater than 0.5 to 1 (a positive sentiment) and the rest to 0 (a negative sentiment).

We then calculate how many rounded predictions equal the actual labels and average it across the batch.

In [25]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()  #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [26]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
                
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

# Evaluate the model

**Evaluate** is similar to **train**,with the removal of optimizer.zero_grad(), loss.backward() and optimizer.step(), as we do not change the model's parameters when evaluating.

In [27]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)



In [28]:
## How long it will take for traing in an epoch?
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

At each epoch, if the validation loss is the best we have seen so far, we'll save the parameters of the model and then after training has finished we'll use that model on the test set.

In [29]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 13s
	Train Loss: 0.694 | Train Acc: 50.00%
	 Val. Loss: 0.697 |  Val. Acc: 49.92%
Epoch: 02 | Epoch Time: 0m 13s
	Train Loss: 0.693 | Train Acc: 50.00%
	 Val. Loss: 0.697 |  Val. Acc: 49.75%
Epoch: 03 | Epoch Time: 0m 13s
	Train Loss: 0.693 | Train Acc: 49.74%
	 Val. Loss: 0.697 |  Val. Acc: 50.30%
Epoch: 04 | Epoch Time: 0m 13s
	Train Loss: 0.693 | Train Acc: 49.69%
	 Val. Loss: 0.697 |  Val. Acc: 49.76%
Epoch: 05 | Epoch Time: 0m 13s
	Train Loss: 0.693 | Train Acc: 50.04%
	 Val. Loss: 0.698 |  Val. Acc: 50.68%


Finally, the metric we actually care about, the test loss and accuracy, which we get from our parameters that gave us the best validation loss.

In [30]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.710 | Test Acc: 46.99%


The loss is not really decreasing and the accuracy seems poor. This is due to several issues with the model itself.Next, we will improve the current model to see how to improve the accuracy.

# How to improve the accuracy?

In this section, our target is to achieve higher accuracy (more than 59.74% from RNN)

What I will use:
    
- packed padded sequences
- pre-trained word embeddings
- different RNN architecture
- bidirectional RNN
- multi-layer RNN
- regularization
- a different optimizer 

# Prepare data

The operation is similar. *packed padded sequences* will be used here, which can make RNN only process the non-padded elements, and the output of any padded element will turn to be a zero tensor.

In [31]:
SEED = 1234

torch.manual_seed(SEED)
TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)

In [32]:
from torchtext.legacy import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

In [33]:
import random

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

Here, we'll use the `"glove.6B.100d" vectors"`instead of having our word embeddings initialized randomly.

In [34]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

.vector_cache/glove.6B.zip: 862MB [02:43, 5.27MB/s]                               
100%|█████████▉| 399999/400000 [00:09<00:00, 43841.17it/s]


# Build a LSTM model

A different RNN architecture, a Long Short-Term Memory (LSTM), will be used here to improve the accuracy. The reasons why we choose this is that standard RNN suffer from [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem).LSTMs overcome this by having an extra recurrent state and the advantages over the standard RNN can be found [here](https://medium.com/@kangeugine/long-short-term-memory-lstm-concept-cb3283934359#:~:text=LSTM%20is%20well%2Dsuited%20to,and%20other%20sequence%20learning%20methods.)

In [35]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)

In [36]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        # note that the LSTM returns the `output` and a tuple of the final `hidden` state
        # and the final `cell` state, whereas the standard RNN only returned the `output` and final `hidden` state.
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout)
        
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        
        # As the final hidden state of our LSTM has both a forward and a backward component, which will 
        # be concatenated together, the size of the input to the `nn.Linear` layer is twice that of the 
        # hidden dimension size.
        
        #To combat overfitting, we use regularization. A method of regularization called dropout was used here.
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
        
        #text = [sent len, batch size]
        
        embedded = self.dropout(self.embedding(text))
        
        #embedded = [sent len, batch size, emb dim]
        
        #pack sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
        #unpack sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)

        #output = [sent len, batch size, hid dim * num directions]
        #output over padding tokens are zero tensors
        
        #hidden = [num layers * num directions, batch size, hid dim]
        #cell = [num layers * num directions, batch size, hid dim]
        
        #concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        #and apply dropout
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
                
        #hidden = [batch size, hid dim * num directions]
            
        return self.fc(hidden)

To ensure the pre-trained vectors can be loaded into the model, the `EMBEDDING_DIM` must be equal to that of the pre-trained GloVe vectors loaded earlier.

In [37]:
Input_dim = len(TEXT.vocab)
Embedding_dim = 100
Hidden_dim = 256
Output_dim = 1
N_layers = 2
Bidirectional = True
Dropout = 0.5
Pad_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = RNN(Input_dim, 
            Embedding_dim, 
            Hidden_dim, 
            Output_dim, 
            N_layers, 
            Bidirectional, 
            Dropout, 
            Pad_IDX)

We retrieve the embeddings from the field's vocab, and check they're the correct size, _**[vocab size, embedding dim]**_ 

In [38]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


We then replace the initial weights of the `embedding` layer with the pre-trained embeddings.

In [39]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.1213,  0.1915,  0.0622,  ..., -0.2294, -0.1433,  0.5837],
        [ 0.6406,  0.3674,  0.8373,  ..., -0.1512, -0.0284, -0.1952],
        [ 0.2644, -0.0054, -1.0183,  ..., -0.1151, -0.1124,  0.8695]])

As our `<unk>` and `<pad>` token aren't in the pre-trained vocabulary they have been initialized using `unk_init` (an $\mathcal{N}(0,1)$ distribution) when building our vocab. It is preferable to initialize them both to all zeros to explicitly tell our model that, initially, they are irrelevant for determining sentiment. 

In [40]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(Embedding_dim)
model.embedding.weight.data[Pad_IDX] = torch.zeros(Embedding_dim)

print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.1213,  0.1915,  0.0622,  ..., -0.2294, -0.1433,  0.5837],
        [ 0.6406,  0.3674,  0.8373,  ..., -0.1512, -0.0284, -0.1952],
        [ 0.2644, -0.0054, -1.0183,  ..., -0.1151, -0.1124,  0.8695]])


# Train the model

The only change we'll make here is changing the optimizer from SGD to Adam

In [41]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

Considering the setting `include_lengths = True`, the `batch.text` works as  a tuple with the first element being the numericalized tensor and the second element being the actual lengths of each sequence. We separate these into their own variables, `text` and `text_lengths`, before passing them to the model.

In [55]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        text, text_lengths = batch.text
        
        predictions = model(text, text_lengths.cpu()).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [60]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text, text_lengths = batch.text
            
            
            predictions = model(text, text_lengths.cpu()).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [61]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 26s
	Train Loss: 0.493 | Train Acc: 76.91%
	 Val. Loss: 0.464 |  Val. Acc: 79.14%
Epoch: 02 | Epoch Time: 0m 26s
	Train Loss: 0.380 | Train Acc: 83.38%
	 Val. Loss: 0.384 |  Val. Acc: 83.05%
Epoch: 03 | Epoch Time: 0m 26s
	Train Loss: 0.339 | Train Acc: 85.93%
	 Val. Loss: 0.307 |  Val. Acc: 87.69%
Epoch: 04 | Epoch Time: 0m 26s
	Train Loss: 0.274 | Train Acc: 89.11%
	 Val. Loss: 0.273 |  Val. Acc: 89.08%
Epoch: 05 | Epoch Time: 0m 26s
	Train Loss: 0.255 | Train Acc: 89.95%
	 Val. Loss: 0.276 |  Val. Acc: 89.26%


In [62]:
model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.300 | Test Acc: 87.77%


The best accuracy has been largely improved from 46.99% to 87.77%. Compared with the standard RNN, LSTM shows more advanages:
* better at storing previous information and accessing data
* without issues on gradient vanishing
* well-suited to learn from experience to classify, process and predict time series when there are very long time lags of unknown size between important events.

# User input

In [69]:
import spacy
nlp = spacy.load("en_core_web_sm")

def predict_sentiment(model, sentence):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    prediction = torch.sigmoid(model(tensor, length_tensor))
    return prediction.item()

In [70]:
predict_sentiment(model,'This film is so great')

0.99054354429245

In [71]:
predict_sentiment(model,'This is such an awful movie I have ever seen')

0.0025321722496300936