## Preparing Data

As in the previous notebooks, we'll prepare the data. 

Unlike the previous notebook with the FastText model, we no longer explicitly need to create the bi-grams and append them to the end of the sentence.

As convolutional layers expect the batch dimension to be first we can tell TorchText to return the data already permuted using the `batch_first = True` argument on the field.

In [0]:
import torch
from torchtext import data
from torchtext import datasets
import random
import numpy as np

SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy', batch_first = True)
LABEL = data.LabelField(dtype = torch.float)

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

Build the vocab and load the pre-trained word embeddings.

In [0]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

As before, we create the iterators.

In [0]:
# print(', '.join(i for i in dir(TEXT.unk_token) if not i.startswith('__')))
# print(TEXT.unk_token.title)
BATCH_SIZE = 16

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)

In [0]:
import torch.nn as nn
import torch.nn.functional as F


Currently the `CNN` model can only use 3 different sized filters, but we can actually improve the code of our model to make it more generic and take any number of filters.

We do this by placing all of our convolutional layers in a  `nn.ModuleList`, a function used to hold a list of PyTorch `nn.Module`s. If we simply used a standard Python list, the modules within the list cannot be "seen" by any modules outside the list which will cause us some errors.

We can now pass an arbitrary sized list of filter sizes and the list comprehension will create a convolutional layer for each of them. Then, in the `forward` method we iterate through the list applying each convolutional layer to get a list of convolutional outputs, which we also feed through the max pooling in a list comprehension before concatenating together and passing through the dropout and linear layers.

We can also implement the above model using 1-dimensional convolutional layers, where the embedding dimension is the "depth" of the filter and the number of tokens in the sentence is the width.

We'll run our tests in this notebook using the 2-dimensional convolutional model, but leave the implementation for the 1-dimensional model below for anyone interested. 

In [0]:
class CNN1d(nn.Module):
    def __init__(self, vocab_size, embedding_dim, 
                 dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.convs = nn.ModuleList([
                                    nn.Conv2d(in_channels = 1, 
                                              out_channels = 5, 
                                              kernel_size = (1,fs)) #(height,width)
                                    for fs in [4,5]
                                    ])
        self.convs2 = nn.Conv2d(in_channels = 5, out_channels = 10, kernel_size = (1,3))
                                    
        self.fc1 = nn.Linear(1496000, 100) #fix this
        self.fc2=nn.Linear(100,1)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        
        #text = [batch size, sent len]
        
        embedded = self.embedding(text)
        
        #embedded = [batch size, sent len, emb dim]
        
        embedded = embedded.permute(0, 2, 1)
        print("embedded",embedded.size())
        x=embedded.size(2)
        y=3000-x
        print(y)
        batch_size=embedded.size(0)
        z=np.zeros((batch_size,100,y))
        z1=torch.from_numpy(z).float()
        lz=[embedded,z1]
        #print(type(lz))
        zcat = torch.cat(lz, dim = 2)
        print("zcat",zcat.size())
        zcat2=zcat.reshape([16, 1, 100, 3000])
        print("zcat2",zcat2.size())
        #embedded = [batch size, emb dim, sent len]
        # print("Yo")
        conved = [F.relu(conv(zcat2)) for conv in self.convs]
        print("conved")
        for c in conved:
            print(c.size())
        #conved_n = [batch size, n_filters, sent len - filter_sizes[n] + 1]
        
        pooled = [F.max_pool2d(conv, (1,2)) for conv in conved]
        print("Pool:")
        for pl in pooled:
          print(pl.size())
        cat = self.dropout(torch.cat(pooled, dim = 2))
        #pooled_n = [batch size, n_filters]
        print("cat",cat.size())
        conved2 = F.relu(self.convs2(cat)) 
        print("conved2",conved2.size())
        pooled2 = F.max_pool2d(conved2, (1,2))
        print("pooled2",pooled2.size())
        pooled2=pooled2.reshape(batch_size,1496000)
        print("pooled2 reshaped",pooled2.size())
        
        #cat = [batch size, n_filters * len(filter_sizes)]
        # print("Gonna get error now")
        # print(pooled2.size())
        full1= self.fc1(pooled2)
        full2= self.fc2(full1)
        #print("done!")
        return full2

We create an instance of our `CNN` class. 

We can change `CNN` to `CNN1d` if we want to run the 1-dimensional convolutional model, noting that both models give almost identical results.

In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
N_FILTERS = 100
FILTER_SIZES = [3,4,5]
OUTPUT_DIM = 1
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = CNN1d(INPUT_DIM, EMBEDDING_DIM, DROPOUT, PAD_IDX)

Checking the number of parameters in our model we can see it has about the same as the FastText model. 

Both the `CNN` and the `CNN1d` models have the exact same number of parameters.

In [7]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 152,100,616 trainable parameters


Next, we'll load the pre-trained embeddings

In [8]:
pretrained_embeddings = TEXT.vocab.vectors

model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.0306, -0.0086,  0.1552,  ..., -0.9847,  0.4392,  0.3018],
        [ 0.3614,  0.1344,  0.0411,  ..., -0.1543, -1.0218, -0.5138],
        [ 0.4870, -0.3286,  0.6392,  ..., -0.4184, -0.0256,  0.1911]])

Then zero the initial weights of the unknown and padding tokens.

In [0]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

## Train the Model

Training is the same as before. We initialize the optimizer, loss function (criterion) and place the model and criterion on the GPU (if available)

In [0]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

We implement the function to calculate accuracy...

In [0]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

We define a function for training our model...

**Note**: as we are using dropout again, we must remember to use `model.train()` to ensure the dropout is "turned on" while training.

In [0]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        #print("Error here",model(batch.text),model(batch.text).size())
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We define a function for testing our model...

**Note**: again, as we are now using dropout, we must remember to use `model.eval()` to ensure the dropout is "turned off" while evaluating.

In [0]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Let's define our function to tell us how long epochs take.

In [0]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, we train our model...

In [0]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut4-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
pooled2 reshaped torch.Size([16, 1496000])
embedded torch.Size([16, 100, 393])
2607
zcat torch.Size([16, 100, 3000])
zcat2 torch.Size([16, 1, 100, 3000])
conved
torch.Size([16, 5, 100, 2997])
torch.Size([16, 5, 100, 2996])
Pool:
torch.Size([16, 5, 100, 1498])
torch.Size([16, 5, 100, 1498])
cat torch.Size([16, 5, 200, 1498])
conved2 torch.Size([16, 10, 200, 1496])
pooled2 torch.Size([16, 10, 200, 748])
pooled2 reshaped torch.Size([16, 1496000])
embedded torch.Size([16, 100, 820])
2180
zcat torch.Size([16, 100, 3000])
zcat2 torch.Size([16, 1, 100, 3000])
conved
torch.Size([16, 5, 100, 2997])
torch.Size([16, 5, 100, 2996])
Pool:
torch.Size([16, 5, 100, 1498])
torch.Size([16, 5, 100, 1498])
cat torch.Size([16, 5, 200, 1498])
conved2 torch.Size([16, 10, 200, 1496])
pooled2 torch.Size([16, 10, 200, 748])
pooled2 reshaped torch.Size([16, 1496000])
embedded torch.Size([16, 100, 568])
2432
zcat torch.Size([16, 100, 3000])
zcat2 to

We get test results comparable to the previous 2 models!

In [0]:
model.load_state_dict(torch.load('tut4-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

## User Input

And again, as a sanity check we can check some input sentences

**Note**: As mentioned in the implementation details, the input sentence has to be at least as long as the largest filter height used. We modify our `predict_sentiment` function to also accept a minimum length argument. If the tokenized input sentence is less than `min_len` tokens, we append padding tokens (`<pad>`) to make it `min_len` tokens.

In [0]:
import spacy
nlp = spacy.load('en')

def predict_sentiment(model, sentence, min_len = 5):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    if len(tokenized) < min_len:
        tokenized += ['<pad>'] * (min_len - len(tokenized))
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(0)
    prediction = torch.sigmoid(model(tensor))
    return prediction.item()

An example negative review...

In [0]:
predict_sentiment(model, "Truly a masterpiece, The Best Hollywood film of 2019, one of the Best films of the decade... And truly the Best film to bring a comic book so chillingly and realistically to real ife. Remarkable Direction, Cinematography, Music and the Acting. Some people are surprised to find it DISTURBING and VIOLENT, but it's a necessity and message. It's about society and reflects those underappreciated/unrecognized/bullied people, proving they can do something too. The way it shows class difference, corruption and how rich and talented rule others around them is not exaggerated and that's what makes it different. It's BELIEVABLE. There could be multiple JOKERs living in our society that could shake those around them in much bitter way than the film shows making people uncomforting people. Consider this a wake up call, a message, but first a film. A PERFECT film.")

2833
Gonna get error now
torch.Size([1, 149700])


0.9945042133331299

An example positive review...

In [0]:
predict_sentiment(model, '''Yes Joaquin's performance is good and the cinematography is pretty but otherwise I do not get all the hype about this movie.

Not for a long time have I felt like I would be better off leaving the cinema than staying and getting my money's worth.

For something that was supposed to be gritty and real it had a lot of convenient scenes to help push the plot along.

Very disappointed.''')

2920
Gonna get error now
torch.Size([1, 149700])


0.37380316853523254

In [0]:
import pickle

In [0]:
pickle.dump(model,open('model_train95.4_val89.1.pkl', 'wb'))

In [16]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

