### Example in NLP

In this notebook, we will do sentiment analysis on the SST-2 dataset.

In [1]:
import torch
import torchtext
from torchtext import data, datasets, vocab
import torch.nn as nn
import torch.optim as optim

print (torch.__version__)
print (torchtext.__version__)

1.5.1
0.6.0


I have already downloaded the dataset and stored in the folder called 'SST-2'. There are three files: train/dev.tsv. 

Note that the labels for the test are not given, hence we will just use the dev set for evaluation.

Each line in the data file consists of a sentence and a label. Label 1 means the sentence is positive, 0 means the sentence is negative.

Let's first load the datset and see how the dataset looks like.

In [2]:
TEXT = data.Field(sequential=True, lower=True, include_lengths=True)
LABEL = data.LabelField(dtype=torch.float)

train_dataset, dev_dataset = data.TabularDataset.splits(
    path='SST-2/', format='tsv', skip_header=True,
    train='train.tsv', validation='dev.tsv',
    fields=[('sentence', TEXT), ('label', LABEL)]
)


In [3]:
# number of examples
print ('#train examples: ', len(train_dataset))
print ('#dev examples: ', len(dev_dataset))

# check one example
print (vars(train_dataset.examples[20]))
print (vars(dev_dataset.examples[20]))

#train examples:  67349
#dev examples:  872
{'sentence': ['equals', 'the', 'original', 'and', 'in', 'some', 'ways', 'even', 'betters', 'it'], 'label': '1'}
{'sentence': ['pumpkin', 'takes', 'an', 'admirable', 'look', 'at', 'the', 'hypocrisy', 'of', 'political', 'correctness', ',', 'but', 'it', 'does', 'so', 'with', 'such', 'an', 'uneven', 'tone', 'that', 'you', 'never', 'know', 'when', 'humor', 'ends', 'and', 'tragedy', 'begins', '.'], 'label': '0'}


In [4]:
# average sentence length

print (sum([len(vars(eg)['sentence']) for eg in train_dataset.examples])/len(train_dataset))
print (sum([len(vars(eg)['sentence']) for eg in dev_dataset.examples])/len(dev_dataset))



9.409553222765
19.548165137614678


For standard ML practice, you should set aside a test set from the dev set, train and tune models on the trian and dev set only, and then evaluate the trained model on the test set only.

But for simplicity, I'll skip that for now. We'll cover more in the lectures.

Unlike images, for text data, we tokenize the data into words, and we need to build up vocab for the data.

In [5]:
# build our vocab based on data
TEXT.build_vocab(train_dataset,)
LABEL.build_vocab(train_dataset)

print ('Text Vocab: ', len(TEXT.vocab))
print ('Label Vocab: ', len(LABEL.vocab))

Text Vocab:  14818
Label Vocab:  2


In [6]:
# view some frequent words in vocab
print (TEXT.vocab.freqs.most_common(20))

[('the', 27205), (',', 25980), ('a', 21609), ('and', 19920), ('of', 17907), ('.', 12687), ('to', 12538), ("'s", 8764), ('is', 8685), ('that', 7759), ('in', 7495), ('it', 7078), ('as', 5088), ('with', 4745), ('an', 4133), ('film', 4038), ('its', 3924), ('for', 3913), ('movie', 3563), ('this', 3365)]


In [7]:
print (LABEL.vocab.stoi)

defaultdict(None, {'1': 0, '0': 1})


Padding in RNN for batched training is tricky.
Please refer to this post: https://zhuanlan.zhihu.com/p/34418001 for more details.

It's ok if you don't get it. We'll cover it in lectures.

In [8]:
# Build the Model
class RNN(nn.Module):
    def __init__(self, vocab_size, emb_dim, hid_dim):
        super(RNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.rnn = nn.LSTM(input_size=emb_dim,
                           hidden_size=hid_dim,
                           num_layers=2,
                           bidirectional=True)
        self.fc = nn.Linear(hid_dim*2, 1)
        self.dropout = nn.Dropout(0.2)
    
    def forward(self, text, text_lengths):
        # input: (bsz, sent_len)->(bsz, sent_len, emb_dim)
        embedded = self.dropout(self.embedding(text))
        
        # pack sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
        
        # output: (sen_len, bsz, hid_dim*num_direction)
        # hidden: (num_layer*num_direction, bsz, hid_dim)
        # concat the final forward and backward final hidden states
        hidden = torch.cat((hidden[-2, :, :], hidden[-1,:,:]), dim=1)
        # hidden: (bsz, hid_dim*2)
        hidden = self.dropout(hidden)
        return self.fc(hidden)
        

In [9]:
BATCH_SIZE = 64
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, dev_iterator = data.BucketIterator.splits(
    (train_dataset, dev_dataset),
    batch_size = BATCH_SIZE,
    sort_key=lambda x: len(x.sentence),
    sort_within_batch=True,
    device=device
)

vocab_size = len(TEXT.vocab)
emb_dim = 50
hid_dim = 64
model = RNN(vocab_size, emb_dim, hid_dim)
model = model.to(device)

In [10]:
# accuracy function
def binary_accuracy(preds, y):
    preds = torch.round(torch.sigmoid(preds))
    correct = (preds==y).float()
    acc = correct.sum() / len(correct) 
    return acc

In [11]:
LR = 1e-3
def train(model, iterator, epoch):
    optimizer = optim.Adam(model.parameters(), lr=LR)
    loss_fn = nn.BCEWithLogitsLoss()
    
    epoch_loss = 0
    
    model.train()
    for batch_idx, batch in enumerate(iterator):
        optimizer.zero_grad()
        text, text_lengths = batch.sentence
        preds = model(text, text_lengths).squeeze(1)
        loss = loss_fn(preds, batch.label)
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        
        if batch_idx % 200 == 0:
            print ("Epoch: {}, Batch Idx: {}, Loss: {}".format(
                    epoch, batch_idx, loss.item()))
    
    return epoch_loss / len(iterator)
        

In [12]:
def evaluate(model, iterator):
    epoch_acc = 0
    model.eval()
    
    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.sentence
            preds = model(text, text_lengths).squeeze(1)
            acc = binary_accuracy(preds, batch.label)
            
            epoch_acc += acc.item()
    return epoch_acc / len(iterator) * 100

In [13]:
# Get a timer
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    mins = int(elapsed_time / 60)
    secs = int(elapsed_time - (mins)*60)
    print ("Time: {} mins {} secs".format(mins, secs))

In [14]:
EPOCH = 3

for epoch in range(1, EPOCH+1):
    start_time = time.time() 
    
    train_loss = train(model, train_iterator, epoch)
    dev_acc = evaluate(model, dev_iterator)
    
    end_time = time.time()
    epoch_time(start_time, end_time)
    
    print ("Epoch: {}, Acc: {}".format(epoch, dev_acc))

Epoch: 1, Batch Idx: 0, Loss: 0.6879485249519348
Epoch: 1, Batch Idx: 200, Loss: 0.6326194405555725
Epoch: 1, Batch Idx: 400, Loss: 0.5964372158050537
Epoch: 1, Batch Idx: 600, Loss: 0.6924533247947693
Epoch: 1, Batch Idx: 800, Loss: 0.3474215269088745
Epoch: 1, Batch Idx: 1000, Loss: 0.48663467168807983
Time: 1 mins 39 secs
Epoch: 1, Acc: 76.5625
Epoch: 2, Batch Idx: 0, Loss: 0.48060038685798645
Epoch: 2, Batch Idx: 200, Loss: 0.37727341055870056
Epoch: 2, Batch Idx: 400, Loss: 0.41725897789001465
Epoch: 2, Batch Idx: 600, Loss: 0.5222943425178528
Epoch: 2, Batch Idx: 800, Loss: 0.2690412402153015
Epoch: 2, Batch Idx: 1000, Loss: 0.31534940004348755
Time: 1 mins 33 secs
Epoch: 2, Acc: 79.53124982970101
Epoch: 3, Batch Idx: 0, Loss: 0.3253956437110901
Epoch: 3, Batch Idx: 200, Loss: 0.3124137818813324
Epoch: 3, Batch Idx: 400, Loss: 0.44053900241851807
Epoch: 3, Batch Idx: 600, Loss: 0.19532600045204163
Epoch: 3, Batch Idx: 800, Loss: 0.2438146024942398
Epoch: 3, Batch Idx: 1000, Loss:

I used a simple LSTM model, you should see the performance go above random (50%). You are encouraged to run for more epochs, change the hyper-parameters or model structure and use other tricks to further improve the performance.

In [13]:
# save the model so you can load it later
SAVE_PATH = "sst_rnn.ckpt"
torch.save(model.state_dict(), SAVE_PATH)

## load the model 
model = model = RNN(vocab_size, emb_dim, hid_dim)
model.load_state_dict(torch.load(SAVE_PATH))

<All keys matched successfully>

In [14]:
# write a predict function to do prediction
def predict(model, sent):
    model.eval()
    with torch.no_grad():
        sent_ = sent.lower().split()
        sent = []
        for w in sent_:
            if w in TEXT.vocab.stoi:
                sent.append([TEXT.vocab.stoi[w]])
            else:
                sent.append([0])
        text_length = torch.LongTensor([len(sent)])
        text = torch.LongTensor(sent)
        preds = model(text, text_length).squeeze(1)
        preds = torch.round(torch.sigmoid(preds)).item()
        if preds > 0.5:
            print ("Predicted: Negative")
        else:
            print ("Predicted: Positive")
        return

In [16]:
sent = "I feel kind of sad though"
predict(model, sent)

Predicted: Negative


In [17]:
sent = "Thanks buddy! I had a lot of fun learning this course !!"
predict(model, sent)

Predicted: Positive


Try changing the sentence above and see if the model predicts it right!

Reference: https://gitee.com/quarky/pytorch-sentiment-analysis/blob/master/2%20-%20Upgraded%20Sentiment%20Analysis.ipynb