# Faster Sentiment Analysis

In the previous notebook, we managed to achieve a decent test accuracy of ~85% using all of the common techniques used for sentiment analysis. In this notebook, we'll implement a model that achieves comparable results a lot faster. More specifically, we'll be implementing the "FastText" model from the paper [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759).


This will allow us to achieve the same ~85% test accuracy as the last model, but much faster.

## Preparing Data

One of the key concepts in the FastText paper is that they calculate the n-grams of an input sentence and append them to the end of a sentence. Here, we'll use bi-grams. Briefly, a bi-gram is a pair of words/tokens that appear consecutively within a sentence. 

For example, in the sentence "how are you ?", the bi-grams are: "how are", "are you" and "you ?".

The `generate_bigrams` function takes a sentence that has already been tokenized, calculates the bi-grams and appends them to the end of the tokenized list.

In [34]:
# import all the libraries
import torch
from torchtext import data
from torchtext import datasets
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import spacy
import dill
import pandas as pd
import numpy as np

nlp = spacy.load('en')
torch.manual_seed(1234)

<torch._C.Generator at 0x109952e50>

In [35]:
def generate_bigrams(x):
    n_grams = set(zip(*[x[i:] for i in range(2)]))
    for n_gram in n_grams:
        x.append(' '.join(n_gram))
    return x

As an example:

In [36]:
generate_bigrams(['This', 'film', 'is', 'terrible'])

['This', 'film', 'is', 'terrible', 'This film', 'film is', 'is terrible']

TorchText `Field`s have a `preprocessing` argument. A function passed here will be applied to a sentence after it has been tokenized (transformed from a string into a list of tokens), but before it has been indexed (transformed from a token to an integer). Here, we pass our `generate_bigrams` function.

In [37]:
TEXT = data.Field(tokenize='spacy', preprocessing=generate_bigrams)
LABEL = data.LabelField(tensor_type=torch.FloatTensor)

As before, we load the IMDb dataset and create the splits.

In [38]:
train, test = datasets.IMDB.splits(TEXT, LABEL)

train, valid = train.split()

Build the vocab and load the pre-trained word embeddings.

In [39]:
TEXT.build_vocab(train, max_size=40000, vectors="glove.6B.100d")
LABEL.build_vocab(train)

print(LABEL.vocab.stoi)

defaultdict(<function _default_unk_index at 0x118187840>, {'pos': 0, 'neg': 1})


And create the iterators.

In [40]:
BATCH_SIZE = 64

train_iter, valid_iter, test_iter = data.BucketIterator.splits(
    (train, valid, test), 
    batch_size=BATCH_SIZE, 
    sort_key=lambda x: len(x.text), 
    repeat=False)

## Build the Model

This model has far fewer parameters than the previous model as it only has 2 layers that have any parameters, the embedding layer and the linear layer. There is no RNN component in sight!

Instead, it first calculates the word embedding for each word using the `Embedding` layer, then calculates the average of all of the word embeddings and feeds this through the `Linear` layer, and that's it!

![](https://i.imgur.com/e0sWZoZ.png)

We implement the averaging with the `avg_pool2d` (average pool 2-dimensions) function. Initially, you may think using a 2-dimensional pooling seems strange, surely our sentences are 1-dimensional, not 2-dimensional? However, you can think of the word embeddings as a 2-dimensional grid, where the ones are along one axis and the dimensions of the word embeddings are along another. In the image below is an example sentence after being converted into 5-dimensional word embeddings, with the words along the vertical axis and the embeddings along the horizontal axis.

![](https://i.imgur.com/SSH25NT.png)

The `avg_pool2d` passes a filter of size `embedded.shape[1]` (i.e. the length of the sentence) by 1. This is shown in pink in the image below.

![](https://i.imgur.com/U7eRnIe.png)

The average value of all of the dimensions is calculated and concatenated into a 5-dimensional (in our pictoral examples, 100-dimensional in the code) tensor for each sentence. This tensor is then passed through the linear layer to produce our prediction.

In [41]:
class FastText(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc = nn.Linear(embedding_dim, output_dim)
        
    def forward(self, x):
        
        #x = [sent len, batch size]
        
        embedded = self.embedding(x)
                
        #embedded = [sent len, batch size, emb dim]
        
        embedded = embedded.permute(1, 0, 2)
        
        #embedded = [batch size, sent len, emb dim]
        
        pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1) 
        
        #pooled = [batch size, embedding_dim]
                
        return self.fc(pooled)

As previously, we'll create an instance of our `FastText` class.

In [42]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
OUTPUT_DIM = 1

model = FastText(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM)

And copy the pre-trained vectors to our embedding layer.

In [43]:
pretrained_embeddings = TEXT.vocab.vectors

model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.9409, -0.1506,  0.5157,  ...,  0.2661, -0.6054, -0.1816],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

## Train the Model

Training the model is the exact same as last time.

We initialize our optimizer...

In [44]:
optimizer = optim.Adam(model.parameters())

We define the criterion and place the model and criterion on the GPU (if available)...

In [45]:
criterion = nn.BCEWithLogitsLoss()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = model.to(device)
criterion = criterion.to(device)

We implement the function to calculate accuracy...

In [46]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(F.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

We define a function for training our model...

**Note**: we are no longer using dropout so we do not need to use `model.train()`, but as mentioned in the 1st notebook, it is good practice to use it.

In [47]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We define a function for testing our model...

**Note**: again, we leave `model.eval()` even though we do not use dropout.

In [48]:
def evaluate(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Finally, we train our model...

In [49]:
N_EPOCHS = 10

for epoch in range(N_EPOCHS):

    train_loss, train_acc = train(model, train_iter, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iter, optimizer, criterion)
    
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:3f}, Val. Acc: {valid_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Epoch: 01, Train Loss: 0.687, Train Acc: 57.01%, Val. Loss: 0.634963, Val. Acc: 70.51%
Epoch: 02, Train Loss: 0.646, Train Acc: 71.90%, Val. Loss: 0.503981, Val. Acc: 76.01%
Epoch: 03, Train Loss: 0.565, Train Acc: 79.38%, Val. Loss: 0.433833, Val. Acc: 80.30%
Epoch: 04, Train Loss: 0.484, Train Acc: 83.89%, Val. Loss: 0.399229, Val. Acc: 83.68%
Epoch: 05, Train Loss: 0.416, Train Acc: 87.39%, Val. Loss: 0.392316, Val. Acc: 85.91%
Epoch: 06, Train Loss: 0.365, Train Acc: 88.92%, Val. Loss: 0.391880, Val. Acc: 86.98%
Epoch: 07, Train Loss: 0.324, Train Acc: 90.42%, Val. Loss: 0.403797, Val. Acc: 87.77%
Epoch: 08, Train Loss: 0.289, Train Acc: 91.44%, Val. Loss: 0.419661, Val. Acc: 88.22%
Epoch: 09, Train Loss: 0.262, Train Acc: 92.29%, Val. Loss: 0.438539, Val. Acc: 88.67%
Epoch: 10, Train Loss: 0.241, Train Acc: 92.91%, Val. Loss: 0.457921, Val. Acc: 88.87%


...and get the test accuracy!

The results are comparable to the results in the last notebook, but training takes considerably less time.

In [50]:
test_loss, test_acc = evaluate(model, test_iter, optimizer, criterion)

print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Test Loss: 0.446, Test Acc: 88.74%


## Save/Reload the model

In [21]:
torch.save(model.state_dict(), 'saved_model_state.pt')

In [3]:
class FastText(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc = nn.Linear(embedding_dim, output_dim)
        
    def forward(self, x):
        
        #x = [sent len, batch size]
        
        embedded = self.embedding(x)
                
        #embedded = [sent len, batch size, emb dim]
        
        embedded = embedded.permute(1, 0, 2)
        
        #embedded = [batch size, sent len, emb dim]
        
        pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1) 
        
        #pooled = [batch size, embedding_dim]
                
        return self.fc(pooled)

INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
OUTPUT_DIM = 1
    
model_reload = FastText(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM)
model_reload.load_state_dict(torch.load('saved_model_state.pt'))

NameError: name 'TEXT' is not defined

## User Input

And as before, we can test on any input the user provides.

In [64]:
def predict_sentiment(article):
    article=' '.join(article)
    tokenized = [tok.text for tok in nlp.tokenizer(article)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    prediction = F.sigmoid(model(tensor))
    return prediction.item()

In [71]:
def predict_sentiment(sentence):
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    prediction = F.sigmoid(model(tensor))
    return prediction.item()

An example negative review...

In [72]:
predict_sentiment("House bill is detrimental to the public and needs to get rid of")

0.0001665666641201824

An example positive review...

In [74]:
predict_sentiment("This film is good")

1.2817367963281601e-20

## Sentiment Analysis For all news data

In [61]:
all_news= dill.load(open('../data/article_set', 'rb'))
all_news.head()

Unnamed: 0,id,title,publication,author,date,year,month,url,content,topic1,topic2,topic3,topic
0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...,Politics*0.45669132,Health Care*0.22221036,Court/Legal System*0.08490019,Politics
1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood...",Crime*0.30302584,Unlabeled*0.28304937,Crime*0.09784402,Crime
2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri...",Unlabeled*0.19544908,Unlabeled*0.14889707,Religion*0.097971134,Religion
3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t...",Unlabeled*0.2989482,Religion*0.1702237,Gender Issues*0.11686472,Religion
4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ...",Foreign Policy/Korea*0.61946124,Politics*0.19456775,Foreign Policy/Trade*0.09577953,Foreign Policy/Korea


In [65]:
def clean_text(article, tokenizer, stop, punct):
    
    article = article.lower()  # Convert to lowercase.
    
    article = tokenizer.tokenize(article)  # Split into words.
    
    # Remove numbers, but not words that contain numbers.
    article = [token for token in article if not token.isdigit()]
    
    # Remove words that are only one character.
    article = [token for token in article if len(token) > 3]
    
    # Remove stop-words
    article = [token for token in article if token not in stop]
    
    # Remove punctuation
    article = [token for token in article if token not in punct]
    
    return article
    
def get_sentiment(article,tokenizer, stop, punct):
    
    article=clean_text(article, tokenizer, stop, punct)
    
    #print (article)
    
    if len(article)>5:
        return predict_sentiment(article)
    else:
        return np.nan


In [66]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import string


tokenizer = RegexpTokenizer(r'\w+')
    
# create English stop words list
en_stop = stopwords.words('english')
# custom stop words
cus_stop=['mr','mrs','ms','said','dr']
# finalize stop words
stop=en_stop+cus_stop
    
# punctuation characters
punct=set(string.punctuation) 

senti_score=[]
articles=all_news['content']

for article in articles:
    senti_score.append(get_sentiment(article,tokenizer, stop, punct))

all_news['sentiment']=np.array(senti_score)

In [67]:
all_news.head(10)

Unnamed: 0,id,title,publication,author,date,year,month,url,content,topic1,topic2,topic3,topic,sentiment
0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...,Politics*0.45669132,Health Care*0.22221036,Court/Legal System*0.08490019,Politics,0.9950062
1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood...",Crime*0.30302584,Unlabeled*0.28304937,Crime*0.09784402,Crime,0.4581105
2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri...",Unlabeled*0.19544908,Unlabeled*0.14889707,Religion*0.097971134,Religion,0.001965388
3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t...",Unlabeled*0.2989482,Religion*0.1702237,Gender Issues*0.11686472,Religion,0.003980621
4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ...",Foreign Policy/Korea*0.61946124,Politics*0.19456775,Foreign Policy/Trade*0.09577953,Foreign Policy/Korea,0.00784058
5,17288,"Sick With a Cold, Queen Elizabeth Misses New Y...",New York Times,Sewell Chan,2017-01-02,2017.0,1.0,,"LONDON — Queen Elizabeth II, who has been b...",Unlabeled*0.29382992,Entertainment*0.19128913,Sports*0.17322288,Entertainment,1.589564e-10
6,17289,Taiwan’s President Accuses China of Renewed In...,New York Times,Javier C. Hernández,2017-01-02,2017.0,1.0,,BEIJING — President Tsai of Taiwan sharpl...,Foreign Policy/China*0.5561201,Foreign Policy/Trade*0.1705865,Politics*0.06863873,Foreign Policy/China,0.0483693
7,17290,"After ‘The Biggest Loser,’ Their Bodies Fought...",New York Times,Gina Kolata,2017-02-08,2017.0,2.0,,"Danny Cahill stood, slightly dazed, in a blizz...",Scientific Research*0.23732938,Unlabeled*0.14938736,Domestic Affairs*0.111396484,Scientific Research,0.9918941
8,17291,"First, a Mixtape. Then a Romance. - The New Yo...",New York Times,Katherine Rosman,2016-12-31,2016.0,12.0,,"Just how is Hillary Kerr, the founder of ...",Unlabeled*0.3029118,Gender Issues*0.22602034,Food/Lifestyle*0.082700245,Gender Issues,0.0001434017
9,17292,Calling on Angels While Enduring the Trials of...,New York Times,Andy Newman,2016-12-31,2016.0,12.0,,Angels are everywhere in the Muñiz family’s ap...,Unlabeled*0.69932485,Domestic Affairs*0.080745846,Education*0.05090855,Domestic Affairs,0.03759886


In [68]:
dill.dump(all_news, open('../data/article_set', 'wb'))