<a href="https://colab.research.google.com/github/ShaunakSen/Deep-Learning/blob/master/1_Simple_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Simple Sentiment Analysis

[tutorial link](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb)


In this series we'll be building a machine learning model to detect sentiment (i.e. detect if a sentence is positive or negative) using PyTorch and TorchText. This will be done on movie reviews, using the IMDb dataset.

In this first notebook, we'll start very simple to understand the general concepts whilst not really caring about good results. Further notebooks will build on this knowledge and we'll actually get good results.

### Introduction

We'll be using a recurrent neural network (RNN) as they are commonly used in analysing sequences. An RNN takes in sequence of words, X={x1,...,xT}, one at a time, and produces a hidden state, h, for each word. We use the RNN recurrently by feeding in the current word xt as well as the hidden state from the previous word, ht−1, to produce the next hidden state, ht.

ht=RNN(xt,ht−1)
Once we have our final hidden state, hT, (from feeding in the last word in the sequence, xT) we feed it through a linear layer, f, (also known as a fully connected layer), to receive our predicted sentiment, ŷ =f(hT).

Below shows an example sentence, with the RNN predicting zero, which indicates a negative sentiment. The RNN is shown in orange and the linear layer shown in silver. Note that we use the same RNN for every word, i.e. it has the same parameters. The initial hidden state, h0, is a tensor initialized to all zeros.

![](https://nbviewer.jupyter.org/github/bentrevett/pytorch-sentiment-analysis/blob/master/assets/sentiment1.png)

Note: some layers and steps have been omitted from the diagram, but these will be explained later.

### Preparing Data

One of the main concepts of TorchText is the Field. These define how your data should be processed. In our sentiment classification task the data consists of both the raw string of the review and the sentiment, either "pos" or "neg".

The parameters of a Field specify how the data should be processed.

We use the TEXT field to define how the review should be processed, and the LABEL field to process the sentiment.

Our TEXT field has tokenize='spacy' as an argument. This defines that the "tokenization" (the act of splitting the string into discrete "tokens") should be done using the spaCy tokenizer. If no tokenize argument is passed, the default is simply splitting the string on spaces.

LABEL is defined by a LabelField, a special subset of the Field class specifically used for handling labels. We will explain the dtype argument later.

For more on Fields, go here.

We also set the random seeds for reproducibility.

In [0]:
import torch
from torchtext import data

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(dtype = torch.float)

Another handy feature of TorchText is that it has support for common datasets used in natural language processing (NLP).

The following code automatically downloads the IMDb dataset and splits it into the canonical train/test splits as torchtext.datasets objects. It process the data using the Fields we have previously defined. The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review.

In [2]:
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:08<00:00, 10.3MB/s]


In [3]:

print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 25000
Number of testing examples: 25000


We can also check an example.



In [4]:
print(vars(train_data.examples[0]))


{'text': ['Now', 'let', 'me', 'tell', 'you', 'about', 'this', 'movie', ',', 'this', 'movie', 'is', 'MY', 'FAVORITE', 'MOVIE', '!', '!', '!', 'This', 'movie', 'has', 'excellent', 'combat', 'fighting', '.', 'This', 'movie', 'does', 'sound', 'like', 'a', 'silly', 'story', 'line', 'about', 'how', 'Jet', 'Li', 'plays', 'a', 'super', 'hero', ',', 'like', 'Spider', '-', 'Man', ',', 'or', 'etc', '.', 'But', 'once', 'you', "'ve", 'seen', 'this', 'movie', ',', 'you', 'would', 'probably', 'want', 'to', 'see', 'it', 'again', 'and', 'again', '.', 'I', 'rate', 'this', 'movie', '10/10', '.'], 'label': 'pos'}


The IMDb dataset only has train/test splits, so we need to create a validation set. We can do this with the .split() method.

By default this splits 70/30, however by passing a split_ratio argument, we can change the ratio of the split, i.e. a split_ratio of 0.8 would mean 80% of the examples make up the training set and 20% make up the validation set.

We also pass our random seed to the random_state argument, ensuring that we get the same train/validation split each time.

In [0]:
import random

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

In [7]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


In [9]:
print (vars(valid_data.examples[0]))
print (vars(test_data.examples[0]))

{'text': ['This', 'is', 'a', 'tongue', 'in', 'cheek', 'movie', 'from', 'the', 'very', 'outset', 'with', 'a', 'voice', '-', 'over', 'that', 'pokes', 'fun', 'at', 'everything', 'French', 'and', 'then', 'produces', 'a', 'rather', 'naif', 'but', 'very', 'brave', 'hero', 'in', 'Fanfan', 'La', 'Tulipe', '.', 'Portrayed', 'by', 'the', 'splendid', 'Gerard', 'Philippe', ',', 'the', 'dashing', 'young', 'man', 'believes', 'utterly', 'in', 'the', 'fate', 'curvaceous', 'Lollobrigida', 'foretells', '-', 'notably', 'that', 'he', 'will', 'marry', 'King', 'Louis', 'XV', "'s", 'daughter', '!', 'Problem', 'is', ',', 'La', 'Lollo', 'soon', 'find', 'outs', 'she', 'too', 'is', 'in', 'love', 'with', 'Fanfan', '...', '<br', '/><br', '/>Propelled', 'by', 'good', 'sword', 'fights', ',', 'cavalcades', ',', 'and', 'other', 'spirited', 'action', 'sequences', 'the', 'film', 'moves', 'at', 'a', 'brisk', 'pace', 'and', 'with', 'many', 'comic', 'moments', '.', 'The', 'direction', 'is', 'perhaps', 'the', 'weakest', 'as