<a href="https://colab.research.google.com/github/Dharmin-Shah/Sentiment-Analysis-Pytorch/blob/main/Sentiment_Analysis_of_Movie_Reviews_(PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis of IMDB Movie Reviews**
---



## <b>Introduction </b>

In this project, we are going to train Recurrent Neural Network to hekp us predict the sentiment of a movie review. As this is a basic example, we are going to classify the sentiment into two classes: Positve and Negative.<br>
<br>
We would also be performing other text preprocessing tasks so we can send only the required data to our model.<br>


> Let's start with the imports

In [None]:
import os
import glob
import pickle as pickle

In [None]:
import numpy as np
import pandas as pd

In [None]:
from sklearn.utils import shuffle
import re
from bs4 import BeautifulSoup
import requests

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [None]:
import torch
import torch.utils.data
import torch.nn as nn
import torch.optim as optim

## Fetching the Data

> We will be using the following data for our sentiment analsysis <a href="http://ai.stanford.edu/~amaas/data/sentiment/"> Large IMDB Dataset </a>

In [None]:
def get_data():
  '''
  This function will fetch the data and unzip it for use.
  '''
  try:
    # Download the data using wget linux command
    os.system("wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz")

    # Unzip the tar file to a data folder
    os.system("mkdir data")
    os.system("tar -zxf aclImdb_v1.tar.gz -C data")
    os.system("rm -f aclImdb_v1.tar.gz")
  except:
    print("There was some error while downloading and unzipping the tar file.")


In [None]:
get_data()

## Processing the Data

> We will start by creating directories required for the processed files.

In [None]:
data_dir = "/content/data/"

# Location of the cache file
cache_dir = "/content/data/sentiment_analysis"  # where to store cache files
# Creating the directory for the same
if not os.path.exists(cache_dir): # Make sure that the folder exists
    os.makedirs(cache_dir)

os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

# Create a folder to store the dictionary
  
word_dir = '/content/data/word' # The folder we will use for storing data
if not os.path.exists(word_dir): # Make sure that the folder exists
  os.makedirs(word_dir)
os.makedirs(word_dir, exist_ok=True)

### Loading the data

> As are data is present in a text file, we need to read the data and load it in respective dictionaries.

In [None]:
def read_imdb_data(data_dir='/content/data/aclImdb'):
  '''
  This function will load the data from text files into the intended directory.
  The total reviews are 50k, split as 25k each for train and test. Furthermore,
  the total reviews are split into 25k positive and 25k negative. The function
  will create respective directories and load the reviews accordingly.
  '''

  #Define the dictionaries
  data = {}
  labels = {}
    
  # First looping through train, test directory  
  for data_type in ['train', 'test']:
    # Nesting dictionaries
    data[data_type] = {}
    labels[data_type] = {}
    
    #Second looping through postive and negative directory
    for sentiment in ['pos', 'neg']:
        # Nesting dictionaries
        data[data_type][sentiment] = []
        labels[data_type][sentiment] = []
        
        #Generate the path for files
        path = os.path.join(data_dir, data_type, sentiment, '*.txt')
        # Fetch the files
        files = glob.glob(path)
        
        # Third looping through each review file
        for f in files:
            with open(f) as review:
                # Add the review text into the dictionary as a list
                data[data_type][sentiment].append(review.read())
                # Here we represent a positive review by '1' and a negative review by '0'
                labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)

        # Checking for any data mismatch        
        assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                "{}/{} data size does not match labels size".format(data_type, sentiment)
            
  return data, labels

In [None]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


### Splitting the data into Train and Test

> We will now spilt our data into train and test datasets and perform some shuffling.

In [None]:
def prepare_imdb_data(data, labels):
    '''
    Prepare training and test sets from IMDb movie reviews.
    '''
    
    #Combine positive and negative reviews and labels
    
    #Train Dataset
    data_train = data['train']['pos'] + data['train']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    
    #Test Dataset
    data_test = data['test']['pos'] + data['test']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test

In [None]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


In [None]:
# Basic check for a train dataset review
print(train_X[100])
print(train_y[100])


As has been noted, this formula has been filmed several times, most recently as "You've Got Mail", with Tom Hanks and Meg"Trout Pout" Ryan. Of the several versions, this is my least favorite. The problem i think is that the studio coasted on the Stars charisma, which doesn't quite cut it here.<br /><br />The chemistry betwixt the two leads never comes to a boil in this movie. There are no real sparks. Van Johnson and Judy Garland remind me of day old donuts, pleasant but bland. And when the leads are boring the rest of the movie can only follow. Judy in particular is disappointing. She looks like she has no neck! I don't know if she was having trouble with pain or something but she looks like a turtle trying to pull it's head into it's shell, all hunched up and everything. I couldn't figure out what Van Johnson was getting so hot about. I would have made a bee line for that cute violin player. And Van wasn't great either. I've always thought of him as a rather generic Hollywood leading

### Text Preprocessing

> We will now perform the text preprocessing steps:
<ul>
  <li>Removal of HTML tags</li>
  <li>Removal of unnecessary characters</li>
  <li>Conversion to lowercase</li>
  <li>Removal of stopwords</li>
  <li>Stemming (converting verbs,etc to their root form. e.g drinking -> drink)</li>
  <li>Generating the wordlist (storing the relevant words in a list)</li>

In [None]:
def review_to_words(review):
  '''
  This function will take each review and perform the text pre-processing steps
  '''

  # Fetching the stopwords list for english language from nltk library
  nltk.download("stopwords", quiet=True)
  
  # Initializing the Stemmer we want to use for our stemming process
  stemmer = PorterStemmer()
  
  # Removing the HTML tags using BeautifulSoup library
  text = BeautifulSoup(review, "html.parser").get_text()

  # Removing any unnecessary characters and converting the text to lowercase
  text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

  # Creating a list of words by spliting the string
  words = text.split()

  # Removing the stopwords fromt the list
  words = [w for w in words if w not in stopwords.words("english")]

  # Performing stemming on the words remaining in the list
  words = [PorterStemmer().stem(w) for w in words] # stem
  
  return words

In [None]:
print(train_X[100])

As has been noted, this formula has been filmed several times, most recently as "You've Got Mail", with Tom Hanks and Meg"Trout Pout" Ryan. Of the several versions, this is my least favorite. The problem i think is that the studio coasted on the Stars charisma, which doesn't quite cut it here.<br /><br />The chemistry betwixt the two leads never comes to a boil in this movie. There are no real sparks. Van Johnson and Judy Garland remind me of day old donuts, pleasant but bland. And when the leads are boring the rest of the movie can only follow. Judy in particular is disappointing. She looks like she has no neck! I don't know if she was having trouble with pain or something but she looks like a turtle trying to pull it's head into it's shell, all hunched up and everything. I couldn't figure out what Van Johnson was getting so hot about. I would have made a bee line for that cute violin player. And Van wasn't great either. I've always thought of him as a rather generic Hollywood leading

In [None]:
review_to_words(train_X[100])

['note',
 'formula',
 'film',
 'sever',
 'time',
 'recent',
 'got',
 'mail',
 'tom',
 'hank',
 'meg',
 'trout',
 'pout',
 'ryan',
 'sever',
 'version',
 'least',
 'favorit',
 'problem',
 'think',
 'studio',
 'coast',
 'star',
 'charisma',
 'quit',
 'cut',
 'chemistri',
 'betwixt',
 'two',
 'lead',
 'never',
 'come',
 'boil',
 'movi',
 'real',
 'spark',
 'van',
 'johnson',
 'judi',
 'garland',
 'remind',
 'day',
 'old',
 'donut',
 'pleasant',
 'bland',
 'lead',
 'bore',
 'rest',
 'movi',
 'follow',
 'judi',
 'particular',
 'disappoint',
 'look',
 'like',
 'neck',
 'know',
 'troubl',
 'pain',
 'someth',
 'look',
 'like',
 'turtl',
 'tri',
 'pull',
 'head',
 'shell',
 'hunch',
 'everyth',
 'figur',
 'van',
 'johnson',
 'get',
 'hot',
 'would',
 'made',
 'bee',
 'line',
 'cute',
 'violin',
 'player',
 'van',
 'great',
 'either',
 'alway',
 'thought',
 'rather',
 'gener',
 'hollywood',
 'lead',
 'man',
 'anyth',
 'dispel',
 'imag',
 'fan',
 'star',
 'earli',
 '1900',
 'might',
 'like',
 'mo

### Checkpoint

> Whenever we want to perform sentiment analysis, it wouldn't be wise to perform all the steps again and again. So we will create a cache file, that can hold the datasets with their respective pre-processed content.

In [None]:
def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):

    '''
    Convert each review to words; read from cache if available.
    '''

    cache_data = None

    # If cache_file exists, we will try to read it and load the data
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache or cache file not found
    
    
    # If the cache_file doesn't exist, the we write the processed data to the file
    if cache_data is None:
      # Preprocess training and test data to obtain words for each review
      words_train = [review_to_words(review) for review in data_train]
      words_test = [review_to_words(review) for review in data_test]
      
      # Write to cache file for future runs
      if cache_file is not None:
          # Create a dictionary that contains all the processed data
          cache_data = dict(words_train=words_train, words_test=words_test,
                            labels_train=labels_train, labels_test=labels_test)
          
          # Write the dictionary to a pickle file
          with open(os.path.join(cache_dir, cache_file), "wb") as f:
              pickle.dump(cache_data, f)
          print("Wrote preprocessed data to cache file:", cache_file)
    else:
      # Unpack data loaded from cache file
      words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
              cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [None]:
# Preprocess data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Wrote preprocessed data to cache file: preprocessed_data.pkl


## Building the Vocabulary

> We can use the reviews to generate a vocabulary of words. There can be a lot of words, so we need to limit the size of the vocab and keep only the most frequently occuring words.

In [None]:
def build_dict(data, vocab_size = 5000):
  '''
  This function will generate a vocab dictionary. The vocab size can be
  provided as an argument, default is 5000 words. The dictionary will be
  sorted to in ascending order with the most frequently appering words.
  We will also leave 2 spaces at the end of the dictionary, so we can add
  'NOWORDS' and 'INFREQ' labels
  '''
  # A dict storing the words that appear in the reviews along with how often they occur
  word_count = {}
    
  for review in data:
      for word in review:
          if word in word_count:
              word_count[word] += 1
          else:
              word_count[word] = 1
  
  # Sorting the dictionary 
  sorted_words = [item[0] for item in sorted(word_count.items(), key=lambda x: x[1], reverse=True)]

  # This is what we are building, a dictionary that translates words into integers
  word_dict = {}
  for idx, word in enumerate(sorted_words[:vocab_size - 2]): # The -2 is so that we save room for the 'no word'
      word_dict[word] = idx + 2                              # 'infrequent' labels
      
  return word_dict

In [None]:
word_dict = build_dict(train_X)

### Saving the dictionary

> We can save the dictionary to a pickle file, so we can use it in futurre.

In [None]:
def save_dict(word_dict):
  '''
  This function will save the vocab dic to a pickle file
  '''
  
  with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)  

In [None]:
save_dict(word_dict)

## Transform reviews

> We will now perform encoding of the reviews. Using the vocab, we will encode each word to its corresponding integer value. The result will be a matrix of encoded reviews.

In [None]:
def convert_and_pad(word_dict, sentence, pad=500):
  '''
  This function will perform encoding of each review. We will be using
  padding, to make sure that all sentences are of same length. 
  '''
  
  # We will use 0 to represent the 'no word' category
  NOWORD = 0 
  # We use 1 to represent the infrequent words, i.e., words not appearing in word_dict
  INFREQ = 1 

  # Creating an empty list for each sentence populated with 0
  working_sentence = [NOWORD] * pad

  # Enumerate each word in the sentencce and encode it and store it in
  # working_sentence variable

  for word_index, word in enumerate(sentence[:pad]):
      if word in word_dict:
          working_sentence[word_index] = word_dict[word]
      else:
          working_sentence[word_index] = INFREQ

  # Return the encoded review and the length of the sentence        
  return working_sentence, min(len(sentence), pad)

In [None]:
def convert_and_pad_data(word_dict, data, pad=500):
  '''
  This function will pass each review through the encoder. The result and
  the lengths will be returned 
  '''
  result = []
  lengths = []
  
  for sentence in data:
      converted, leng = convert_and_pad(word_dict, sentence, pad)
      result.append(converted)
      lengths.append(leng)
      
  return np.array(result), np.array(lengths)

In [None]:
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

In [None]:
# Checking the encoded review and its length
print(train_X[100])
print(train_X_len[100])

[ 539 1376    3  335    6  500  111 3542  712 1907 3970    1    1 2029
  335  197  141  395  204   30  809 3881   76 2883   96  374 1084    1
   42  177   51   45 3094    2   71 2817 1050 2229 3095    1  620   91
   72    1 1954 1491  177  186  302    2  213 3095  765  287   19    5
 2529   37  733  483   66   19    5 4972   54  614  253 4219    1  207
  461 1050 2229   10  801   15   34    1  117  908    1  931 1050   26
  298  131   99  173  258  280  177   55  153    1  654  123   76  346
    1  156    5    2   70  164  646  105  517   21 1159 3875    1    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In [None]:
train_X.shape

(25000, 500)

### Saving the transformed reviews

> We will be concatenating the train_y, train_x_len and train_x into a Pandas DataFrame and save it as a csv file  

In [None]:
#Creating and saving training dataframe
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

## Creating the model

> We will now proceed to create our Recurrent Neural Network. For our project, we will be using LSTM layers and Embeddings Layers.

In [None]:
class SentimentAnalysis(nn.Module):
  '''
  A simple RNN to perform Sentiment Analysis
  '''

  def __init__(self, embedding_dim, hidden_dim, vocab_size):
    '''
    Initialize the model by setting up the layers.
    '''

    super(SentimentAnalysis, self).__init__()

    self.embedding = nn.Embedding(vocab_size,embedding_dim,padding_idx=0)
    self.lstm = nn.LSTM(embedding_dim, hidden_dim)
    self.dense = nn.Linear(in_features=hidden_dim, out_features=1)
    self.sig = nn.Sigmoid()

  def forward(self, x):
    '''
    Perform a forward pass of our model on some input.
    '''
    x = x.t()
    lengths = x[0,:]
    reviews = x[1:,:]
    embeds = self.embedding(reviews)
    lstm_out, _ = self.lstm(embeds)
    out = self.dense(lstm_out)
    
    # Maintaining the output as same as input shape
    out = out[lengths - 1, range(len(lengths))]
   
    return self.sig(out.squeeze())
    

### Creating Datasets and Dataloaders

> Since the data is very very large, we will be using a small subset for this project. But we can use the whole dataset if we have more computing resources.

In [None]:
data_file = os.path.join(data_dir, 'train.csv')

In [None]:
def create_loader(data_file,batch_size=50):
  '''
  This function will create loader that can be used to feed the data to the
  network.
  '''

  # Read in only the first 250 rows
  train_data = pd.read_csv(data_file, header=None, names=None)

  # Turn the input pandas dataframe into tensors
  train_y = torch.from_numpy(train_data[[0]].values).float().squeeze()
  train_X = torch.from_numpy(train_data.drop([0], axis=1).values).long()
  
  # Build the dataset
  train_ds = torch.utils.data.TensorDataset(train_X, train_y)
  # Build the dataloader
  train_loader = torch.utils.data.DataLoader(train_ds, batch_size=batch_size)

  return train_loader

In [None]:
# Creating a train loader
train_loader = create_loader(data_file)

## Train the model

> We will now create a function that can train the model

In [None]:
def train(model, train_loader, epochs, optimizer, loss_fn, device):
  '''
  This funcion will take the training parameters and train the model
  '''
  for epoch in range(1, epochs + 1):
      model.train()
      total_loss = 0
      for batch in train_loader:         
          batch_X, batch_y = batch
          
          # Shifting to GPU is available
          batch_X = batch_X.to(device)
          batch_y = batch_y.to(device)
          
          # Optmizing
          optimizer.zero_grad()
          out = model.forward(batch_X)
          loss = loss_fn(out, batch_y)
          loss.backward()
          optimizer.step()
          
          total_loss += loss.data.item()
      print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))

### Creating the model and training:

> We will create the model, define the los function, optimizer and pass the model parameters. We will also transfer the model to GPU if available

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SentimentAnalysis(100, 256, 5000).to(device)
optimizer = optim.Adam(model.parameters(),)
loss_fn = torch.nn.BCELoss()
train_loader = create_loader(data_file)

train(model,train_loader, 15, optimizer, loss_fn, device)

Epoch: 1, BCELoss: 0.5452197691798211
Epoch: 2, BCELoss: 0.4080163461863995
Epoch: 3, BCELoss: 0.3429047822058201
Epoch: 4, BCELoss: 0.2748203212842345
Epoch: 5, BCELoss: 0.22618635434657336
Epoch: 6, BCELoss: 0.17249603949859738
Epoch: 7, BCELoss: 0.13410881471447647
Epoch: 8, BCELoss: 0.1164365616682917
Epoch: 9, BCELoss: 0.08813222290948033
Epoch: 10, BCELoss: 0.060109753909055146
Epoch: 11, BCELoss: 0.03992631448199972
Epoch: 12, BCELoss: 0.011715514285839163
Epoch: 13, BCELoss: 0.004094383300835034
Epoch: 14, BCELoss: 0.2840329334288835
Epoch: 15, BCELoss: 0.1534613660648465


## Testing the Model

>We will perform similar tasks like we did for training in terms of data prep.

In [None]:
# Creating a test data file
pd.concat([pd.DataFrame(test_y), pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

In [None]:
test_file = os.path.join(data_dir, 'test.csv')

In [None]:
#Creating the test loader
test_loader = create_loader(test_file)

In [None]:
def test_model(model, test_loader, criterion):
  
  test_losses = [] # track loss
  num_correct = 0
  
  model.eval()
  # iterate over test data
  for batch in test_loader:

    batch_X, batch_y= batch
    
    batch_X = batch_X.to(device)
    batch_y = batch_y.to(device)
    
    output = model.forward(batch_X)
   
    # calculate loss
    test_loss = criterion(output, batch_y.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output)  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(batch_y.float().view_as(pred))
    correct = np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


  # -- stats! -- ##
  # avg test loss
  print("Test loss: {:.3f}".format(np.mean(test_losses)))

  # accuracy over all test data
  test_acc = num_correct/len(test_loader.dataset)
  print("Test accuracy: {:.3f}".format(test_acc))
    

In [None]:
test_model(model,test_loader,loss_fn)

Test loss: 0.465
Test accuracy: 0.846


## Prediction on real data

In [None]:
def predict(review,word_dict,model):
  '''
  This function will take the review, the word dictionary and the model and
  return the predicted sentiment of the review
  '''
  
  # Encode the review and get its length
  review_encoded,review_len = convert_and_pad_data(word_dict, review)
  
  # Append the length to encoded review so it can be provided to the model
  review_p = np.append(review_len,review_encoded)

  # Convert the array to torch tensor and shift it to gpu if available
  review_t = torch.from_numpy(review_p).to(device)

  # Pass the tensor to the model and get its output
  output = model(review_t.unsqueeze(dim=0))
  
  print(output)
  if torch.round(output) == 1:
    return 'positive'
  else:
    return 'negative'




### Providing the url for a movie review

In [None]:
# Get the reviews page
page = requests.get('https://www.imdb.com/title/tt0087233/reviews?sort=submissionDate&dir=desc&ratingFilter=0')

In [None]:
# Parse the page
soup = BeautifulSoup(page.text, 'html.parser')

In [None]:
# Navigate and get all the reviews
reviews_container = soup.find_all(class_='text show-more__control')

In [None]:
# Cleaning the text
reviews = []
for review in reviews_container:
  review_t = review.text.strip()
  reviews.append(review_t)

In [None]:
len(reviews)

25

In [None]:
# Perform the predictions for the reviews

for r in reviews:
  p = predict([r],word_dict,model)
  print(r)
  print("\n")
  print(p.upper())
  print("----------------------------------------\n\n")

tensor(0.4181, device='cuda:0', grad_fn=<SigmoidBackward>)
Though not optimally made, this movie is captivating.It 'only' shows the love affair of two people, alas married, who have no intention to stray. It simply is their inherent desire of just being together that brings them into the state of affairs.Great play and / or maquillage on the side of especially Meryl Streep that convincingly and immediately shows her internal state(s): feeling good, feeling bad, feeling happy, etc.The audience can follow as observer, how these two come together, ever closer.A nice trick to show the true love of the two is by not getting them together for sex during the plot. Alongside the wife of Frank, who immediately states "It is worse!" when Frank tells her, that he didn't even sleep with Molly. She seems to feel what is going on.Harmless. The audience is not presented with big drama, no sex, no cruelty. Which renders this movie better. Harmless, because the audience knows almost exactly what goes o

## Conclusion

> We can observe, that some of the reviews may be misclassified due to the absence of words that make up the vocabulary. As we train more, and tune the hyper paramters, we can achive much better accuracy ahead.