# Assignment 3 on Natural Language Processing

## Date : 30th Sept, 2020

### Instructor : Prof. Sudeshna Sarkar

### Teaching Assistants : Alapan Kuila, Aniruddha Roy, Anusha Potnuru, Uppada Vishnu

The central idea of this assignment is to use Naive Bayes classifier and LSTM based classifier and compare the models by accuracy on IMDB dataset.



Please submit with outputs. 

In [1]:
import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras
from sklearn.metrics import classification_report

In [2]:
#Load the IMDB dataset. You can load it using pandas as dataframe
df = pd.read_csv("IMDB Dataset.csv")

# Preprocessing
PrePrecessing that needs to be done on lower cased corpus

1. Remove html tags
2. Remove URLS
3. Remove non alphanumeric character
4. Remove Stopwords
5. Perform stemming and lemmatization

You can use regex from re. 

In [3]:
# Imports
import nltk
from tqdm import tqdm

nltk.download('punkt') # For tokenizers
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('stopwords')
from nltk.corpus import stopwords # For stop-words
stopwords = stopwords.words('english')

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer() 

from nltk.stem import PorterStemmer
ps = PorterStemmer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

# Functions
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

def full_preprocess(sent):
    # Remove HTML tags.
    sent = cleanhtml(sent)
    # Remove URLs
    sent = re.sub(r'^https?:\/\/.*[\r\n]*', '', sent, flags=re.MULTILINE)
    # Remove Non-alphanumeric characters
    sent = re.sub(r'[^A-Za-z0-9 ]+', '', sent)
    # Convert to Lower-Case
    sent = sent.lower()
    # Tokenize and Remove Stopwords
    tokens = word_tokenize(sent)
    tokens_wsw = [w for w in tokens if w not in stopwords and w != '']
    # Stemming
    ps_tokens = [ps.stem(w) for w in tokens_wsw]
    # Lemmatization
    lemmatized_tokens = [lemmatizer.lemmatize(w) for w in tokens_wsw]
    
    # We used the lemmatized tokens
    return lemmatized_tokens

df['review'] = [full_preprocess(x) for x in tqdm(df['review'])]

100%|██████████| 50000/50000 [02:59<00:00, 278.55it/s]


In [5]:
# Print Statistics of Data like avg length of sentence , proposition of data w.r.t class labels

# Data distribution w.r.t labels
print(df.groupby('sentiment').count())

# Average Length of sentence
review_len = df['review'].apply(lambda x: len(x))
print("\nAverage length of review : {}".format(np.mean(review_len)))
print("Max length of review : {}".format(np.max(review_len)))
print("Min length of review : {}".format(np.min(review_len)))
print("Standard dev of length of review : {}".format(np.std(review_len)))

           review
sentiment        
negative    25000
positive    25000

Average length of review : 119.76948
Max length of review : 1429
Min length of review : 0
Standard dev of length of review : 89.97188950182776


# Naive Bayes classifier

In [6]:
# get reviews column from df
reviews = df['review']

# get labels column from df
labels = df['sentiment']

In [7]:
# Use label encoder to encode labels. Convert to 0/1
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)

print(encoder.classes_)

['negative' 'positive']


In [8]:
# Split the data into train and test (80% - 20%). 
# Use stratify in train_test_split so that both train and test have similar ratio of positive and negative samples.

train_sentences, test_sentences, train_labels, test_labels = train_test_split(reviews, encoded_labels, test_size=0.2, stratify=encoded_labels)

Here there are two approaches possible for building vocabulary for the naive Bayes.
1. Take the whole data (train + test) to build the vocab. In this way while testing there is no word which will be out of vocabulary.
2. Take the train data to build vocab. In this case, some words from the test set may not be in vocab and hence one needs to perform smoothing so that one the probability term is not zero.
 
You are supposed to go by the 2nd approach.
 
Also building vocab by taking all words in the train set is memory intensive, hence you are required to build vocab by choosing the top 2000 - 3000 frequent words in the training corpus.

> $ P(x_i | w_j) = \frac{ N_{x_i,w_j}\, +\, \alpha }{ N_{w_j}\, +\, \alpha*d} $


$N_{x_i,w_j}$ : Number of times feature $x_i$ appears in samples of class $w_j$

$N_{w_j}$ : Total count of features in class $w_j$

$\alpha$ : Parameter for additive smoothing. Here consider $\alpha$ = 1

$d$ : Dimentionality of the feature vector  $x = [x_1,x_2,...,x_d]$. In our case its the vocab size.






In [9]:
from sklearn.feature_extraction.text import CountVectorizer
# Use Count vectorizer to get frequency of the words
'''
max_features parameter : If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
vec = CountVectorizer(max_features = 3000)
X = vec.fit_transform(Sentence_list)
'''
def dummy(doc):
    return doc

MAX_FEATS = 3000
train_vec = CountVectorizer(tokenizer=dummy, preprocessor=dummy, max_features = MAX_FEATS)
X_train = train_vec.fit_transform(train_sentences)
n_tokens = X_train.toarray().sum(axis=0)

# Dict to store frequency of each feature in train
vocab = train_vec.get_feature_names()
vocab_f = {x:n for x, n in zip(vocab, n_tokens)}

In [10]:
# Use laplace smoothing for words in test set not present in vocab of train set
test_vec = CountVectorizer(tokenizer=dummy, preprocessor=dummy, max_features = None)
X_test = test_vec.fit_transform(test_sentences)

# Words in test set not present in train set
UNK_test = [w for w in test_vec.get_feature_names() if w not in vocab]
print("Percentage of test tokens NOT in Vocab = {}%".format(100*len(UNK_test)/len(test_vec.get_feature_names())))

Percentage of test tokens NOT in Vocab = 96.00989545926103%


In [11]:
# Build the model. Don't use the model from sklearn

# Collecting same class object in training set
train_0 = {key: 0 for key in vocab_f}
train_1 = {key: 0 for key in vocab_f}

for doc, label in tqdm(zip(train_sentences, train_labels), total = len(train_labels)):
    if label:
        for x in doc:
            if x in vocab:
                train_1[x] += 1
    else:
        for x in doc:
            if x in vocab:
                train_0[x] += 1

total_count_1 = sum(train_1.values())
total_count_0 = sum(train_0.values())
vocab_size = len(vocab_f)

100%|██████████| 40000/40000 [01:59<00:00, 333.56it/s]


In [12]:
# Prediction Function
def get_preds(test_sentence):
    alpha = 1
    p_1 = 1
    p_0 = 1
    for x in test_sentence:
        # For Label 1
        if x in train_1.keys():
            N = train_1[x]
        else:
            N = 0
        p_x1 = (N+alpha)/(total_count_1+alpha*vocab_size)
        p_1 *= p_x1
  
        # For Label 0
        if x in train_0.keys():
            N = train_0[x]
        else:
            N = 0
        p_x0 = (N+alpha)/(total_count_0+alpha*vocab_size)
        p_0 *= p_x0

        # To prevent very small
        if(p_0 < 1e-100 or p_1 < 1e-100):
            p_0 *= 1e100
            p_1 *= 1e100

    return p_1/(p_1+p_0)

In [13]:
# Test the model on test set and report Accuracy
test_preds = []
for sample in tqdm(test_sentences):
    proba = get_preds(sample)
    if proba >= 0.5:
        test_preds.append(1)
    else:
        test_preds.append(0) 

print("Performance of Naive Bayes Classifier - \n")
print(classification_report(test_labels, test_preds))

100%|██████████| 10000/10000 [00:00<00:00, 12303.33it/s]

Performance of Naive Bayes Classifier - 

              precision    recall  f1-score   support

           0       0.85      0.84      0.84      5000
           1       0.84      0.85      0.84      5000

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000






# *LSTM* based Classifier

Use the above train and test splits.

In [14]:
# Hyperparameters of the model
vocab_size = vocab_size # choose based on statistics
oov_tok = '<OOK>'
embedding_dim = 100
max_length = 1430 # choose based on statistics, for example 150 to 200
padding_type='post'
trunc_type='post'

In [15]:
# tokenize sentences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index

# convert train dataset to sequence and pad sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding='post', maxlen=max_length)

# convert Test dataset to sequence and pad sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)

In [16]:
# model initialization
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(24, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

# compile model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# model summary
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 1430, 100)         300000    
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               84480     
_________________________________________________________________
dense (Dense)                (None, 24)                3096      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 25        
Total params: 387,601
Trainable params: 387,601
Non-trainable params: 0
_________________________________________________________________


In [17]:
num_epochs = 5
history = model.fit(train_padded, train_labels, 
                    epochs=num_epochs, verbose=1, 
                    validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [18]:
# Calculate accuracy on Test data

# Get probabilities
prediction = model.predict(test_padded)

# Get labels based on probability 1 if p>= 0.5 else 0
test_preds_lstm = np.int32(prediction >= 0.5)

# Accuracy : one can use classification_report from sklearn
print("Performance of LSTM Classifier - \n")
print(classification_report(test_labels, test_preds_lstm))

Performance of LSTM Classifier - 

              precision    recall  f1-score   support

           0       0.87      0.87      0.87      5000
           1       0.87      0.87      0.87      5000

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000



## Get predictions for random examples

In [19]:
# reviews on which we need to predict
sentence = ["The movie was very touching and heart whelming", 
            "I have never seen a terrible movie like this", 
            "the movie plot is terrible but it had good acting"]

# LSTM Classifier - 
# convert to a sequence
sequences = tokenizer.texts_to_sequences([full_preprocess(sent) for sent in sentence])

# pad the sequence
padded = pad_sequences(sequences, padding='post', maxlen=max_length)

# Get probabilities
probas = model.predict(padded)
print("LSTM Classifier predicted probabilities - ")
print([p[0] for p in probas])

# Get labels based on probability 1 if p>= 0.5 else 0
preds = np.int32(probas >= 0.5)

label_map = {1: "positive", 0: "negative"}
print("\nLSTM classifier predictions - ")
labels = [label_map[p[0]] for p in preds]
print(labels)

LSTM Classifier predicted probabilities - 
[0.92212844, 0.09588553, 0.045102824]

LSTM classifier predictions - 
['positive', 'negative', 'negative']


In [20]:
# Naive Bayes Classifier - 
# convert to a sequence
sequences = [full_preprocess(sent) for sent in sentence]

# Get probabilities
probas = np.float32([get_preds(seq) for seq in sequences])
print("Naive-Bayes Classifier predicted probabilities - ")
print(probas)

# Get labels based on probability 1 if p>= 0.5 else 0
preds = np.int32(probas >= 0.5)

label_map = {1: "positive", 0: "negative"}
print("\nNaive-Bayes classifier predictions - ")
labels = [label_map[p] for p in preds]
print(labels)

Naive-Bayes Classifier predicted probabilities - 
[0.89637595 0.08451005 0.04191791]

Naive-Bayes classifier predictions - 
['positive', 'negative', 'negative']
