# Natural Language Processing

## Ruthu S Sanketh

### NB and LSTM based classifiers

The central idea of this tutorial is to use Naive Bayes classifier and LSTM based classifier and compare the models by accuracy on IMDB dataset.



In [1]:
import pandas as pd
import numpy as np
import nltk, keras, string, re, html, math

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter, defaultdict
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import accuracy_score, classification_report

In [2]:
#Load the IMDB dataset. We load it using pandas as dataframe
data = pd.read_csv('/Users/ruthu/Desktop/IMDB Dataset.csv') 
print("Data shape - ", data.shape, "\n")                                  #prints the number of rows and columns

for col in data.columns:
    print("The number of null values - ", col, data[col].isnull().sum())   #prints the number of null values in each column

data["review"]= data["review"].str.lower() 
data["sentiment"]= data["sentiment"].str.lower()             #converts every value in the column to lowercase
data.head()

Data shape -  (50000, 2) 

The number of null values -  review 0
The number of null values -  sentiment 0


Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


# Preprocessing
Pre-precessing that needs to be done on lower cased corpus - 

1. Removal of html tags
2. Removal of  URLS
3. Removal of non alphanumeric character
4. Removal of Stopwords
5. Performing stemming and lemmatization

We use regex from re. 

In [3]:
def cleaning(data):
    clean = re.sub('<.*?>', ' ', str(data))            #removes HTML tags
    clean = re.sub('\'.*?\s',' ', clean)               #removes all hanging letters afer apostrophes (s in it's)
    clean = re.sub(r'http\S+',' ', clean)              #removes URLs
    clean = re.sub('\W+',' ', clean)                   #replacing the non alphanumeric characters
    return html.unescape(clean)
data['cleaned'] = data['review'].apply(cleaning)


def tokenizing(data):
    review = data['cleaned']                            #tokenizing is done
    tokens = nltk.word_tokenize(review)
    return tokens
data['tokens'] = data.apply(tokenizing, axis=1)


stop_words = set(stopwords.words('english'))
def remove_stops(data):
    my_list = data['tokens']
    meaningful_words = [w for w in my_list if not w in stop_words]           #stopwords are removed from the tokenized data
    return (meaningful_words)
data['tokens'] = data.apply(remove_stops, axis=1)


lemmatizer = WordNetLemmatizer()
def lemmatizing(data):
    my_list = data['tokens']
    lemmatized_list = [lemmatizer.lemmatize(word) for word in my_list]    #lemmatizing is performed. It's more efficient and better than stemming.
    return (lemmatized_list)
data['tokens'] = data.apply(lemmatizing, axis=1)

def rejoin_words(data):
    my_list = data['tokens']
    joined_words = ( " ".join(my_list))                     #rejoins all stemmed words
    return joined_words
data['cleaned'] = data.apply(rejoin_words, axis=1)

data.head()

Unnamed: 0,review,sentiment,cleaned,tokens
0,one of the other reviewers has mentioned that ...,positive,one reviewer mentioned watching 1 oz episode h...,"[one, reviewer, mentioned, watching, 1, oz, ep..."
1,a wonderful little production. <br /><br />the...,positive,wonderful little production filming technique ...,"[wonderful, little, production, filming, techn..."
2,i thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...,"[thought, wonderful, way, spend, time, hot, su..."
3,basically there's a family where a little boy ...,negative,basically family little boy jake think zombie ...,"[basically, family, little, boy, jake, think, ..."
4,"petter mattei's ""love in the time of money"" is...",positive,petter mattei love time money visually stunnin...,"[petter, mattei, love, time, money, visually, ..."


In [4]:
# Prints statistics of Data like avg length of sentence , proportion of data w.r.t class labels
def sents(data):
    clean = re.sub('<.*?>', ' ', str(data))            #removes HTML tags
    clean = re.sub('\'.*?\s',' ', clean)               #removes all hanging letters afer apostrophes (s in it's)
    clean = re.sub(r'http\S+',' ', clean)              #removes URLs
    clean = re.sub('[^a-zA-Z0-9\.]+',' ', clean)       #removes all non-alphanumeric characters except periods.
    tokens = nltk.sent_tokenize(clean)                 #sentence tokenizing is done
    return tokens
sents = data['review'].apply(sents)

length_s = 0
for i in range(data.shape[0]):
    length_s+= len(sents[i])
print("The number of sentences is - ", length_s)          #prints the number of sentences

length_t = 0
for i in range(data.shape[0]):
    length_t+= len(data['tokens'][i])
print("\nThe number of tokens is - ", length_t)           #prints the number of tokens

average_tokens = round(length_t/length_s)
print("\nThe average number of tokens per sentence is - ", average_tokens) #prints the average number of tokens per sentence

positive = negative = 0
for i in range(data.shape[0]):
    if (data['sentiment'][i]=='positive'):
        positive += 1                           #finds the proprtion of positive and negative sentiments
    else:
        negative += 1

print("\nThe number of positive examples are - ", positive)
print("\nThe number of negative examples are - ", negative)
print("\nThe proportion of positive sentiments to negative ones are - ", positive/negative)
      

The number of sentences is -  544935

The number of tokens is -  5961690

The average number of tokens per sentence is -  11

The number of positive examples are -  25000

The number of negative examples are -  25000

The proportion of positive sentiments to negative ones are -  1.0


# Naive Bayes classifier

In [5]:
# gets reviews column from df
reviews = data['cleaned'].values

# gets labels column from df
labels = data['sentiment'].values

In [6]:
# Uses label encoder to encode labels. Convert to 0/1
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)
data['encoded']= encoded_labels
print(data['encoded'].head())

# prints(enc.classes_)
encoder_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))
print("\nThe encoded classes are - ", encoder_mapping)

labels = data['encoded']

0    1
1    1
2    1
3    0
4    1
Name: encoded, dtype: int32

The encoded classes are -  {'negative': 0, 'positive': 1}


In [7]:
# Splits the data into train and test (80% - 20%). 
# Uses stratify in train_test_split so that both train and test have similar ratio of positive and negative samples.
train_sentences, test_sentences, train_labels, test_labels = train_test_split(reviews, labels, test_size=0.2, random_state=42, stratify=labels)

# train_sentences, test_sentences, train_labels, test_labels
print("The training sentences are -",train_sentences, sep='\n\n')
print("\nThe test sentences are -",test_sentences, sep='\n\n')
print("\nThe training labels are -",train_labels, sep='\n\n')
print("\nThe test labels are -",test_labels, sep='\n\n')

The training sentences are -

 'believe let movie accomplish favor friend ask early april 14 2007 movie certainly pain as theater sickly boring even felt gory impact daunting scene deem complete failure attract audience worst even trampled cause friend failed come time theater busy assisting boyfriend looking appropriate lodge stay one night really disappointed matter movie matter indeed poor plot useless storyline naively created know say anymore title suggest anyway creep horror failed overture u viewer maybe beating animal could get creep show theater real situational play good luck anyone attempt watch anyway'
 'spoiler alert get nerve people remake use term loosely good movie american version dutch thriller someone decided original ending pasteurized enough american audience create new one stupid improbable pretend kind ending favor get original one'
 ...
 'waste time danger watch tempted tear dvd wall heave thru window amateur production terrible repetitive vacuous dialog paper t

There are two approaches possible for building vocabulary for the Naive Bayes classifier.
1. We take the whole data (train + test) to build the vocab. In this way while testing there is no word which will be out of vocabulary.
2. We take the train data to build vocab. In this case, some words from the test set may not be in vocab and hence one needs to perform smoothing so that one of the probability terms are not zero.
 
We use the 2nd approach.
 
Also, building vocab by taking all words in the train set is memory intensive, hence we build the vocab by choosing the top 2000 - 3000 frequent words in the training corpus.

> $ P(x_i | w_j) = \frac{ N_{x_i,w_j}\, +\, \alpha }{ N_{w_j}\, +\, \alpha*d} $


$N_{x_i,w_j}$ : Number of times feature $x_i$ appears in samples of class $w_j$

$N_{w_j}$ : Total count of features in class $w_j$

$\alpha$ : Parameter for additive smoothing. Here consider $\alpha$ = 1

$d$ : Dimentionality of the feature vector  $x = [x_1,x_2,...,x_d]$. In our case its the vocab size.

In [9]:
# Uses Count vectorizer to get frequency of the words
vectorizer = CountVectorizer(max_features = 3000)

sents_encoded = vectorizer.fit_transform(train_sentences)         #encodes all training sentences
counts = sents_encoded.sum(axis=0).A1
vocab = list(vectorizer.get_feature_names())

In [11]:
# Builds the model.
# Uses laplace smoothing for words in test set not present in vocab of train set
class MultinomialNaiveBayes:
  
    def __init__(self, classes, tokenizer):
      #self.tokenizer = tokenizer
      self.classes = classes
      
    def group_by_class(self, X, y):
      data = dict()
      for c in self.classes:                            #grouping by positive and negative sentiments
        data[c] = X[np.where(y == c)]
      return data
           
    def fit(self, X, y):
        self.n_class_items = {}
        self.log_class_priors = {}
        self.word_counts = {}
        self.vocab = vocab                            #using the pre-made vocabulary of 3000 most frequent training words

        n = len(X)
        
        grouped_data = self.group_by_class(X, y)
        
        for c, data in grouped_data.items():
          self.n_class_items[c] = len(data)
          self.log_class_priors[c] = math.log(self.n_class_items[c] / n)   #taking log for easier calculation
          self.word_counts[c] = defaultdict(lambda: 0)
          
          for text in data:
            counts = Counter(nltk.word_tokenize(text))
            for word, count in counts.items():
                self.word_counts[c][word] += count
                
        return self
      
    def laplace_smoothing(self, word, text_class):          #smoothing
      num = self.word_counts[text_class][word] + 1
      denom = self.n_class_items[text_class] + len(self.vocab)
      return math.log(num / denom)
      
    def predict(self, X):
        result = []
        for text in X:
          
          class_scores = {c: self.log_class_priors[c] for c in self.classes}

          words = set(nltk.word_tokenize(text))
          for word in words:
              if word not in self.vocab: continue

              for c in self.classes:
                
                log_w_given_c = self.laplace_smoothing(word, c)
                class_scores[c] += log_w_given_c
                
          result.append(max(class_scores, key=class_scores.get))

        return result

In [14]:
MNB = MultinomialNaiveBayes(
    classes=np.unique(labels), 
    tokenizer=Tokenizer()
).fit(train_sentences, train_labels)

# Tests the model on test set and reports the Accuracy
predicted_labels = MNB.predict(test_sentences)
print("The accuracy of the MNB classifier is ", accuracy_score(test_labels, predicted_labels))
print("\nThe classification report with metrics - \n", classification_report(test_labels, predicted_labels))

The accuracy of the MNB classifier is  0.8533

The classification report with metrics - 
               precision    recall  f1-score   support

           0       0.86      0.85      0.85      5000
           1       0.85      0.86      0.85      5000

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000



# LSTM based Classifier

We use the above train and test splits.

In [15]:
# Hyperparameters of the model
oov_tok = '<OOK>'
embedding_dim = 100
max_length = 150
padding_type='post'
trunc_type='post'

# tokenizes sentences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_sentences)

# vocabulary size
word_index = tokenizer.word_index
vocab_size = len(tokenizer.word_index) + 1

# converts train dataset to sequence and pads sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding='post', maxlen=max_length)

# converts Test dataset to sequence and pads sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)

In [16]:
# model initialization
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(24, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

# compiles model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# model summary
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 150, 100)          8295000   
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               84480     
_________________________________________________________________
dense (Dense)                (None, 24)                3096      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 25        
Total params: 8,382,601
Trainable params: 8,382,601
Non-trainable params: 0
_________________________________________________________________


In [17]:
#training the model
num_epochs = 5
history = model.fit(train_padded, train_labels, 
                    epochs=num_epochs, verbose=1, 
                    validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [20]:
# Gets probabilities
prediction = model.predict(test_padded)
print("The probabilities are - ", prediction, sep='\n')

# Gets labels based on probability 1 if p>= 0.5 else 0
for each in prediction:
    if each[0] >=0.5:
        each[0] = 1
    else:
        each[0] = 0
prediction = prediction.astype('int32') 
print("\nThe labels are - ", prediction, sep='\n')

# Calculates accuracy on Test data
print("\nThe accuracy of the model is ", accuracy_score(test_labels, prediction))
print("\nThe accuracy and other metrics are \n", classification_report(test_labels, prediction, labels=[0, 1]),sep='\n')


The probabilities are - 
[[9.8351729e-01]
 [3.4398127e-01]
 [9.9961859e-01]
 ...
 [2.4896860e-04]
 [9.9665564e-01]
 [3.7850103e-01]]

The labels are - 
[[1]
 [0]
 [1]
 ...
 [0]
 [1]
 [0]]

The accuracy of the model is  0.8711

The accuracy and other metrics are 

              precision    recall  f1-score   support

           0       0.85      0.90      0.87      5000
           1       0.89      0.84      0.87      5000

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000



## To get predictions for random examples

In [21]:
# reviews on which we need to predict
sentence = ["The movie was very touching and heart whelming", 
            "I have never seen a terrible movie like this", 
            "the movie plot is terrible but it had good acting"]

# converts to a sequence
test_sequences = tokenizer.texts_to_sequences(sentence)

# pads the sequence
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)

# Gets probabilities
prediction = model.predict(test_padded)
print("The probabilities are - ", prediction, sep='\n')

# Gets labels based on probability 1 if p>= 0.5 else 0
for each in prediction:
    if each[0] >=0.5:
        each[0] = 1
    else:
        each[0] = 0
prediction = prediction.astype('int32') 
print("\nThe labels are - ", prediction, sep='\n')

The probabilities are - 
[[0.96641695]
 [0.03413102]
 [0.0519332 ]]

The labels are - 
[[1]
 [0]
 [0]]


### We see that the MNB classifier has an accuracy of around 85%, while the LSTM classifier has an accuracy of around 87%, and is hence the better classifier.