# Assignment 3 on Natural Language Processing

## Date : 30th Sept, 2020

### Instructor : Prof. Sudeshna Sarkar

### Teaching Assistants : Alapan Kuila, Aniruddha Roy, Anusha Potnuru, Uppada Vishnu

The central idea of this assignment is to use Naive Bayes classifier and LSTM based classifier and compare the models by accuracy on IMDB dataset.



Please submit with outputs. 

In [167]:
import re
import pandas as pd
import numpy as np
import keras
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from sklearn.metrics import classification_report

In [168]:
#Load the IMDB dataset. You can load it using pandas as dataframe
df=pd.read_csv("IMDB Dataset.csv")
#df=ddf[0:50]
print(df[0:5])

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


# Preprocessing
PrePrecessing that needs to be done on lower cased corpus

1. Remove html tags
2. Remove URLS
3. Remove non alphanumeric character
4. Remove Stopwords
5. Perform stemming and lemmatization

You can use regex from re. 

In [169]:
#imports
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
stopwords=stopwords.words("english")
from nltk.stem import SnowballStemmer
ss=SnowballStemmer("english")
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') # Since Lemmatization method is based on WorldNet's built-in morph function.
lemmatizer=WordNetLemmatizer()

[nltk_data] Downloading package wordnet to C:\Users\Aparna
[nltk_data]     Sakshi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [170]:
#removing html tags
def remove_html_tags(text):     
    clean = re.compile('<.*?>')    
    return re.sub(clean, '', text)
#removing urls
def remove_URLS(text):
    return re.sub(r'http\S+', '', text)

df['clean_web']=[remove_URLS(remove_html_tags(text)) for text in df['review']]
#print(df[0:5])

#remove non alpha numeric characters, stopwords, and then lemmatize and stem(snowball stem)
def preprocess(text):
    sentences=sent_tokenize(text)
    preprocessed_text=[];
    for sentence in sentences:
        tokenizer = nltk.RegexpTokenizer(r"[a-z0-9]+")
        tokenized_sentence = tokenizer.tokenize(sentence.lower())
        for word in tokenized_sentence:
            if word not in stopwords:
                #lemmatized word -> lw
                lw=lemmatizer.lemmatize(word)
                #stem lw -> slw: stem the lemmatized word
                slw=ss.stem(lw)
                preprocessed_text.append(slw)
    st = " ".join(preprocessed_text)
    #print(preprocessed_text)
    #return preprocessed_text
    return st
            
df['preprocessed_review']=[preprocess(text) for text in df['clean_web']]
df['review_len']=[len(processed_list.split()) for processed_list in df['preprocessed_review']]

df.drop(['review','clean_web'], axis=1, inplace= True)
print(df[0:5])
    
df.columns 

  sentiment                                preprocessed_review  review_len
0  positive  one review mention watch 1 oz episod hook righ...         163
1  positive  wonder littl product film techniqu unassum old...          86
2  positive  thought wonder way spend time hot summer weeke...          85
3  negative  basic famili littl boy jake think zombi closet...          66
4  positive  petter mattei love time money visual stun film...         125


Index(['sentiment', 'preprocessed_review', 'review_len'], dtype='object')

In [171]:
# Print Statistics of Data like avg length of review , proposition of data w.r.t class labels
print("avg len of review:")
print(df['review_len'].agg(['average']))
print("__________________________________________")
print("count of each label:")
df['sentiment'].value_counts()

avg len of review:
average    119.58238
Name: review_len, dtype: float64
__________________________________________
count of each label:


negative    25000
positive    25000
Name: sentiment, dtype: int64

# Naive Bayes classifier

In [172]:
# get reviews column from df
reviews = df['preprocessed_review']

# get labels column from df
labels = df['sentiment']

In [173]:
# Use label encoder to encode labels. Convert to 0/1
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)

#print(enc.classes_)

In [174]:
# Split the data into train and test (80% - 20%). 
# Use stratify in train_test_split so that both train and test have similar ratio of positive and negative samples.
train_sentences, test_sentences, train_labels, test_labels = train_test_split(reviews, encoded_labels,stratify=encoded_labels, test_size=0.2)



Here there are two approaches possible for building vocabulary for the naive Bayes.
1. Take the whole data (train + test) to build the vocab. In this way while testing there is no word which will be out of vocabulary.
2. Take the train data to build vocab. In this case, some words from the test set may not be in vocab and hence one needs to perform smoothing so that one the probability term is not zero.
 
You are supposed to go by the 2nd approach.
 
Also building vocab by taking all words in the train set is memory intensive, hence you are required to build vocab by choosing the top 2000 - 3000 frequent words in the training corpus.

> $ P(x_i | w_j) = \frac{ N_{x_i,w_j}\, +\, \alpha }{ N_{w_j}\, +\, \alpha*d} $


$N_{x_i,w_j}$ : Number of times feature $x_i$ appears in samples of class $w_j$

$N_{w_j}$ : Total count of features in class $w_j$

$\alpha$ : Parameter for additive smoothing. Here consider $\alpha$ = 1

$d$ : Dimentionality of the feature vector  $x = [x_1,x_2,...,x_d]$. In our case its the vocab size.






In [179]:
from sklearn.feature_extraction.text import CountVectorizer
# Use Count vectorizer to get frequency of the words

'''
max_features parameter : If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
'''
vec = CountVectorizer(max_features = 2000)
X=vec.fit_transform(train_sentences)
vocab = vec.get_feature_names()
print(X[0].shape[1])


2000


In [182]:
#d=V=vocab size
V=len(vocab)  
#N(xi,wj)
Npos={vocab[x_i]:0 for x_i in range(V)}
Nneg={vocab[x_i]:0 for x_i in range(V)}

#ndarray = X.toarray()
#listOflist = ndarray.tolist()
#print(type(X))
ndarray = X.toarray()
listX = ndarray.tolist()
#print(listX)

for d_i, doc in enumerate(listX):  
    for x_i,count_i in enumerate(doc):        
        if train_labels[d_i]==0:        
            Npos[vocab[x_i]]+=count_i
        else:
            Nneg[vocab[x_i]]+=count_i
        
#N(wj)
Nw_pos=sum(Npos.values())
Nw_neg=sum(Nneg.values())

#alpha=1
      
        

In [183]:
# Use laplace smoothing for words in test set not present in vocab of train set
#P(xi/wj) return smoothed values for any word irrespective of it being in vocab
print((0+1)/(Nw_pos+V))
print((0+1)/(Nw_neg+V))
def P(word,label):
    if word in vocab:
        #print("in vocab", word)
        if label == 0:
            return (Npos[word]+1)/(Nw_pos+V)
        else:
            return (Nneg[word]+1)/(Nw_neg+V)
    else:
        #print("not in vocab", word)
        if label == 0:
            return (0+1)/(Nw_pos+V)
        else:
            return (0+1)/(Nw_neg+V)


5.39626675473374e-07
5.399751719415942e-07


In [186]:
# Build the model. Don't use the model from sklearn
import math
#predict
def predict(text, label):
    #predicted label
    predicted_label=1
    #pre process text
    clean_tags=remove_URLS(remove_html_tags(text))
    clean_text=preprocess(clean_tags)   
    #calculate probability estimate
    #prior estimate of each class=0.5
    prod1=0
    prod2=0
    for word in clean_text.split():
        prod1+=math.log10(P(word,0))
        prod2+=math.log10(P(word,1))       
    if prod1>prod2:        
        predicted_label=0
    else:        
        predicted_label=1   
    #print("Predicted label",predicted_label, "actual label", label)
    if predicted_label==label:        
        return True
    else:
        return False



In [187]:
# Test the model on test set and report Accuracy
val=0
accuracy=0
for text,label in  zip(test_sentences, test_labels):
    val+=1
    if predict(text, label):
        accuracy+=1
        
print("Accuracy:",accuracy/val)   




Accuracy: 0.8361


# *LSTM* based Classifier

Use the above train and test splits.

In [None]:
# Hyperparameters of the model
vocab_size = # choose based on statistics
oov_tok = '<OOK>'
embedding_dim = 100
max_length = # choose based on statistics, for example 150 to 200
padding_type='post'
trunc_type='post'

In [None]:
# tokenize sentences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index

# convert train dataset to sequence and pad sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding='post', maxlen=max_length)

# convert Test dataset to sequence and pad sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)

In [None]:
# model initialization
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(24, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

# compile model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# model summary
model.summary()

In [None]:
num_epochs = 5
history = model.fit(train_padded, train_labels, 
                    epochs=num_epochs, verbose=1, 
                    validation_split=0.1)

In [None]:
# Calculate accuracy on Test data
'''
prediction = model.predict(test_padded)

'''
# Get probabilities


# Get labels based on probability 1 if p>= 0.5 else 0


# Accuracy : one can use classification_report from sklearn

## Get predictions for random examples

In [None]:
# reviews on which we need to predict
sentence = ["The movie was very touching and heart whelming", 
            "I have never seen a terrible movie like this", 
            "the movie plot is terrible but it had good acting"]

# convert to a sequence
sequences = 

# pad the sequence
padded = 

# Get probabilities
print(model.predict(padded))

# Get labels based on probability 1 if p>= 0.5 else 0


