# **Natural Language Processing: Sentiment Prediction using Naive Bayes Classifier**

#### Instructor : Prof. Sudeshna Sarkar


Name: ANGANA MONDAL

Roll Number: 19IE10039

We will use Naive Bayes classifier and LSTM based classifier and compare the models by accuracy on IMDB dataset.  This dataset consists of 50k movie reviews (25k positive, 25k negative). Dataset has been taken from: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews



In [2]:
import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras
from sklearn.metrics import classification_report

In [3]:
#Loading the IMDB dataset.
df=pd.read_csv('IMDB Dataset.csv')

# Preprocessing
PrePrecessing that needs to be done on lower cased corpus

1. Removing html tags
2. Removing URLS
3. Removing non alphanumeric character
4. Removing Stopwords
5. Performing stemming and lemmatization

We use regex from re. 

In [5]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\angan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\angan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
#Preprocessing 

#lower case
df['review']=df['review'].str.lower()

#removing HTML tags
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
    return TAG_RE.sub(" ", text)
df['review']= df['review'].apply(remove_tags)

#removing URLs
def remove_urls(text):
    return re.sub(r'http\S+', '', text)
df['review']= df['review'].apply(remove_urls)

#removing non-alphanumeric characters
def remove_non_alphanum(text):
    pattern = r'[^A-Za-z0-9]+'
    text = re.sub(pattern, " ", text)
    return text
df['review']= df['review'].apply(remove_non_alphanum)

#removing stopwords
from nltk.corpus import stopwords
stop = stopwords.words('english')
df['review'] = df['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

#Stemming the text
from nltk.stem import PorterStemmer
ps=PorterStemmer()
df['review'] = df.apply(lambda row: word_tokenize(row['review']),axis=1)
df['review'] = df['review'].apply(lambda text: [ps.stem(word) for word in text])

#Lemmatizing the text
from nltk.stem import WordNetLemmatizer
lem=WordNetLemmatizer()
df['review'] = df['review'].apply(lambda text: [lem.lemmatize(word, pos='v') for word in text])

df2=df
df2['review'] = df2['review'].apply(lambda text:' '.join([str(w) for w in text]))

#Now, df2 is the preprocessed dataframe.

In [9]:
# We print Statistics of Data like avg length of sentence , proposition of data w.r.t class labels

#Average length of sentence
num_reviews=len(df['review'])
total_word_count=0
for i in range(0, len(df2)):
    total_word_count = total_word_count + len(nltk.word_tokenize(df2.iloc[i]['review']))
print("Average Length of Sentence: ", total_word_count/num_reviews)

#proposition of data w.r.t class labels
num_pos = len(df2[df2['sentiment'] == 'positive'])
num_neg = len(df2[df2['sentiment'] == 'negative'])
print("Total number of reviews: ",num_reviews)
print("Number of positive reviews: ",num_pos)
print("Number of negative reviews: ",num_neg)
print("Thus, Ratio of positive to negative reviews are: ", num_pos/num_neg)

Average Length of Sentence:  119.6626
Total number of reviews:  50000
Number of positive reviews:  25000
Number of negative reviews:  25000
Thus, Ratio of positive to negative reviews are:  1.0


# Naive Bayes classifier

In [10]:
# we get reviews column from df
reviews = df2['review']

# we get labels column from df
labels = df2['sentiment']

In [11]:
# We use label encoder to encode labels. Convert to 0/1
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)

print('The encoder classes are: ',encoder.classes_)
# print(enc.classes_)

The encoder classes are:  ['negative' 'positive']


In [12]:
# We split the data into train and test (80% - 20%). 
# We use stratify in train_test_split so that both train and test have similar ratio of positive and negative samples.

x_train, x_test, y_train, y_test = train_test_split(reviews, encoded_labels, test_size=0.2, random_state=0, stratify=encoded_labels)
# train_sentences, test_sentences, train_labels, test_labels

Approach for building vocabulary for the naive Bayes.

We take the train data to build vocab. In this case, some words from the test set may not be in vocab and hence one needs to perform smoothing so that one the probability term is not zero.
 
Also building vocab by taking all words in the train set is memory intensive, hence we build vocab by choosing the top 2000 - 3000 frequent words in the training corpus.

> $ P(x_i | w_j) = \frac{ N_{x_i,w_j}\, +\, \alpha }{ N_{w_j}\, +\, \alpha*d} $


$N_{x_i,w_j}$ : Number of times feature $x_i$ appears in samples of class $w_j$

$N_{w_j}$ : Total count of features in class $w_j$

$\alpha$ : Parameter for additive smoothing. Here consider $\alpha$ = 1

$d$ : Dimentionality of the feature vector  $x = [x_1,x_2,...,x_d]$. In our case its the vocab size.






In [13]:
from sklearn.feature_extraction.text import CountVectorizer
# We use Count vectorizer to get frequency of the words

#max_features parameter : If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
vec = CountVectorizer(max_features = 3000)
X = vec.fit_transform (x_train) 


In [15]:
# We use laplace smoothing for words in test set not present in vocab of train set

d= len(vec.get_feature_names()) #total number of features, or word count in vocabulary
training_vocab=vec.get_feature_names() #list of features (words)

#finding prob(class is positive) and prob(class is negative)
Probability_pos = len(y_train[y_train==1])/len(y_train)
Probability_neg = len(y_train[y_train==0])/len(y_train)

#finding total number of words in positive class, and in negative class
list_words_pos = X.toarray()[np.where(y_train==1)].sum(axis=0)
total_words_pos = (list_words_pos).sum() #total number of words in 'positive' class label
list_words_neg = X.toarray()[np.where(y_train==0)].sum(axis=0)
total_words_neg = (list_words_neg).sum() #total number of words in 'negative' class label


#finding probability(word|positive class) and probability(word|negative class)
#for each word in training_vocab (the list of unique and most frequent words)
#with laplace smoothing.
#We make the prob_pos and prob_neg dictionary where key=word, value=probability
alpha=1

Prob_of_word_given_pos= {} #initialising dictionary
Prob_of_word_given_neg= {} #initialising dictionary

#for every word, we calculate probability(word|positive) and probability(word|negative), and enter them in the dicts.
for i in range(len(training_vocab)):
    word = training_vocab[i]
    word_count_in_pos = list_words_pos[i]
    word_count_in_neg = list_words_neg[i]
    #we use log probability to prevent underflow (because probability products may become very small)
    Prob_of_word_given_pos[word]=np.log((word_count_in_pos + alpha)/(total_words_pos + alpha * d))
    Prob_of_word_given_neg[word]=np.log((word_count_in_neg + alpha)/(total_words_neg + alpha * d))

In [16]:
# We build the model from scratch

#the following function reads a text input and returns 1, if the review is positive, else 0.
def NP_model(text_input):
    P=Probability_pos
    N=Probability_neg
    words = word_tokenize(text_input)
    
    for w in words:
        if w in Prob_of_word_given_pos.keys():
            P=P+Prob_of_word_given_pos[w]#we are adding because we use log probability
            N=N+Prob_of_word_given_neg[w]
        else:
            P=P+np.log((alpha)/(total_words_pos+alpha*d))#in this case, the word is not there in training_vocab
            N=N+np.log((alpha)/(total_words_neg+alpha*d))
    
    if(P>=N):
        return 1 #positive review predicted
    else:
        return 0 #negative review predicted
 

In [17]:
# We test the model on test set and report Accuracy

test_sentences_list = x_test.tolist()
correct = 0 #count of correct predictions
for i in range(len(test_sentences_list)):
    sentence=test_sentences_list[i]
    predicted = NP_model(sentence)
    #prediction is correct if it matches with the test label
    if(predicted==y_test[i]):
        correct=correct+1

print('Accuracy of Naive Bayes Model: ',(correct/len(test_sentences_list))*100, '%')

Accuracy of Naive Bayes Model:  84.41 %


# *LSTM* based Classifier

We use the above train test splits on an LSTM Classifier

In [18]:
# Hyperparameters of the model
vocab_size = 3000 #we assume our vocab size remains same as before
oov_tok = '<OOK>'
embedding_dim = 100
max_length = 200 # we choose based on statistics, for example 150 to 200
padding_type='post'
trunc_type='post'

In [20]:
train_sentences = x_train
test_sentences = x_test
train_labels = y_train
test_labels = y_test

# we tokenize sentences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index

# we convert train dataset to sequence and pad sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding='post', maxlen=max_length)

# we convert Test dataset to sequence and pad sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)

In [21]:
# model initialization
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(24, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

# compile model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# model summary
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 200, 100)          300000    
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               84480     
_________________________________________________________________
dense (Dense)                (None, 24)                3096      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 25        
Total params: 387,601
Trainable params: 387,601
Non-trainable params: 0
_________________________________________________________________


In [22]:
num_epochs = 5
history = model.fit(train_padded, train_labels, 
                    epochs=num_epochs, verbose=1, 
                    validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [23]:
# We calculate accuracy on Test data

# Getting probabilities
prediction = model.predict(test_padded)

# Getting labels based on probability 1 if p>= 0.5 else 0
prediction_probability = (prediction>=0.5).astype(int)

# Accuracy : using classification_report from sklearn
accuracy_lstm = classification_report(test_labels, prediction_probability, output_dict=True)['accuracy']
print('\nAccuracy of LSTM Model: ',accuracy_lstm,)
print('\n\n*************Classification Report*************')
print(classification_report(test_labels, prediction_probability))


Accuracy of LSTM Model:  0.8508


*************Classification Report*************
              precision    recall  f1-score   support

           0       0.91      0.78      0.84      5000
           1       0.81      0.92      0.86      5000

    accuracy                           0.85     10000
   macro avg       0.86      0.85      0.85     10000
weighted avg       0.86      0.85      0.85     10000



## Getting predictions for random examples

In [27]:
# reviews on which we need to predict
sentence = ["The movie was very touching and heart whelming", 
            "I have never seen a terrible movie like this", 
            "the movie plot is terrible but it had good acting"]


#preprocessing the sentences

from nltk.stem import PorterStemmer
ps = PorterStemmer()
def stemming(text):
    words = word_tokenize(text)
    stemmed_list=[]
    for w in words:
        stemmed_list.append(ps.stem(w))
    return " ".join(stemmed_list)

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatizing(text):
    words = word_tokenize(text)
    lemmatize_list=[]
    for w in words:
        lemmatize_list.append(lemmatizer.lemmatize(w))
    return " ".join(lemmatize_list)

from nltk.corpus import stopwords
stop = stopwords.words('english')

for i in range(len(sentence)):
    sentence[i]=sentence[i].lower();
    TAG_RE = re.compile(r'<[^>]+>')
    sentence[i]=TAG_RE.sub(" ", sentence[i])
    sentence[i]=re.sub(r'http\S+', '', sentence[i])
    pattern = r'[^A-Za-z0-9]+'
    sentence[i]= re.sub(pattern, " ", sentence[i])
    sentence[i] = ' '.join([word for word in sentence[i].split() if word not in (stop)])
    sentence[i] = stemming(sentence[i])
    sentence[i] = lemmatizing(sentence[i])

    
# converting to a sequence
sequences = tokenizer.texts_to_sequences(sentence)


# padding the sequence
padded = pad_sequences(sequences, padding='post', maxlen=max_length)


# Getting probabilities
print('Probabilities generated: \n', model.predict(padded))


# Getting labels based on probability 1 if p>= 0.5 else 0
predictions_generated = (model.predict(padded)>=0.5).astype(int)

# Printing the predictions made by the LSTM Model
print('\nThe sentiment predictions on the sentences are respectively:')
for val in predictions_generated:
    if val[0]==1:
        print('Positive Review')
    else:
        print('Negative Review')

Probabilities generated: 
 [[0.9584262 ]
 [0.11119056]
 [0.08764476]]

The sentiment predictions on the sentences are respectively:
Positive Review
Negative Review
Negative Review
