<h1><b>News Sentiment Analyzer using Bidirectional LSTM(BiLSTM) and Keras Embedding.

<h4><b>Importing necessary packages

In [None]:
from nltk.corpus import stopwords
import string
import os
from collections import Counter
import numpy as np
import pickle
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Embedding, LSTM, Dropout, Bidirectional, SpatialDropout1D

<b> We have a dataset named "Train ready Dataset" which has test and train folders. Inside both "test" and "train" folders there are "neg" and "pos" subfolder. There are 778 .txt files inside both "neg" and "pos" subfolder of "train" set, while there are 40 .txt files inside both "neg" and "pos" subfolder of "test" set. The text file are news excerpt from BBC.

<h2><b>Loading and Cleaning the .txt files</b></h2>

<h4><b>load_file() function below loads the file,opens the file in read only mode and returns text of the file.

In [47]:
#Load doc into the memory
def load_file(filename):
    #opening the file in read only mode
    file = open(filename, 'r')
    #read all text
    text = file.read()
    #close the file
    file.close()
    return text

<h4><b>clean_file() function below converts the text into clean tokens by removing punctuations,stop words and short words.

In [48]:
#Convert the text into clean tokens
def clean_file(text):
    #split the text into tokens by whitespace
    tokens = text.split()
    #remove punctuation from each token
    table = str.maketrans('','', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    #remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    #remove the stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    #remove the short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokensdd

<h4><b>Example of loading and cleaning a document

In [49]:
#Load and clean the document
filename = "Train ready Dataset/train/neg/93.txt"
text = load_file(filename)
tokens = clean_file(text)
print(tokens)

['Water', 'firm', 'Suez', 'Argentina', 'row', 'conflict', 'Argentine', 'State', 'water', 'firm', 'Aguas', 'Argentinas', 'controlled', 'Frances', 'Suez', 'casting', 'doubt', 'firms', 'future', 'The', 'firm', 'serves', 'province', 'Buenos', 'Aires', 'wants', 'tariff', 'rise', 'fund', 'watersupply', 'improvements', 'The', 'government', 'rejected', 'rise', 'wants', 'Aguas', 'Argentinas', 'make', 'annual', 'investment', 'pesos', 'improvements', 'Planning', 'Minister', 'Julio', 'De', 'Vido', 'offered', 'State', 'help', 'free', 'Mr', 'De', 'Vido', 'said', 'Argentine', 'state', 'would', 'make', 'contribution', 'form', 'subsidy', 'He', 'said', 'contribution', 'could', 'made', 'return', 'seat', 'companys', 'board', 'He', 'added', 'government', 'discussions', 'Aguas', 'Argentinas', 'role', 'might', 'take', 'event', 'State', 'contribution', 'agreed', 'However', 'Aguas', 'Argentinas', 'told', 'Argentine', 'newspaper', 'Clarin', 'would', 'accept', 'change', 'legal', 'structure', 'practice', 'rules',

<h2><b>Define a Vocabulary</b></h2>

<h4><b>add_to_vocab() function performs the task of loading the document, cleaning the document and returning the tokens which is later added to vocabulary.

In [50]:
# load doc and add to vocab
def add_to_vocab(filename, vocab):
    #load  document
    doc = load_file(filename)
    #clean document
    tokens = clean_file(doc)
    #add tokens to the vocabulary
    vocab.update(tokens)

<h4><b>process_documents() function will load all the files from the directory and pass the file to add_to_vocab() function.

In [51]:
#load all docs in a vocabulary
def process_documents(directory, vocab):
    #walk through all files in the folder
    for filename in os.listdir(directory):
        #create a full path of the file to open
        file_path = directory + '/' + filename
        #add doc to vocab
        add_to_vocab(file_path, vocab)   

<h4><b>Now we pass all the file in training set to process_documents() function and create a vocabulary.

In [52]:
#define the vocab as counter
vocab = Counter()
#add all docs to Vocab
process_documents('Train ready Dataset/train/neg', vocab)
process_documents('Train ready Dataset/train/pos', vocab)
#print the size of the vocab
print(len(vocab))
#print the top 50 most common words in the vocab
print(vocab.most_common(50))

29401
[('The', 5824), ('said', 5439), ('Mr', 2593), ('would', 1924), ('also', 1462), ('US', 1211), ('But', 1197), ('He', 1169), ('people', 1166), ('It', 1067), ('year', 1066), ('new', 1010), ('could', 977), ('one', 964), ('government', 909), ('years', 883), ('last', 816), ('In', 749), ('two', 749), ('first', 734), ('UK', 722), ('time', 721), ('told', 701), ('best', 697), ('We', 670), ('film', 656), ('Labour', 614), ('made', 578), ('election', 577), ('make', 566), ('BBC', 533), ('Blair', 523), ('get', 523), ('added', 507), ('number', 482), ('music', 481), ('next', 478), ('says', 476), ('three', 474), ('like', 466), ('take', 465), ('back', 457), ('say', 456), ('many', 451), ('public', 449), ('British', 432), ('set', 429), ('company', 428), ('way', 424), ('plans', 418)]


<h4><b>We can remove the tokens from the vocab which has low occurence as below.

In [53]:
#keep tokens with a min occurence
minimum_occurence = 2
tokens = [tkn for tkn,count in vocab.items() if count >=minimum_occurence]
print(len(tokens))

17850


<h4><b> We can save the vocabulary into a .txt file which can later be loaded.

In [54]:
# save a tokens to a file
def save_vocab(tokens, filename):
    #convert tokens to single blob of text
    data = '\n'.join(tokens)
    #open file in write mode
    file = open(filename, 'w')
    #write the text to a file
    file.write(data)
    #close file
    file.close()  
#save to to vocabulary file
save_vocab(tokens,'vocab.txt')

<h2><b>Train Embedding Layer</b></h2>

<b>Word embedding is any of a set of language modeling and feature learning techniques in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers. Now we will learn a word embedding using keras while training a neural network on the classification problem.

<h4><b>Loading vocabulary

<h4><b>We will load the our vocabulary file into a memory by using load_file() function written before.

In [55]:
#load the vocabulary into memory
vocab_file = 'vocab.txt'
vocab = load_file(vocab_file)
vocab = vocab.split()
vocab = set(vocab)

<h4><b>Now we need to load all training data into memory.Before that, we need to clean them.

<h4><b>clean_text_file() function below converts the text into clean tokens by removing punctuations, and removing the tokens which are not in the vocabulary.

In [56]:
#Convert the text into clean tokens
def clean_text_file(doc, vocab):
    #split the text into tokens by whitespace
    tokens = doc.split()
    #remove the punctuation from each token
    table = str.maketrans('','',string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    #remove the token which are not in the vocab
    tokens = [w for w in tokens if w in vocab]
    tokens = ' '.join(tokens)
    return tokens

<h4><b>process_text_documents() function below loads all the documents and passes it to clean_text_file() function to clean it and  then append the tokens  to a list

In [57]:
#load all the documents in a directory
def process_text_documents(directory, vocab):
    documents = list()
    #walk through all files in the directory
    for filename in os.listdir(directory):
        #create the full path of the file which is to be opened
        file_path = directory + '/' + filename
        #load the document
        doc = load_file(file_path)
        #clean the document
        tokens = clean_text_file(doc, vocab)
        #add to list
        documents.append(tokens)
    return documents                 

<h4><b>Loading all the positive and negative documents in the training dataset 

In [58]:
#load all the training dataset
positive_documents = process_text_documents('Train ready Dataset/train/pos', vocab)
negative_documents = process_text_documents('Train ready Dataset/train/neg', vocab)
train_docs = negative_documents + positive_documents

<b>Now, we will use Keras Tokenizer API.fit_on_texts() will create a vocabulary of all tokens in the training set and will develop a consistent mappping from words in the vocabulary to a unique integer.

In [59]:
#create the Tokenizer
tokenizer = Tokenizer()
#fit the tokenizer on the training documents
tokenizer.fit_on_texts(train_docs)

<b>Now,texts_to_sequence will encode each document in the training set into a seuence of unique integers 

In [60]:
#sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)

<b>For this project, maximum length of document is set to 400.Hence we will pad and truncate every documents in the training set to a maximum length.

In [61]:
#pad sequences
max_length = 400
Xtrain = pad_sequences(encoded_docs, maxlen = max_length, padding = 'post',truncating = 'post')

<b>Training labels for the documents are defined according to sentiment in training dataset. 0 is defined as label for documents in "neg" folder having negative sentiment.Similarly, for all documents in "pos" folder having positive sentiment are given label of 1.

In [62]:
#define training labels
import numpy as np
ytrain = np.array([0 for _ in range(778)] +  [1 for _ in range(778)])

<b>Similarly, all documents in the test dataset are loaded and encoded into sequence of unique integers.After that, they are padded to maximum length of 400 and the test labels of a documents are passed in a similar way as that of training dataset(i.e 0 label for negative sentiment document and 1 for positive sentiment document).

In [63]:
#load all test reviews
positive_docs = process_text_documents('Train ready Dataset/test/pos', vocab)
negative_docs = process_text_documents('Train ready Dataset/test/neg', vocab)
test_docs = negative_docs + positive_docs
#sequence encode (Note: we do not do tokenizer.fit_on_texts on test data otherwise it will change index of words.)
encoded_docs = tokenizer.texts_to_sequences(test_docs)
#pad sequences
Xtest = pad_sequences(encoded_docs, maxlen = max_length, padding = 'post',truncating = 'post')
#define test labels
ytest = np.array( [0 for _ in range(40)] + [1 for _ in range(40)])

<b>Now we define the vocabulary size.

In [64]:
#define vocabulary  size(largest integer value)
vocab_size = len(tokenizer.word_index) + 1

<b>Since, we are going to use softmax classifier later, so all the train and test label must be converted into one hot vector.

In [65]:
#one hot encoding the y labels
ytrain = tf.keras.utils.to_categorical(ytrain, 2)
ytest = tf.keras.utils.to_categorical(ytest, 2)

<b>Now model is built. We use Stacked Bidirectional LSTM(BiLSTM) with dropout of 20%. Since it is multi-class classification, so softmax classifier is used.The summary of a model is displayed below.

In [66]:
#define model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length = max_length))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(LSTM(64, dropout =0.2,return_sequences= True)))
model.add(Bidirectional(LSTM(64, dropout =0.2)))
model.add(Dense(2, activation = 'softmax'))
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 400, 100)          1598900   
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 400, 100)          0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, 400, 128)          84480     
_________________________________________________________________
bidirectional_3 (Bidirection (None, 128)               98816     
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 258       
Total params: 1,782,454
Trainable params: 1,782,454
Non-trainable params: 0
_________________________________________________________________
None


<b>Now, we will compile the model. Since we have used softmax classifer, we will use categorical_crossentropy as a loss function. Similarly, Adam optimizer was used.

In [67]:
#compile the network
model.compile(loss= 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

<b>Fitting the training dataset into the model.

In [27]:
#fit the network
model.fit(Xtrain, ytrain, epochs = 10, verbose = 2)

Epoch 1/10
49/49 - 39s - loss: 0.6193 - accuracy: 0.6350
Epoch 2/10
49/49 - 39s - loss: 0.2321 - accuracy: 0.9094
Epoch 3/10
49/49 - 40s - loss: 0.0522 - accuracy: 0.9833
Epoch 4/10
49/49 - 38s - loss: 0.0182 - accuracy: 0.9949
Epoch 5/10
49/49 - 40s - loss: 0.0148 - accuracy: 0.9968
Epoch 6/10
49/49 - 39s - loss: 0.0152 - accuracy: 0.9968
Epoch 7/10
49/49 - 41s - loss: 0.0155 - accuracy: 0.9981
Epoch 8/10
49/49 - 42s - loss: 0.0213 - accuracy: 0.9955
Epoch 9/10
49/49 - 36s - loss: 0.0186 - accuracy: 0.9968
Epoch 10/10
49/49 - 36s - loss: 0.0134 - accuracy: 0.9981


<tensorflow.python.keras.callbacks.History at 0x7fa23c0c2dd0>

<b>Now we will evaluate the performance of our model on test dataset.

In [69]:
#evaluate
loss,acc =model.evaluate(Xtest, ytest, verbose = 0)
print('Test Accuracy: %f' % (acc*100))

Test Accuracy: 78.750002


<b>The model has a test accuracy of 78.75% which is  decent

<h4><b>Save Model

In [29]:
#save model
model.save("model.h5")

<h4><b>Save tokenizer

In [None]:
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)