<a href="https://colab.research.google.com/github/Akashsindhu/movie-sentiment/blob/master/chapter_15_Embedding_%2B_CNN_Model_for_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Movie Review Dataset
2. Data Preparation
3. Train CNN With Embedding Layer
4. Evaluate Model



In [0]:
#@title Default title text
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).
/gdrive


In [0]:
%cd /gdrive/My Drive/Colab Notebooks/NLTK/review_polarity

/gdrive/My Drive/Colab Notebooks/NLTK/review_polarity


### Loading and Cleaning Reviews
The text data is already pretty clean; not much preparation is required. Without getting bogged down too much in the details, we will prepare the data using the following way:

1.   Split tokens on white space.
2.   Remove all punctuation from words.
3.   Remove all words that are not purely comprised of alphabetical characters.
4. Remove all words that are known stop words.  Remove all words that have a length ≤ 1 character





In [0]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
import re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
# load the doc into memory 
def load_doc(filename):
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text

# clean the text from the file and convert to tokens
def clean_doc(doc):
    # split into tokens by white space 
    tokens = doc.split()
    # prepare regex for char filtering 
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    #remove punctuations from each word 
    tokens = [re_punc.sub('', w) for w in tokens]
    # remove remaining words that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stopwords 
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens 
    tokens = [w for w in tokens if len(w) > 1]

    return tokens

# load the document 
# filename = 'txt_sentoken/pos/cv000_29590.txt'
# text = load_doc(filename)
# tokens = clean_doc(text)
# print(tokens)

### Define a vocabulary
It is important to deﬁne a vocabulary of known words when using a text model. The more words, the larger the representation of documents, therefore it is important to constrain the words to only those believed to be predictive. This is diﬃcult to know beforehand and often it is important to test diﬀerent hypotheses about how to construct a useful vocabulary. We have already seen how we can remove punctuation and numbers from the vocabulary in the previous section. We can repeat this for all documents and build a set of all known words. 

We can develop a vocabulary as a Counter, which is a dictionary mapping of words and their count that allows us to easily update and query. Each document can be added to the counter (a new function called add doc to vocab()) and we can step over all of the reviews in the negative directory and then the positive directory (a new function called process docs()). The complete example is listed below.


In [0]:
import string
import re 
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

In [0]:
#load doc from memory is already defined above
# clean the doc already defined above 

# load doc and add to vocab 
def add_doc_to_vocab(filename, vocab):
    text = load_doc(filename)
    tokens = clean_doc(text)
    vocab.update(tokens)

# step over all the docs in negative and positive 
def step_alldoc(directory, vocab):
    for filename in listdir(directory):
        if filename.startswith('cv9'):
            next
        path = directory + '/' + filename
        add_doc_to_vocab(path, vocab)

# define vocab
vocab = Counter()
step_alldoc('txt_sentoken/pos/', vocab)
step_alldoc('txt_sentoken/neg/', vocab)

print(len(vocab))
print(vocab.most_common(50))



46557
[('film', 8860), ('one', 5521), ('movie', 5440), ('like', 3553), ('even', 2555), ('good', 2320), ('time', 2283), ('story', 2118), ('films', 2102), ('would', 2042), ('much', 2024), ('also', 1965), ('characters', 1947), ('get', 1921), ('character', 1906), ('two', 1825), ('first', 1768), ('see', 1730), ('well', 1694), ('way', 1668), ('make', 1590), ('really', 1563), ('little', 1491), ('life', 1472), ('plot', 1451), ('people', 1420), ('movies', 1416), ('could', 1395), ('bad', 1374), ('scene', 1373), ('never', 1364), ('best', 1301), ('new', 1277), ('many', 1268), ('doesnt', 1267), ('man', 1266), ('scenes', 1265), ('dont', 1210), ('know', 1207), ('hes', 1150), ('great', 1141), ('another', 1111), ('love', 1089), ('action', 1078), ('go', 1075), ('us', 1065), ('director', 1056), ('something', 1048), ('end', 1047), ('still', 1038)]


In [0]:
occurance = 2
tokens = [k for k, c in vocab.items() if c >= occurance]
print(len(tokens))

27139


In [0]:
# save tokens to a file 
def save_file(vocab):
    filename = 'vocab.txt'
    # join the tokens to a line 
    data = '\n'.join(vocab)
    file = open(filename, 'w')
    file.write(data)
    file.close()

save_file(tokens)

Train CNN with Embedding layer 


In [0]:
from numpy import array 
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense, Flatten, Embedding
from keras.layers.convolutional import Conv1D, MaxPooling1D

In [0]:
def one_doc_per_string(doc, vocab):
    tokens = clean_doc(doc)
    tokens = [w for w in tokens if w in vocab]
    return ' '.join(tokens)


def updated_process_doc(directory, vocab, is_train):
    documents = list()
    for filename in listdir(directory):
        if is_train and filename.startswith('cv9'):
            next
        if not is_train and not filename.startswith('cv9'):
            next
        path = directory + '/' + filename
        text = load_doc(path)
        tokens = one_doc_per_string(text, vocab)
        documents.append(tokens)
    return documents

def load_clean_docs(vocab, is_train):
    neg = updated_process_doc('txt_sentoken/neg/', vocab, is_train)
    pos = updated_process_doc('txt_sentoken/pos/', vocab, is_train)
    doc = neg + pos 
    # prepare labels 
    labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
    return doc, labels 

# fit the tokenizer 
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

def encode_docs(tokenizer, max_length, docs):
    #integer encode 
    encoded = tokenizer.texts_to_sequences(docs)
    #pad sequence
    padded = pad_sequences(encoded, maxlen = max_length, padding = 'post')
    return padded

def model(vocab_size, max_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 100, input_length=max_length))
    model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Flatten())
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(loss='binary_crossentropy', optimizer= 'adam', metrics = ['accuracy'])

    model.summary()
    plot_model(model, to_file='model.png', show_shapes=True)
    return model 

In [0]:
# load the vocabulary 
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = set(vocab.split())

# load training data
train_docs, y_train = load_clean_docs(vocab, True)
#create tokenizer 
tokenizer = create_tokenizer(train_docs)

#define vocab size 
vocab_size = len(tokenizer.word_index) + 1

print('vocab size: %d' % vocab_size)
#calculate the maximum sequence length
max_length = max([len(s.split()) for s in train_docs])
print("max length: %d" % max_length)

#encode data 
X_train = encode_docs(tokenizer, max_length, train_docs)
#define model
model = model(vocab_size, max_length)

#fit the model 
model.fit(X_train, y_train, epochs= 10, verbose = 1)
# save the model 
model.save('model.h5')


vocab size: 27140
max length: 1319
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 1319, 100)         2714000   
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 1312, 32)          25632     
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 656, 32)           0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 20992)             0         
_________________________________________________________________
dense_5 (Dense)              (None, 10)                209930    
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 11        
Total params: 2,949,573
Trainable params: 2,949,573
Non-trainable params: 0
_________

evaluate the model

In [0]:
def predict_sentiment(review, vocab, tokenizer, max_length, model):
    line = one_doc_per_string(review, vocab)
    padded = encode_docs(tokenizer, max_length, [line])
    # predict sentiment 
    predicted = model.predict(padded, verbose = 0)
    # retrive predicted percentage and label
    percent_pos = predicted[0,0]
    if round(percent_pos) == 0:
        return (1-percent_pos), 'NEGATIVE'
    return percent_pos, 'POSITIVE'



In [0]:
from keras.models import load_model
#load all reviews 
train_docs, y_train  = load_clean_docs(vocab, True)
test_docs, y_test = load_clean_docs(vocab, False)
#createthe tokenizer 
tokenizer = create_tokenizer(train_docs)
# define the vocab size 
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary size: %d' % vocab_size)

# calculate the maximum seuence length 
max_length = max([len(s.split()) for s in train_docs])
print('Maximum length: %d' % max_length)

# encoded data 
X_train = encode_docs(tokenizer, max_length, train_docs)
X_test = encode_docs(tokenizer, max_length, test_docs)

# load the model 
model = load_model('model.h5')
# evaluate model on training dataset 
_, acc = model.evaluate(X_train, y_test, verbose = 1)
print( 'Train Accuracy: %f' % (acc*100))

_, acc = model.evaluate(X_test, y_test, verbose = 1)
print ('Test Accuracy: %f' % (acc*100))

#test positive text
text = 'Everyone will enjoy this film. I love it, recommended!'
percent, sentiment = predict_sentiment(text, vocab, tokenizer, max_length, model)
print("Review: [%s]\nSentiment: %s (%.3f%%)" % (text, sentiment, percent*100))

# test negative text 
text = 'This is a bad movie. Do not watch it. It sucks'
percent, sentiment = predict_sentiment(text, vocab, tokenizer, max_length, model)
print("Review: [%s]\nSentiment: %s (%.3f%%)" % (text, sentiment, percent*100))


Vocabulary size: 27140
Maximum length: 1319
Train Accuracy: 100.000000
Test Accuracy: 100.000000
Review: [Everyone will enjoy this film. I love it, recommended!]
Sentiment: NEGATIVE (99.984%)
Review: [This is a bad movie. Do not watch it. It sucks]
Sentiment: NEGATIVE (99.985%)
