# Neural Models for Document Classification


# Word Embeddings + CNN = Text Classification


The modus operandi for text classification involves the use of a word embedding for representing
words and a Convolutional Neural Network (CNN) for learning how to discriminate documents
on classification problems. Yoav Goldberg, in his primer on deep learning for natural language
processing, comments that neural networks in general oer better performance than classical
linear classifiers, especially when used with pre-trained word embeddings.


The architecture is therefore comprised of three key pieces:
    
 Word Embedding: A distributed representation of words where different words that
have a similar meaning (based on their usage) also have a similar representation.

 Convolutional Model: A feature extraction model that learns to extract salient features
from documents represented using a word embedding.


 Fully Connected Model: The interpretation of extracted features in terms of a predictive
output.

# Use a Single Layer CNN Architecture


You can get good results for document classification with a single layer CNN, perhaps with
differently sized kernels across the filters to allow grouping of word representations at different
scales. 

Yoon Kim in his study of the use of pre-trained word vectors for classification tasks with
Convolutional Neural Networks found that using pre-trained static word vectors does very well.


He suggests that pre-trained word embeddings that were trained on very large text corpora,
such as the freely available Word2Vec vectors trained on 100 billion tokens from Google news
may offer good universal features for use in natural language processing.

He also discovered that further task-specific tuning of the word vectors offer a small additional
improvement in performance. Kim describes the general approach of using CNN for natural
language processing. Sentences are mapped to embedding vectors and are available as a matrix
input to the model. Convolutions are performed across the input word-wise using differently
sized kernels, such as 2 or 3 words at a time. The resulting feature maps are

In [11]:
import IPython
IPython.display.Image

IPython.core.display.Image

![title](picture1.png)

The architecture is based on the approach used by Ronan Collobert, et al. in their paper
Natural Language Processing (almost) from Scratch, 2011. In it, they develop a single end-to-end
neural network model with convolutional and pooling layers for use across a range of fundamental
natural language processing problems. Kim provides a diagram that helps to see the sampling
of the filters using differently sized kernels as different colors (red and yellow).

***Usefully, he reports his chosen model configuration, discovered via grid search and used
across a suite of 7 text classification tasks, summarized as follows***:
    
Transfer function: rectified linear.
 Kernel sizes: 2, 4, 5.

 Number of filters: 100. 

 Dropout rate: 0.5.

 Weight regularization (L2): 3.

 Batch Size: 50.

 Update Rule: Adadelta.

# Dial in CNN Hyperparameters


Some hyperparameters matter more than others when tuning a convolutional neural network on your document classification problem. 

Ye Zhang and Byron Wallace performed a sensitivity analysis into the hyperparameters needed to configure a single layer convolutional neural network for document classification. 


The study is motivated by their claim that the models are sensitive to their configuration.

![title](picture2.png)

# Consider Character-Level CNNs


Text documents can be modeled at the character level using convolutional neural networks
that are capable of learning the relevant hierarchical structure of words, sentences, paragraphs,
and more. Xiang Zhang, et al. use a character-based representation of text as input for a
convolutional neural network. The promise of the approach is that all of the labor-intensive
effort required to clean and prepare text could be overcome if a CNN can learn to abstract the
salient details.

The model reads in one hot encoded characters in a xed-sized alphabet. Encoded characters
are read in blocks or sequences of 1,024 characters. A stack of 6 convolutional layers with
pooling follows, with 3 fully connected layers at the output end of the network in order to make
a prediction.





![title](picture3.png)

# Consider Deeper CNNs for Classification


Better performance can be achieved with very deep convolutional neural networks, although
standard and reusable architectures have not been adopted for classification tasks, yet. Alexis
Conneau, et al. comment on the relatively shallow networks used for natural language processing
and the success of much deeper networks used for computer vision applications. For example,
Kim (above) restricted the model to a single convolutional layer.
Other architectures used for natural language reviewed in the paper are limited to 5 and 6
layers. These are contrasted with successful architectures used in computer vision with 19 or
even up to 152 layers. They suggest and demonstrate that there are benefits for hierarchical
feature learning with very deep convolutional neural network model, called VDCNN.

Results on a suite of 8 large text classification tasks show better performance than more
shallow networks. Specifically, state-of-the-art results on all but two of the datasets tested,
at the time of writing. Generally, they make some key findings from exploring the deeper
architectural approach:

 The very deep architecture worked well on small and large datasets.

 Deeper networks decrease classification error.

 Max-pooling achieves better results than other, more sophisticated types of pooling.

 Generally going deeper degrades accuracy; the shortcut connections used in the architecture
are important. 

This is the first time that the \benefit of depths" was shown for convolutional
neural networks in NLP.

# Define a Vocabulary

It is important to define a vocabulary of known words when using a text model. The more
words, the larger the representation of documents, therefore it is important to constrain the
words to only those believed to be predictive. This is dicult to know beforehand and often it
is important to test different hypotheses about how to construct a useful vocabulary. We have
already seen how we can remove punctuation and numbers from the vocabulary in the previous
section. We can repeat this for all documents and build a set of all known words.
We can develop a vocabulary as a Counter, which is a dictionary mapping of words and
their count that allows us to easily update and query. Each document can be added to the
counter (a new function called add doc to vocab()) and we can step over all of the reviews in
the negative directory and then the positive directory (a new function called process docs()).
The complete example is listed below.

In [55]:
import string
import re
from os import listdir
from collections import Counter
from nltk.corpus import stopwords
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# turn a doc into clean tokens
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens
# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
    # load doc
    doc = load_doc(filename)
    # clean doc
    tokens = clean_doc(doc)
    # update counts

    vocab.update(tokens)
    
    
# load all docs in a directory
def process_docs(directory, vocab):
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any reviews in the test set
        if filename.startswith('cv9'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # add doc to vocab
        add_doc_to_vocab(path, vocab)

    
# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('txt_sentoken/pos', vocab)
process_docs('txt_sentoken/neg', vocab)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))

min_occurrence = 2
tokens = [k for k,c in vocab.items() if c >= min_occurrence]
print(len(tokens))

def save_list(lines, filename):
    # convert lines to a single blob of text
    data = '\n'.join(lines)
    # open file
    file = open(filename, 'w')
    # write text
    file.write(data)
    # close file
    file.close()
    # save tokens to a vocabulary file
save_list(tokens, 'vocab.txt')

44276
[('film', 7983), ('one', 4946), ('movie', 4826), ('like', 3201), ('even', 2262), ('good', 2080), ('time', 2041), ('story', 1907), ('films', 1873), ('would', 1844), ('much', 1824), ('also', 1757), ('characters', 1735), ('get', 1724), ('character', 1703), ('two', 1643), ('first', 1588), ('see', 1557), ('way', 1515), ('well', 1511), ('make', 1418), ('really', 1407), ('little', 1351), ('life', 1334), ('plot', 1288), ('people', 1269), ('could', 1248), ('bad', 1248), ('scene', 1241), ('movies', 1238), ('never', 1201), ('best', 1179), ('new', 1140), ('scenes', 1135), ('man', 1131), ('many', 1130), ('doesnt', 1118), ('know', 1092), ('dont', 1086), ('hes', 1024), ('great', 1014), ('another', 992), ('action', 985), ('love', 977), ('us', 967), ('go', 952), ('director', 948), ('end', 946), ('something', 945), ('still', 936)]
25767


### Movie Review Polarity Dataset (review polarity.tar.gz, 3MB).


https://raw.githubusercontent.com/jbrownlee/Datasets/master/review_polarity.
tar.gz

# Develop an Embedding + CNN Model for Sentiment Analysis

## Data Preparation

Note: The preparation of the movie review dataset was first described in Chapter 9. In this
section, we will look at 3 things:
    
1. Separation of data into training and test sets.

2. Loading and cleaning the data to remove punctuation and numbers.

3. Defining a vocabulary of preferred words.

In [66]:
import string
import re
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# turn a doc into clean tokens
def clean_doc(doc, vocab):
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]
    # filter out tokens not in vocab
    tokens = [w for w in tokens if w in vocab]
    tokens = ' '.join(tokens)
    return tokens


# load all docs in a directory
def process_docs(directory, vocab, is_train):
    documents = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any reviews in the test set
        if is_train and filename.startswith('cv9'):
            continue
        if not is_train and not filename.startswith('cv9'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load the doc
        doc = load_doc(path)

        # clean doc
        tokens = clean_doc(doc, vocab)
        # add to list
        documents.append(tokens)
    return documents

# load and clean a dataset
def load_clean_dataset(vocab, is_train):
    # load documents
    neg = process_docs('txt_sentoken/neg', vocab, is_train)
    pos = process_docs('txt_sentoken/pos', vocab, is_train)
    docs = neg + pos
    # prepare labels
    labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
    return docs, labels
# fit a tokenizer
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

# integer encode and pad documents
def encode_docs(tokenizer, max_length, docs):
    # integer encode
    encoded = tokenizer.texts_to_sequences(docs)
    # pad sequences
    padded = pad_sequences(encoded, maxlen=max_length, padding='post')
    return padded
# define the model
def define_model(vocab_size, max_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 100, input_length=max_length))
    model.add(Conv1D(32, 8, activation='relu'))
    model.add(MaxPooling1D())
    model.add(Flatten())
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # compile network
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='model.png', show_shapes=True)
    return model


# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = set(vocab.split())
# load training data
train_docs, ytrain = load_clean_dataset(vocab, True)
print(train_docs[:10])
# create the tokenizer
tokenizer = create_tokenizer(train_docs)
# define vocabulary size
vocab_size = len(tokenizer.word_index) + 1


print('Vocabulary size: %d' % vocab_size)
# calculate the maximum sequence length
max_length = max([len(s.split()) for s in train_docs])
print('Maximum length: %d' % max_length)
# encode data
Xtrain = encode_docs(tokenizer, max_length, train_docs)
# define model
model = define_model(vocab_size, max_length)
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)
# save the model
model.save('model.h5')



Vocabulary size: 25768
Maximum length: 1317
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 1317, 100)         2576800   
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 1310, 32)          25632     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 655, 32)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 20960)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                209610    
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 11        
Total params: 2,812,053
Trainable params: 2,812,053
Non-trainable params: 0
______________________

![title](picture4.png)

# Score the Model

In [67]:
from keras.models import load_model
import string
import re
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
# load doc into memory
def load_doc(filename):
# open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# turn a doc into clean tokens
def clean_doc(doc, vocab):
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]
    # filter out tokens not in vocab
    tokens = [w for w in tokens if w in vocab]
    tokens = ' '.join(tokens)
    return tokens
# load all docs in a directory
def process_docs(directory, vocab, is_train):
    documents = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any reviews in the test set
        if is_train and filename.startswith('cv9'):
            continue
        if not is_train and not filename.startswith('cv9'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load the doc
        doc = load_doc(path)
        # clean doc
        tokens = clean_doc(doc, vocab)
        # add to list
        documents.append(tokens)
    return documents


# load and clean a dataset
def load_clean_dataset(vocab, is_train):
    # load documents
    neg = process_docs('txt_sentoken/neg', vocab, is_train)
    pos = process_docs('txt_sentoken/pos', vocab, is_train)
    docs = neg + pos
    # prepare labels
    labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
    return docs, labels


# fit a tokenizer
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer


# integer encode and pad documents
def encode_docs(tokenizer, max_length, docs):
    # integer encode
    encoded = tokenizer.texts_to_sequences(docs)
    # pad sequences
    padded = pad_sequences(encoded, maxlen=max_length, padding='post')
    return padded


# classify a review as negative or positive
def predict_sentiment(review, vocab, tokenizer, max_length, model):
    # clean review
    line = clean_doc(review, vocab)
    # encode and pad review
    padded = encode_docs(tokenizer, max_length, [line])
    # predict sentiment
    yhat = model.predict(padded, verbose=0)
    # retrieve predicted percentage and label
    percent_pos = yhat[0,0]
    if round(percent_pos) == 0:
        return (1-percent_pos), 'NEGATIVE'
    return percent_pos, 'POSITIVE'


# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = set(vocab.split())

# load all reviews
train_docs, ytrain = load_clean_dataset(vocab, True)
test_docs, ytest = load_clean_dataset(vocab, False)
# create the tokenizer
tokenizer = create_tokenizer(train_docs)
# define vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary size: %d' % vocab_size)
# calculate the maximum sequence length
max_length = max([len(s.split()) for s in train_docs])
print('Maximum length: %d' % max_length)
# encode data
# Xtrain = encode_docs(tokenizer, max_length, train_docs)
Xtest = encode_docs(tokenizer, max_length, test_docs)
# load the model
model = load_model('model.h5')
# evaluate model on training dataset
# _, acc = model.evaluate(Xtrain, ytrain, verbose=0)
# print('Train Accuracy: %.2f' % (acc*100))
# evaluate model on test dataset
_, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %.2f' % (acc*100))
# test positive text
text = 'Everyone will enjoy this film. I love it, recommended!'
percent, sentiment = predict_sentiment(text, vocab, tokenizer, max_length, model)
print('Review: [%s]\nSentiment: %s (%.3f%%)' % (text, sentiment, percent*100))

# test negative text
text = 'This is a bad movie. Do not watch it. It sucks.'
percent, sentiment = predict_sentiment(text, vocab, tokenizer, max_length, model)
print('Review: [%s]\nSentiment: %s (%.3f%%)' % (text, sentiment, percent*100))

Vocabulary size: 25768
Maximum length: 1317
Test Accuracy: 86.50
Review: [Everyone will enjoy this film. I love it, recommended!]
Sentiment: NEGATIVE (51.282%)
Review: [This is a bad movie. Do not watch it. It sucks.]
Sentiment: NEGATIVE (58.544%)


## Extensions


This section lists some ideas for extending the tutorial that you may wish to explore.

 Data Cleaning. Explore better data cleaning, perhaps leaving some punctuation in tact
or normalizing contractions.

 Truncated Sequences. Padding all sequences to the length of the longest sequence
might be extreme if the longest sequence is very dierent to all other reviews. Study the
distribution of review lengths and truncate reviews to a mean length.

 Truncated Vocabulary. We removed infrequently occurring words, but still had a large
vocabulary of more than 25,000 words. Explore further reducing the size of the vocabulary
and the effect on model skill.

 Filters and Kernel Size. The number of filters and kernel size are important to model
skill and were not tuned. Explore tuning these two CNN parameters.

 Epochs and Batch Size. The model appears to fitt the training dataset quickly. Explore
alternate configurations of the number of training epochs and batch size and use the test
dataset as a validation set to pick a better stopping point for training the model.
15.7. Further Reading 172


 Deeper Network. Explore whether a deeper network results in better skill, either in
terms of CNN layers, MLP layers and both.

 Pre-Train an Embedding. Explore pre-training a Word2Vec word embedding in the
model and the impact on model skill with and without further fine tuning during training.

 Use GloVe Embedding. Explore loading the pre-trained GloVe embedding and the
impact on model skill with and without further fine tuning during training.

 Longer Test Reviews. Explore whether the skill of model predictions is dependent on
the length of movie reviews as suspected in the final section on evaluating the model.

 Train Final Model. Train a final model on all available data and use it make predictions
on real ad hoc movie reviews from the internet.