### Best Practices for Document Classification with Deep Learning

Text classification describes a general class of problems such as predicting the sentiment of tweets and movie reviews, as well as classifying email as spam or not.

Deep learning methods are proving very good at text classification, achieving state-of-the-art results on a suite of standard academic benchmark problems.

## 1. Word Embeddings + CNN = Text Classification

The modus operandi for text classification involves the use of a word embedding for representing words and a Convolutional Neural Network (CNN) for learning how to discriminate documents on classification problems.


The architecture is therefore comprised of three key pieces:

Word Embedding: A distributed representation of words where different words that have a similar meaning (based on their usage) also have a similar representation.
Convolutional Model: A feature extraction model that learns to extract salient features from documents represented using a word embedding.
Fully Connected Model: The interpretation of extracted features in terms of a predictive output.

## 2.Use a Single Layer CNN Architecture

You can get good results for document classification with a single layer CNN, perhaps with differently sized kernels across the filters to allow grouping of word representations at different scales.
The general approach of using CNN for natural language processing. Sentences are mapped to embedding vectors and are available as a matrix input to the model. Convolutions are performed across the input word-wise using differently sized kernels, such as 2 or 3 words at a time. The resulting feature maps are then processed using a max pooling layer to condense or summarize the extracted features.

## 3. Dial in CNN Hyperparameters

Some hyperparameters matter more than others when tuning a convolutional neural network on your document classification problem.

The study makes a number of useful findings that could be used as a starting point for configuring shallow CNN models for text classification.

The general findings were as follows:

The choice of pre-trained word2vec and GloVe embeddings differ from problem to problem, and both performed better than using one-hot encoded word vectors.

The size of the kernel is important and should be tuned for each problem.

The number of feature maps is also important and should be tuned.

The 1-max pooling generally outperformed other types of pooling.

Dropout has little effect on the model performance.

They go on to provide more specific heuristics, as follows:

Use word2vec or GloVe word embeddings as a starting point and tune them while fitting the model.
Grid search across different kernel sizes to find the optimal configuration for your problem, in the range 1-10.
Search the number of filters from 100-600 and explore a dropout of 0.0-0.5 as part of the same search.
Explore using tanh, relu, and linear activation functions.
The key caveat is that the findings are based on empirical results on binary text classification problems using single sentences as input.

## 4. Consider Character-Level CNNs

Text documents can be modeled at the character level using convolutional neural networks that are capable of learning the relevant hierarchical structure of words, sentences, paragraphs, and more.

## 5. Consider Deeper CNNs for Classification

Better performance can be achieved with very deep convolutional neural networks, although standard and reusable architectures have not been adopted for classification tasks, yet.

Alexis Conneau, et al. comment on the relatively shallow networks used for natural language processing and the success of much deeper networks used for computer vision applications. For example, Kim (above) restricted the model to a single convolutional layer.

Other architectures used for natural language reviewed in the paper are limited to 5 and 6 layers. These are contrasted with successful architectures used in computer vision with 19 or even up to 152 layers.

They suggest and demonstrate that there are benefits for hierarchical feature learning with very deep convolutional neural network model, called VDCNN.

## A standard model

This type of model can be defined in the Keras Python deep learning library. The snippet below shows an example of a deep learning model for classifying text documents as one of two classes.

In [None]:
# define problem
vocab_size = 100
max_length = 200
# define model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())