<a href="https://colab.research.google.com/github/KCL-Health-NLP/nlp_examples/blob/master/ann/cnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## A simple CNN for text classification

Based on an [example from the Keras team](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/text_classification_from_scratch.ipynb).



This notebook uses a popular neural network API, [Keras](https://keras.io/), to build a simple CNN classifer, and runs it over movie reviews from [IMDb - the Internet Movie Database](https://www.imdb.com/). These reviews are available as a pre-prepared dataset that can be downloaded by the Keras distribution. The dataset is also available from [here](http://ai.stanford.edu/~amaas/data/sentiment/)

The dataset is constructed from very polarised reviews, and has been used in text classification evaluations for several years.

Here's an example positive review:

> I went to an advance screening of this movie thinking I was about to embark on 120 minutes of cheezy lines, mindless plot, and the kind of nauseous acting that made "The Postman" one of the most malignant displays of cinematic blundering of our time. But I was shocked. Shocked to find a film starring Costner that appealed to the soul of the audience. Shocked that Ashton Kutcher could act in such a serious role. Shocked that a film starring both actually engaged and captured my own emotions. Not since 'Robin Hood' have I seen this Costner: full of depth and complex emotion. Kutcher seems to have tweaked the serious acting he played with in "Butterfly Effect". These two actors came into this film with a serious, focused attitude that shone through in what I thought was one of the best films I've seen this year. No, its not an Oscar worthy movie. It's not an epic, or a profound social commentary film. Rather, its a story about a simple topic, illuminated in a way that brings that audience to a higher level of empathy than thought possible. That's what I think good film-making is and I for one am throughly impressed by this work. Bravo!

And here's a negative review example:

> It hurt to watch this movie, it really did... I wanted to like it, even going in. Shot obviously for very little cash, I looked past and told myself to appreciate the inspiration. Unfortunately, although I did appreciate the film on that level, the acting and editing was terrible, and the last 25-30 minutes were severe thumb-twiddling territory. A 95 minute film should not drag. The ratings for this one are good so far, but I fear that the friends and family might have had a say in that one. What was with those transitions? Dear Mr. Editor, did you just purchase your first copy of Adobe Premiere and make it your main goal to use all the goofy transitions that come with that silly program? Anyway... some better actors, a little more passion, and some more appealing editing and this makes a decent movie.


### A note on performance
From the original code comments: This example demonstrates the use of Convolution1D for text classification. It gets to 0.89 test accuracy after 2 epochs. Speed:
* 90s/epoch on Intel i5 2.4Ghz CPU.
* 10s/epoch on Tesla K40 GPU.


### Packages

First, the import - you will need keras, but even though it is not needed as an import, you will also need a neural net backend installed for Keras to use, i.e. Tensorflow. Make sure you have this available, and make sure it is compatible with the version of Keras you are using. If you use the latest Keras and the latest Tensroflow, you should be ok. 

Note if running locally: in order for the visualisation to work, you will need to have pydot and graphviz installed, e.g. 

```sudo apt-get install graphviz
pip3 install pydot```

In [None]:
from keras.utils.data_utils import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.datasets import imdb

# For displaying
from keras.utils.vis_utils import plot_model
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
import matplotlib.pyplot as plt

# For processing example texts into one-hot vectors
import nltk
import numpy as np
from nltk.corpus import stopwords
from keras.preprocessing import text

### Parameters

Now let's set up some parameters, such as number of features, embedding dimensions, batch size, epocchs etc.

In [None]:
max_features = 5000
maxlen = 400
batch_size = 32
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
epochs = 2

### Data

Let's load the data, and pad it out so all are the same length. In the data each review is labelled with an integer value of either 0 (negative review), or 1 (a positive review).

In [None]:
# We load our training examples in to x_train, and their lables in to y_train
# We also have somne test data (which we will use in development), in x_test
# and y_test
print('Loading data...\n')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print('\nData loaded.')


How much data do we have? What does it look like? Write some code below to take a look.

In [None]:
# Write code here to see how many examples we have in each dataset,
# and to see what the examples and labels look like

Why does the data look like this? What is going on? If you are not sure, take a look at the [Keras imdb dataset documentation](https://keras.io/api/datasets/imdb/).

Next, we will pad our data so that eacb example is the same lenght for our CNN.

In [None]:
print('Pad sequences (samples x time)')
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)


Write some code to take a look at the padded data.

In [None]:
# Write code here to look at the padded data

## Building the model

Next we build the model

In [None]:
print('Build model...\n')
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen))
model.add(Dropout(0.2))

# we add a Convolution1D, which will learn filters
# word group filters of size filter_length:
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
# we use max pooling:
model.add(GlobalMaxPooling1D())

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Finished building model.\n')

## Take a look at the model

Keras can print out textual and graphical representations of a model, that tells us:

* The layers in the model, in the order in which they appear in the model
* The output shape - i.e. the size of the matrices passed between layers. In some layers, the final dimension will be the number of units, in CNN laters, it will be the number of filters.
* Parameters - this is the number of weights in each layer

Let's take a look at our model...


In [None]:
print(model.summary())

We can also visualise this

In [None]:
SVG(model_to_dot(model).create(prog='dot', format='svg'))

## Train the model

Now let's train it. Keras will validate against our test data, showing us loss and accuracy as it goes. We will save our metrics so we can display them afterwards.

In [None]:
history = model.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(x_test, y_test))

score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])


## Visualise the training process

OK, but how did that change over time?
(Thanks to [Jason Brownlee](https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/) for this next bit of code)

In [None]:
import matplotlib.pyplot as plt

# summarize history for accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

What do you think of the results? How good was it after 1 epoch? Is it going to improve much more if you run more epochs?