# Sequence classification model for IMDB Sentiment Analysis
(c) Deniz Yuret, 2018

* Objectives: Learn the structure of the IMDB dataset and train a simple RNN model.
* Prerequisites: RNN models (06.rnn.ipynb), param, GRU, nll, minibatch, accuracy, Adam, train!
* Knet: dir (used by imdb.jl)

In [1]:
using Pkg
for p in ("Knet","ProgressMeter")
    haskey(Pkg.installed(),p) || Pkg.add(p)
end

In [2]:
EPOCHS=3          # Number of training epochs
BATCHSIZE=64      # Number of instances in a minibatch
EMBEDSIZE=125     # Word embedding size
NUMHIDDEN=100     # Hidden layer size
MAXLEN=150        # maximum size of the word sequence, pad shorter sequences, truncate longer ones
VOCABSIZE=30000   # maximum vocabulary size, keep the most frequent 30K, map the rest to UNK token
NUMCLASS=2        # number of output classes
DROPOUT=0.0       # Dropout rate
LR=0.001          # Learning rate
BETA_1=0.9        # Adam optimization parameter
BETA_2=0.999      # Adam optimization parameter
EPS=1e-08         # Adam optimization parameter

1.0e-8

## Load and view data

In [3]:
using Knet: Knet
ENV["COLUMNS"]=92                     # column width for array printing
include(Knet.dir("data","imdb.jl"))   # defines imdb loader

imdb

In [4]:
@doc imdb

```
imdb()
```

Load the IMDB Movie reviews sentiment classification dataset from https://keras.io/datasets and return (xtrn,ytrn,xtst,ytst,dict) tuple.

# Keyword Arguments:

  * url=https://s3.amazonaws.com/text-datasets: where to download the data (imdb.npz) from.
  * dir=Pkg.dir("Knet/data"): where to cache the data.
  * maxval=nothing: max number of token values to include. Words are ranked by how often they occur (in the training set) and only the most frequent words are kept. nothing means keep all, equivalent to maxval = vocabSize + pad + stoken.
  * maxlen=nothing: truncate sequences after this length. nothing means do not truncate.
  * seed=0: random seed for sample shuffling. Use system seed if 0.
  * pad=true: whether to pad short sequences (padding is done at the beginning of sequences). pad_token = maxval.
  * stoken=true: whether to add a start token to the beginning of each sequence. start_token = maxval - pad.
  * oov=true: whether to replace words >= oov*token with oov*token (the alternative is to skip them). oov_token = maxval - pad - stoken.


In [5]:
@time (xtrn,ytrn,xtst,ytst,imdbdict)=imdb(maxlen=MAXLEN,maxval=VOCABSIZE);

┌ Info: Loading IMDB...
└ @ Main /home/gridsan/dyuret/.julia/dev/Knet/data/imdb.jl:57


 11.300031 seconds (29.03 M allocations: 1.478 GiB, 6.11% gc time)


In [6]:
summary.((xtrn,ytrn,xtst,ytst,imdbdict))

("25000-element Array{Array{Int32,1},1}", "25000-element Array{Int8,1}", "25000-element Array{Array{Int32,1},1}", "25000-element Array{Int8,1}", "Dict{String,Int32} with 88584 entries")

In [7]:
# Words are encoded with integers
rand(xtrn)'

1×150 LinearAlgebra.Adjoint{Int32,Array{Int32,1}}:
 30000  30000  30000  30000  30000  30000  …  5813  12  97  718  177  1  3405  1916

In [8]:
# Each word sequence is padded or truncated to length 150
length.(xtrn)'

1×25000 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 150  150  150  150  150  150  150  150  150  …  150  150  150  150  150  150  150  150

In [9]:
# Define a function that can print the actual words:
imdbvocab = Array{String}(undef,length(imdbdict))
for (k,v) in imdbdict; imdbvocab[v]=k; end
imdbvocab[VOCABSIZE-2:VOCABSIZE] = ["<unk>","<s>","<pad>"]
function reviewstring(x,y=0)
    x = x[x.!=VOCABSIZE] # remove pads
    """$(("Sample","Negative","Positive")[y+1]) review:\n$(join(imdbvocab[x]," "))"""
end

reviewstring (generic function with 2 methods)

In [10]:
# Hit Ctrl-Enter to see random reviews:
r = rand(1:length(xtrn))
println(reviewstring(xtrn[r],ytrn[r]))

Positive review:
<s> i liked this movie i remember there was one very well done scene in this movie where riff randell played by p j soles is lying in her bed smoking pot and then she begins to visualize that the ramones are in the room with her sing the song i want you around very very cool stuff br br it was fun energetic quirky and cool yes i'll admit that the ending is way way over the top and far fetched but it doesn't matter because it is fun this is a very fun movie it's sex pot and rock n <unk> forever br br i read that cheap trick was the band who was originally to star in this but i do not know if this is true or not


In [11]:
# Here are the labels: 1=negative, 2=positive
ytrn'

1×25000 LinearAlgebra.Adjoint{Int8,Array{Int8,1}}:
 2  1  1  2  2  2  1  1  1  2  2  1  1  2  1  …  1  1  1  2  1  2  1  1  1  2  2  1  2  1

## Define the model

In [12]:
using Knet: param, dropout, RNN

In [13]:
struct SequenceClassifier; input; rnn; output; end

In [14]:
SequenceClassifier(input::Int, embed::Int, hidden::Int, output::Int) =
    SequenceClassifier(param(embed,input), RNN(embed,hidden,rnnType=:gru), param(output,hidden))

SequenceClassifier

In [15]:
function (sc::SequenceClassifier)(input; pdrop=0)
    embed = sc.input[:, permutedims(hcat(input...))]
    embed = dropout(embed,pdrop)
    hidden = sc.rnn(embed)
    hidden = dropout(hidden,pdrop)
    return sc.output * hidden[:,:,end]
end

## Experiment

In [16]:
using Knet: minibatch
dtrn = minibatch(xtrn,ytrn,BATCHSIZE;shuffle=true)
dtst = minibatch(xtst,ytst,BATCHSIZE)
length.((dtrn,dtst))

(390, 390)

In [17]:
# For running experiments
using Knet: train!, Adam
import ProgressMeter

function trainresults(file,model)
    if (print("Train from scratch? ");readline()[1]=='y')
        updates = 0; prog = ProgressMeter.Progress(EPOCHS * length(dtrn))
        function callback(J)
            ProgressMeter.update!(prog, updates)
            return (updates += 1) <= prog.n
        end
        opt = Adam(lr=LR, beta1=BETA_1, beta2=BETA_2, eps=EPS)
        train!(model, dtrn; callback=callback, optimizer=opt, pdrop=DROPOUT)
        Knet.gc()
        Knet.save(file,"model",model)
    else
        isfile(file) || download("http://people.csail.mit.edu/deniz/models/tutorial/$file",file)
        model = Knet.load(file,"model")
    end
    return model
end

trainresults (generic function with 1 method)

In [18]:
using Knet: nll, accuracy
model = SequenceClassifier(VOCABSIZE,EMBEDSIZE,NUMHIDDEN,NUMCLASS)
nll(model,dtrn), nll(model,dtst), accuracy(model,dtrn), accuracy(model,dtst)

(0.6931865f0, 0.6931778f0, 0.48072916666666665, 0.48509615384615384)

In [19]:
model = trainresults("imdbmodel.jld2",model);

Train from scratch? stdin> y


[32mProgress: 100%|█████████████████████████████████████████████████████| Time: 0:00:22[39m


In [20]:
# 33s (0.059155148f0, 0.3877507f0, 0.9846153846153847, 0.8583733974358975)
nll(model,dtrn), nll(model,dtst), accuracy(model,dtrn), accuracy(model,dtst)

(0.07053138f0, 0.4179096f0, 0.9782451923076924, 0.8471153846153846)

## Playground

In [21]:
predictstring(x)="\nPrediction: " * ("Negative","Positive")[argmax(Array(vec(model([x]))))]
UNK = VOCABSIZE-2
str2ids(s::String)=[(i=get(imdbdict,w,UNK); i>=UNK ? UNK : i) for w in split(lowercase(s))]

str2ids (generic function with 1 method)

In [22]:
# Here we can see predictions for random reviews from the test set; hit Ctrl-Enter to sample:
r = rand(1:length(xtst))
println(reviewstring(xtst[r],ytst[r]))
println(predictstring(xtst[r]))

Positive review:
and david <unk> who both give good turns pleasance in particular who shows just how great an actor he can be and highlights what a shame it is that he went on to waste himself in halloween films the unknown <unk> also gives a great performance in her role as patricia the movie is very mysterious for the first hour and really keeps the audience hooked when inspector discovers <unk> diary the film turns into more of a drama in which the girl's last actions are shown and while this section of the film is not as good as what went before it it's still interesting and leads into a great twist at the end overall blood relatives is a great film that really deserves to be better seen le is a better known effort from chabrol but for my money this is at least as good highly recommended viewing

Prediction: Positive


In [23]:
# Here the user can enter their own reviews and classify them:
println(predictstring(str2ids(readline(stdin))))

stdin> no

Prediction: Negative
