# Sequence classification model for IMDB Sentiment Analysis
(c) Deniz Yuret, 2019
* Objectives: Learn the structure of the IMDB dataset and train a simple RNN model.
* Prerequisites: [RNN models](60.rnn.ipynb)

In [1]:
# Set display width, load packages, import symbols
ENV["COLUMNS"] = 72
using Statistics: mean
using IterTools: ncycle
using FileIO: load, save
using JSON
using Knet: Knet, AutoGrad, RNN, param, dropout, minibatch, nll, accuracy, progress!, adam, gc

In [2]:
# Set constants for the model and training
EPOCHS=3          # Number of training epochs
BATCHSIZE=64      # Number of instances in a minibatch
EMBEDSIZE=125     # Word embedding size
NUMHIDDEN=100     # Hidden layer size
MAXLEN=150        # maximum size of the word sequence, pad shorter sequences, truncate longer ones
VOCABSIZE=30000   # maximum vocabulary size, keep the most frequent 30K, map the rest to UNK token
NUMCLASS=2        # number of output classes
DROPOUT=0.5       # Dropout rate
LR=0.001          # Learning rate
BETA_1=0.9        # Adam optimization parameter
BETA_2=0.999      # Adam optimization parameter
EPS=1e-08         # Adam optimization parameter

1.0e-8

## Load and view data

In [3]:
include(Knet.dir("data","imdb.jl"))   # defines imdb loader

imdb

In [4]:
@doc imdb

```
imdb()
```

Load the IMDB Movie reviews sentiment classification dataset from https://keras.io/datasets and return (xtrn,ytrn,xtst,ytst,dict) tuple.

# Keyword Arguments:

  * url=https://s3.amazonaws.com/text-datasets: where to download the data (imdb.npz) from.
  * dir=Pkg.dir("Knet/data"): where to cache the data.
  * maxval=nothing: max number of token values to include. Words are ranked by how often they occur (in the training set) and only the most frequent words are kept. nothing means keep all, equivalent to maxval = vocabSize + pad + stoken.
  * maxlen=nothing: truncate sequences after this length. nothing means do not truncate.
  * seed=0: random seed for sample shuffling. Use system seed if 0.
  * pad=true: whether to pad short sequences (padding is done at the beginning of sequences). pad_token = maxval.
  * stoken=true: whether to add a start token to the beginning of each sequence. start_token = maxval - pad.
  * oov=true: whether to replace words >= oov*token with oov*token (the alternative is to skip them). oov_token = maxval - pad - stoken.


In [5]:
@time (xtrn,ytrn,xtst,ytst,imdbdict)=imdb(maxlen=MAXLEN,maxval=VOCABSIZE);

┌ Info: Loading IMDB...
└ @ Main /home/deniz/.julia/dev/Knet/data/imdb.jl:57


  6.632559 seconds (26.86 M allocations: 1.400 GiB, 8.81% gc time)


In [6]:
println.(summary.((xtrn,ytrn,xtst,ytst,imdbdict)));

25000-element Array{Array{Int32,1},1}
25000-element Array{Int8,1}
25000-element Array{Array{Int32,1},1}
25000-element Array{Int8,1}
Dict{String,Int32} with 88584 entries


In [7]:
# Words are encoded with integers
rand(xtrn)'

1×150 LinearAlgebra.Adjoint{Int32,Array{Int32,1}}:
 30000  30000  30000  30000  29999  …  437  33  67  4389  65  19  228

In [8]:
# Each word sequence is padded or truncated to length 150
length.(xtrn)'

1×25000 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 150  150  150  150  150  150  150  …  150  150  150  150  150  150

In [9]:
# Define a function that can print the actual words:
imdbvocab = Array{String}(undef,length(imdbdict))
for (k,v) in imdbdict; imdbvocab[v]=k; end
imdbvocab[VOCABSIZE-2:VOCABSIZE] = ["<unk>","<s>","<pad>"]
function reviewstring(x,y=0)
    x = x[x.!=VOCABSIZE] # remove pads
    """$(("Sample","Negative","Positive")[y+1]) review:\n$(join(imdbvocab[x]," "))"""
end

reviewstring (generic function with 2 methods)

In [10]:
# Hit Ctrl-Enter to see random reviews:
r = rand(1:length(xtrn))
println(reviewstring(xtrn[r],ytrn[r]))

Negative review:
money to see i won't comment on the story itself it's a wonderful classic but here it feels like a soap opera to start with the acting except for eric bana is soap opera quality i've always been a fan of brad pitt but here every actor on the bold and the beautiful puts him to shame the camera action doesn't help either how it lingers on him when he's thinking it just takes me back to brooke <unk> days in the lab peter o'toole has either had a really bad plastic surgery or he is desperately in need of one either way he looks more like linda evans than linda evans and to end my comments diane kruger is a cute girl but she sure is no helen of troy peterson should rather have chosen saffron burrows for the role since elizabeth taylor would be rather miscast by now


In [11]:
# Here are the labels: 1=negative, 2=positive
ytrn'

1×25000 LinearAlgebra.Adjoint{Int8,Array{Int8,1}}:
 2  1  1  2  2  2  1  1  2  1  2  …  2  2  1  1  2  2  2  2  1  1  1

## Define the model

In [12]:
struct SequenceClassifier; input; rnn; output; pdrop; end

In [13]:
SequenceClassifier(input::Int, embed::Int, hidden::Int, output::Int; pdrop=0) =
    SequenceClassifier(param(embed,input), RNN(embed,hidden,rnnType=:gru), param(output,hidden), pdrop)

SequenceClassifier

In [14]:
function (sc::SequenceClassifier)(input)
    embed = sc.input[:, permutedims(hcat(input...))]
    embed = dropout(embed,sc.pdrop)
    hidden = sc.rnn(embed)
    hidden = dropout(hidden,sc.pdrop)
    return sc.output * hidden[:,:,end]
end

(sc::SequenceClassifier)(input,output) = nll(sc(input),output)

## Experiment

In [15]:
dtrn = minibatch(xtrn,ytrn,BATCHSIZE;shuffle=true)
dtst = minibatch(xtst,ytst,BATCHSIZE)
length.((dtrn,dtst))

(390, 390)

In [16]:
# For running experiments
function trainresults(file,maker; o...)
    if (print("Train from scratch? "); readline()[1]=='y')
        model = maker()
        progress!(adam(model,ncycle(dtrn,EPOCHS);lr=LR,beta1=BETA_1,beta2=BETA_2,eps=EPS))
        save(file,"model",model)
        GC.gc(true) # To save gpu memory
    else
        isfile(file) || download("https://github.com/denizyuret/Knet.jl/releases/download/v1.4.9/$file",file)
        model = load(file,"model")
    end
    return model
end

trainresults (generic function with 1 method)

In [17]:
maker() = SequenceClassifier(VOCABSIZE,EMBEDSIZE,NUMHIDDEN,NUMCLASS,pdrop=DROPOUT)
# model = maker()
# nll(model,dtrn), nll(model,dtst), accuracy(model,dtrn), accuracy(model,dtst)
# (0.69312066f0, 0.69312423f0, 0.5135817307692307, 0.5096153846153846)

maker (generic function with 1 method)

In [18]:
model = trainresults("imdbmodel149.jld2",maker);
# ┣████████████████████┫ [100.00%, 1170/1170, 00:15/00:15, 76.09i/s]
# nll(model,dtrn), nll(model,dtst), accuracy(model,dtrn), accuracy(model,dtst)
# (0.05217469f0, 0.3827392f0, 0.9865785256410257, 0.8576121794871795)

Train from scratch? stdin> y


┣████████████████████┫ [100.00%, 1170/1170, 00:21/00:21, 54.84i/s] 


## Playground

In [19]:
predictstring(x)="\nPrediction: " * ("Negative","Positive")[argmax(Array(vec(model([x]))))]
UNK = VOCABSIZE-2
str2ids(s::String)=[(i=get(imdbdict,w,UNK); i>=UNK ? UNK : i) for w in split(lowercase(s))]

str2ids (generic function with 1 method)

In [20]:
# Here we can see predictions for random reviews from the test set; hit Ctrl-Enter to sample:
r = rand(1:length(xtst))
println(reviewstring(xtst[r],ytst[r]))
println(predictstring(xtst[r]))

Negative review:
fest and draw my applause br br i love philosophical films this isn't one of them anyone who is amazed at the depths of intellect <unk> in this film hasn't read a good book lately or ever the thought provoking dialogue is trite at best perhaps it lost something in the translation br br i love a good horror comedy this isn't one of them laugh i thought i'd never start squirm only when trying to think of a polite way to phrase my feedback of the film to the friend who recommended it br br rupert is <unk> good in the setting of this film but even he cannot resurrect it i only wish he had shot the director instead if the zombies br br for shame that the land that gave rise to the inferno should also give rise to this dante would be spinning in his grave

Prediction: Negative


In [21]:
# Here the user can enter their own reviews and classify them:
println(predictstring(str2ids(readline(stdin))))

stdin> i cannot recommend this movie

Prediction: Negative
