# LSTM-Lab
Task 7-9 are homework. Please hand seperate notebooks after
1. Task 6 (e.g. you finished the lab)
1. Task 7
1. Task 8
1. Task 9

Homework is due on **October 30th**, before **8AM**.

In [None]:
using Knet, AutoGrad
using Knet: sigm_dot, tanh_dot

### Options

In [None]:
datafiles    = ["input.txt"]  # If provided, use first file for training, second for dev, others for test.
togenerate   = 500            # If non-zero generate given number of characters.
epochs       = 10             # Number of epochs for training.
hidden       = [128]          # Sizes of one or more LSTM layers.
embed        = 168            # Size of the embedding vector.
batchsize    = 128            # Number of sequences to train on in parallel
seqlength    = 20             # Maximum number of steps to unroll the network for bptt. Initial epochs will use the epoch number as bptt length for faster convergence.
seed         = -1             # Random number seed. -1 or 0 is no fixed seed
lr           = 1e-1           # Initial learning rate
gclip        = 3.0            # Value to clip the gradient norm at.
dpout        = 0.0            # Dropout probability.

In [None]:
seed > 0 && srand(seed)

# read text and report lengths
text = map(readstring, datafiles)
!isempty(text) && info("Chars read: $(map((f,c)->(basename(f),length(c)),datafiles,text))")

## Task-1: Create dictionary by completing createVocabulary function

 function createVocabulary takes text::Array{Any,1} that contains the names of datafiles you provided by opts[:datafiles] argument. It returns vocabulary::Dict{Char,Int}()  for given text. In this lab, your text array is length of 1. For example the text is ["content of input"]. Note that for the sake of simplicity, we do *NOT* use validation or test dataset in this lab. You can try it by splitting  your data into 3 different set after the lab.

In [None]:
function createVocabulary(text)
    vocab = Dict{Char,Int}()
    # Your code starts here
    for (char_i,unique_character) in enumerate(unique(text[1]))
        vocab[Char(unique_character)] = char_i
    end
    # Your code ends here
    return vocab
end

In [None]:
vocab = createVocabulary(text)
info("$(length(vocab)) unique chars.") # The output should be 75 unique chars for input.txt

## LSTM Network funtion 

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
https://d396qusza40orc.cloudfront.net/neuralnets/lecture_slides/lec7.pdf

function lstm is provided below.

In [None]:
function lstm(weight,bias,hidden,cell,input)
    gates   = hcat(input,hidden) * weight .+ bias
    hsize   = size(hidden,2)
    forget  = sigm_dot(gates[:,1:hsize])
    ingate  = sigm_dot(gates[:,1+hsize:2hsize])
    outgate = sigm_dot(gates[:,1+2hsize:3hsize])
    change  = tanh_dot(gates[:,1+3hsize:end])
    cell    = cell .* forget + ingate .* change
    hidden  = outgate .* tanh_dot(cell)
    return (hidden,cell)
end

## Task-2: Create Initial weights

 initweights creates the weights and biases for the model. We are using LSTM network. We provide init function(for weights) and bias function(for bias)

In [None]:
function initweights(hidden, vocab, embed)
    init(d...) = xavier(d...)
    bias(d...) = zeros(d...)
    model = Vector{Any}(2*length(hidden)+3)
    X = embed
    for k = 1:length(hidden)
        # Your code starts here
        H = hidden[k]
        model[2k-1] = init(X+H, 4H)
        model[2k] = bias(1, 4H)
        #model[2k][1:H] = 1 # forget gate bias = 1
        X = H
        # Your code ends here
    end
    model[end-2] = init(vocab,embed)
    model[end-1] = init(hidden[end],vocab)
    model[end] = bias(1,vocab)
    return model
end

## Task-3: Create Initial state

 At each time step, we take the hidden state from previous time step as input. To be able to do that,first we need to initialize hidden state. We also store updated hidden states in array created here. We initialize state as a zero matrix.

In [None]:
let blank = nothing; global initstate
    function initstate(model, batch)
        nlayers = div(length(model)-3,2)
        state = Vector{Any}(2*nlayers)
        for k = 1:nlayers
            bias = model[2k]
            hidden = div(length(bias),4)
            if typeof(blank)!=typeof(bias) || size(blank)!=(batch,hidden)
                blank = fill!(similar(bias, batch, hidden),0)
            end
            state[2k-1] = state[2k] = blank
        end
        return state
    end
end

## Task-4: Create Predict function

 predict is a function that takes w(model) created in initweights, s(state) created in initstate and input whose size is batchsize vocabulary You need to implement predict function for LSTM. You must use lstm function here. LSTM function is provided above.

In [None]:
function predict(model, state, input; pdrop=0)
    nlayers = div(length(model)-3,2)
    newstate = similar(state)
    for k = 1:nlayers
        # Your code starts here
        input = dropout(input, pdrop)
        (newstate[2k-1],newstate[2k]) = lstm(model[2k-1],model[2k],state[2k-1],state[2k],input)
        input = newstate[2k-1]
        # Your code ends here
    end
    return input,newstate
end

## Generate and Sample function

Generate function is a function we use to create some text that is similar to our training data. We provide sample function to you. You can predict the next character by using sample function once you calculate the probabilities given the input. index to char is the same dictionary as you created with createdictionary function but it works in the reverse direction. It gives you the character given the index.

In [None]:
function generate(model, tok2int, nchar)
    int2tok = Vector{Char}(length(tok2int))
    for (k,v) in tok2int; int2tok[v] = k; end
    input = tok2int[' ']
    state = initstate(model, 1)
    for t in 1:nchar
        embed = model[end-2][[input],:]
        ypred,state = predict(model,state,embed)
        ypred = ypred * model[end-1] .+ model[end]
        input = sample(exp.(logp(ypred)))
        print(int2tok[input])
    end
    print(nchar)
    println()
end

function sample(p)
    p = convert(Array,p)
    r = rand()
    for c = 1:length(p)
        r -= p[c]
        r < 0 && return c
    end
end

### Now, Let's generate some random sample

In [None]:
model = initweights(hidden, length(vocab), embed)
state = initstate(model,1)

println("########## RANDOM MODEL OUTPUT ############")
generate(model, vocab, togenerate) ## change togenerate if you want longer sample text

 We provide minibatch function for you. You do not have to do it for this lab. But we suggest you to understand the idea since you need to do it in your own project and future labs

In [None]:
function minibatch(chars, tok2int, batch_size)
    chars = collect(chars)
    nbatch = div(length(chars), batch_size)
    data = [zeros(Int,batch_size) for i=1:nbatch ]
    for n = 1:nbatch
        for b = 1:batch_size
            char = chars[(b-1)*nbatch + n]
            data[n][b] = tok2int[char]
        end
    end
    return data
end

## Task-5: Create loss function

This might be the hardest part of this lab. Implement appropriate loss function

In [None]:
function loss(model, state, sequence, range=1:length(sequence)-1; newstate=nothing, pdrop=0)
    preds = []
    for t in range
        input = model[end-2][sequence[t],:]
        pred,state = predict(model,state,input; pdrop=pdrop)
        push!(preds,pred)
    end
    if newstate != nothing
        copy!(newstate, map(AutoGrad.getval,state))
    end
    pred0 = vcat(preds...)
    pred1 = dropout(pred0,pdrop)
    pred2 = pred1 * model[end-1]
    pred3 = pred2 .+ model[end]
    logp1 = logp(pred3,2)
    nrows,ncols = size(pred3)
    golds = vcat(sequence[range[1]+1:range[end]+1]...)
    index = similar(golds)
    @inbounds for i=1:length(golds)
        index[i] = i + (golds[i]-1)*nrows
    end
    logp2 = logp1[index]
    logp3 = sum(logp2)
    return -logp3 / length(golds)
end

# Knet magic
lossgradient = grad(loss)

function avgloss(model, sequence, S)
    T = length(sequence)
    B = length(sequence[1])
    state = initstate(model, B)
    total = count = 0
    for i in 1:S:T-1
        j = min(i+S-1,T-1)
        n = j-i+1
        total += n * loss(model, state, sequence, i:j; newstate=state)
        count += n
    end
    return total / count
end

## Task-6: Create Train function

Implement bptt(Backpropagation through time) function for training. You need to fill up only 3 lines(or even small numbers). You need use lossgradient function and update! function.

In [None]:
function train(model, sequence, optim, S; pdrop=0)
    T = length(sequence)
    B = length(sequence[1])
    state = initstate(model, B)
    for i in 1:S:T-1
        # Your code starts here
        j = min(i+S-1,T-1)
        grads = lossgradient(model, state, sequence, i:j; newstate=state, pdrop=pdrop)
        update!(model, grads, optim)
        # Your code ends here
    end
end

### Now we are ready. First let's see the initial loss

In [None]:
data =  map(t->minibatch(t, vocab, batchsize), text)
# Print the loss of randomly initialized model.
losses = map(d->avgloss(model,d,100), data)
println((:epoch,0,:loss,losses...))

### Below is the training part of RNN(with Adam)

In [None]:
optim = map(x->Adam(lr=lr, gclip=gclip), model)
# MAIN LOOP
for epoch=1:epochs
    @time train(model, data[1], optim, min(epoch,seqlength); pdrop=dpout)
    # Calculate and print the losses after each epoch
    losses = map(d->avgloss(model,d,100),data)
    println((:epoch,epoch,:loss,losses...))
end

### If you have checked the loss decreasing, let's create some text with our model

In [None]:
println("########## FINAL  MODEL OUTPUT ############")
state = initstate(model,1)
generate(model, vocab, togenerate)

## Task-7 (Homework)

For simplicity we have removed the GPU support from this notebook.
What would need to change so that you can run your training on a GPU? (hin: KNetArray)

Change the notebook to work on the GPU or the CPU depending on a global parameter that switches between `Array` and `KNetArray`.

## Task-8 (Homework)
Analyse the performance of this code (you might have to convert this from a notebook to a single-source file before doing so, see nbconvert)

1. Read the performance tips https://docs.julialang.org/en/latest/manual/performance-tips/
2. Use `@code_warntype` to check that you don't have type-instabilities in your code.
3. [`ProfileView.jl`](https://github.com/timholy/ProfileView.jl) and the [profiler](https://docs.julialang.org/en/latest/manual/profile/) are your friends.
4. Where are memory allocation happening (use the memory allocation tracker).
    

## Task-9 (Homework)

In this notebook we used standard LSTM untis, but there are many variants (see http://colah.github.io/posts/2015-08-Understanding-LSTMs/):

- [GRU](https://arxiv.org/pdf/1406.1078v3.pdf)
- [peephole-LSTM](ftp://ftp.idsia.ch/pub/juergen/TimeCount-IJCNN2000.pdf)
- [Depth Gated RNN](https://arxiv.org/pdf/1508.03790v2.pdf)
- [Clockwork RNN](https://arxiv.org/pdf/1402.3511v1.pdf)
- ...

Take a look at [*Greff, et al. (2015)*](https://arxiv.org/pdf/1503.04069.pdf) and [*Jozefowicz, et al. (2015)*](http://proceedings.mlr.press/v37/jozefowicz15.pdf) and **choose** a different LSTM model and implement it.

For the particular ambitious student choose a model from https://distill.pub/2016/augmented-rnns/

For GRU take a look at http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/