# Emojify with Embeddings

Here, in this tutorial, we will be applying LSTM model to predict the emoji for various sentences. We will do this using pre-trained embeddings. This is quite similar to what one do in Text Sentiment Analysis where we will be predicting Sentiments in the form of Emojis.<br> <br>
Let's first import all the required packages that we will be using.

In [1]:
using Flux
using Flux:onehot,crossentropy,onecold
using CSV,MLBase
using Base.Iterators: repeated
using DataFrames,StatsBase
using PyCall,BSON
using Embeddings
using WordTokenizers
using MLDataPattern
using Random
using MLDataUtils

┌ Info: CUDAdrv.jl failed to initialize, GPU functionality unavailable (set JULIA_CUDA_SILENT or JULIA_CUDA_VERBOSE to silence or expand this message)
└ @ CUDAdrv /home/blackforest/.julia/packages/CUDAdrv/mCr0O/src/CUDAdrv.jl:69


Now, once we have imported the required packages, it's time to load our data from the CSV file.
Here our dataset file 'train.csv' consist of certain sentences and then the marking to the respective emoji class index which can be applied to that sentence

In [2]:
cd("./Downloads/emojify")
train = CSV.read("train_emoji.csv",header =["Text","Classifier","Col3","Col4"])[:,[1,2]];
first(train,6)



Unnamed: 0_level_0,Text,Classifier
Unnamed: 0_level_1,String,Int64
1,never talk to me again,3
2,I am proud of your achievements,2
3,It is the worst day in my life,3
4,Miss you so much,0
5,food is life,4
6,I love you mum,0


In [3]:
X = train[:,1]
Y = train[:,2];

In [4]:
countmap(train[!,2])

Dict{Int64,Int64} with 5 entries:
  0 => 22
  4 => 17
  2 => 38
  3 => 36
  1 => 19

Well it's quite clear from above, that out dataset classes are skewed with two classes dominating over other. So here, we will be applying oversampling. What Oversample does is, it create copies of the less represented classes in the dataset and make all the classes equally represented

In [5]:
X_bal,Y_bal = oversample((X,Y),shuffle = true);

What does these indexes represent???<br>
Let's define a dictionary of specify what these indexes actually signify. <br>

<i>Note: This is taken from the dataset source</i>

In [6]:
emoji_dictionary = Dict{Int64,String}(0=>"💙",    # :heart: prints a black instead of red heart depending on the font
                    1=> "🎾",
                    2=> "😄",
                    3=> "😞",
                    4=> "🍴")

Dict{Int64,String} with 5 entries:
  0 => "💙"
  4 => "🍴"
  2 => "😄"
  3 => "😞"
  1 => "🎾"

Now, we need to change our target dataset as one hot label. Let's do that real quick

In [7]:
N = 5
Y_ = zeros(N,length(Y_bal))
for i in 1:length(Y_bal)
    Y_[Y_bal[i]+1,i] = 1
end

As specified earlier, we will use here pre-trained text embeddings based on GloVe dataset. If, you are running this for the first time, this may take a bit longer as GloVe need to get downloaded first. But once that's done, it won't be needed to download again. There will be only some loading time........

In [8]:
const embtable = load_embeddings(GloVe) # or load_embeddings(FastText_Text) or ...

const get_word_index = Dict(word=>ii for (ii,word) in enumerate(embtable.vocab))

function get_embedding(word)
    ind = get_word_index[word]
    emb = embtable.embeddings[:,ind]
    return emb
end

set_tokenizer(poormans_tokenize)

tokenize (generic function with 1 method)

First we will need to tokenize that sentences, which we will afterwards convert using text embeddings loaded above

In [9]:
X_ = [tokenize(lowercase(i)) for i in X_bal];

In [10]:
function tokenise(s)
    token_arr = []
    for c in s
        if (c in keys(get_word_index))==1
        push!(token_arr,get_embedding(c))
    else
        #print(c)
        push!(token_arr,get_embedding("unk"))
    
        end
    end
    return token_arr
end

tokenise (generic function with 1 method)

In [11]:
Xs = [tokenise(a) for a in X_];

In [12]:
max_length = 0
for i in 1:length(Xs)
    if max_length<length(Xs[i])
        max_length = length(Xs[i])
    end
end

Now, before moving further, we first need to convert the Data, an array of data elements.

In [116]:
#Converting array{array{array,1],1},1}-Xs to array{embedded matrix}  
X1 = []
for i in 1:length(Xs)
    m = fill(0.0,(length(Xs[i][1]),max_length))
    for j =1:length(Xs[i])
        for k in 1:length(Xs[i][j])
            m[k,j] = Xs[i][j][k]
        end
    end
    push!(X1,m)
end

In [117]:
Y1 = []
for i in 1:size(Y_)[2]
    push!(Y1,Y_[:,i])
end

## Normalisation
Normalisation is a very important aspect in order to improve the performance of any model. It allows model to converge more easily and 

In [118]:
for i in 1:length(X)
    X1[i] = Flux.normalise(X1[i],dims = 2)
end

In [119]:
X1,Y1 = shuffleobs((X1,Y1));

In [120]:
X_train = X1[1:176]
Y_train = Y1[1:176]
X_val = X1[161:end]
Y_val = Y1[161:end];

## Batching the data
As here, we will be using the Stochastic gradient process for training, so we will be updating our model after each element.
Hence, we will have just one element in each batch

In [121]:
batch = []  #batching of each element for Stochastic Gradient Descent as [(X1[1],Y1[1]),(X1[2],Y1[2].......)]
for i in 1:length(X_train)
    push!(batch,(X_train[i],[Y_train[i]]))
end

## Model
Once, we are done, It's time making our model. Here we will use Chain method of model declaration.

In [122]:
#Model
Scanner = Chain(LSTM(length(Xs[1][1]),128),LayerNorm(128),x->relu.(x),
        Dropout(0.5),        
        LSTM(128,128),
        LayerNorm(128),
        Dropout(0.5),
        x->relu.(x),
        Dense(128,N)
        ,softmax)

Chain(Recur(LSTMCell(50, 128)), LayerNorm(128), #35, Dropout(0.5), Recur(LSTMCell(128, 128)), LayerNorm(128), Dropout(0.5), #36, Dense(128, 5), softmax)

## Loss Function
In the loss function, we will be evaluating the crossentropy loss. Another important point to note here is that we will be reseting the hidden state, after evaluation of loss function

In [123]:
function loss(x,y)
    y_hat = Scanner.(x)
    l= crossentropy(y_hat[1][:,end],y[1])
    Flux.reset!(Scanner)
    return l
end

loss (generic function with 1 method)

In [124]:
loss(batch[1]...)

2.520803928375244

## Functions for performance analysis
This is the loss function that we will be using after each iteration to evaluate the performance of the model
Here, an important aspect to note is that after each model production, we need to reset the hidden state, as otherwise they may get transferred as initial state for next prediction, which is definitely not wha we want here.

In [125]:
function val_loss_(x,y)
    y_ = []
    for i in 1:length(x)
        push!(y_,Scanner.(x[i]))
        Flux.reset!(Scanner)
    end
    y_hat = y_
    l=0.0
    for i in 1:length(y_hat)
        l+= crossentropy(y_hat[i][length(x[i])][:,end],y[i])
    end
    return l/length(y_hat)
end

val_loss_ (generic function with 1 method)

In [126]:
function accuracy(x,y)
    y_hat = []
    for i in 1:length(x)
        push!(y_hat,Scanner.(x[i]))
        Flux.reset!(Scanner)
    end
    sum=0.0
    for i in 1:length(y)
        if (argmax(y_hat[i][length(x[i])][:,end])==argmax(y[i]))
        sum+=1.0
        end
    end
    return sum/length(y)
end

accuracy (generic function with 1 method)

## Training
Once we are done with all the preprocessing, let's begin the model training..........

In [None]:
opt = ADAM(0.0001,(0.9,0.999))
epochs = 100

best_acc = 0
@info("Beginning training loop...")

for i in 1:epochs
    global best_acc
    index = Random.randperm(length(batch))
    batch = batch[index]
    Flux.train!(loss, params(Scanner),batch, opt)
    loss_val = val_loss_(X_val,Y_val)
    accuracy_ = accuracy(X_val,Y_val)
    if i%10==0
        print("Epoch[$i]- Loss: $loss_val Accuracy: $accuracy_\n")
    end
    # If this is the best accuracy we've seen so far, save the model out
    if accuracy_ >best_acc
        @info("Epoch[$i]-> New best accuracy: $accuracy_ ! Saving model out to emojifier_norm.bson")
        BSON.@save joinpath(dirname(@__FILE__), "emojifier_norm.bson") Scanner max_length
        best_acc = accuracy_
    end
            
end

Once training is done, let's have a look at the performance on training data.<br><br>
Let's start with Model accuracy........

In [42]:
#Model evaluation on Training Data
BSON.@load "./emojifier_norm.bson" Scanner
accuracy(X1,Y1)

0.7526315789473684

In [43]:
y_label = []
for i in 1:length(X1)
    push!(y_label,Scanner(X1[i]))
    Flux.reset!(Scanner)
end

Now let's have a look at the confusion matrix of predictions

In [44]:
y_pred = []
for i in 1:length(y_label)
    push!(y_pred,Int64(argmax(y_label[i][:,10])-1))
    end
y_pred = Int64.(y_pred);


confusmat(5,Y_bal.+1,y_pred.+1)

5×5 Array{Int64,2}:
 4  4  7   4  19
 3  5  7   5  18
 7  4  6   5  16
 9  3  4  10  12
 7  5  6   3  17

As you can see that the model is highly biased towards the last class. This model need to be optimized further.