In this tutorial we will be classifying sentences as sarcasm or no sarcasm. We will be using word embedding along with CNN. I won't go into details of word embedding but will just go through basics of it. 


## Why not to use one-hot-encoding

In one-hot encoding each word is represented by a vector and size of this vector increases with the total number of words. Now if we have 1000 words, each word will be represented by a vector of length 1000 and hence matrix size of features will be 1000 x 1000. This kind of encoding also does not consider any semantic relation between words. For example, cat and tiger will have same relation as cat and a pencil. 

## Word embeddings

In word embeddings each word is represented by a dense vector and the words having similar meaning will have minimum difference between them. For example a 'dog' and a 'puppy' will have a value close to 1 because both are having almost similar meaning. Similarly a cat and tiger will have value close to 1 because both are animals. 

### How to choose embedding size

A word vector length will be same as of embedding size which we choose as a hyperparameter. Now if we are to choose embedding size for a POS problem where we just want to check relation of a word with different parts of speech like verb, noun etc., we can choose a smaller size because there are just 35 POS and hence a word can be represented in a smaller dimension. There is a tradeoff between of accuracy vs computational power for choosing embedding size as:
* Representaion of a word in a higher dimension (larger embedding size) increase the accuracy.
* Higher dimension representation needs more compuational power

Typically embedding size is chosen between 50 to 200. 

In [43]:
import json
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt



Json data is downloaded from the internet. This data contains URLs,headlins, classes (sarcastic or not-sarcastic) etc. 
There are 2 classes in it so it will be a binary classification problem. We will only be fetching headline and class name from the data and will append in lists.

In [4]:
with open("sarcasm.json", 'r') as f:
    datastore = json.load(f)

In [6]:
sentences = [] 
labels = []
for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    

In [23]:
len (sentences)  # Total number of sentences in our data

26709

In [8]:
training_size=2000   # Here we are using first 2000 sentences for training. You can change this number
training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

Next step is to convert sentences into tokens of words. We can use keras API to do this task. Here we will use one hyper parameter 'vocab_size'. It is the total number of unique words in our entire document. Let's suppose we are going to train our model on 2 sentences given as:
* Cat is a pet animal
* Dog is a faithful animal
Now if we count there are a total of 7 unique words, so we will use 'vocab_size' as 7. Now if our 'vocab_size' is 100 and we get 110 words in training data, then most frequent 100 words will be chosen. 
Another parameter being used here is 'oov_tok'. It means that how to handle words that are not in training data. We can choose to replace unknown words using some string. Here we are using 'unknown_text' for words out of vocabulory. 

In [28]:
vocab_size = 1000
oov_tok = "unknown_text"
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index  # It will assign unique IDs to each word
print(len(word_index))
print(word_index)


6854


IDs we got from above code are to be arranged in a sequence. Lets suppose our sentence is 'Horse is beautiful' having 'horse' as ID of 56, 'is' as 45, and 'beautiful' as 89, our sequence will be [56,45,89]. Please note one more thing that this API will automatically convert all words in lower form, so 'Horse' and 'horse' will have same unique ID.
We will be using following parameters:

**maxlen**: We need to give a fix length of words as input to our model so this parameter decides that what will be the maximum length of a sequence. For example if we have two sentences to be trained, one having a total of 10 words while other has 24 words, we will choose 24 as 'maxlen'. 

**padding_type**: If number of words in a sentence are less than the 'maxlen', that sentence will be padded with zeros. If padding_type is chosen as 'post', zeros will be padded at end of sentence, if it's 'pre', zeros will be padded in start. 

**truncating** If a sentence length is greater than the 'maxlen', sentence will be truncated from start or end. If 'truncating' is chosen as 'post', sentence will be truncated from the end, while if it's 'pre', sentence will be truncated from start. 


In [31]:
trunc_type='post'
padding_type='post'
max_length = 120


training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)


print(training_padded[0])
print(training_padded.shape)





[952   1 711   1   1  30 576   1   1   6   1   1   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0]
(2000, 120)


In [30]:
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments:

It must specify 3 arguments:

    input_dim: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.
    output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.
    input_length: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.


In [35]:
embedding_dim = 16

In [36]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Conv1D(128, 5, activation='relu'),
    tf.keras.layers.GlobalMaxPooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

Instructions for updating:
Colocations handled automatically by placer.


In [37]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 120, 16)           16000     
_________________________________________________________________
conv1d (Conv1D)              (None, 116, 128)          10368     
_________________________________________________________________
global_max_pooling1d (Global (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 24)                3096      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 25        
Total params: 29,489
Trainable params: 29,489
Non-trainable params: 0
_________________________________________________________________


In [41]:


training_padded = np.array(training_padded)
training_labels = np.array(training_labels)
testing_padded = np.array(testing_padded)
testing_labels = np.array(testing_labels)

In [42]:
history = model.fit(training_padded, training_labels, epochs=50, validation_data=(testing_padded, testing_labels), verbose=1)

Train on 2000 samples, validate on 24709 samples
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [44]:
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

plot_graphs(history, 'accuracy')
plot_graphs(history, 'loss')

KeyError: 'accuracy'

In [None]:
model.save("sarcasm_cnn.h5")