### Sarcasm Prediction using Deep Learning

This project is meant for differentiating whether the given passage is sarcasm or not. The Data's are captured in the json format and its been loaded afterwards for processing. 

#### Why Deep Learning?

The data to be processed is huge, which results in large memory space and time for processing.Deep Learning out perform other techniques if the data size is large.As per Andrew Ng, the chief scientist of China’s major search engine Baidu and one of the leaders of the Google Brain Project, “The analogy to deep learning is that the rocket engine is the deep learning models and the fuel is the huge amounts of data we can feed to these algorithms.”

##### Note:

This project is constructed using keras, a high-level neural networks API which is written using Python and capable of running over TensorFlow.


In [1]:
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import json

##### About the dataset

The dataset is saved as json file, which is loaded using context manager. Both the independent and dependent feature, headline and is_sarcastic respectively, is saved in the list.

In [2]:
with open("sarcasm.json") as file:
    data = json.load(file)

In [3]:
headline = []
sarcastic = []
for content in data:
    headline.append(content["headline"])
    sarcastic.append(content["is_sarcastic"])

In [4]:
len(headline)

26709

#### Splitting the dataset

Dataset is splitted in to training dataset and testing dataset. Model will be fitted using training dataset and validated using new datasets which is not seed by the model during training, so called testing dataset. In this way, we can able to calculate the perfomance of the model.

Slicing function, provided by python, is used for this purpose!!

In [5]:
training_size = 15000

In [6]:
training_data = headline[:training_size]
testing_data = headline[training_size:]
training_label = sarcastic[:training_size]
testing_label = sarcastic[training_size:]

##### Keras Tokenizer

Machine learning models take vectors (array of integers) as input. When working with text, we should come up with a strategy to convert strings to numbers (or to "vectorize" the text) before feeding it to the model. Tokenizer class allows to vectorize a text corpus (collection of written text), by turning each text into either a sequence of integers or into a vector where the coefficient for each token could be binary, based on word count or based on the tf-idf (term frequency-inverse document frequency).

Keras Toeknizer takes,
* num_words: Maximum number of words to keep, based on the word frequency. Only those words with the maximum frequency will be kept.
* oov_token:  used to replace out-of-vocabulary words during text_to_sequence calls

Do refer the [tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) document for other input options

In [7]:
vocabulary = 10000
oov_token = "<OOV>"

In [8]:
tokenizer =Tokenizer(num_words=vocabulary, oov_token = oov_token)

#### fits_on_texts

Updates internal vocabulary based on a list of texts.

#### text_to_sequences

Transforms each text in texts to a sequence of integers. Only the most frequent words are taken in to account.

In [9]:
tokenizer.fit_on_texts(training_data)

In [10]:
training_data = tokenizer.texts_to_sequences(training_data) 

In [11]:
testing_data = tokenizer.texts_to_sequences(testing_data) 

#### Keras Sequence Padding

This function transforms a lists of integers into a 2D Numpy array.

pad_seqeunce takes,
* sequences: List of lists, where each element is a sequence.
* maxlen: Int, maximum length of all sequences.
* padding: String, 'pre' or 'post':
    pad either before or after each sequence.
* truncating: String, 'pre' or 'post':
    remove values from sequences larger than
    `maxlen`, either at the beginning or at the end of the sequences.

Do refer the [pad_sequence](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) for other input options

In [12]:
max_length =  100
trunc ='post'
padding ='post'

In [13]:
padded_training_data = pad_sequences(training_data, padding=padding, truncating=trunc, maxlen=max_length)

In [14]:
padded_testing_data = pad_sequences(testing_data, padding=padding, truncating=trunc, maxlen=max_length)

In [15]:
training_data = np.array(padded_training_data)
testing_data = np.array(padded_testing_data)
training_label = np.array(training_label)
testing_label = np.array(testing_label)

#### Keras Sequential Model

The `sequential` model is linear stack of layers. You can create a `Sequential model` by passing a list of layer instances to the constructor.

##### Embedding Layer

Turns positive integers into dense vectors of fixed size. This layer can only be used as the first layer in a model.

Embedding layer takes,
* input_dim: List of lists, where each element is a sequence.
* output_dim: Int, maximum length of all sequences.
* input_length:  Length of input sequences

Check [embedding layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) documentation for other details

##### GlobalAveragePooling1D

The 1D Global average pooling block takes a 2-dimensional tensor of size (input size) x (input channels) and computes the maximum of all the (input size) values for each of the (input channels)

#### Activation Layer

Activation function decides, whether a neuron should be activated or not by calculating weighted sum and further adding bias with it. The purpose of the activation function is to introduce non-linearity into the output of a neuron. Without a non-linear function doesn’t matter how many hidden layers we attach in the neutral net all will behave in the same way.Neuron cannot learn with just a linear function attached to it, it requires a non-linear activation function to learn as per the difference w.r.t error.

##### Relu Activation

The rectified linear activation function is a piecewise linear function that will output the input directly if is positive, otherwise, it will output zero

##### Sigmoid

The sigmoid activation function, also called the logistic function, is traditionally a very popular activation function for neural networks. The input to the function is transformed into a value between 0.0 and 1.0.

In [16]:
embedding_dim = 16

In [17]:
model = keras.Sequential()

In [18]:
model.add(keras.layers.Embedding(vocabulary, embedding_dim, input_length=max_length))

In [19]:
model.add(keras.layers.GlobalAveragePooling1D())

In [20]:
model.add(keras.layers.Dense(24, activation='relu'))

In [21]:
model.add(keras.layers.Dense(1, activation='sigmoid'))

In [22]:
model.compile(optimizer="adam", loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])

In [23]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 24)                408       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 25        
Total params: 160,433
Trainable params: 160,433
Non-trainable params: 0
_________________________________________________________________


In [None]:
history = model.fit(training_data, training_label, epochs=25, validation_data=(testing_data, testing_label), verbose=2)

Train on 15000 samples, validate on 11709 samples
Epoch 1/25
15000/15000 - 3s - loss: 0.6863 - accuracy: 0.5531 - val_loss: 0.6786 - val_accuracy: 0.5677
Epoch 2/25
15000/15000 - 2s - loss: 0.6222 - accuracy: 0.6592 - val_loss: 0.5011 - val_accuracy: 0.8147
Epoch 3/25
15000/15000 - 2s - loss: 0.3910 - accuracy: 0.8478 - val_loss: 0.3798 - val_accuracy: 0.8376
Epoch 4/25
15000/15000 - 2s - loss: 0.3031 - accuracy: 0.8821 - val_loss: 0.3584 - val_accuracy: 0.8434
Epoch 5/25
15000/15000 - 2s - loss: 0.2571 - accuracy: 0.9017 - val_loss: 0.3536 - val_accuracy: 0.8436
Epoch 6/25
15000/15000 - 2s - loss: 0.2236 - accuracy: 0.9158 - val_loss: 0.3712 - val_accuracy: 0.8370
Epoch 7/25
15000/15000 - 2s - loss: 0.1965 - accuracy: 0.9249 - val_loss: 0.3489 - val_accuracy: 0.8523
Epoch 8/25
15000/15000 - 2s - loss: 0.1729 - accuracy: 0.9355 - val_loss: 0.3751 - val_accuracy: 0.8452
Epoch 9/25
15000/15000 - 2s - loss: 0.1547 - accuracy: 0.9443 - val_loss: 0.3652 - val_accuracy: 0.8511
Epoch 10/25
15

In [None]:
import matplotlib.pyplot as plt

In [None]:
def display_plot(string, history):
    plt.plot(history.history[string])
    plt.plot(history.history['val_'+string])
    plt.xlabel('epochs')
    plt.ylabel(string)
    plt.legend([string, 'val_'+string])
    plt.show()

In [None]:
display_plot('accuracy', history)

In [None]:
display_plot('loss', history)

In [None]:
sentence = [
    "I am actually not funny. I am just mean and people think I am joking",
    "I’d be fine if there were not so much blood in my alcohol system",
    "If you need so much space, there is always NASA",
    "game of thrones season finale showing this sunday night"
]
sequences = tokenizer.texts_to_sequences(sentence)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding, truncating=trunc)
print(model.predict(padded))