# Sarcasm Detection
 **Acknowledgement**

Misra, Rishabh, and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).

**Required Files given in below link.**

https://drive.google.com/drive/folders/1xUnF35naPGU63xwRDVGc-DkZ3M8V5mMk

# PLEASE NOTE THAT THIS NOTEBOOK IS BUILT IN GOOGLE COLAB AND ASSUMES LOCAL DRIVE OF COLAB ENV

## Install `Tensorflow2.0` 

## Import 

In [2]:
import warnings
import tensorflow as tf
import pickle
from tensorflow.keras import layers
from tensorflow.keras import preprocessing
import numpy as np
import pandas as pd
import json
from sklearn.model_selection import train_test_split
import pprint
from tensorflow.keras.layers import Bidirectional,LSTM,Dense,Dropout,BatchNormalization,Flatten,Input
from tensorflow.keras.models import Model
from tensorflow.keras.layers import concatenate
from numpy import array
import nltk
import re
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
tf.__version__

'2.2.0'

In [None]:
#Set your project path 
project_path = "/content/drive/My Drive/"


#**## Reading and Exploring Data**

## Read Data "Sarcasm_Headlines_Dataset.json". Explore the data and get  some insights about the data. ( 4 marks)
Hint - As its in json format you need to use pandas.read_json function. Give paraemeter lines = True.

In [14]:
sarcasm_df=pd.read_json("https://raw.githubusercontent.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection/master/Sarcasm_Headlines_Dataset.json",lines=True )

In [16]:
sarcasm_df.head()

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


## Drop `article_link` from dataset. ( 2 marks)
As we only need headline text data and is_sarcastic column for this project. We can drop artical link column here.

In [17]:
sarcasm_df.drop(labels='article_link', axis=1,inplace=True)


In [18]:
sarcasm_df.head(2)

Unnamed: 0,is_sarcastic,headline
0,1,thirtysomething scientists unveil doomsday clo...
1,0,dem rep. totally nails why congress is falling...


In [19]:
sarcasm_df.shape

(28619, 2)

In [20]:
sarcasm_df.dtypes

is_sarcastic     int64
headline        object
dtype: object

In [21]:
sarcasm_df['is_sarcastic'].value_counts()

0    14985
1    13634
Name: is_sarcastic, dtype: int64

## Get the Length of each line and find the maximum length. ( 4 marks)
As different lines are of different length. We need to pad the our sequences using the max length.

In [22]:
max_headline=max(len(txt) for txt in sarcasm_df['headline'])
max_headline

926

## Build Vocab

In [28]:
tokenizer = tf.keras.preprocessing.text.Tokenizer()
# Build tokenizer
tokenizer.fit_on_texts(sarcasm_df['headline']) 
vocab_size = len(tokenizer.word_index)
print(vocab_size)

30884


#**## Modelling**

## Import required modules required for modelling.

In [26]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Flatten, Bidirectional, GlobalMaxPool1D
from tensorflow.keras.models import Model, Sequential

# Set Different Parameters for the model. ( 2 marks)

In [47]:
max_features = 15000
embedding_size = 200
padding = 'pre'
rnn_units=256
mask_zero=False
return_sequence=False

## Apply Keras Tokenizer of headline column of your data.  ( 4 marks)
Hint - First create a tokenizer instance using Tokenizer(num_words=max_features) 
And then fit this tokenizer instance on your data column df['headline'] using .fit_on_texts()

# Define X and y for your model.

In [32]:
sarcasm_df_seq = tokenizer.texts_to_sequences(sarcasm_df['headline'])
X = tf.keras.preprocessing.sequence.pad_sequences(sarcasm_df_seq, 
                                                                   maxlen=max_headline,
                                                                   padding='pre')
y = np.asarray(sarcasm_df['is_sarcastic'])
print(X.shape)
print(y.shape)

(28619, 926)
(28619,)


In [37]:
from sklearn.model_selection import train_test_split

train,test = train_test_split(sarcasm_df,test_size = 0.2)
train,val = train_test_split(train,test_size=0.25)

print(train.shape)
print(test.shape)
print(val.shape)

(17171, 2)
(5724, 2)
(5724, 2)


In [64]:
# X
train_sequence = tokenizer.texts_to_sequences(train["headline"].values)
test_sequence = tokenizer.texts_to_sequences(test["headline"].values)
val_sequence = tokenizer.texts_to_sequences(val["headline"].values)

train_sequence = tf.keras.preprocessing.sequence.pad_sequences(train_sequence,
                                                               maxlen=max_headline,
                                                               padding='pre')
test_sequence = tf.keras.preprocessing.sequence.pad_sequences(test_sequence,
                                                               maxlen=max_headline,
                                                               padding='pre')
val_sequence = tf.keras.preprocessing.sequence.pad_sequences(val_sequence,
                                                               maxlen=max_headline,
                                                               padding='pre')

#y
y_train = np.asarray(train['is_sarcastic'])
y_test = np.asarray(test['is_sarcastic'])
y_val = np.asarray(val['is_sarcastic'])

len(train_sequence)
len(y_train)

17171

In [42]:
y_train.shape

(17171,)

## Get the Vocabulary size ( 2 marks)
Hint : You can use tokenizer.word_index.

In [65]:
vocab_size = len(tokenizer.word_index)
print(vocab_size)

30884


#**## Word Embedding**

## Get Glove Word Embeddings

In [44]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2020-06-18 05:36:19--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-06-18 05:36:19--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-06-18 05:36:20--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

In [45]:
glove_file = "glove.6B.zip"

In [46]:
#Extract Glove embedding zip file
from zipfile import ZipFile
with ZipFile(glove_file, 'r') as z:
  z.extractall()

# Get the Word Embeddings using Embedding file as given below.

In [48]:
EMBEDDING_FILE = './glove.6B.200d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE):
    word = o.split(" ")[0]
    # print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    # print(embd)
    embeddings[word] = embd



# Create a weight matrix for words in training docs

In [49]:
embedding_matrix = np.zeros((vocab_size+1, 200))

for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

len(embeddings.values())

400000

## Create and Compile your Model  ( 7 marks)
Hint - Use Sequential model instance and then add Embedding layer, Bidirectional(LSTM) layer, then dense and dropout layers as required. 
In the end add a final dense layer with sigmoid activation for binary classification.


In [66]:
# As per https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM, it will use CuDNN Lstm
# if below params match
# activation == tanh
# recurrent_activation == sigmoid
# recurrent_dropout == 0
# unroll is False
# use_bias is True
# Inputs are not masked or strictly right padded.

def createCUDNNLstm(units,return_state,return_sequences,dropout,name=''):
  return layers.LSTM(units=units,
                     return_state=return_state,
                     return_sequences=return_sequences, 
                     name = name,
                     activation='tanh',
                     recurrent_activation='sigmoid',
                     recurrent_dropout=0,
                     dropout=dropout,
                     unroll=False,
                     use_bias=True)

In [67]:
strategy = tf.distribute.MirroredStrategy()

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)


In [None]:
### Embedding layer for hint 
## model.add(Embedding(num_words, embedding_size, weights = [embedding_matrix]))
### Bidirectional LSTM layer for hint 
## model.add(Bidirectional(LSTM(128, return_sequences = True)))

In [73]:
with strategy.scope():
  model = tf.keras.Sequential()
  ## Please note that mask_zero is true to trigger cudn LSTN variant for GPU
  model.add(tf.keras.layers.Embedding(input_dim=vocab_size+1,
                    output_dim=embedding_size,
                    weights=[embedding_matrix],
                    trainable=False, mask_zero= False))
  model.add(tf.keras.layers.Bidirectional(createCUDNNLstm(units=128,
                                                          return_sequences=False, 
                                                          dropout=0.2, 
                                                          return_state=False),
                                          merge_mode='concat'))
  model.add(tf.keras.layers.Dense(1,activation='sigmoid'))

  model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [74]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, None, 200)         6177000   
_________________________________________________________________
bidirectional_3 (Bidirection (None, 256)               336896    
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 257       
Total params: 6,514,153
Trainable params: 337,153
Non-trainable params: 6,177,000
_________________________________________________________________


# Fit your model with a batch size of 100 and validation_split = 0.2. and state the validation accuracy ( 5 marks)


In [75]:
batch_size = 100
epochs = 5

## Add your code here ##

In [76]:
with strategy.scope():
  model.fit(train_sequence,y_train,
            epochs=5,
            batch_size=100, validation_data=(val_sequence, y_val))

Epoch 1/5
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Validation Accuracy with Embedding size of 200 and RNN Units of 128 is 86.36

# Model 2- Glove Embedding of 300, With 256 Memory Size

In [77]:
EMBEDDING_FILE = './glove.6B.300d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE):
    word = o.split(" ")[0]
    # print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    # print(embd)
    embeddings[word] = embd

In [79]:
embedding_matrix = np.zeros((vocab_size+1, 300))

for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

len(embeddings.values())

400000

In [85]:
strategy2 = tf.distribute.MirroredStrategy()
with strategy2.scope():
  model2 = tf.keras.Sequential()
  ## Please note that mask_zero is true to trigger cudn LSTN variant for GPU
  model2.add(tf.keras.layers.Embedding(input_dim=vocab_size+1,
                    output_dim=300,
                    weights=[embedding_matrix],
                    trainable=False, mask_zero= False))
  model2.add(tf.keras.layers.Dropout(0.2))
  model2.add(tf.keras.layers.Bidirectional(createCUDNNLstm(units=256,return_sequences=False, 
                                                          dropout=0.2, 
                                                          return_state=False),
                                           merge_mode='concat'))
  model2.add(tf.keras.layers.Dense(1,activation='sigmoid'))
  model2.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)


In [86]:
model2.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, None, 300)         9265500   
_________________________________________________________________
dropout_1 (Dropout)          (None, None, 300)         0         
_________________________________________________________________
bidirectional_5 (Bidirection (None, 512)               1140736   
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 513       
Total params: 10,406,749
Trainable params: 1,141,249
Non-trainable params: 9,265,500
_________________________________________________________________


In [87]:
with strategy2.scope():
  model2.fit(train_sequence,y_train,
            epochs=5,
            batch_size=100, validation_data=(val_sequence, y_val))  

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Validation Accuracy with Embedding size of 300 and RNN Units of 256 is 84.31

# Model 3- Glove Embedding of 300, With 128 Memory Size

In [91]:
strategy3 = tf.distribute.MirroredStrategy()
with strategy3.scope():
  model3 = tf.keras.Sequential()
  model3.add(tf.keras.layers.Embedding(input_dim=vocab_size+1,
                    output_dim=300,
                    weights=[embedding_matrix],
                    trainable=False, mask_zero= False))
  model3.add(tf.keras.layers.Bidirectional(createCUDNNLstm(units=256,return_sequences=False, 
                                                          dropout=0.2, 
                                                          return_state=False),
                                           merge_mode='concat'))
  model3.add(tf.keras.layers.Dense(1,activation='sigmoid'))
  model3.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)


In [92]:
model3.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, None, 300)         9265500   
_________________________________________________________________
bidirectional_7 (Bidirection (None, 512)               1140736   
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 513       
Total params: 10,406,749
Trainable params: 1,141,249
Non-trainable params: 9,265,500
_________________________________________________________________


In [93]:
with strategy3.scope():
  model3.fit(train_sequence,y_train,
            epochs=10,
            batch_size=100, validation_data=(val_sequence, y_val))  

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Validation Accuracy with Embedding size of 300 and RNN Units of 256 with epoch 10 is 86.36