## Sarcasm Detection

## Description:

Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based
supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are
replies to other tweets and detecting sarcasm in these requires the availability of contextual tweets.
In this hands-on project, the goal is to build a model to detect whether a sentence is sarcastic or not, using
Bidirectional LSTMs.


## Reference:
https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection


## Import Packages

In [1]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Setting the current working directory
import os; 
os.chdir('/content/drive/MyDrive/GL/NLP')

In [40]:
import pandas as pd
import numpy as np

In [41]:
def parseJson(fname):
    for line in open(fname, 'r'):
        yield eval(line)

In [34]:
data = pd.read_json('/content/drive/MyDrive/GL/NLP/Data/Sarcasm_Headlines_Dataset.json', lines = True)
print(f'Data has {data.shape[0]} rows and {data.shape[1]} columns. Here are the first five rows of the data...')
display(data.head())

Data has 26709 rows and 3 columns. Here are the first five rows of the data...


Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


In [44]:
data = list(parseJson('/content/drive/MyDrive/GL/NLP/Data/Sarcasm_Headlines_Dataset.json'))

In [45]:
df = pd.DataFrame(data)

In [46]:
df.head(10)

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0
5,https://www.huffingtonpost.com/entry/advancing...,advancing the world's women,0
6,https://www.huffingtonpost.com/entry/how-meat-...,the fascinating case for eating lab-grown meat,0
7,https://www.huffingtonpost.com/entry/boxed-col...,"this ceo will send your kids to school, if you...",0
8,https://politics.theonion.com/top-snake-handle...,top snake handler leaves sinking huckabee camp...,1
9,https://www.huffingtonpost.com/entry/fridays-m...,friday's morning email: inside trump's presser...,0


## Drop one column

In [47]:
df_1 = df.drop('article_link', axis=1)

In [48]:
df_1.head()

Unnamed: 0,headline,is_sarcastic
0,former versace store clerk sues over secret 'b...,0
1,the 'roseanne' revival catches up to our thorn...,0
2,mom starting to fear son's web series closest ...,1
3,"boehner just wants wife to listen, not come up...",1
4,j.k. rowling wishes snape happy birthday in th...,0


## Get length of each sentence

In [49]:
df_1['col_length'] = df_1['headline'].apply(lambda x: len(x))

In [50]:
df_1['headline'][0]

"former versace store clerk sues over secret 'black code' for minority shoppers"

In [51]:
df_1.head()

Unnamed: 0,headline,is_sarcastic,col_length
0,former versace store clerk sues over secret 'b...,0,78
1,the 'roseanne' revival catches up to our thorn...,0,84
2,mom starting to fear son's web series closest ...,1,79
3,"boehner just wants wife to listen, not come up...",1,84
4,j.k. rowling wishes snape happy birthday in th...,0,64


## Apply tensorflow.keras Tokenizer and get indices for words.

In [None]:
max_features = 10000
maxlen = 25
embedding_size = 200

In [53]:
headline_data = df_1['headline']

# Splitting the dataset into Train and Test
training_size = round(len(headline_data) * .75)

hl_data_train_sent = headline_data[0:training_size]
hl_data_test_sent = headline_data[training_size:]

labels = df_1['is_sarcastic']
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

In [54]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer=Tokenizer(num_words=10000)

# fit on the input data 
tokenizer.fit_on_texts(hl_data_train_sent)

In [55]:
train_sequences = tokenizer.texts_to_sequences(hl_data_train_sent)
test_sequences = tokenizer.texts_to_sequences(hl_data_test_sent)

In [56]:
print(train_sequences)
print(test_sequences)

[[327, 800, 3411, 2408, 46, 388, 2217, 5, 2621, 8873], [3, 6844, 3100, 3101, 22, 1, 161, 389, 2842, 5, 250, 8, 889], [152, 890, 1, 891, 1447, 2218, 595, 5659, 217, 132, 36, 44, 1, 8874], [1252, 37, 212, 382, 1, 1574, 28, 287, 22, 9, 2409, 1448, 5660, 958], [716, 671, 5661, 1042, 8875, 661, 552, 4, 3, 91, 1253, 89], [8876, 3, 366, 69], [3, 6845, 369, 5, 491, 3412, 1877, 1379], [19, 563, 36, 1091, 30, 163, 1, 102, 86, 17, 149, 5, 32, 344], [277, 3413, 6846, 447, 8877, 2092, 147], [2093, 299, 335, 370, 62, 5, 3, 4268], [3102, 2219, 3759, 13, 35, 5662, 8878, 4, 2094, 1092], [300, 767, 427, 7, 1667, 1668, 8, 3103], [221, 477, 2843, 12, 8, 922, 238, 371, 1, 4269, 6847], [236, 5663, 8879, 3760, 38, 239, 5, 6, 174], [1380, 801, 662, 4, 336, 2, 959], [526, 2094, 8880, 126, 8881, 5, 8882, 3761, 1669], [2095, 1329, 340, 45, 3414, 322, 287, 960, 1, 21, 18, 1043, 358, 108, 1449], [1670, 6848, 3104, 8883, 18, 5664, 1203], [8884, 821, 1, 1767, 250, 1204, 37, 210, 2410], [4828, 4829, 3105], [8885, 376

## Pad sequences 

In [57]:
from keras.preprocessing.sequence import pad_sequences

hl_train_data_pad = pad_sequences(train_sequences, maxlen = maxlen)  #makes all the array the same length by filling out with 0x
hl_test_data_pad = pad_sequences(test_sequences, maxlen = maxlen)

## Vocab mapping

In [58]:
tokenizer.word_index

{'to': 1,
 'of': 2,
 'the': 3,
 'in': 4,
 'for': 5,
 'a': 6,
 'on': 7,
 'and': 8,
 'with': 9,
 'is': 10,
 'new': 11,
 'trump': 12,
 'man': 13,
 'from': 14,
 'at': 15,
 'about': 16,
 'you': 17,
 'by': 18,
 'this': 19,
 'after': 20,
 'be': 21,
 'up': 22,
 'out': 23,
 'that': 24,
 'how': 25,
 'as': 26,
 'it': 27,
 'not': 28,
 'are': 29,
 'your': 30,
 'what': 31,
 'his': 32,
 'all': 33,
 'he': 34,
 'who': 35,
 'will': 36,
 'just': 37,
 'has': 38,
 'more': 39,
 'one': 40,
 'year': 41,
 'into': 42,
 'report': 43,
 'have': 44,
 'why': 45,
 'over': 46,
 'area': 47,
 'u': 48,
 'donald': 49,
 'says': 50,
 'day': 51,
 'can': 52,
 's': 53,
 'first': 54,
 'woman': 55,
 'time': 56,
 'like': 57,
 'old': 58,
 'get': 59,
 'her': 60,
 'no': 61,
 "trump's": 62,
 'off': 63,
 'now': 64,
 'an': 65,
 'life': 66,
 'people': 67,
 'obama': 68,
 'women': 69,
 'house': 70,
 "'": 71,
 'white': 72,
 'was': 73,
 'still': 74,
 'back': 75,
 'make': 76,
 'than': 77,
 'down': 78,
 'clinton': 79,
 'when': 80,
 'my': 81,


## Set number of words

In [59]:
num_words = len(tokenizer.word_index) + 1
print(num_words)

25652


## Load Glove Word Embeddings

In [60]:
path = '/content/drive/MyDrive/GL/NLP/Data/'

glove_file = path + 'glove.6B.50d.txt'

## Create embedding matrix

In [61]:
embeddings_index = {}
f = open(glove_file)
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

## Define model

In [62]:
EMBEDDING_FILE = path + 'glove.6B.300d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE):
    word = o.split(" ")[0]
    # print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    # print(embd)
    embeddings[word] = embd

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((num_words, 300))

for word, i in tokenizer.word_index.items():
	embedding_vector = embeddings.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

In [63]:
from keras import models
from keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten,Dense, Embedding, LSTM,TimeDistributed, SpatialDropout1D, Bidirectional
model = models.Sequential()
embedding_dim = 300 
#e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
#model.add(Embedding(max_features, embedding_dim, input_length=maxlen))
model.add(Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=maxlen))
model.add(Bidirectional(LSTM(units=60, activation='tanh',return_sequences=True)))
model.add(TimeDistributed(Dense(100)))
model.add(Flatten())
# Output- Layer
model.add(Dense(units=1, activation='sigmoid'))
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 300)           7695600   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 25, 120)           173280    
_________________________________________________________________
time_distributed (TimeDistri (None, 25, 100)           12100     
_________________________________________________________________
flatten_1 (Flatten)          (None, 2500)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 2501      
Total params: 7,883,481
Trainable params: 7,883,481
Non-trainable params: 0
_________________________________________________________________


In [64]:
print('training_sentences : ',hl_train_data_pad.shape)
print('testing_sentences : ',hl_test_data_pad.shape)
print('training_labels : ',training_labels.shape)
print('testing_labels : ',testing_labels.shape)

training_sentences :  (20032, 25)
testing_sentences :  (6677, 25)
training_labels :  (20032,)
testing_labels :  (6677,)


## Compile the model

In [65]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 300)           7695600   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 25, 120)           173280    
_________________________________________________________________
time_distributed (TimeDistri (None, 25, 100)           12100     
_________________________________________________________________
flatten_1 (Flatten)          (None, 2500)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 2501      
Total params: 7,883,481
Trainable params: 7,883,481
Non-trainable params: 0
_________________________________________________________________


## Fit the model

In [66]:
# Converting the lists to numpy arrays for Tensorflow 2.x
training_padded = np.array(hl_train_data_pad)
training_labels = np.array(training_labels)
testing_padded = np.array(hl_test_data_pad)
testing_labels = np.array(testing_labels)
# Training the model
num_epochs = 30
history = model.fit(training_padded, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels), verbose=2)

Epoch 1/30
626/626 - 73s - loss: 0.3766 - accuracy: 0.8267 - val_loss: 0.3102 - val_accuracy: 0.8652
Epoch 2/30
626/626 - 71s - loss: 0.1874 - accuracy: 0.9250 - val_loss: 0.3305 - val_accuracy: 0.8624
Epoch 3/30
626/626 - 70s - loss: 0.0877 - accuracy: 0.9677 - val_loss: 0.4619 - val_accuracy: 0.8585
Epoch 4/30
626/626 - 69s - loss: 0.0352 - accuracy: 0.9876 - val_loss: 0.5716 - val_accuracy: 0.8532
Epoch 5/30
626/626 - 70s - loss: 0.0151 - accuracy: 0.9952 - val_loss: 0.8231 - val_accuracy: 0.8559
Epoch 6/30
626/626 - 69s - loss: 0.0107 - accuracy: 0.9967 - val_loss: 0.8173 - val_accuracy: 0.8523
Epoch 7/30
626/626 - 68s - loss: 0.0099 - accuracy: 0.9965 - val_loss: 1.0125 - val_accuracy: 0.8507
Epoch 8/30
626/626 - 70s - loss: 0.0070 - accuracy: 0.9977 - val_loss: 1.2003 - val_accuracy: 0.8538
Epoch 9/30
626/626 - 69s - loss: 0.0099 - accuracy: 0.9965 - val_loss: 1.0109 - val_accuracy: 0.8450
Epoch 10/30
626/626 - 69s - loss: 0.0075 - accuracy: 0.9973 - val_loss: 1.1264 - val_accura

In [67]:
sentence = ["Coworkers At Bathroom Sink Locked In Tense Standoff Over Who Going To Wash Hands Longer", 
            "Spiking U.S. coronavirus cases could force rationing decisions similar to those made in Italy, China."]
sequences = tokenizer.texts_to_sequences(sentence)
padded = pad_sequences(sequences, maxlen=maxlen)
print(model.predict(padded))

[[1.        ]
 [0.07241046]]


In [68]:
print("Test-Accuracy:", np.mean(history.history["val_accuracy"]))

Test-Accuracy: 0.8511806646982829


## Check the validation accuracy

In [69]:
scores = model.evaluate(testing_padded, testing_labels, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 84.11%
