<img src="http://drive.google.com/uc?export=view&id=1tpOCamr9aWz817atPnyXus8w5gJ3mIts" width=500px>

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

### Package Version:
- tensorflow==2.2.0
- pandas==1.0.5
- numpy==1.18.5
- google==2.0.3

# Sarcasm Detection

### Dataset

#### Acknowledgement
Misra, Rishabh, and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).

**Required Files given in below link.**

https://drive.google.com/drive/folders/1xUnF35naPGU63xwRDVGc-DkZ3M8V5mMk

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Load Data (5 Marks)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import os

In [4]:
os.chdir("/content/drive/MyDrive/Data")
os.listdir()

['Sarcasm_Headlines_Dataset.json',
 'glove.6B.300d.txt',
 'glove.6B.zip',
 'glove.6B.100d.txt',
 'glove.6B.50d.txt',
 'glove.6B.200d.txt']

In [8]:
input_data=pd.read_json("Sarcasm_Headlines_Dataset.json", lines=True)
input_data.shape

(26709, 3)

In [9]:
input_data

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0
...,...,...,...
26704,https://www.huffingtonpost.com/entry/american-...,american politics in moral free-fall,0
26705,https://www.huffingtonpost.com/entry/americas-...,america's best 20 hikes,0
26706,https://www.huffingtonpost.com/entry/reparatio...,reparations and obama,0
26707,https://www.huffingtonpost.com/entry/israeli-b...,israeli ban targeting boycott supporters raise...,0


### Drop `article_link` from dataset (5 Marks)

In [11]:
## Droping article_link column 
input_data.drop('article_link', axis=1, inplace=True)

In [13]:
input_data.head()

Unnamed: 0,headline,is_sarcastic
0,former versace store clerk sues over secret 'b...,0
1,the 'roseanne' revival catches up to our thorn...,0
2,mom starting to fear son's web series closest ...,1
3,"boehner just wants wife to listen, not come up...",1
4,j.k. rowling wishes snape happy birthday in th...,0


### Get length of each headline and add a column for that (5 Marks)

In [18]:
input_data['headline_length']=input_data['headline'].apply(lambda x:len(x))


In [20]:
input_data.shape

(26709, 3)

In [21]:
input_data.head()

Unnamed: 0,headline,is_sarcastic,headline_length
0,former versace store clerk sues over secret 'b...,0,78
1,the 'roseanne' revival catches up to our thorn...,0,84
2,mom starting to fear son's web series closest ...,1,79
3,"boehner just wants wife to listen, not come up...",1,84
4,j.k. rowling wishes snape happy birthday in th...,0,64


### Initialize parameter values
- Set values for max_features, maxlen, & embedding_size
- max_features: Number of words to take from tokenizer(most frequent words)
- maxlen: Maximum length of each sentence to be limited to 25
- embedding_size: size of embedding vector

In [22]:
max_features = 10000
maxlen = 25
embedding_size = 200

In [36]:
test_data=input_data[np.arange(input_data.shape[0])%5 ==0]
train_data=input_data[np.arange(input_data.shape[0])%5 !=0]


In [37]:
test_data.head()


Unnamed: 0,headline,is_sarcastic,headline_length
0,former versace store clerk sues over secret 'b...,0,78
5,advancing the world's women,0,27
10,airline passengers tackle man who rushes cockp...,0,63
15,nuclear bomb detonates during rehearsal for 's...,1,64
20,courtroom sketch artist has clear manga influe...,1,50


In [39]:

train_data.head()

(21367, 3)


Unnamed: 0,headline,is_sarcastic,headline_length
1,the 'roseanne' revival catches up to our thorn...,0,84
2,mom starting to fear son's web series closest ...,1,79
3,"boehner just wants wife to listen, not come up...",1,84
4,j.k. rowling wishes snape happy birthday in th...,0,64
6,the fascinating case for eating lab-grown meat,0,46


In [42]:
print('training data shape->',train_data.shape)
print('testing data shape->',test_data.shape)

training data shape-> (21367, 3)
testing data shape-> (5342, 3)


### Apply `tensorflow.keras` Tokenizer and get indices for words (5 Marks)
- Initialize Tokenizer object with number of words as 10000
- Fit the tokenizer object on headline column
- Convert the text to sequence


In [44]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [46]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(train_data['headline'])
encodings_train = tokenizer.texts_to_sequences(train_data['headline'])
encodings_test = tokenizer.texts_to_sequences(test_data['headline'])

### Pad sequences (5 Marks)
- Pad each example with a maximum length
- Convert target column into numpy array

In [47]:
encodings_train = pad_sequences(encodings_train, maxlen=maxlen, padding='pre')
encodings_test = pad_sequences(encodings_test, maxlen=maxlen, padding='pre')

In [49]:
encodings_train.shape

(21367, 25)

In [50]:
encodings_train

array([[   0,    0,    0, ...,  285,    8,  950],
       [   0,    0,    0, ...,   45,    1, 9322],
       [   0,    0,    0, ..., 1283, 5953, 1088],
       ...,
       [   0,    0,    0, ...,    0,    8,   64],
       [   0,    0,    0, ..., 1760, 3160, 3463],
       [   0,    0,    0, ...,    5,    3,  836]], dtype=int32)

### Vocab mapping
- There is no word for 0th index

In [58]:
tokenizer.word_index == 0

False

In [56]:
tokenizer.word_index

{'to': 1,
 'of': 2,
 'the': 3,
 'in': 4,
 'for': 5,
 'a': 6,
 'on': 7,
 'and': 8,
 'with': 9,
 'is': 10,
 'new': 11,
 'trump': 12,
 'man': 13,
 'from': 14,
 'at': 15,
 'about': 16,
 'you': 17,
 'this': 18,
 'by': 19,
 'after': 20,
 'be': 21,
 'out': 22,
 'up': 23,
 'how': 24,
 'as': 25,
 'it': 26,
 'that': 27,
 'not': 28,
 'are': 29,
 'your': 30,
 'what': 31,
 'his': 32,
 'all': 33,
 'who': 34,
 'more': 35,
 'he': 36,
 'just': 37,
 'will': 38,
 'has': 39,
 'year': 40,
 'why': 41,
 'one': 42,
 'into': 43,
 'report': 44,
 'have': 45,
 'area': 46,
 'over': 47,
 'donald': 48,
 'u': 49,
 'day': 50,
 'says': 51,
 's': 52,
 'can': 53,
 'first': 54,
 'time': 55,
 'woman': 56,
 'like': 57,
 'get': 58,
 'her': 59,
 "trump's": 60,
 'old': 61,
 'no': 62,
 'now': 63,
 'obama': 64,
 'an': 65,
 'off': 66,
 'life': 67,
 'people': 68,
 'than': 69,
 'was': 70,
 'still': 71,
 "'": 72,
 'make': 73,
 'house': 74,
 'women': 75,
 'back': 76,
 'my': 77,
 'i': 78,
 'clinton': 79,
 'down': 80,
 'white': 81,
 'i

### Set number of words
- Since the above 0th index doesn't have a word, add 1 to the length of the vocabulary

In [59]:
num_words = len(tokenizer.word_index) + 1
print(num_words)

26571


### Load Glove Word Embeddings (5 Marks)

In [60]:
os.listdir()

['Sarcasm_Headlines_Dataset.json',
 'glove.6B.300d.txt',
 'glove.6B.zip',
 'glove.6B.100d.txt',
 'glove.6B.50d.txt',
 'glove.6B.200d.txt']

### Create embedding matrix

In [61]:
EMBEDDING_FILE = './glove.6B.200d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE):
    word = o.split(" ")[0]
    # print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    # print(embd)
    embeddings[word] = embd

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((num_words, 200))

for word, i in tokenizer.word_index.items():
	embedding_vector = embeddings.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

### Define model (10 Marks)
- Hint: Use Sequential model instance and then add Embedding layer, Bidirectional(LSTM) layer, flatten it, then dense and dropout layers as required. 
In the end add a final dense layer with sigmoid activation for binary classification.

In [62]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Bidirectional, Dense, Embedding, Dropout
from tensorflow.keras.optimizers import Adam

### Compile the model (5 Marks)

In [63]:
model = Sequential()
model.add(Embedding(num_words, embedding_size, weights=[embedding_matrix], input_length=25,  trainable=False))
model.add(Bidirectional(LSTM(units=20)))
model.add(Dense(40, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(20))
model.add(Dense(1, activation='sigmoid'))

In [64]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 25, 200)           5314200   
_________________________________________________________________
bidirectional (Bidirectional (None, 40)                35360     
_________________________________________________________________
dense (Dense)                (None, 40)                1640      
_________________________________________________________________
dropout (Dropout)            (None, 40)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 20)                820       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 21        
Total params: 5,352,041
Trainable params: 37,841
Non-trainable params: 5,314,200
_________________________________________

In [65]:
opt = Adam(learning_rate=0.001)
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])

### Fit the model (5 Marks)

In [68]:
model.fit(encodings_train, train_data['is_sarcastic'],  epochs=15, batch_size=32, 
          validation_data=(encodings_test, test_data['is_sarcastic']))

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7ff4c8dc45c0>

In [70]:
preds = model.predict_classes(encodings_test)

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).


In [85]:
## predicting
print('Headline -> ',input_data['headline'][25])
print('Predicted -> ', preds[25][0])
print('Actual -> ', input_data['is_sarcastic'][25])

Headline ->  why writers must plan to be surprised
Predicted ->  0
Actual ->  0


In [87]:
print('Headline -> ',input_data['headline'][27])
print('Predicted -> ', preds[27][0])
print('Actual -> ', input_data['is_sarcastic'][27])

Headline ->  ex-con back behind bar
Predicted ->  1
Actual ->  1


In [90]:
from sklearn.metrics import classification_report
print(classification_report(test_data['is_sarcastic'],preds))

              precision    recall  f1-score   support

           0       0.86      0.89      0.87      2995
           1       0.85      0.82      0.83      2347

    accuracy                           0.85      5342
   macro avg       0.85      0.85      0.85      5342
weighted avg       0.85      0.85      0.85      5342



In [None]:
# Here in total 
#How many relevant items selected for 0 -> 89%
#How many relevant items selected  for 1 ->  82%
#----------------------------------------------------
#How many selected items relevant for 0 -> 86%
#How many selected items relevant for 1 -> 85%
#----------------------------------------------------
#Overall accuracy -> 85%