# Sarcasm Detection

### Dataset

#### Acknowledgement
Misra, Rishabh, and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).

**Required Files given in below link.**

https://drive.google.com/drive/folders/1xUnF35naPGU63xwRDVGc-DkZ3M8V5mMk

In [1]:
import pandas as pd
import numpy as np
import tensorflow.keras as keras

### Load Data

In [3]:
data = pd.read_json('Sarcasm_Headlines_Dataset.json', lines=True)
data.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


### Drop `article_link` from dataset

In [4]:
data.drop('article_link', axis=1, inplace=True)
data.head()

Unnamed: 0,headline,is_sarcastic
0,former versace store clerk sues over secret 'b...,0
1,the 'roseanne' revival catches up to our thorn...,0
2,mom starting to fear son's web series closest ...,1
3,"boehner just wants wife to listen, not come up...",1
4,j.k. rowling wishes snape happy birthday in th...,0


### Get length of each headline and add a column for that

In [5]:
# Get number of words in each headline and add it as a column
data['num_words'] = data['headline'].apply(lambda x: len(x.split()))
data.head()

Unnamed: 0,headline,is_sarcastic,num_words
0,former versace store clerk sues over secret 'b...,0,12
1,the 'roseanne' revival catches up to our thorn...,0,14
2,mom starting to fear son's web series closest ...,1,14
3,"boehner just wants wife to listen, not come up...",1,13
4,j.k. rowling wishes snape happy birthday in th...,0,11


### Split data into train and test sets

In [6]:
from sklearn.model_selection import train_test_split

X = data['headline']
y = data['is_sarcastic']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Convert to numpy arrays
X_train = np.array(X_train)
X_test = np.array(X_test)

### Initialize parameter values
- Set values for max_features, maxlen, & embedding_size
- max_features: Number of words to take from tokenizer(most frequent words)
- maxlen: Maximum length of each sentence to be limited to 25
- embedding_size: size of embedding vector

In [7]:
max_features = 10000
maxlen = 25
embedding_size = 200

### Apply `tensorflow.keras` Tokenizer and get indices for words
- Initialize Tokenizer object with number of words as 10000
- Fit the tokenizer object on headline column
- Convert the text to sequence


In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=max_features)

# Fit on train texts and convert them to sequences 
tokenizer.fit_on_texts(X_train) # X_train is np.array having training headlines
X_train_sequences = tokenizer.texts_to_sequences(X_train)

# Convert the test texts to sequences
X_test_sequences = tokenizer.texts_to_sequences(X_test) # X_test is np.array having test headlines

### Pad sequences
- Pad each example with a maximum length
- Convert target column into numpy array

In [9]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Pad the sentences with 0 to mark them as unknown word
X_train_sequences_pad = pad_sequences(X_train_sequences, maxlen=maxlen, padding='post', value=0)
X_test_sequences_pad = pad_sequences(X_test_sequences, maxlen=maxlen, padding='post', value=0)

# Convert targets to numpy array
y_train = np.array(y_train)
y_test = np.array(y_test)

### Vocab mapping
- There is no word for 0th index

In [10]:
tokenizer.word_index

{'to': 1,
 'of': 2,
 'the': 3,
 'in': 4,
 'for': 5,
 'a': 6,
 'on': 7,
 'and': 8,
 'with': 9,
 'is': 10,
 'new': 11,
 'trump': 12,
 'man': 13,
 'at': 14,
 'from': 15,
 'about': 16,
 'you': 17,
 'by': 18,
 'this': 19,
 'after': 20,
 'up': 21,
 'be': 22,
 'out': 23,
 'how': 24,
 'it': 25,
 'that': 26,
 'as': 27,
 'not': 28,
 'are': 29,
 'your': 30,
 'his': 31,
 'what': 32,
 'he': 33,
 'all': 34,
 'just': 35,
 'has': 36,
 'who': 37,
 'into': 38,
 'one': 39,
 'more': 40,
 'report': 41,
 'will': 42,
 'why': 43,
 'year': 44,
 'over': 45,
 'area': 46,
 'have': 47,
 'day': 48,
 'says': 49,
 'u': 50,
 'can': 51,
 'donald': 52,
 's': 53,
 'woman': 54,
 'first': 55,
 'time': 56,
 'no': 57,
 'get': 58,
 'like': 59,
 'old': 60,
 'off': 61,
 "trump's": 62,
 'her': 63,
 'obama': 64,
 'an': 65,
 'now': 66,
 'life': 67,
 'people': 68,
 "'": 69,
 'was': 70,
 'make': 71,
 'women': 72,
 'house': 73,
 'than': 74,
 'still': 75,
 'white': 76,
 'back': 77,
 'my': 78,
 'if': 79,
 'clinton': 80,
 'when': 81,
 '

### Set number of words
- Since the above 0th index doesn't have a word, add 1 to the length of the vocabulary

In [11]:
num_words = len(tokenizer.word_index) + 1
print(num_words)

26597


### Load Glove Word Embeddings

Downloaded glove embedding file (glove.6B.200d.txt) from https://drive.google.com/drive/folders/1xUnF35naPGU63xwRDVGc-DkZ3M8V5mMk and placed in current working directory

### Create embedding matrix

In [12]:
EMBEDDING_FILE = 'glove.6B.200d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE, encoding="utf8"):
    word = o.split(" ")[0]
    # print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    # print(embd)
    embeddings[word] = embd

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((num_words, embedding_size))

for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

### Define model

In [28]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import LSTM, Embedding, Dense, Bidirectional, Input, Flatten, Dropout

inputs = Input(shape=(maxlen,)) # Input layer
model = Embedding(num_words, embedding_size, 
                  embeddings_initializer=keras.initializers.Constant(embedding_matrix),
                  trainable=False)(inputs)
model = Bidirectional(LSTM(units=500, dropout=0.2, recurrent_dropout=0.2))(model) # Bi-directional LSTM layer
model = Flatten()(model) # Flatten
model = Dense(200, activation='relu')(model) # Dense layer 1
model = Dropout(0.2)(model) # Dropout 1
model = Dense(100, activation='relu')(model) # Dense layer 2
model = Dropout(0.2)(model) # Dropout 2
model = Dense(50, activation='relu')(model) # Dense layer 3
model = Dropout(0.2)(model) # Dropout 3
out = Dense(1, activation='sigmoid')(model) # Sigmoid output layer

model = Model(inputs, out) # Complete model



### Compile the model

In [29]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]) # Compile the model
model.summary()

Model: "functional_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         [(None, 25)]              0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 25, 200)           5319400   
_________________________________________________________________
bidirectional_5 (Bidirection (None, 1000)              2804000   
_________________________________________________________________
flatten_5 (Flatten)          (None, 1000)              0         
_________________________________________________________________
dense_18 (Dense)             (None, 200)               200200    
_________________________________________________________________
dropout_12 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_19 (Dense)             (None, 100)             

### Fit the model

In [30]:
model.fit(X_train_sequences_pad, y_train, batch_size=32, epochs=5, validation_data=(X_test_sequences_pad, y_test), verbose=1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x2b043bcf188>

**Validation accuracy: 85.90%**

## Conclusion

We have successfully build a Sarcasm Detection model for news headlines achieving **86% accuracy**