# Sarcasm Detection
 **Acknowledgement**

Misra, Rishabh, and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).

**Required Files given in below link.**

https://drive.google.com/drive/folders/1xUnF35naPGU63xwRDVGc-DkZ3M8V5mMk

## Install `Tensorflow2.0` 

In [None]:
# !!pip uninstall tensorflow
# !pip install tensorflow==2.0.0

## Get Required Files from Drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive/')

#using local machine

In [None]:
#Set your project path 
# project_path =  ## Add your path here ##

#**## Reading and Exploring Data**

## Read Data "Sarcasm_Headlines_Dataset.json". Explore the data and get  some insights about the data. ( 4 marks)
Hint - As its in json format you need to use pandas.read_json function. Give paraemeter lines = True.

In [1]:
import pandas as pd

In [2]:
sarcasm_df = pd.read_json('./Sarcasm_Headlines_Dataset.json', lines=True)
sarcasm_df.sample(5)

Unnamed: 0,article_link,headline,is_sarcastic
10297,https://www.huffingtonpost.com/entry/one-size-...,one size does not fit all: three questions to ...,0
26300,https://www.theonion.com/bubba-gump-shrimp-own...,bubba gump shrimp owner comforts depressed guy...,1
20641,https://www.huffingtonpost.com/entry/tpp-signe...,u.s. allies sign landmark trade pact as trump ...,0
5565,https://entertainment.theonion.com/new-documen...,new documentary focuses on life of eva braun's...,1
1571,https://local.theonion.com/newlyweds-regret-sa...,newlyweds regret saving sex for marriage,1


## Drop `article_link` from dataset. ( 2 marks)
As we only need headline text data and is_sarcastic column for this project. We can drop artical link column here.

In [3]:
sarcasm_df.drop('article_link', axis=1, inplace=True)
sarcasm_df.head()

Unnamed: 0,headline,is_sarcastic
0,former versace store clerk sues over secret 'b...,0
1,the 'roseanne' revival catches up to our thorn...,0
2,mom starting to fear son's web series closest ...,1
3,"boehner just wants wife to listen, not come up...",1
4,j.k. rowling wishes snape happy birthday in th...,0


## Get the Length of each line and find the maximum length. ( 4 marks)
As different lines are of different length. We need to pad the our sequences using the max length.

In [4]:
sarcasm_df['len'] = sarcasm_df['headline'].apply(lambda x: len(x.split(" ")))

In [5]:
sarcasm_df.head()

Unnamed: 0,headline,is_sarcastic,len
0,former versace store clerk sues over secret 'b...,0,12
1,the 'roseanne' revival catches up to our thorn...,0,14
2,mom starting to fear son's web series closest ...,1,14
3,"boehner just wants wife to listen, not come up...",1,13
4,j.k. rowling wishes snape happy birthday in th...,0,11


In [6]:
sarcasm_df.max()

headline        â€‹report: all standing between trump and presid...
is_sarcastic                                                    1
len                                                            39
dtype: object

#**## Modelling**

## Import required modules required for modelling.

In [7]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Flatten, Bidirectional, GlobalMaxPool1D
from tensorflow.keras.models import Model, Sequential

# Set Different Parameters for the model. ( 2 marks)

In [8]:
max_features = 10000
maxlen = 30
embedding_size = 200

## Apply Keras Tokenizer of headline column of your data.  ( 4 marks)
Hint - First create a tokenizer instance using Tokenizer(num_words=max_features) 
And then fit this tokenizer instance on your data column df['headline'] using .fit_on_texts()

In [9]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(sarcasm_df['headline'])
X = tokenizer.texts_to_sequences(sarcasm_df['headline'])

print("Number of Samples:", len(X))
print(X[0])

X = pad_sequences(X, maxlen = maxlen)
y = np.asarray(sarcasm_df['is_sarcastic'])

print("\nNumber of Labels: ", len(y))
print(y[0])

Number of Samples: 26709
[307, 678, 3336, 2297, 47, 381, 2575, 5, 2576, 8433]

Number of Labels:  26709
0


# Define X and y for your model.

In [12]:
X = tokenizer.texts_to_sequences(sarcasm_df['headline'])
X = pad_sequences(X, maxlen = maxlen)
y = np.asarray(sarcasm_df['is_sarcastic'])

print("Number of Samples:", len(X))
print(X[0])
print("Number of Labels: ", len(y))
print(y[0])

Number of Samples: 26709
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0  307  678 3336 2297   47  381 2575    5
 2576 8433]
Number of Labels:  26709
0


## Get the Vocabulary size ( 2 marks)
Hint : You can use tokenizer.word_index.

In [13]:
num_words = len(tokenizer.word_index) + 1
num_words

29657

#**## Word Embedding**

## Get Glove Word Embeddings

In [21]:
glove_file = "./glove.6B.200d.txt"

In [22]:
#Extract Glove embedding zip file
# from zipfile import ZipFile
# with ZipFile(glove_file, 'r') as z:
#   z.extractall()

#Downloaded flove.6B.100d only

# Get the Word Embeddings using Embedding file as given below.

In [23]:
EMBEDDING_FILE = glove_file

embeddings = {}
for o in open(EMBEDDING_FILE, encoding="utf8"):
    word = o.split(" ")[0]
    # print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    # print(embd)
    embeddings[word] = embd



# Create a weight matrix for words in training docs

In [25]:
embedding_matrix = np.zeros((num_words, 200))

for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

len(embeddings.values())

400000

In [31]:
# let's check embedding for word ==> work
embeddings['work']

array([ 5.6792e-03,  2.2325e-01, -9.7926e-02, -1.6128e-01,  4.7453e-01,
       -3.3332e-01, -3.7491e-01, -4.1808e-02, -5.9711e-02,  2.3397e-01,
        5.7158e-01,  2.8719e-01, -1.1798e-01,  3.5308e-01,  2.7206e-01,
        7.1822e-03, -3.8106e-01,  3.5700e-01,  1.6333e-01,  3.2810e-01,
       -1.9585e-02,  2.8545e+00,  3.0997e-01, -1.7071e-01,  6.5618e-01,
        6.3599e-01,  2.2558e-01, -3.7270e-02,  3.6916e-01,  2.1133e-01,
       -2.0398e-01, -2.2599e-01, -3.5113e-04, -2.6588e-01, -1.8939e-01,
       -4.1834e-01, -4.7140e-01, -4.1733e-01,  2.7964e-01, -2.0345e-01,
       -1.1666e-01, -1.9084e-02, -2.4930e-02,  1.5921e-01, -5.8741e-02,
       -1.5579e-01,  3.0610e-01, -3.6300e-01,  8.4587e-02, -2.3893e-02,
       -1.2391e-01,  3.3058e-01, -3.2631e-01,  5.6357e-01,  1.5621e-01,
       -3.0395e-01,  3.0185e-01,  1.8804e-01,  2.7050e-01, -8.5709e-03,
        7.3487e-02,  1.4172e-01, -5.5930e-01, -1.2523e-01, -5.2305e-01,
        2.2663e-01,  2.3772e-02,  3.1973e-01, -2.3190e-01,  2.17

## Create and Compile your Model  ( 7 marks)
Hint - Use Sequential model instance and then add Embedding layer, Bidirectional(LSTM) layer, then dense and dropout layers as required. 
In the end add a final dense layer with sigmoid activation for binary classification.


In [26]:
model = Sequential()
model.add(Embedding(num_words, embedding_size, weights = [embedding_matrix]))
model.add(Bidirectional(LSTM(128, return_sequences = True)))
model.add(Dense(40, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(20, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit your model with a batch size of 100 and validation_split = 0.2. and state the validation accuracy ( 5 marks)


In [27]:
batch_size = 100
epochs = 5

batch_size = 100
epochs = 5
history = model.fit(X, y, batch_size=batch_size, epochs=epochs, validation_split=0.2)

Train on 21367 samples, validate on 5342 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [36]:
import io

reverse_word_index = dict([(value, key) for (key, value) in tokenizer.word_index.items()])
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_n in range(1, num_words):
    word = reverse_word_index[word_n]
    embeddings1 = weights[word_n]
    out_m.write(word + "\n")
    out_v.write('\t'.join([str(x) for x in embeddings1]) + "\n")
out_v.close()
out_m.close()

(29657, 200)


In [37]:
# we can use these tsv file to visualize the on https://projector.tensorflow.org/