# Sarcasm Detection
 **Acknowledgement**

Misra, Rishabh, and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).

**Required Files given in below link.**

https://drive.google.com/drive/folders/1xUnF35naPGU63xwRDVGc-DkZ3M8V5mMk

## Install `Tensorflow2.0` 

In [None]:
!!pip uninstall tensorflow
!pip install tensorflow==2.0.0

## Get Required Files from Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive/


In [2]:
#Set your project path 
project_path = '/content/drive/My Drive/Colab Notebooks/NLP/Data/'

In [4]:
cd /content/drive/My Drive/Colab Notebooks/NLP

/content/drive/My Drive/Colab Notebooks/NLP


In [6]:
#Loading The libraries
import pandas as pd
import numpy as np

#**## Reading and Exploring Data**

## Read Data "Sarcasm_Headlines_Dataset.json". Explore the data and get  some insights about the data. ( 4 marks)
Hint - As its in json format you need to use pandas.read_json function. Give paraemeter lines = True.

In [7]:
def parseJson(fname):
    for line in open(fname, 'r'):
        yield eval(line)

In [8]:
data = list(parseJson('Data/Sarcasm_Headlines_Dataset.json'))

In [9]:
df_data=pd.DataFrame(data)

In [10]:
df_data.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


## Drop `article_link` from dataset. ( 2 marks)
As we only need headline text data and is_sarcastic column for this project. We can drop artical link column here.

In [11]:
df_data=df_data.drop('article_link',axis=1)

In [12]:
df_data.head()

Unnamed: 0,headline,is_sarcastic
0,former versace store clerk sues over secret 'b...,0
1,the 'roseanne' revival catches up to our thorn...,0
2,mom starting to fear son's web series closest ...,1
3,"boehner just wants wife to listen, not come up...",1
4,j.k. rowling wishes snape happy birthday in th...,0


In [21]:
df_data.shape

(26709, 2)

## Get the Length of each line and find the maximum length. ( 4 marks)
As different lines are of different length. We need to pad the our sequences using the max length.

In [13]:
for i in df_data['headline']:
  df_data['len']=len(i)
df_data.head()

Unnamed: 0,headline,is_sarcastic,len
0,former versace store clerk sues over secret 'b...,0,33
1,the 'roseanne' revival catches up to our thorn...,0,33
2,mom starting to fear son's web series closest ...,1,33
3,"boehner just wants wife to listen, not come up...",1,33
4,j.k. rowling wishes snape happy birthday in th...,0,33


In [14]:
df_data['len'].max()

33

#**## Modelling**

## Import required modules required for modelling.

In [15]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Flatten, Bidirectional, GlobalMaxPool1D
from tensorflow.keras.models import Model, Sequential

# Set Different Parameters for the model. ( 2 marks)

In [16]:
max_features = 10000
maxlen = 33
embedding_size = 200

## Apply Keras Tokenizer of headline column of your data.  ( 4 marks)
Hint - First create a tokenizer instance using Tokenizer(num_words=max_features) 
And then fit this tokenizer instance on your data column df['headline'] using .fit_on_texts()

In [17]:
#Tokenizer for source language
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(df_data['headline']) #Fit it on Source sentences

# Define X and y for your model.

In [18]:
X = tokenizer.texts_to_sequences(df_data['headline'])
X = pad_sequences(X, maxlen = maxlen)
y = np.asarray(df_data['is_sarcastic'])

print("Number of Samples:", len(X))
print(X[0])
print("Number of Labels: ", len(y))
print(y[0])

Number of Samples: 26709
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0  307  678 3336 2297   47
  381 2575    5 2576 8433]
Number of Labels:  26709
0


## Get the Vocabulary size ( 2 marks)
Hint : You can use tokenizer.word_index.

In [19]:
#Maximum length of sentence
max_encoder_seq_length = max([len(txt) for txt in X])
print('Maximum sentence length for Source language: ', max_encoder_seq_length)

#Source language Vocablury
encoder_vocab_size = len(tokenizer.word_index)
print('Source language vocablury size: ', encoder_vocab_size)

Maximum sentence length for Source language:  33
Source language vocablury size:  29656


#**## Word Embedding**

## Get Glove Word Embeddings

In [20]:
glove_file = project_path + "glove.6B.zip"

In [21]:
#Extract Glove embedding zip file
from zipfile import ZipFile
with ZipFile(glove_file, 'r') as z:
  z.extractall()

# Get the Word Embeddings using Embedding file as given below.

In [23]:
EMBEDDING_FILE = './glove.6B.200d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE):
    word = o.split(" ")[0]
    # print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    # print(embd)
    embeddings[word] = embd



# Create a weight matrix for words in training docs

In [25]:
num_words=encoder_vocab_size+1
embedding_matrix = np.zeros((num_words, 200))

for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

len(embeddings.values())

400000

## Create and Compile your Model  ( 7 marks)
Hint - Use Sequential model instance and then add Embedding layer, Bidirectional(LSTM) layer, then dense and dropout layers as required. 
In the end add a final dense layer with sigmoid activation for binary classification.


In [26]:
model = Sequential()
# Embedding layer 
model.add(Embedding(num_words, embedding_size, weights = [embedding_matrix]))
# Bidirectional LSTM layer 
model.add(Bidirectional(LSTM(128, return_sequences = True)))
#Dense Layer
model.add(Dense(1,activation='sigmoid'))

In [27]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

# Fit your model with a batch size of 100 and validation_split = 0.2. and state the validation accuracy ( 5 marks)


In [29]:
batch_size = 100
epochs = 5

## Add your code here ##
model.fit(X,y,
          epochs=epochs,
          batch_size=batch_size,          
          verbose=1,
          validation_split=0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f99a5846470>

The validation accuracy is 85.98