<a href="https://colab.research.google.com/github/ShashwatVv/DetectingSaracasm-In-NewsHeadlines/blob/main/Detecting_Sarcasm_in_News_headlines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###  Let's first load the dataset. 

In [8]:
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json \
    -O /tmp/sarcasm.json

--2022-03-28 10:12:08--  https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.218.128, 108.177.11.128, 74.125.31.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.218.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5643545 (5.4M) [application/json]
Saving to: ‘/tmp/sarcasm.json’


2022-03-28 10:12:08 (218 MB/s) - ‘/tmp/sarcasm.json’ saved [5643545/5643545]



In [47]:
##  The libraries to be imported
import numpy  as np
import json 
import io
import tensorflow  as  tf
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer  as TKN
from tensorflow.keras.preprocessing.sequence import pad_sequences as padseq
print("Imported!!")

Imported!!


In [48]:
## Let's load the json-data

file_sarc = open('/tmp/sarcasm.json', 'r')
data = json.load(file_sarc)
file_sarc.close()

In [49]:
info = data[0].keys()
print("The 3 features associated with the news headlines are:", *info)

The 3 features associated with the news headlines are: article_link headline is_sarcastic


In [50]:
## let's create separate list of sentences, labels and  links/urls 
list_of_labels = list() ##sarcastic or non sarcastic
list_of_links = list() ##url to the article 
list_of_sentences = list() 

for instance in data:
  list_of_labels.append(instance['is_sarcastic'])
  list_of_links.append(instance['article_link'])
  list_of_sentences.append(instance['headline'])

print("Length of each of these lists are:",  len(list_of_labels), len(list_of_links), len(list_of_sentences))

Length of each of these lists are: 26709 26709 26709


In [51]:
## Let's do  some standard preprocessing!!
##create  a  tokenizer instance
tokenizer_instance = TKN(oov_token = '<OOV>')
##let's fit this object on the sentence lists
tokenizer_instance.fit_on_texts(list_of_sentences)
word_index = tokenizer_instance.word_index

In [52]:
word_index
list_of_sentences[:5]

["former versace store clerk sues over secret 'black code' for minority shoppers",
 "the 'roseanne' revival catches up to our thorny political mood, for better and worse",
 "mom starting to fear son's web series closest thing she will have to grandchild",
 'boehner just wants wife to listen, not come up with alternative debt-reduction ideas',
 'j.k. rowling wishes snape happy birthday in the most magical way']

In [53]:
sequences = tokenizer_instance.texts_to_sequences(list_of_sentences)
##this tep was used to ensure tokens represent the words
padded = padseq(sequences, padding='post') ##--> 'post' describes all the sequences will be padded to the longest sequence
print(padded[0],'\n', padded.shape) 

[  308 15115   679  3337  2298    48   382  2576 15116     6  2577  8434
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0] 
 (26709, 40)


In [54]:
##Let's set the parameters
size_vocab = 10000
embedding_dimension = 16
len_max = 32
truncate = 'post'
padding = 'post'
ukn_token = '<OOV>'
size_train = 20000

In [55]:
## Let's split into  train and tests!!
xtrain, ytrain = list_of_sentences[0:size_train], list_of_labels[0:size_train]
xtest,  ytest =  list_of_sentences[size_train:], list_of_labels[size_train:]

In [58]:
## Let's tokenize and pad  our training and test data

tokenizer = TKN(num_words=size_vocab, oov_token=ukn_token)
tokenizer.fit_on_texts(xtrain)
word_index = tokenizer.word_index

train_seq = tokenizer.texts_to_sequences(xtrain)
train_pad = padseq(train_seq, maxlen=len_max, padding=padding, truncating=truncate)

test_seq = tokenizer.texts_to_sequences(xtest)
test_pad = padseq(test_seq, maxlen=len_max, padding='post', truncating='post')

In [62]:
train_pad = np.array(train_pad)
test_pad = np.array(test_pad)
ytrain, ytest = np.array(ytrain), np.array(ytest) ##labels

In [63]:
### Let's  define our model

model = tf.keras.Sequential([

  tf.keras.layers.Embedding(size_vocab, embedding_dimension, input_length=len_max),
  tf.keras.layers.GlobalAveragePooling1D(),
  tf.keras.layers.Dense(24,activation='relu'),
  tf.keras.layers.Dense(1,activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [65]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 32, 16)            160000    
                                                                 
 global_average_pooling1d_1   (None, 16)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_2 (Dense)             (None, 24)                408       
                                                                 
 dense_3 (Dense)             (None, 1)                 25        
                                                                 
Total params: 160,433
Trainable params: 160,433
Non-trainable params: 0
_________________________________________________________________


In [64]:
n_epoch = 40
history = model.fit(train_pad, ytrain, epochs = n_epoch, validation_data=(test_pad, ytest), verbose=2)

Epoch 1/40
625/625 - 3s - loss: 0.5614 - accuracy: 0.7073 - val_loss: 0.3961 - val_accuracy: 0.8369 - 3s/epoch - 5ms/step
Epoch 2/40
625/625 - 2s - loss: 0.3131 - accuracy: 0.8760 - val_loss: 0.3427 - val_accuracy: 0.8556 - 2s/epoch - 4ms/step
Epoch 3/40
625/625 - 2s - loss: 0.2367 - accuracy: 0.9066 - val_loss: 0.3491 - val_accuracy: 0.8533 - 2s/epoch - 4ms/step
Epoch 4/40
625/625 - 2s - loss: 0.1902 - accuracy: 0.9280 - val_loss: 0.3634 - val_accuracy: 0.8483 - 2s/epoch - 4ms/step
Epoch 5/40
625/625 - 2s - loss: 0.1587 - accuracy: 0.9412 - val_loss: 0.3860 - val_accuracy: 0.8474 - 2s/epoch - 4ms/step
Epoch 6/40
625/625 - 2s - loss: 0.1336 - accuracy: 0.9533 - val_loss: 0.4340 - val_accuracy: 0.8353 - 2s/epoch - 3ms/step
Epoch 7/40
625/625 - 2s - loss: 0.1155 - accuracy: 0.9611 - val_loss: 0.4593 - val_accuracy: 0.8399 - 2s/epoch - 4ms/step
Epoch 8/40
625/625 - 2s - loss: 0.0990 - accuracy: 0.9661 - val_loss: 0.4921 - val_accuracy: 0.8381 - 2s/epoch - 4ms/step
Epoch 9/40
625/625 - 2s 

In [None]:
## Clearly overfitting!!! ugh!!