# **Sarcasm Detection in News Headlines**

## Model Building

The purpose of this project is to build a tool to be able to detect sarcasm in sentences. 

**Data**: The data we are working with is headlines from various news articles marked as either sarcastic or not sarcastic. The columns in the dataset are:
1. The headline
2. The article's link (we'll disregard this column)
3. Label of whether the headline is sarcastic or not

**Package Imports**

In [0]:
#@title Package imports
#Import
import pandas as pd
from tensorflow import keras
from keras.models import Sequential
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from keras.layers import Dense, Embedding, LSTM, GlobalAveragePooling1D, Flatten
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.optimizers import RMSprop
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
import random

random.seed(9176932)


nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
#@title Google drive mount
from google.colab import drive
drive.mount('/content/gdrive')
drive.mount("/content/gdrive", force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive
Mounted at /content/gdrive


In [0]:
#@title Data read-in
#reading in the file
sarcasm_master = pd.read_json("/content/gdrive/My Drive/Data Mining II/sarcasm_master.json",lines=True)

## **Data Preprocessing**

Before we can feed the data to our model, we need to perform a few data preprocessing operations. 

1. **Removing punctuations**:
   
   Most text contains punctuations. In detecting sarcasm, the presence of punctuations doesn't necesarily contribute to the model performing better. So we aim to strip the data of all punctuations.

2. **Remove digits**:
    
    We are going to vectorize our data and convert the strings to numbers. Presence of numbers in the data would not help in identifying the tone any better. Moreover the pre-existing digits might interfere with the vectorization process. Hence all numbers are removed as well.

3. **Converting to lower case**: 
  
    Converting the text to lower case helps make the data uniform.

4. **Removing stop words**:

    Most headlines or any natural language data would contain stop words that are usually removed as stop words generally appear in abundance and do not provide any valuable information during classification.

5. **Lemmatization**:

    Lemmatization is the process by which any inflected version of a word is converted to its base word so that all forms of a word are treated the same.

6. **Vectorization and padding**:

    Vectorization is the process by which words are mapped to the numeric vectors. For LSTM model, the input should be of same size. Hence we pad the vectors with zeros to ensure unformity.

In [0]:
#@title Prelim text processing
#Removing punctuation and digits and converting to lower case
#Punctuation
sarcasm_master['punct_headline'] = sarcasm_master['headline'].apply(lambda x : re.sub(r'[^a-zA-Z\s]','', x ) )

#Removing digits
sarcasm_master['digits_headline'] = sarcasm_master['punct_headline'].apply(lambda x :re.sub("\d+", "", x)) 

#Converting to lower case
sarcasm_master['lc_headline'] = sarcasm_master['digits_headline'].apply(lambda x : x.lower())

In [0]:
#@title Remove stop words
def rem_stp(input_series):
    words = word_tokenize(input_series)
    a = [w for w in words] 
    return(a)

sarcasm_master['rem_stp_headline'] = sarcasm_master['lc_headline'].apply(rem_stp) 


In [0]:
#@title Lemmatization
#Lemmatization
lmt = WordNetLemmatizer()
def lem_fn(input_series):
    a  =  [lmt.lemmatize(word) for word in input_series ]
    return(a)

sarcasm_master['lem_headline'] = sarcasm_master['rem_stp_headline'].apply(lem_fn) 

In [0]:
#@title Vectorization
#Vectorization 

tkn = Tokenizer(num_words=10000)
tkn.fit_on_texts(sarcasm_master['lem_headline'])
sarcasm_master['tkn_headline'] = tkn.texts_to_sequences(sarcasm_master['lem_headline'])

total_words = len(tkn.word_index)

#Add padding in front for the tokenized list

#Find max length of the headline array length to add as maximum pad length
max_pad_length = sarcasm_master.tkn_headline.map(lambda x: len(x)).max()



# **Building Keras Model**

We are going to be building a Keras Model.

1. **Embedding layer**

   The Embedding layer is used to create word vectors for incoming words. It sits between the input and the LSTM layer, i.e. the output of the Embedding layer is the input to the LSTM layer.

2. **LSTM Layer**

 The LSTM transforms the vector sequence into a single vector containing information about the entire sequence.

3. **Intermediate Layer**
   
   There is a Dense intermediate layer with 64 neurons and with activation function **relu**.

4. **Output Layer**
    
    The final ouput we want from this model is whether the headline is sarcastic or not. So we want to perform classification. The output layer's activation function is thus **sigmoid**


Reference:

Keras : https://keras.io/getting-started/sequential-model-guide/


In [0]:
#@title build_model
def build_model(X_train, y_train, X_test, y_test):
  
  embed_size = 128
  model = Sequential()
  
  #Embedding Layer
  model.add(Embedding( total_words,embed_size))

  #LSTM input layer
  model.add(LSTM(embed_size, activation='relu'))
  
  #Intermediate layer
  model.add(Dense(64, activation ='relu'))
  
  #OutputLayer
  model.add(Dense(1))
  model.add(Activation('sigmoid'))

  model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
  print(model.summary())

  model.fit(X_train,y_train,epochs=2)

  accuracy = model.evaluate(X_test, y_test)[1]
  return accuracy

## **Model 1 - Base model**

We first build a base model with the following preprocessing.

1. Remove punctuations, digits and convert the text to lowercase.
2. Remove all the stopwords
3. Perform Lemmatization
4. Perform tokenization and pad the resulting sequence - Prepadding


In [0]:
#@title Base Model
X = pad_sequences(sarcasm_master['tkn_headline'], maxlen= max_pad_length, padding='pre')
sarcasm_master['padded_headline'] = X.tolist()

Y = sarcasm_master['is_sarcastic'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

base_accuracy = build_model(X_train, y_train, X_test ,y_test )
print(base_accuracy)

Model: "sequential_26"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_26 (Embedding)     (None, None, 128)         3094656   
_________________________________________________________________
lstm_26 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dense_51 (Dense)             (None, 64)                8256      
_________________________________________________________________
dense_52 (Dense)             (None, 1)                 65        
_________________________________________________________________
activation_26 (Activation)   (None, 1)                 0         
Total params: 3,234,561
Trainable params: 3,234,561
Non-trainable params: 0
_________________________________________________________________
None


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/2
Epoch 2/2
0.8496817946434021


**Results:**

We see that we get an accuracy of about 84.9%


## **Model 2 - Post padding**

We first build a model very similar to the base model with the following change:
After performing the tokenization, do post-padding

**Example:**
Before padding : [234,5,67,12]

The max_length is 7

Pre-padding: [0,0,0,0,234,5,67,12]

Post-padding: [234,5,67,12,0,0,0,0]


In [0]:
#@title Post padding model
X = pad_sequences(sarcasm_master['tkn_headline'], maxlen= max_pad_length, padding='post')
sarcasm_master['post_padded_headline'] = X.tolist()

Y = sarcasm_master['is_sarcastic'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

post_padding_accuracy=build_model(X_train, y_train, X_test, y_test)
print(post_padding_accuracy)

Model: "sequential_27"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_27 (Embedding)     (None, None, 128)         3094656   
_________________________________________________________________
lstm_27 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dense_53 (Dense)             (None, 64)                8256      
_________________________________________________________________
dense_54 (Dense)             (None, 1)                 65        
_________________________________________________________________
activation_27 (Activation)   (None, 1)                 0         
Total params: 3,234,561
Trainable params: 3,234,561
Non-trainable params: 0
_________________________________________________________________
None


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/2
Epoch 2/2
0.5503556728363037


**Results**

Surprisingly we see a huge drop in the accuracy from **85% to 55%** now.

The reason why this is because we are building a Long Short term memory model.
When the padding is in the beginning, the useful content is at the back and is therefore the latest information the model takes in. This stays in memory and results in a better model.

**We are going to proceed further with pre-padded sequence for future models.**

Reference: https://arxiv.org/pdf/1903.07288.pdf

## **Model 3 - Sans Lemmatization**

We now build a model that is a modification of our base model. We want to see the effect of lemmatization. Lemmatization is the process by which any inflected version of a word is converted to its base word so that all forms of a word are treated the same. 

We want to see if in detecting sarcasm, the **effect of inflect** plays a role in improcing the efficiency of our model.

In [0]:
#@title Sans Lemmatization
tkn = Tokenizer(num_words=10000)
tkn.fit_on_texts(sarcasm_master['rem_stp_headline'])
sarcasm_master['tkn_headline'] = tkn.texts_to_sequences(sarcasm_master['rem_stp_headline'])

total_words = len(tkn.word_index)

#Add padding in front for the tokenized list

#Find max length of the headline array length to add as maximum pad length
max_pad_length = sarcasm_master.tkn_headline.map(lambda x: len(x)).max()
X = pad_sequences(sarcasm_master['tkn_headline'], maxlen= max_pad_length, padding='pre')
sarcasm_master['sanslem_headline'] = X.tolist()

Y = sarcasm_master['is_sarcastic'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

sans_lemmatization_accuracy=build_model(X_train, y_train, X_test, y_test)
print(sans_lemmatization_accuracy)

Model: "sequential_28"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_28 (Embedding)     (None, None, 128)         3534848   
_________________________________________________________________
lstm_28 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dense_55 (Dense)             (None, 64)                8256      
_________________________________________________________________
dense_56 (Dense)             (None, 1)                 65        
_________________________________________________________________
activation_28 (Activation)   (None, 1)                 0         
Total params: 3,674,753
Trainable params: 3,674,753
Non-trainable params: 0
_________________________________________________________________
None


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/2
Epoch 2/2
0.8567951917648315


**Results**

We see that not performing lemmatization inproves the accuracy but just slightly from 85% to 85.6%


## **Model 4 - Including stop words**

We test the effect that stop words have on the model. In general NLP models, we generally remove stop words. But our theory is that the stop words might actually have an effect in identifying the sarcasm in a sentence. 

We want to see if in detecting sarcasm, the **effect of stop words** plays a role in improving the efficiency of our model.


Reference : https://towardsdatascience.com/why-you-should-avoid-removing-stopwords-aa7a353d2a52

In [0]:
#@title Including stop words

#Lemmatization
lmt = WordNetLemmatizer()
def lem_fn(input_series):
    a  =  [lmt.lemmatize(word) for word in input_series ]
    return(a)

sarcasm_master['lem_headline'] = sarcasm_master['rem_stp_headline'].apply(lem_fn) 

tkn = Tokenizer(num_words=10000)
tkn.fit_on_texts(sarcasm_master['lem_headline'])
sarcasm_master['tkn_headline'] = tkn.texts_to_sequences(sarcasm_master['lem_headline'])

total_words = len(tkn.word_index)

#Add padding in front for the tokenized list

#Find max length of the headline array length to add as maximum pad length
max_pad_length = sarcasm_master.tkn_headline.map(lambda x: len(x)).max()
X = pad_sequences(sarcasm_master['tkn_headline'], maxlen= max_pad_length, padding='pre')
sarcasm_master['sanslem_headline'] = X.tolist()

Y = sarcasm_master['is_sarcastic'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

include_sw_accuracy=build_model(X_train, y_train, X_test, y_test)
print(include_sw_accuracy)

Model: "sequential_29"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_29 (Embedding)     (None, None, 128)         3094656   
_________________________________________________________________
lstm_29 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dense_57 (Dense)             (None, 64)                8256      
_________________________________________________________________
dense_58 (Dense)             (None, 1)                 65        
_________________________________________________________________
activation_29 (Activation)   (None, 1)                 0         
Total params: 3,234,561
Trainable params: 3,234,561
Non-trainable params: 0
_________________________________________________________________
None


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/2
Epoch 2/2
0.8547360301017761


**Results**:

We see that the accuracy hasn't increased too much from the base model.


# Hyperparameter Tuning

We now want to focus on hyperparameter tunings. 
The various parameters in consideration are 
1. Layer activation
2. Number of epochs
3. Optimizer, etc

We are going to use **Grid search** for selecting the best parameters.


In [0]:
#Vectorization 

tkn = Tokenizer(num_words=10000)
tkn.fit_on_texts(sarcasm_master['rem_stp_headline'])
sarcasm_master['tkn_headline'] = tkn.texts_to_sequences(sarcasm_master['rem_stp_headline'])

total_words = len(tkn.word_index)

#Add padding in front for the tokenized list

#Find max length of the headline array length to add as maximum pad length
max_pad_length = sarcasm_master.tkn_headline.map(lambda x: len(x)).max()

X = pad_sequences(sarcasm_master['tkn_headline'], maxlen= max_pad_length, padding='pre')
sarcasm_master['final_headline'] = X.tolist()

Y = sarcasm_master['is_sarcastic'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

In [0]:
def create_model(optimizer='adam', activation='relu'):
  embed_size = 128
  model = Sequential()
  
  #Embedding Layer
  model.add(Embedding( total_words,embed_size))

  #LSTM input layer
  model.add(LSTM(embed_size, activation='relu'))
  
  #Intermediate layer
  model.add(Dense(64, activation ='relu'))
  
  #OutputLayer
  model.add(Dense(1))
  model.add(Activation('sigmoid'))

  model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
  return model


In [0]:

from keras.wrappers.scikit_learn import KerasClassifier

model1 = KerasClassifier(build_fn=create_model, epochs=2, batch_size=16)

from sklearn.model_selection import GridSearchCV
params = dict(optimizer=['sgd', 'adam'], 
              epochs=[2],
              batch_size=[15], 
              activation=['relu','tanh'])

# Create a random search cv object and fit it to the data
grid_search = GridSearchCV(model1, params, cv=3, scoring='accuracy')
random_search_results = grid_search.fit(X, Y)
# Print results
print(random_search_results.best_score_,random_search_results.best_params_)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
0.8546557340222397 {'activation': 'relu', 'batch_size': 15, 'epochs': 2, 'optimizer': 'adam'}


In [0]:
#Epoch 5 accuracy 82.8%
embed_size = 128
model = Sequential()
  
#Embedding Layer
model.add(Embedding( total_words,embed_size))

#LSTM input layer
model.add(LSTM(embed_size, activation='relu'))
  
#Intermediate layer
model.add(Dense(64, activation ='relu'))
  
#OutputLayer
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
model.fit(X_train,y_train,epochs=5)
accuracy = model.evaluate(X_test, y_test)[1]
print(accuracy)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
0.8287158608436584


In [0]:
#Epoch 10 accuracy 84.5%
embed_size = 128
model = Sequential()
  
#Embedding Layer
model.add(Embedding( total_words,embed_size))

#LSTM input layer
model.add(LSTM(embed_size, activation='relu'))
  
#Intermediate layer
model.add(Dense(64, activation ='relu'))
  
#OutputLayer
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
model.fit(X_train,y_train,epochs=10)
accuracy = model.evaluate(X_test, y_test)[1]
print(accuracy)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
0.8333957195281982


**Results**

From the grid search we get the following results:

1. The best optimizer for our model is **adam**

2. The best activation to use is **relu**

We also try various epoch values : 2,5,10

2  : Training accuracy- 91.9   Testing accuracy - 85.6

5  : Training accuracy- 98.7   Testing accuracy - 82.8

10 : Training accuracy- 99.3   Testing accuracy - 83.3

We see that for epochs higher than 2, even if the training accuracy increases the testing accuracy goes down. This could be because of overfitting.


## **Final Model**

The final model we build is a Keras model with LSTM 

Prepocessing : Removing stop words, punctuations, digits and converting to lower case

Epochs : 2

Activation : relu

Output Activation : sigmoid

Optimizer : adam

Training accuracy : 91.9%

Testing accuracy : 85.6%
