Part A:
In a digital content and entertainment industry, the objective of this project is to build a text classification model that analyses the customer's
sentiments based on their reviews in the IMDB database. The model uses a complex deep learning model to build
an embedding layer followed by a classification algorithm to analyse the sentiment of the customers.


Part B:
This part deals with social media analytics. Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based
supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to
other tweets and detecting sarcasm in these requires the availability of contextual tweets.In this hands-on project,
the goal is to build a model to detect whether a sentence is sarcastic or not, using Bidirectional LSTMs.
In this part data has been collected from theonion.com & huffingtonpost.com.

In [None]:
# Importing necessary libraries
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

import numpy as np
from keras.utils import to_categorical
from keras import models
from keras import layers
import warnings
warnings.filterwarnings("ignore")

2022-12-03 17:25:33.125845: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-03 17:25:33.219669: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-12-03 17:25:33.223117: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-03 17:25:33.223129: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if yo

**PART A**

In [None]:
from keras.datasets import imdb

In [None]:
from keras_preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review(I have tried with 20 words as mentioned in the question but it gives poor accuracy)

Train, test split

In [None]:
#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

In [None]:
data = np.concatenate((x_train, x_test), axis=0)
targets = np.concatenate((y_train, y_test), axis=0)

In [None]:
#make all sequences of the same length
data = pad_sequences(data, maxlen=maxlen)

In [None]:
#Shape of the whole set
data.shape

(50000, 300)

In [None]:
# One random data point
print("Label: ", targets[23])
print(data[23])

Label:  0
[  72    4   91 2227 1406    7    4   22   63    2  630   56    2    4
 7526  268   58 6463 4698    5    2    2    4 1406    9 2227 1424   88
    7   98   21   13  215 1109   98   18   68    2  507 4152   78   32
  143    4   22    2  100   73   30    4  375 1652    7    2   13   92
  818    4  595   15   41  217   15    7   35    2    2 2594   41    8
  511  120    4  350 1843  144  376   41    4 1474  200  112 9081   88
    4  109 3898   12    5 2612    2   48  874  110    2    2   37  186
    8   30    4 2867  496   14  217   11   41 3019    5    2 1410  490
  124   51   13  384   13 2303  235   15   48    2    2   69  623    2
    2   11   14  217  247   74    6 1963  323   40    2    4   65   62
   28  952  128 6463    2 4980 1191    9   73 7863    2 6374    9  329
    2   74    2   10   10    8   30 1257    8    4  167   29  127 1921
    8  763   49   52 3667 2442    8    4   22   13  572  423    4  361
    7 3180   17    4    2 1399   11    4 6301 4029    2    2   65  

In [None]:
# Dimension of the data set
print(data.shape)
print(targets.shape)

(50000, 300)
(50000,)


In [None]:
import numpy as np
unique_elements, counts_elements = np.unique(targets, return_counts=True)
print(np.asarray((unique_elements, counts_elements)))     

[[    0     1]
 [25000 25000]]


Positive & negetive are equally distributed

In [None]:
word_index = imdb.get_word_index()

In [None]:
reverse_word_map = dict(map(reversed, word_index.items()))

In [None]:
# Function takes a tokenized sentence and returns the words
def sequence_to_text(list_of_indices):
    # Looking up words in dictionary
    words = [reverse_word_map.get(letter) for letter in list_of_indices]
    return(words)

In [None]:
#Random data point
review = sequence_to_text(data[17])
print(review)

['realise', 'and', 'i', 'i', 'old', 'best', 'world', 'and', 'and', 'and', 'this', 'of', 'and', 'masturbation', 'and', 'asian', 'with', 'members', "that's", 'in', 'actress', 'but', 'is', 'rate', 'br', 'rose', 'hill', 'this', 'beowulf', 'exposed', 'to', 'and', 'chore', 'of', 'and', 'film', 'is', 'island', 'house', 'br', 'for', 'work', 'and', 'and', 'and', 'from', 'them', 'maggie', 'so', 'problem', 'quit', 'br', 'work', 'and', 'hopkins', 'when', 'fan', 'and', 'and', 'lady', 'i', 'i', 'of', 'ever', 'particularly', 'succeed', 'come', 'job', 'actors', 'seem', 'no', 'nina', 'effective', 'br', 'biased', 'girl', 'been', 'and', 'of', 'almost', 'br', 'and', 'of', 'forward', 'out', 'episode', 'made', 'at', 'objects', 'flicks', 'no', 'richard', 'and', 'half', 'of', 'being', 'br', 'and', 'strong', 'be', 'put', 'movie', 'and', 'lou', 'funny', 'closer', 'br', 'must', 'i', 'i', 'of', 'performance', 'mistress', 'this', 'pole', 'least', 'of', 'constantly', 'cannot', "could've", 'who', 'favor', 'but', 'an

In [None]:
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(data, targets, test_size=0.30, random_state=1)

In [None]:
from keras.layers import Dense, Input
from keras.layers import Embedding
from keras.preprocessing import sequence
from keras.layers import LSTM
from keras.models import Sequential
### create the model
model = Sequential()
model.add(Embedding(vocab_size, 128, trainable=True, input_length=maxlen))
model.add(LSTM(units=64, dropout=0.2))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 300, 128)          1280000   
                                                                 
 lstm_3 (LSTM)               (None, 64)                49408     
                                                                 
 dense_6 (Dense)             (None, 32)                2080      
                                                                 
 dense_7 (Dense)             (None, 1)                 33        
                                                                 
Total params: 1,331,521
Trainable params: 1,331,521
Non-trainable params: 0
_________________________________________________________________


In [None]:
## Fit the model
%time
model.fit(train_x,train_y, validation_data=(test_x,test_y), epochs=10, batch_size=500, verbose=1)

CPU times: user 163 µs, sys: 4 µs, total: 167 µs
Wall time: 13.1 µs
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f576e3c9a50>

In [None]:
# Final evaluation of the model
%time
scores = model.evaluate(test_x,test_y, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

CPU times: user 6 µs, sys: 0 ns, total: 6 µs
Wall time: 13.1 µs
Accuracy: 87.01%


In [None]:
pred_y = model.predict(test_x)



Evaluating the model by taking different random datapoints

In [None]:
print(pred_y)

[[1.1075592e-01]
 [9.9972594e-01]
 [9.8268169e-01]
 ...
 [6.6726329e-04]
 [7.3852914e-04]
 [9.6545768e-01]]


In [None]:
print(test_y)

[0 1 1 ... 0 1 1]


In [None]:
print(pred_y[78])

[0.01467701]


In [None]:
print(test_y[78])

0


In [None]:
pred_y = np.round(pred_y, 0)

In [None]:
pred_y = pred_y.ravel()
pred_y.shape

(15000,)

In [None]:
pred_y = pred_y.astype('int64')

In [None]:
print(pred_y[78])

0


In [None]:
print(test_y[78])

0


In [None]:
test_y.ravel()
test_y

array([0, 1, 1, ..., 0, 1, 1])

In [None]:
from sklearn.metrics import classification_report
target_names = ['Sentiment_Positive', 'Sentiment_Negative']
print(classification_report(test_y,pred_y, target_names=target_names))     

                    precision    recall  f1-score   support

Sentiment_Positive       0.88      0.86      0.87      7579
Sentiment_Negative       0.86      0.88      0.87      7421

          accuracy                           0.87     15000
         macro avg       0.87      0.87      0.87     15000
      weighted avg       0.87      0.87      0.87     15000



Taking a random data point from the test set checking it's original label and predicted label decode the whole review

In [None]:
index = imdb.get_word_index()
reverse_index = dict([(value, key) for (key, value) in index.items()]) 
decoded = " ".join( [reverse_index.get(i - 3, "#") for i in test_x[23]])
print("Label: ",test_y[23])
print("Predicted label: ",pred_y[23])
print(decoded) 

Label:  1
Predicted label:  1
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # i agree with who said that there is a lot going on between the lines in this film while i do think the pacing of this film could be improved i do think that the complexity of the relationships between the characters is fascinating br br examples br br pierre is going to marry his cousin even though his love for her seems very cousin y br br pierre and his stepmother have a rather curious relationship br br pierre # and seem to have a # relationship and the actual points to the triangle are not quite certain br br brother is a bit of a # or is he br br and isabelle who is she really br br overall i think it was worth my time an interesting film 

"#" sign has been used for padding

# Part B

Read Data "Sarcasm_Headlines_Dataset.json". Explore the data and get some insights about the data

In [2]:
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [3]:
#Set project path 
project_path =  '/content/drive/My Drive/PyData/NLP_2'

Read the data

In [55]:
import pandas as pd
import os

data = pd.read_json(os.path.join(project_path,'Sarcasm_Headlines_Dataset.json'),lines=True)

In [56]:
data.sample(10)

Unnamed: 0,article_link,headline,is_sarcastic
14592,https://www.huffingtonpost.com/entry/waking-dr...,"waking, dreaming, being",0
10723,https://www.huffingtonpost.com/entry/tuesdays-...,tuesday's morning email: inside the gop health...,0
8765,https://www.theonion.com/christ-to-wed-longtim...,christ to wed longtime backup singer,1
23582,https://entertainment.theonion.com/farm-aid-ai...,'farm aid aid' concert to benefit struggling f...,1
22120,https://www.huffingtonpost.com/entry/melissa-c...,mizzou chancellor condemns 'verbal assault' by...,0
18541,https://www.huffingtonpost.com/entry/religion-...,8 do's and don'ts of religion-themed halloween...,0
15033,https://local.theonion.com/hypochondriac-convi...,hypochondriac convinced patient has cancer,1
15588,https://www.huffingtonpost.com/entry/the-war-o...,'the war on christmas' -- a film by ken burns,0
19681,https://www.huffingtonpost.com/entry/inside-lo...,inside los angeles' first ever marijuana farme...,0
19458,https://www.theonion.com/binge-drinking-promis...,"binge-drinking, promiscuous sex good for you, ...",1


Size of the data

In [50]:
print (data.shape)
data.describe()

(26709, 3)


Unnamed: 0,is_sarcastic
count,26709.0
mean,0.438953
std,0.496269
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


Random text(we'll see them after cleaning)

In [51]:
data['headline'][11274]

'laid-off zoologist goes on tranquilizing rampage'

In [57]:
data['headline'][23582]

"'farm aid aid' concert to benefit struggling farm aid concerts"

In [58]:
##The column headline needs to be cleaned up as we have special characters and numbers in the column

import re
from nltk.corpus import stopwords
import nltk
import string
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))
def cleanData(text):
  text = re.sub(r'\d+', '', text)
  text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
  text = re.sub(r'\<a href', ' ', text)
  text = re.sub(r'&amp;', '', text)
  text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
  text = re.sub(r'<br />', ' ', text)
  text = "".join([char for char in text if char not in string.punctuation])
  return text

data['headline']=data['headline'].apply(cleanData)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Data has been cleaned

In [53]:
data['headline'][11274]

'laid off zoologist goes on tranquilizing rampage'

In [59]:
data['headline'][23582]

'farm aid aid concert to benefit struggling farm aid concerts'

Drop 'article_link' from the dataset as it is not necessary

In [60]:
data.drop('article_link',inplace=True,axis=1)

Get the Length of each line and find the maximum length.As different lines are of different length. We need to pad the our sequences using the max length.

In [61]:
maxlen = max([len(text) for text in data['headline']])
print(maxlen)

252


Modelling

In [62]:
# Required libraries
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Flatten, Bidirectional, GlobalMaxPool1D
from tensorflow.keras.models import Model, Sequential

Set Different Parameters for the model

In [63]:
max_features = 10000
maxlen = max([len(text) for text in data['headline']])
embedding_size = 200

Apply Keras' Tokenizer on headline column of the data.

In [64]:
tokenizer = Tokenizer(num_words=max_features,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=' ', char_level=False)
tokenizer.fit_on_texts(data['headline'])

Define X and y

In [65]:
X = tokenizer.texts_to_sequences(data['headline'])
X = pad_sequences(X, maxlen = maxlen)
y = np.asarray(data['is_sarcastic'])

print("Number of Samples:", len(X))
print(X[0])
print("Number of Labels: ", len(y))
print(y[0])

Number of Samples: 26709
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0  

Get the Vocabulary size

In [66]:
print(tokenizer.word_counts)
print(tokenizer.document_count)
print(tokenizer.word_index)
print(tokenizer.word_docs)

26709


In [67]:
num_words=len(tokenizer.word_index)
print (num_words)

25765


Word embedding using glove

In [70]:
glove_file = project_path + "/archive.zip"

In [71]:
#Extract Glove embedding zip file
from zipfile import ZipFile
with ZipFile(glove_file, 'r') as z:
  z.extractall()

In [72]:
#Get the Word Embeddings using Embedding file
EMBEDDING_FILE = './glove.6B.200d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE):
    word = o.split(" ")[0]
    # print(word)
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    # print(embd)
    embeddings[word] = embd

Create a weight matrix for words in training docs

In [74]:
embedding_matrix = np.zeros((num_words, 200))

for word, i in tokenizer.word_index.items():
  embedding_vector = embeddings.get(word)
  if embedding_vector is not None:
    embedding_matrix[i-1] = embedding_vector

len(embeddings.values())

400000

In [79]:
import tensorflow as tf

input_layer = Input(shape=(maxlen,),dtype=tf.int64)
embed = Embedding(embedding_matrix.shape[0],output_dim=200,weights=[embedding_matrix],input_length=maxlen, trainable=True)(input_layer)
lstm=Bidirectional(LSTM(128))(embed)
drop=Dropout(0.3)(lstm)
dense =Dense(100,activation='relu')(drop)
out=Dense(2,activation='softmax')(dense)

In [80]:
batch_size = 100
epochs = 5

model = Model(input_layer,out)
model.compile(loss='sparse_categorical_crossentropy',optimizer="adam",metrics=['accuracy'])
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 252)]             0         
                                                                 
 embedding_3 (Embedding)     (None, 252, 200)          5153000   
                                                                 
 bidirectional_2 (Bidirectio  (None, 256)              336896    
 nal)                                                            
                                                                 
 dropout_2 (Dropout)         (None, 256)               0         
                                                                 
 dense_4 (Dense)             (None, 100)               25700     
                                                                 
 dense_5 (Dense)             (None, 2)                 202       
                                                             

In [81]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

In [82]:
import time
s = time.time()
model.fit(X_train,y_train,batch_size=batch_size, epochs=epochs, verbose=1)
e = time.time()
print(e-s)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
52.5929491519928


In [88]:
test_pred = model.predict(np.array(X_test), verbose=1)



In [90]:
test_pred = [1 if j>i else 0 for i,j in test_pred]

In [91]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, test_pred)

array([[2597,  434],
       [ 359, 1952]])

In [87]:
from sklearn.metrics import classification_report
print(classification_report(y_test, test_pred))

              precision    recall  f1-score   support

           0       0.88      0.86      0.87      3031
           1       0.82      0.84      0.83      2311

    accuracy                           0.85      5342
   macro avg       0.85      0.85      0.85      5342
weighted avg       0.85      0.85      0.85      5342

