<a href="https://colab.research.google.com/github/Nourhan-Adell/DeepLearning/blob/main/NLP_in_TF/Sarcasm_Dataset_Model_KaggleData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Training a binary classifier with the Sarcasm Dataset**

Quest: Applying the preprocessing step to the [News Headlines Dataset for Sarcasm Detection](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection).

### **Download and inspect the dataset**

In [1]:
# Download the dataset
!wget https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json

--2022-10-12 12:18:18--  https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.23.128, 74.125.203.128, 74.125.204.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.23.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5643545 (5.4M) [application/json]
Saving to: ‘sarcasm.json’


2022-10-12 12:18:19 (131 MB/s) - ‘sarcasm.json’ saved [5643545/5643545]



In [2]:
import json     # This allows to load the data in JSON format and automatically create a python data structure from it.

with open('sarcasm.json','r') as f:
  datastore= json.load(f)

datastore[:5]

[{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5',
  'headline': "former versace store clerk sues over secret 'black code' for minority shoppers",
  'is_sarcastic': 0},
 {'article_link': 'https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365',
  'headline': "the 'roseanne' revival catches up to our thorny political mood, for better and worse",
  'is_sarcastic': 0},
 {'article_link': 'https://local.theonion.com/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697',
  'headline': "mom starting to fear son's web series closest thing she will have to grandchild",
  'is_sarcastic': 1},
 {'article_link': 'https://politics.theonion.com/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302',
  'headline': 'boehner just wants wife to listen, not come up with alternative debt-reduction ideas',
  'is_sarcastic': 1},
 {'article_link': 'https://www.huffingtonpost.com/entry/jk-rowling-w

**Notice that:**

Each element consists of a dictionary with a URL link, the actual headline, and a label named is_sarcastic.

In [3]:
print(datastore[0])

{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5', 'headline': "former versace store clerk sues over secret 'black code' for minority shoppers", 'is_sarcastic': 0}


### **Importing libraries**

In [38]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [39]:
# Initialize lists
sentences= []
labels= []
links= []

# Append elements in the dictionaries into each list
for item in datastore:
  sentences.append(item['headline'])
  labels.append(item['is_sarcastic'])
  links.append(item['article_link'])

In [40]:
len(sentences)

26709

### **Specify the Hyper-Parameters**

In [41]:
# Hyper-Parameters:
# Number of examples to use for training
training_size= 20000

# Vocabulary size of the tokenizer
vocab_size= 10000

# Output dimensions of the Embedding layer
embedding_dim= 16

# Maximum length of the padded sequences
max_length= 32

### **Split the data**

In [49]:
# Split the sentences
training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]

# Split the labels
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

### **Generate Padded && Sequences**

In [50]:
trunc_type= 'post'
padding_type= 'post'
oov_tok= '<OOV>'

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words= vocab_size, oov_token= oov_tok)

# Generate the word index dictionary
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

# Generate sequence and pad to the training sentences
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padd= pad_sequences(training_sequences,maxlen= max_length, padding= padding_type, truncating= trunc_type)

# Generate sequence and pad to the testing sentences
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padd= pad_sequences(testing_sequences,maxlen= max_length, padding= padding_type, truncating= trunc_type)

# Convert the labels lists into numpy arrays
training_labels = np.array(training_labels)
testing_labels = np.array(testing_labels)

### **Building and compile the Model**

In [51]:
model= tf.keras.Sequential([tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),  #This is the key to text sentiment analysis in tensorflow
                           tf.keras.layers.GlobalAveragePooling1D(),    # It's like the flatten layer, but it's simpler and faster  
                           tf.keras.layers.Dense(24, activation='relu'),
                           tf.keras.layers.Dense(1, activation= 'sigmoid')
                           ])

model.compile(loss='binary_crossentropy',optimizer= 'adam', metrics=['accuracy'])

In [52]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 32, 16)            160000    
                                                                 
 global_average_pooling1d_5   (None, 16)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_8 (Dense)             (None, 24)                408       
                                                                 
 dense_9 (Dense)             (None, 1)                 25        
                                                                 
Total params: 160,433
Trainable params: 160,433
Non-trainable params: 0
_________________________________________________________________


### **Train the model**

In [53]:
num_epochs = 30

# Train the model
history = model.fit(training_padd, training_labels, epochs=num_epochs,
                    validation_data=(testing_padd, testing_labels), verbose=2)

Epoch 1/30
625/625 - 3s - loss: 0.5729 - accuracy: 0.6985 - val_loss: 0.4132 - val_accuracy: 0.8299 - 3s/epoch - 5ms/step
Epoch 2/30
625/625 - 2s - loss: 0.3236 - accuracy: 0.8712 - val_loss: 0.3456 - val_accuracy: 0.8568 - 2s/epoch - 4ms/step
Epoch 3/30
625/625 - 2s - loss: 0.2442 - accuracy: 0.9043 - val_loss: 0.3423 - val_accuracy: 0.8535 - 2s/epoch - 4ms/step
Epoch 4/30
625/625 - 2s - loss: 0.1976 - accuracy: 0.9257 - val_loss: 0.3527 - val_accuracy: 0.8530 - 2s/epoch - 4ms/step
Epoch 5/30
625/625 - 3s - loss: 0.1648 - accuracy: 0.9401 - val_loss: 0.3722 - val_accuracy: 0.8547 - 3s/epoch - 5ms/step
Epoch 6/30
625/625 - 3s - loss: 0.1398 - accuracy: 0.9509 - val_loss: 0.4009 - val_accuracy: 0.8509 - 3s/epoch - 4ms/step
Epoch 7/30
625/625 - 2s - loss: 0.1204 - accuracy: 0.9586 - val_loss: 0.4474 - val_accuracy: 0.8402 - 2s/epoch - 4ms/step
Epoch 8/30
625/625 - 2s - loss: 0.1045 - accuracy: 0.9653 - val_loss: 0.5008 - val_accuracy: 0.8335 - 2s/epoch - 4ms/step
Epoch 9/30
625/625 - 4s 

### **Visualize the Results**

In [53]:
import matplotlib.pyplot as plt

# Plot utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
# Plot the accuracy and loss
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

### **Visualize Word Embeddings**

In [54]:
# Get the index-word dictionary
reverse_word_index = tokenizer.index_word

# Get the embedding layer from the model (i.e. first layer)
embedding_layer = model.layers[0]

# Get the weights of the embedding layer
embedding_weights = embedding_layer.get_weights()[0]

# Print the shape. Expected is (vocab_size, embedding_dim)
print(embedding_weights.shape) 

(10000, 16)


In [55]:
import io

# Open writeable files
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')

# Initialize the loop. Start counting at `1` because `0` is just for the padding
for word_num in range(1, vocab_size):

  # Get the word associated at the current index
  word_name = reverse_word_index[word_num]

  # Get the embedding weights associated with the current index
  word_embedding = embedding_weights[word_num]

  # Write the word name
  out_m.write(word_name + "\n")

  # Write the word embedding
  out_v.write('\t'.join([str(x) for x in word_embedding]) + "\n")

# Close the files
out_v.close()
out_m.close()

In [56]:
# Import files utilities in Colab
try:
  from google.colab import files
except ImportError:
  pass

# Download the files
else:
  files.download('vecs.tsv')
  files.download('meta.tsv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>