# Ungraded Lab: Tokenizing the Sarcasm Dataset

In this lab, you will apply what you've learned in the past two exercises to preprocess the [News Headlines Dataset for Sarcasm Detection](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection). This contains news headlines which are labeled as sarcastic or not. You will revisit this dataset in later labs so it is good to be acquainted with it now.

**IMPORTANT NOTE:** This notebook is designed to run as a Colab. Running it on your local machine might result in some of the code blocks throwing errors.

## Download and inspect the dataset

First, you will fetch the dataset and preview some of its elements.

In [None]:
import os
try:
    import wget
except ModuleNotFoundError:
    print("Installing wget module...")
    !pip install wget
    import wget

def download_dataset(url, folder, filename):
    # Check if the folder exists, otherwise, create it
    if not os.path.exists(folder):
        os.makedirs(folder)

    file_path = os.path.join(folder, filename)

    # Check if the file has already been downloaded
    if not os.path.exists(file_path):
        print(f"Downloading file from {url}...")
        wget.download(url, out=folder)
        print("\nDownload completed.")
    else:
        print("The file has already been downloaded.")

# Specify the URL of the file, the destination folder, and the filename
url = "https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json"
folder = "Datasets"
filename = "sarcasm.json"

# Call the function to download the file.
download_dataset(url, folder, filename)

Installing wget module...
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9656 sha256=781ccd042ecda2efed63b378625919c906ef2a2270b6755d74944954c00b7cde
  Stored in directory: /root/.cache/pip/wheels/40/b3/0f/a40dbd1c6861731779f62cc4babcb234387e11d697df70ee97
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
Downloading file from https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json...

Download completed.


The dataset is saved as a [JSON](https://www.json.org/json-en.html) file and you can use Python's [`json`](https://docs.python.org/3/library/json.html) module to load it into your workspace. The cell below unpacks the JSON file into a list.

In [None]:
import json

def load_json_file(file_path):
    with open(file_path, 'r') as f:
        json_data = json.load(f)
    return json_data

# Example usage
datastore = load_json_file("./Datasets/sarcasm.json")

You can inspect a few of the elements in the list. You will notice that each element consists of a dictionary with a URL link, the actual headline, and a label named `is_sarcastic`. Printed below are two elements with contrasting labels.

In [None]:
# Print the JSON data in a readable format
print(json.dumps(datastore, indent=4))

In [None]:
# 'datastore' is your array of JSON Objects
article = 20000

# Print the article link, the headline, and whether it is sarcastic or not
print("Article Link:", datastore[article]["article_link"])
print("Headline:", datastore[article]["headline"])
print("Is Sarcastic:", "Yes" if datastore[article]["is_sarcastic"] == 1 else "No")

Article Link: https://www.theonion.com/pediatricians-announce-2011-newborns-are-ugliest-babies-1819572977
Headline: pediatricians announce 2011 newborns are ugliest babies in 30 years
Is Sarcastic: Yes


With that, you can collect the headlines because those are the string inputs that you will preprocess into numeric features.

In [None]:
# Append the headline elements into the list
sentences = [item['headline'] for item in datastore]
print(f'There are {len(sentences)} headlines in the datastore')
print(f'First five headlines in the datastore: {sentences[:5]}')

There are 26709 headlines in the datastore
First five headlines in the datastore: ["former versace store clerk sues over secret 'black code' for minority shoppers", "the 'roseanne' revival catches up to our thorny political mood, for better and worse", "mom starting to fear son's web series closest thing she will have to grandchild", 'boehner just wants wife to listen, not come up with alternative debt-reduction ideas', 'j.k. rowling wishes snape happy birthday in the most magical way']


## Preprocessing the headlines

You can convert the sentences list above into padded sequences by using the same methods you've been using in the previous labs. The cells below will build the vocabulary, then use that to generate the list of post-padded sequences for each of the 26,709 headlines.

In [None]:
import tensorflow as tf

# Instantiate the layer
vectorize_layer = tf.keras.layers.TextVectorization()

# Build the vocabulary
vectorize_layer.adapt(sentences)

# Apply the layer for post padding
post_padded_sequences = vectorize_layer(sentences)

You can view the results for a particular headline by changing the value of `index` below.

In [None]:
# Print dimensions of padded sequences
print(f'Shape of padded sequences: {post_padded_sequences.shape}')

# Print a sample headline and sequence
index = 20000
print(f'sample headline: {sentences[index]}')
print(f'padded sequence: {post_padded_sequences[index]}')

Shape of padded sequences: (26709, 39)
sample headline: pediatricians announce 2011 newborns are ugliest babies in 30 years
padded sequence: [11985  1123  6846  5432    30  8441  2365     5   690    84     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0]


For prepadding, you have to setup the `TextVectorization` layer differently. You don't want to have the automatic postpadding shown above, and instead have sequences with variable length. Then, you will pass it to the `pad_sequences()` utility function you used in the previous lab. The cells below show one way to do it:

* First, you will initialize the `TextVectorization` layer and set its `ragged` flag to `True`. This will result in a [ragged tensor](https://www.tensorflow.org/guide/ragged_tensor) which simply means a tensor with variable-length elements. The sequences will indeed have different lengths after removing the zeroes, thus you will need the ragged tensor to contain them.

* Like before, you will use the layer's `adapt()` method to generate a vocabulary.

* Then, you will apply the layer to the string sentences to generate the integer sequences. As mentioned, this will not be post-padded.

* Lastly, you will pass this ragged tensor to the [pad_sequences()](https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences) function to generate pre-padded sequences.

In [None]:
# Instantiate the layer and set the `ragged` flag to `True`
vectorize_layer = tf.keras.layers.TextVectorization(ragged=True)

# Build the vocabulary
vectorize_layer.adapt(sentences)

# Apply the layer to generate a ragged tensor
ragged_sequences = vectorize_layer(sentences)

In [None]:
# Print dimensions of padded sequences
print(f'Shape of padded sequences: {ragged_sequences.shape}')

# Print a sample headline and sequence
index = 20000
print(f'sample headline: {sentences[index]}')
print(f'padded sequence: {ragged_sequences[index]}')

Shape of padded sequences: (26709, None)
sample headline: pediatricians announce 2011 newborns are ugliest babies in 30 years
padded sequence: [11985  1123  6846  5432    30  8441  2365     5   690    84]


In [None]:
from tensorflow.keras.utils import pad_sequences

# Apply pre-padding to the ragged tensor
pre_padded_sequences = pad_sequences(ragged_sequences.numpy())
print(f'Shape of pre-padded sequences: {pre_padded_sequences.shape}')

# Preview the result for an only sequence
index = 2
print(f'sample headline: {sentences[index]}')
print(f'padded sequence: {pre_padded_sequences[index]}')

Shape of pre padded sequences: (26709, 39)
sample headline: mom starting to fear son's web series closest thing she will have to grandchild
padded sequence: [    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0   140   825     2   813  1100  2048   571  5057   199   139    39
    46     2 13050]


You can see the results for post-padded and pre-padded sequences by changing the value of `index` below.

In [None]:
# Print a sample headline and sequence
index = 2
print(f'sample headline: {sentences[index]}')
print()
print(f'post-padded sequence: {post_padded_sequences[index]}')
print()
print(f'pre-padded sequence: {pre_padded_sequences[index]}')
print()

# Print dimensions of padded sequences
print(f'shape of post-padded sequences: {post_padded_sequences.shape}')
print(f'shape of pre-padded sequences: {pre_padded_sequences.shape}')
print()

print(f'The dimensions of sequences with pre-padded and post-padded are equal: {post_padded_sequences.shape==pre_padded_sequences.shape}')

sample headline: mom starting to fear son's web series closest thing she will have to grandchild

post-padded sequence: [  140   825     2   813  1100  2048   571  5057   199   139    39    46
     2 13050     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0]

pre-padded sequence: [    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0   140   825     2   813  1100  2048   571  5057   199   139    39
    46     2 13050]

shape of post-padded sequences: (26709, 39)
shape of pre-padded sequences: (26709, 39)

The dimensions of sequences with pre-padded and post-padded are equal: True


This concludes the short demo on text data preprocessing on a relatively large dataset. Next week, you will start building models that can be trained on these output sequences. See you there!