<a href="https://colab.research.google.com/github/EmilSeyfullayev/EmilSeyfullayev-Tensorflow-Developer-Professional-Certificate/blob/main/C3/W1/ungraded_labs/C3_W1_Lab_3_sarcasm_Exercised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/https-deeplearning-ai/tensorflow-1-public/blob/master/C3/W1/ungraded_labs/C3_W1_Lab_3_sarcasm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ungraded Lab: Tokenizing the Sarcasm Dataset

In this lab, you will be applying what you've learned in the past two exercises to preprocess the [News Headlines Dataset for Sarcasm Detection](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection/home). This contains news headlines which are labeled as sarcastic or not. You will revisit this dataset in later labs so it is good to be acquainted with it now.

## Download and inspect the dataset

First, you will fetch the dataset and preview some of its elements.

In [1]:
# Download the dataset
!wget https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json

--2022-06-13 08:38:42--  https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json
Resolving storage.googleapis.com (storage.googleapis.com)... 64.233.189.128, 108.177.125.128, 142.250.157.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|64.233.189.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5643545 (5.4M) [application/json]
Saving to: ‘sarcasm.json’


2022-06-13 08:38:42 (195 MB/s) - ‘sarcasm.json’ saved [5643545/5643545]



The dataset is saved as a [JSON](https://www.json.org/json-en.html) file and you can use Python's [`json`](https://docs.python.org/3/library/json.html) module to load it into your workspace. The cell below unpacks the JSON file into a list.

In [2]:
import json

# Load the JSON file
with open("./sarcasm.json", 'r') as f:
    datastore = json.load(f)

You can inspect a few of the elements in the list. You will notice that each element consists of a dictionary with a URL link, the actual headline, and a label named `is_sarcastic`. Printed below are two elements with contrasting labels.

In [3]:
# Non-sarcastic headline
print(datastore[0])

# Sarcastic headline
print(datastore[20000])

{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5', 'headline': "former versace store clerk sues over secret 'black code' for minority shoppers", 'is_sarcastic': 0}
{'article_link': 'https://www.theonion.com/pediatricians-announce-2011-newborns-are-ugliest-babies-1819572977', 'headline': 'pediatricians announce 2011 newborns are ugliest babies in 30 years', 'is_sarcastic': 1}


With that, you can collect all urls, headlines, and labels for easier processing when using the tokenizer. For this lab, you will only need the headlines but we included the code to collect the URLs and labels as well.

In [4]:
# Initialize lists
sentences = [] 
labels = []
urls = []

# Append elements in the dictionaries into each list
for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])

## Preprocessing the headlines

You can convert the `sentences` list above into padded sequences by using the same methods you've been using in the past exercises. The cell below generates the `word_index` dictionary and generates the list of padded sequences for each of the 26,709 headlines.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Initialize the Tokenizer class
tokenizer = Tokenizer(oov_token="<OOV>")

# Generate the word index dictionary
tokenizer.fit_on_texts(sentences)

# Print the length of the word index
word_index = tokenizer.word_index
print(f'number of words in word_index: {len(word_index)}')

# Print the word index
print(f'word_index: {word_index}')
print()

# Generate and pad the sequences
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')

# Print a sample headline
index = 2
print(f'sample headline: {sentences[index]}')
print(f'padded sequence: {padded[index]}')
print()

# Print dimensions of padded sequences
print(f'shape of padded sequences: {padded.shape}')

This concludes the short demo on using text data preprocessing APIs on a relatively large dataset. Next week, you will start building models that can be trained on these output sequences. See you there!

In [6]:
tokenizer = Tokenizer(num_words=30000, oov_token='<OOV>')

In [7]:
import numpy as np

In [8]:
np.array(sentences).shape

(26709,)

In [9]:
tokenizer.fit_on_texts(sentences)

In [10]:
type(tokenizer.word_index)

dict

In [11]:
len(tokenizer.word_index)

29657

In [12]:
tokenizer.word_index['with']

10

In [13]:
# word_index = tokenizer.word_index not necesserily to run

In [14]:
sequences = tokenizer.texts_to_sequences(sentences)

In [15]:
np.array(sequences).shape

  """Entry point for launching an IPython kernel.


(26709,)

In [16]:
sequences[:10]

[[308, 15115, 679, 3337, 2298, 48, 382, 2576, 15116, 6, 2577, 8434],
 [4, 8435, 3338, 2746, 22, 2, 166, 8436, 416, 3112, 6, 258, 9, 1002],
 [145, 838, 2, 907, 1749, 2093, 582, 4719, 221, 143, 39, 46, 2, 10736],
 [1485, 36, 224, 400, 2, 1832, 29, 319, 22, 10, 2924, 1393, 6969, 968],
 [767, 719, 4720, 908, 10737, 623, 594, 5, 4, 95, 1309, 92],
 [10738, 4, 365, 73],
 [4, 6970, 351, 6, 461, 4274, 2195, 1486],
 [19, 479, 39, 1168, 31, 155, 2, 99, 83, 18, 158, 6, 32, 352],
 [249, 3623, 6971, 555, 5274, 1995, 141],
 [2094, 326, 347, 401, 60, 15117, 6, 4, 3896]]

In [17]:
np.array(sequences)[:10]

  """Entry point for launching an IPython kernel.


array([list([308, 15115, 679, 3337, 2298, 48, 382, 2576, 15116, 6, 2577, 8434]),
       list([4, 8435, 3338, 2746, 22, 2, 166, 8436, 416, 3112, 6, 258, 9, 1002]),
       list([145, 838, 2, 907, 1749, 2093, 582, 4719, 221, 143, 39, 46, 2, 10736]),
       list([1485, 36, 224, 400, 2, 1832, 29, 319, 22, 10, 2924, 1393, 6969, 968]),
       list([767, 719, 4720, 908, 10737, 623, 594, 5, 4, 95, 1309, 92]),
       list([10738, 4, 365, 73]),
       list([4, 6970, 351, 6, 461, 4274, 2195, 1486]),
       list([19, 479, 39, 1168, 31, 155, 2, 99, 83, 18, 158, 6, 32, 352]),
       list([249, 3623, 6971, 555, 5274, 1995, 141]),
       list([2094, 326, 347, 401, 60, 15117, 6, 4, 3896])], dtype=object)

In [18]:
list_one = [[1, 2, 3], [1, 2, 3]]
print(np.array(list_one))
np.array(list_one).shape

[[1 2 3]
 [1 2 3]]


(2, 3)

In [19]:
list_two = [[1, 2, 3], [1, 2]]
print(np.array(list_two))
np.array(list_two).shape

[list([1, 2, 3]) list([1, 2])]


  
  This is separate from the ipykernel package so we can avoid doing imports until


(2,)

In [20]:
padded = pad_sequences(sequences, padding='post')

In [21]:
padded.shape

(26709, 40)

In [22]:
padded[:10]

array([[  308, 15115,   679,  3337,  2298,    48,   382,  2576, 15116,
            6,  2577,  8434,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0],
       [    4,  8435,  3338,  2746,    22,     2,   166,  8436,   416,
         3112,     6,   258,     9,  1002,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0],
       [  145,   838,     2,   907,  1749,  2093,   582,  4719,   221,
          143,    39,    46,     2, 10736,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0],
       [ 1485,    36,   224,   400,  