The machine learning algorithms need to work with number so if we want to work with sentences and words we need to convert each word to a number like below:

I love My dog -> {1:'i', 2:'love', 3:'my', 4:'dog'} -> [[1, 2, 3, 4]]

now we could pass it throug our neural network to work with this data 
for creating this list we can use a built-in api of tensorflow called Tokenizer.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
tokenizer = Tokenizer(num_words=100)

In [None]:
sentences = [
             'I love milad soleymani',
             'I love myself',
             'I hate to be laghar',
             'I want to be Topol'
]

tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'i': 1, 'love': 2, 'to': 3, 'be': 4, 'milad': 5, 'soleymani': 6, 'myself': 7, 'hate': 8, 'laghar': 9, 'want': 10, 'topol': 11}


After Doing the above staff now we need to create a sequence of sentence into the list with code below:

In [None]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[1, 2, 5, 6], [1, 2, 7], [1, 8, 3, 4, 9], [1, 10, 3, 4, 11]]


In [None]:
temp = ['milad ashegh laghar']
print(tokenizer.texts_to_sequences(temp))

[[5, 9]]


As you can see here there are 4 sentenses with different shapes, that could be a problem in further process so we need to use padding in this cases.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
# maxlen -> number of pad to add end of the sentences
# padding -> do padding from the pre or post of sentences
# truncating -> depend on maxlen say where to remove the sentence from

padded = pad_sequences(
    sequences,
    maxlen = 10,
    padding='post',
    truncating='post'
)

padded

array([[ 2,  3,  6,  7,  0,  0,  0,  0,  0,  0],
       [ 2,  3,  8,  0,  0,  0,  0,  0,  0,  0],
       [ 2,  9,  4,  5, 10,  0,  0,  0,  0,  0],
       [ 2, 11,  4,  5, 12,  0,  0,  0,  0,  0]], dtype=int32)

Now we want to work with real Data of BBC to see if a headline is sarcasm or not

First we should Download the dataset

In [None]:
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json \
    -O /tmp/sarcasm.json

--2021-05-25 07:48:26--  https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.13.80, 172.217.13.240, 172.217.15.80, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.13.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5643545 (5.4M) [application/json]
Saving to: ‘/tmp/sarcasm.json’


2021-05-25 07:48:26 (261 MB/s) - ‘/tmp/sarcasm.json’ saved [5643545/5643545]



In [None]:
import json

In [None]:
with open('/tmp/sarcasm.json', 'r') as fin:
    dataset = json.load(fin)

sentences = []
labels = []
urls = []


for item in dataset:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])

Next Step is tokenize the loaded data...

In [None]:
tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
print('Data Shape is :', len(word_index))

Data Shape is : 29657


Now we need to create sequences in order to prepare data to pass through our network...

In [None]:
sequences = tokenizer.texts_to_sequences(sentences)
padded_sequences = pad_sequences(sequences, padding='post')

print(sentences[2])
print(padded_sequences[2])

print(padded_sequences.shape)

mom starting to fear son's web series closest thing she will have to grandchild
[  145   838     2   907  1749  2093   582  4719   221   143    39    46
     2 10736     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0]
(26709, 40)
