<a href="https://colab.research.google.com/github/ReemOmer/25daysofMLDS/blob/main/Day3in25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NLP TensorFlow Course

The words have to be represented in a way that computer understands it (Tokenization Process), for example, using number like in ASCII.


Encoding sentences could be done by assigning numbers to words, the same word in a different sentence will have the number.

### Tokenization: Words to Numbers

In [1]:
import tensorflow as tf
from tensorflow import keras

# importing tokenizer api
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
             'Test if for the first time!',
             'I shall pass the test!!'
]

# create object tokenizer, num_words => number of words to tokenize
# If the text is longer than num_words, it will select the frequent ones
tokenizer = Tokenizer(num_words = 10)
tokenizer.fit_on_texts(sentences)

# the list of words after tokenization
word_index = tokenizer.word_index

#the marks will be filtered out
print(word_index)

{'test': 1, 'the': 2, 'if': 3, 'for': 4, 'first': 5, 'time': 6, 'i': 7, 'shall': 8, 'pass': 9}


### Sequencing: Sentences to Data

In [2]:
# convert sentences into list of number
sequences = tokenizer.texts_to_sequences(sentences)

print(sequences)

[[1, 3, 4, 2, 5, 6], [7, 8, 9, 2, 1]]


In [3]:
# in case new words appear in sentences and sequenced, only existed words in corpus will be sequenced.
test_data = ['The weather is really hot!']

test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

[[2]]


In [4]:
# to avoid skipping some words, use OOV (Out Of Vocabulary) property
tokenizer = Tokenizer(num_words= 10, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

{'<OOV>': 1, 'test': 2, 'the': 3, 'if': 4, 'for': 5, 'first': 6, 'time': 7, 'i': 8, 'shall': 9, 'pass': 10}
[[2, 4, 5, 3, 6, 7], [8, 9, 1, 3, 2]]
[[3, 1, 1, 1, 1]]


### Padding: Unify Sentences Length

In [5]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Padding adds zeros to sentences at the begining to match the longest sentence
# padding='post' sets the zeros at the end
padded = pad_sequences(sequences)
post_padded = pad_sequences(sequences, padding='post', maxlen=3, truncating='pre')
# maxlen=3 if we want the sentences to have fixed length, we can also specify the truncation as pre or post
# using truncating='post' 
print(padded)
print(post_padded)

[[2 4 5 3 6 7]
 [0 8 9 1 3 2]]
[[3 6 7]
 [1 3 2]]


### Classifier to Recognize sentiment in text

In [8]:
!wget --no-check-certificate https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json -O ./sarcasm.json

--2022-01-10 08:49:24--  https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.45.112, 172.217.13.80, 172.217.13.240, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.45.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5643545 (5.4M) [application/json]
Saving to: ‘./sarcasm.json’


2022-01-10 08:49:24 (76.0 MB/s) - ‘./sarcasm.json’ saved [5643545/5643545]



In [9]:
# the data is stored in json format, we have to change it into python format using json library

import json

# load the file
with open('./sarcasm.json','r') as f:
  datastore = json.load(f)

  sentences = []
  labels = []
  urls = []

# load the values while iterating through the file
  for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])

In [10]:
tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(padded[0])
print(padded.shape)

[  308 15115   679  3337  2298    48   382  2576 15116     6  2577  8434
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0]
(26709, 40)
