<a href="https://colab.research.google.com/github/TyronSamaroo/GoogleColabNotebooks/blob/master/TensorflowIntro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Author: Tyron Samaroo

Referenced from

https://www.youtube.com/watch?v=fNxaJsNG3-s&list=PLQY2H8rRoyvwLbzbnKJ59NkZvQAW9wLbx




# Natural Language Processing - Tokenization

In [0]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [0]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!'
]

In [0]:
# Tokenizer object 
tokenizer = Tokenizer(num_words= 100)
# Fix on given text
tokenizer.fit_on_texts(sentences)

In [0]:
word_index = tokenizer.word_index

In [0]:
#Note that dog! is not outputed smart tokenizer stil doesnt make a new instance 
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


# Sequencing - Turning sentences into data

In [0]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]


In [0]:
# Tokenizer object 
tokenizer = Tokenizer(num_words= 100)
# Fix on given text
tokenizer.fit_on_texts(sentences)

In [0]:
word_index = tokenizer.word_index
print(word_index)


{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}


In [0]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


Be careful with data not in word index or present data. The corpus will not contain the test data

In [0]:
test_data = ['i really love my dog', 'my dog loves my ']
test_seq = tokenizer.texts_to_sequences(test_data)


Notice that we have an issue. It doesnt map properly. 

In [0]:
print(test_seq)
print(word_index)

[[4, 2, 1, 3], [1, 3, 1]]
{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}


In [0]:
#To fix 
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]
tokenizer = Tokenizer(num_words= 100, oov_token= "<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)

In [0]:
test_seq = tokenizer.texts_to_sequences(test_data)

In [0]:
#Notice all new words that in test_data that was not in sentences is replaced with a token <OOV>. 
print(word_index)
print(sequences)
print(test_seq)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
[[5, 1, 3, 2, 4], [2, 4, 1, 2]]


### Notice Change for Test Sequence when added new OOV token


BEFORE 

> [[4, 2, 1, 3], [1, 3, 1]]

>`{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}`


### After 
> [[5, 1, 3, 2, 4], [2, 4, 1, 2]]

> `{<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}`




Dealing with different size of text sequence. We use padding to help.

In [0]:
from tensorflow.keras.preprocessing.sequence import pad_sequences


In [0]:
padded = pad_sequences(sequences)

In [0]:
print(word_index)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}


In [0]:
print(sequences)

[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]


Padding now makes all sentences the same length as inputs

In [0]:
print(padded)

[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]


### Other Parameters for Padding 

In [0]:
pad_sequences(sequences, maxlen=10)

array([[ 0,  0,  0,  0,  0,  0,  5,  3,  2,  4],
       [ 0,  0,  0,  0,  0,  0,  5,  3,  2,  7],
       [ 0,  0,  0,  0,  0,  0,  6,  3,  2,  4],
       [ 0,  0,  0,  8,  6,  9,  2,  4, 10, 11]], dtype=int32)

In [0]:
pad_sequences(sequences, maxlen=3)

array([[ 3,  2,  4],
       [ 3,  2,  7],
       [ 3,  2,  4],
       [ 4, 10, 11]], dtype=int32)

In [0]:
pad_sequences(sequences, padding='post')

array([[ 5,  3,  2,  4,  0,  0,  0],
       [ 5,  3,  2,  7,  0,  0,  0],
       [ 6,  3,  2,  4,  0,  0,  0],
       [ 8,  6,  9,  2,  4, 10, 11]], dtype=int32)

In [0]:
pad_sequences(sequences, truncating='pre', maxlen=5)

array([[ 0,  5,  3,  2,  4],
       [ 0,  5,  3,  2,  7],
       [ 0,  6,  3,  2,  4],
       [ 9,  2,  4, 10, 11]], dtype=int32)

# Training a model to reconize sentiment in text