# Google ML - Natural Language Processing
## Tokenization and Sequence generation

## Tokenzation

This generally involves breaking a sentence down into its constituent words.

Tensorflow's tokenizer takes this a step further by representing the individual words by numerical values to make them easier for use in training a neural network to perform NLP.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer


sentences = [
    'i know basic programming',
    'I know Matlab programming.',
    'You know python programming!',
    'Do you love the python programming simplicity?',
    'I expertly know python programming',
    'Programming in python beats programming in Malbolge.'
]

In [None]:
train_data = sentences[:3]

tokenizer = Tokenizer(num_words = 100) #tokenizes a max of 100 words
tokenizer.fit_on_texts(train_data)
word_index = tokenizer.word_index #assigns index based on frequency
print(word_index)

{'know': 1, 'programming': 2, 'i': 3, 'basic': 4, 'matlab': 5, 'you': 6, 'python': 7}


## Sequence Generation

Creating sequences of numbers from sentences.

In [None]:
train_data = sentences[:4]

tokenizer = Tokenizer(num_words=100)
# tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(train_data)
word_index = tokenizer.word_index
print(word_index)

{'programming': 1, 'know': 2, 'i': 3, 'you': 4, 'python': 5, 'basic': 6, 'matlab': 7, 'do': 8, 'love': 9, 'the': 10, 'simplicity': 11}


In [None]:
# After generating the tokens for the individual words in the senteces,
# we represent the sentences as token sequences.

sequences = tokenizer.texts_to_sequences(train_data)
print(train_data)
print(sequences)

['i know basic programming', 'I know Matlab programming.', 'You know python programming!', 'Do you love the python programming simplicity?']
[[3, 2, 6, 1], [3, 2, 7, 1], [4, 2, 5, 1], [8, 4, 9, 10, 5, 1, 11]]


In [None]:
# inputs to ANNs should be of the same size
# for variable input lengths,
#   pad to largest input length
#   resize to N input length

from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
#padded = pad_sequences(sequences, maxlen=#, padding='pre_or_post',
#    truncating='pre_or_post')
padded = pad_sequences(sequences)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)


Sequences =  [[3, 2, 6, 1], [3, 2, 7, 1], [4, 2, 5, 1], [8, 4, 9, 10, 5, 1, 11]]

Padded Sequences:
[[ 0  0  0  3  2  6  1]
 [ 0  0  0  3  2  7  1]
 [ 0  0  0  4  2  5  1]
 [ 8  4  9 10  5  1 11]]


In [None]:
# Test the tokenizer with new data
test_data = sentences[4:]

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nWord Index = ",word_index)
print("Test Sentences = ", test_data)
print("Test Sequences = ", test_seq)

padded = pad_sequences(test_seq)
print("\nPadded Test Sequence: ")
print(padded)

## The resulted sequences are distored as it does have tokens
## for words it has not fit. New words are skipped without consideration
## Include the out-of-vocabulary (oov) token and fit tokenizer again.


Word Index =  {'programming': 1, 'know': 2, 'i': 3, 'you': 4, 'python': 5, 'basic': 6, 'matlab': 7, 'do': 8, 'love': 9, 'the': 10, 'simplicity': 11}
Test Sentences =  ['I expertly know python programming', 'Programming in python beats programming in Malbolge.']
Test Sequences =  [[3, 2, 5, 1], [1, 5, 1]]

Padded Test Sequence: 
[[3 2 5 1]
 [0 1 5 1]]
