## Text Preparation
- texts must be cleaned and vectorixed before it is used to train a neural network
- instead of creating a table of word counts, we create a table of sequences containing token representing individual words.
- Tokens are indices into a dictionary or vocab. built from the corpus of words in the dataset

In [1]:
# example of sequence creation from 4 lines using Keras tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer

lines = [
    'I love machine learning',
    'Deep learning is a branch of machine learning',
    'Natural language processing is an exciting field',
    'I enjoy learning new things about AI'
]
# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

In [3]:
sequences

[[2, 5, 3, 1],
 [6, 1, 4, 7, 8, 9, 3, 1],
 [10, 11, 12, 4, 13, 14, 15],
 [2, 16, 1, 17, 18, 19, 20]]

In [4]:
words = tokenizer.sequences_to_texts(sequences)
words

['i love machine learning',
 'deep learning is a branch of machine learning',
 'natural language processing is an exciting field',
 'i enjoy learning new things about ai']

In [5]:
# import nltk

# nltk.download('stopwords')

In [6]:
# remove stop words from the text or numbers
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

lines = [
    'I love machine learning',
    'Deep learning is a branch of machine learning $$$',
    'Natural language processing is an exciting field',
    'I enjoy learning new things about AI'
]

def remove_stop_words(text):
    text = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    text = [word for word in text if word.isalpha() and not word in stop_words]
    return ' '.join(text)

lines = list(map(remove_stop_words, lines))

tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
tokenizer.texts_to_sequences(lines)

[[3, 2, 1], [4, 1, 5, 2, 1], [6, 7, 8, 9, 10], [11, 1, 12, 13, 14]]

In [7]:
# neural networks expects all sequesnces to be the same length
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequences = pad_sequences(sequences, maxlen=8, padding='post')
padded_sequences

array([[ 2,  5,  3,  1,  0,  0,  0,  0],
       [ 6,  1,  4,  7,  8,  9,  3,  1],
       [10, 11, 12,  4, 13, 14, 15,  0],
       [ 2, 16,  1, 17, 18, 19, 20,  0]], dtype=int32)