 Keras Tokenizer

In [16]:
data = [
    "Tomorrow I will visit the hospital.",
    "Yesterday I took a flight to Athens.",
    "Sally visited Harry and his dog."
]

In [17]:
#tokenize the word into sentences
import spacy
nlp = spacy.load("en_core_web_md")

sentences = [[token.text for token in nlp(sentence)] for sentence in data]
for sentence in sentences:
    print(sentence)
    

['Tomorrow', 'I', 'will', 'visit', 'the', 'hospital', '.']
['Yesterday', 'I', 'took', 'a', 'flight', 'to', 'Athens', '.']
['Sally', 'visited', 'Harry', 'and', 'his', 'dog', '.']


In [18]:
#keras txt-preprocess turn word sequence into word id seq with tokenizer class
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(lower=True)
tokenizer.fit_on_texts(data)
tokenizer

<keras_preprocessing.text.Tokenizer at 0x250a8b7f4f0>

In [19]:
tokenizer.word_index

{'i': 1,
 'tomorrow': 2,
 'will': 3,
 'visit': 4,
 'the': 5,
 'hospital': 6,
 'yesterday': 7,
 'took': 8,
 'a': 9,
 'flight': 10,
 'to': 11,
 'athens': 12,
 'sally': 13,
 'visited': 14,
 'harry': 15,
 'and': 16,
 'his': 17,
 'dog': 18}

In [21]:
tokenizer.texts_to_sequences(["hospital"])

[[6]]

In [22]:
tokenizer.texts_to_sequences(["hospital", "took"])

[[6], [8]]

In [26]:
tokenizer.sequences_to_texts([[3,2,1]])

['will tomorrow i']

In [27]:
tokenizer.sequences_to_texts([[3,2,1], [5,6,10]])

['will tomorrow i', 'the hospital flight']

In [28]:
# 0 keras padding
from tensorflow.keras.preprocessing.sequence import pad_sequences

sequences = [[7], [8,1], [9,11,12,14]]
MAX_LEN=4
pad_sequences(sequences, MAX_LEN, padding="post")

array([[ 7,  0,  0,  0],
       [ 8,  1,  0,  0],
       [ 9, 11, 12, 14]])

In [29]:
pad_sequences(sequences, MAX_LEN, padding="pre")

array([[ 0,  0,  0,  7],
       [ 0,  0,  8,  1],
       [ 9, 11, 12, 14]])

In [30]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(lower=True)
tokenizer.fit_on_texts(data)

seqs = tokenizer.texts_to_sequences(data)
MAX_LEN = 7

padded_seqs = pad_sequences(seqs, MAX_LEN, padding="post")
padded_seqs

array([[ 2,  1,  3,  4,  5,  6,  0],
       [ 7,  1,  8,  9, 10, 11, 12],
       [13, 14, 15, 16, 17, 18,  0]])

    Embedding words
    1.We broke each sentence into words and built a vocabulary with Keras' Tokenizer. 
    2.The Tokenizer object held a word index, which was a word->word-ID mapping. 
    3.After obtaining the word-ID, we could do a lookup to the embedding table rows with this word-ID and got a word vector. 
    4.Finally, we fed this word vector to the neural network. 

    Neural Network architecture for text classification
  1. we'll first preprocess, tokenize, pad the review sentences and after this, we will obtain a list of sequence
  2. we'll feed this list to the neural network through input layer.
  3. we'll vectorize each word by looking its word-id in the embadding layer. at this point, a sentence is now a sequence of word vectors, each vectors correspond to a word
  4. next, we will feed the seq of word vectors to LSTM.
  5. finally, we'll squash the LSTM output with a sigmoid layer to obtain class probabilities.

In [31]:
#import realword data (Amazon customers' food reviews)
import pandas as pd

reviews_df = pd.read_csv("data/reviews.csv")
reviews_df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,50057,B000ER5DFQ,A1ESDLEDR9Y0JX,A. Spencer,1,2,1,1310256000,the garbanzo beans in it give horrible gas,To be fair only one of my twins got gas from t...
1,366917,B001AIQP8M,A324KM3YY1DWQG,danitrice,0,0,5,1251072000,Yummy Lil' Treasures!!,Just recieved our first order of these (they d...
2,214380,B001E5E1XW,A3QCWO53N69HW3,"M. A. Vaughan ""-_-GOBNOGO-_-""",2,2,5,1276473600,Great Chai,This is a fantastic Chai Masala. I am very pic...
3,178476,B000TIZP5I,AYZ5NG9705AG1,Consumer,0,0,5,1341360000,Celtic Salt worth extra price,Flavorful and has added nutrition! You use le...
4,542504,B000E18CVE,A2LMWCJUF5HZ4Z,"Miki Lam ""mikilam""",8,11,3,1222732800,mixed feelings,I thought this soup tasted good. I liked the t...


In [35]:
reviews_df = reviews_df[['Text', 'Score']].dropna()

In [36]:
reviews_df.head()

Unnamed: 0,Text,Score
0,To be fair only one of my twins got gas from t...,1
1,Just recieved our first order of these (they d...,5
2,This is a fantastic Chai Masala. I am very pic...,5
3,Flavorful and has added nutrition! You use le...,5
4,I thought this soup tasted good. I liked the t...,3


In [37]:
reviews_df.Score[reviews_df.Score <=3 ] = 0
reviews_df.Score[reviews_df.Score >=4 ] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reviews_df.Score[reviews_df.Score <=3 ] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reviews_df.Score[reviews_df.Score >=4 ] = 1


In [38]:
reviews_df.head()

Unnamed: 0,Text,Score
0,To be fair only one of my twins got gas from t...,0
1,Just recieved our first order of these (they d...,1
2,This is a fantastic Chai Masala. I am very pic...,1
3,Flavorful and has added nutrition! You use le...,1
4,I thought this soup tasted good. I liked the t...,0


In [39]:
train_examples = []
labels = []

for index, row in reviews_df.iterrows():
    text = row['Text']
    rating = row['Score']
    labels.append(rating)
    tokens = [token.text for token in nlp(text)]
    train_examples.append(tokens)

In [44]:
# Data and vocabulary preparation
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

tokenizer = Tokenizer(lower=True)
tokenizer.fit_on_texts(train_examples)

sequences = tokenizer.texts_to_sequences(train_examples)

MAX_LEN = 50
X = pad_sequences(sequences, MAX_LEN, padding="post")

X = np.array(X)
y = np.array(labels)

In [54]:
#ready to feed our data to neural network
#feed data to the input layers
#import necessaries
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding
from tensorflow.keras import optimizers

# The input layer
sentence_input = Input(shape=(None,))

# The embedding layer
#input_dim = number of words in vocab, +1 = indices start from 1 not 0
#output shape 100 dims, popular no for word vector dim are 50, 200, 100
embedding = Embedding(input_dim = len(tokenizer.word_index)+1,
                     output_dim = 100)(sentence_input)

#The LSTM layer
#unit params means the dim of hidden state(lstm hidden and output shape are same)
LSTM_layer = LSTM(units=256)(embedding)

#The output layer
#squash 256-dim vector from lstm to 1-dim
#sigmoid function is S shape function and map its input to [0-1] range
output_dense = Dense(1, activation='sigmoid')(LSTM_layer)

In [55]:
#Compiling the model
model = Model(inputs=[sentence_input],outputs=[output_dense])


In [56]:
#adam(adaptive moment estimation), a popular optimizer in deep-learning
#binary cross-entropy, a loos that is used in binaru classification
#metrics use to evalute the performance of model

model.compile(optimizer="adam", loss="binary_crossentropy",
             metrics=["accuracy"])

In [57]:
#fitting model and experiment evaluation
model.fit(x=X, 
          y=y, 
          batch_size=64, 
          epochs=5, 
          validation_split=0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x250b7ad65b0>