<a href="https://colab.research.google.com/github/Anshul007/Anshul-Chaurasia/blob/master/15_Test_dataPrep_Keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Spliting words with <i>text_to_word_sequence</i>

Words are called token and the process of splitting token is called tokenization.

In [1]:
from keras.preprocessing.text import text_to_word_sequence

In [7]:
text = "The quick brown fox jumped over the lazy dog";

#tokenize the doc
results = text_to_word_sequence(text)
print(results)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


## Note:
by default <b>text_to_word_sequence</b> will do 3 things.

1. Split word by space
2. Filters out punchuation
3. Convert text into lowercase (if lower = True)



## Encoding with one hot
Keras provide the one hot() that you can use to <b>tokenize</b> and <b>integer encode a text document</b> in one step.

In [6]:
text = "The quick brown fox jumped over the lazy dog";

#tokenize the doc
results = text_to_word_sequence(text)

words = set(text_to_word_sequence(text))
vocab_size = len(words)

print(vocab_size)

8


In [8]:
# import one hot from keras
from keras.preprocessing.text import one_hot

In [17]:
# printing the size of words
print("Printing the word size from text: "+ str(vocab_size))

# integer encode the document
result = one_hot(text, round(vocab_size*1.3))
print(result)

Printing the word size from text: 8
[9, 6, 9, 2, 9, 6, 9, 8, 4]


## Hash Encoding with hashing trick

It is simmilar to one_hot(), but it gives more flexibility. it also allow you to create your own Hash function

In [10]:
# importing hashing trick library
from keras.preprocessing.text import hashing_trick

In [18]:
# printing the size of words
print("Printing the word size from text: "+ str(vocab_size))

# integer encode the documents
result = hashing_trick(text, round(vocab_size*1.3), hash_function= "md5")
print(result)

Printing the word size from text: 8
[6, 4, 1, 2, 7, 5, 6, 2, 6]


## Tokenizer API
It can fit and reused to prepare multiple text documents.

The tokenizer must be constructed and then fit on either raw text documents or integer encoded text documents.

In [19]:
from keras.preprocessing.text import Tokenizer

In [22]:
# Defining doc
doc = ['The quick',
 'quick brown',
 'brown fox',
 'fox jumps',
 'jumps the',
 'the lazy',
 'lazy dog']

 # create the tokenizer
t = Tokenizer()

 # fit the tokenizer n the documents
t.fit_on_texts(doc)

Once you fit, the tokenizer provides 4 attribute that you can use to query what has been learned about the doc

1. <b>Word Count:</b> A dictonary of words and their count
2. <b>Word Doc:</b> total # documents that were used to fit tokenizer
3. <b>Word index:</b> A dictonary of words and their uniquly assign integer
4. <b>Document Count:</b> A dictonary of words and how many documents ach appeared in.



In [26]:
# i.e.

print(t.word_counts)
print(t.word_docs)
print(t.word_index)
print(t.document_count)

OrderedDict([('the', 3), ('quick', 2), ('brown', 2), ('fox', 2), ('jumps', 2), ('lazy', 2), ('dog', 1)])
defaultdict(<class 'int'>, {'quick': 2, 'the': 3, 'brown': 2, 'fox': 2, 'jumps': 2, 'lazy': 2, 'dog': 1})
{'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumps': 5, 'lazy': 6, 'dog': 7}
7


In [27]:
# integer encode documents

encode_doc = t.texts_to_matrix(doc, mode= "count")
print(encode_doc)

[[0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 1. 0. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 1. 1.]]
