# Text Pre-processing in TensorFlow

TensorFlow supports tokenization and vectorization of text data, similar to SciKit-Learn, which can come in handy. The most well-known is the **`Tokenizer`** class that accepts a list of strings and converts tokens to integers (as well as tokenizing).

It is also worth noting the **`pad_sequences`** function, that pads out sequences of integers so that they are all of equal length, and can also perform other operations on the tokenized sequences.



The dataset you will be using is a useful benchmark datasets for NLP tasks from *NLP-progress*. The dataset **Text8** was prepared by taking the 'XML dump' (archived WikiPedia articles) from March 2006 and retaining only the text that you would have read on a Wikipedia page. The dataset is usually treated as one long sequence and consists of only lowercase English characters and spaces.

In order to download the dataset, you need to import Gensim module **`gensim.downloader`**, which is an API for downloading datasets for language modeling exercises.

If you’re interested in the exact processing of the 'XML dump', you can see online.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [13]:
# To load the text8 dataset - ALL THIS NOT NEEDED!!

#import gensim.downloader as api

In [14]:
# NOT NEEDED

#dataset = api.load("text8")

In [15]:
# REMOVE!!

#type(dataset)

**Lets look at simple example of using the `Tokenizer`:**

In [6]:
sentences = ["I like eggs and ham.", 
             "I love chocolate and bunnies.", 
             "I hate onions."]

In [7]:
# Max vocabulary size is usually this large
max_vocab = 20000

tokenizer = Tokenizer(num_words=max_vocab)

# Fit the tokenizer on sentences
tokenizer.fit_on_texts(sentences)

In [8]:
# Transform sentences

sequences = tokenizer.texts_to_sequences(sentences)

In [9]:
# Integers start from 1, not 0

print(sequences)

[[1, 3, 4, 2, 5], [1, 6, 7, 2, 8], [1, 9, 10]]


In [10]:
# Word-to-index mapping generated through tokenizer

tokenizer.word_index

{'i': 1,
 'and': 2,
 'like': 3,
 'eggs': 4,
 'ham': 5,
 'love': 6,
 'chocolate': 7,
 'bunnies': 8,
 'hate': 9,
 'onions': 10}

**Lets look at a simple example using `pad_sequences` function, which 'pads' out the sequences so that each vector is of the same length:**

In [11]:
data = pad_sequences(sequences)

print(data)

[[ 1  3  4  2  5]
 [ 1  6  7  2  8]
 [ 0  0  1  9 10]]


In [12]:
# You can increase the length of each sequence when padding (although no point!)

data = pad_sequences(sequences, maxlen=6)

print(data)

[[ 0  1  3  4  2  5]
 [ 0  1  6  7  2  8]
 [ 0  0  0  1  9 10]]
