# TensorFlow Tutorials
# Load and Preprocess Data 05 - Text

Using `tf.data.TextLineDataset` to load examples from a text file. This is useful for any dataset which is stored in the form of text files and where each line in the text file is an example (e.g. poetry, error logs, etc.).

To demonstrate this API, will be using three different English translations of Homer's Illiad and train a model to identify the translator given a single line of text.

## Workspace Setup

In [2]:
from __future__ import print_function, absolute_import, division, unicode_literals

# I want tensorflow 2 because reasons
try:
  %tensorflow_version 2.x
except Exception:
  pass

# Check
import tensorflow as tf
tf.__version__

'2.0.0-rc2'

In [0]:
# Dataset has been built into tf
import tensorflow_datasets as tfds

# Fill probably need to do some file manipulation
import os 

In [0]:
# Downloading preprocessed files for this tutorial
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

In [5]:
# Get each file and store locally
for name in FILE_NAMES:
  text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)

# Remember the filepath for the directory where these files are stored
parent_dir = os.path.dirname(text_dir)

# Print this path
parent_dir

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt


'/root/.keras/datasets'

## Loading into `tf.data.Dataset`
Each example needs to be labeled individually, so use `tf.data.Dataset.map` to apply a labeler function to each one. This iterates over all examples in the dataset and reutrns a tuple of the form `(example, label)`.

In [0]:
def labeler(example, index):
  return example, tf.cast(index, tf.int64) # cast the index as a 64 bit integer

In [0]:
# Empty lst to store tuples of labeled datasets
labeled_data_sets = []

# List of file names is transformed into a list of integer-indexed tuples
for i, file_name in enumerate(FILE_NAMES):
  # Read each translation as a TextLineDataset by using `join` to get its filepath
  lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))

  # Every line in the dataset is passed to the labeler method which returns
  # a tuple with the line and the translation's index
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))

  # Append this dataset of labeled lines to the labeled_data_sets list
  labeled_data_sets.append(labeled_dataset)

In [0]:
# Combine these labeled datasets into a single dataset and shuffle it
BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

In [11]:
labeled_data_sets

[<MapDataset shapes: ((), ()), types: (tf.string, tf.int64)>,
 <MapDataset shapes: ((), ()), types: (tf.string, tf.int64)>,
 <MapDataset shapes: ((), ()), types: (tf.string, tf.int64)>]

In [0]:
# Why extract the first element separately?
all_labeled_data = labeled_data_sets[0]

# Why extract the rest with a for loop?
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

# Shuffle, okay, I understand this.
all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

In [14]:
# Now check the first 5 elements in the newly created list to confirm processing
for ex in all_labeled_data.take(5):
  print(ex)

(<tf.Tensor: id=74, shape=(), dtype=string, numpy=b'There on the topmost height of Gargarus,'>, <tf.Tensor: id=75, shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: id=76, shape=(), dtype=string, numpy=b'Thy joy is ever such, from me apart'>, <tf.Tensor: id=77, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=78, shape=(), dtype=string, numpy=b'gods or men."'>, <tf.Tensor: id=79, shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: id=80, shape=(), dtype=string, numpy=b'Responsive sighs, deploring each, in show,'>, <tf.Tensor: id=81, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=82, shape=(), dtype=string, numpy=b'Thou hast thy secret chamber, built for thee'>, <tf.Tensor: id=83, shape=(), dtype=int64, numpy=1>)


## Encoding Text as Numbers
ML models do not work on words. They work on numbers. Words in each line of text need to be converted into lists of numbers. To do this, we can map each word in each line to a number in a word_index dictionary of unique words.

### Build Vocabulary
Tokenize the text into a collection of individual unique words. 
1. Iterate over each example's numpy value.
2. Use `tfds.features.text.Tokenizer` to split it into tokens.
3. Collct tokens into a Python **set** to remove duplicates.
4. Get the siez of the vocabulary for later use.

In [17]:
tokenizer = tfds.features.text.Tokenizer()

# Ensures no duplicates
vocabulary_set = set()

# Will only be using the text tensor for each line, don't need the label
for text_tensor, _ in all_labeled_data:
  # Tokenize using the built in method
  some_tokens = tokenizer.tokenize(text_tensor.numpy())

  # Add the word to the set if not already present
  vocabulary_set.update(some_tokens)

# How many unique words across all three translations?
vocab_size = len(vocabulary_set)
vocab_size

17178

### Encode Examples 
Now that individual words have been mapped to unique integers, we can use an **encoder** to transform each sequence of words into a list of integers. 

In [0]:
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)

In [20]:
# What does this look like?
example_text = next(iter(all_labeled_data))[0].numpy()
print(example_text)

b'There on the topmost height of Gargarus,'


In [21]:
encoded_example = encoder.encode(example_text)
print(encoded_example)

[26, 14281, 2173, 2559, 1941, 14910, 10095]


In [0]:
# Run the encoder on the entire dataset - wrap this encoder in a tf.py_function and pass it to `map`
def encode(text_tensor, label):
  encoded_text = encoder.encode(text_tensor.numpy())
  return encoded_text, label

def encode_map_fn(text, label):
  return tf.py_function(encode, inp=[text, label], Tout=(tf.int64, tf.int64))

all_encoded_data = all_labeled_data.map(encode_map_fn)

## Train Test Split

In [0]:
# Create padded batches - not all lines will have the same number of words (or integers)
train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE, padded_shapes=([-1], []))

test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE, padded_shapes=([-1],[]))

Now `train_data` and `test_data` are no longer collections of tuples (`examples`, `labels`), but rather collections of **batches** of many such examples and labels.

In [24]:
sample_text, sample_labels = next(iter(test_data))
sample_text[0], sample_labels[0]

(<tf.Tensor: id=99547, shape=(15,), dtype=int64, numpy=
 array([   26, 14281,  2173,  2559,  1941, 14910, 10095,     0,     0,
            0,     0,     0,     0,     0,     0])>,
 <tf.Tensor: id=99551, shape=(), dtype=int64, numpy=1>)

In [0]:
# 0 is now being used for padding so increment vocab size by 1
vocab_size += 1

## Build Model

In [0]:
model = tf.keras.Sequential()

# Embedding layer will transform the integers into dense vectors called word embeddings. 
model.add(tf.keras.layers.Embedding(vocab_size, 64))

# LSTM layer - helps model understand context in which words were used
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))

# One or more densely connected layers - in this case, 2 layers with 64 units
for units in [64, 64]:
  model.add(tf.keras.layers.Dense(units, activation='relu'))

# Output layer. First argument is the number of labels
model.add(tf.keras.layers.Dense(3, activation='softmax'))

In [0]:
# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'])

## Train the Model

In [29]:
# Remember, train_data is a generator of batches of labeled lines
model.fit(train_data, epochs=3, validation_data=test_data)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7ffb7ecef9e8>

In [31]:
eval_loss, eval_acc = model.evaluate(test_data)
print('\nEval loss: {:.3f}, Eval accuracy: {:.3f}'.format(eval_loss, eval_acc))


Eval loss: 0.419, Eval accuracy: 0.828
