<a href="https://colab.research.google.com/github/JpChii/ML-Projects/blob/main/Handling_text_data_from_various_sources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we'll be walkthrough and learn to dealing with variety of inputs on text data.

Resources:
* https://www.tensorflow.org/tutorials/load_data/text
* https://realpython.com/read-write-files-python/

## Using Keras API

In [150]:
!pip install "tensorflow-text==2.8.*"



In [151]:
# Importing the libraries
import collections
import pathlib

import tensorflow as tf

from tensorflow.keras import layers, losses, utils
from tensorflow.keras.layers import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

### Example 1: Predict the tag for a stack overflow question

For this example, we'll use a dataset of programming questions from stack overflow to predict the tag for a question. This is an multi-class classification problem

#### Download and explore the dataset

In [152]:
data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'

In [153]:
dataset_dir = utils.get_file(origin=data_url,
                             untar=True,
                             cache_dir='stack_overflow',
                             cache_subdir='')

dataset_dir = pathlib.Path(dataset_dir).parent

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz


In [154]:
list(dataset_dir.iterdir())

[PosixPath('/tmp/.keras/train'),
 PosixPath('/tmp/.keras/README.md'),
 PosixPath('/tmp/.keras/test'),
 PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz')]

In [155]:
train_dir = dataset_dir/'train'
list(train_dir.iterdir())

[PosixPath('/tmp/.keras/train/python'),
 PosixPath('/tmp/.keras/train/csharp'),
 PosixPath('/tmp/.keras/train/java'),
 PosixPath('/tmp/.keras/train/javascript')]

In [156]:
!ls /tmp/.keras/train/java | head

0.txt
1000.txt
1001.txt
1002.txt
1003.txt
1004.txt
1005.txt
1006.txt
1007.txt
1008.txt


In [157]:
sample_file = train_dir/'java/0.txt'

with open(sample_file) as f:
  print(f.read())

"how to download .msi file in blank i want to download .msi file using blank.  i have tried to download file using following code..printwriter out = null;.fileinputstream filetodownload = null;.bufferedreader bufferedreader = null;.try {.        out = response.getwriter();.        filetodownload = new fileinputstream(download_directory + file_name);.        bufferedreader = new bufferedreader(new inputstreamreader(filetodownload));..        //response.setcontenttype(""application/text"");.        //response.setcontenttype(""application/x-msi"");.        //response.setcontenttype(""application/msi"");.        //response.setcontenttype(""octet-stream"");.        response.setcontenttype(""application/octet-stream"");.        //response.setcontenttype(""application/x-7z-compressed"");.        //response.setcontenttype(""application/zip"");.        response.setheader(""content-disposition"",""attachment; filename="" +file_name );.        response.setcontentlength(filetodownload.available())

#### Load the dataset

Loading the data off disk and prepare it into a suitable format for traiing. We'll use `tf.keras.utils.text_dataset_from_directory` utility to create a `tf.data.Dataset`.

The train directory is in the format `text_dataset_from_directory` API expects.

In [158]:
# Test set is already present, splitting the train dataset into train and validation set
BATCH_SIZE = 32
SEED = 42

raw_train_ds = utils.text_dataset_from_directory(train_dir,
                                                  batch_size=BATCH_SIZE,
                                                  seed=SEED,
                                                  validation_split=0.2,
                                                  subset='training')

Found 8000 files belonging to 4 classes.
Using 6400 files for training.


In [159]:
# Iteratinf over the dataset to get a idea of the data
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print(f"Question: {text_batch.numpy()[i]}")
    print(f"Label: {label_batch.numpy()[i]}")

Question: b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon().    {.       mynumsides = 5;.       mysidelength = 30;.    }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength).    {.        mynumsides = numsides;.        mysidelength = sidelength;.    }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);.        shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it only uses the default cons

In [160]:
# The label names are 0,1,2,3. Checking the class_names property to find corresponding class names.
for i, label in enumerate(raw_train_ds.class_names):
  print(f"Label: {i}, corresponds to: {label}")

Label: 0, corresponds to: csharp
Label: 1, corresponds to: java
Label: 2, corresponds to: javascript
Label: 3, corresponds to: python


Creating a validation set using the remaining 1600 reviws from the training set for validation.

> With `validation_split` and `subset` arguments of `text_dataset_from_directory` make sure to sepcify random seed or pass shuffle-False, so the splits have no overlap.

In [161]:
# Creating a validatio set
raw_val_ds = utils.text_dataset_from_directory(
    directory=train_dir,
    batch_size=BATCH_SIZE,
    seed=SEED,
    subset='validation',
    validation_split=0.2
)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.


In [162]:
test_dir = dataset_dir/'test'

# Createing test dataset
raw_test_ds = utils.text_dataset_from_directory(
    directory=test_dir,
    batch_size=BATCH_SIZE
)

Found 8000 files belonging to 4 classes.


Now we've the datasets ready, we'll **`prepare the dataset for training`**

Next steps:

1. *`Standardization`* - preprocessing the text to remove punctuation and html elements to simplify the dataset
2. *`Tokenization`* - Splitting the strings into tokens(based on whitespace or any other delimiter)
3. *`Vecotrization`* - Converting tokens into numbers so they can be fed into a neural network

Let's accompolish these tasks using `tf.keras.layers.TextVectorization` API

To lean about the above three techniques we'll try the below two with TextVectorization:

* First use `binry` vectorization mode to build a bag-of-words model.
* Use `int` mode with a 1D ConvNet.

In [163]:
VOCAB_SIZE = 10000
binary_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary'
)


setting `output_sequence_length` parameter will cause the layer to pad or truncate sequences to the value of the parameter.

In [164]:
MAX_SEQUENCE_LENGTH = 250
int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH
)

All set, calling `TextVectorization.adapt` to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.

In [165]:
# Make a text-only dataset (without labels), then call the adapt methods
train_text = raw_train_ds.map(lambda text, labels: text)

In [166]:
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

In [167]:
# Printing the result of using these layers to preprocess data

def binary_vectorize_text(text, label):
  # To accomodatae batching
  text = tf.expand_dims(text, -1)
  return binary_vectorize_layer(text), label

In [168]:
def int_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label

In [169]:
raw_train_ds

<BatchDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>

In [170]:
# Retrieve a batch from the dataset
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print(f"Question: {first_question}")
print(f"Label: {label}")

Question: b'"what is the difference between these two ways to create an element? var a = document.createelement(\'div\');..a.id = ""mydiv"";...and..var a = document.createelement(\'div\').id = ""mydiv"";...what is the difference between them such that the first one works and the second one doesn\'t?"\n'
Label: python


In [171]:
print(f"""
binary vectorized question:
{binary_vectorize_text(first_question, first_label)}
""")


binary vectorized question:
(<tf.Tensor: shape=(1, 10000), dtype=float32, numpy=array([[1., 1., 0., ..., 0., 0., 0.]], dtype=float32)>, <tf.Tensor: shape=(), dtype=int32, numpy=2>)



In [172]:
print(f"""
int vectorized question:
{int_vectorize_text(first_question, first_label)}
""")


int vectorized question:
(<tf.Tensor: shape=(1, 250), dtype=int64, numpy=
array([[ 55,   6,   2, 410, 211, 229, 121, 895,   4, 124,  32, 245,  43,
          5,   1,   1,   5,   1,   1,   6,   2, 410, 211, 191, 318,  14,
          2,  98,  71, 188,   8,   2, 199,  71, 178,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,  

As seen above,

* `binary` mode creates a sparse matrix of VOCAB_SIZE and `1` where words are present
* `int` mode replaces words with integers and adds or truncates max length of output to MAX_SEQUENCE_LENGTH

We can lookup the word for token using good vocabulary

In [173]:
print(f"1221---> {int_vectorize_layer.get_vocabulary()[1221]}")
print(f"2---> {int_vectorize_layer.get_vocabulary()[2]}")

1221---> parsing
2---> the


In [174]:
len(int_vectorize_layer.get_vocabulary())

10000

In [175]:
print(f"""
Top words in vocab: {int_vectorize_layer.get_vocabulary()[:10]},
Bottom words in vocab: {int_vectorize_layer.get_vocabulary()[-10:]}
""")


Top words in vocab: ['', '[UNK]', 'the', 'i', 'to', 'a', 'is', 'in', 'and', 'of'],
Bottom words in vocab: ['excluded', 'exceptionthe', 'evnets', 'everyvarmathfloormathrandomeveryvarlength', 'eventtargetinnerhtml', 'evalinputplease', 'euros', 'ettercap', 'etos', 'essential']



We're all set, let's apply the vectorization layer to the entire dataset.

In [176]:
binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

In [177]:
first_batch_question, first_batch_label = next(iter(binary_train_ds))

In [178]:
first_batch_question.shape, first_batch_label.shape

(TensorShape([32, 10000]), TensorShape([32]))

#### Configure the dataset for performance

* `Dataset.cache` keeps data in memeory after it;s loaded off disk. This ensures the dataset does not become a bottleneck while training the model. If dataset is too large to fit into memeory, can also use ths method to create a performant on-disk cache, which is more effecient to read than many small files.

* `Dataset.prefetch` overalaps data preprocessing and model execution while training.

In [179]:
AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size=AUTOTUNE)

In [180]:
binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

In [181]:
binary_train_ds

<PrefetchDataset element_spec=(TensorSpec(shape=(None, 10000), dtype=tf.float32, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>

In [182]:
binary_model = tf.keras.Sequential([layers.Dense(4)])

binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy']
)

In [183]:
history = binary_model.fit(
    binary_train_ds,
    validation_data=binary_val_ds,
    epochs=10
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Creating Conv1D model on `int` dataset

In [249]:
def create_model(vocab_size, num_labels):
  model = tf.keras.Sequential([
      layers.Embedding(input_dim=vocab_size, 
                       output_dim=64, 
                       mask_zero=True),
      layers.Conv1D(filters=64,
                    kernel_size=5,
                    strides=2,
                    padding="valid",
                    activation="relu"),
      layers.GlobalMaxPool1D(),
      layers.Dense(num_labels)
  ])

  return model

In [186]:
int_model = create_model(vocab_size=VOCAB_SIZE + 1, # 1 for 0 used in padding
                         num_labels=4)

In [187]:
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy']
)

In [188]:
history = int_model.fit(
    int_train_ds,
    validation_data=int_val_ds,
    epochs=10
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [189]:
binary_model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 4)                 40004     
                                                                 
Total params: 40,004
Trainable params: 40,004
Non-trainable params: 0
_________________________________________________________________


In [190]:
int_model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, None, 64)          640000    
                                                                 
 conv1d_3 (Conv1D)           (None, None, 64)          20544     
                                                                 
 global_max_pooling1d_3 (Glo  (None, 64)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_4 (Dense)             (None, 4)                 260       
                                                                 
Total params: 660,804
Trainable params: 660,804
Non-trainable params: 0
_________________________________________________________________


In [191]:
binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))

Binary model accuracy: 81.52%
Int model accuracy: 80.90%


Let's include preprocessing layer as part of the model to make it easier for predictions and use in production if needed.

In [192]:
export_model = tf.keras.Sequential(
    [
     binary_vectorize_layer,
     binary_model,
     layers.Activation('sigmoid')
    ]
)

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics='accuracy'
)

In [193]:
export_model.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (TextV  (None, 10000)            0         
 ectorization)                                                   
                                                                 
 sequential_3 (Sequential)   (None, 4)                 40004     
                                                                 
 activation (Activation)     (None, 4)                 0         
                                                                 
Total params: 40,004
Trainable params: 40,004
Non-trainable params: 0
_________________________________________________________________


In [194]:
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(binary_accuracy))

  return dispatch_target(*args, **kwargs)


Accuracy: 81.52%


In [195]:
def get_string_labels(predicted_scores_batch):
  predicted_int_labels = tf.argmax(predicted_scores_batch, axis=1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels

In [196]:
inputs = [
    "how do I extract keys from a dict into a list?",  # 'python'
    "debug public static void main(string[] args) {...}",  # 'java'
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

Question:  how do I extract keys from a dict into a list?
Predicted label:  b'python'
Question:  debug public static void main(string[] args) {...}
Predicted label:  b'java'


### **Summary:**

1. `utils.get_file` to download dataset from a url
2. Loading text dataset from a directory using `utils.text_dataset_from_directory`
3. Standardization, tokenization, vectorization using `TextVectorizationLayer`
4. Mapping vectorization layer using function over the entire dataset
5. I/O bottlenect prevention using `tf.data` API

## Using TensorFlow low level API

### Example 2: Predict the author of Iliad translations

Using `tf.data.TextLineDataset` to load examples from text files and `TensorFlow Text` to preprocess the data. We'll use three different English translations of the same work, Homer's Iliad and train a model to identify the translator given a single line of text.

#### Download and explore the dataset

The texts of three translations are by:

* [William Cowper:](https://en.wikipedia.org/wiki/William_Cowper)[ text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt)

* [Edaward, Earl of Derby:](https://en.wikipedia.org/wiki/Edward_Smith-Stanley,_14th_Earl_of_Derby)[ text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt)

* [Samuel Butler: ](https://en.wikipedia.org/wiki/Samuel_Butler_%28novelist%29)[text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt)

The text files used in this tutorial have undergone some typical preprocessing tasks like removing documnet header and footer, line numbers and chapter titles.

In [197]:
# Download these lightly munged files locally:
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())

[PosixPath('/root/.keras/datasets/cowper.txt'),
 PosixPath('/root/.keras/datasets/butler.txt'),
 PosixPath('/root/.keras/datasets/derby.txt')]

In [198]:
text_dir

'/root/.keras/datasets/butler.txt'

#### Load the dataset

Previously with `tf.keras.utils.text_dataset_from_directory` all contents(lines) of a file were treated as a single example. Here, we'll use `tf.data.TextLineDataset` which is designed to create a `tf.data.Dataset` from a text file where example is a line of text from the original file.

`TextLineDataset` is useful for text data that is primarily line-based (for example, poetry or error logs)

Iterate through these files, loading each one into its own dataset. Each example(line) needs to be individually labelled, so use `Dataset.map` to apply a lebeler function to each one. This will iterate over evary example in the dataset, returning(`example, label`) pairs.

In [199]:
def labeler(example, index):
  return example, tf.cast(index, tf.int64)

In [200]:
labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(filenames=(str(parent_dir/file_name)))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

In [201]:
labeled_data_sets

[<MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>,
 <MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>,
 <MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>]

In [202]:
example, label = next(iter(labeled_data_sets[0]))

In [203]:
example.numpy(), label.numpy()

(b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;", 0)

Next combining these labeled datasets into a single dataset using `Dataset.concatenate` and shuffle it with `Dataset.shuffle`

In [204]:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000

In [205]:
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

In [206]:
# Shuffling the data
all_labelled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False
)

In [207]:
# Printing out few examples
for text, label in all_labeled_data.take(10):
  print(f"Sentence: {text.numpy()}")
  print(f"Label: {label.numpy()}")

Sentence: b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"
Label: 0
Sentence: b'His wrath pernicious, who ten thousand woes'
Label: 0
Sentence: b"Caused to Achaia's host, sent many a soul"
Label: 0
Sentence: b'Illustrious into Ades premature,'
Label: 0
Sentence: b'And Heroes gave (so stood the will of Jove)'
Label: 0
Sentence: b'To dogs and to all ravening fowls a prey,'
Label: 0
Sentence: b'When fierce dispute had separated once'
Label: 0
Sentence: b'The noble Chief Achilles from the son'
Label: 0
Sentence: b'Of Atreus, Agamemnon, King of men.'
Label: 0
Sentence: b"Who them to strife impell'd? What power divine?"
Label: 0


#### Prepare the dataset for training

Instead of using `tf.keras.layers.TextVectorization` to preprocess the text dataset, will now use the TensorFlow Text APIs to standardize and tokenize the data, build a vocabulary and use `tf.lookup.StaticVocabularyTable` to map tokens to integers to feed to the model.

Define a function to convert the text to lower-case and tokenize it.

* TensorFlow Text provides various Tokenizers. In this example, we'll use the text `text.UnicodeScriotTokenizer` to tokenize the dataset.
* Will use `Dataset.map` to apply the tokenization to the dataset.

In [208]:
tokenizer = tf_text.UnicodeScriptTokenizer()

In [209]:
def tokenize(text, unused_label):
  # Lower case
  lower_case = tf_text.case_fold_utf8(text)
  return tokenizer.tokenize(lower_case)

In [210]:
tokenized_ds = all_labeled_data.map(tokenize)

In [211]:
for text_batch in tokenized_ds.take(5):
  print(f"Tokens: {text_batch.numpy()}")

Tokens: [b'achilles' b'sing' b',' b'o' b'goddess' b'!' b'peleus' b"'" b'son' b';']
Tokens: [b'his' b'wrath' b'pernicious' b',' b'who' b'ten' b'thousand' b'woes']
Tokens: [b'caused' b'to' b'achaia' b"'" b's' b'host' b',' b'sent' b'many' b'a'
 b'soul']
Tokens: [b'illustrious' b'into' b'ades' b'premature' b',']
Tokens: [b'and' b'heroes' b'gave' b'(' b'so' b'stood' b'the' b'will' b'of' b'jove'
 b')']


Next building a vocabulary by sorting tokens by frequency and keeping the top `VOCAB_SIZE` tokens

In [212]:
tokenized_ds = configure_dataset(tokenized_ds)

In [213]:
vocab_dict = collections.defaultdict(lambda: 0)

In [214]:
for tokens in tokenized_ds.as_numpy_iterator():
  for token in tokens:
    vocab_dict[token] += 1

In [215]:
vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True)

In [216]:
vocab[:5]

[(b',', 45478),
 (b'the', 28299),
 (b'and', 17012),
 (b"'", 15695),
 (b'of', 13489)]

In [217]:
vocab = [token for token, count in vocab]
vocab[:5]

[b',', b'the', b'and', b"'", b'of']

In [218]:
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)

In [219]:
print(f"Vocab size: {vocab[:5]}")

Vocab size: [b',', b'the', b'and', b"'", b'of']


To convert tokens into integers, use the vocab to create a `tf.lookup.StaticVocabularyTable`. Will map tokens to integers in the range `[2, vocab_size + 2]`. As with `TextVectorization` layer, `0` is reserved to denote padding and `1` is reserved to denote an OOV token

In [220]:
keys = vocab
values = range(2, len(vocab) + 2) # Reserve 0 for padding and 1 for OOV tokens

In [221]:
init = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=tf.string, value_dtype=tf.int64
)

In [222]:
num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

Finally define a function to standardize, tokenize and vectorize the dataset using the tokenizer and lookup table

In [223]:
vocab_table.size()

<tf.Tensor: shape=(), dtype=int64, numpy=10001>

In [224]:
def preprocess_text(text, label):
  standardized = tf_text.case_fold_utf8(text)
  tokenized = tokenizer.tokenize(standardized)
  vectorized = vocab_table.lookup(tokenized)
  return vectorized, label

In [225]:
# Try this on a single example
example_text, example_label = next(iter(all_labeled_data))
print(f"Sentence: {example_text.numpy()}")

Sentence: b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"


In [226]:
vectorized_txt, example_label = preprocess_text(example_text, example_label)
print(f"vectorized sentence: {vectorized_txt}")

vectorized sentence: [  57 4110    2   95  284   59  182    5   28   10]


In [227]:
# Preprocess entire data using map
all_encoded_data = all_labeled_data.map(preprocess_text)

In [228]:
all_labeled_data

<ConcatenateDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>

#### Split the dataset into training and test sets

The keras `TextVectorization` layer also batches and pads vectorized data. Padding is required because the examples inside of a batch need to be the same size and shape, but the examples in these datasets are not all the same size - each line of text has different number of words.

`tf.data.Dataset` supports splitting and padded-batchinf datasets

In [229]:
train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)

In [230]:
train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)

In [231]:
sample_text, sample_labels = next(iter(validation_data))
print("Text batch shape: ", sample_text.shape)
print("Label batch shape: ", sample_labels.shape)
print("First text example: ", sample_text[0])
print("First label example: ", sample_labels[0])

Text batch shape:  (64, 14)
Label batch shape:  (64,)
First text example:  tf.Tensor([  57 4110    2   95  284   59  182    5   28   10    0    0    0    0], shape=(14,), dtype=int64)
First label example:  tf.Tensor(0, shape=(), dtype=int64)


In [232]:
vocab_size

10000

In [233]:
vocab_size += 2

In [234]:
vocab_table.size()

<tf.Tensor: shape=(), dtype=int64, numpy=10001>

In [235]:
train_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)

In [236]:
vocab_size

10002

In [239]:
model = create_model(vocab_size=vocab_size, num_labels=3)

model.compile(
    optimizer='adam',
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])

history = model.fit(train_data, validation_data=validation_data, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [240]:
loss, accuracy = model.evaluate(validation_data)

print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")

Loss: 1.0633639097213745
Accuracy: 0.6462000012397766


In [250]:
model.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, None, 64)          640128    
                                                                 
 conv1d_5 (Conv1D)           (None, None, 64)          20544     
                                                                 
 global_max_pooling1d_5 (Glo  (None, 64)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_6 (Dense)             (None, 3)                 195       
                                                                 
Total params: 660,867
Trainable params: 660,867
Non-trainable params: 0
_________________________________________________________________


#### Export the model

To make the model capabale of taking raw strings as input, will use keras `TextVectorization` layer that performs the same steps as our custom preprocessing function.

In [241]:
preprocess_layer = TextVectorization(
    max_tokens=vocab_size,
    standardize=tf_text.case_fold_utf8,
    split=tokenizer.tokenize,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH
)

In [242]:
preprocess_layer.set_vocabulary(vocab)

In [261]:
export_model = tf.keras.Sequential(
    [preprocess_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

In [262]:
# Create a test dataset of raw strings.
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)

loss, accuracy = export_model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

Loss:  0.6702016592025757
Accuracy: 74.78%


In [248]:
export_model.summary()

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_3 (TextV  (None, 250)              0         
 ectorization)                                                   
                                                                 
 sequential_7 (Sequential)   (None, 3)                 660867    
                                                                 
 activation_2 (Activation)   (None, 3)                 0         
                                                                 
Total params: 660,867
Trainable params: 660,867
Non-trainable params: 0
_________________________________________________________________


In [263]:
%%time
inputs = [
    "Join'd to th' Ionians with their flowing robes,",  # Label: 1
    "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
    "And with loud clangor of his arms he fell.",  # Label: 0
]

predicted_scores = export_model.predict(inputs)
predicted_labels = tf.argmax(predicted_scores, axis=1)

for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

Question:  Join'd to th' Ionians with their flowing robes,
Predicted label:  1
Question:  the allies, and his armour flashed about him so that he seemed to all
Predicted label:  2
Question:  And with loud clangor of his arms he fell.
Predicted label:  0
CPU times: user 2.51 s, sys: 57.3 ms, total: 2.57 s
Wall time: 2.55 s


In [258]:
predicted_scores[0]

array([6.0442750e-07, 9.9999893e-01, 5.0214715e-07], dtype=float32)