<a href="https://colab.research.google.com/github/Ajay-user/DataScience/blob/master/Notes/Load_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Example 1: Predict the tag for a Stack Overflow question

Download a dataset of programming questions from Stack Overflow. Each question ("How do I sort a dictionary by value?") is labeled with exactly one tag (Python, CSharp, JavaScript, or Java).

The task is to develop a model that predicts the tag for a question. This is an example of multi-class classification, an important and widely applicable kind of machine learning problem.

**Download and explore the dataset**

Next, you will download the dataset, and explore the directory structure.

In [1]:
import re
import string
import pathlib
import tensorflow as tf

In [2]:
data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'

dataset = tf.keras.utils.get_file(fname = 'stack_overflow',
                                  origin = data_url,
                                  cache_dir = 'stack_overflow',
                                  cache_subdir = '',
                                  untar = True)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz


In [3]:
print('Path to the downloaded file',dataset)

Path to the downloaded file /tmp/.keras/stack_overflow


In [4]:
dataset_dir = pathlib.Path(dataset).parent
print('Parent',dataset_dir)

Parent /tmp/.keras


In [5]:
for directory in list(dataset_dir.iterdir()):
  print("path :",directory)

path : /tmp/.keras/train
path : /tmp/.keras/stack_overflow.tar.gz
path : /tmp/.keras/test
path : /tmp/.keras/README.md


In [6]:
train_dir = dataset_dir/'train'
test_dir = dataset_dir/'test'

print('path to training data:',train_dir)
print('path to testing data:',test_dir)

path to training data: /tmp/.keras/train
path to testing data: /tmp/.keras/test


In [7]:
for f in train_dir.iterdir():
  print('training data:',f)

print('-'*50)

for f in test_dir.iterdir():
  print('testing data:',f)

training data: /tmp/.keras/train/csharp
training data: /tmp/.keras/train/python
training data: /tmp/.keras/train/javascript
training data: /tmp/.keras/train/java
--------------------------------------------------
testing data: /tmp/.keras/test/csharp
testing data: /tmp/.keras/test/python
testing data: /tmp/.keras/test/javascript
testing data: /tmp/.keras/test/java


The `train/csharp`, `train/java`, `train/python` and `train/javascript` directories contain many text files, each of which is a Stack Overflow question. Print a file and inspect the data.

In [8]:

for tag in ['csharp', 'python', 'javascript', 'java']:
  path = train_dir/tag
  for que in path.iterdir():
    with open(que) as f:
      print('Tag:',tag)
      print(f.read())
      break



Tag: csharp
"how can i keep invoking a method inside a button in blank i am trying to call a function (testdraw) repeatedly inside a button (btn_auto_update) in a blank windows application:..private void btn_auto_update_click(object sender, eventargs e).{      ..}..private void testdraw().{.    textoutput.text += ""drawingrn"";.}...or i need a way to do what this codeline in python do:..def auto_get(self):.    self.testdraw().    self.after(1000, self.auto_get)"

Tag: python
"how to correct this loop using an array if any keyword from an array is mentioned in the title, don't click the title. if the title doesn't mention any of the keywords click the title...right now it clicks all the time, and i know why but i don't know how to fix it. it always clicks because it goes through the whole array and eventually there is a keyword that is not in the title. ideas?..arr = [""bunny"", ""watch"", ""book""]..title = (""the book of coding. (e-book) by seb tota"").lower().length = len(arr).for i 

### Load the dataset
Next, you will load the data off disk and prepare it into a format suitable for training. To do so, you will use `text_dataset_from_directory utility` to create a labeled `tf.data.Dataset`. 

**`tf.data`**, it's a powerful collection of tools for building input pipelines.

The `preprocessing.text_dataset_from_directory` expects a directory structure as follows.

<br>
train/<br>
...csharp/<br>
......1.txt<br>
......2.txt<br>
...java/<br>
......1.txt<br>
......2.txt<br>
...javascript/<br>
......1.txt<br>
......2.txt<br>
...python/<br>
......1.txt<br>
......2.txt<br>
<br>

When running a machine learning experiment, it is a best practice to divide your dataset into three splits: train, validation, and test. The Stack Overflow dataset has already been divided into train and test, but it lacks a validation set. Create a validation set using an 80:20 split of the training data by using the validation_split argument below.

**Note: When using the validation_split and subset arguments, make sure to either specify a random seed, or to pass shuffle=False, so that the validation and training splits have no overlap.**

**The labels are 0, 1, 2 or 3. To see which of these correspond to which string label, you can check the class_names property on the dataset.**

In [9]:
batch_size = 32
seed = 42

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory( directory = train_dir,
                                                                  batch_size = batch_size,
                                                                  seed = seed,
                                                                  validation_split = 0.2,
                                                                  subset = 'training')

raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory( directory = train_dir,
                                                                  batch_size = batch_size,
                                                                  seed = seed,
                                                                  validation_split = 0.2,
                                                                  subset = 'validation')

raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory( directory = test_dir,
                                                                  batch_size = batch_size,
                                                                  seed = seed)

Found 8000 files belonging to 4 classes.
Using 6400 files for training.
Found 8000 files belonging to 4 classes.
Using 1600 files for validation.
Found 8000 files belonging to 4 classes.


Note: To increase the difficulty of the classification problem, the dataset author replaced occurrences of the words Python, CSharp, JavaScript, or Java in the programming question with the word blank.

In [10]:
batch_text , batch_label = next(iter(raw_train_ds))

for i in range(5):
  print("Question : ",batch_text[i])
  print("Tag : ",batch_label[i])

Question :  tf.Tensor(b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon().    {.       mynumsides = 5;.       mysidelength = 30;.    }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength).    {.        mynumsides = numsides;.        mysidelength = sidelength;.    }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);.        shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it only uses the 

In [11]:
for i, name in enumerate(raw_train_ds.class_names):
  print('label:',i,'classname:',name)

label: 0 classname: csharp
label: 1 classname: java
label: 2 classname: javascript
label: 3 classname: python


### Prepare the dataset for training
**Standardize**, **Tokenize**, and **Vectorize** the data using the `preprocessing.TextVectorization` layer.

**Standardization** refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset.

**Tokenization** refers to splitting strings into tokens (for example, splitting a sentence into individual words by splitting on whitespace).

**Vectorization** refers to converting tokens into numbers so they can be fed into a neural network.



The default standardization converts text to lowercase and removes punctuation.

The default tokenizer splits on whitespace.

**`The default vectorization mode is int. This outputs integer indices (one per token). This mode can be used to build models that take word order into account. You can also use other modes, like binary, to build bag-of-word models`**.

You will build two models to learn more about these. 
* First, you will use the binary model to build a bag-of-words model. 
* Next, you will use the int mode with a 1D ConvNet.

### Vocab configuration

In [12]:
VOCAB_SIZE = 10000
MAX_SEQUENCE_LENGTH = 250

# pad_to_max_tokens	Only valid in "binary", "count", and "tf-idf" modes.
# If True, the output will have its feature axis padded to max_tokens even if the number of unique tokens in the vocabulary is less than max_tokens, 
# resulting in a tensor of shape [batch_size, max_tokens] regardless of vocabulary size. 
# Defaults to False.

# 'output_sequence_length'	Only valid in INT mode.
# If set, the output will have its time dimension padded or truncated to exactly output_sequence_length values, 
# resulting in a tensor of shape [batch_size, output_sequence_length] regardless of how many tokens resulted from the splitting step.
# Defaults to None.

In [13]:
binary_vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens = VOCAB_SIZE,
                                                                                      output_mode = 'binary')

For int mode, in addition to maximum vocabulary size, you need to set an explicit maximum sequence length, which will cause the layer to pad or truncate sequences to exactly sequence_length values.

In [14]:
int_vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens = VOCAB_SIZE,
     output_mode = 'int',
      output_sequence_length = MAX_SEQUENCE_LENGTH)

Next, you will call adapt to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.

Note: it's important to only use your training data when calling adapt (using the test set would leak information).

**Make a text-only dataset (without labels), then call adapt**

In [15]:
#  text-only dataset
text_ds = raw_train_ds.map(lambda text,label:text)

#  calling adapt on text-only dataset
binary_vectorize_layer.adapt(text_ds)
int_vectorize_layer.adapt(text_ds)

See the result of using these layers to preprocess data:

In [16]:
def binary_vectorize_text(text,label):
  text = tf.expand_dims(text, -1)
  return binary_vectorize_layer(text), label

def int_vectorize_text(text,label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label

In [17]:
sample_text = batch_text[0]
sample_label = batch_label[0]

In [18]:
print('Sample text before Vectorization \n',sample_text)
print('Sample label:',sample_label)

Sample text before Vectorization 
 tf.Tensor(b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon().    {.       mynumsides = 5;.       mysidelength = 30;.    }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength).    {.        mynumsides = numsides;.        mysidelength = sidelength;.    }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);.        shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class

In [19]:
print('Binary vectorization \n',binary_vectorize_text(sample_text,sample_label))

Binary vectorization 
 (<tf.Tensor: shape=(1, 10000), dtype=float32, numpy=array([[1., 1., 1., ..., 0., 0., 0.]], dtype=float32)>, <tf.Tensor: shape=(), dtype=int32, numpy=1>)


In [20]:
print('Integer vectorization \n',int_vectorize_text(sample_text,sample_label))

Integer vectorization 
 (<tf.Tensor: shape=(1, 250), dtype=int64, numpy=
array([[  23, 1978,    6,  414,    4,    2,  151,  314,    3,   34,   15,
           4,  598,   50,   10,    3,  675,    5,  159,   14,   35,   33,
        2146, 1180,  160, 5800,   74,   23,   86,   95,    5, 1978,   29,
          21,    5,  153,   44,    3,  448,   14,    4,   23,    1,   29,
          11, 1845,   11,    4,    2,  151,  314,    3,   17,  121, 2205,
          25,  203,    1,    1, 7557,  145, 7555,  473,  197,  369,    1,
          23,  199,   21,    1,   22,    1, 6398,  120, 4485, 7557, 6398,
        7555, 4485,  197,    1,   23, 1978,   29,    3,   17,  229,  121,
           1, 2242,   15,    1, 4485,    1,    8, 4485,  541, 1082,    8,
        1369, 2070,    7,    2,  773,    1,   55,    3,   46,    4, 1078,
           6,    2, 1978,   29, 1845, 6398,    8, 4485,    4,    2,  199,
         314,    8,   70,   11,    7,   14,   29,   26,   11,   93,  722,
           2,  369,  314,   66,    1,  

As you can see above, binary mode returns an array denoting which tokens exist at least once in the input, while int mode replaces each token by an integer, thus preserving their order. You can lookup the token (string) that each integer corresponds to by calling `.get_vocabulary()` on the layer.

In [21]:
for i in int_vectorize_text(sample_text, sample_label)[0].numpy()[0]:
  print(int_vectorize_layer.get_vocabulary()[i], end=' ')

my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed please forgive me my program has a tester class with a main when i send that to my [UNK] class it sends it to the wrong constructor i have two constructors 1 without [UNK] [UNK] mynumsides 5 mysidelength 30 end default [UNK] my second with [UNK] public [UNK] numsides double sidelength mynumsides numsides mysidelength sidelength end [UNK] my tester class i have these two [UNK] shape new [UNK] sidelength [UNK] and sidelength were declared and initialized earlier in the testing [UNK] what i want to happen is the tester class sends numsides and sidelength to the second constructor and use it in that class but it only uses the default constructor which [UNK] [UNK] the whole rest of the program can somebody help [UNK] those of you who want to see more of my code here you [UNK] double vertexangle systemoutprintlnthe vertex angle method mynumsides prints out 5 systemoutprintlnthe

**Apply the TextVectorization layers you created earlier to the train, validation, and test dataset**.

In [22]:
binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

**Configure the dataset for performance** 

These are two important methods you should use when loading data to make sure that I/O does not become blocking.

**.cache()** keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.

**.prefetch()** overlaps data preprocessing and model execution while training.

In [23]:
def config_for_performance(ds):
  ds = ds.cache()
  ds = ds.prefetch(tf.data.AUTOTUNE)
  return ds

In [24]:
binary_train_ds = config_for_performance(binary_train_ds)
binary_val_ds = config_for_performance(binary_val_ds)
binary_test_ds = config_for_performance(binary_test_ds)

int_train_ds = config_for_performance(int_train_ds)
int_val_ds = config_for_performance(int_val_ds)
int_test_ds = config_for_performance(int_test_ds)

### Create a Model
it's time to create our neural network. For the binary vectorized data, **train a simple bag-of-words linear model**:

#### Binary Model

In [25]:
binary_model = tf.keras.Sequential([
                                    tf.keras.layers.Dense(units=4)
])

In [26]:
binary_model.compile(optimizer='adam',
                     loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                     metrics=['accuracy'])

In [27]:
binary_model.fit(x=binary_train_ds, epochs=10, validation_data=binary_val_ds)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f093bf41a50>

Next, you will use the int vectorized layer to build a 1D ConvNet.
### ConvNet model template

In [28]:
def create_model(vocab_size, num_labels):
  model = tf.keras.Sequential([
                                 tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=64, mask_zero=True),
                                 tf.keras.layers.Conv1D(64, 5, strides=2, padding='valid', activation='relu'),
                                 tf.keras.layers.GlobalAveragePooling1D(),
                                 tf.keras.layers.Dense(num_labels)])
  return model

#### Integer model

In [29]:
# vocab size = 1 + VOCAB_SIZE since 0 is used additionally for padding.
int_model = create_model(VOCAB_SIZE+1, 4)
int_model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
int_model.fit(x = int_train_ds, epochs=5, validation_data=int_val_ds)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f093bc252d0>

### Compare the models

In [30]:
binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
print('Binary model accuray',binary_accuracy,'loss',binary_loss)

Binary model accuray 0.8162500262260437 loss 0.517124593257904


In [31]:
int_loss, int_accuracy = int_model.evaluate(int_test_ds)
print('Integer model accuray',int_accuracy,'loss',int_loss)

Integer model accuray 0.7858750224113464 loss 0.5957909226417542


### Export the model
In the code above, you applied the TextVectorization layer to the dataset before feeding text to the model. If you want to make your model capable of processing raw strings (for example, to simplify deploying it), you can include the TextVectorization layer inside your model. To do so, you can create a new model using the weights you just trained.

In [35]:
export_model = tf.keras.Sequential([
                                    binary_vectorize_layer,
                                    binary_model,
                                    tf.keras.layers.Softmax()
])

In [36]:
export_model.compile(optimizer='adam',
                     loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                     metrics=['accuracy'])

In [37]:
# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = export_model.evaluate(raw_test_ds)
print('Export model accuracy',accuracy)
print('Export model loss',loss)

Export model accuracy 0.8162500262260437
Export model loss 0.5171244740486145


### Run inference on new data

In [38]:
inputs = [
    "how do I extract keys from a dict into a list?",  # python
    "debug public static void main(string[] args) {...}",  # java
]

In [39]:
def get_string_labels(predicted_score):
  int_labels = tf.argmax(predicted_score, axis=1)
  return tf.gather(raw_train_ds.class_names, int_labels)

In [40]:
predictions = export_model.predict(inputs)

In [41]:
output_labels = get_string_labels(predictions)

In [42]:
for i, o in zip(inputs, output_labels.numpy()):
  print('Question',i)
  print('Tag',o)

Question how do I extract keys from a dict into a list?
Tag b'python'
Question debug public static void main(string[] args) {...}
Tag b'java'


There is a performance difference to keep in mind when choosing where to apply your TextVectorization layer. Using it outside of your model enables you to do asynchronous CPU processing and buffering of your data when training on GPU. So, if you're training your model on the GPU, you probably want to go with this option to get the best performance while developing your model, then switch to including the TextVectorization layer inside your model when you're ready to prepare for deployment.

## Example 2: Predict the author of Illiad translations
The following provides an example of using `tf.data.TextLineDataset` to load examples from text files, and `tf.text` to preprocess the data. In this example, you will use three different English translations of the same work, Homer's Illiad, and train a model to identify the translator given a single line of text.

**Download and explore the dataset**

The texts of the three translations are by:

* William Cowper — text

* Edward, Earl of Derby — text

* Samuel Butler — text

The text files used in this tutorial have undergone some typical preprocessing tasks like removing document header and footer, line numbers and chapter titles. Download these lightly munged files locally.

In [43]:
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

text_dir = ''
for name in FILE_NAMES:
  text_dir = tf.keras.utils.get_file(fname=name,origin=DIRECTORY_URL+name)
  

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt


In [44]:
import pathlib
text_dir = pathlib.Path(text_dir)
parent_dir = text_dir.parent
print('Parent directory:',parent_dir)

Parent directory: /root/.keras/datasets


In [45]:
text_files = []
for item in parent_dir.iterdir():
  text_files.append(item)
  print('Text File:',item)

Text File: /root/.keras/datasets/cowper.txt
Text File: /root/.keras/datasets/butler.txt
Text File: /root/.keras/datasets/derby.txt


### Load the dataset
You will use `TextLineDataset`, which is designed to create a `tf.data.Dataset` from a text file in which each example is a line of text from the original file, whereas `text_dataset_from_directory` treats all contents of a file as a single example. `TextLineDataset` is useful for text data that is primarily line-based (for example, poetry or error logs).

Iterate through these files, loading each one into its own dataset. Each example needs to be individually labeled, so use tf.data.Dataset.map to apply a labeler function to each one. This will iterate over every example in the dataset, returning (example, label) pairs.

In [48]:
labeled_text_dataset = []

def labeler(example, index):
  return example, tf.cast(index, dtype=tf.int64)

for index, file in enumerate(text_files):
  text_line_ds = tf.data.TextLineDataset(file)
  labeled_ds = text_line_ds.map(lambda text:labeler(text, index))
  labeled_text_dataset.append(labeled_ds)


  

Next, you'll combine these labeled datasets into a single dataset, and shuffle it.


In [49]:
all_labeled_data = labeled_text_dataset[0]

for ds in labeled_text_dataset[1:]:
  all_labeled_data = all_labeled_data.concatenate(ds)

### Dataset configuration

In [50]:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000

Shuffle the dataset

In [51]:
all_labeled_data = all_labeled_data.shuffle(buffer_size=BUFFER_SIZE, reshuffle_each_iteration=False)

Print out a few examples as before. The dataset hasn't been batched yet, hence each entry in `all_labeled_data ` corresponds to one data point:

In [52]:
for data, label in all_labeled_data.take(10):
  print('Sentence',data)
  print('Label',label)

Sentence tf.Tensor(b'Found him, and such as none could waft aside,', shape=(), dtype=string)
Label tf.Tensor(0, shape=(), dtype=int64)
Sentence tf.Tensor(b"Of fierce \xc3\x86acides. And now they reach'd", shape=(), dtype=string)
Label tf.Tensor(0, shape=(), dtype=int64)
Sentence tf.Tensor(b"By art averted Peleus' son; the form", shape=(), dtype=string)
Label tf.Tensor(0, shape=(), dtype=int64)
Sentence tf.Tensor(b'Nor horse have I, nor car on which to mount;', shape=(), dtype=string)
Label tf.Tensor(2, shape=(), dtype=int64)
Sentence tf.Tensor(b'So swiftly past the eager horses flew."', shape=(), dtype=string)
Label tf.Tensor(2, shape=(), dtype=int64)
Sentence tf.Tensor(b"Then Hector Leitus, Aloctryon's son,", shape=(), dtype=string)
Label tf.Tensor(2, shape=(), dtype=int64)
Sentence tf.Tensor(b'There, settling by degrees, it rolls no more;', shape=(), dtype=string)
Label tf.Tensor(0, shape=(), dtype=int64)
Sentence tf.Tensor(b"Supine, his eyes with pitchy darkness veil'd,", shape=(), 

### Prepare the dataset for training
Instead of using the Keras `TextVectorization` layer to preprocess our text dataset, you will now use the `tf.text` API to standardize and tokenize the data, build a vocabulary and use `StaticVocabularyTable` to map tokens to integers to feed to the model.

While `tf.text` provides various tokenizers, you will use the `UnicodeScriptTokenizer` to tokenize our dataset.

Define a function to convert the text to lower-case and tokenize it. You will use `tf.data.Dataset.map` to apply the tokenization to the dataset.

A Tokenizer is a text.Splitter that splits strings into tokens. Tokens generally correspond to short substrings of the source string. Tokens can be encoded using either strings or integer ids (where integer ids could be created by hashing strings or by looking them up in a fixed vocabulary table that maps strings to ids).

**`UnicodeScriptTokenizer`**

Tokenizes UTF-8 by splitting when there is a change in Unicode script.

The strings are split when successive tokens change their Unicode script or change being whitespace or not.

By default, this tokenizer leaves out scripts matching the whitespace unicode property (use the keep_whitespace argument to keep it), so in this case the results are similar to the WhitespaceTokenizer. Any punctuation will get its own token (since it is in a different script), and any script change in the input string will be the location of a split.


In [54]:
pip install tensorflow-text-nightly

Collecting tensorflow-text-nightly
[?25l  Downloading https://files.pythonhosted.org/packages/0b/62/e9f54b2c360920e35be40976d660f5735a72870d739124b83e72110065a5/tensorflow_text_nightly-2.7.0.dev20210626-cp37-cp37m-manylinux1_x86_64.whl (4.3MB)
[K     |████████████████████████████████| 4.3MB 8.8MB/s 
Installing collected packages: tensorflow-text-nightly
Successfully installed tensorflow-text-nightly-2.7.0.dev20210626


In [55]:
pip install -q tensorflow-text

[K     |████████████████████████████████| 4.3MB 7.3MB/s 
[?25h

In [56]:
import tensorflow_text 

In [57]:
tokenizer = tensorflow_text.UnicodeScriptTokenizer()

Define a function to convert the text to lower-case and tokenize it. 

In [59]:
def tokenize(text, unused_label):
  lowercase = tensorflow_text.case_fold_utf8(text)
  return tokenizer.tokenize(lowercase)

In [61]:
# eg: applying tokenize
example_text, example_label = next(iter(all_labeled_data))
print('Sample before tokenization \n',example_text)
print('Sample after tokenization \n',tokenize(example_text, example_label))

Sample before tokenization 
 tf.Tensor(b'Found him, and such as none could waft aside,', shape=(), dtype=string)
Sample after tokenization 
 tf.Tensor(
[b'found' b'him' b',' b'and' b'such' b'as' b'none' b'could' b'waft'
 b'aside' b','], shape=(11,), dtype=string)


You will use tf.data.Dataset.map to apply the tokenization to the dataset.

In [63]:
tokenized_ds = all_labeled_data.map(tokenize)

You can iterate over the dataset and print out a few tokenized examples.

In [64]:
for toks in tokenized_ds.take(10):
  print(toks)
 

tf.Tensor(
[b'found' b'him' b',' b'and' b'such' b'as' b'none' b'could' b'waft'
 b'aside' b','], shape=(11,), dtype=string)
tf.Tensor(
[b'of' b'fierce' b'\xc3\xa6acides' b'.' b'and' b'now' b'they' b'reach'
 b"'" b'd'], shape=(10,), dtype=string)
tf.Tensor([b'by' b'art' b'averted' b'peleus' b"'" b'son' b';' b'the' b'form'], shape=(9,), dtype=string)
tf.Tensor(
[b'nor' b'horse' b'have' b'i' b',' b'nor' b'car' b'on' b'which' b'to'
 b'mount' b';'], shape=(12,), dtype=string)
tf.Tensor([b'so' b'swiftly' b'past' b'the' b'eager' b'horses' b'flew' b'."'], shape=(8,), dtype=string)
tf.Tensor([b'then' b'hector' b'leitus' b',' b'aloctryon' b"'" b's' b'son' b','], shape=(9,), dtype=string)
tf.Tensor(
[b'there' b',' b'settling' b'by' b'degrees' b',' b'it' b'rolls' b'no'
 b'more' b';'], shape=(11,), dtype=string)
tf.Tensor(
[b'supine' b',' b'his' b'eyes' b'with' b'pitchy' b'darkness' b'veil' b"'"
 b'd' b','], shape=(11,), dtype=string)
tf.Tensor([b'then' b'thou' b',' b'achilles' b',' b'reverence' b't

Next, you will build a vocabulary by sorting tokens by frequency and keeping the top VOCAB_SIZE tokens.

Dictionary in Python is an unordered collection of data values.
Sometimes, when the KeyError is raised, it might become a problem. To overcome this Python introduces another dictionary like container known as `Defaultdict` which is present inside the `collections module`.

Defaultdict is a container like dictionaries present in the module collections. Defaultdict is a sub-class of the dict class that returns a dictionary-like object. The functionality of both dictionaries and defualtdict are almost same except for the fact that defualtdict never raises a KeyError. It provides a default value for the key that does not exists.

Syntax: `defaultdict`(`default_factory`)

Parameters:

`default_factory`: A function returning the default value for the dictionary defined. If this argument is absent then the dictionary raises a KeyError.

In [65]:
import collections

In [66]:
def config_for_performance(ds):
  ds = ds.cache()
  ds = ds.prefetch(tf.data.AUTOTUNE)
  return ds

In [67]:
tokenized_ds = config_for_performance(tokenized_ds)

In [68]:
vocab_dict = collections.defaultdict(lambda : 0)


for toks in tokenized_ds.as_numpy_iterator():
  for tok in toks:
    vocab_dict[tok] += 1

In [69]:
# sort decending
vocab = sorted(vocab_dict.items(), key=lambda x:x[1], reverse=True)

In [70]:
# limit the vocab to VOCAB_SIZE = 10000
vocab = [token for token, count in vocab]

vocab = vocab[:VOCAB_SIZE]

In [71]:
print('Length of vocab ',len(vocab))
print("First five vocab entries:", vocab[:5])

Length of vocab  10000
First five vocab entries: [b',', b'the', b'and', b"'", b'of']


To convert the tokens into integers, use the vocab set to create a `StaticVocabularyTable`. 

You will map tokens to integers in the range [`2`, `vocab_size` +`2`]. As with the TextVectorization layer, `0` is reserved to denote padding and `1` is reserved to denote an out-of-vocabulary (OOV) token.

`tf.lookup.StaticVocabularyTable` 
*  Raises **ValueError** when num_oov_buckets is not positive.
*  Raises **TypeError**	when lookup_key_dtype or initializer.key_dtype are not integer or string. Also when initializer.value_dtype != int64.

In [72]:
keys = vocab
values = range(2,len(vocab)+2)

In [73]:
init = tf.lookup.KeyValueTensorInitializer(keys=keys,
                                           values = values,
                                           key_dtype = tf.string,
                                           value_dtype = tf.int64)

In [75]:
# vocab look-up table

vocab_table = tf.lookup.StaticVocabularyTable(initializer = init,
                                              num_oov_buckets = 1)

Finally, define a fuction to **standardize**, **tokenize** and **vectorize** the dataset using the tokenizer and lookup table:

In [77]:
def preprocess_text(text, label):
  standradize = tensorflow_text.case_fold_utf8(text)
  tokenized = tokenizer.tokenize(standradize)
  vectorized = vocab_table.lookup(tokenized)
  return vectorized, label


You can try this on a single example to see the output:

In [78]:
example_text, example_label = next(iter(all_labeled_data))
print('Sample before preprocessing \n', example_text)
print('Sample after preprocessing \n', preprocess_text(example_text, example_label))

Sample before preprocessing 
 tf.Tensor(b'Found him, and such as none could waft aside,', shape=(), dtype=string)
Sample after preprocessing 
 (<tf.Tensor: shape=(11,), dtype=int64, numpy=array([ 290,   16,    2,    4,  103,   25,  251,  201, 5028,  775,    2])>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)


Now run the preprocess function on the dataset using tf.data.Dataset.map.

In [317]:
all_encoded_data = all_labeled_data.map(preprocess_text)

### Split the dataset into train and test
The Keras `TextVectorization` layer also batches and pads the vectorized data. 

**Padding** is required because the examples inside of a batch need to be the same size and shape, but the examples in these datasets are not all the same size — each line of text has a different number of words. `tf.data.Dataset` supports splitting and padded-batching datasets:

The `tf.data.Dataset.padded_batch()` method allows you to specify padded_shapes for each component (`feature`) of the resulting batch. For example, if your input dataset is called `ds`:

<br>
padded_ds = `ds`.`padded_batch`(<br>
    BATCH_SIZE,<br>
    padded_shapes= {<br>
        'label': [ ],                       # Scalar elements, no padding.<br>
        'sequence_feature': [None],          # Vector elements, padded to longest.<br>
        'seq_of_seqs_feature': [None, None],  # Matrix elements, padded to longest in each dimension.<br>
    })     <br>                                  
    <br>
    
Notice that the padded_shapes argument has the same structure as your input dataset's elements, so in this case it takes a dictionary with keys that match your feature names.

In [318]:
# VALIDATION_SIZE = 5000

train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)

In [319]:
# padding 
train_data = train_data.padded_batch(batch_size=BATCH_SIZE)
validation_data = validation_data.padded_batch(batch_size=BATCH_SIZE)

Now, validation_data and train_data are not collections of (example, label) pairs, but collections of batches. Each batch is a pair of (many examples, many labels) represented as arrays. To illustrate:

In [320]:
example_text_batch , example_label_batch = next(iter(validation_data))

In [321]:
print('Shape of text batch',example_text_batch.shape)
print('Shape of label batch',example_label_batch.shape)
print('first text example',example_text_batch[0])
print('first label example',example_label_batch[0])

Shape of text batch (64, 17)
Shape of label batch (64,)
first text example tf.Tensor(
[ 290   16    2    4  103   25  251  201 5028  775    2    0    0    0
    0    0    0], shape=(17,), dtype=int64)
first label example tf.Tensor(0, shape=(), dtype=int64)


Since we use `0` for padding and `1` for out-of-vocabulary (OOV) tokens, the vocabulary size has increased by two.

Configure the datasets for better performance as before.

In [322]:
train_data = config_for_performance(train_data)
validation_data = config_for_performance(validation_data)

### Train the model
You can train a model on this dataset as before.

In [343]:
model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(input_dim=VOCAB_SIZE+2, output_dim=64, mask_zero=True),
                             tf.keras.layers.Conv1D(64, 5, strides=2, padding='valid', activation='relu'),
                             tf.keras.layers.GlobalAveragePooling1D(),
                             tf.keras.layers.Dense(3)                       
])

In [344]:
model.compile(
    optimizer = 'adam',
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics = ['accuracy']
             )

In [345]:
model.fit(x=train_data, epochs=3, validation_data=validation_data)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f09352399d0>

In [346]:
loss, accuracy = model.evaluate(validation_data)
print('Loss',loss)
print('Model Accuracy',accuracy)

Loss 0.38751453161239624
Model Accuracy 0.8453999757766724


### Export the model
To make our model capable to taking raw strings as input, you will create a TextVectorization layer that performs the same steps as our custom preprocessing function. Since you already trained a vocabulary, you can use set_vocaublary instead of adapt which trains a new vocabulary.

In [361]:
preprocess_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=VOCAB_SIZE+2,
                                                                                standardize=tensorflow_text.case_fold_utf8,
                                                                                split=tokenizer.tokenize,
                                                                                output_mode='int',
                                                                                output_sequence_length=20)

In [362]:
preprocess_layer.set_vocabulary(vocab)

In [363]:
export_model = tf.keras.Sequential([
                                    preprocess_layer,
                                    model,
                                    tf.keras.layers.Activation('sigmoid'),
])

In [364]:
export_model.compile(optimizer='adam',
                     loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                     metrics=['accuracy'])

In [365]:
# Create a test dataset of raw strings
test_data = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_data = config_for_performance(test_data)

In [366]:
loss,accuracy = export_model.evaluate(test_data)
print('Loss',loss)
print('Accuracy',accuracy)

Loss 0.3879798948764801
Accuracy 0.8361999988555908


### Run inference on new data

In [367]:
inputs = [
    "Join'd to th' Ionians with their flowing robes,",  # Label: 1
    "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
    "And with loud clangor of his arms he fell.",  # Label: 0
]

In [368]:
predictions = export_model.predict(inputs)
outputs = tf.argmax(predictions, axis=1)

for i, o in zip(inputs, outputs):
  print('Line :',i)
  print('Label :',o)

Line : Join'd to th' Ionians with their flowing robes,
Label : tf.Tensor(2, shape=(), dtype=int64)
Line : the allies, and his armour flashed about him so that he seemed to all
Label : tf.Tensor(1, shape=(), dtype=int64)
Line : And with loud clangor of his arms he fell.
Label : tf.Tensor(0, shape=(), dtype=int64)


## Downloading more datasets using TensorFlow Datasets (TFDS)
You can download many more datasets from TensorFlow Datasets. As an example, you will download the IMDB Large Movie Review dataset, and use it to train a model for sentiment classification.

In [264]:
import tensorflow_datasets as tfds

In [270]:
train_ds = tfds.load(name='imdb_reviews',
                     split='train[:80%]',
                     batch_size=BATCH_SIZE,
                     shuffle_files=True,
                     as_supervised=True
                     )

val_ds = tfds.load(name='imdb_reviews',
                     split='train[80%:]',
                     batch_size=BATCH_SIZE,
                     shuffle_files=True,
                     as_supervised=True
                     )

In [271]:
for batch_text, batch_label in val_ds.take(1):
  for i in range(5):
    print('Review:',batch_text[i])
    print('Label:',batch_label[i])

Review: tf.Tensor(b"Instead, go to the zoo, buy some peanuts and feed 'em to the monkeys. Monkeys are funny. People with amnesia who don't say much, just sit there with vacant eyes are not all that funny.<br /><br />Black comedy? There isn't a black person in it, and there isn't one funny thing in it either.<br /><br />Walmart buys these things up somehow and puts them on their dollar rack. It's labeled Unrated. I think they took out the topless scene. They may have taken out other stuff too, who knows? All we know is that whatever they took out, isn't there any more.<br /><br />The acting seemed OK to me. There's a lot of unfathomables tho. It's supposed to be a city? It's supposed to be a big lake? If it's so hot in the church people are fanning themselves, why are they all wearing coats?", shape=(), dtype=string)
Label: tf.Tensor(0, shape=(), dtype=int64)
Review: tf.Tensor(b'Well, was Morgan Freeman any more unusual as God than George Burns? This film sure was better than that bore,

You can now preprocess the data and train a model as before.

Note: You will use losses.BinaryCrossentropy instead of losses.SparseCategoricalCrossentropy for your model since this is a binary classification problem.

In [273]:
vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens = VOCAB_SIZE,
                                                                               output_mode = 'int',
                                                                               output_sequence_length = MAX_SEQUENCE_LENGTH)

In [277]:
# Make a text-only dataset (without labels), then call adapt
text_ds = train_ds.map(lambda x,y:x)

vectorize_layer.adapt(text_ds)

In [278]:
def vectorize_text(text, label):
  text = tf.expand_dims(text,-1)
  return vectorize_layer(text), label

In [285]:
train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)

In [286]:
# Configure datasets for performance as before
train_ds = config_for_performance(train_ds)
val_ds = config_for_performance(val_ds)

### Train the model

In [287]:
model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(input_dim=VOCAB_SIZE+1, output_dim=64, mask_zero=True),
                             tf.keras.layers.Conv1D(64, 5, strides=2, padding='valid', activation='relu'),
                             tf.keras.layers.GlobalAveragePooling1D(),
                             tf.keras.layers.Dense(1)
])

In [288]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [289]:
model.summary()

Model: "sequential_23"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, None, 64)          640064    
_________________________________________________________________
conv1d_7 (Conv1D)            (None, None, 64)          20544     
_________________________________________________________________
global_average_pooling1d_7 ( (None, 64)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 65        
Total params: 660,673
Trainable params: 660,673
Non-trainable params: 0
_________________________________________________________________


In [290]:
model.fit(x=train_ds, validation_data=val_ds, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f0920f6ffd0>

In [291]:
loss, accuracy = model.evaluate(val_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

Loss:  0.33923861384391785
Accuracy: 86.62%


### Export the model

In [292]:
export_model = tf.keras.Sequential([
                                    vectorize_layer,
                                    model,
                                    tf.keras.layers.Activation('sigmoid')
])

In [293]:
export_model.compile(optimizer='adam',
                     loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
                     metrics=['accuracy'])

In [303]:
# 0 --> negative review
# 1 --> positive review
inputs = [
    "This is a fantastic movie.",
    "This is a bad movie.",
    "This movie was so bad that it was good.",
    "I will never say yes to watching this movie.",
]

predictions = export_model.predict(inputs)
outputs = [round(pred[0]) for pred in predictions]
for i, o in zip(inputs, outputs):
  print('Review:',i)
  print('Label:',o)

Review: This is a fantastic movie.
Label: 1
Review: This is a bad movie.
Label: 0
Review: This movie was so bad that it was good.
Label: 0
Review: I will never say yes to watching this movie.
Label: 0
