<a href="https://colab.research.google.com/github/JpChii/ML-Projects/blob/main/Handling_text_data_from_various_sources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we'll be walkthrough and learn to dealing with variety of inputs on text data.

Resources:
* https://www.tensorflow.org/tutorials/load_data/text
* https://realpython.com/read-write-files-python/

## Using Keras API

In [3]:
!pip install "tensorflow-text==2.8.*"

Collecting tensorflow-text==2.8.*
  Downloading tensorflow_text-2.8.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 4.4 MB/s 
Collecting tf-estimator-nightly==2.8.0.dev2021122109
  Downloading tf_estimator_nightly-2.8.0.dev2021122109-py2.py3-none-any.whl (462 kB)
[K     |████████████████████████████████| 462 kB 54.1 MB/s 
Installing collected packages: tf-estimator-nightly, tensorflow-text
Successfully installed tensorflow-text-2.8.1 tf-estimator-nightly-2.8.0.dev2021122109


In [4]:
# Importing the libraries
import collections
import pathlib

import tensorflow as tf

from tensorflow.keras import layers, losses, utils
from tensorflow.keras.layers import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

### Example 1: Predict the tag for a stack overflow question

For this example, we'll use a dataset of programming questions from stack overflow to predict the tag for a question. This is an multi-class classification problem

#### Download and explore the dataset

In [5]:
data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'

In [8]:
dataset_dir = utils.get_file(origin=data_url,
                             untar=True,
                             cache_dir='stack_overflow',
                             cache_subdir='')

dataset_dir = pathlib.Path(dataset_dir).parent

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz


In [9]:
list(dataset_dir.iterdir())

[PosixPath('/tmp/.keras/train'),
 PosixPath('/tmp/.keras/README.md'),
 PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz'),
 PosixPath('/tmp/.keras/test')]

In [10]:
train_dir = dataset_dir/'train'
list(train_dir.iterdir())

[PosixPath('/tmp/.keras/train/java'),
 PosixPath('/tmp/.keras/train/javascript'),
 PosixPath('/tmp/.keras/train/csharp'),
 PosixPath('/tmp/.keras/train/python')]

In [13]:
!ls /tmp/.keras/train/java | head

0.txt
1000.txt
1001.txt
1002.txt
1003.txt
1004.txt
1005.txt
1006.txt
1007.txt
1008.txt


In [14]:
sample_file = train_dir/'java/0.txt'

with open(sample_file) as f:
  print(f.read())

"how to download .msi file in blank i want to download .msi file using blank.  i have tried to download file using following code..printwriter out = null;.fileinputstream filetodownload = null;.bufferedreader bufferedreader = null;.try {.        out = response.getwriter();.        filetodownload = new fileinputstream(download_directory + file_name);.        bufferedreader = new bufferedreader(new inputstreamreader(filetodownload));..        //response.setcontenttype(""application/text"");.        //response.setcontenttype(""application/x-msi"");.        //response.setcontenttype(""application/msi"");.        //response.setcontenttype(""octet-stream"");.        response.setcontenttype(""application/octet-stream"");.        //response.setcontenttype(""application/x-7z-compressed"");.        //response.setcontenttype(""application/zip"");.        response.setheader(""content-disposition"",""attachment; filename="" +file_name );.        response.setcontentlength(filetodownload.available())

#### Load the dataset

Loading the data off disk and prepare it into a suitable format for traiing. We'll use `tf.keras.utils.text_dataset_from_directory` utility to create a `tf.data.Dataset`.

The train directory is in the format `text_dataset_from_directory` API expects.

In [17]:
# Test set is already present, splitting the train dataset into train and validation set
BATCH_SIZE = 32
SEED = 42

raw_train_ds = utils.text_dataset_from_directory(train_dir,
                                                  batch_size=BATCH_SIZE,
                                                  seed=SEED,
                                                  validation_split=0.2,
                                                  subset='training')

Found 8000 files belonging to 4 classes.
Using 6400 files for training.


In [18]:
# Iteratinf over the dataset to get a idea of the data
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print(f"Question: {text_batch.numpy()[i]}")
    print(f"Label: {label_batch.numpy()[i]}")

Question: b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon().    {.       mynumsides = 5;.       mysidelength = 30;.    }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength).    {.        mynumsides = numsides;.        mysidelength = sidelength;.    }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);.        shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it only uses the default cons

In [19]:
# The label names are 0,1,2,3. Checking the class_names property to find corresponding class names.
for i, label in enumerate(raw_train_ds.class_names):
  print(f"Label: {i}, corresponds to: {label}")

Label: 0, corresponds to: csharp
Label: 1, corresponds to: java
Label: 2, corresponds to: javascript
Label: 3, corresponds to: python


Creating a validation set using the remaining 1600 reviws from the training set for validation.

> With `validation_split` and `subset` arguments of `text_dataset_from_directory` make sure to sepcify random seed or pass shuffle-False, so the splits have no overlap.

In [21]:
# Creating a validatio set
raw_val_ds = utils.text_dataset_from_directory(
    directory=train_dir,
    batch_size=BATCH_SIZE,
    seed=SEED,
    subset='validation',
    validation_split=0.2
)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.


In [23]:
test_dir = dataset_dir/'test'

# Createing test dataset
raw_test_ds = utils.text_dataset_from_directory(
    directory=test_dir,
    batch_size=BATCH_SIZE
)

Found 8000 files belonging to 4 classes.


Now we've the datasets ready, we'll **`prepare the dataset for training`**

Next steps:

1. *`Standardization`* - preprocessing the text to remove punctuation and html elements to simplify the dataset
2. *`Tokenization`* - Splitting the strings into tokens(based on whitespace or any other delimiter)
3. *`Vecotrization`* - Converting tokens into numbers so they can be fed into a neural network

Let's accompolish these tasks using `tf.keras.layers.TextVectorization` API

To lean about the above three techniques we'll try the below two with TextVectorization:

* First use `binry` vectorization mode to build a bag-of-words model.
* Use `int` mode with a 1D ConvNet.

In [24]:
VOCAB_SIZE = 10000
binary_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary'
)


setting `output_sequence_length` parameter will cause the layer to pad or truncate sequences to the value of the parameter.

In [25]:
MAX_SEQUENCE_LENGTH = 250
int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH
)

All set, calling `TextVectorization.adapt` to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.

In [26]:
# Make a text-only dataset (without labels), then call the adapt methods
train_text = raw_train_ds.map(lambda text, labels: text)

In [28]:
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

In [40]:
# Printing the result of using these layers to preprocess data

def binary_vectorize_text(text, label):
  # To accomodatae batching
  text = tf.expand_dims(text, -1)
  return binary_vectorize_layer(text), label

In [35]:
def int_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label

In [36]:
raw_train_ds

<BatchDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>

In [37]:
# Retrieve a batch from the dataset
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print(f"Question: {first_question}")
print(f"Label: {label}")

Question: b'"what is the difference between these two ways to create an element? var a = document.createelement(\'div\');..a.id = ""mydiv"";...and..var a = document.createelement(\'div\').id = ""mydiv"";...what is the difference between them such that the first one works and the second one doesn\'t?"\n'
Label: python


In [42]:
print(f"""
binary vectorized question:
{binary_vectorize_text(first_question, first_label)}
""")

Text without dims expansion: ()
Text with dims expansion: (1,)

binary vectorized question:
(<tf.Tensor: shape=(1, 10000), dtype=float32, numpy=array([[1., 1., 0., ..., 0., 0., 0.]], dtype=float32)>, <tf.Tensor: shape=(), dtype=int32, numpy=2>)



In [44]:
print(f"""
int vectorized question:
{int_vectorize_text(first_question, first_label)}
""")


int vectorized question:
(<tf.Tensor: shape=(1, 250), dtype=int64, numpy=
array([[ 55,   6,   2, 410, 211, 229, 121, 895,   4, 124,  32, 245,  43,
          5,   1,   1,   5,   1,   1,   6,   2, 410, 211, 191, 318,  14,
          2,  98,  71, 188,   8,   2, 199,  71, 178,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,  

As seen above,

* `binary` mode creates a sparse matrix of VOCAB_SIZE and `1` where words are present
* `int` mode replaces words with integers and adds or truncates max length of output to MAX_SEQUENCE_LENGTH

We can lookup the word for token using good vocabulary

In [45]:
print(f"1221---> {int_vectorize_layer.get_vocabulary()[1221]}")
print(f"2---> {int_vectorize_layer.get_vocabulary()[2]}")

1221---> parsing
2---> the


In [46]:
len(int_vectorize_layer.get_vocabulary())

10000

In [48]:
print(f"""
Top words in vocab: {int_vectorize_layer.get_vocabulary()[:10]},
Bottom words in vocab: {int_vectorize_layer.get_vocabulary()[-10:]}
""")


Top words in vocab: ['', '[UNK]', 'the', 'i', 'to', 'a', 'is', 'in', 'and', 'of'],
Bottom words in vocab: ['excluded', 'exceptionthe', 'evnets', 'everyvarmathfloormathrandomeveryvarlength', 'eventtargetinnerhtml', 'evalinputplease', 'euros', 'ettercap', 'etos', 'essential']



We're all set, let's apply the vectorization layer to the entire dataset.

In [51]:
binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

Text without dims expansion: (None,)
Text with dims expansion: (None, 1)
Text without dims expansion: (None,)
Text with dims expansion: (None, 1)
Text without dims expansion: (None,)
Text with dims expansion: (None, 1)


In [55]:
first_batch_question, first_batch_label = next(iter(binary_train_ds))

In [57]:
first_batch_question.shape, first_batch_label.shape

(TensorShape([32, 10000]), TensorShape([32]))

#### Configure the dataset for performance

* `Dataset.cache` keeps data in memeory after it;s loaded off disk. This ensures the dataset does not become a bottleneck while training the model. If dataset is too large to fit into memeory, can also use ths method to create a performant on-disk cache, which is more effecient to read than many small files.

* `Dataset.prefetch` overalaps data preprocessing and model execution while training.

In [60]:
AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size=AUTOTUNE)

In [61]:
binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

In [62]:
binary_train_ds

<PrefetchDataset element_spec=(TensorSpec(shape=(None, 10000), dtype=tf.float32, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>

In [64]:
binary_model = tf.keras.Sequential([layers.Dense(4)])

binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy']
)

In [65]:
history = binary_model.fit(
    binary_train_ds,
    validation_data=binary_val_ds,
    epochs=10
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Creating Conv1D model on `int` dataset

In [67]:
def create_model(vocab_size, num_labels):
  model = tf.keras.Sequential([
      layers.Embedding(input_dim=VOCAB_SIZE, 
                       output_dim=64, 
                       mask_zero=True),
      layers.Conv1D(filters=64,
                    kernel_size=5,
                    strides=2,
                    padding="valid",
                    activation="relu"),
      layers.GlobalMaxPool1D(),
      layers.Dense(num_labels)
  ])

  return model

In [68]:
int_model = create_model(vocab_size=VOCAB_SIZE + 1, # 1 for 0 used in padding
                         num_labels=4)

In [70]:
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy']
)

In [71]:
history = int_model.fit(
    int_train_ds,
    validation_data=int_val_ds,
    epochs=10
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [72]:
binary_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 4)                 40004     
                                                                 
Total params: 40,004
Trainable params: 40,004
Non-trainable params: 0
_________________________________________________________________


In [73]:
int_model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 64)          640000    
                                                                 
 conv1d (Conv1D)             (None, None, 64)          20544     
                                                                 
 global_max_pooling1d (Globa  (None, 64)               0         
 lMaxPooling1D)                                                  
                                                                 
 dense_1 (Dense)             (None, 4)                 260       
                                                                 
Total params: 660,804
Trainable params: 660,804
Non-trainable params: 0
_________________________________________________________________


In [74]:
binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))

Binary model accuracy: 81.30%
Int model accuracy: 80.43%


Let's include preprocessing layer as part of the model to make it easier for predictions and use in production if needed.

In [76]:
export_model = tf.keras.Sequential(
    [
     binary_vectorize_layer,
     binary_model,
     layers.Activation('sigmoid')
    ]
)

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics='accuracy'
)

In [78]:
export_model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, 10000)            0         
 torization)                                                     
                                                                 
 sequential (Sequential)     (None, 4)                 40004     
                                                                 
 activation_1 (Activation)   (None, 4)                 0         
                                                                 
Total params: 40,004
Trainable params: 40,004
Non-trainable params: 0
_________________________________________________________________


In [77]:
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(binary_accuracy))

  return dispatch_target(*args, **kwargs)


Accuracy: 81.30%


In [81]:
def get_string_labels(predicted_scores_batch):
  predicted_int_labels = tf.argmax(predicted_scores_batch, axis=1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels

In [82]:
inputs = [
    "how do I extract keys from a dict into a list?",  # 'python'
    "debug public static void main(string[] args) {...}",  # 'java'
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

Question:  how do I extract keys from a dict into a list?
Predicted label:  b'python'
Question:  debug public static void main(string[] args) {...}
Predicted label:  b'java'


### **Summary:**

1. `utils.get_file` to download dataset from a url
2. Loading text dataset from a directory using `utils.text_dataset_from_directory`
3. Standardization, tokenization, vectorization using `TextVectorizationLayer`
4. Mapping vectorization layer using function over the entire dataset
5. I/O bottlenect prevention using `tf.data` API