# Text Data Pipelines

{{ badge }}

In this notebook, we'll cover the basics of building data pipelines for text data. This is an important step in processing text data efficiently and effectively for tasks such as sentiment analysis or machine translation.

We'll start by reading text data from directories, preprocessing the data to clean and prepare it for modeling, and then building a data pipeline to efficiently process the data and feed it into a machine learning model.

By the end of this notebook, you will have a good understanding of how to build and use text data pipelines in practice.

## Table of Contents <a name="toc"></a>
- [Text Dataset from Directory](#text-dataset-from-directory)
- [Text Vectorization Layer](#text-vectorization-layer)
- [Model Training](#model-training)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import re
import string

In [2]:
# Download the dataset
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

# we will be using the tf.keras.utils.get_file method to download the dataset and extract it automatically
dataset = tf.keras.utils.get_file(
    "aclImdb", url, untar=True, cache_dir=".", cache_subdir=""
)

# remove extra class that we will not be using
!rm -rf aclImdb/train/unsup

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


Note that the data is structured as follows:

```
aclImdb
├── test
│   ├── neg
|   |   ├── 0_2.txt
|   |   ├── 10000_4.txt
|   |   ├── ...
│   └── pos
|       ├── 0_10.txt
|       ├── 10000_7.txt
|       ├── ...
└── train
    ├── neg
    |   ├── 0_3.txt
    |   ├── 10000_4.txt
    |   ├── ...
    └── pos
        ├── 0_8.txt
        ├── 10000_7.txt
        ├── ...
```

This is similar to cats vs dogs dataset structure we used in the previous week, but instead of images, we have text files where each file contains a review.

In [2]:
# first, we will set some parameters
vocab_size = 8000  # number of words in the vocabulary, we will use the top 8000 most common words
max_length = 120  # maximum length of a review, we will truncate reviews longer than 120 words and pad reviews shorter than 120 words
embedding_dim = 50  # dimension of the embedding vector, we will use 50-dimensional embedding vectors
batch_size = 32  # number of reviews in each batch
seed = 42  # random seed

## Text Dataset from Directory <a name="text-dataset-from-directory"></a>
[Back to top](#toc)

The `tf.keras.utils.text_dataset_from_directory` is a function that provides the ability to read and process text data from a directory. It's ideal for working with large datasets of text data that are organized in a directory structure.

This class has the following key parameters:

- `directory`: The directory containing the text data. The files in this directory will be processed as text data.
- `labels`: A list of labels, one for each text file. The labels should correspond to the text files in the directory. The default value is `infered` which will infer the labels from the directory structure.
- `label_mode`: The type of label to return. The default value is `int` which will return an integer label for each text file. The other options are `binary` which will return a binary label for each text file, and `categorical` which will return a categorical label for each text file.

For full documentation, see the [tf.keras.utils.text_dataset_from_directory doc](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text_dataset_from_directory).

In [4]:
# read the train and test datasets
raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=32, seed=seed
)

raw_test_ds = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=32, seed=seed
)

Found 25000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [5]:
# preview samples from the training dataset
for text_batch, label_batch in raw_train_ds.take(1):
    print("Review:", text_batch[0])
    print("Label:", label_batch[0])

Review: tf.Tensor(b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)', shape=(), dtype=string)
Label: tf.Tensor(0, shape=(), dtype=int32)


2023-02-05 02:43:40.935256: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


## Text Vectorization Layer <a name="text-vectorization-layer"></a>
[Back to top](#toc)

Text Vectorization is a preprocessing step in NLP where we convert the raw text data into numerical representations or embeddings. This is a crucial step as most machine learning models work with numerical data, not with text.

In TensorFlow and Keras, there are several layers that we can use for text vectorization. We have already used `tf.keras.preprocessing.text.Tokenizer` in the previous lesson which only tokenizes the text data. In this lesson, we will use `tf.keras.layers.experimental.preprocessing.TextVectorization` which is a more advanced layer which will take care of the whole preprocessing pipeline.

The `TextVectorization` layer has the following key parameters:
- `max_tokens`: The maximum number of words to keep, based on word frequency. Only the most common `max_tokens` words will be kept.
- `output_mode`: The output mode of the layer. The default value is `int` which will return an integer representation of the words. The other options are `binary` which will return a binary representation of the words, and `count` which will return the count of each word.
- `output_sequence_length`: The length of the output sequences. If the input sequence is shorter than this value, the output sequence will be padded. If the input sequence is longer than this value, the output sequence will be truncated.
- `standardize`: The standardization to apply to the text. The default value is `lower_and_strip_punctuation` which will convert the text to lowercase and strip punctuation. This parameter can be set to a custom function to apply a custom standardization.


In [6]:
# let's create a custom standardization function similar to the one we used in the previous notebook but using TensorFlow operations
def custom_standardization(text):
    # change all text to lowercase
    text = tf.strings.lower(text)

    # remove HTML tags
    text = tf.strings.regex_replace(text, r"<.*?>", "")

    # remove numbers
    text = tf.strings.regex_replace(text, r"\d+", "")

    # remove words with numbers
    text = tf.strings.regex_replace(text, r"\w*\d\w*", "")

    # remove URLs
    text = tf.strings.regex_replace(text, r"https?://\S+", "")

    # remove emails
    text = tf.strings.regex_replace(text, r"\S+@\S+", "")

    # remove mentions (@username)
    text = tf.strings.regex_replace(text, r"@\S+", "")

    # remove hashtags (#)
    text = tf.strings.regex_replace(text, r"#", "")

    # remove Punctuation
    text = tf.strings.regex_replace(text, f"[{re.escape(string.punctuation)}]", " ")

    # remove extra spaces
    text = tf.strings.regex_replace(text, r"\s+", " ")

    return text

In [7]:
# create a TextVectorization layer with our custom standardization function and other parameters
vectorize_layer = tf.keras.layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=max_length,
)

# make a text-only dataset (without labels), then call adapt to build the vocabulary
train_text = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

2023-02-05 02:43:45.637438: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


In [8]:
# create a pipeline mapping function to vectorize the text and label
def vectorize_text(text, label):
    # expand the dimensions of the text to make it into a batch
    text = tf.expand_dims(text, -1)
    # apply the vectorization layer to the text
    text = vectorize_layer(text)
    return text, label


# create a pipeline mapping function to vectorize the text and label
def dataset_creator(dataset):
    # create a dataset of text and labels
    dataset = dataset.map(
        vectorize_text, num_parallel_calls=tf.data.experimental.AUTOTUNE
    )

    # prefetch the dataset to improve latency
    dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

    return dataset


# create the training and test datasets
train_ds = dataset_creator(raw_train_ds)
test_ds = dataset_creator(raw_test_ds)

In [9]:
# preview samples from the training dataset
for x_batch, y_batch in train_ds.take(1):
    print("X batch shape:", x_batch.shape, "Y batch shape:", y_batch.shape)
    print("X:", x_batch[0])
    print("Y:", y_batch[0])

X batch shape: (32, 120) Y batch shape: (32,)
X: tf.Tensor(
[  84   18  256    2  223    1  566   32  232   11 2436    1   54   22
   28  413  254   12  315  278    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0], shape=(120,), dtype=int64)
Y: tf.Tensor(0, shape=(), dtype=int32)


## Model Training <a name="model-training"></a>
[Back to top](#toc)

In this section, we will build a model to classify the text data using the data pipeline we built in the previous section.

In [10]:
gru_model = tf.keras.Sequential(
    [
        tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
        tf.keras.layers.GRU(64, activation="tanh"),
        tf.keras.layers.Dense(64, activation="relu"),
        tf.keras.layers.Dense(32, activation="relu"),
        tf.keras.layers.Dense(1, activation="sigmoid"),
    ]
)

gru_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 120, 50)           400000    
                                                                 
 gru (GRU)                   (None, 64)                22272     
                                                                 
 dense (Dense)               (None, 64)                4160      
                                                                 
 dense_1 (Dense)             (None, 32)                2080      
                                                                 
 dense_2 (Dense)             (None, 1)                 33        
                                                                 
Total params: 428,545
Trainable params: 428,545
Non-trainable params: 0
_________________________________________________________________


In [None]:
# compile the model
gru_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

# train the model
gru_model.fit(train_ds, epochs=10, validation_data=test_ds)