# Reusable Embeddings

**Learning Objectives**
1. Learn how to use a pre-trained Kaggle Hub text modules to generate sentence vectors
1. Learn how to incorporate a pre-trained Kaggle-Hub module into a Keras model


## Introduction


In this notebook, we will implement text models to recognize the probable source (GitHub, TechCrunch, or The New York Times) of the titles we have in the title dataset.

First, we will load and pre-process the texts and labels so that they are suitable to be fed to sequential Keras models with first layer being Kaggle Hub pre-trained modules. Thanks to this first layer, we won't need to tokenize and integerize the text before passing it to our models. The pre-trained layer will take care of that for us, and consume directly raw text. However, we will still have to one-hot-encode each of the 3 classes into a 3 dimensional basis vector.

Then we will build, train and compare simple DNN models starting with one more pre-trained Kaggle Hub layers.

In [None]:
import os
import warnings

import pandas as pd
from google.cloud import bigquery

warnings.filterwarnings("ignore")
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

# Set `PATH` to include the directory containing saved_model_cli
%load_ext tensorboard
PATH = %env PATH
%env PATH=/home/jupyter/.local/bin:{PATH}

Replace the variable values in the cell below:

In [None]:
PROJECT = !(gcloud config get-value core/project)
PROJECT = PROJECT[0]
BUCKET = PROJECT  # defaults to PROJECT
REGION = "us-central1"  # Replace with your REGION

os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION

In [None]:
%%bash
gcloud config set project $PROJECT
gcloud config set ai/region $REGION

## Create a Dataset from BigQuery 

Hacker news headlines are available as a BigQuery public dataset. The [dataset](https://bigquery.cloud.google.com/table/bigquery-public-data:hacker_news.stories?tab=details) contains all headlines from the sites inception in October 2006 until October 2015. 

Here is a sample of the dataset:

In [None]:
%%bigquery --project $PROJECT

SELECT
    url, title, score
FROM
    `bigquery-public-data.hacker_news.full`
WHERE
    LENGTH(title) > 10
    AND score > 10
    AND LENGTH(url) > 0
LIMIT 10

Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with <i>nytimes</i>

In [None]:
%%bigquery --project $PROJECT

SELECT
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[SAFE_OFFSET(1)] AS source,
    COUNT(title) AS num_articles
FROM
    `bigquery-public-data.hacker_news.full`
WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
    AND LENGTH(title) > 10
GROUP BY
    source
ORDER BY num_articles DESC
  LIMIT 100

Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning.

In [None]:
regex = ".*://(.[^/]+)/"


sub_query = """
SELECT
    title,
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '{0}'), '.'))[SAFE_OFFSET(1)] AS source
    
FROM
    `bigquery-public-data.hacker_news.full`
WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '{0}'), '.com$')
    AND LENGTH(title) > 10
""".format(
    regex
)


query = """
SELECT 
    LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title,
    source
FROM
  ({sub_query})
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
""".format(
    sub_query=sub_query
)

print(query)

For ML training, we usually need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset). AutoML however figures out on its own how to create these splits, so we won't need to do that here. 



In [None]:
bq = bigquery.Client(project=PROJECT)
title_dataset = bq.query(query).to_dataframe()
title_dataset.head()

AutoML for text classification requires that
* the dataset be in csv form with 
* the first column being the texts to classify or a GCS path to the text 
* the last colum to be the text labels

The dataset we pulled from BiqQuery satisfies these requirements.

In [None]:
print(f"The full dataset contains {len(title_dataset)} titles")

Let's make sure we have roughly the same number of labels for each of our three labels:

In [None]:
title_dataset.source.value_counts()

Finally we will save our data, which is currently in-memory, to disk.

We will create a csv file containing the full dataset and another containing only 1000 articles for development.

**Note:** It may take a long time to train AutoML on the full dataset, so we recommend to use the sample dataset for the purpose of learning the tool. 


In [None]:
DATADIR = "./data/"

if not os.path.exists(DATADIR):
    os.makedirs(DATADIR)

In [None]:
FULL_DATASET_NAME = "titles_full.csv"
FULL_DATASET_PATH = os.path.join(DATADIR, FULL_DATASET_NAME)

# Let's shuffle the data before writing it to disk.
title_dataset = title_dataset.sample(n=len(title_dataset))

title_dataset.to_csv(
    FULL_DATASET_PATH, header=False, index=False, encoding="utf-8"
)

## Re-usable Embedding Models

In [None]:
import datetime
import os
import shutil

import keras
import pandas as pd
import tensorflow as tf
from keras.callbacks import EarlyStopping, TensorBoard
from keras.layers import Dense, Input, Lambda, TextVectorization
from keras.models import Model
from keras.utils import to_categorical
from tensorflow_hub import KerasLayer

print(tf.__version__)

In [None]:
%matplotlib inline

Let's start by specifying where the information about the trained models will be saved as well as where our dataset is located:

In [None]:
MODEL_DIR = f"gs://{BUCKET}/text_models"

## Loading the dataset

As in the previous labs, our dataset consists of titles of articles along with the label indicating from which source these articles have been taken from (GitHub, TechCrunch, or The New York Times):

In [None]:
ls $DATADIR

In [None]:
DATASET_NAME = "titles_full.csv"
TITLE_SAMPLE_PATH = os.path.join(DATADIR, DATASET_NAME)
COLUMNS = ["title", "source"]

titles_df = pd.read_csv(TITLE_SAMPLE_PATH, header=None, names=COLUMNS)
titles_df.head()

Let's look again at the number of examples per label to make sure we have a well-balanced dataset:

In [None]:
titles_df.source.value_counts()

## Preparing the labels

In this lab, we will use pre-trained [Kaggle-Hub embeddings modules for english](https://tfhub.dev/s?q=tf2%20embeddings%20text%20english) for the first layer of our models. One immediate
advantage of doing so is that the Kaggle-Hub embedding module will take care for us of processing the raw text. 
This also means that our model will be able to consume text directly instead of sequences of integers representing the words.

However, as before, we still need to preprocess the labels into one-hot-encoded vectors:

In [None]:
CLASSES = {"github": 0, "nytimes": 1, "techcrunch": 2}
N_CLASSES = len(CLASSES)

In [None]:
def encode_labels(sources):
    classes = [CLASSES[source] for source in sources]
    one_hots = to_categorical(classes, num_classes=N_CLASSES)
    return one_hots

In [None]:
encode_labels(titles_df.source[:4])

## Preparing the train/test splits

Let's split our data into train and test splits:

In [None]:
N_TRAIN = int(len(titles_df) * 0.95)

titles_train, sources_train = (
    titles_df.title[:N_TRAIN],
    titles_df.source[:N_TRAIN],
)

titles_valid, sources_valid = (
    titles_df.title[N_TRAIN:],
    titles_df.source[N_TRAIN:],
)

To be on the safe side, we verify that the train and test splits
have roughly the same number of examples per class.

Since it is the case, accuracy will be a good metric to use to measure
the performance of our models.

In [None]:
sources_train.value_counts()

In [None]:
sources_valid.value_counts()

Now let's create the features and labels we will feed our models with:

In [None]:
X_train, Y_train = titles_train.values, encode_labels(sources_train)
X_valid, Y_valid = titles_valid.values, encode_labels(sources_valid)

In [None]:
X_train[:3]

In [None]:
Y_train[:3]

## NNLM Model

We will first try a word embedding pre-trained using a [Neural Probabilistic Language Model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf). Kaggle-Hub has a 50-dimensional one called 
[nnlm-en-dim50-with-normalization](https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1), which also
normalizes the vectors produced. 

### Lab Task 1a: Import NNLM TF Hub module into `KerasLayer`

Once loaded from its url, the kaggle hub module can be used as a layer in a Keras model. Since we have enough data to fine-tune the parameters of the pre-trained embedding itself, we will set `trainable=True` in the `KerasLayer` that loads the pre-trained embedding.

In [None]:
NNLM = "https://tfhub.dev/google/nnlm-en-dim50/2"

nnlm_module = KerasLayer(
    # TODO
)

Note that this Kaggle-Hub embedding produces a single 50-dimensional vector when passed a sentence:

### Lab Task 1b: Use module to encode a sentence string

In [None]:
nnlm_module(
    tf.constant(
        [
            # TODO
        ]
    )
)

## Swivel Model

Then we will try a word embedding obtained using [Swivel](https://arxiv.org/abs/1602.02215), an algorithm that essentially factorizes word co-occurrence matrices to create the words embeddings. 
Kaggle-Hub hosts the pretrained [gnews-swivel-20dim-with-oov](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1) 20-dimensional Swivel module.

### Lab Task 1c: Import Swivel TF Hub module into `KerasLayer`

In [None]:
SWIVEL = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1"

swivel_module = KerasLayer(
    # TODO
)

Similarly as the previous pre-trained embedding, it outputs a single vector when passed a sentence:

### Lab Task 1d: Use module to encode a sentence string

In [None]:
swivel_module(
    tf.constant(
        [
            # TODO
        ]
    )
)

## Building the models

Write a function, `build_model` using Keras Functional API, construct a Keras model designed for text classification by leveraging a pre-trained Kaggle Hub embedding layer. It dynamically creates a neural network architecture that can be trained and evaluated.


Hereâ€™s a breakdown of model we build and how it works:

- Input Layer: The model begins by defining an Input layer that is configured to accept raw text strings. This allows the model to process text directly without needing to first convert it to a numerical format.

- Embedding Layer: The hub_module, which is a pre-trained TensorFlow Hub model, is then wrapped in a Lambda layer. This is a necessary step to make the TensorFlow Hub module compatible with the Keras functional API, allowing it to be used like any other Keras layer. The module_output_dim parameter specifies the dimensionality of the embedding that the hub_module produces ( source).

- Classification Head: Following the embedding layer, two (or more, if you decide to have more) fully connected (Dense) layers form the classification part of the network.

Model Assembly and Compilation: The Model is created by specifying its input and output layers. It is then compiled with the adam optimizer, categorical_crossentropy as the loss function (which is standard for multi-class classification), and accuracy as the evaluation metric.
### Lab Task 2: Incorporate a pre-trained TF Hub module as first layer of Keras Sequential Model

In [None]:
def build_model(hub_module, module_output_dim, name):
    # Define the input layer for raw text strings
    # TODO

    # Wrap the hub_module call in a Lambda layer so it can output a KerasTensor
    # TODO

    # Add a Dense layer
    # TODO

    # Add the final Dense layer for classification
    # TODO

    # Create the model
    model = Model(inputs=inputs, outputs=outputs, name=name)

    # Compile the model
    model.compile(
        optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"]
    )
    return model

Let's also wrap the training code into a `train_and_evaluate` function that 
* takes as input the training and validation data, as well as the compiled model itself, and the `batch_size`
* trains the compiled model for 100 epochs at most, and does early-stopping when the validation loss is no longer decreasing
* returns an `history` object, which will help us to plot the learning curves

In [None]:
def train_and_evaluate(train_data, val_data, model, batch_size=5000):
    X_train, Y_train = train_data

    model_dir = os.path.join(MODEL_DIR, model.name)
    if tf.io.gfile.exists(model_dir):
        tf.io.gfile.rmtree(model_dir)

    history = model.fit(
        X_train,
        Y_train,
        epochs=100,
        batch_size=batch_size,
        validation_data=val_data,
        callbacks=[EarlyStopping(patience=3), TensorBoard(model_dir)],
    )
    return history

## Training NNLM

In [None]:
data = (X_train, Y_train)
val_data = (X_valid, Y_valid)

In [None]:
nnlm_model = build_model(
    hub_module=nnlm_module, module_output_dim=50, name="nnlm"
)
nnlm_history = train_and_evaluate(data, val_data, nnlm_model)

In [None]:
history = nnlm_history
pd.DataFrame(history.history)[["loss", "val_loss"]].plot()
pd.DataFrame(history.history)[["accuracy", "val_accuracy"]].plot()

## Training Swivel

In [None]:
swivel_model = build_model(
    hub_module=swivel_module, module_output_dim=20, name="swivel"
)

In [None]:
swivel_history = train_and_evaluate(data, val_data, swivel_model)

In [None]:
history = swivel_history
pd.DataFrame(history.history)[["loss", "val_loss"]].plot()
pd.DataFrame(history.history)[["accuracy", "val_accuracy"]].plot()

## Comparing the models

While the static plots are useful, an interactive visualization allows for a more direct comparison of the two models' performance. TensorBoard enables you to overlay the metrics for both the Swivel and NNLM models on a single, interactive graph. This will clearly show that Swivel trains faster, but NNLM converges to a superior validation accuracy with fewer training epochs. To explore these results, launch the TensorBoard dashboard with the following command.

In [None]:
%tensorboard --logdir $MODEL_DIR --port 8008

Copyright 2025 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License