# Custom TF-Hub Word Embedding with text2hub

**Learning Objectives:**
  1. Learn how to deploy deploy AI Hub Kubeflow pipeline
  1. Learn how to configure the run parameters for text2hub
  1. Learn how to inspect text2hub generated artifacts and word embeddings in TensorBoard
  1. Learn how to run TF 1.x generated hub module in TF 2.0


## Introduction


Pre-trained re-usable text embeddings such as TF-hub module are a great tool for text models, since they capture relationship between words. They are generally trained on vast but generic text corpuses like Wikipedia or Google news, which make them usually very good at representing generic text, but not so much when the text comes from a very specialized domain with words unique to that domain, such as in the medical field.
One problem in particular is that applying a TF-hub text modules pre-trained on a generic corpus to a specialized text will send all the unique domain words to the same OOV vector, often the zero vector. By doing so we lose a very valuable part of the text information, because for specialized texts the most informative words are often the words that are very specific to that special domain.
Another issue is that of commonly misspelled words from text gather from say customer feedback. Applying a generic pre-trained embedding will send the misspelled word to the OOV vectors, losing precious info. However, creating a TF-hub module tailored to the texts coming from that customer feedback feed will place the commonly misspelling of a given word closeby to the original word in the embedding space. 

In this notebook, we will learn how to generate a text TF-hub module specific to a particular domain using the text2hub Kubeflow pipeline available on Google AI Hub. This pipeline takes as input a corpus of text stored in a GCS bucket and outputs a TF-Hub module to a GCS bucket. The generated TF-Hub module can then be reused both in TF 1.x or in TF 2.0 code by referencing the output GCS bucket path when loading the module. 

Our first order of business will be to learn how to deploy a Kubeflow pipeline, namely text2hub, stored in AI Hub to a Kubeflow cluster. Then we will dig into the pipeline run parameter configuration and review the artifacts produced by the pipeline during its run. These artifacts are meant to help you assess how good the domain specific TF-hub module you generated is. In particular, we will  explore the embedding space visually using TensorBoard projector, which provides a tool to list the nearest neighbors to a given word in the embedding space.
At last, we will explain how to run the generated module both in TF 1.x and TF 2.0. Running the module in TF 2.0 will necessite a small trick that’s useful to know in itself because it allows you to use all the TF 1.x modules in TF hub in TF 2.0 as a Keras layer. 



In [11]:
import os

import tensorflow as tf
import tensorflow_hub as hub

Replace by your GCP project and bucket:

In [24]:
PROJECT = "<YOUR PROJEC>"
BUCKET = "<YOUR BUCKET>"

os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET

## Setting up the Kubeflow cluster

We assume that you have a running Kubeflow cluster. 

If not, to deploy a [Kubeflow](https://www.kubeflow.org/) cluster in your GCP project, use the [Kubeflow cluster deployer](https://deploy.kubeflow.cloud/#/deploy).

There is a [setup video](https://www.kubeflow.org/docs/started/cloud/getting-started-gke/~) that will
take you over all the steps in detail, and explains how to access to the Kubeflow Dashboard UI, once it is 
running. 

You'll need to create an OAuth client for authentication purposes: Follow the 
instructions [here](https://www.kubeflow.org/docs/gke/deploy/oauth-setup/).

## Loading the dataset in GCS

The corpus we chose is one of [Project Gutenberg medical texts](http://www.gutenberg.org/ebooks/bookshelf/48): [A Manual of the Operations of Surgery](http://www.gutenberg.org/ebooks/24564) by Joseph Bell, containing very specialized language. 

The first thing to do is to upload the text into a GCS bucket:

In [14]:
%%bash

URL=http://www.gutenberg.org/cache/epub/24564/pg24564.txt
OUTDIR=gs://$BUCKET/custom_embedding
CORPUS=surgery_manual.txt

curl $URL > $CORPUS
gsutil cp $CORPUS $OUTDIR/$CORPUS

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 19  608k   19  116k    0     0   137k      0  0:00:04 --:--:--  0:00:04  137k100  608k  100  608k    0     0   391k      0  0:00:01  0:00:01 --:--:--  391k
Copying file://surgery_manual.txt [Content-Type=text/plain]...
/ [0 files][    0.0 B/608.3 KiB]                                                / [1 files][608.3 KiB/608.3 KiB]                                                -
Operation completed over 1 objects/608.3 KiB.                                    


It has very specialized language such as 

```
On the surface of the abdomen the position of this vessel would be 
indicated by a line drawn from about an inch on either side of the 
umbilicus to the middle of the space between the symphysis pubis 
and the crest of the ilium.
```

Now let's go over the steps involved in creating your own embedding from that corpus.

## Step 1: Download the `text2hub` pipeline from AI Hub

Go on [AI Hub](https://aihub.cloud.google.com/u/0/) and search for the `text2hub` pipeline, or just follow [this link](https://aihub.cloud.google.com/u/0/p/products%2F4a91d2d0-1fb8-4e79-adf7-a35707071195).
You'll land onto a page describing `text2hub`. Click on the "Download" button on that page to download the Kubeflow pipeline

![img](./assets/text2hub_download.png)

The text2hub pipeline is a KubeFlow pipeline that comprises three components; namely:


* The **text2cooc** component that computes a word co-occurrence matrix
from a corpus of text

* The **cooc2emb** component that factorizes the
co-occurrence matrix using [Swivel](https://arxiv.org/pdf/1602.02215.pdf) into
the word embeddings exported as a tsv file

* The **emb2hub** component that takes the word
embedding file and generates a TF Hub module from it


Each component is implemented as a Docker container image that's stored into Google Cloud Docker registry, [gcr.io](https://cloud.google.com/container-registry/). The `pipeline.tar.gz` file that you downloaded is a yaml description of how these containers need to be composed as well as where to find the corresponding images. 

**Remark:** Each component can be run individually as a single component pipeline in exactly the same manner as the `text2hub` pipeline. On AI Hub, each component has a pipeline page describing it and from where you can download the associated single component pipeline:

 * [text2cooc](https://aihub.cloud.google.com/u/0/p/products%2F6d998d56-741e-4154-8400-0b3103f2a9bc)
 * [cooc2emb](https://aihub.cloud.google.com/u/0/p/products%2Fda367ed9-3d70-4ca6-ad14-fd6bf4a913d9)
 * [emb2hub](https://aihub.cloud.google.com/u/0/p/products%2F1ef7e52c-5da5-437b-a061-31111ab55312)

## Step 2: Upload the pipeline to the Kubflow cluster

Go to your Kubeflow cluster dashboard and click on the pipeline tab to create a new pipeline. You'll be prompted to upload the pipeline file you have just downloaded. Rename the generated pipeline name to be `text2hub` to keep things nice and clean.

![img](./assets/text2hub_upload.png)

## Step 3: Create a pipeline run

After uploading the pipeline, you should see `text2hub` appear on the pipeline list. Click on it. This will bring you to a page describing the pipeline (explore!) and allowing you to create a run. You can inspect the input and output parameters of each of the pipeline components by clicking on the component node in the graph representing the pipeline.

![img](./assets/text2hub_run_creation.png)

## Step 4: Enter the run parameters

`text2hub` has the following run parameters you can configure:

Argument                                         | Description                                                                           | Optional | Data Type | Accepted values | Default
------------------------------------------------ | ------------------------------------------------------------------------------------- | -------- | --------- | --------------- | -------
gcs-path-to-the-text-corpus                      | A Cloud Storage location pattern (i.e., glob) where the text corpus will be read from | False    | String    | gs://...        | -
gcs-directory-path-for-pipeline-output           | A Cloud Storage directory path where the pipeline output will be exported             | False    | String    | gs://...        | -
number-of-epochs                                 | Number of epochs to train the embedding algorithm (Swivel) on                         | True     | Integer   | -               | 40
embedding-dimension                              | Number of components of the generated embedding vectors                               | True     | Integer   | -               | 128
co-occurrence-word-window-size                   | Size of the sliding word window where co-occurrences are extracted from               | True     | Integer   | -               | 10
number-of-out-of-vocabulary-buckets              | Number of out-of-vocabulary buckets                                                   | True     | Integer   | -               | 1
minimum-occurrences-for-a-token-to-be-considered | Minimum number of occurrences for a token to be included in the vocabulary            | True     | Integer   | -               | 5

You can leave most parameters with their default values except for
`gcs-path-to-the-test-corpus` whose value should be set to

In [48]:
!echo gs://$BUCKET/custom_embedding/surgery_manual.txt

gs://dherin-sandbox/custom_embedding/surgery_manual.txt


and for `gcs-directory-path-for-pipeline-output` which we will set to

In [49]:
!echo gs://$BUCKET/custom_embedding

gs://dherin-sandbox/custom_embedding


**Remark**: `gcs-path-to-the-test-corpus` will accept a GCS pattern like `gs://BUCKET/data/*.txt` or simply a path like `gs://BUCKET/data/` to a GCS directory. All the files that match the pattern or that are in that directory will be parsed to create the word embedding TF-Hub module. 

![img](./assets/text2hub_run_parameters.png)

Once these values have been set, you can start the run by clicking on "Start".

## Step 5: Inspect the run artifacts

Once the run has started you can see its state by going to the experiment tab and clicking on the name of the run (here "text2hub-1"). 

![img](assets/text2hub_experiment_list.png)

It will show you the pipeline graph. The components in green have successfuly completed. You can then click on them and look at the artifacts that these components have produced.

The `text2cooc` components has "co-occurrence extraction summary" showing you the GCS path where the co-occurrence data has been saved. Their is a corresponding link that you can paste into your browser to inspect the co-occurrence data from the GCS browser. Some statistics about the vocabulary are given to you such as the most and least frequent tokens. You can also download the vocabulary file containing the token to be embedded. 

![img](assets/text2cooc_markdown_artifacts.png)

The `cooc2emb` has three artifacts
* An "Embedding Extraction Summary" providing the information as where the model chekpoints and the embedding tables are exported to on GCP
* A similarity matrix from a random sample of words giving you an indication whether the model associates close-by vectors to similar words
* An button to start TensorBoard from the UI to inspect the model and visualize the word embeddings

![img](assets/cooc2emb_artifacts.png)

We can have a look at the word embedding visualization provided by TensorBoard. Start TensorBoard by clicking on "Start" and then "Open" buttons, and then select "Projector".

**Remark:** The projector tab may take some time to appear. If it takes too long it may be that your Kubeflow cluster is running an incompatible version of TensorBoard (you TB version should be between 1.13 and 1.15). If that's the case, just run Tensorboard from CloudShell or locally by issuing the following command:

In [23]:
!echo tensorboard --port 8080 --logdir gs://$BUCKET/custom_embedding/embeddings

tensorboard --port 8080 --logdir gs://dherin-sandbox/custom_embedding/embeddings


The projector view will present you with a representation of the word vectors in a 3 dimensional space (the dim is reduced through PCA) that you can interact with. Enter in the search tool a few words like "ilium" and points in the 3D space will light up. 

![img](assets/cooc2emb_tb_search.png)

If you click on a word vector, you'll see appear the n nearest neighbors of that word in the embedding space. The nearset neighbors are both visualized in the center panel and presented as a flat list on the right. 

Explore the nearest neighbors of a given word and see if they kind of make sense. This will give you a rough understanding of the embedding quality. If it nearest neighbors do not make sense after trying for a few key words, you may need rerun `text2hub`, but this time with either more epochs or more data. Reducing the embedding dimension may help as well as modifying the co-occurence window size (choose a size that make sense given how your corpus is split into lines.)


![img](assets/cooc2emb_nn.png)

The `emb2hub` artifacts give you a snippet of TensorFlow 1.x code showing you how to re-use the generated TF-Hub module in your code. We will demonstrate how to use the TF-Hub module in TF 2.0 in the next section.

![img](assets/emb2hub_artifacts.png)

# Step 7: Using the generated TF-Hub module

Let's see now how to load the TF-Hub module generated by `text2hub` in TF 2.0.

We first store the GCS bucket path where the TF-Hub module has been exported into a variable:

In [17]:
MODULE = "gs://{bucket}/custom_embedding/hub-module".format(bucket=BUCKET)
MODULE

'gs://dherin-sandbox/custom_embedding/hub-module'

Since TF-hub model has been saved in TF 1.x format, it's not callable by default when you load it with `hub.load`. We need a light wrapper class around the result of `hub.load` that turns it into a callable, by overloading the `__call__` method:

In [20]:
class Wrapper(tf.train.Checkpoint):
    def __init__(self, spec):
        super(Wrapper, self).__init__()
        self.module = hub.load(spec)
        self.variables = self.module.variables
        self.trainable_variables = []
    def __call__(self, x):
        return self.module.signatures["default"](x)["default"]

Now we are ready to create a `KerasLayer` out of our custom text embedding.

In [21]:
med_embed = hub.KerasLayer(Wrapper(MODULE))

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


That layer when called with a list of sentences will create a sentence vector for each sentence by averaging the word vectors of the sentence.

In [22]:
outputs = med_embed(tf.constant(['ilium', 'I have a fracture', 'aneurism']))
outputs

<tf.Tensor: id=139, shape=(3, 128), dtype=float32, numpy=
array([[ 0.14453109,  0.07852267,  0.10940594,  0.46825212,  0.2711016 ,
        -0.30434078,  0.35756677, -0.25814775,  0.58705705, -0.33127564,
        -0.3224882 , -0.14958593,  0.27588782, -0.16383132, -0.15279762,
        -0.37463644, -0.37352383, -0.375445  , -0.41296828, -0.1574907 ,
         0.6536843 ,  0.12237123,  0.12104838,  0.1749343 , -0.41965425,
        -0.08719758,  0.19248757,  0.01838654, -0.06238181, -0.23054636,
         0.03802659, -0.06728458, -0.23047519,  0.4418791 , -0.21132842,
        -0.09843157, -0.08029427, -0.11595032,  0.12324606, -0.09030671,
         0.3085222 , -0.08089428, -0.24685636, -0.60403794,  0.4301253 ,
        -0.46597183,  0.475734  , -0.43883252, -0.11868288, -0.21692485,
        -0.01024164, -0.3957712 ,  0.371831  ,  0.30342016,  0.1703456 ,
        -0.40179157,  0.48199773,  0.4366476 , -0.1528644 , -0.15868089,
         0.3833921 ,  0.22966436,  0.53352064,  0.19017717, -0.203

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License