# Custom TF-Hub Word Embedding with text2hub

**Learning Objectives:**
  1. Learn how to deploy deploy AI Hub Kubeflow pipeline
  1. Learn how to configure the run parameters for text2hub
  1. Learn how to inspect text2hub generated artifacts and word embeddings in TensorBoard
  1. Learn how to run TF 1.x generated hub module in TF 2.0


## Introduction

In this notebook, we will first learn how to run a pipeline from AI Hub Kubeflow, text2hub, that takes as input a corpus of text in a GCS bucket and ouputs a reusable TF-Hub text embedding module trained on that corpus. We will create a medical text embedding using a surgery manual from Project Gutenberg.
We will learn how to inspect the pipeline run artifacts and the generated embeddings via visual exploration of the embedding space in TensorBoard. At last we will learn how to use the generated text embedding modules both in TF 1.x and TF 2.0. 

In [29]:
import os

import tensorflow as tf
import tensorflow_hub as hub

Replace by your GCP project and bucket:

In [30]:
PROJECT = "<YOUR PROJEC>"
BUCKET = "<YOUR BUCKET>"

os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET

## Setting up the Kubeflow cluster

We assume that you have a running Kubeflow cluster. 

If not, to deploy a [Kubeflow](https://www.kubeflow.org/) cluster in your GCP project, use the [Kubeflow cluster deployer](https://deploy.kubeflow.cloud/#/deploy).

There is a [setup video](https://www.kubeflow.org/docs/started/cloud/getting-started-gke/~) that will
take you over all the steps in details, and explains how to access to the Kubeflow Dashboard UI, once it is 
running. 

You'll need to create an OAuth client for authentication purposes: Follow the 
instructions [here](https://www.kubeflow.org/docs/gke/deploy/oauth-setup/).

## Loading the dataset on GCS

In [40]:
%%bash

# Surgery Manual
URL=http://www.gutenberg.org/cache/epub/24564/pg24564.txt
OUTDIR=gs://$BUCKET/custom_embedding
CORPUS=surgery_manual.txt

curl $URL > $CORPUS
echo gsutil cp $CORPUS $OUTDIR/$CORPUS

gsutil cp surgery_manual.txt gs://dherin-sandbox/custom_embedding/surgery_manual.txt


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 17  608k   17  104k    0     0   156k      0  0:00:03 --:--:--  0:00:03  156k100  608k  100  608k    0     0   461k      0  0:00:01  0:00:01 --:--:--  462k


## Step 1: Download the `text2hub` pipeline from AI Hub

You can go on [AI Hub](https://aihub.cloud.google.com/u/0/) and search for the `text2hub` pipeline, or just follow [this link](https://aihub.cloud.google.com/u/0/p/products%2F4a91d2d0-1fb8-4e79-adf7-a35707071195).
You'll land onto a page describing `text2hub`. Click on the "Download" button on that page to download the Kubeflow pipeline

![img](./assets/text2hub_download.png)

## Step 2: Upload the pipeline to the Kubflow cluster

![img](./assets/text2hub_upload.png)

## Step 3: Create a pipeline run

After uploading the pipeline, you should see `text2hub` appear on the pipeline list. Click on it. This will bring you to a page describing the pipeline (explore!) and allowing you to create a run:

![img](./assets/text2hub_run_creation.png)

## Step 4: Enter the run parameters

![img](./assets/text2hub_run_parameters.png)

You can leave most parameters as default except for
`gcs-path-to-the-test-corpus` whose value should be set to

In [48]:
!echo gs://$BUCKET/custom_embedding/surgery_manual.txt

gs://dherin-sandbox/custom_embedding/surgery_manual.txt


and for `gcs-directory-path-for-pipeline-output` which we will set to

In [49]:
!echo gs://$BUCKET/custom_embedding

gs://dherin-sandbox/custom_embedding


Once these values have been set, you can start the run by clicking on "Start".

## Step 5: Inspect the run artifacts

![img](assets/text2hub_experiment_list.png)

![img](assets/text2cooc_markdown_artifacts.png)

![img](assets/cooc2emb_artifacts.png)

tensorboard==1.15.0

![img](assets/cooc2emb_tb_search.png)

The [ilium bone](https://en.wikipedia.org/wiki/Ilium_(bone))

![img](assets/cooc2emb_nn.png)

![img](assets/emb2hub_artifacts.png)

# Step 7: Using the generated TF-Hub module

In [62]:
MODULE = "gs://{bucket}/custom_embedding/hub-module".format(bucket=BUCKET)

In [70]:
texts = tf.constant(['ilium', 'fracture', 'aneurism'])
texts

<tf.Tensor: id=1378, shape=(3,), dtype=string, numpy=array([b'ilium', b'fracture', b'aneurism'], dtype=object)>

In [71]:
class Wrapper(tf.train.Checkpoint):
    def __init__(self, spec):
        super(Wrapper, self).__init__()
        self.module = hub.load(spec)
        self.variables = self.module.variables
        self.trainable_variables = []
    def __call__(self, x):
        return self.module.signatures["default"](x)["default"]

In [72]:
med_embed = hub.KerasLayer(Wrapper(MODULE))


In [73]:
outputs = med_embed(texts)
outputs

<tf.Tensor: id=1517, shape=(3, 128), dtype=float32, numpy=
array([[ 0.14453109,  0.07852267,  0.10940594,  0.46825212,  0.2711016 ,
        -0.30434078,  0.35756677, -0.25814775,  0.58705705, -0.33127564,
        -0.3224882 , -0.14958593,  0.27588782, -0.16383132, -0.15279762,
        -0.37463644, -0.37352383, -0.375445  , -0.41296828, -0.1574907 ,
         0.6536843 ,  0.12237123,  0.12104838,  0.1749343 , -0.41965425,
        -0.08719758,  0.19248757,  0.01838654, -0.06238181, -0.23054636,
         0.03802659, -0.06728458, -0.23047519,  0.4418791 , -0.21132842,
        -0.09843157, -0.08029427, -0.11595032,  0.12324606, -0.09030671,
         0.3085222 , -0.08089428, -0.24685636, -0.60403794,  0.4301253 ,
        -0.46597183,  0.475734  , -0.43883252, -0.11868288, -0.21692485,
        -0.01024164, -0.3957712 ,  0.371831  ,  0.30342016,  0.1703456 ,
        -0.40179157,  0.48199773,  0.4366476 , -0.1528644 , -0.15868089,
         0.3833921 ,  0.22966436,  0.53352064,  0.19017717, -0.20