## Creating an index of Movie Embeddings at Scale with Apache Beam and Dataflow
In this lab you will deploy an Apache Beam pipeline to Dataflow that sends movie overviews  through a pre-trained NLP model to generate embedded representations of their overviews. This lab is a special case of leveraging Apache Beam/Dataflow as a Batch Prediction infrastructure.

#### Setup

In [1]:
from datetime import datetime

import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub

PROJECT = !(gcloud config get-value core/project)
PROJECT = PROJECT[0]
REGION = "us-central1"
BUCKET = PROJECT
INPUT_FILE = f"gs://{BUCKET}/movies/movies.csv"

%env PROJECT={PROJECT}
%env REGION={REGION}
%env BUCKET={BUCKET}
%env INPUT_FILE={INPUT_FILE}

env: PROJECT=kylesteckler-demo
env: REGION=us-central1
env: BUCKET=kylesteckler-demo
env: INPUT_FILE=gs://kylesteckler-demo/movies/movies.csv


Make sure you have a GCS bucket that exists and if not create it.

In [2]:
%%bash
exists=$(gsutil ls -d | grep -w gs://${BUCKET}/)

if [ -n "$exists" ]; then
  echo -e "Bucket gs://${BUCKET} already exists."
    
else
   echo "Creating a new GCS bucket."
   gsutil mb -l ${REGION} gs://${BUCKET}
   echo -e "\nHere are your current buckets:"
   gsutil ls
fi

Bucket gs://kylesteckler-demo already exists.


Copy CSV to your bucket

In [3]:
!gsutil cp gs://asl-public/data/movie-descriptions/movies.csv {INPUT_FILE}

Copying gs://asl-public/data/movie-descriptions/movies.csv [Content-Type=text/csv]...
/ [1 files][ 10.5 MiB/ 10.5 MiB]                                                
Operation completed over 1 objects/10.5 MiB.                                     


#### Explore the Data

In [4]:
df = pd.read_csv(INPUT_FILE)
df

Unnamed: 0,original_title,overview
0,Toy Story,Led by Woody Andy's toys live happily in his r...
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Waiting to Exhale,Cheated on mistreated and stepped on the women...
3,Father of the Bride Part II,Just when George Banks has recovered from his ...
4,Heat,Obsessive master thief Neil McCauley leads a t...
...,...,...
34607,Caged Heat 3000,It's the year 3000 AD. The world's most danger...
34608,Robin Hood,Yet another version of the classic epic with e...
34609,Siglo ng Pagluluwal,An artist struggles to finish his work while a...
34610,Betrayal,When one of her hits goes wrong a professional...


As you can see, this CSV file contains about 35000 rows of data. Each row has a movie title `original_title` and a short paragraph description `overview`.

In [5]:
# Take a look at one example
example = df.iloc[0]
title = example[0]
overview = example[1]

print(f"Title: {title}\nOverview: {overview}")

Title: Toy Story
Overview: Led by Woody Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner the duo eventually learns to put aside their differences.


#### Generating Embeddings
It is common to create embeddings, or numeric representations of unstructured data. Embeddings are frequently used in recommender systems and for computing similarities. In this lab we will use a pre-trained model to create embeddings for each movie overview. 

The model we will use is Google's [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/4). The Universal Sentence Encoder encodes text into 512 dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

In [6]:
# Load the model into memory
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

2022-06-27 21:45:15.024795: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2022-06-27 21:45:15.024951: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-06-27 21:45:15.024988: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (home-depot-asl): /proc/driver/nvidia/version does not exist
2022-06-27 21:45:15.025381: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-27 2

In [7]:
# Send one example through the model to get embedding
embeddings = embed([overview])
print(
    f"Overview: {overview}\nEmbedding Shape: {embeddings[0].shape}\nEmbedding Value: {embeddings[0].numpy()}"
)

Overview: Led by Woody Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner the duo eventually learns to put aside their differences.
Embedding Shape: (512,)
Embedding Value: [ 0.06237842 -0.00021261 -0.0321418   0.04211934 -0.03190538 -0.04727348
 -0.03579684 -0.06527699 -0.01243671 -0.01291869 -0.03203537  0.05493606
 -0.04103908  0.01471261 -0.03063923 -0.06317661 -0.0466479   0.04065031
 -0.05278567  0.05723299 -0.01752132 -0.0664015   0.0681255  -0.02964727
  0.0051426   0.06729304 -0.03282264 -0.02283403  0.04896601  0.06429423
 -0.02937895 -0.06723911 -0.05709471 -0.04732318 -0.04401306 -0.03806528
  0.02364934  0.03226624  0.01851922 -0.05438095 -0.01259939 -0.03234723
 -0.00504965 -0.05542825 -0.05961061  0.06569342  0.04169372  0.0305221
 -0.03209813 -0.05397084  0.06841838 -0.00631814  0.00088382 -0.0

#### Beam and Dataflow to serve batch predictions at scale
While we are able to load this model in locally and serve a single prediction (which in this case is to generate an embedding), this becomes difficult as the data size grows. To serve batch predictions in a scalable manner, we can use Apache Beam and Dataflow. 

#### Apache Beam Pipeline
This Apache Beam pipeline will do the following
* Read in rows from a CSV file formatted as `movie title, overview`
* Send the movie overviews through a pre-trained TFHub model with a custom `beam.DoFn` to generate embeddings 
* Write the movie titles and embedding representations of their overviews to a JSON file in GCS

In [8]:
%%writefile movie_embedding_pipeline.py

import apache_beam as beam
from apache_beam.options.pipeline_options import (
    GoogleCloudOptions, 
    PipelineOptions, 
    StandardOptions, 
    SetupOptions
)

from apache_beam.runners import DataflowRunner
import argparse

import typing 
import tensorflow as tf
import tensorflow_hub as hub
import json 
import os 

# Schema of CSV file
class Movie(typing.NamedTuple):
    title: str
    overview: str

# DoFn to transform CSV rows to PCollection with schema
class ParseFileDoFn(beam.DoFn):
    def process(self, element):
        title, overview = element.split(',')
        yield Movie(
            title = title,
            overview = overview
        )
    
class PredictDoFn(beam.DoFn):
    def __init__(self, hub_module):
        self.hub_module = hub_module
    
    # called whenever DoFn instance is deserialized on the worker
    def setup(self):
        self.model = hub.load(self.hub_module)
        
    def process(self, element):
        embedding = self.model([element.overview]) # tf.Tensor(shape (1, num_emb))
        embedding = tf.squeeze(embedding) # tf.Tensor(shape (num_emb))
        
        yield {
            "id": str(element.title),
            "embedding" : embedding.numpy().tolist() #  len(list)=num_emb 
        }
        
def run():
    parser = argparse.ArgumentParser(description='Movie Embedding Pipeline')

    # Google Cloud options
    parser.add_argument('--project',required=True, help='Specify Google Cloud project')
    parser.add_argument('--region', required=True, help='Specify Google Cloud region')
    parser.add_argument('--staging_location', required=True, help='Specify Cloud Storage bucket for staging')
    parser.add_argument('--runner', required=True, help='Specify Apache Beam Runner')
    parser.add_argument('--job_name', required=True, help='Job name for Dataflow Runner')

    # Pipeline-specific options
    parser.add_argument('--input_file', required=True, help='GCS path to input CSV')
    parser.add_argument('--output_base', required=True, help='Output base for sharded JSON files')
    parser.add_argument('--hub_module', required=True, help='URI of TF Hub model for embeddings')

    opts, pipeline_opts = parser.parse_known_args()

    # Setting up the Beam pipeline options.
    options = PipelineOptions(pipeline_opts)
    
     # Set standard pipeline options.
    options.view_as(StandardOptions).streaming = False
    options.view_as(StandardOptions).runner = opts.runner
    options.view_as(SetupOptions).save_main_session = True

    # Set Google Cloud specific options.
    google_cloud_options = options.view_as(GoogleCloudOptions)
    google_cloud_options.project = opts.project
    google_cloud_options.job_name = opts.job_name
    google_cloud_options.staging_location = opts.staging_location
    google_cloud_options.region = opts.region

    # Instaniate pipeline
    p = beam.Pipeline(DataflowRunner(), options=options)
    
    # Pcollection with movie titles and overviews
    rows = (p 
        | "Read File" >> beam.io.ReadFromText(opts.input_file, skip_header_lines=1)
        | "Parse Rows" >> beam.ParDo(ParseFileDoFn()))
    
    # Pcollection with movie titles and embeddings
    index = rows | "Generate Embeddings" >> beam.ParDo(PredictDoFn(hub_module=opts.hub_module))
    
    # Write out to JSON file 
    write_json = (index 
                  | "Format JSON" >> beam.Map(json.dumps) 
                  | "Write JSON" >> beam.io.WriteToText(
                      opts.output_base,
                      file_name_suffix=".json"
                  ))
    p.run()
    
if __name__ == '__main__':
    run()


Writing movie_embedding_pipeline.py


### A deeper look at this pipeline

#### Apache Beam Core Concepts
`Pcollection`: An immutable collections of values representing data elements.

`PTransform`: Represents a data processing operation, or a step, in your pipeline.

`ParDo`: A transform for generic parallel processing. A `ParDo` transform considers each element in the input `PCollection`, performs some processing function on that element, and emits elements to an output `PCollection`. The processing function passed to ParDo is a `DoFn` object.

`DoFn`: These are what define your pipeline's exact data processing tasks. To create a custom processing task, create a `DoFn` subclass (e.g. `class MyCustomProcessingTask(beam.DoFn)`) and write a method `def process(self, element):` where you provide the actual processing logic. You don't need to manually extract the elements from the input collection; the Beam SDKs handle that for you. Your `process` method should accept an argument `element`, which is the input element, and return an interable with its output values. You can accomplish this by emitting individual elements with `yield` statements. 

#### Movie Embedding with TFHub Model Pipeline
1) `beam.io.ReadFromText`: A `PTransform` for reading text files into `str` elements. It returns one element for each line the file. With the input CSV file for our movie dataset, it will return a string element `"{movieTitle}, {overview}"` for each row in the CSV.

2) `beam.ParDo(ParseCsv())`: A `PTransform` that applies `ParseCsv` (a custom `DoFn`), to each string element in the output `PCollection` from `beam.io.ReadFromText`. The `process` method for `ParseCsv` simply uses the `str.split()` method to split each row and return a `NamedTuple` of the movie title and overview for each example in the dataset.

3) `beam.ParDo(PredictDoFn(hub_module=opts.hub_module))`. A `PTransform` that applies `PredictDoFn()` (a custom `DoFn`), to each element in the output `PCollection` from `beam.ParDo(ParseCsv())`. The `process` method for `PredictDoFn()` sends the movie overview through the universal sentence encoder that is loaded from TFHub in the `DoFn`s `setup` method. The output is a `Pcollection` where each element is a dictionary with
    * `id`: The movie title
    * `embedding`: The embedding of the movie overview 


4) `beam.Map(json.dumps)` A `PTransform` that maps `json.dumps` to each element in the `PCollection` output from `PredictDoFn()`. `json.dumps` converts the Python dictionary into a json string.

5) `beam.io.WriteToText`: A `PTransfrm` that takes the output `PCollection` of json strings from `beam.Map(json.dumps)` and writes them to a text file. We provide `beam.io.WriteToText`
    * `file_path_prefix`: The file path to write to
    * `file_name_suffix`: Suffix for the files written
    * While not provided in this pipeline, `beam.io.WriteToText` can take argument  
    `num_shards` which specifies the number of files (shards) used for output. If not
    set, the service will decide the optimal number of shards.

Create requirement.txt file for Dataflow job

In [9]:
%%writefile requirements.txt
tensorflow_hub==0.12.0

Writing requirements.txt


In [10]:
STAGING_LOCATION = f"gs://{BUCKET}/staging"
JOB_NAME = (
    f'movie-embedding-pipeline-{datetime.now().strftime("%Y%m%d-%H%M%S")}'
)
OUTPUT_BASE = f"gs://{BUCKET}/movies/data/embeddings"
HUB_MODULE = "https://tfhub.dev/google/universal-sentence-encoder/4"


%env STAGING_LOCATION={STAGING_LOCATION}
%env JOB_NAME={JOB_NAME}
%env OUTPUT_BASE={OUTPUT_BASE}
%env HUB_MODULE={HUB_MODULE}

env: STAGING_LOCATION=gs://kylesteckler-demo/staging
env: JOB_NAME=movie-embedding-pipeline-20220627-214540
env: OUTPUT_BASE=gs://kylesteckler-demo/movies/data/embeddings
env: HUB_MODULE=https://tfhub.dev/google/universal-sentence-encoder/4


Launch Dataflow Job

In [11]:
%%bash
python3 movie_embedding_pipeline.py \
    --project=${PROJECT} \
    --region=${REGION} \
    --staging_location=${STAGING_LOCATION} \
    --runner='DataflowRunner' \
    --job_name=${JOB_NAME} \
    --input_file=${INPUT_FILE} \
    --output_base=${OUTPUT_BASE} \
    --hub_module=${HUB_MODULE} \
    --requirements_file='./requirements.txt'



**NOTE** The pipeline will take about 10 minutes to complete. Once it's complete you can check to see the sharded files produced.

In [None]:
!gsutil ls {OUTPUT_BASE}*

gs://kylesteckler-demo/movies/data/embeddings-00000-of-00009.json
gs://kylesteckler-demo/movies/data/embeddings-00001-of-00009.json
gs://kylesteckler-demo/movies/data/embeddings-00002-of-00009.json
gs://kylesteckler-demo/movies/data/embeddings-00003-of-00009.json
gs://kylesteckler-demo/movies/data/embeddings-00004-of-00009.json
gs://kylesteckler-demo/movies/data/embeddings-00005-of-00009.json
gs://kylesteckler-demo/movies/data/embeddings-00006-of-00009.json
gs://kylesteckler-demo/movies/data/embeddings-00007-of-00009.json
gs://kylesteckler-demo/movies/data/embeddings-00008-of-00009.json


Look at one example

In [None]:
!gsutil cat PASTE_ONE_FILE_PATH_HERE | head -1

{"id": "La balsa de piedra", "embedding": [-0.0005029064486734569, 0.058336496353149414, 0.004098975099623203, -0.03346201032400131, 0.0641259178519249, 0.06506840139627457, 0.05715423449873924, -0.05169349163770676, 0.05143652856349945, 0.05856018140912056, 0.05496252328157425, -0.024854077026247978, 0.019266225397586823, -0.02226070500910282, 0.013348620384931564, -0.07217089086771011, 0.016987279057502747, 0.02662324160337448, 0.045979633927345276, 0.0259920135140419, 0.047732412815093994, -0.008502245880663395, 0.06375230103731155, 0.06713975220918655, 0.0509035550057888, 0.0009528131922706962, -0.043219879269599915, 0.05272862687706947, 0.005474014673382044, -0.061405427753925323, -0.03596453368663788, -0.01561708189547062, -0.06089054048061371, -0.05055272579193115, -0.06819809973239899, -0.02577584981918335, -0.05599018558859825, 0.06125086173415184, 0.07008080184459686, -0.029677150771021843, -0.06407012790441513, 0.05200875923037529, -0.0025642698165029287, -0.0681324750185012