In [1]:
# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License

# Generate text embeddings by using the Vertex AI API

## Text Embeddings

Text embeddings are a way of representing text as numerical vectors. This allows computers to understand and process text data, which is essential for many natural language processing (NLP) tasks.

### Uses of text embeddings
By converting text into numerical vectors, text embeddings make it possible for computers to process and analyze text data. This enables a wide range of NLP tasks, including:

* Semantic search: Finding documents or passages that are relevant to a query, even if the query doesn't use the exact same words as the documents.
* Text classification: Categorzing text data into different classes, such as spam or not spam, or positive sentiment or negative sentiment.
* Machine translation: Translating text from one language to another while preserving the meaning.
* Text summarization: Creating shorter summaries of longer pieces of text.

In this notebook, we will use Apache Beam's `MLTransform` to embeddings on the text data.

Vertex AI provides an API that you can use to generate text embeddings that use Google’s large generative AI models. For more information, see [Get text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings). To generate text embeddings by using the Vertex AI text-embeddings API, use `MLTransform` with the `VertexAITextEmbeddings` class to specify the model configuration.

For more information about using `MLTransform`, see [Preprocess data with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in the Apache Beam documentation.

## Requirements

To use the Vertex AI text-embeddings API, complete the following prerequisites:

* Install the `google-cloud-aiplatform` Python package.
* Do one of the following tasks:
  * Configure Credentials for your Google cloud project. For more information, see [Google Auth Library for Python](https://googleapis.dev/python/google-auth/latest/reference/google.auth.html#module-google.auth).
  * Store the path to a service account JSON file by using the [GOOGLE_APPLICATION_CREDENTIALS](https://cloud.google.com/docs/authentication/application-default-credentials#GAC) environment variable.

To use your Google Cloud account, authenticate this notebook.

In [2]:
from google.colab import auth
auth.authenticate_user()

# TODO: Remove the project name before merging.
project = 'google.com:clouddfe' # Replace with a valid project id.

## Install dependencies
 Install Apache Beam and the dependencies required for the Vertex AI text-embeddings API.

In [None]:
! git clone https://github.com/apache/beam.git
! cd beam/sdks/python
! pip install beam/sdks/python[gcp]


## Import the required modules



In [7]:
import tempfile
import apache_beam as beam
from apache_beam.ml.transforms.base import MLTransform
from apache_beam.ml.transforms.embeddings.vertex_ai import VertexAITextEmbeddings

## Use MLTransform in write mode

In `write` mode, `MLTransform` saves the transforms and their attributes to an artifact location. These transforms are reused in `read` mode.

In [19]:
artifact_location = tempfile.mkdtemp(prefix='vertex_ai')

# Use the latest text embedding model from the Vertex AI API documentation
# https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text-embeddings
text_embedding_model_name = 'textembedding-gecko@latest'

# Generate text embedding on the sentences.
content = [{ 'x' : 'I would like embeddings for this text'}, {'x' : 'Hello world'},
           {
               'x': 'The Dog is running in the park.'
           }]

# helper function that returns a dict containing only first
#10 elements of generated embeddings.
def truncate_embeddings(d):
  for key in d.keys():
    d[key] = d[key][:10]
  return d

The `MLTransform` function processes a dictionary containing column names and their corresponding text data. For each sentence, it generates a list of embeddings. This pipeline generates text embeddings based on the input sentences by calling the Vertex AI text-embeddings API for online prediction.

In [20]:
embedding_transform = VertexAITextEmbeddings(
        model_name=text_embedding_model_name, columns=['x'], project=project)

with beam.Pipeline() as pipeline:
  data_pcoll = (
          pipeline
          | "CreateData" >> beam.Create(content))
  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(write_artifact_location=artifact_location).with_transform(embedding_transform))

  # Show just the first 10 elements of the embeddings to prevent clutter in the output.
  transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> beam.Map(print)

  transformed_pcoll | "PrintEmbeddingShape" >> beam.Map(lambda x: print(f"Embedding shape: {len(x['x'])}"))

{'x': [0.041293490678071976, -0.010302993468940258, -0.048611514270305634, -0.01360565796494484, 0.06441926211118698, 0.022573700174689293, 0.016446372494101524, -0.033894773572683334, 0.004581860266625881, 0.060710687190294266]}
Embedding shape: 10
{'x': [0.05889148637652397, -0.0046180677600204945, -0.06738516688346863, -0.012708292342722416, 0.06461101770401001, 0.025648491457104683, 0.023468563333153725, -0.039828114211559296, -0.009968819096684456, 0.050098177045583725]}
Embedding shape: 10
{'x': [0.04683901369571686, -0.013076924718916416, -0.082594133913517, -0.01227626483887434, 0.00417641457170248, -0.024504298344254494, 0.04282262548804283, -0.0009824123699218035, -0.02860993705689907, 0.01609829254448414]}
Embedding shape: 10


## Use MLTransform in read mode

In `read` mode, `MLTransform` uses the artifacts saved during `write` mode. In this example, the transform and its attributes are loaded from the saved artifacts. You don't need to specify artifacts again during `read` mode.

In this way, `MLTransform` provides consistent preprocessing steps for training and inference workloads.

In [21]:
test_content = [
    {
        'x': 'This is a test sentence'
    },
    {
        'x': 'The park is full of dogs'
    },
]

with beam.Pipeline() as pipeline:
  data_pcoll = (
          pipeline
          | "CreateData" >> beam.Create(test_content))
  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(read_artifact_location=artifact_location))

  transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> beam.Map(print)


{'x': [0.04782044142484665, -0.010078949853777885, -0.05793016776442528, -0.026060665026307106, 0.05756739526987076, 0.02292264811694622, 0.014818413183093071, -0.03718176111578941, -0.005486017093062401, 0.04709304869174957]}
{'x': [0.042911216616630554, -0.007554919924587011, -0.08996245265007019, -0.02607591263949871, 0.0008614308317191899, -0.023671219125390053, 0.03999944031238556, -0.02983051724731922, -0.015057179145514965, 0.022963201627135277]}
