In [1]:
# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License

# Generate text embeddings by using the Vertex AI API

## Text Embeddings

Text embeddings are a way of representing text as numerical vectors. This allows computers to understand and process text data, which is essential for many natural language processing (NLP) tasks.

### Uses of text embeddings
By converting text into numerical vectors, text embeddings make it possible for computers to process and analyze text data. This enables a wide range of NLP tasks, including:

* Semantic search: Finding documents or passages that are relevant to a query, even if the query doesn't use the exact same words as the documents.
* Text classification: Categorzing text data into different classes, such as spam or not spam, or positive sentiment or negative sentiment.
* Machine translation: Translating text from one language to another while preserving the meaning.
* Text summarization: Creating shorter summaries of longer pieces of text.

In this notebook, we will use Apache Beam's `MLTransform` to embeddings on the text data.

Vertex AI provides an API that you can use to generate text embeddings that use Google’s large generative AI models. For more information, see [Get text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings). To generate text embeddings by using the Vertex AI text-embeddings API, use `MLTransform` with the `VertexAITextEmbeddings` class to specify the model configuration.

For more information about using `MLTransform`, see [Preprocess data with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in the Apache Beam documentation.

## Requirements

To use the Vertex AI text-embeddings API, complete the following prerequisites:

* Install the `google-cloud-aiplatform` Python package.
* Do one of the following tasks:
  * Configure Credentials for your Google cloud project. For more information, see [Google Auth Library for Python](https://googleapis.dev/python/google-auth/latest/reference/google.auth.html#module-google.auth).
  * Store the path to a service account JSON file by using the [GOOGLE_APPLICATION_CREDENTIALS](https://cloud.google.com/docs/authentication/application-default-credentials#GAC) environment variable.

To use your Google Cloud account, authenticate this notebook.

In [2]:
from google.colab import auth
auth.authenticate_user()

# TODO: Remove the project name before merging.
project = 'google.com:clouddfe' # Replace with a valid project id.

## Install dependencies
 Install Apache Beam and the dependencies required for the Vertex AI text-embeddings API.

In [None]:
! git clone https://github.com/apache/beam.git
! cd beam/sdks/python
! pip install beam/sdks/python[gcp]


## Import the required modules



In [7]:
import tempfile
import apache_beam as beam
from apache_beam.ml.transforms.base import MLTransform
from apache_beam.ml.transforms.embeddings.vertex_ai import VertexAITextEmbeddings

## Use MLTransform in write mode

In `write` mode, `MLTransform` saves the transforms and their attributes to an artifact location. These transforms are reused in `read` mode.

In [8]:
artifact_location = tempfile.mkdtemp(prefix='vertex_ai')

# Use the latest text embedding model from the Vertex AI API documentation
# https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text-embeddings
text_embedding_model_name = 'textembedding-gecko@latest'

# Generate text embedding on the sentences.
content = [{ 'x' : 'I would like embeddings for this text'}, {'x' : 'Hello world'},
           {
               'x': 'The Dog is running in the park.'
           }]

The `MLTransform` function processes a dictionary containing column names and their corresponding text data. For each sentence, it generates a list of embeddings. This pipeline generates text embeddings based on the input sentences by calling the Vertex AI text-embeddings API for online prediction.

In [9]:
embedding_transform = VertexAITextEmbeddings(
        model_name=text_embedding_model_name, columns=['x'], project=project)

with beam.Pipeline() as pipeline:
  data_pcoll = (
          pipeline
          | "CreateData" >> beam.Create(content))
  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(write_artifact_location=artifact_location).with_transform(embedding_transform))

  transformed_pcoll | 'LogOutput' >> beam.Map(print)

  transformed_pcoll | "PrintEmbeddingShape" >> beam.Map(lambda x: print(f"Embedding shape: {len(x['x'])}"))

Embedding shape: 768
{'x': [0.041293490678071976, -0.010302993468940258, -0.048611514270305634, -0.01360565796494484, 0.06441926211118698, 0.022573700174689293, 0.016446372494101524, -0.033894773572683334, 0.004581860266625881, 0.060710687190294266, -0.021728642284870148, 0.021351153030991554, -0.029735974967479706, 0.02554303966462612, -0.003689623437821865, -0.054144348949193954, 0.045556843280792236, 0.024512041360139847, 0.033651020377874374, -0.007227035705000162, 0.0034407798666507006, 0.01046749297529459, -0.0003862503217533231, -0.017267994582653046, 0.013953671790659428, -0.02976437471807003, 0.023665405809879303, -0.04075342044234276, -0.03480035066604614, -0.0114308912307024, -0.0239212978631258, 0.04272296652197838, -0.028070665895938873, 0.016720645129680634, 0.01396490354090929, -0.03568996116518974, -0.012728322297334671, -0.01839173398911953, -0.00044931433512829244, -0.01082014013081789, 0.007709820754826069, -0.03283832222223282, -0.022093195468187332, 0.0086980136111

## Use MLTransform in read mode

In `read` mode, `MLTransform` uses the artifacts saved during `write` mode. In this example, the transform and its attributes are loaded from the saved artifacts. You don't need to specify artifacts again during `read` mode.

In this way, `MLTransform` provides consistent preprocessing steps for training and inference workloads.

In [10]:
test_content = [
    {
        'x': 'This is a test sentence'
    },
    {
        'x': 'The park is full of dogs'
    },
]

with beam.Pipeline() as pipeline:
  data_pcoll = (
          pipeline
          | "CreateData" >> beam.Create(test_content))
  transformed_pcoll = (
      data_pcoll
      | "MLTransform" >> MLTransform(read_artifact_location=artifact_location))

  transformed_pcoll | 'LogOutput' >> beam.Map(print)


{'x': [0.04782044142484665, -0.010078949853777885, -0.05793016776442528, -0.026060665026307106, 0.05756739526987076, 0.02292264811694622, 0.014818413183093071, -0.03718176111578941, -0.005486017093062401, 0.04709304869174957, 0.01156215462833643, 0.01828550361096859, -0.004602659493684769, 0.010465343482792377, -0.017492054030299187, 0.0020488162990659475, 0.013086975552141666, 0.013247685506939888, 0.023523516952991486, 3.5538287193048745e-06, 0.011738542467355728, 0.008585019037127495, -0.018009517341852188, 0.013147163204848766, 0.02381870523095131, -0.00846739299595356, 0.015348557382822037, -0.020572269335389137, -0.03364741802215576, -0.022080661728978157, -0.058389559388160706, 0.02413676120340824, -0.06442892551422119, 0.012014482170343399, 0.006031039170920849, -0.0544792078435421, -0.012419291771948338, 0.02049395814538002, -0.0018002701690420508, 0.0017960186814889312, 0.014369702897965908, -0.024810563772916794, -0.023603912442922592, -0.0027788206934928894, 0.0191820040345