In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Generate Text Embeddings with Hugging face model using Apache Spark on Vertex Workbench

## Overview
The example creates a similarity search on Stackoverflow questions to identify similar topics, questions and technologies being discussed. It leverages BigQuery and Dataproc Serverless for distributed prediction on Deep Learning models.

Data Engineers and Data Scientists with existing working knowledge of BigQuery and Dataproc/Spark can use this notebook to launch batch inference jobs at scale.

### Objective

In this tutorial, you learn how to use Apache Spark for batch inference/prediction and BQ for Vector Search. You also learn to use Dataproc Interactive Sessions from Jupyter Notebooks - From Vertex Workbench Instance

The example uses open source stackoverflow data and open source Hugging Face model - all-MiniLM-L12-v2 to generate embeddings of text data. https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 The model maps text data into 384 dimensional dense vector space. The similarity search on vector index is created in BigQuery.

This tutorial uses the following Google Cloud ML services and resources:

- BQML - Vector Search

### Dataset

BigQuery public dataset - "bigquery-public-data.stackoverflow"

### Costs 

This tutorial uses billable components of Google Cloud:

* Dataproc Serverless
* BigQuery
* Vertex Workbench Instance / BQ Studio

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
[BigQuery pricing](https://cloud.google.com/bigquery/pricing),
and [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing), 
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).
[Enable the Dataproc API](https://console.cloud.google.com/flows/enableapi?apiid=dataproc.googleapis.com)


## Setup & Installation

### Create Dataproc Interactive Session Template with Autoscaling

Create a [Dataproc Interactive Session Template](https://cloud.google.com/dataproc-serverless/docs/guides/create-serverless-sessions-templates) using the network configuration specified in the link.

Enable Autoscaling - https://cloud.google.com/dataproc-serverless/docs/concepts/autoscaling

Required Parameters to enable Autoscaling :

    spark.dynamicAllocation.enabled = true
    spark.dynamicAllocation.maxExecutors = 100
    spark.dynamicAllocation.minExecutors = 5
    

#### Select Dataproc Serverless Interactive Session as the Kernel for this notebook

Once the Template is created, select the interactive template as the kernel for the notebook. This will create Dataproc Interactive Session [check here](https://console.cloud.google.com/dataproc/interactive?)

This may take a while, so please dont close the notebook.

Spark runtime 2.2 has many ML libraries pre-installed. Check - https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-2.2 before re-installing 

Due to certain dependencies between Hugging Face models, we ensure if the numpy version is 1.26

In [10]:
import numpy as np
np.__version__

'1.26.4'

In [1]:
project_id = '[your-project-id]'  # @param {type:"string"}
region = "us-central1"  # @param {type: "string"}

### Import libraries

In [None]:
from pyspark.sql import SparkSession
from pyspark import SparkConf
import numpy as np
from sentence_transformers import SentenceTransformer
from google.cloud import bigquery

from pyspark.ml.functions import predict_batch_udf
from pyspark.sql.functions import struct, col, array, udf, lit, split
from pyspark.sql.types import ArrayType, FloatType, Union, Dict
from pyspark.sql.functions import regexp_replace

Create a BigQuery Dataset - https://cloud.google.com/bigquery/docs/datasets#console 

In [19]:
bq_dataset = '[your-bigquery-dataset]'
table_name = f'{bq_dataset}.stackoverflow_questions'
index = f'{bq_dataset}.stackoverflow_index'

### Create Spark Session & load the data

In [4]:
spark = SparkSession.builder.appName("Embeddings")\
.config("spark.jars.packages", "org.apache.spark:spark-avro_2.13:3.5.1") \
.getOrCreate()


25/01/02 09:21:18 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [15]:
df_data = spark.read.format('bigquery') \
  .option('table', 'bigquery-public-data.stackoverflow.posts_questions') \
  .load()

df_data = df_data.select('title', 'tags').filter(col("title").isNotNull())

# To remove special characters from title for text embedding model
df_data = df_data.withColumn('title_mod', regexp_replace("title", '[^A-Za-z0-9.,]',' '))

In [16]:
df_data.show(5)



+--------------------+--------------------+--------------------+
|               title|                tags|           title_mod|
+--------------------+--------------------+--------------------+
|Which internal fo...|   3d-texture|webgl2|Which internal fo...|
|How to scroll to ...|jquery|mousewheel...|How to scroll to ...|
|How To read Colum...|          apache-poi|How To read Colum...|
|cudaMemcpy2D erro...|            c++|cuda|cudaMemcpy2D erro...|
|Query SQL on Orie...|            orientdb|Query SQL on Orie...|
+--------------------+--------------------+--------------------+
only showing top 5 rows



                                                                                

##### Understand the data

In [17]:
print(df_data.columns)
print(df_data.count())

['title', 'tags', 'title_mod']
10000


### Create batch prediction function

The model will be called and loaded within the batch predict function which will load the model in executors and run distributed inference on spark dataframe
Learn more - https://spark.apache.org/docs/3.4.3/api/python/reference/api/pyspark.ml.functions.predict_batch_udf.html

In [18]:
def predict_batch_fn():
    import torch
    from pyspark.sql.types import ArrayType, StringType
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print("Using {} device".format(device))
    
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')
    model.to(device)
    
    def predict(inputs: ArrayType(StringType())) -> np.ndarray:
        embeddings = model.encode(inputs) #size [batch_size]
        return embeddings #return (batch_size,384)
    
    return predict

In [19]:
generate_embeddings_udf = predict_batch_udf(predict_batch_fn,
                          return_type=ArrayType(FloatType()),
                          batch_size=100)

In [20]:
%%time
from pyspark.sql.functions import size

prediction = df_data.withColumn("embeddings", generate_embeddings_udf('title_mod'))
prediction = prediction.withColumn("array_size", size(prediction["embeddings"]))


CPU times: user 8.83 ms, sys: 5.6 ms, total: 14.4 ms
Wall time: 79.6 ms


In [21]:
prediction.printSchema()

root
 |-- title: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- title_mod: string (nullable = true)
 |-- embeddings: array (nullable = true)
 |    |-- element: float (containsNull = true)
 |-- array_size: integer (nullable = false)



#### Validating the length of embeddings generated [384] as part of data validation step 

In [12]:
from pyspark.sql.functions import countDistinct, count

prediction_filtered = prediction.filter(size(prediction["embeddings"]) == 384)
df_grouped = prediction_filtered.groupBy("array_size").agg(count("array_size").alias("row_count"))

In [None]:
df_grouped.show()

In [None]:
prediction.repartition(100)

In [None]:
prediction.show(5)

#### Save the dataframe as a table in BigQuery. We will create a vector index on this table

In [23]:
prediction.write\
.mode('overwrite')\
.format("bigquery")\
.option("writeMethod","direct")\
.option("useAvroLogicalTypes", "true")\
.option("table",f{table_name}"_embeddings")\
.save()


                                                                                

## Create Vector index in BigQuery 
https://cloud.google.com/bigquery/docs/vector-index 

In [None]:
client = bigquery.Client()
client.dataset(bq_dataset)

Expand variables in BigQuery Magic commands.

In [25]:
%%bigquery
CREATE VECTOR INDEX <bq_dataset>.<index> ON <bq_dataset>.<table_name>_embeddings(embeddings)
OPTIONS (index_type = 'TREE_AH', distance_type = 'EUCLIDEAN',
tree_ah_options = '{"normalization_type": "L2"}');

Query is running:   0%|          |

#### Generate embedding of the search query which will be searched on the vector index to find similar search items

In [6]:
query_sentence = "Hadoop and Apache Spark"
model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')
embeddings = model.encode(query_sentence).tolist()
# embeddings

In [28]:
query = """
    SELECT * FROM
      VECTOR_SEARCH ( TABLE @table,
      'embeddings',
      (select @embeddings),
        top_k => 5, options => '{"fraction_lists_to_search": 0.01}');
"""
job_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.ScalarQueryParameter("table", "STRING", f'{table_name}_embeddings'),
        bigquery.ArrayQueryParameter("embeddings", "FLOAT", embeddings),
    ]
)
query_job = client.query(query, job_config=job_config)  # Make an API request.

In [27]:
for row in query_job:
    print(row[1]['title'])
    print(f'Distance: {row[2]}\n')

Executing HDFS commands in hadoop
Distance: 0.7820840860574579

Direct HDFS access in hadooponazure
Distance: 0.8784149879587282

using chmod intalling hadoop
Distance: 0.9152823934170262

Using Hadoop & related projects to analyze usage patterns that constantly change
Distance: 0.9157446105036288

Performance-wise, is it better to use Hadoop over MPI for a typical embarrassingly parallel scenario?
Distance: 0.9221166140003834



In [30]:
#clean up spark session
spark.stop()

## Cleaning up

To clean up all Google Cloud resources used in this project, you can delete BQ Dataset you used for the tutorial.
