# PySpark Huggingface Inferencing
### Text Classification using Pipelines with Tensorflow

Based on: https://huggingface.co/docs/transformers/quicktour#pipeline-usage

### Using TensorFlow
Note that cuFFT/cuDNN/cuBLAS registration errors are expected with `tf=2.17.0` and will not affect behavior, as noted in [this issue.](https://github.com/tensorflow/tensorflow/issues/62075)  
This notebook does not demonstrate inference with TensorRT, as [TF-TRT](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html#tensorrt-10) does not yet support `tf=2.17.0`. See the `pytorch` notebooks for TensorRT demos.

In [2]:
import tensorflow as tf
from transformers import pipeline

2024-10-03 16:47:48.209366: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-03 16:47:48.215921: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-03 16:47:48.223519: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-03 16:47:48.225906: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-03 16:47:48.231640: I tensorflow/core/platform/cpu_feature_guar

In [3]:
# set device if tensorflow gpu is available
device = 0 if tf.config.list_physical_devices('GPU') else -1

In [4]:
print(tf.__version__)

# Enable GPU memory growth
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

2.17.0


In [5]:
classifier = pipeline("sentiment-analysis", device=device)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
2024-10-03 16:47:49.863791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 46447 MB memory:  -> device: 0, name: NVIDIA RTX A6000, pci bus id: 0000:01:00.0, compute capability: 8.6
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [6]:
classifier(("We are very happy to show you the 🤗 Transformers library."))

[{'label': 'POSITIVE', 'score': 0.9997794032096863}]

In [7]:
results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5282


#### Use another model and tokenizer in the pipeline

In [8]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

In [9]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some layers from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [10]:
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.")

[{'label': '5 stars', 'score': 0.7272655963897705}]

## Inference using Spark DL API

In [11]:
import os
import pandas as pd
from pyspark.sql.functions import col, struct, pandas_udf
from pyspark.ml.functions import predict_batch_udf
from pyspark.sql.types import FloatType, StringType, StructField, StructType
from pyspark.sql import SparkSession
from pyspark import SparkConf

In [None]:
conda_env = os.environ.get("CONDA_PREFIX")

conf = SparkConf()
if 'spark' not in globals():
    # If Spark is not already started with Jupyter, attach to Spark Standalone
    import socket
    hostname = socket.gethostname()
    conf.setMaster(f"spark://{hostname}:7077") # assuming Master is on default port 7077
conf.set("spark.task.maxFailures", "1")
conf.set("spark.driver.memory", "8g")
conf.set("spark.executor.memory", "8g")
conf.set("spark.pyspark.python", f"{conda_env}/bin/python")
conf.set("spark.pyspark.driver.python", f"{conda_env}/bin/python")
conf.set("spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled", "false")
conf.set("spark.sql.pyspark.jvmStacktrace.enabled", "true")
conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
conf.set("spark.python.worker.reuse", "true")
# Create Spark Session
spark = SparkSession.builder.appName("spark-dl-examples").config(conf=conf).getOrCreate()
sc = spark.sparkContext

In [13]:
from datasets import load_dataset

# Load the IMDB dataset
data = load_dataset("imdb", split="test")

lines = []
for example in data:
    # first sentence only
    lines.append([example["text"]])

len(lines)

df = spark.createDataFrame(lines, ['lines']).repartition(8).cache()

In [14]:
df.write.mode("overwrite").parquet("imdb_test")

24/10/03 16:47:58 WARN TaskSetManager: Stage 0 contains a task of very large size (3860 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

In [15]:
# only use first sentence of IMDB reviews
@pandas_udf("string")
def first_sentence(text: pd.Series) -> pd.Series:
    return pd.Series([s.split(".")[0] for s in text])

df = spark.read.parquet("imdb_test").withColumn("sentence", first_sentence(col("lines"))).select("sentence").limit(100).cache()
df.show(truncate=80)

                                                                                

+--------------------------------------------------------------------------------+
|                                                                        sentence|
+--------------------------------------------------------------------------------+
|                                                                                |
|Hard up, No proper jobs going down at the pit, why not rent your kids! DIY pi...|
|I watched this movie to see the direction one of the most promising young tal...|
|                        This movie makes you wish imdb would let you vote a zero|
|I never want to see this movie again!<br /><br />Not only is it dreadfully ba...|
|(As a note, I'd like to say that I saw this movie at my annual church camp, w...|
|                 Don't get me wrong, I love the TV series of League Of Gentlemen|
|Did you ever think, like after watching a horror movie with a group of friend...|
|                                                             Awful, awful, awful|
|Thi

In [16]:
def predict_batch_fn():
    import tensorflow as tf
    from transformers import pipeline

    # Enable GPU memory growth
    gpus = tf.config.experimental.list_physical_devices('GPU')
    if gpus:
        try:
            for gpu in gpus:
                tf.config.experimental.set_memory_growth(gpu, True)
        except RuntimeError as e:
            print(e)
    
    device = 0 if tf.config.list_physical_devices('GPU') else -1
    pipe = pipeline("sentiment-analysis", device=device)
    def predict(inputs):
        return pipe(inputs.tolist())
    return predict

In [17]:
classify = predict_batch_udf(predict_batch_fn,
                             return_type=StructType([
                                 StructField("label", StringType(), True),
                                 StructField("score", FloatType(), True)
                             ]),
                             batch_size=10)

In [18]:
%%time
# note: expanding the "struct" return_type to top-level columns
preds = df.withColumn("preds", classify(struct("sentence"))).select("sentence", "preds.*")
results = preds.collect()

[Stage 11:>                                                         (0 + 1) / 1]

CPU times: user 9.15 ms, sys: 6.76 ms, total: 15.9 ms
Wall time: 5 s


                                                                                

In [19]:
%%time
# note: expanding the "struct" return_type to top-level columns
preds = df.withColumn("preds", classify("sentence")).select("sentence", "preds.*")
results = preds.collect()

[Stage 13:>                                                         (0 + 1) / 1]

CPU times: user 4.86 ms, sys: 2.19 ms, total: 7.05 ms
Wall time: 2.81 s


                                                                                

In [20]:
%%time
# note: expanding the "struct" return_type to top-level columns
preds = df.withColumn("preds", classify(col("sentence"))).select("sentence", "preds.*")
results = preds.collect()

[Stage 15:>                                                         (0 + 1) / 1]

CPU times: user 3.91 ms, sys: 1.96 ms, total: 5.87 ms
Wall time: 2.76 s


                                                                                

In [21]:
preds.show(truncate=80)

+--------------------------------------------------------------------------------+--------+----------+
|                                                                        sentence|   label|     score|
+--------------------------------------------------------------------------------+--------+----------+
|                                                                                |POSITIVE|0.74807304|
|Hard up, No proper jobs going down at the pit, why not rent your kids! DIY pi...|NEGATIVE| 0.9996724|
|I watched this movie to see the direction one of the most promising young tal...|POSITIVE| 0.9994948|
|                        This movie makes you wish imdb would let you vote a zero|NEGATIVE| 0.9981299|
|I never want to see this movie again!<br /><br />Not only is it dreadfully ba...|NEGATIVE|0.99883264|
|(As a note, I'd like to say that I saw this movie at my annual church camp, w...|POSITIVE| 0.9901753|
|                 Don't get me wrong, I love the TV series of League Of G

### Using Triton Inference Server

Note: you can restart the kernel and run from this point to simulate running in a different node or environment.  

This notebook uses the [Python backend with a custom execution environment](https://github.com/triton-inference-server/python_backend#creating-custom-execution-environments) with the compatible versions of Python/Numpy for Triton 24.08, using a conda-pack environment created as follows:
```
conda create -n huggingface-tf -c conda-forge python=3.10.0
conda activate huggingface-tf

export PYTHONNOUSERSITE=True
pip install numpy==1.26.4 tensorflow[and-cuda] tf-keras transformers conda-pack

conda-pack  # huggingface-tf.tar.gz
```

In [22]:
import numpy as np
import pandas as pd
import os
from pyspark.ml.functions import predict_batch_udf
from pyspark.sql.functions import col, struct, pandas_udf
from pyspark.sql.types import FloatType, StringType, StructField, StructType

In [23]:
%%bash
# copy custom model to expected layout for Triton
rm -rf models
mkdir -p models
cp -r models_config/hf_pipeline_tf models

# add custom execution environment
cp huggingface-tf.tar.gz models

#### Start Triton Server on each executor

In [24]:
num_executors = 1
triton_models_dir = "{}/models".format(os.getcwd())
huggingface_cache_dir = "{}/.cache/huggingface".format(os.path.expanduser('~'))
nodeRDD = sc.parallelize(list(range(num_executors)), num_executors)

def start_triton(it):
    import docker
    import time
    import tritonclient.grpc as grpcclient
    
    client=docker.from_env()
    containers=client.containers.list(filters={"name": "spark-triton"})
    if containers:
        print(">>>> containers: {}".format([c.short_id for c in containers]))
    else:
        container=client.containers.run(
            "nvcr.io/nvidia/tritonserver:24.08-py3", "tritonserver --model-repository=/models",
            detach=True,
            device_requests=[docker.types.DeviceRequest(device_ids=["0"], capabilities=[['gpu']])],
            environment=[
                "TRANSFORMERS_CACHE=/cache"
            ],
            name="spark-triton",
            network_mode="host",
            remove=True,
            shm_size="256M",
            volumes={
                triton_models_dir: {"bind": "/models", "mode": "ro"},
                huggingface_cache_dir: {"bind": "/cache", "mode": "rw"}
            }
        )
        print(">>>> starting triton: {}".format(container.short_id))
        # wait for triton to be running
        time.sleep(15)
        
        client = grpcclient.InferenceServerClient("localhost:8001")
        
        elapsed = 0
        timeout = 120
        ready = False
        while not ready and elapsed < timeout:
            try:
                time.sleep(5)
                elapsed += 5
                ready = client.is_server_ready()
            except Exception as e:
                pass

    return [True]

nodeRDD.barrier().mapPartitions(start_triton).collect()

                                                                                

[True]

#### Run inference

In [25]:
# only use first sentence of IMDB reviews
@pandas_udf("string")
def first_sentence(text: pd.Series) -> pd.Series:
    return pd.Series([s.split(".")[0] for s in text])

df = spark.read.parquet("imdb_test").withColumn("sentence", first_sentence(col("lines"))).select("sentence").limit(1000)

In [26]:
def triton_fn(triton_uri, model_name):
    import numpy as np
    import tritonclient.grpc as grpcclient
    
    np_types = {
      "BOOL": np.dtype(np.bool_),
      "INT8": np.dtype(np.int8),
      "INT16": np.dtype(np.int16),
      "INT32": np.dtype(np.int32),
      "INT64": np.dtype(np.int64),
      "FP16": np.dtype(np.float16),
      "FP32": np.dtype(np.float32),
      "FP64": np.dtype(np.float64),
      "FP64": np.dtype(np.double),
      "BYTES": np.dtype(object)
    }

    client = grpcclient.InferenceServerClient(triton_uri)
    model_meta = client.get_model_metadata(model_name)
    
    def predict(inputs):
        if isinstance(inputs, np.ndarray):
            # single ndarray input
            request = [grpcclient.InferInput(model_meta.inputs[0].name, inputs.shape, model_meta.inputs[0].datatype)]
            request[0].set_data_from_numpy(inputs.astype(np_types[model_meta.inputs[0].datatype]))
        else:
            # dict of multiple ndarray inputs
            request = [grpcclient.InferInput(i.name, inputs[i.name].shape, i.datatype) for i in model_meta.inputs]
            for i in request:
                i.set_data_from_numpy(inputs[i.name()].astype(np_types[i.datatype()]))
        
        response = client.infer(model_name, inputs=request)
        
        if len(model_meta.outputs) > 1:
            # return dictionary of numpy arrays
            return {o.name: response.as_numpy(o.name) for o in model_meta.outputs}
        else:
            # return single numpy array
            return response.as_numpy(model_meta.outputs[0].name)
        
    return predict

In [27]:
from functools import partial

classify = predict_batch_udf(partial(triton_fn, triton_uri="localhost:8001", model_name="hf_pipeline_tf"),
                             return_type=StructType([
                                 StructField("label", StringType(), True),
                                 StructField("score", FloatType(), True)
                             ]),
                             input_tensor_shapes=[[1]],
                             batch_size=100)

In [28]:
%%time
# first pass caches model/fn
# note: expanding the "struct" return_type to top-level columns
preds = df.withColumn("preds", classify(struct("sentence"))).select("sentence", "preds.*")
results = preds.collect()

[Stage 20:>                                                         (0 + 1) / 1]

CPU times: user 22.5 ms, sys: 5.9 ms, total: 28.4 ms
Wall time: 24.6 s


                                                                                

In [29]:
%%time
# note: expanding the "struct" return_type to top-level columns
preds = df.withColumn("preds", classify("sentence")).select("sentence", "preds.*")
results = preds.collect()

[Stage 21:>                                                         (0 + 1) / 1]

CPU times: user 12.2 ms, sys: 10.1 ms, total: 22.3 ms
Wall time: 23.8 s


                                                                                

In [30]:
%%time
# note: expanding the "struct" return_type to top-level columns
preds = df.withColumn("preds", classify(col("sentence"))).select("sentence", "preds.*")
results = preds.collect()

[Stage 22:>                                                         (0 + 1) / 1]

CPU times: user 8.74 ms, sys: 8.23 ms, total: 17 ms
Wall time: 23.8 s


                                                                                

In [31]:
preds.show(truncate=False)

[Stage 23:>                                                         (0 + 1) / 1]

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+----------+
|sentence                                                                                                                                                                                                                                                                   |label   |score     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+----------+
|                                                                                                                                 

                                                                                

#### Stop Triton Server on each executor

In [32]:
def stop_triton(it):
    import docker
    import time
    
    client=docker.from_env()
    containers=client.containers.list(filters={"name": "spark-triton"})
    print(">>>> stopping containers: {}".format([c.short_id for c in containers]))
    if containers:
        container=containers[0]
        container.stop(timeout=120)

    return [True]

nodeRDD.barrier().mapPartitions(stop_triton).collect()

                                                                                

[True]

In [33]:
spark.stop()