<img src="https://developer.download.nvidia.com/notebooks/dlsw-notebooks/tensorrt_torchtrt_efficientnet/nvidia_logo.png" width="90px">

# PySpark LLM Inference: Qwen-2.5-14b Data Structuring

In this notebook, we demonstrate distributed batch inference with [Qwen-2.5](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct), using open weights on Huggingface.

The Qwen-2.5-14b-instruct is an instruction-fine-tuned version of the Qwen-2.5-14b base model. We'll show how to use the model to prepare unstructured text data into a structured schema for downstream tasks.

**Note:** This example demonstrates **tensor parallelism**, which requires multiple GPUs per node. For standalone users, make sure to use a Spark worker with 2 GPUs. If you follow the Databricks or Dataproc instructions, make sure to include the `tp` argument to the cluster startup scripts.

Check the cluster environment to handle any platform-specific configurations.

In [1]:
import os

on_databricks = os.environ.get("DATABRICKS_RUNTIME_VERSION", False)
on_dataproc = os.environ.get("DATAPROC_IMAGE_VERSION", False)
on_standalone = not (on_databricks or on_dataproc)

In [2]:
# For cloud environments, load the model to the distributed file system.
if on_databricks:
    models_dir = "/dbfs/FileStore/spark-dl-models"
    dbutils.fs.mkdirs("/FileStore/spark-dl-models")
    model_path = f"{models_dir}/qwen2.5-14b"
elif on_dataproc:
    models_dir = "/mnt/gcs/spark-dl-models"
    os.mkdir(models_dir) if not os.path.exists(models_dir) else None
    model_path = f"{models_dir}/qwen2.5-14b"
else:
    model_path = os.path.abspath("qwen2.5-14b")

Download the model from huggingface hub.

In [3]:
from huggingface_hub import snapshot_download

model_path = snapshot_download(
    repo_id="Qwen/Qwen2.5-14B-Instruct",
    local_dir=model_path
)

Fetching 18 files:   0%|          | 0/18 [00:00<?, ?it/s]

## PySpark

In [4]:
import pandas as pd
from pyspark.sql.types import *
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, col, struct, length, lit, concat
from pyspark.ml.functions import predict_batch_udf

In [5]:
import os
import datasets
from datasets import load_dataset
datasets.disable_progress_bars()

#### Create Spark Session

For local standalone clusters, we'll connect to the cluster and create the Spark Session.  
For CSP environments, Spark will either be preconfigured (Databricks) or we'll need to create the Spark Session (Dataproc).

In [6]:
conf = SparkConf()

if 'spark' not in globals():
    if on_standalone:
        import socket
        conda_env = os.environ.get("CONDA_PREFIX")
        hostname = socket.gethostname()
        conf.setMaster(f"spark://{hostname}:7077")
        conf.set("spark.pyspark.python", f"{conda_env}/bin/python")
        conf.set("spark.pyspark.driver.python", f"{conda_env}/bin/python")
        
    # For standalone users: adjust executor.cores and task.resource.gpu.amount based on available cores
    conf.set("spark.executor.cores", "24")  
    conf.set("spark.task.maxFailures", "1")
    conf.set("spark.task.resource.gpu.amount", "0.083333")
    conf.set("spark.executor.resource.gpu.amount", "2")  # 2 GPUs per executor
    conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
    conf.set("spark.python.worker.reuse", "true")

spark = SparkSession.builder.appName("spark-dl-examples").config(conf=conf).getOrCreate()
sc = spark.sparkContext

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/20 00:55:18 INFO SparkEnv: Registering MapOutputTracker
25/03/20 00:55:18 INFO SparkEnv: Registering BlockManagerMaster
25/03/20 00:55:18 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/03/20 00:55:18 INFO SparkEnv: Registering OutputCommitCoordinator


#### Load and Preprocess DataFrame

Load the first 500 samples of the [Amazon Video Game Product Reviews dataset](https://huggingface.co/datasets/logankells/amazon_product_reviews_video_games) from Huggingface and store in a Spark Dataframe.

In [7]:
product_reviews_ds = load_dataset("LoganKells/amazon_product_reviews_video_games", split="train", streaming=True)
product_reviews_pds = pd.Series([sample["reviewText"] for sample in product_reviews_ds.take(500)])

README.md:   0%|          | 0.00/6.00 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


In [8]:
df = spark.createDataFrame(product_reviews_pds, schema=StringType())

In [9]:
df.show(5, truncate=100)

                                                                                

+----------------------------------------------------------------------------------------------------+
|                                                                                               value|
+----------------------------------------------------------------------------------------------------+
|Installing the game was a struggle (because of games for windows live bugs).Some championship rac...|
|If you like rally cars get this game you will have fun.It is more oriented to &#34;European marke...|
|1st shipment received a book instead of the game.2nd shipment got a FAKE one. Game arrived with a...|
|I had Dirt 2 on Xbox 360 and it was an okay game. I started playing games on my laptop and bought...|
|Overall this is a well done racing game, with very good graphics for its time period. My family h...|
+----------------------------------------------------------------------------------------------------+
only showing top 5 rows



Format each sample into the Qwen chat template, including a system prompt to guide generation.

In [10]:
system_prompt = """You are a specialized review analysis AI that categorizes product reviews into precise sentiment categories.
IMPORTANT: Your response must contain ONLY valid JSON and nothing else - no explanations, no additional text.
For each review, analyze and output EXACTLY this JSON structure:
{
  "primary_sentiment": [EXACTLY ONE OF: "positive", "negative", "neutral", "mixed"],
  "sentiment_score": [integer between 1-10, where 1 is extremely negative and 10 is extremely positive],
  "purchase_intention": [EXACTLY ONE OF: "will repurchase", "might repurchase", "will not repurchase", "recommends alternatives", "uncertain"]
}

Do not include any text before or after the JSON. The response should start with '{' and end with '}' with no trailing characters, comments, or explanations.
"""

df = df.select(
    concat(
        lit("<|im_start|>system\n"),
        lit(system_prompt),
        lit("<|im_end|>\n<|im_start|>user\n"),
        lit("Analyze this review: "),
        col("value"),
        lit("<|im_end|>\n<|im_start|>assistant\n")
    ).alias("prompt")
)

In [11]:
print(df.take(1)[0].prompt)

<|im_start|>system
You are a specialized review analysis AI that categorizes product reviews into precise sentiment categories.
IMPORTANT: Your response must contain ONLY valid JSON and nothing else - no explanations, no additional text.
For each review, analyze and output EXACTLY this JSON structure:
{
  "primary_sentiment": [EXACTLY ONE OF: "positive", "negative", "neutral", "mixed"],
  "sentiment_score": [integer between 1-10, where 1 is extremely negative and 10 is extremely positive],
  "purchase_intention": [EXACTLY ONE OF: "will repurchase", "might repurchase", "will not repurchase", "recommends alternatives", "uncertain"]
}

Do not include any text before or after the JSON. The response should start with '{' and end with '}' with no trailing characters, comments, or explanations.
<|im_end|>
<|im_start|>user
Analyze this review: Installing the game was a struggle (because of games for windows live bugs).Some championship races and cars can only be "unlocked" by buying them as an

In [12]:
data_path = "spark-dl-datasets/amazon_video_game_reviews"
if on_databricks:
    dbutils.fs.mkdirs("/FileStore/spark-dl-datasets")
    data_path = "dbfs:/FileStore/" + data_path

df.write.mode("overwrite").parquet(data_path)

                                                                                

## Using vLLM Server
In this section, we demonstrate integration with [vLLM Serving](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html), an open-source server with an OpenAI-compatible completions endpoint for LLMs.  

The process looks like this:
- Distribute a server startup task across the Spark cluster, instructing each node to launch a vLLM server process.
- Define a vLLM inference function, which sends inference request to the local server on a given node.
- Wrap the vLLM inference function in a predict_batch_udf to launch parallel inference requests using Spark.
- Finally, distribute a shutdown signal to terminate the vLLM server processes on each node.

<img src="../images/spark-server-mg.png" alt="drawing" width="700"/>

In [13]:
from functools import partial

Import the helper class from server_utils.py:

In [None]:
sc.addPyFile("server_utils.py")

from server_utils import VLLMServerManager

There are currently some hanging issues with vLLM's `torch.compile` on Databricks, which we are working to resolve. For now we will enforce eager mode on Databricks, which disables compilation at some performance cost.

In [None]:
enforce_eager = True if on_databricks else False

#### Start vLLM servers

The `VLLMServerManager` will handle the lifecycle of vLLM server instances across the Spark cluster:
- Find available ports for HTTP
- Deploy a server on each node via stage-level scheduling
- Gracefully shutdown servers across nodes

In [None]:
model_name = "qwen-2.5-14b"
server_manager = VLLMServerManager(model_name=model_name, model_path=model_path)

You can pass any of the supported [vLLM serve CLI arguments](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#vllm-serve) as key-word arguments when starting the servers. Note that this can take some time, as it includes loading the model from disk, Torch compilation, and capturing CUDA graphs.

Here, we set `tensor_parallel_size` to the number of GPUs per node:

In [22]:
tensor_parallel_size = int(spark.conf.get("spark.executor.resource.gpu.amount"))
server_manager.start_servers(tensor_parallel_size=tensor_parallel_size,
                             gpu_memory_utilization=0.95,
                             max_model_len=6600,
                             task="generate",
                             enforce_eager=enforce_eager,
                             wait_retries=100)

2025-03-20 01:04:42,978 - INFO - Requesting stage-level resources: (cores=13, gpu=2.0)
2025-03-20 01:04:42,979 - INFO - Starting 2 VLLM servers.
                                                                                

{'spark-dl-inference-vllm-tp-w-0': (35438, [7000]),
 'spark-dl-inference-vllm-tp-w-1': (35288, [7000])}

#### Define client function

Get the hostname -> url mapping from the server manager:

In [23]:
host_to_http_url = server_manager.host_to_http_url

In [51]:
def vllm_fn(model_name, host_to_url):
    import socket
    import json
    import requests

    url = host_to_url[socket.gethostname()]
    
    def predict(inputs):
        print(inputs)
        response = requests.post(
            "http://localhost:7000/v1/completions",
            json={
                "model": model_name,
                "prompt": inputs.tolist(),
                "max_tokens": 50,
                "temperature": 0.7,
                "top_p": 0.8,
                "repetition_penalty": 1.05,
            }
        )
        result_dicts = [json.loads(o["text"]) for o in response.json()["choices"]]
        return result_dicts
    
    return predict

In [55]:
generate = predict_batch_udf(partial(vllm_fn, model_name=model_name, host_to_url=host_to_http_url),
                             return_type=StructType([
                                 StructField("primary_sentiment", StringType()),
                                 StructField("sentiment_score", IntegerType()),
                                 StructField("purchase_intention", StringType())
                             ]),
                             batch_size=32)

#### Load DataFrame

In [56]:
df = spark.read.parquet(data_path).repartition(16)

                                                                                

#### Run Inference

In [57]:
%%time
# first pass caches model/fn
preds = df.withColumn("outputs", generate(col("prompt"))).select("prompt", "outputs.*")
results = preds.collect()



CPU times: user 29.6 ms, sys: 6.89 ms, total: 36.5 ms
Wall time: 33 s


                                                                                

In [58]:
%%time
preds = df.withColumn("outputs", generate(col("prompt"))).select("prompt", "outputs.*")
results = preds.collect()



CPU times: user 25.6 ms, sys: 6.73 ms, total: 32.3 ms
Wall time: 32 s


                                                                                

In [59]:
preds.show(5, truncate=50)

[Stage 37:>                                                         (0 + 1) / 1]

+--------------------------------------------------+-----------------+---------------+-------------------+
|                                            prompt|primary_sentiment|sentiment_score| purchase_intention|
+--------------------------------------------------+-----------------+---------------+-------------------+
|<|im_start|>system\nYou are a specialized revie...|         positive|              9|    will repurchase|
|<|im_start|>system\nYou are a specialized revie...|         positive|              9|    will repurchase|
|<|im_start|>system\nYou are a specialized revie...|         positive|              8|    will repurchase|
|<|im_start|>system\nYou are a specialized revie...|         negative|              4|will not repurchase|
|<|im_start|>system\nYou are a specialized revie...|            mixed|              6|   might repurchase|
+--------------------------------------------------+-----------------+---------------+-------------------+
only showing top 5 rows



                                                                                

In [60]:
sample = results[0]
print("Review:", sample["prompt"])
print(f"Sentiment: {sample['primary_sentiment']}, Score: {sample['sentiment_score']}, Status: {sample['purchase_intention']}")

Review: <|im_start|>system
You are a specialized review analysis AI that categorizes product reviews into precise sentiment categories.
IMPORTANT: Your response must contain ONLY valid JSON and nothing else - no explanations, no additional text.
For each review, analyze and output EXACTLY this JSON structure:
{
  "primary_sentiment": [EXACTLY ONE OF: "positive", "negative", "neutral", "mixed"],
  "sentiment_score": [integer between 1-10, where 1 is extremely negative and 10 is extremely positive],
  "purchase_intention": [EXACTLY ONE OF: "will repurchase", "might repurchase", "will not repurchase", "recommends alternatives", "uncertain"]
}

Do not include any text before or after the JSON. The response should start with '{' and end with '}' with no trailing characters, comments, or explanations.
<|im_end|>
<|im_start|>user
Analyze this review: I have never played anything like this since. Everything from Sly  Racoon, to Ratchet and Clank, owe it to this.Wicked witch Gruntilda takes Ban

#### Shut down server on each executor

In [61]:
server_manager.stop_servers()

2025-03-20 01:19:32,218 - INFO - Requesting stage-level resources: (cores=13, gpu=2.0)
2025-03-20 01:19:33,872 - INFO - Successfully stopped 2 VLLM servers.           


[True, True]

In [62]:
if not on_databricks: # on databricks, spark.stop() puts the cluster in a bad state
    spark.stop()