## Principal Component Analysis (PCA)

In this notebook, we will demonstrate the end-to-end workflow of Spark RAPIDS accelerated PCA.

In [2]:
import os
import requests
import numpy as np
import pandas as pd
import time

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, FloatType
from pyspark.sql import functions as F
from pyspark import SparkConf
from pyspark.sql.functions import pandas_udf

In [4]:
### Download Spark Rapids jar ###

SPARK_RAPIDS_VERSION = "24.08.1"
rapids_jar = f"rapids-4-spark_2.12-{SPARK_RAPIDS_VERSION}.jar"

if not os.path.exists(rapids_jar):
    print("Downloading spark rapids jar")
    url = f"https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/{SPARK_RAPIDS_VERSION}/{rapids_jar}"
    response = requests.get(url)
    if response.status_code == 200:
        with open(rapids_jar, "wb") as f:
            f.write(response.content)
        print(f"File '{rapids_jar}' downloaded and saved successfully.")
    else:
        print(f"Failed to download the file. Status code: {response.status_code}")
else:
    print("File already exists. Skipping download.")

File already exists. Skipping download.


In [None]:
### Configure Spark Session ###
conda_env = os.environ.get("CONDA_PREFIX")

conf = SparkConf()
conf.setMaster(f"spark://{hostname}:7077") # Set to your hostname
conf.set("spark.task.maxFailures", "1")
conf.set("spark.driver.memory", "10g")
conf.set("spark.executor.memory", "8g")
conf.set("spark.rpc.message.maxSize", "1024")
conf.set("spark.sql.pyspark.jvmStacktrace.enabled", "true")
conf.set("spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled", "false")
conf.set("spark.sql.pyspark.jvmStacktrace.enabled", "true")
conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
conf.set("spark.python.worker.reuse", "true")
conf.set("spark.rapids.ml.uvm.enabled", "true")
conf.set("spark.jars", rapids_jar)
conf.set("spark.executorEnv.PYTHONPATH", rapids_jar)
conf.set("spark.rapids.memory.gpu.minAllocFraction", "0.0001")
conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin")
conf.set("spark.locality.wait", "0s")
conf.set("spark.sql.cache.serializer", "com.nvidia.spark.ParquetCachedBatchSerializer")
conf.set("spark.rapids.memory.gpu.pooling.enabled", "false")
conf.set("spark.sql.execution.sortBeforeRepartition", "false")
conf.set("spark.rapids.sql.format.parquet.reader.type", "MULTITHREADED")
conf.set("spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel", "20")
conf.set("spark.rapids.sql.multiThreadedRead.numThreads", "20")
conf.set("spark.rapids.sql.python.gpu.enabled", "true")
conf.set("spark.rapids.memory.pinnedPool.size", "2G")
conf.set("spark.python.daemon.module", "rapids.daemon")
conf.set("spark.rapids.sql.batchSizeBytes", "512m")
conf.set("spark.sql.adaptive.enabled", "false")
conf.set("spark.sql.files.maxPartitionBytes", "512m")
conf.set("spark.rapids.sql.concurrentGpuTasks", "1")
conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "20000")
conf.set("spark.rapids.sql.explain", "NONE")
# Create Spark Session
spark = SparkSession.builder.appName("spark-rapids-ml-pca").config(conf=conf).getOrCreate()
sc = spark.sparkContext

### Generate synthetic dataset

Here we generate a 100,000 x 2048 random dataset.

In [6]:
rows = 100000
dim = 2048
dtype = 'float32'
np.random.seed(42)

data = np.random.rand(rows, dim).astype(dtype)
pd_data = pd.DataFrame({"features": list(data)})
prepare_df = spark.createDataFrame(pd_data)
prepare_df.write.mode("overwrite").parquet("PCA_data.parquet")

24/10/03 18:15:12 WARN TaskSetManager: Stage 0 contains a task of very large size (160085 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

In [7]:
df = spark.read.parquet("PCA_data.parquet")
df.printSchema()

root
 |-- features: array (nullable = true)
 |    |-- element: float (containsNull = true)



### ETL: Mean-centering

PCA expects mean-centered data as input so that the first principal component is not influenced by the distribution mean. We perform a simple mean centering on the data below.

In [8]:
avg_values = df.select([
    F.avg(F.col("features")[i]).alias(f"avg_{i}") for i in range(dim)
]).first()

@pandas_udf(ArrayType(FloatType()))
def mean_center_udf(features: pd.Series) -> pd.Series:
    return features.apply(lambda row: [row[i] - avg_values[i] for i in range(dim)])

mean_centered_df = df.withColumn("mean_centered_features", mean_center_udf(F.col("features")))

24/10/03 18:15:21 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
24/10/03 18:15:21 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB
24/10/03 18:15:28 WARN DAGScheduler: Broadcasting large task binary with size 4.1 MiB
                                                                                

#### Spark-RAPIDS-ML accepts ArrayType input

Note that in the original Spark-ML PCA, we must `Vectorize` the input column:

```python
from pyspark.ml.linalg import Vectors
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
    (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
    (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data,["features"])
df.show()
```

...whereas the Spark-RAPIDS-ML version does not require extra Vectorization, and can accept an ArrayType column as the input column:

In [9]:
data_df = mean_centered_df.withColumn("features", F.col("mean_centered_features")).drop("mean_centered_features")
data_df.printSchema()

root
 |-- features: array (nullable = true)
 |    |-- element: float (containsNull = true)



### Using Spark-RAPIDS-ML PCA (GPU)

Compared to the Spark-ML PCA training API:

```python
from pyspark.ml.feature import PCA
pca = PCA(k=3, inputCol="features")
pca.setOutputCol("pca_features")
```

We use a customized class which requires **no code change** from the user to enjoy GPU acceleration:

```python
from spark_rapids_ml.feature import PCA
pca = PCA(k=3, inputCol="features")
pca.setOutputCol("pca_features")
```

In [10]:
from spark_rapids_ml.feature import PCA

gpu_pca = PCA(k=2, inputCol="features")
gpu_pca.setOutputCol("pca_features")

PCA_167b33961a13

The PCA estimator object can be persisted and reloaded.

In [11]:
estimator_path = "/tmp/pca_estimator"
gpu_pca.write().overwrite().save(estimator_path)
gpu_pca_loaded = PCA.load(estimator_path)

#### Fit

In [12]:
start_time = time.time()
gpu_pca_model = gpu_pca_loaded.fit(data_df)
gpu_fit_time = time.time() - start_time
print(f"GPU PCA fit took: {gpu_fit_time} sec")

2024-10-03 18:15:31,372 - spark_rapids_ml.feature.PCA - INFO - CUDA managed memory enabled.
2024-10-03 18:15:31,423 - spark_rapids_ml.feature.PCA - INFO - Stage-level scheduling in spark-rapids-ml requires spark.executor.cores, spark.executor.resource.gpu.amount to be set.
2024-10-03 18:15:31,425 - spark_rapids_ml.feature.PCA - INFO - Training spark-rapids-ml with 1 worker(s) ...
2024-10-03 18:16:03,110 - spark_rapids_ml.feature.PCA - INFO - Finished training


GPU PCA fit took: 32.50137519836426 sec


#### Transform

In [19]:
start_time = time.time()
embeddings = gpu_pca_model.transform(data_df).select("pca_features").show(truncate=False)
gpu_transform_time = time.time() - start_time
print(f"GPU PCA transform took: {gpu_transform_time} sec")

[Stage 19:>                                                         (0 + 1) / 1]

+--------------------------+
|pca_features              |
+--------------------------+
|[-0.029416187, 0.14954807]|
|[-0.114759326, 0.30470988]|
|[0.24565856, -0.3830186]  |
|[0.40122557, 0.0786071]   |
|[0.33858502, -0.3383386]  |
|[-0.4234191, 0.054718923] |
|[0.31339574, -0.18767774] |
|[0.48100916, -0.13139157] |
|[0.24663548, 0.62084264]  |
|[-0.7007258, 0.41795364]  |
|[-0.3402629, 0.118103035] |
|[0.050888825, -0.13529032]|
|[0.22439958, -0.2205292]  |
|[0.25716788, -0.03613429] |
|[0.6055516, -0.44179356]  |
|[-0.2515555, -0.1829353]  |
|[-0.2190136, -0.48459405] |
|[-0.28191802, 0.005161534]|
|[-0.32060724, -0.52684677]|
|[0.10207409, -0.07858773] |
+--------------------------+
only showing top 20 rows

GPU PCA transform took: 1.7161929607391357 sec


                                                                                

### Using Spark-ML PCA (CPU)

In [14]:
from pyspark.ml.feature import PCA

cpu_pca = PCA(k=2, inputCol="features")
cpu_pca.setOutputCol("pca_features")

PCA_cde1243ffb2d

In [15]:
from pyspark.ml.functions import array_to_vector

vector_df = data_df.select(array_to_vector("features").alias("features"))
vector_df.printSchema()

root
 |-- features: vector (nullable = true)



#### Fit

In [16]:
start_time = time.time()
cpu_pca_model = cpu_pca.fit(vector_df)
pca_fit_time = time.time() - start_time
print(f"CPU PCA fit took: {pca_fit_time} sec")

24/10/03 18:18:39 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


CPU PCA fit took: 168.66824460029602 sec


#### Transform

In [17]:
start_time = time.time()
embeddings = cpu_pca_model.transform(vector_df).select("pca_features").show(truncate=False)
pca_transform_time = time.time() - start_time
print(f"CPU PCA transform took: {pca_transform_time} sec")

+--------------------------------------------+
|pca_features                                |
+--------------------------------------------+
|[-0.028913990912549873,-0.14877800281975417]|
|[-0.11377219195114611,-0.30286035088028784] |
|[0.24481776731139782,0.38339521202540466]   |
|[0.4012645935474968,-0.0774854638371508]    |
|[0.3382728256029673,0.3389068302551277]     |
|[-0.4228363394060016,-0.05490137931031454]  |
|[0.3133387034642869,0.18913735472308166]    |
|[0.4810119397199172,0.1325004050655491]     |
|[0.24748029515381828,-0.6211263064522211]   |
|[-0.6999917660681412,-0.420784987276893]    |
|[-0.34046347604937044,-0.11978179300566327] |
|[0.05074039845250796,0.13683028045753456]   |
|[0.22282065118768452,0.22023244883076878]   |
|[0.2562068262436395,0.03528786064789906]    |
|[0.6045398358884778,0.44301892614623417]    |
|[-0.25204946003221423,0.18266864414577164]  |
|[-0.22000096004134898,0.4838697920777026]   |
|[-0.28225973295047585,-0.006133424943195989]|
|[-0.32151621

### Summary

With our 100,000 x 2048 dataset, we achieved end-to-end speedup of  

CPU: (173.7s + 0.50s)  
GPU: (32.5s + 1.71s)  

`CPU / GPU = 5.1x`