## Principal Component Analysis (PCA)

In this notebook, we will demonstrate the end-to-end workflow of Spark RAPIDS accelerated PCA.

In [5]:
import numpy as np
import pandas as pd
import time

In [6]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [7]:
import os
import requests

SPARK_RAPIDS_VERSION = "24.08.1"
rapids_jar = f"rapids-4-spark_2.12-{SPARK_RAPIDS_VERSION}.jar"

if not os.path.exists(rapids_jar):
    print("Downloading spark rapids jar")
    url = f"https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/{SPARK_RAPIDS_VERSION}/{rapids_jar}"
    response = requests.get(url)
    if response.status_code == 200:
        with open(rapids_jar, "wb") as f:
            f.write(response.content)
        print(f"File '{rapids_jar}' downloaded and saved successfully.")
    else:
        print(f"Failed to download the file. Status code: {response.status_code}")
else:
    print("File already exists. Skipping download.")

num_threads = 6
driver_memory = "8g"
num_gpus = 1

_config = {
    "spark.master": f"local[{num_threads}]",
    "spark.driver.host": "127.0.0.1",
    "spark.task.maxFailures": "1",
    "spark.driver.memory": driver_memory,
    "spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled": "false",
    "spark.sql.pyspark.jvmStacktrace.enabled": "true",
    "spark.sql.execution.arrow.pyspark.enabled": "true",
    "spark.rapids.ml.uvm.enabled": "true",
    # accelerated file/parquet reading
    "spark.jars": rapids_jar,
    "spark.executorEnv.PYTHONPATH": rapids_jar,
    "spark.sql.files.minPartitionNum": num_gpus,
    "spark.rapids.memory.gpu.minAllocFraction": "0.0001",
    "spark.plugins": "com.nvidia.spark.SQLPlugin",
    "spark.locality.wait": "0s",
    "spark.sql.cache.serializer": "com.nvidia.spark.ParquetCachedBatchSerializer",
    "spark.rapids.memory.gpu.pooling.enabled": "false",
    "spark.rapids.sql.explain": "ALL",
    "spark.sql.execution.sortBeforeRepartition": "false",
    "spark.rapids.sql.format.parquet.reader.type": "MULTITHREADED",
    "spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel": "20",
    "spark.rapids.sql.multiThreadedRead.numThreads": "20",
    "spark.rapids.sql.python.gpu.enabled": "true",
    "spark.rapids.memory.pinnedPool.size": "2G",
    "spark.python.daemon.module": "rapids.daemon",
    "spark.rapids.sql.batchSizeBytes": "512m",
    "spark.sql.adaptive.enabled": "false",
    "spark.sql.files.maxPartitionBytes": "2000000000000",
    "spark.rapids.sql.concurrentGpuTasks": "2",
    "spark.sql.execution.arrow.maxRecordsPerBatch": "20000",
}
spark = SparkSession.builder.appName("spark-rapids-ml")
for key, value in _config.items():
    spark = spark.config(key, value)
spark = spark.getOrCreate()

File already exists. Skipping download.


24/09/27 22:30:18 WARN Utils: Your hostname, cb4ae00-lcedt resolves to a loopback address: 127.0.1.1; using 10.110.47.100 instead (on interface eno1)
24/09/27 22:30:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
24/09/27 22:30:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/27 22:30:19 WARN RapidsPluginUtils: RAPIDS Accelerator 24.08.1 using cudf 24.08.0, private revision 9fac64da220ddd6bf5626bd7bd1dd74c08603eac
24/09/27 22:30:19 WARN RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.
24/09/27 22:30:19 WARN RapidsPluginUtils: spark.rapids.sql.explain is set to `ALL`. Set it to 'NONE' to suppress the diagnostics logging about the query placement on the GPU.
24/09/27 22:30:22 WARN GpuDevi

### Generate synthetic dataset

Here we generate a 100,000 x 2048 random dataset.

In [8]:
rows = 100000
dim = 2048
dtype = 'float32'
np.random.seed(42)

data = np.random.rand(rows, dim).astype(dtype)
cols = [f"c{i}" for i in range(dim)]
pd_data = pd.DataFrame(data, columns=cols)
df = spark.createDataFrame(pd_data).repartition(10)
df.show()

df.write.mode("overwrite").parquet("data.parquet")

24/09/27 22:30:36 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(c0#2048 as string) AS c0#8192 will run on GPU
      *Expression <Cast> cast(c0#2048 as string) will run on GPU
    *Expression <Alias> cast(c1#2049 as string) AS c1#8193 will run on GPU
      *Expression <Cast> cast(c1#2049 as string) will run on GPU
    *Expression <Alias> cast(c2#2050 as string) AS c2#8194 will run on GPU
      *Expression <Cast> cast(c2#2050 as string) will run on GPU
    *Expression <Alias> cast(c3#2051 as string) AS c3#8195 will run on GPU
      *Exp

+------------+----------+------------+-----------+----------+-----------+------------+------------+-----------+-----------+----------+-----------+-----------+-----------+----------+------------+------------+----------+-----------+----------+----------+-----------+-----------+-----------+----------+----------+----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------+-----------+-----------+-----------+----------+------------+-----------+----------+-----------+-----------+-----------+-----------+----------+------------+------------+----------+----------+----------+-----------+------------+------------+-----------+----------+----------+-----------+-----------+-----------+-----------+-----------+------------+----------+----------+----------+----------+-----------+-----------+-----------+------------+-----------+-----------+------------+----------+-----------+------------+-----------+-----------+-----------+

24/09/27 22:30:44 WARN GpuOverrides: 
*Exec <DataWritingCommandExec> will run on GPU
  *Output <InsertIntoHadoopFsRelationCommand> will run on GPU
  *Exec <WriteFilesExec> will run on GPU
    *Exec <ShuffleExchangeExec> will run on GPU
      *Partitioning <RoundRobinPartitioning> will run on GPU
      ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
        @Expression <AttributeReference> c0#2048 could run on GPU
        @Expression <AttributeReference> c1#2049 could run on GPU
        @Expression <AttributeReference> c2#2050 could run on GPU
        @Expression <AttributeReference> c3#2051 could run on GPU
        @Expression <AttributeReference> c4#2052 could run on GPU
        @Expression <AttributeReference> c5#2053 could run on GPU
        @Expression <AttributeReference> c6#2054 could run on GPU
        @Expression <AttributeReference> c7#2055 could run on GPU
        @Expression <AttributeRefe

### ETL: Mean-centering

PCA is expecting mean-centered data as input, so we perform a simple mean centering on the data.

In [9]:
avg_values = df.select([
    F.avg(F.col(c)).alias(c) for c in cols
]).first()

mean_centered_df = df.select([
    (F.col(c) - avg_values[c]).alias(c) for c in cols
])

mean_centered_df.show(5)

24/09/27 22:30:57 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> avg(c0#2048) will run on GPU
    *Expression <Average> avg(c0#2048) will run on GPU
  *Expression <AggregateExpression> avg(c1#2049) will run on GPU
    *Expression <Average> avg(c1#2049) will run on GPU
  *Expression <AggregateExpression> avg(c2#2050) will run on GPU
    *Expression <Average> avg(c2#2050) will run on GPU
  *Expression <AggregateExpression> avg(c3#2051) will run on GPU
    *Expression <Average> avg(c3#2051) will run on GPU
  *Expression <AggregateExpression> avg(c4#2052) will run on GPU
    *Expression <Average> avg(c4#2052) will run on GPU
  *Expression <AggregateExpression> avg(c5#2053) will run on GPU
    *Expression <Average> avg(c5#2053) will run on GPU
  *Expression <AggregateExpression> avg(c6#2054) will run on GPU
    *Expression <Average> avg(c6#2054) will run on GPU
  *Expression <AggregateExpression> avg(c7#2055) will run on GPU
    *Expression

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+-------------------+--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+----------------

                                                                                

#### Spark-RAPIDS-ML accepts ArrayType input

Note that in the original Spark-ML PCA, we must `Vectorize` the input column:

```python
from pyspark.ml.linalg import Vectors
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
    (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
    (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data,["features"])
df.show()
```

...whereas the Spark-RAPIDS-ML version does not require extra Vectorization, and can accept an ArrayType column as the input column:

In [10]:
data_df = mean_centered_df.withColumn(
    "features", F.array(mean_centered_df.columns)
).drop(*mean_centered_df.columns).cache()

data_df.printSchema()
data_df.show(5, False)

24/09/27 22:31:19 WARN GpuOverrides: 
*Exec <ProjectExec> will run on GPU
  *Expression <Alias> array((cast(c0#2048 as double) - 0.502850817943191), (cast(c1#2049 as double) - 0.4993568329317731), (cast(c2#2050 as double) - 0.5022041627250311), (cast(c3#2051 as double) - 0.5007607737954406), (cast(c4#2052 as double) - 0.49888224182054824), (cast(c5#2053 as double) - 0.49961661828046305), (cast(c6#2054 as double) - 0.49952424866733774), (cast(c7#2055 as double) - 0.5010911387606175), (cast(c8#2056 as double) - 0.5006733374329541), (cast(c9#2057 as double) - 0.5005974506563594), (cast(c10#2058 as double) - 0.4995066206061082), (cast(c11#2059 as double) - 0.49974233042317545), (cast(c12#2060 as double) - 0.5008259517211953), (cast(c13#2061 as double) - 0.5004211648272235), (cast(c14#2062 as double) - 0.5015444330962864), (cast(c15#2063 as double) - 0.49893703256832034), (cast(c16#2064 as double) - 0.49859672808367905), (cast(c17#2065 as double) - 0.49929881067924775), (cast(c18#2066 as do

root
 |-- features: array (nullable = false)
 |    |-- element: double (containsNull = true)



[Stage 11:>                                                         (0 + 1) / 1]

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                

### Using Spark-RAPIDS-ML PCA (GPU)

Compared to the Spark-ML PCA training API:

```python
from pyspark.ml.feature import PCA
pca = PCA(k=3, inputCol="features")
pca.setOutputCol("pca_features")
```

We use a customized class which requires **no code change** from the user to enjoy GPU acceleration:

```python
from spark_rapids_ml.feature import PCA
pca = PCA(k=3, inputCol="features")
pca.setOutputCol("pca_features")
```

In [11]:
from spark_rapids_ml.feature import PCA

gpu_pca = PCA(k=2, inputCol="features")
gpu_pca.setOutputCol("pca_features")

PCA_36d5f96125ea

The PCA estimator object can be persisted and reloaded.

In [12]:
estimator_path = "/tmp/pca_estimator"
gpu_pca.write().overwrite().save(estimator_path)
gpu_pca_loaded = PCA.load(estimator_path)

#### Fit

In [13]:
start_time = time.time()
gpu_pca_model = gpu_pca_loaded.fit(data_df)
print(f"GPU PCA fit took: {time.time() - start_time} sec")

24/09/27 22:31:32 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <InMemoryTableScanExec> will run on GPU

24/09/27 22:31:33 WARN GpuOverrides: 
*Exec <ProjectExec> will run on GPU
  *Expression <Alias> cast(features#63490 as array<float>) AS cuml_values_c3BhcmstcmFwaWRzLW1sCg==#65581 will run on GPU
    *Expression <Cast> cast(features#63490 as array<float>) will run on GPU
  *Exec <InMemoryTableScanExec> will run on GPU

2024-09-27 22:31:33,234 - spark_rapids_ml.feature.PCA - INFO - CUDA managed memory enabled.
24/09/27 22:31:33 WARN GpuOverrides: 
*Exec <MapInPandasExec> will partially run o

GPU PCA fit took: 14.639127254486084 sec


#### Transform

In [14]:
start_time = time.time()
gpu_pca_model.transform(data_df).select("pca_features").show(10, False)
print(f"GPU PCA transform took: {time.time() - start_time} sec")

+---------------------------+
|pca_features               |
+---------------------------+
|[0.6227822, -0.28341442]   |
|[0.1728966, -0.41411814]   |
|[-0.118897766, 0.12770754] |
|[0.20606507, -0.21859361]  |
|[0.17110847, -0.24185863]  |
|[0.6516079, 0.2862177]     |
|[-0.056618597, -0.46034873]|
|[-0.025016112, 0.18431072] |
|[0.75429356, 0.56623113]   |
|[-0.0777183, -0.53442883]  |
+---------------------------+
only showing top 10 rows

GPU PCA transform took: 0.39992523193359375 sec


24/09/27 22:31:58 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(pythonUDF0#65655 as string) AS pca_features#65643 will run on GPU
      *Expression <Cast> cast(pythonUDF0#65655 as string) will run on GPU
    *Exec <ArrowEvalPythonExec> will partially run on GPU
      *Expression <PythonUDF> predict_udf(struct(cuml_values_c3BhcmstcmFwaWRzLW1sCg==, cast(features#63490 as array<float>)))#65634 will not block GPU acceleration
        *Expression <CreateNamedStruct> struct(cuml_values_c3BhcmstcmFwaWRzLW1sCg==, cast(features#63490 as array<

### Using Spark-ML PCA (CPU)

In [15]:
from pyspark.ml.feature import PCA

cpu_pca = PCA(k=2, inputCol="features")
cpu_pca.setOutputCol("pca_features")

PCA_6d9a54f03008

In [16]:
from pyspark.ml.functions import array_to_vector

vector_df = data_df.select(array_to_vector("features").alias("features"))

vector_df.printSchema()
vector_df.show(5, False)

root
 |-- features: vector (nullable = true)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

24/09/27 22:32:09 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
    @Expression <Alias> cast(UDF(features#63490) as string) AS features#65673 could run on GPU
      !Expression <Cast> cast(UDF(features#63490) as string) cannot run on GPU because Cast from org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 to StringType is not supported
        !Expression <ScalaUDF> UDF(features#63490) cannot run on GPU because neither UDF implemented by class org.apache.spark.ml.functions$$$Lambda$4959/1422973986 provides a GPU implementa

#### Fit

In [17]:
start_time = time.time()
cpu_pca_model = cpu_pca.fit(vector_df)
print(f"CPU PCA fit took: {time.time() - start_time} sec")

24/09/27 22:32:12 WARN GpuOverrides: 
! <DeserializeToObjectExec> cannot run on GPU because not all expressions can be replaced; GPU does not currently support the operator class org.apache.spark.sql.execution.DeserializeToObjectExec
  ! <CreateExternalRow> createexternalrow(newInstance(class org.apache.spark.ml.linalg.VectorUDT).deserialize, StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
    ! <Invoke> newInstance(class org.apache.spark.ml.linalg.VectorUDT).deserialize cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
      ! <NewInstance> newInstance(class org.apache.spark.ml.linalg.VectorUDT) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.NewInstance
     

CPU PCA fit took: 59.46716833114624 sec


#### Transform

In [18]:
start_time = time.time()
cpu_pca_model.transform(vector_df).select("pca_features").show(10, False)
print(f"CPU PCA transform took: {time.time() - start_time} sec")

+-------------------------------------------+
|pca_features                               |
+-------------------------------------------+
|[0.6231368721203074,0.2811834635470497]    |
|[0.17314993885673247,0.41393407853362335]  |
|[-0.11888122338423039,-0.1264593855869665] |
|[0.20664423707517993,0.21816959353388718]  |
|[0.1717628008505725,0.24164277970428447]   |
|[0.6509145280024092,-0.2881945449536234]   |
|[-0.055763417837812905,0.45830871903188247]|
|[-0.025045147340843735,-0.1845252467279124]|
|[0.7530942348727582,-0.5668156630855318]   |
|[-0.07733055028819996,0.5355410187977719]  |
+-------------------------------------------+
only showing top 10 rows

CPU PCA transform took: 0.20693421363830566 sec


24/09/27 22:33:16 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
    @Expression <Alias> cast(UDF(UDF(features#63490)) as string) AS pca_features#65720 could run on GPU
      !Expression <Cast> cast(UDF(UDF(features#63490)) as string) cannot run on GPU because Cast from org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 to StringType is not supported
        !Expression <ScalaUDF> UDF(UDF(features#63490)) cannot run on GPU because neither UDF implemented by class org.apache.spark.ml.feature.PCAModel$$Lambda$5278/787864918 p

### Summary

With our 100,000 x 2048 dataset, we achieved end-to-end speedup of  

CPU: (59.467 + 0.2069)  
GPU: (14.6391 + 0.3999)  
CPU / GPU = 3.97.