## Principal Component Analysis (PCA)

In this notebook, we will demonstrate the end-to-end workflow of Spark RAPIDS accelerated PCA.

In [1]:
import numpy as np
import pandas as pd
import time

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [3]:
import os
import requests

SPARK_RAPIDS_VERSION = "24.06.1"
cuda_version = "11"
rapids_jar = f"rapids-4-spark_2.12-{SPARK_RAPIDS_VERSION}.jar"

if not os.path.exists(rapids_jar):
    print("Downloading spark rapids jar")
    url = f"https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/{SPARK_RAPIDS_VERSION}/rapids-4-spark_2.12-{SPARK_RAPIDS_VERSION}-cuda{cuda_version}.jar"
    response = requests.get(url)
    if response.status_code == 200:
        with open(rapids_jar, "wb") as f:
            f.write(response.content)
        print(f"File '{rapids_jar}' downloaded and saved successfully.")
    else:
        print(f"Failed to download the file. Status code: {response.status_code}")
else:
    print("File already exists. Skipping download.")

num_threads = 6
driver_memory = "8g"
num_gpus = 1

_config = {
    "spark.master": f"local[{num_threads}]",
    "spark.driver.host": "127.0.0.1",
    "spark.task.maxFailures": "1",
    "spark.driver.memory": driver_memory,
    "spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled": "false",
    "spark.sql.pyspark.jvmStacktrace.enabled": "true",
    "spark.sql.execution.arrow.pyspark.enabled": "true",
    "spark.rapids.ml.uvm.enabled": "true",
    # accelerated file/parquet reading
    "spark.jars": rapids_jar,
    "spark.executorEnv.PYTHONPATH": rapids_jar,
    "spark.sql.files.minPartitionNum": num_gpus,
    "spark.rapids.memory.gpu.minAllocFraction": "0.0001",
    "spark.plugins": "com.nvidia.spark.SQLPlugin",
    "spark.locality.wait": "0s",
    "spark.sql.cache.serializer": "com.nvidia.spark.ParquetCachedBatchSerializer",
    "spark.rapids.memory.gpu.pooling.enabled": "false",
    "spark.rapids.sql.explain": "ALL",
    "spark.sql.execution.sortBeforeRepartition": "false",
    "spark.rapids.sql.format.parquet.reader.type": "MULTITHREADED",
    "spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel": "20",
    "spark.rapids.sql.multiThreadedRead.numThreads": "20",
    "spark.rapids.sql.python.gpu.enabled": "true",
    "spark.rapids.memory.pinnedPool.size": "2G",
    "spark.python.daemon.module": "rapids.daemon",
    "spark.rapids.sql.batchSizeBytes": "512m",
    "spark.sql.adaptive.enabled": "false",
    "spark.sql.files.maxPartitionBytes": "2000000000000",
    "spark.rapids.sql.concurrentGpuTasks": "2",
    "spark.sql.execution.arrow.maxRecordsPerBatch": "20000",
}
spark = SparkSession.builder.appName("spark-rapids-ml umap")
for key, value in _config.items():
    spark = spark.config(key, value)
spark = spark.getOrCreate()

24/09/27 21:22:39 WARN Utils: Your hostname, cb4ae00-lcedt resolves to a loopback address: 127.0.1.1; using 10.110.47.100 instead (on interface eno1)
24/09/27 21:22:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/27 21:22:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Generate synthetic dataset

Here we generate a 100,000 x 2048 random dataset.

In [4]:
rows = 100000
dim = 2048
dtype = 'float32'
np.random.seed(42)

data = np.random.rand(rows, dim).astype(dtype)
cols = [f"c{i}" for i in range(dim)]
pd_data = pd.DataFrame(data, columns=cols)
df = spark.createDataFrame(pd_data).repartition(10)
df.show()

df.write.mode("overwrite").parquet("data.parquet")

24/09/27 21:22:51 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
24/09/27 21:22:51 WARN TaskSetManager: Stage 0 contains a task of very large size (80103 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

+-----------+-----------+----------+-----------+-----------+-----------+----------+-----------+----------+------------+-----------+-----------+-----------+-----------+-----------+----------+-----------+-----------+-----------+----------+-----------+-----------+-----------+-----------+----------+----------+----------+-----------+------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------+-----------+-----------+-----------+----------+----------+-----------+-----------+-----------+------------+-----------+----------+-----------+-----------+------------+-----------+-----------+-----------+-----------+----------+-----------+----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------+-----------+-----------+-----------+------------+-----------+-----------+----------+-----------+-----------+----------+----------+-----------+-----------+----

24/09/27 21:22:58 WARN TaskSetManager: Stage 3 contains a task of very large size (80103 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

### ETL: Mean-centering

PCA is expecting mean-centered data as input, so we perform a simple mean centering on the data.

In [None]:
avg_values = df.select([
    F.avg(F.col(c)).alias(c) for c in cols
]).first()

mean_centered_df = df.select([
    (F.col(c) - avg_values[c]).alias(c) for c in cols
])

mean_centered_df.show(5)

#### Spark-RAPIDS-ML accepts ArrayType input

Note that in the original Spark-ML PCA, we must `Vectorize` the input column:

```python
from pyspark.ml.linalg import Vectors
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
    (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
    (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data,["features"])
df.show()
```

...whereas the Spark-RAPIDS-ML version does not require extra Vectorization, and can accept an ArrayType column as the input column:

In [None]:
data_df = mean_centered_df.withColumn(
    "features", F.array(mean_centered_df.columns)
).drop(*mean_centered_df.columns)

data_df.printSchema()
data_df.show(5, False)

### Using Spark-RAPIDS-ML PCA (GPU)

Compared to the Spark-ML PCA training API:

```python
from pyspark.ml.feature import PCA
pca = PCA(k=3, inputCol="features")
pca.setOutputCol("pca_features")
```

We use a customized class which requires **no code change** from the user to enjoy GPU acceleration:

```python
from spark_rapids_ml.feature import PCA
pca = PCA(k=3, inputCol="features")
pca.setOutputCol("pca_features")
```

In [None]:
from spark_rapids_ml.feature import PCA

gpu_pca = PCA(k=2, inputCol="features")
gpu_pca.setOutputCol("pca_features")

The PCA estimator object can be persisted and reloaded.

In [9]:
estimator_path = "/tmp/pca_estimator"
gpu_pca.write().overwrite().save(estimator_path)
gpu_pca_loaded = PCA.load(estimator_path)

#### Fit

In [None]:
start_time = time.time()
gpu_pca_model = gpu_pca_loaded.fit(data_df)
print(f"GPU PCA fit took: {time.time() - start_time} sec")

#### Transform

In [None]:
start_time = time.time()
gpu_pca_model.transform(data_df).select("pca_features").show(10, False)
print(f"GPU PCA transform took: {time.time() - start_time} sec")

### Using Spark-ML PCA (CPU)

In [None]:
from pyspark.ml.feature import PCA

cpu_pca = PCA(k=2, inputCol="features")
cpu_pca.setOutputCol("pca_features")

In [None]:
from pyspark.ml.functions import array_to_vector

vector_df = data_df.select(array_to_vector("features").alias("features"))

vector_df.printSchema()
vector_df.show(5, False)

#### Fit

In [None]:
start_time = time.time()
cpu_pca_model = cpu_pca.fit(vector_df)
print(f"CPU PCA fit took: {time.time() - start_time} sec")

#### Transform

In [None]:
start_time = time.time()
cpu_pca_model.transform(vector_df).select("pca_features").show(10, False)
print(f"CPU PCA transform took: {time.time() - start_time} sec")

### Summary

With our 100,000 x 2048 dataset, we achieved end-to-end speedup of

(fc + tc) / (fg + tg) = 