## Introduction

In this notebook, we will show the integrated workflow of Spark RAPIDS accelerated ETL and PCA train & transform.

In [1]:
import org.apache.spark.ml.linalg._
import org.apache.spark.sql.functions._

### Generate dummy data for PCA benchmark

Generate the sample data of 2048 columns and 50,000 rows

In [2]:
val rows = 50000
val dim = 2048
val r = new scala.util.Random(0)
var prepareDf = spark.createDataFrame(
      (0 until rows).map(_ => Tuple1(Array.fill(dim)(r.nextDouble))))
        .withColumnRenamed("_1", "array_feature")
        .select((0 until dim).map(i => col("array_feature").getItem(i)): _*)
prepareDf.write.mode("overwrite").parquet("PCA_raw_parquet")

Waiting for a Spark session to start...

rows = 50000
dim = 2048
r = scala.util.Random@18a448de
prepareDf = [array_feature[0]: double, array_feature[1]: double ... 2046 more fields]


[array_feature[0]: double, array_feature[1]: double ... 2046 more fields]

### Read raw parquet data

The parquet file contains the raw data for PCA train and transform.

There're 2048 columns in the table naming as "array_feature[0], array_feature[1] ... array_feature[2047]".

In [3]:
val df = spark.read.parquet("PCA_raw_parquet")

df = [array_feature[0]: double, array_feature[1]: double ... 2046 more fields]


[array_feature[0]: double, array_feature[1]: double ... 2046 more fields]

### ETL: Calculate mean value for each column

PCA algorithm is expecting mean centered data as input, so use a simple ETL process to do mean centering.

In [4]:
val dim = 2048
val avgValue = df.select(
    (0 until dim).map("array_feature[" + _ + "]").map(col).map(avg): _*).first()
val inputCols = (0 until dim).map(i =>
    (col("array_feature[" + i + "]") - avgValue.getDouble(i)).alias("feature_"+i)
 )
val meanCenterDf = df.select(inputCols:_*)

dim = 2048
avgValue = [0.5014784341440235,0.5007938298214618,0.4988382739107633,0.5004857021518329,0.4976086737881863,0.501459317390976,0.4998871629299758,0.5003749032337383,0.5004268051953419,0.4992212831312325,0.5002230208274252,0.49916485476370304,0.49928552249125024,0.5001192271170941,0.4974153011145406,0.500340861041902,0.500511698285404,0.5029175790341269,0.5000848064753295,0.49946358217105435,0.4991402970341374,0.4999057035861329,0.4993188619485362,0.49782509547668896,0.5001573241354326,0.4991954590903186,0.4988846878237177,0.5008673384728016,0.4982505290656533,0.5000069827383224,0.49830672380384944,0.49849188876978057,0.502253148518209,0.4995624384114367,0.5006052199700368,0.49922409882583835,0.4996825327694508,0.4983465266402566,0.5001149704952238...


[0.5014784341440235,0.5007938298214618,0.4988382739107633,0.5004857021518329,0.4976086737881863,0.501459317390976,0.4998871629299758,0.5003749032337383,0.5004268051953419,0.4992212831312325,0.5002230208274252,0.49916485476370304,0.49928552249125024,0.5001192271170941,0.4974153011145406,0.500340861041902,0.500511698285404,0.5029175790341269,0.5000848064753295,0.49946358217105435,0.4991402970341374,0.4999057035861329,0.4993188619485362,0.49782509547668896,0.5001573241354326,0.4991954590903186,0.4988846878237177,0.5008673384728016,0.4982505290656533,0.5000069827383224,0.49830672380384944,0.49849188876978057,0.502253148518209,0.4995624384114367,0.5006052199700368,0.49922409882583835,0.4996825327694508,0.4983465266402566,0.5001149704952238...

### Spark RAPIDS accelerated PCA can accept ArrayType column as the input column.

Comparing to the original Spark PCA requirement, there's no need to do extra `Vectorize` work for the input column.

For example, the following code is required when using standard Spark PCA:

```scala
val convertToVector = udf((array: Seq[Double]) => {
  Vectors.dense(array.map(_.toDouble).toArray)
})
val vectorDf = dataDf.withColumn("feature_vec", convertToVector(col("feature")))
```

In [5]:
val dataDf = meanCenterDf.withColumn("feature",array(meanCenterDf.columns.map(col):_*))

dataDf = [feature_0: double, feature_1: double ... 2047 more fields]


[feature_0: double, feature_1: double ... 2047 more fields]

### Use Spark RAPIDS accelerated PCA

Comparing to the original PCA training API:

```scala
val pca = new org.apache.spark.ml.feature.PCA()
  .setInputCol("feature")
  .setOutputCol("pca_features")
  .setK(3)
  .fit(vectorDf)
```

We used a customized class and user will need to do `no code change` to enjoy the GPU acceleration:

```scala
val pca = new com.nvidia.spark.ml.feature.PCA()
...
```

In [6]:
val pcaGpu = new com.nvidia.spark.ml.feature.PCA().setInputCol("feature").setOutputCol("pca_features").setK(3)

pcaGpu = pca_6b8d054604e4


pca_6b8d054604e4

In [7]:
val pcaModelGpu = spark.time(pcaGpu.fit(dataDf))

pcaModelGpu = PCAModel: uid=pca_6b8d054604e4, k=3


Time taken: 8280 ms


PCAModel: uid=pca_6b8d054604e4, k=3

In [8]:
spark.time(pcaModelGpu.transform(dataDf).select("pca_features").show(10, false))

+-----------------------------------------+
|pca_features                             |
+-----------------------------------------+
|[0.568805417, 0.041445481, 0.621107902]  |
|[0.378405859, -0.244389411, -0.358809445]|
|[0.421817533, -0.309621711, -0.159095405]|
|[0.424088954, 0.09907811, 0.252832213]   |
|[0.481344556, 0.303004001, 0.06884068]   |
|[0.837941281, 0.113648256, -0.319001501] |
|[0.093790516, -0.364140016, -0.33318393] |
|[0.103996026, -0.174265839, 0.226559042] |
|[-0.283206201, -0.487276589, 0.174362571]|
|[0.101710379, 0.569866637, 0.118964435]  |
+-----------------------------------------+
only showing top 10 rows

Time taken: 3284 ms


### Use original Spark PCA

In [9]:
val convertToVector = udf((array: Seq[Double]) => {
  Vectors.dense(array.map(_.toDouble).toArray)
})
val vectorDf = dataDf.withColumn("feature_vec", convertToVector(col("feature")))

convertToVector = SparkUserDefinedFunction($Lambda$4927/1016137095@128d5536,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,List(Some(class[value[0]: array<double>])),None,true,true)
vectorDf = [feature_0: double, feature_1: double ... 2048 more fields]


[feature_0: double, feature_1: double ... 2048 more fields]

In [10]:
val pcaCpu = new org.apache.spark.ml.feature.PCA().setInputCol("feature_vec").setOutputCol("pca_features").setK(3)

pcaCpu = pca_3f8feb827742


pca_3f8feb827742

In [11]:
val pcaModelCpu = spark.time(pcaCpu.fit(vectorDf))

pcaModelCpu = PCAModel: uid=pca_3f8feb827742, k=3


Time taken: 140539 ms


PCAModel: uid=pca_3f8feb827742, k=3

In [12]:
spark.time(pcaModelCpu.transform(vectorDf).select("pca_features").show(10, false))

+--------------------------------------------------------------+
|pca_features                                                  |
+--------------------------------------------------------------+
|[0.5688054172628585,-0.04144548077183109,-0.6211079018457807] |
|[0.37840585922945796,0.24438941118757604,0.3588094451238177]  |
|[0.4218175332258925,0.3096217108376109,0.15909540520858537]   |
|[0.4240889539599815,-0.09907811042793396,-0.2528322129752815] |
|[0.4813445560531313,-0.3030040008580291,-0.06884068037876276] |
|[0.8379412808563966,-0.11364825624115062,0.3190015014324452]  |
|[0.09379051625949268,0.3641400160124998,0.33318393004824964]  |
|[0.10399602625088979,0.17426583892592548,-0.2265590421381768] |
|[-0.2832062006131796,0.4872765894121887,-0.1743625713365004]  |
|[0.10171037937872408,-0.5698666372294762,-0.11896443456600647]|
+--------------------------------------------------------------+
only showing top 10 rows

Time taken: 12738 ms


### Summary

With the data of 50,000 rows, we achived:

the speedup for training: 140539 / 8280  = `16.97`

the speedup for transform: 12738 / 3284   = `3.87`

### Note

Some columns in GPU output have different signs from that in CPU output, this is due to the calculation nature of SVD algorithm which doesn't impact the effectiveness of the SVD results. More details could be found in the [wiki](https://en.wikipedia.org/wiki/Singular_value_decomposition#Relation_to_eigenvalue_decomposition)