# Principal Component Analysis - Going from High to Low Dimensionality

PCA creates synthetic features from the original ones by combining them.

The idea of this new set of features is to have a low dimension.

## Importing

In [21]:
import pyspark, findspark
from pyspark.sql import SparkSession

findspark.init()

spark = SparkSession.builder.appName("pca").getOrCreate()

In [22]:
from pyspark.ml.feature import PCA, VectorAssembler

## Loading Data

In [23]:
cars = spark.read.csv("../../data/Carros.csv", header = True, inferSchema=True, sep=';')
cars.show(2)

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|     21|        6|        160|             39| 262| 1646|        0|          1|      4|          4|110|
|     21|        6|        160|             39|2875| 1702|        0|          1|      4|          4|110|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
only showing top 2 rows



## Using PCA

In PySpark, PCA receives a feature vector object, not a dataframe.

So, first we need to create this feature column in the dataset with.

**VectorAssembler**
- instantiate the object
- pass the input column
- pass the output column name
- use the output column as input to PCA

**PCA**
- instantiate the object
- fit on the dataset
- transform the dataset and get the results

### Creating the object

We have to pass to the object as a parameters the feature list.
To easily get that, we'll use the the following method to get the `cars` dataset column names.
However, we'll exclude the `HP` columns, since it is the target variable.

In [24]:
features_list = [feat.name for feat in cars.schema.fields if feat.name != 'HP']
features_list

['Consumo',
 'Cilindros',
 'Cilindradas',
 'RelEixoTraseiro',
 'Peso',
 'Tempo',
 'TipoMotor',
 'Transmissao',
 'Marchas',
 'Carburadors']

In [25]:
vectas = VectorAssembler(
    inputCols=features_list,
    outputCol="features"
)

cars = vectas.transform(cars)
cars.select("features").show(2, truncate=False)

+---------------------------------------------------+
|features                                           |
+---------------------------------------------------+
|[21.0,6.0,160.0,39.0,262.0,1646.0,0.0,1.0,4.0,4.0] |
|[21.0,6.0,160.0,39.0,2875.0,1702.0,0.0,1.0,4.0,4.0]|
+---------------------------------------------------+
only showing top 2 rows



In [30]:
pca = PCA(k=3, inputCol="features", outputCol="featuresPCA")

pca_model = pca.fit(cars)

In [None]:
pca.transform()