

# Spark ML Transformación de Variables

Cargamos un dataset con información sobre cuán seguro es un coche. Con este dataset se estudiarán funciones muy importantes de Spark ML.

In [1]:
import os
os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python3.6'

In [2]:
# Respuesta

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()



### Cargar datos y comprobar schema

El método _read.csv_ tiene un parámetro _inferSchema_. El mismo permite inferir el tipo de las columnas, para ello requiere recorrer una vez más los datos y por defecto es _False_.

In [3]:
# Respuesta

cars = spark.read.csv('Data/automobile.csv', sep=';', header=True, inferSchema=True)

cars.printSchema()

root
 |-- normalized_losses: integer (nullable = true)
 |-- make: string (nullable = true)
 |-- fuel_type: string (nullable = true)
 |-- aspiration: string (nullable = true)
 |-- num_of_doors: string (nullable = true)
 |-- body_style: string (nullable = true)
 |-- drive_wheels: string (nullable = true)
 |-- engine_location: string (nullable = true)
 |-- wheel_base: double (nullable = true)
 |-- length: double (nullable = true)
 |-- width: double (nullable = true)
 |-- height: double (nullable = true)
 |-- curb_weight: integer (nullable = true)
 |-- engine_type: string (nullable = true)
 |-- num_of_cylinders: string (nullable = true)
 |-- engine_size: integer (nullable = true)
 |-- fuel_system: string (nullable = true)
 |-- bore: double (nullable = true)
 |-- stroke: double (nullable = true)
 |-- compression_ratio: double (nullable = true)
 |-- horsepower: integer (nullable = true)
 |-- peak_rpm: integer (nullable = true)
 |-- city_mpg: integer (nullable = true)
 |-- highway_mpg: intege



### VectorAssembler



Un _VectorAssembler_ es un transformador de múltiples características (_features_) en una sola columna de tipo vector. Lo construiremos con todas las variables menos con la columna objetivo 'symboling'.

In [4]:
# Respuesta

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=[element for element in cars.columns if element != 'symboling'], outputCol='assembled_features')

cars_assembled = assembler.transform(cars)

cars_assembled.show()

IllegalArgumentException: 'Data type string of column make is not supported.\nData type string of column fuel_type is not supported.\nData type string of column aspiration is not supported.\nData type string of column num_of_doors is not supported.\nData type string of column body_style is not supported.\nData type string of column drive_wheels is not supported.\nData type string of column engine_location is not supported.\nData type string of column engine_type is not supported.\nData type string of column num_of_cylinders is not supported.\nData type string of column fuel_system is not supported.'



Estudiando el error se lee:
    **IllegalArgumentException: 'Data type StringType is not supported.'**
    
Recordamos que VectorAssembler solo acepta los siguientes tipos de datos:

- numéricos
- booleanos
- vector
    



Estudiamos el tipo de cada una de las variables y hacemos VectorAssembler para todas las variables cuyos tipos sí están permitidos. Es decir el _VectorAssembler_ no debe incluir columnas de tipo _string_.

In [5]:
# Respuesta

cars.dtypes

[('normalized_losses', 'int'),
 ('make', 'string'),
 ('fuel_type', 'string'),
 ('aspiration', 'string'),
 ('num_of_doors', 'string'),
 ('body_style', 'string'),
 ('drive_wheels', 'string'),
 ('engine_location', 'string'),
 ('wheel_base', 'double'),
 ('length', 'double'),
 ('width', 'double'),
 ('height', 'double'),
 ('curb_weight', 'int'),
 ('engine_type', 'string'),
 ('num_of_cylinders', 'string'),
 ('engine_size', 'int'),
 ('fuel_system', 'string'),
 ('bore', 'double'),
 ('stroke', 'double'),
 ('compression_ratio', 'double'),
 ('horsepower', 'int'),
 ('peak_rpm', 'int'),
 ('city_mpg', 'int'),
 ('highway_mpg', 'int'),
 ('price', 'int'),
 ('symboling', 'int')]

In [6]:
# Respuesta

columns_assemble = [element[0] for element in cars.dtypes if element[1] != 'string' and element[0] != 'symboling']

assembler = VectorAssembler(inputCols=columns_assemble, outputCol='assembled_features')

cars_assembled = assembler.transform(cars)

cars_assembled.show()

Py4JJavaError: An error occurred while calling o77.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (struct<normalized_losses_double_VectorAssembler_3a24381a8a30:double,wheel_base:double,length:double,width:double,height:double,curb_weight_double_VectorAssembler_3a24381a8a30:double,engine_size_double_VectorAssembler_3a24381a8a30:double,bore:double,stroke:double,compression_ratio:double,horsepower_double_VectorAssembler_3a24381a8a30:double,peak_rpm_double_VectorAssembler_3a24381a8a30:double,city_mpg_double_VectorAssembler_3a24381a8a30:double,highway_mpg_double_VectorAssembler_3a24381a8a30:double,price_double_VectorAssembler_3a24381a8a30:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Encountered null while assembling a row with handleInvalid = "keep". Consider
removing nulls from dataset or using handleInvalid = "keep" or "skip".
	at org.apache.spark.ml.feature.VectorAssembler$$anonfun$assemble$1.apply(VectorAssembler.scala:287)
	at org.apache.spark.ml.feature.VectorAssembler$$anonfun$assemble$1.apply(VectorAssembler.scala:255)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at org.apache.spark.ml.feature.VectorAssembler$.assemble(VectorAssembler.scala:255)
	at org.apache.spark.ml.feature.VectorAssembler$$anonfun$4.apply(VectorAssembler.scala:144)
	at org.apache.spark.ml.feature.VectorAssembler$$anonfun$4.apply(VectorAssembler.scala:143)
	... 21 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
	at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (struct<normalized_losses_double_VectorAssembler_3a24381a8a30:double,wheel_base:double,length:double,width:double,height:double,curb_weight_double_VectorAssembler_3a24381a8a30:double,engine_size_double_VectorAssembler_3a24381a8a30:double,bore:double,stroke:double,compression_ratio:double,horsepower_double_VectorAssembler_3a24381a8a30:double,peak_rpm_double_VectorAssembler_3a24381a8a30:double,city_mpg_double_VectorAssembler_3a24381a8a30:double,highway_mpg_double_VectorAssembler_3a24381a8a30:double,price_double_VectorAssembler_3a24381a8a30:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more
Caused by: org.apache.spark.SparkException: Encountered null while assembling a row with handleInvalid = "keep". Consider
removing nulls from dataset or using handleInvalid = "keep" or "skip".
	at org.apache.spark.ml.feature.VectorAssembler$$anonfun$assemble$1.apply(VectorAssembler.scala:287)
	at org.apache.spark.ml.feature.VectorAssembler$$anonfun$assemble$1.apply(VectorAssembler.scala:255)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at org.apache.spark.ml.feature.VectorAssembler$.assemble(VectorAssembler.scala:255)
	at org.apache.spark.ml.feature.VectorAssembler$$anonfun$4.apply(VectorAssembler.scala:144)
	at org.apache.spark.ml.feature.VectorAssembler$$anonfun$4.apply(VectorAssembler.scala:143)
	... 21 more




Ha vuelto a fallar, ¿qué ocurre?

En la version de Spark 2.1 el mensaje no parece aportar muchos indicios  que el error. Sin embargo, en la version de Spark 2.2  el error se describe de la siguiente manera:
    
**Caused by: org.apache.spark.SparkException: Values to assemble cannot be null.**

Así pues, se tiene que se deben haber filtrado correctamente los valores nulos antes de crear un VectorAssembler.




Quitaremos todas las filas con nulos:

In [7]:
# Respuesta

cars_no_nulls = cars.cache()

for element in cars.columns:
    if cars.where(cars[element].isNull()).count() != 0:
        print('\tThe column "{}" has null values'.format(element))
        cars_no_nulls = cars_no_nulls.where(cars[element].isNotNull())
    if cars.where(cars[element].isNull()).count() == 0:
        print('The column "{}" does not have null values'.format(element))

The column "normalized_losses" does not have null values
	The column "make" has null values
The column "fuel_type" does not have null values
The column "aspiration" does not have null values
The column "num_of_doors" does not have null values
	The column "body_style" has null values
The column "drive_wheels" does not have null values
The column "engine_location" does not have null values
The column "wheel_base" does not have null values
The column "length" does not have null values
The column "width" does not have null values
The column "height" does not have null values
The column "curb_weight" does not have null values
The column "engine_type" does not have null values
The column "num_of_cylinders" does not have null values
The column "engine_size" does not have null values
The column "fuel_system" does not have null values
	The column "bore" has null values
The column "stroke" does not have null values
The column "compression_ratio" does not have null values
	The column "horsepower"

In [8]:
# Respuesta

assembler = VectorAssembler(inputCols=columns_assemble, outputCol='assembled_features')

cars_assembled = assembler.transform(cars_no_nulls) # please bear in mind, we are using cars_no_nulls

cars_assembled.show()

+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+--------------------+
|normalized_losses|      make|fuel_type|aspiration|num_of_doors|body_style|drive_wheels|engine_location|wheel_base|length|width|height|curb_weight|engine_type|num_of_cylinders|engine_size|fuel_system|bore|stroke|compression_ratio|horsepower|peak_rpm|city_mpg|highway_mpg|price|symboling|  assembled_features|
+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+--------------------+
|              113|     mazda|      gas|       std|      [four]|     seda



** ¡¡Ahora se ha podido crear el VectorAssembler!! **



### StringIndexer



* Hagamos StringIndexer para la variable 'make' que representa la marca del auto.

In [10]:
cars.groupBy('make').count().show()

+-------------+-----+
|         make|count|
+-------------+-----+
|       peugot|   11|
|       jaguar|    3|
|   mitsubishi|   13|
|         null|    1|
|       toyota|   31|
|         saab|    6|
|     plymouth|    7|
|         audi|    7|
|          bmw|    8|
|  alfa-romero|    3|
|        dodge|    9|
|        mazda|   17|
|mercedes-benz|    8|
|        isuzu|    4|
|      porsche|    5|
|    chevrolet|    3|
|        honda|   13|
|   volkswagen|   12|
|      mercury|    1|
|      renault|    2|
+-------------+-----+
only showing top 20 rows



In [11]:
# Respuesta

from pyspark.ml.feature import StringIndexer

feature_indexer = StringIndexer(inputCol='make', outputCol='make_indexed')

feature_indexer_model = feature_indexer.fit(cars)

cars_indexed = feature_indexer_model.transform(cars)

cars_indexed.collect()

Py4JJavaError: An error occurred while calling o350.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 120.0 failed 1 times, most recent failure: Lost task 0.0 in stage 120.0 (TID 311, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$9: (string) => double)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: StringIndexer encountered NULL value. To handle or skip NULLS, try setting StringIndexer.handleInvalid.
	at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$9.apply(StringIndexer.scala:251)
	at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$9.apply(StringIndexer.scala:246)
	... 18 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3258)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3255)
	at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3255)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$9: (string) => double)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more
Caused by: org.apache.spark.SparkException: StringIndexer encountered NULL value. To handle or skip NULLS, try setting StringIndexer.handleInvalid.
	at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$9.apply(StringIndexer.scala:251)
	at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$9.apply(StringIndexer.scala:246)
	... 18 more




De nuevo se produce un error. En la versión de Spark 2.1 el mensaje no parece aportar muchos indicios acerca del mismo.
En la versión de Spark 2.2 el error dice lo siguiente: ** Caused by: org.apache.spark.SparkException: StringIndexer encountered NULL value. To handle or skip NULLS, try setting StringIndexer.handleInvalid.**

Es importante haber tratado correctamente los nulos antes.

¿Qué desventaja tendría utilizar handleInvalid tal como se indica?

In [12]:
# Respuesta

from pyspark.ml.feature import StringIndexer

feature_indexer = StringIndexer(inputCol='make', outputCol='make_indexed')

feature_indexer_model = feature_indexer.fit(cars_no_nulls) # Please bear in mind, now we are using cars_no_nulls

cars_indexed = feature_indexer_model.transform(cars_no_nulls)

cars_indexed.show()

+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+------------+
|normalized_losses|      make|fuel_type|aspiration|num_of_doors|body_style|drive_wheels|engine_location|wheel_base|length|width|height|curb_weight|engine_type|num_of_cylinders|engine_size|fuel_system|bore|stroke|compression_ratio|horsepower|peak_rpm|city_mpg|highway_mpg|price|symboling|make_indexed|
+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+------------+
|              113|     mazda|      gas|       std|      [four]|     sedan|         fwd|         



Si se accede a `feature_indexer_model.labels` se obtiene un vector construído por `StringIndexer`. El vector está ordenado por la frecuencia de los valores, por lo tanto el valor más frecuente tiene índice 0.

In [13]:
# Respuesta

feature_indexer_model.labels

['toyota',
 'mazda',
 'nissan',
 'honda',
 'mitsubishi',
 'subaru',
 'volkswagen',
 'volvo',
 'peugot',
 'dodge',
 'mercedes-benz',
 'bmw',
 'audi',
 'plymouth',
 'saab',
 'porsche',
 'isuzu',
 'alfa-romero',
 'jaguar',
 'chevrolet',
 'renault',
 'mercury']

In [33]:
cars_indexed.groupBy('make_indexed').count().show()

+------------+-----+
|make_indexed|count|
+------------+-----+
|         8.0|   11|
|         0.0|   31|
|         7.0|   11|
|        18.0|    3|
|         1.0|   17|
|         4.0|   13|
|        11.0|    8|
|        21.0|    1|
|        14.0|    6|
|         3.0|   13|
|        19.0|    2|
|         2.0|   17|
|        17.0|    3|
|        10.0|    8|
|        13.0|    7|
|         6.0|   11|
|        20.0|    2|
|         5.0|   12|
|        15.0|    5|
|         9.0|    9|
+------------+-----+
only showing top 20 rows





### CountVectorizer



* Hagamos CountVectorizer para la variable 'num_of_doors'. 

| num_of_doors   |
| -------------: |
| [four]| 
| [two,four]     | 


In [34]:
# Respuesta

from pyspark.ml.feature import CountVectorizer

feature_cv = CountVectorizer(inputCol='num_of_doors', outputCol='doors_counter')

model_cv = feature_cv.fit(cars)

cars_cv = feature_cv.transform(cars)

cars_cv.show()

IllegalArgumentException: 'requirement failed: Column num_of_doors must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type string.'



Mirando el schema se ve que 'num_of_doors' no tiene el formato correcto (es de tipo _string_). Vamos a convertirlo a _ArrayType(StringType())_

In [35]:
# Respuesta

import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, StringType, DoubleType

cars_no_nulls = cars_no_nulls.withColumn('num_of_doors', 
                                         F.udf(lambda value: 
                                               value.replace('[', '').
                                               replace(']','').split(','), 
                                               ArrayType(StringType()))
                                         (F.col('num_of_doors')))

cars_no_nulls.printSchema()

root
 |-- normalized_losses: integer (nullable = true)
 |-- make: string (nullable = true)
 |-- fuel_type: string (nullable = true)
 |-- aspiration: string (nullable = true)
 |-- num_of_doors: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- body_style: string (nullable = true)
 |-- drive_wheels: string (nullable = true)
 |-- engine_location: string (nullable = true)
 |-- wheel_base: double (nullable = true)
 |-- length: double (nullable = true)
 |-- width: double (nullable = true)
 |-- height: double (nullable = true)
 |-- curb_weight: integer (nullable = true)
 |-- engine_type: string (nullable = true)
 |-- num_of_cylinders: string (nullable = true)
 |-- engine_size: integer (nullable = true)
 |-- fuel_system: string (nullable = true)
 |-- bore: double (nullable = true)
 |-- stroke: double (nullable = true)
 |-- compression_ratio: double (nullable = true)
 |-- horsepower: integer (nullable = true)
 |-- peak_rpm: integer (nullable = true)
 |-- city_mpg: int

In [36]:
# Respuesta

cars_no_nulls.show()

+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+
|normalized_losses|      make|fuel_type|aspiration|num_of_doors|body_style|drive_wheels|engine_location|wheel_base|length|width|height|curb_weight|engine_type|num_of_cylinders|engine_size|fuel_system|bore|stroke|compression_ratio|horsepower|peak_rpm|city_mpg|highway_mpg|price|symboling|
+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+
|              113|     mazda|      gas|       std|      [four]|     sedan|         fwd|          front|      93.1| 166.8| 64.2|  54.1| 



Volvamos a probar otra vez:

In [37]:
# Respuesta

feature_cv = CountVectorizer(inputCol='num_of_doors', outputCol='doors_counter')

model_cv = feature_cv.fit(cars_no_nulls)

cars_cv = model_cv.transform(cars_no_nulls)

cars_cv.show()

+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+-------------------+
|normalized_losses|      make|fuel_type|aspiration|num_of_doors|body_style|drive_wheels|engine_location|wheel_base|length|width|height|curb_weight|engine_type|num_of_cylinders|engine_size|fuel_system|bore|stroke|compression_ratio|horsepower|peak_rpm|city_mpg|highway_mpg|price|symboling|      doors_counter|
+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+-------------------+
|              113|     mazda|      gas|       std|      [four]|     sedan| 



En la siguiente tabla se puede apreciar la conversión realizada con _CountVectorizer_

| num_of_doors   | doors_counter   |
| -------------: | -------------: |
| [four]| (2,[0],[1.0]) |
| [two,four]     | (2,[0,1],[1.0,1.0])|

La columna *doors_counter* contiene un _CountVectorizerModel_ que es un vector con tres campos. El primero indica la cantidad de valores posibles que tiene la columna *num_of_doors*, en este caso es 2. El segundo campo indica qué valores aparecen en ese registro. Se puede saber con *model_cv.vocabulary* que 'four' corresponde al dígito 0 y 'two' corresponde al dígito 1. El tercer campo indica cuantas veces aparecen los valores en la columna *num_of_doors* para ese registro. 




### OneHotEncoder



* Hagamos OneHotEncoder para la variable 'make' (recordar que contiene las marcas de distintos autos)

In [38]:
# Respuesta

from pyspark.ml.feature import OneHotEncoder

feature_ohe = OneHotEncoder(inputCol='make', outputCol='make_onehotencoder')

cars_ohe = feature_ohe.transform(cars_no_nulls)

cars_ohe.show()

IllegalArgumentException: 'requirement failed: Input column must be of type numeric but got string'



Salta el siguiente error: ** IllegalArgumentException: 'requirement failed: Input column must be of type NumericType but got StringType'**

Para hacer un OneHotEncoder, equivalente a variable dummies, es necesarios pasar antes por _StringIndexer_. Ya hemos realizado esto, por favor recuerde la columna *make_indexed*.

Reutilizamos el ejemplo anterior:

In [39]:
# Respuesta

cars_indexed.show()

+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+------------+
|normalized_losses|      make|fuel_type|aspiration|num_of_doors|body_style|drive_wheels|engine_location|wheel_base|length|width|height|curb_weight|engine_type|num_of_cylinders|engine_size|fuel_system|bore|stroke|compression_ratio|horsepower|peak_rpm|city_mpg|highway_mpg|price|symboling|make_indexed|
+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+------------+
|              113|     mazda|      gas|       std|      [four]|     sedan|         fwd|         

In [40]:
cars_indexed.dtypes

[('normalized_losses', 'int'),
 ('make', 'string'),
 ('fuel_type', 'string'),
 ('aspiration', 'string'),
 ('num_of_doors', 'string'),
 ('body_style', 'string'),
 ('drive_wheels', 'string'),
 ('engine_location', 'string'),
 ('wheel_base', 'double'),
 ('length', 'double'),
 ('width', 'double'),
 ('height', 'double'),
 ('curb_weight', 'int'),
 ('engine_type', 'string'),
 ('num_of_cylinders', 'string'),
 ('engine_size', 'int'),
 ('fuel_system', 'string'),
 ('bore', 'double'),
 ('stroke', 'double'),
 ('compression_ratio', 'double'),
 ('horsepower', 'int'),
 ('peak_rpm', 'int'),
 ('city_mpg', 'int'),
 ('highway_mpg', 'int'),
 ('price', 'int'),
 ('symboling', 'int'),
 ('make_indexed', 'double')]



Se aprecia que el dataframe `cars_indexed` ya incluye la variable `make_indexed` y es tipo numérica. Empezamos a trabajar a partir de aquí:

In [41]:
# Respuesta

encoder = OneHotEncoder(inputCol="make_indexed", outputCol="make_onehotencoder")
cars_encoded = encoder.transform(cars_indexed)
cars_encoded.show()

+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+------------+------------------+
|normalized_losses|      make|fuel_type|aspiration|num_of_doors|body_style|drive_wheels|engine_location|wheel_base|length|width|height|curb_weight|engine_type|num_of_cylinders|engine_size|fuel_system|bore|stroke|compression_ratio|horsepower|peak_rpm|city_mpg|highway_mpg|price|symboling|make_indexed|make_onehotencoder|
+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+------------+------------------+
|              113|     mazda|      gas|



### Pasar resultados a columnas independientes

Tanto al hacer el CountVectorizer como el OneHotEncoder, los resultados se encuentran en un vector en una sola columna. Sería muy útil separar los resultados en columnas distintas.

Veamos cómo hacerlo.



**Para el caso de CountVectorizer **

Un posible ejemplo podría ser generar una columna *doors_four* y una columna *doors_two*.

| num_of_doors   | doors_counter   |doors_four|doors_two|
| -------------: | -------------: | -------------:| -------------:|
| [four]| (2,[0],[1.0]) |1.0|0.0|
| [two,four]     | (2,[0,1],[1.0,1.0])| 1.0|1.0|

Para esto, primero se crea una columna tipo Vector Array, llamada *activated_index*.

In [42]:
# Respuesta

from pyspark.sql.types import DoubleType

cars_cv = (cars_cv.withColumn('activated_index', F.udf(lambda x: x.toArray().tolist(), ArrayType(DoubleType()))(F.col('doors_counter'))))

cars_cv.show()

+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+-------------------+---------------+
|normalized_losses|      make|fuel_type|aspiration|num_of_doors|body_style|drive_wheels|engine_location|wheel_base|length|width|height|curb_weight|engine_type|num_of_cylinders|engine_size|fuel_system|bore|stroke|compression_ratio|horsepower|peak_rpm|city_mpg|highway_mpg|price|symboling|      doors_counter|activated_index|
+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+-------------------+---------------+
|              113|     mazd



Ahora debemos modificar el vector resultante, *activated_index*, para que cada elemento se encuentre en una columna distinta. También debemos saber los distintos valores/elementos sobre los que se ha hecho el count, esto se puede hacer mediante  *model_cv.vocabulary*

In [71]:
# Respuesta

vocab = model_cv.vocabulary

In [72]:
vocab

['four', 'two']



Partimos nuestra columna 'activated_index' y renombramos las columnas resultantes con el tipo de evento correspondiente:

In [74]:
# Respuesta

cars_cv = cars_cv.select(cars_cv.columns + [(F.col("activated_index")[i]).alias('doors_' + v) for i, v in enumerate(vocab)])

cars_cv.show()

+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+-------------------+---------------+----------+---------+
|normalized_losses|      make|fuel_type|aspiration|num_of_doors|body_style|drive_wheels|engine_location|wheel_base|length|width|height|curb_weight|engine_type|num_of_cylinders|engine_size|fuel_system|bore|stroke|compression_ratio|horsepower|peak_rpm|city_mpg|highway_mpg|price|symboling|      doors_counter|activated_index|doors_four|doors_two|
+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+-------------------+---



¡Ya está hecho!




**Para el caso OneHotEncoder**

El proceso será equivalente con la diferencia de la procedencia de las distintas categorías.



Primero se crea una columna _ArrayType()_

In [75]:
# Respuesta

from pyspark.sql.types import DoubleType

cars_encoded = (cars_encoded.withColumn('make_activated_index', F.udf(lambda x: x.toArray().tolist(), ArrayType(DoubleType()))(F.col('make_onehotencoder'))))

cars_encoded.show()

+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+------------+------------------+--------------------+
|normalized_losses|      make|fuel_type|aspiration|num_of_doors|body_style|drive_wheels|engine_location|wheel_base|length|width|height|curb_weight|engine_type|num_of_cylinders|engine_size|fuel_system|bore|stroke|compression_ratio|horsepower|peak_rpm|city_mpg|highway_mpg|price|symboling|make_indexed|make_onehotencoder|make_activated_index|
+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+------------+------------------



Modificar el vector resultante, *make_activated_index*, para que cada elemento se encuentre en una columna distinta



Debemos saber los distintos elementos sobre los que se ha hecho el count. La diferencia aquí es que se ha hecho un StringIndexer antes del OneHotEncoder y se debe volver a StringIndexer para recuperar las categorías.


In [76]:
# Respuesta

vocab = feature_indexer_model.labels
print(vocab)

['toyota', 'mazda', 'nissan', 'honda', 'mitsubishi', 'subaru', 'volkswagen', 'volvo', 'peugot', 'dodge', 'mercedes-benz', 'bmw', 'audi', 'plymouth', 'saab', 'porsche', 'isuzu', 'alfa-romero', 'jaguar', 'chevrolet', 'renault', 'mercury']




Al inspeccionar las categorias observamos que aparecen símbolos no permitidos. Esto debe a que existen macas de autos como "mercedes-benz". El guión medio "-" no esta permitido para los nombres de las columnas. Tomando esto en cuenta, partimos nuestra columna 'make_activated_index' en porciones y renombramos las columnas resultantes con la marca correspondiente:

In [77]:
# Respuesta

cars_encoded = cars_encoded.select(cars_encoded.columns + [(F.col("make_activated_index")[i]).alias('make_' + v.replace('-','_')) for i, v in enumerate(vocab)])

cars_encoded.show()

+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+------------+------------------+--------------------+-----------+----------+-----------+----------+---------------+-----------+---------------+----------+-----------+----------+------------------+--------+---------+-------------+---------+------------+----------+----------------+-----------+--------------+------------+------------+
|normalized_losses|      make|fuel_type|aspiration|num_of_doors|body_style|drive_wheels|engine_location|wheel_base|length|width|height|curb_weight|engine_type|num_of_cylinders|engine_size|fuel_system|bore|stroke|compression_ratio|horsepower|peak_rpm|city_mpg|highway_mpg|price|symboling|make_indexed|make_onehotencoder|make_activated_index|make_toyota|make_mazda|make_nissan|make



* Estudiamos comportamiento de OneHotEncoder

In [78]:
# Respuesta

vocab

['toyota',
 'mazda',
 'nissan',
 'honda',
 'mitsubishi',
 'subaru',
 'volkswagen',
 'volvo',
 'peugot',
 'dodge',
 'mercedes-benz',
 'bmw',
 'audi',
 'plymouth',
 'saab',
 'porsche',
 'isuzu',
 'alfa-romero',
 'jaguar',
 'chevrolet',
 'renault',
 'mercury']



La última categoría es 'mercury'. Veamos qué pasa:

In [79]:
# Respuesta

cars_encoded.where(F.col('make')=='mercury').show(1)

+-----------------+-------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+------------+------------------+--------------------+-----------+----------+-----------+----------+---------------+-----------+---------------+----------+-----------+----------+------------------+--------+---------+-------------+---------+------------+----------+----------------+-----------+--------------+------------+------------+
|normalized_losses|   make|fuel_type|aspiration|num_of_doors|body_style|drive_wheels|engine_location|wheel_base|length|width|height|curb_weight|engine_type|num_of_cylinders|engine_size|fuel_system|bore|stroke|compression_ratio|horsepower|peak_rpm|city_mpg|highway_mpg|price|symboling|make_indexed|make_onehotencoder|make_activated_index|make_toyota|make_mazda|make_nissan|make_honda



Se aprecia cómo 'make_mercury' toma valor nulo. De hecho, siempre la última columna toma el valor nulo.

In [80]:
# Respuesta

cars_encoded.select('make_mercury').distinct().show()

+------------+
|make_mercury|
+------------+
|        null|
+------------+





** ¿Por qué? **

Porque OneHotEncoder supone que las columnas no nulas son las únicas categorías posibles para esa columna y por lo tanto, una de ellas es combinación lineal del resto. Por esta razón desestima la última de las categorías.

Hay situaciones de selección de variables donde todas deben estar presentes. Veamos como forzar la aparición de esta categoría también.

In [81]:
# Respuesta

cars_encoded = (cars_encoded.withColumn(cars_encoded.columns[-1], 
                F.udf(lambda value: 1.0 if value == vocab[-1] else 0.0, DoubleType())(F.col('make'))))


cars_encoded.show()

+-----------------+----------+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+---------+------------+------------------+--------------------+-----------+----------+-----------+----------+---------------+-----------+---------------+----------+-----------+----------+------------------+--------+---------+-------------+---------+------------+----------+----------------+-----------+--------------+------------+------------+
|normalized_losses|      make|fuel_type|aspiration|num_of_doors|body_style|drive_wheels|engine_location|wheel_base|length|width|height|curb_weight|engine_type|num_of_cylinders|engine_size|fuel_system|bore|stroke|compression_ratio|horsepower|peak_rpm|city_mpg|highway_mpg|price|symboling|make_indexed|make_onehotencoder|make_activated_index|make_toyota|make_mazda|make_nissan|make

In [82]:
# Respuesta

cars_encoded.select('make_mercury').distinct().show()

+------------+
|make_mercury|
+------------+
|         0.0|
|         1.0|
+------------+

