# Imputer: Dealing with Missing Values

With the Imputer method, we can fill missing value with many imputation strategies: mean, median, mode.

But that's not all: we can also change any other value (like zero) by using a specified parameters.

Similar to other Spark methods, it only works for numeric values.

## Importing

In [1]:
import pyspark, findspark
from pyspark.sql import SparkSession

findspark.init()

spark = SparkSession.builder.appName("ohe").getOrCreate()

In [2]:
from pyspark.ml.feature import Imputer

## Loading Data

In [3]:
cars = spark.read.load(
    "../../data/CarrosNAN.csv",
    format="csv",
    sep=";",
    header = True, 
    inferSchema=True)

cars.show(2)

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|     21|        6|        160|             39| 262| 1646|        0|          1|      4|          4|110|
|     21|        6|       null|             39|2875| null|        0|          1|      4|          4|110|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
only showing top 2 rows



## Imputing Missing Values

By default, the imputation strategy is the column mean. To change it, we have to specify the strategy.

### Filling with the mean

In [4]:
imputer = Imputer(
    inputCols=["Cilindradas", "Peso"],
    outputCols=["cil_filled_mean", "peso_filled_mean"]
)
cars = imputer.fit(cars).transform(cars)



In [5]:
cars.select("Cilindradas", "cil_filled_mean", "Peso", "peso_filled_mean").show(5)

+-----------+---------------+----+----------------+
|Cilindradas|cil_filled_mean|Peso|peso_filled_mean|
+-----------+---------------+----+----------------+
|        160|            160| 262|             262|
|       null|            848|2875|            2875|
|        108|            108| 232|             232|
|       null|            848|3215|            3215|
|        360|            360|null|            1318|
+-----------+---------------+----+----------------+
only showing top 5 rows



### Filling with the median

To a different method, we have to set the `setStrategy` method before the imputation object creation:

In [6]:
imputer = Imputer(
    inputCols=["Cilindradas", "Peso"],
    outputCols=["cil_filled_median", "peso_filled_median"]
)
cars = imputer.setStrategy('median').fit(cars).transform(cars)

In [7]:
cars.select("Cilindradas", "cil_filled_mean",  "cil_filled_median", "Peso", "peso_filled_mean", "peso_filled_median").show(5)

+-----------+---------------+-----------------+----+----------------+------------------+
|Cilindradas|cil_filled_mean|cil_filled_median|Peso|peso_filled_mean|peso_filled_median|
+-----------+---------------+-----------------+----+----------------+------------------+
|        160|            160|              160| 262|             262|               262|
|       null|            848|              440|2875|            2875|              2875|
|        108|            108|              108| 232|             232|               232|
|       null|            848|              440|3215|            3215|              3215|
|        360|            360|              360|null|            1318|               373|
+-----------+---------------+-----------------+----+----------------+------------------+
only showing top 5 rows



### Imputing a Numeric value

Instead of filling missing values, we could need to impute numeric values.

In order to do so, we jsut have to set the missing value we want to change by using the method `setMissingValue`:

In [8]:
imputer = Imputer(
    inputCols=["Cilindros"],
    outputCols=["cilindros_zero_median"]
)
cars = imputer.setStrategy('median').setMissingValue(0).fit(cars).transform(cars)

In [10]:
cars.select("Cilindros", "cilindros_zero_median").show(10)

+---------+---------------------+
|Cilindros|cilindros_zero_median|
+---------+---------------------+
|        6|                    6|
|        6|                    6|
|        0|                    6|
|        0|                    6|
|        0|                    6|
|        6|                    6|
|        8|                    8|
|        4|                    4|
|        4|                    4|
|        0|                    6|
+---------+---------------------+
only showing top 10 rows

