### Pyspark Handling Missing Values
- Eliminar columnas
- Eliminar Filas
- Parámetros para el Dropping
- Imputar valores nulos con la media, la mediana y la moda

#### CREAMOS LA SESIÓN

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Practise').getOrCreate()

In [2]:
spark

#### CARGAMOS EL DATAFRAME

In [29]:
df_pyspark = spark.read.csv('data/test2.csv', header=True, inferSchema=True)
df_pyspark.show()

+---------+----+----------+------+
|     Name| age|Experience|Salary|
+---------+----+----------+------+
|    Krish|  31|        10| 30000|
|Sudhanshu|  30|         8| 25000|
|    Sunny|  29|         4| 20000|
|     Paul|  24|         3| 20000|
|   Harsha|  21|         1| 15000|
|  Shubham|  23|         2| 18000|
|   Mahesh|null|      null| 40000|
|     null|  34|        10| 38000|
|     null|  36|      null|  null|
+---------+----+----------+------+



In [11]:
df_pyspark.show()

+---------+----+----------+------+
|     Name| age|Experience|Salary|
+---------+----+----------+------+
|    Krish|  31|        10| 30000|
|Sudhanshu|  30|         8| 25000|
|    Sunny|  29|         4| 20000|
|     Paul|  24|         3| 20000|
|   Harsha|  21|         1| 15000|
|  Shubham|  23|         2| 18000|
|   Mahesh|null|      null| 40000|
|     null|  34|        10| 38000|
|     null|  36|      null|  null|
+---------+----+----------+------+



La función drop() tiene varios parámetros, entre ellos "how", con valores "any" y "all".
- how = 'any' elimina la columna si uno de los valores es nulo (any por defecto)
- how = 'all' elimina la columna si todos los valores son nulos

Otro parámetro es "thresh". Por ejemplo: thresh = 2, significa que se mantienen las filas con al menos 2 valores NO NULOS. 

In [22]:
df_pyspark.na.drop(how = 'any', thresh = 3).show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
|     null| 34|        10| 38000|
+---------+---+----------+------+



In [25]:
df_pyspark.na.drop(thresh = 1).show()

+---------+----+----------+------+
|     Name| age|Experience|Salary|
+---------+----+----------+------+
|    Krish|  31|        10| 30000|
|Sudhanshu|  30|         8| 25000|
|    Sunny|  29|         4| 20000|
|     Paul|  24|         3| 20000|
|   Harsha|  21|         1| 15000|
|  Shubham|  23|         2| 18000|
|   Mahesh|null|      null| 40000|
|     null|  34|        10| 38000|
|     null|  36|      null|  null|
+---------+----+----------+------+



In [27]:
df_pyspark.na.drop(how = 'any', thresh = 3).show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
|     null| 34|        10| 38000|
+---------+---+----------+------+



El último parámetro es "subset". Con este, podemos decirle que elimine las columnas con valores nulos en una columna determinada.

In [47]:
df_pyspark.toPandas().isna()

Unnamed: 0,Name,age,Experience,Salary
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
5,False,False,False,False
6,False,True,True,False
7,True,False,False,False
8,True,False,True,True


RELLENAR MISSING VALUES

In [35]:
df_pyspark.na.fill("-999").na.fill(0.1).show() 

# también se podría poner df_pyspark.fillna().
# Solo rellenará las columnas que comparten el tipo de valor. Si lo que queremos introducir es un string, solo lo sustituirá en aquellas
# columnas que sean string. Esto es igual con las columnas numéricas. 

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
|   Mahesh|  0|         0| 40000|
|     -999| 34|        10| 38000|
|     -999| 36|         0|     0|
+---------+---+----------+------+



In [36]:
# Distintos tipos en las columnas seleccionadas.

df_pyspark.na.fill({'Name':'Missing Value', 'Experience':10}).show()

+-------------+----+----------+------+
|         Name| age|Experience|Salary|
+-------------+----+----------+------+
|        Krish|  31|        10| 30000|
|    Sudhanshu|  30|         8| 25000|
|        Sunny|  29|         4| 20000|
|         Paul|  24|         3| 20000|
|       Harsha|  21|         1| 15000|
|      Shubham|  23|         2| 18000|
|       Mahesh|null|        10| 40000|
|Missing Value|  34|        10| 38000|
|Missing Value|  36|        10|  null|
+-------------+----+----------+------+



#### IMPUTAR VALORES

Podemos sustituir los valores por el valor medio, mediana... de la siguiente forma:

In [43]:
from pyspark.ml.feature import Imputer

[1;31mInit signature:[0m [0mImputer[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwds[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mSource:[0m        
[1;32mclass[0m [0mImputer[0m[1;33m([0m[1;33m
[0m    [0mJavaEstimator[0m[1;33m[[0m[1;34m"ImputerModel"[0m[1;33m][0m[1;33m,[0m [0m_ImputerParams[0m[1;33m,[0m [0mJavaMLReadable[0m[1;33m[[0m[1;34m"Imputer"[0m[1;33m][0m[1;33m,[0m [0mJavaMLWritable[0m[1;33m
[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m    [1;34m"""
    Imputation estimator for completing missing values, using the mean, median or mode
    of the columns in which the missing values are located. The input columns should be of
    numeric type. Currently Imputer does not support categorical features and
    possibly creates incorrect values for a categorical feature.

    Note that the mean/median/mode value is computed after filtering out missing values.
    All Null values in the input columns are treated as missing, 

In [38]:
imputer_mean = Imputer(inputCols=['age', 'Experience', 'Salary'],
                    outputCols = [f'{c}_imputed' for c in ['age', 'Experience', 'Salary']]
                    ).setStrategy('mean')

In [40]:
imputer_fit = imputer_mean.fit(df_pyspark)

In [41]:
imputer_fit

ImputerModel: uid=Imputer_f2c9c5a7b697, strategy=mean, missingValue=NaN, numInputCols=3, numOutputCols=3

In [42]:
imputer_fit.transform(df_pyspark).show()

+---------+----+----------+------+-----------+------------------+--------------+
|     Name| age|Experience|Salary|age_imputed|Experience_imputed|Salary_imputed|
+---------+----+----------+------+-----------+------------------+--------------+
|    Krish|  31|        10| 30000|         31|                10|         30000|
|Sudhanshu|  30|         8| 25000|         30|                 8|         25000|
|    Sunny|  29|         4| 20000|         29|                 4|         20000|
|     Paul|  24|         3| 20000|         24|                 3|         20000|
|   Harsha|  21|         1| 15000|         21|                 1|         15000|
|  Shubham|  23|         2| 18000|         23|                 2|         18000|
|   Mahesh|null|      null| 40000|         28|                 5|         40000|
|     null|  34|        10| 38000|         34|                10|         38000|
|     null|  36|      null|  null|         36|                 5|         25750|
+---------+----+----------+-