En primer lugar, abriremos conexión con Google Colab para poder trabajar con el fichero.

In [79]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Una vez inciado el Google Colab, creamos una variable con la ruta a nuestro directorio de trabajo

In [80]:
input_path = '/content/drive/MyDrive/APBD/trabajo/data/{}'
output_path= '/content/drive/MyDrive/APBD/trabajo/data/output/{}'

Finalmente, inicializamos la conexión con el motor de ejecución de Spark.

In [81]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("trabajo") \
    .getOrCreate()
spark

In [82]:
spark.sparkContext.defaultParallelism

2

In [83]:
spark

Vamos a dedicar la siguiente celda a importar todas las funciones utilizadas durante el proyecto:

In [84]:
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
from pyspark.ml.feature import StringIndexer, IndexToString, VectorAssembler
from pyspark.sql.types import StringType
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor,GBTRegressor, DecisionTreeRegressor, FMRegressor
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
import pandas as pd

Ahora, cargaremos los datos de entrenamiento y los datos test del problema.

In [85]:
train_df = spark.read.csv(path=input_path.format('train.csv'), header=True, inferSchema=True)
test_df = spark.read.csv(path=input_path.format('test.csv'), header=True,inferSchema=True)

Veamos la estructura que tienen estos datasets.

In [86]:
train_df.show(3)
test_df.show(3)

+---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+
| Id|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition

Fíjese que al cargar os datasets con inferencia de esquema, el propio programa definirá la clase de las variables.

In [87]:
train_df.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- MSSubClass: integer (nullable = true)
 |-- MSZoning: string (nullable = true)
 |-- LotFrontage: string (nullable = true)
 |-- LotArea: integer (nullable = true)
 |-- Street: string (nullable = true)
 |-- Alley: string (nullable = true)
 |-- LotShape: string (nullable = true)
 |-- LandContour: string (nullable = true)
 |-- Utilities: string (nullable = true)
 |-- LotConfig: string (nullable = true)
 |-- LandSlope: string (nullable = true)
 |-- Neighborhood: string (nullable = true)
 |-- Condition1: string (nullable = true)
 |-- Condition2: string (nullable = true)
 |-- BldgType: string (nullable = true)
 |-- HouseStyle: string (nullable = true)
 |-- OverallQual: integer (nullable = true)
 |-- OverallCond: integer (nullable = true)
 |-- YearBuilt: integer (nullable = true)
 |-- YearRemodAdd: integer (nullable = true)
 |-- RoofStyle: string (nullable = true)
 |-- RoofMatl: string (nullable = true)
 |-- Exterior1st: string (nullable = true)
 |--

In [88]:
test_df.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- MSSubClass: integer (nullable = true)
 |-- MSZoning: string (nullable = true)
 |-- LotFrontage: string (nullable = true)
 |-- LotArea: integer (nullable = true)
 |-- Street: string (nullable = true)
 |-- Alley: string (nullable = true)
 |-- LotShape: string (nullable = true)
 |-- LandContour: string (nullable = true)
 |-- Utilities: string (nullable = true)
 |-- LotConfig: string (nullable = true)
 |-- LandSlope: string (nullable = true)
 |-- Neighborhood: string (nullable = true)
 |-- Condition1: string (nullable = true)
 |-- Condition2: string (nullable = true)
 |-- BldgType: string (nullable = true)
 |-- HouseStyle: string (nullable = true)
 |-- OverallQual: integer (nullable = true)
 |-- OverallCond: integer (nullable = true)
 |-- YearBuilt: integer (nullable = true)
 |-- YearRemodAdd: integer (nullable = true)
 |-- RoofStyle: string (nullable = true)
 |-- RoofMatl: string (nullable = true)
 |-- Exterior1st: string (nullable = true)
 |--

Sin embargo, hay variables que el programa indica, erróneamente, que son de tipo String. Es el caso de las variables MasVnrArea, GarageYrBlt y LotFrontage. Vamos a corregirlo:

In [89]:
train_df = train_df \
    .withColumn('MasVnrArea', train_df.MasVnrArea.cast(IntegerType())) \
    .withColumn('GarageYrBlt', train_df.GarageYrBlt.cast(IntegerType())) \
    .withColumn('LotFrontage', train_df.LotFrontage.cast(IntegerType()))

test_df = test_df \
    .withColumn('MasVnrArea', test_df.MasVnrArea.cast(IntegerType())) \
    .withColumn('GarageYrBlt', test_df.GarageYrBlt.cast(IntegerType())) \
    .withColumn('LotFrontage', test_df.LotFrontage.cast(IntegerType()))

In [90]:
train_df.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- MSSubClass: integer (nullable = true)
 |-- MSZoning: string (nullable = true)
 |-- LotFrontage: integer (nullable = true)
 |-- LotArea: integer (nullable = true)
 |-- Street: string (nullable = true)
 |-- Alley: string (nullable = true)
 |-- LotShape: string (nullable = true)
 |-- LandContour: string (nullable = true)
 |-- Utilities: string (nullable = true)
 |-- LotConfig: string (nullable = true)
 |-- LandSlope: string (nullable = true)
 |-- Neighborhood: string (nullable = true)
 |-- Condition1: string (nullable = true)
 |-- Condition2: string (nullable = true)
 |-- BldgType: string (nullable = true)
 |-- HouseStyle: string (nullable = true)
 |-- OverallQual: integer (nullable = true)
 |-- OverallCond: integer (nullable = true)
 |-- YearBuilt: integer (nullable = true)
 |-- YearRemodAdd: integer (nullable = true)
 |-- RoofStyle: string (nullable = true)
 |-- RoofMatl: string (nullable = true)
 |-- Exterior1st: string (nullable = true)
 |-

In [91]:
test_df.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- MSSubClass: integer (nullable = true)
 |-- MSZoning: string (nullable = true)
 |-- LotFrontage: integer (nullable = true)
 |-- LotArea: integer (nullable = true)
 |-- Street: string (nullable = true)
 |-- Alley: string (nullable = true)
 |-- LotShape: string (nullable = true)
 |-- LandContour: string (nullable = true)
 |-- Utilities: string (nullable = true)
 |-- LotConfig: string (nullable = true)
 |-- LandSlope: string (nullable = true)
 |-- Neighborhood: string (nullable = true)
 |-- Condition1: string (nullable = true)
 |-- Condition2: string (nullable = true)
 |-- BldgType: string (nullable = true)
 |-- HouseStyle: string (nullable = true)
 |-- OverallQual: integer (nullable = true)
 |-- OverallCond: integer (nullable = true)
 |-- YearBuilt: integer (nullable = true)
 |-- YearRemodAdd: integer (nullable = true)
 |-- RoofStyle: string (nullable = true)
 |-- RoofMatl: string (nullable = true)
 |-- Exterior1st: string (nullable = true)
 |-

Ahora sí parecen estar clasificados de forma correcta.

 ## VALORES PERDIDOS

Empezaremos con el procesamiento del dataset estudiando los valores perdidos.

In [92]:
def count_missings_aux(c):
    return F.sum(F.col(c).isNull().cast(IntegerType())).alias(c)

def count_missings(df):
    exprs = [count_missings_aux(c) for c in df.columns]
    df.agg(*exprs).show()

count_missings(train_df)
count_missings(test_df)

+---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+
| Id|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition

Como podemos apreciar, se distinguen valores perdidos en las variables LotFrontage y GarageYrBlt. Sin embargo, estos valores han podido ser identificados debido a que son variables de clase Integer. Los valores perdidos en el dataset vienen representados por el valor NA, y las variables de tipo Sring aceptan dicho valor como valor posible cuando, en realidad, representa un valor perdido.

Aún así, hay que ser cuidadosos, pues hay variables en el dataset que sí aceptan NA como valor no perdido. Por lo tanto, en dichas variables supondremos que el valor NA no será un valor perdido, y en el resto de variables sustituiremos el valor NA por el valor NULL, para evitar confusión.

In [93]:
#La lista cols_NA contiene las variables que admiten NA como posible valor no perdido.
cols_NA = ["Alley","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond","PoolQC","Fence","MiscFeature"]
for col in train_df.columns:
  if col not in cols_NA:
    train_df = train_df.replace("NA", None, subset=[col])

for col in test_df.columns:
  if col not in cols_NA:
    test_df = test_df.replace("NA", None, subset=[col])

In [94]:
train_df.show(10)

+---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+
| Id|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition

In [95]:
test_df.show(10)

+----+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+
|  Id|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition2|BldgTy

In [96]:
count_missings(train_df)
count_missings(test_df)

+---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+
| Id|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition

Tras haber realizado este proceso, se puden identificar valores perdidos en otras variables como MasVnrType o MasVnrArea, entre otras.

### Imputación de los datos

Hemos decidido no juntar los dataset para evitar usar el conjunto test para imputar valores. De esta forma evitamos futuros sesgos en el resultado.

Las variables que más valores perdidos presentan son LotFrontage y GarageYrBlt. En un primer momento, decidimos imputar valores en ambas variables para poder perder la menor cantidad de observaciones posibles, sin embargo, durante el proceso hemos obtenido que la imputación de valores en la variable LotFrontage mediante la media o la moda no son representativo en los datos, lo que entorpece el aprendizaje del modelo predictivo. Por lo tanto, hemos decidido realizar únicamente la imputación de valores para la variable GarageYrBlt, ya que dicha imputación se obtiene sin necesidad de hacer ninguna suposición.

El conjunto test, presenta más variables con valores perdidos, sin embargo, la imputación de estos valores sería similar a la de LotFrontage lo que puede resultar en valores no representativos. Debido a esto y a la poca cantidad de valores perdidos de estas variables, hemos decidido no realizar la imputación.

In [97]:
no_miss_Lot = train_df.filter(train_df.LotFrontage.isNotNull())
mean_Lot_df = no_miss_Lot.select(F.avg(no_miss_Lot.LotFrontage).alias('mean_LotFrontage'))
mean_Lot_df.show()
mean_Lot = mean_Lot_df.rdd.map(lambda row: row.mean_LotFrontage).collect()[0]

+-----------------+
| mean_LotFrontage|
+-----------------+
|70.04995836802665|
+-----------------+



La variable GarageYrBlt contiene el año en el que fue construido el garaje, sin embargo, en los casos en los que no hay garaje se tiene valor perdido por la imposibilidad de reflejar un año acorde. Por lo tanto, asignaremos el valor 0 a todas las casas que no tienen garaje, ya que junto a la variable GarageType, que indica si tiene garaje o no y el tipo, el modelo podrá entender el valor y aprender.

In [98]:
imputed_train_df = train_df.fillna({'GarageYrBlt': 0})
count_missings(imputed_train_df)
imputed_test_df = test_df.fillna({'GarageYrBlt': 0})
count_missings(imputed_test_df)

+---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+
| Id|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition

In [99]:
train_df = imputed_train_df
test_df = imputed_test_df

## CONVERSIÓN DE VARIABLES CATEGÓRICAS.

Para poder entrenar el modelo, debemos convertir nuestras variables categóricas a variables numéricas.

Identificaremos como variable categórica toda variable de tipo String. Además, el dataset presenta algunas variables ordinales que toman valores enteros para expresar una cualidad, por lo tanto, estas variables también deberán ser convertidas a variables numéricas.

La lista de variables ordinales que toman valores de enteros la llamaremos lista_var_ord.

In [101]:
#Podemos obviar la columna de identificación de las observaciones:
train_df = train_df.drop("Id")
test_df = test_df.drop("ID")

lista_var_ord = ["MSSubClass","OverallQual","OverallCond","MoSold"]
lista_var_string = []
for c in train_df.columns:
  if isinstance(train_df.schema[c].dataType, StringType) or c in lista_var_ord:
    lista_var_string.append(c)

list_var_nostring = []
for c in train_df.columns[:-1]:
  if c not in lista_var_string:
    list_var_nostring.append(c)

list_var_nostring

['LotFrontage',
 'LotArea',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal',
 'YrSold']

Ya hemos identificado las variables categóricas y las variables numéricas del dataset. Ahora, prepararemos el transformer StringIndexer para poder hacer la conversión de las variables categóricas.

Definamos una lista que contenga los nombres de las columnas de salida del transformer.

In [102]:
lista_sal_string = []
for e in lista_var_string:
  lista_sal_string.append(e +"_n")

Realizaremos un bucle para poder aplicar el transformer a todas las variables categóricas de forma más cómoda. Además, crearemos una lista con el resultado de los transformers para poder automatizar el procedimiento más adelante con un pipeline.

In [103]:
lista_indexer = []
lista_cols = []
for c in lista_var_string:
  indexer = StringIndexer(inputCol= c, outputCol= c + "_n", handleInvalid="skip")
  lista_indexer.append(indexer)
  lista_cols.append(c + "_n")


selected_features = lista_cols + list_var_nostring
assembler = VectorAssembler(inputCols=selected_features, outputCol="features",handleInvalid="skip")
lista_indexer.append(assembler)


Hagamos uso de un Pipeline.

In [104]:
pipeline = Pipeline(stages=lista_indexer)
pipeline

Pipeline_b18cdb994074

Una vez creado el Pipeline, podemos entrenarlo con el dataset que deseamos transformar, en este caso el dataset de entrenamiento.

In [105]:
preprocessing_pl = pipeline.fit(train_df)
preprocessing_pl

PipelineModel_f592dae4849c

Una vez entrenado, podemos realizar la transformación en nuestro dataset.

In [106]:
preprocessing_pl.transform(train_df).show(10)

+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+------------+----------+--------+-------+----------+-------------+-----------+-----------+-----------+--------------+------------+------------+----

Ahora, nos quedaremos únicamente con las columnas features y SalePrice que son las únicas necesarias para poder realizar los modelos de regresión.

In [107]:
final_train_df = preprocessing_pl.transform(train_df)
final_train_df = final_train_df.select(final_train_df.features, final_train_df.SalePrice)
final_train_df.show(10)

+--------------------+---------+
|            features|SalePrice|
+--------------------+---------+
|(79,[0,9,13,14,20...|   208500|
|(79,[7,9,10,14,15...|   181500|
|(79,[0,4,9,13,14,...|   223500|
|(79,[0,4,7,9,13,1...|   140000|
|(79,[0,4,7,9,13,1...|   250000|
|(79,[0,4,9,13,23,...|   143000|
|(79,[9,14,20,21,2...|   307000|
|(79,[0,1,9,10,13,...|   129900|
|(79,[0,7,9,10,11,...|   118000|
|(79,[9,16,18,19,2...|   129500|
+--------------------+---------+
only showing top 10 rows



Para poder definir el modelo con el que realizar la predicción, trabajaremos únicamente sobre el conjunto de entrenamiento. Por lo tanto, dividiremos el conjunto en 80% train y 20% test. La semilla utilizada es la semilla 0.

In [108]:
train, test = final_train_df.randomSplit([0.8, 0.2], seed=0)
print(train.first())
print(test.first())

Row(features=SparseVector(79, {0: 8.0, 1: 1.0, 2: 1.0, 4: 2.0, 5: 1.0, 7: 1.0, 8: 1.0, 9: 15.0, 10: 7.0, 12: 2.0, 18: 3.0, 19: 3.0, 21: 3.0, 23: 1.0, 24: 1.0, 26: 2.0, 27: 5.0, 28: 6.0, 30: 1.0, 31: 1.0, 36: 5.0, 44: 2.0, 47: 110.0, 48: 8472.0, 49: 1963.0, 50: 1963.0, 52: 104.0, 53: 712.0, 55: 816.0, 56: 816.0, 59: 816.0, 60: 1.0, 62: 1.0, 64: 2.0, 65: 1.0, 66: 5.0, 68: 1963.0, 69: 2.0, 70: 516.0, 71: 106.0, 78: 2010.0}), SalePrice=110000)
Row(features=SparseVector(79, {0: 5.0, 1: 2.0, 3: 2.0, 7: 1.0, 9: 4.0, 12: 1.0, 13: 1.0, 14: 1.0, 20: 2.0, 21: 1.0, 24: 1.0, 26: 1.0, 33: 1.0, 36: 1.0, 37: 1.0, 45: 1.0, 46: 1.0, 47: 40.0, 48: 3951.0, 49: 2009.0, 50: 2009.0, 51: 76.0, 54: 612.0, 55: 612.0, 56: 612.0, 57: 612.0, 59: 1224.0, 62: 2.0, 63: 1.0, 64: 2.0, 65: 1.0, 66: 4.0, 68: 2009.0, 69: 2.0, 70: 528.0, 72: 234.0, 78: 2009.0}), SalePrice=164500)


A partir de ahora, nos refiremos como conjunto de entrenamiento y conjunto test a estos dos nuevos conjuntos, procedentes del conjunto de entrenamiento original.

Para poder definir el modelo, realizaremos una validación cruzada con la que ajustaremos los diferentes hiperparámetros. Con la validación cruzada, obtendremos el valor óptimo de los hiperparámetros con los que construiremos el modelo final, el cual será evaluado en el conjunto test (recordemos, procedente del dataset de entrenamiento).

## VALIDACIÓN CRUZADA

Realicemos distintos modelos para comparar su rendimiento e identificar el mejor modelo para nuestro problema.

### RandomForest

In [109]:
rf = RandomForestRegressor(labelCol="SalePrice", featuresCol="features", seed=0)
print(rf.explainParams())

bootstrap: Whether bootstrap samples are used when building trees. (default: True)
cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the featur

Tras consultar los parámetros que podemos modificar, hemos escogido recorrer los siguientes valores:

In [110]:
paramGrid = ParamGridBuilder() \
    .addGrid(rf.maxDepth, [4, 5, 7]) \
    .addGrid(rf.numTrees, [10, 25, 50]) \
    .build()

crossval_rf = CrossValidator(estimator=rf,
                          estimatorParamMaps=paramGrid,
                          #evaluator = RmsleEvaluator(targetCol="SalePrice", predictionCol="prediction"),
                          evaluator = RegressionEvaluator(labelCol="SalePrice", predictionCol="prediction",metricName = 'rmse'),
                          numFolds=5,
                          seed=0)
crossval_rf

CrossValidator_babb0ea2bffd

Ahora, realizaremos la validación cruzada para obtener los hiperparámetros óptimos para el modelo.

In [111]:
rf_cv_model = crossval_rf.fit(train)

Veamos cuáles han sido los valores de hiperparámetros que mejores resultados han obtenido:

In [112]:
par_map = rf_cv_model.getEstimatorParamMaps()
lpars = [{par.name: value for par, value in par_comb.items()} for par_comb in par_map]
pars_df = pd.DataFrame(lpars)
pars_df['score'] = rf_cv_model.avgMetrics
pars_df.sort_values(by='score')

Unnamed: 0,maxDepth,numTrees,score
8,7,50,30642.413367
7,7,25,31255.619277
5,5,50,32338.731468
4,5,25,32392.89226
6,7,10,32431.323698
3,5,10,34473.380608
2,4,50,34668.715287
1,4,25,34794.172255
0,4,10,36629.404007


Luego, definimos el mejor modelo para poder evaluarlo en el conjunto test.

In [113]:
rf_model = rf_cv_model.bestModel
rf_model

RandomForestRegressionModel: uid=RandomForestRegressor_bff1c02778b8, numTrees=50, numFeatures=79

Ahora, evaluemos el modelo resultante en el conjunto test.

In [114]:
prediction_rf_cv = rf_model.transform(test)
#evaluator = RmsleEvaluator(targetCol="SalePrice", predictionCol="prediction")
evaluator = RegressionEvaluator(labelCol="SalePrice", predictionCol="prediction",metricName = 'rmse')
rmse_fr_cv = evaluator.evaluate(prediction_rf_cv)
rmse_fr_cv


26360.445569368054

Almacenemos el error obtenido para poder realizar la comparación más adelante.

In [115]:
err = {}
err["rf"] = rmse_fr_cv
err

{'rf': 26360.445569368054}

### GBT Regressor

Nuavemente, definimos el regresor y escogemos distintos valores para los hiperparámetros que vayamos a modificar durante la validación cruzada:

In [116]:
gbt = GBTRegressor(labelCol="SalePrice", featuresCol="features",seed=0)
print(gbt.explainParams())
paramGrid = ParamGridBuilder() \
    .addGrid(gbt.maxDepth, [3, 5]) \
    .addGrid(gbt.stepSize, [0.05, 0.01]) \
    .addGrid(gbt.maxIter, [50, 100]) \
    .build()

cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 

Definimos la validación cruzada y la entrenamos con el conjunto de entrenamiento.

In [117]:
crossval_gbt = CrossValidator(estimator=gbt,
                          estimatorParamMaps=paramGrid,
                          #evaluator=RmsleEvaluator(targetCol="label", predictionCol="prediction"),
                          evaluator = RegressionEvaluator(labelCol="SalePrice", predictionCol="prediction",metricName = 'rmse'),
                          numFolds=5,
                          seed=0)
crossval_gbt

CrossValidator_6494731c5ffe

In [118]:
gbt_cv_model = crossval_gbt.fit(train)

In [119]:
par_map = gbt_cv_model.getEstimatorParamMaps()
lpars = [{par.name: value for par, value in par_comb.items()} for par_comb in par_map]
pars_df = pd.DataFrame(lpars)
pars_df['score'] = gbt_cv_model.avgMetrics
pars_df.sort_values(by='score')

Unnamed: 0,maxDepth,stepSize,maxIter,score
1,3,0.05,100,40522.31357
0,3,0.05,50,41761.662597
3,3,0.01,100,44960.322952
2,3,0.01,50,47382.991564
4,5,0.05,50,47605.817003
5,5,0.05,100,47712.663287
7,5,0.01,100,47930.19783
6,5,0.01,50,48714.228182


Ahora, podemos definir el mejor modelo para evaluarlo en el conjunto test.

In [120]:
gbt_model = gbt_cv_model.bestModel
gbt_model

GBTRegressionModel: uid=GBTRegressor_27cb98d534c3, numTrees=100, numFeatures=79

In [121]:
prediction_gbt_cv = gbt_model.transform(test)
evaluator = RegressionEvaluator(labelCol="SalePrice", predictionCol="prediction",metricName = 'rmse')
rmse_gbt_cv = evaluator.evaluate(prediction_gbt_cv)
rmse_gbt_cv

43981.54299205587

In [122]:
err["gbt"] = rmse_gbt_cv
err

{'rf': 26360.445569368054, 'gbt': 43981.54299205587}

### DecisionTreeRegressor

In [123]:
dtr = DecisionTreeRegressor(labelCol="SalePrice", featuresCol="features",seed=0)
print(dtr.explainParams())

cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featuresCol: features column name. (default: features, current: features)
impurity: Criterion used for information gain calculation (case-insensitive). Supported options: variance (default: variance)
labelCol: label column name. (default: label, current: SalePrice)
leafCol: Leaf indices column name. Predicted leaf index of each instance in each tree by preorder. (default: )
maxBins: Max number of bins for discr

En esta ocasión, los valores que consideraremos para los hiperparámetros en la validación cruzada son los siguientes:

In [124]:
paramGrid = ParamGridBuilder() \
    .addGrid(dtr.maxDepth, [4, 5, 7]) \
    .addGrid(dtr.maxBins, [32, 64, 128]) \
    .build()

In [125]:
crossval_dtr = CrossValidator(estimator=dtr,
                          estimatorParamMaps=paramGrid,
                          #evaluator=RmsleEvaluator(targetCol="SalePrice", predictionCol="prediction"),
                          evaluator = RegressionEvaluator(labelCol="SalePrice", predictionCol="prediction",metricName = 'rmse'),
                          numFolds=5,
                          seed=0)
crossval_dtr

CrossValidator_57a0bba0ad79

In [126]:
dtr_cv_model = crossval_dtr.fit(train)

In [127]:
par_map = dtr_cv_model.getEstimatorParamMaps()
lpars = [{par.name: value for par, value in par_comb.items()} for par_comb in par_map]
pars_df = pd.DataFrame(lpars)
pars_df['score'] = dtr_cv_model.avgMetrics
pars_df.sort_values(by='score')

Unnamed: 0,maxDepth,maxBins,score
7,7,64,50608.471954
6,7,32,50713.969033
8,7,128,50774.449823
3,5,32,50804.843804
0,4,32,50868.798639
1,4,64,51160.486397
4,5,64,51387.501512
5,5,128,51844.497471
2,4,128,51978.905047


In [128]:
dtr_model = dtr_cv_model.bestModel
dtr_model

DecisionTreeRegressionModel: uid=DecisionTreeRegressor_291a0ae1eb02, depth=7, numNodes=211, numFeatures=79

El error obtenido por este modelo en el conjunto test es el siguiente:

In [129]:
prediction_dtr_cv = dtr_model.transform(test)
#evaluator = RmsleEvaluator(targetCol="SalePrice", predictionCol="prediction")
evaluator = RegressionEvaluator(labelCol="SalePrice", predictionCol="prediction",metricName = 'rmse')
rmse_dtr_cv = evaluator.evaluate(prediction_dtr_cv)
rmse_dtr_cv

54961.56315340533

In [130]:
err["dtr"] = rmse_dtr_cv
err

{'rf': 26360.445569368054, 'gbt': 43981.54299205587, 'dtr': 54961.56315340533}

### FMRegressor

In [131]:
fm = FMRegressor(labelCol="SalePrice", featuresCol="features",seed=0)
print(fm.explainParams())

factorSize: Dimensionality of the factor vectors, which are used to get pairwise interactions between variables (default: 8)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
fitLinear: whether to fit linear term (aka 1-way term) (default: True)
initStd: standard deviation of initial coefficients (default: 0.01)
labelCol: label column name. (default: label, current: SalePrice)
maxIter: max number of iterations (>= 0). (default: 100)
miniBatchFraction: fraction of the input data set that should be used for one iteration of gradient descent (default: 1.0)
predictionCol: prediction column name. (default: prediction)
regParam: regularization parameter (>= 0). (default: 0.0)
seed: random seed. (default: -412908578589224933, current: 0)
solver: The solver algorithm for optimization. Supported options: gd, adamW. (Default adamW) (default: adamW)
stepSize: Step size to be used for each iteration of optimiza

Para el regresor FMRegressor, hemos decidido considerar los siguientes valores de hiperparámetros:

In [132]:
paramGrid = ParamGridBuilder() \
    .addGrid(fm.maxIter, [10,20,50]) \
    .addGrid(fm.regParam, [0.001, 0.01, 0.1]) \
    .addGrid(fm.stepSize,[0.001, 0.01]) \
    .build()

In [133]:
crossval_fm = CrossValidator(estimator=fm,
                          estimatorParamMaps=paramGrid,
                          #evaluator=RmsleEvaluator(targetCol="SalePrice", predictionCol="prediction"),
                          evaluator = RegressionEvaluator(labelCol="SalePrice", predictionCol="prediction",metricName = 'rmse'),
                          numFolds=5,
                          seed=0)
crossval_fm

CrossValidator_47e932bd5570

In [134]:
fm_cv_model = crossval_fm.fit(train)

In [135]:
par_map = fm_cv_model.getEstimatorParamMaps()
lpars = [{par.name: value for par, value in par_comb.items()} for par_comb in par_map]
pars_df = pd.DataFrame(lpars)
pars_df['score'] = fm_cv_model.avgMetrics
pars_df.sort_values(by='score')

Unnamed: 0,maxIter,regParam,stepSize,score
15,50,0.01,0.01,56280.24265
17,50,0.1,0.01,56327.884482
13,50,0.001,0.01,56497.497955
12,50,0.001,0.001,58856.281663
14,50,0.01,0.001,61881.79667
11,20,0.1,0.01,70059.324254
9,20,0.01,0.01,78107.842408
7,20,0.001,0.01,82908.240544
6,20,0.001,0.001,94280.290543
8,20,0.01,0.001,94467.828944


In [136]:
fm_model = fm_cv_model.bestModel
fm_model

FMRegressionModel: uid=FMRegressor_fc6991c82ffc, numFeatures=79, factorSize=8, fitLinear=true, fitIntercept=true

Consultemos el resultado obtenido por este modelo en el conjunto test.

In [137]:
prediction_fm_cv = fm_model.transform(test)
#evaluator = RmsleEvaluator(targetCol="label", predictionCol="prediction")
evaluator = RegressionEvaluator(labelCol="SalePrice", predictionCol="prediction",metricName = 'rmse')
rmse_fm_cv = evaluator.evaluate(prediction_fm_cv)
rmse_fm_cv

41046.22339673065

In [138]:
err["fm"] = rmse_fm_cv
err

{'rf': 26360.445569368054,
 'gbt': 43981.54299205587,
 'dtr': 54961.56315340533,
 'fm': 41046.22339673065}

Ordenemos el diccionario de menor a mayor para identificar los modelos con mejor resultado:

In [139]:
err = dict(sorted(err.items(), key=lambda item: item[1]))
err

{'rf': 26360.445569368054,
 'fm': 41046.22339673065,
 'gbt': 43981.54299205587,
 'dtr': 54961.56315340533}

El modelo con menor error ha sido el RandomForest. Por otro lado, los estimadores FMRegressor y GBTRegressor han tenido resultados parecidos aunque peor que el proporcionado por RandomForest. Por último, el modelo con peor resultado es el DecisionTree que no aporta una buena predicción para nuestro problema.

In [140]:
spark.stop()