# Random Forest Regressor

General Pipeline:

- Importing Data
- Vectorize and RFormula transformation
- Split into train and test
- Building the model
- Prediction on the test set
- Evaluation

Hiperparameters:

- bootstrap: boolean to pass if it is going to be used or not
- maxBins: max of discretization values of continuous features
- maxDepth: the depth of each tree
- numTrees: the number of random trees

## Importing

In [1]:
import pyspark, findspark
from pyspark.sql import SparkSession

findspark.init()

spark = SparkSession.builder.appName("rfregressor").getOrCreate()

In [15]:
from pyspark.ml.feature    import RFormula, Normalizer
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator

## Loading Data

In [3]:
cars = spark.read.load(
    "../../data/Carros.csv",
    format="csv",
    sep=";",
    header = True, 
    inferSchema=True)

cars.show(2)

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|     21|        6|        160|             39| 262| 1646|        0|          1|      4|          4|110|
|     21|        6|        160|             39|2875| 1702|        0|          1|      4|          4|110|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
only showing top 2 rows



## Data Preparation

In [4]:
rformula = RFormula(
    formula='HP ~ Consumo + Cilindros + Cilindradas',
    featuresCol="features",
    labelCol="target"
)
cars = rformula.fit(cars).transform(cars)

In [5]:
cars.select("features", "target").show(10)

+------------------+------+
|          features|target|
+------------------+------+
|  [21.0,6.0,160.0]| 110.0|
|  [21.0,6.0,160.0]| 110.0|
| [228.0,4.0,108.0]|  93.0|
| [214.0,6.0,258.0]| 110.0|
| [187.0,8.0,360.0]| 175.0|
| [181.0,6.0,225.0]| 105.0|
| [143.0,8.0,360.0]| 245.0|
|[244.0,4.0,1467.0]|  62.0|
|[228.0,4.0,1408.0]|  95.0|
|[192.0,6.0,1676.0]| 123.0|
+------------------+------+
only showing top 10 rows



In [8]:
normalizer = Normalizer(
    inputCol="features",
    outputCol="normFeatures",
    p=1.0
)

cars = normalizer.transform(cars)

In [10]:
cars.select("features", "normFeatures").show(10, truncate=False)

+------------------+--------------------------------------------------------------+
|features          |normFeatures                                                  |
+------------------+--------------------------------------------------------------+
|[21.0,6.0,160.0]  |[0.11229946524064172,0.03208556149732621,0.8556149732620321]  |
|[21.0,6.0,160.0]  |[0.11229946524064172,0.03208556149732621,0.8556149732620321]  |
|[228.0,4.0,108.0] |[0.6705882352941176,0.011764705882352941,0.3176470588235294]  |
|[214.0,6.0,258.0] |[0.4476987447698745,0.012552301255230125,0.5397489539748954]  |
|[187.0,8.0,360.0] |[0.33693693693693694,0.014414414414414415,0.6486486486486487] |
|[181.0,6.0,225.0] |[0.4393203883495146,0.014563106796116505,0.5461165048543689]  |
|[143.0,8.0,360.0] |[0.27984344422700586,0.015655577299412915,0.7045009784735812] |
|[244.0,4.0,1467.0]|[0.1422740524781341,0.0023323615160349854,0.8553935860058309] |
|[228.0,4.0,1408.0]|[0.13902439024390245,0.0024390243902439024,0.85853658536

## Split into Train and Test

In [13]:
carsTrain, carsTest = cars.randomSplit([0.7, 0.3], seed=11)

In [14]:
carsTrain.count(), carsTest.count()

(21, 11)

## Model Development and Training

In [16]:
rflr = RandomForestRegressor(
    featuresCol="normFeatures",
    labelCol="target",
    maxDepth=10,
    numTrees=500,
    seed=11
)

model = rflr.fit(carsTrain)

## Predicting on Test Set

In [17]:
predictions = model.transform(carsTest)
predictions.select("target", "prediction").show()

+------+------------------+
|target|        prediction|
+------+------------------+
| 205.0|            202.05|
| 245.0|195.35600000000002|
| 150.0|156.53366666666662|
| 264.0|178.63133333333334|
| 180.0|           162.426|
| 180.0|           162.426|
| 105.0|169.97166666666666|
| 175.0|169.80966666666666|
| 109.0|111.51900000000002|
|  66.0|111.51900000000002|
| 113.0|           117.288|
+------+------------------+



## Model Evaluation

In [10]:
evaluation = RegressionEvaluator(
    predictionCol="prediction",
    labelCol="target",
    metricName="rmse"
)

rmse = evaluation.evaluate(predictions)

print(rmse)

61.0917281580934
