#Kaggle Competition

In this session lab, we are going to compete in a Kaggle Competition.

First, we are going to upload the `train` and `test` datasets to databricks using the following route:

*Data -> Add Data -> Upload File*

**Note:** You have the option to select the location to store the files within DBFS.

Once the files are uploaded, we can use them in our environment.

You will need to change /FileStore/tables/train.csv with the name of the files and the path(s) that you chose to store them.

**Note 1:** When the upload is complete, you will get a confirmation along the path and name assigned. Filenames might be slightly modified by Databricks.

**Note 2:** If you missed the path and filename message you can navigate the DBFS via: *Data -> Add Data -> Upload File -> DBFS* or checking the content of the path `display(dbutils.fs.ls("dbfs:/FileStore/some_path"))`

#### 1/ Reading input files

In [4]:
display(dbutils.fs.ls("dbfs:/FileStore/tables"))

In [5]:
train_data = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferSchema='true').load('/FileStore/tables/train_set-51e11.csv')

test_data = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferSchema='true').load('/FileStore/tables/test_set-b5f57.csv')

In [6]:
print('Train data size: {} rows, {} columns'.format(train_data.count(), len(train_data.columns)))
print('Test data size: {} rows, {} columns'.format(test_data.count(), len(test_data.columns)))

#### 2/ Vizualizing the data

In [8]:
display(train_data)

Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area1,Wilderness_Area2,Wilderness_Area3,Wilderness_Area4,Soil_Type1,Soil_Type2,Soil_Type3,Soil_Type4,Soil_Type5,Soil_Type6,Soil_Type7,Soil_Type8,Soil_Type9,Soil_Type10,Soil_Type11,Soil_Type12,Soil_Type13,Soil_Type14,Soil_Type15,Soil_Type16,Soil_Type17,Soil_Type18,Soil_Type19,Soil_Type20,Soil_Type21,Soil_Type22,Soil_Type23,Soil_Type24,Soil_Type25,Soil_Type26,Soil_Type27,Soil_Type28,Soil_Type29,Soil_Type30,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
1,2611,326,20,120,27,1597,168,214,184,2913,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
2,2772,324,17,42,7,1814,175,220,183,2879,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
3,2764,4,14,480,-21,700,201,212,148,700,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
4,3032,342,9,60,8,4050,202,227,164,2376,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
5,2488,23,11,117,21,1117,209,218,151,1136,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
6,2968,83,8,390,19,4253,232,226,127,4570,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
7,3027,11,6,534,47,1248,214,228,151,2388,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2
8,3216,277,9,67,23,5430,212,236,169,2373,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
9,3242,262,5,849,169,1672,207,242,173,691,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
10,3315,61,15,120,-6,3042,231,208,106,1832,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,7


most of the labels (80%) are Type 1 or Type 2

In [10]:
display(train_data)

Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area1,Wilderness_Area2,Wilderness_Area3,Wilderness_Area4,Soil_Type1,Soil_Type2,Soil_Type3,Soil_Type4,Soil_Type5,Soil_Type6,Soil_Type7,Soil_Type8,Soil_Type9,Soil_Type10,Soil_Type11,Soil_Type12,Soil_Type13,Soil_Type14,Soil_Type15,Soil_Type16,Soil_Type17,Soil_Type18,Soil_Type19,Soil_Type20,Soil_Type21,Soil_Type22,Soil_Type23,Soil_Type24,Soil_Type25,Soil_Type26,Soil_Type27,Soil_Type28,Soil_Type29,Soil_Type30,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
1,2611,326,20,120,27,1597,168,214,184,2913,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
2,2772,324,17,42,7,1814,175,220,183,2879,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
3,2764,4,14,480,-21,700,201,212,148,700,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
4,3032,342,9,60,8,4050,202,227,164,2376,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
5,2488,23,11,117,21,1117,209,218,151,1136,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
6,2968,83,8,390,19,4253,232,226,127,4570,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
7,3027,11,6,534,47,1248,214,228,151,2388,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2
8,3216,277,9,67,23,5430,212,236,169,2373,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
9,3242,262,5,849,169,1672,207,242,173,691,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
10,3315,61,15,120,-6,3042,231,208,106,1832,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,7


Plotting the geographic-related data, we see that 
- elevation seems to pretty well separate cover types
- there is no strong correlations between these data

In [12]:
display(train_data)

Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area1,Wilderness_Area2,Wilderness_Area3,Wilderness_Area4,Soil_Type1,Soil_Type2,Soil_Type3,Soil_Type4,Soil_Type5,Soil_Type6,Soil_Type7,Soil_Type8,Soil_Type9,Soil_Type10,Soil_Type11,Soil_Type12,Soil_Type13,Soil_Type14,Soil_Type15,Soil_Type16,Soil_Type17,Soil_Type18,Soil_Type19,Soil_Type20,Soil_Type21,Soil_Type22,Soil_Type23,Soil_Type24,Soil_Type25,Soil_Type26,Soil_Type27,Soil_Type28,Soil_Type29,Soil_Type30,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
1,2611,326,20,120,27,1597,168,214,184,2913,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
2,2772,324,17,42,7,1814,175,220,183,2879,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
3,2764,4,14,480,-21,700,201,212,148,700,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
4,3032,342,9,60,8,4050,202,227,164,2376,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
5,2488,23,11,117,21,1117,209,218,151,1136,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
6,2968,83,8,390,19,4253,232,226,127,4570,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
7,3027,11,6,534,47,1248,214,228,151,2388,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2
8,3216,277,9,67,23,5430,212,236,169,2373,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
9,3242,262,5,849,169,1672,207,242,173,691,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
10,3315,61,15,120,-6,3042,231,208,106,1832,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,7


We see a correlation (not linear though) between light related data. Besides it does not seems to separated cover types. 
Features engineering might be useful on these data to reduce them. 
However I will not do it here.

Let's plot main statistics of continuous features

In [15]:
train_data.select("Elevation", "Aspect", "Slope", "Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology").describe().show()
train_data.select("Horizontal_Distance_To_Roadways", "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", "Horizontal_Distance_To_Fire_Points").describe().show()

We see there are significant différences between features regarding their means and standard deviation. This could be useful trying to standardize them in a feature engineering step.
However the different attempts to do it with Databricks (using pandas, using pyspark.sql.functions) were not successful so I could not try this option.

#### 3/ Features engineering

I added new features (non linear transformation) that could possibly be more interesting regarding the prediction :
- Rather than horizontal and vertical distance to water point, let's calculate the global distance, i.e sqrt(horiz_distance^2 + vert_distance^2)
- Rather than horizontal distance to fire point, let's calculate linear distance, i.e horizontal distance / cos (slope)
- Same for distance to roadway  
We will assess in a further step these new features.

Nota: I also wanted to implement a feature representative of the oxygen concentration. Assuming that the ratio O2 in air is constant at any altitude, the pressure at a given altitude is a good indicator.
https://fr.wikipedia.org/wiki/Formule_du_nivellement_barométrique
However I did not implement by lack of time

In [19]:
from pyspark.sql.functions import sqrt
from pyspark.sql.functions import pow
from pyspark.sql.functions import cos, radians

def add_features(dataset):
  dataset = dataset\
  .withColumn("Distance_to_water",sqrt(pow(dataset.Horizontal_Distance_To_Hydrology,2) + pow(dataset.Vertical_Distance_To_Hydrology,2)))\
  .withColumn("Distance_to_firepoint", dataset.Horizontal_Distance_To_Fire_Points/cos(radians(dataset.Slope)))\
  .withColumn("Distance_to_roadway", dataset.Horizontal_Distance_To_Roadways/cos(radians(dataset.Slope)))
  return dataset

#### 4/ Sub split data
Let's split the train data into a sub train and validation dataset so that we will be able to measure performance of our model without the need to submit each time on Kaggle. 
Nota: this will allow us to find the best algorithm and hyperparameters but the final predictions will be done on the whole train data set to have the best training model.

In [21]:
# Split the data into training and validation sets (30% held out for validation)
(sub_training_data, validation_data) = train_data.randomSplit([0.75, 0.25])
sub_training_data.cache()
validation_data.cache()
# Visualize dimensions of the 2 the newly created datasets
print('Sub training data size: {} rows, {} columns'.format(sub_training_data.count(), len(sub_training_data.columns)))
print('Validation data size: {} rows, {} columns'.format(validation_data.count(), len(validation_data.columns)))



We will use the `VectorAssembler()` to merge our feature columns into a single vector column as requiered by Spark methods.  
We will create one with original features and one with newly created ones.

In [23]:
from pyspark.ml.feature import VectorAssembler

# Vector assembler with original features
vector_assembler = VectorAssembler(inputCols=["Elevation", "Aspect", "Slope", "Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways", "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", "Horizontal_Distance_To_Fire_Points", "Wilderness_Area1", "Wilderness_Area2", "Wilderness_Area3", "Wilderness_Area4", "Soil_Type1", "Soil_Type2", "Soil_Type3", "Soil_Type4", "Soil_Type5", "Soil_Type6", "Soil_Type7", "Soil_Type8", "Soil_Type9", "Soil_Type10", "Soil_Type11", "Soil_Type12", "Soil_Type13", "Soil_Type14", "Soil_Type15", "Soil_Type16", "Soil_Type17", "Soil_Type18", "Soil_Type19", "Soil_Type20", "Soil_Type21", "Soil_Type22", "Soil_Type23", "Soil_Type24", "Soil_Type25", "Soil_Type26", "Soil_Type27", "Soil_Type28", "Soil_Type29", "Soil_Type30", "Soil_Type31", "Soil_Type32", "Soil_Type33", "Soil_Type34", "Soil_Type35", "Soil_Type36", "Soil_Type37", "Soil_Type38", "Soil_Type39", "Soil_Type40"], outputCol="features")

# New vector assembler with additional features
vector_assembler_new = VectorAssembler(inputCols=["Elevation", "Aspect", "Slope", "Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways", "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", "Horizontal_Distance_To_Fire_Points", "Distance_to_water", "Distance_to_firepoint", "Distance_to_roadway", "Wilderness_Area1", "Wilderness_Area2", "Wilderness_Area3", "Wilderness_Area4", "Soil_Type1", "Soil_Type2", "Soil_Type3", "Soil_Type4", "Soil_Type5", "Soil_Type6", "Soil_Type7", "Soil_Type8", "Soil_Type9", "Soil_Type10", "Soil_Type11", "Soil_Type12", "Soil_Type13", "Soil_Type14", "Soil_Type15", "Soil_Type16", "Soil_Type17", "Soil_Type18", "Soil_Type19", "Soil_Type20", "Soil_Type21", "Soil_Type22", "Soil_Type23", "Soil_Type24", "Soil_Type25", "Soil_Type26", "Soil_Type27", "Soil_Type28", "Soil_Type29", "Soil_Type30", "Soil_Type31", "Soil_Type32", "Soil_Type33", "Soil_Type34", "Soil_Type35", "Soil_Type36", "Soil_Type37", "Soil_Type38", "Soil_Type39", "Soil_Type40"], outputCol="features")



#### 5/ Example with `Logistic Regression`.

In [25]:
from pyspark.ml.classification import LogisticRegression

# Setup classifier
classifier = LogisticRegression(labelCol="Cover_Type", featuresCol="features")


Now, we are going to create a pipeline that will chain the vector assembler and the classifier stages.

In [27]:
from pyspark.ml import Pipeline

# Chain vecAssembler and classification model 
pipeline = Pipeline(stages=[vector_assembler, classifier])

# Run stages in pipeline with the train data
model = pipeline.fit(train_data)

From the example provided I modified the evaluator the consider the accuracy as the metrics for the performance of the prediction

In [29]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="Cover_Type", predictionCol="prediction",
                                              metricName="accuracy") # changed "weightedPrecision" to "accuracy"

In [30]:
# Run stages in pipeline with the sub training data to have an assessment of the model accuracy directly in Databricks
model_lr = pipeline.fit(sub_training_data)
predictions_lr = model_lr.transform(validation_data)
accuracy = evaluator.evaluate(predictions_lr)
print("Model Accuracy: {}".format(accuracy))

The logistic regression (with default parameters) gives a quite low performance of 0.710893442437.  
Let's try some other models.

#### 6/ Random Forest

##### Random Forest with default parameters

In [34]:
from pyspark.ml.classification import RandomForestClassifier

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="Cover_Type", predictionCol="prediction")

# Chain vecAssembler and classification model 
pipeline = Pipeline(stages=[vector_assembler, rf])

# Run stages in pipeline with the train data
model_rf = pipeline.fit(sub_training_data)

# Apply model to validation Data
predictions_rf = model_rf.transform(validation_data)

# Evaluate model accuracy 
accuracy_rf = evaluator.evaluate(predictions_rf)
print("Model RF accuracy : {}".format(accuracy_rf))

Random Forest (with default parameters) gives also quite low performance 0.676035334994.  
Let's try to see if cross validation and sensitivity analyses on hyperparamters would improve the results.

##### Random Forest with cross validation
We will do sensistivity analyses on maxDepth and numTrees parameters.

In [37]:
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [38]:
# Split sub training data for cross validation
numFolds = 3

# Define a grid search to assess different values of parameters
paramGrid = (ParamGridBuilder()
             .addGrid(rf.numTrees, [5, 10, 30, 50, 70]) # best numTrees = 50
             .addGrid(rf.maxDepth, [5, 10, 15]) # best maxDepth = 15
             .build())

# Define cross validation model
crossval_rf_cv = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=numFolds)



In [39]:
# Run cross validation model on sub training dataset
model_rf_cv = crossval_rf_cv.fit(sub_training_data)

In [40]:
# Evaluate model on validation dataset
validation_rf_cv = model_rf_cv.transform(validation_data)
accuracy_rf_cv = evaluator.evaluate(validation_rf_cv)

print("Model RF accuracy with cross-validation and grid seach: {}".format(accuracy_rf_cv))

Sensitivity analysis has been done 
- on numTrees = [5, 10, 30, 50, 70]
- on maxDepth = [5, 10, 15, 20]
- numFolds = [3, 4]

Increasing numTrees up to 50 increases accuracy of the model. After no significant improvement has been observed.  
Increasing maxDepth improves the accuracy of the model. And faster than numTrees.  
But ressources limitations are reached beyond 15.  
numFolds sensitivity did not lead to significant variations of the accuracy.  

The best hyperparameters found so far are
- MaxDepth: 15, NumTrees: 50 with an accuracy of 0.787024084086  
Increasing the MaxDepth leads to crashing Databricks server

The next steps will be then:   
1/ assess if cross validation really helps  
2/ increase MaxDepth as it seems to improve significantly accuracy

##### Random Forest without cross validation

After some trials, it seems that cross validation with random forest does not significantly improves the accuracy.  
--> The nexts steps will be done without cross validation but I keep the validation set.  
Let's try now to increase maxDepth.

In [44]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

I created a function to simplify the process

In [46]:
def train_rf(train_set, validation_set, numTrees_val, maxDepth_val):
  # Train a RandomForest model.
  rf = RandomForestClassifier(labelCol="Cover_Type", predictionCol="prediction", numTrees=numTrees_val, maxDepth=maxDepth_val)

  # Chain vecAssembler and classification model 
  pipeline = Pipeline(stages=[vector_assembler, rf])
  
  # Run stages in pipeline with the train data
  model = pipeline.fit(train_set)

  # Apply model to validation Data
  validations = model.transform(validation_set)
  
  # Select (prediction, Cover Type) and compute accuracy
  evaluator = MulticlassClassificationEvaluator(
      labelCol="Cover_Type", predictionCol="prediction", metricName="accuracy")

  # Evaluate model accuracy 
  accuracy = evaluator.evaluate(validations)
  print("Model RF accuracy : {}".format(accuracy))
  
  return model, accuracy

In [47]:
numTrees = 40
maxDepth = 20
model_rf, accuracy_rf = train_rf(sub_training_data, validation_data, numTrees, maxDepth)

The sensitivity analysis is done here one by one and not via a parameter grid as I wanted to collect the hyperparameters values and I did not succeed in retrieving this data.  
maxDepth clearly improves accuracy but beyond 15, we reach the Databricks limits if numTree is kept at 50.  
This is I reduced numTrees to allow calculation with a higher maxDepth.  
With numTrees = 40 and maxDepth = 20, we obtain Model RF accuracy : 0.832731183773

In [49]:
# 
vector_assembler_new = VectorAssembler(inputCols=["Elevation"], outputCol="features")
train_data = add_features(train_data)
test_data = add_features(test_data)
display(train_data)

Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area1,Wilderness_Area2,Wilderness_Area3,Wilderness_Area4,Soil_Type1,Soil_Type2,Soil_Type3,Soil_Type4,Soil_Type5,Soil_Type6,Soil_Type7,Soil_Type8,Soil_Type9,Soil_Type10,Soil_Type11,Soil_Type12,Soil_Type13,Soil_Type14,Soil_Type15,Soil_Type16,Soil_Type17,Soil_Type18,Soil_Type19,Soil_Type20,Soil_Type21,Soil_Type22,Soil_Type23,Soil_Type24,Soil_Type25,Soil_Type26,Soil_Type27,Soil_Type28,Soil_Type29,Soil_Type30,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type,Distance_to_water,Distance_to_firepoint,Distance_to_roadway
1,2611,326,20,120,27,1597,168,214,184,2913,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,123.0,3099.949851222332,1699.4919026440316
2,2772,324,17,42,7,1814,175,220,183,2879,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,42.579337712087536,3010.546566926499,1896.8848462676865
3,2764,4,14,480,-21,700,201,212,148,700,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,480.4591553920063,721.4295405449286,721.4295405449286
4,3032,342,9,60,8,4050,202,227,164,2376,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,60.53098380168623,2405.617138872295,4100.483759441412
5,2488,23,11,117,21,1117,209,218,151,1136,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,118.86967653695368,1157.2621654691234,1137.9065482649742
6,2968,83,8,390,19,4253,232,226,127,4570,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,390.46254621922446,4614.912006410084,4294.796665921683
7,3027,11,6,534,47,1248,214,228,151,2388,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2,536.0643618074233,2401.1537715976774,1254.8743328952685
8,3216,277,9,67,23,5430,212,236,169,2373,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,70.83784299369935,2402.579743494931,5497.685633028856
9,3242,262,5,849,169,1672,207,242,173,691,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,865.6569759437049,693.639507742453,1678.3867683724768
10,3315,61,15,120,-6,3042,231,208,106,1832,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,7,120.14990636700472,1896.625962511272,3149.3101408074726


#### 7/ Decision tree
Let's try this algorithm

In [51]:
from pyspark.ml.classification import DecisionTreeClassifier

def train_dt(vector_assembler_new, train_set, validation_set, max_depth=5, max_bins=32):
  # Train the model.
  classifier = DecisionTreeClassifier(labelCol="Cover_Type", featuresCol="features", maxDepth=max_depth, maxBins=max_bins, impurity="entropy", minInfoGain=0.0)
  
  # Chain vecAssembler and classification model 
  pipeline = Pipeline(stages=[vector_assembler, classifier])
  
  # Run stages in pipeline with the train data
  model = pipeline.fit(train_set)

  # Apply model to validation Data
  validations = model.transform(validation_set)

  # Select (prediction, Cover Type) and compute accuracy
  evaluator = MulticlassClassificationEvaluator(
      labelCol="Cover_Type", predictionCol="prediction", metricName="accuracy")
  
  # Evaluate model accuracy 
  accuracy = evaluator.evaluate(validations)
  print("Model RF accuracy : {}".format(accuracy))
  
  return model, accuracy

In [52]:
model_dt, accuracy_dt = train_dt(vector_assembler, sub_training_data, validation_data, max_depth=30, max_bins=50)

The decision tree has been trained with the following parameters
- maxDepth  [5, 30]
- maxbin = [20, 32, 50, 60, 75, 100, 400]
- impurity = ["gini", "entropy"]
- minInfoGain = [0.0, 0.01, 0.1]

While maxDepth clearly improves the result, maxBin does not really improves. The impurity set with entropy clearly improves too.  
Finally the best hyparameters found are:  
- maxDepth = 30 / maxBins = 50 / impurity = "entropy" 
The accuracy with such settings is 0.9000280076  
Nota: maxDepth is bounded at 30

Let's try to see if new features (defined here above) improves the model

In [55]:
model_dt_new, accuracy_dt_new = train_dt(vector_assembler_new, sub_training_data, validation_data, max_depth=30, max_bins=50)

The new features do not improve the model so we will keep the original ones...

#### 8/ Final training and predictions

Let's train now the model on the whole train data set to get the maximum data for training the model (with original features and best hyperparamters found)

In [59]:

# Train the model.
classifier = DecisionTreeClassifier(labelCol="Cover_Type", featuresCol="features", maxDepth=30, maxBins=50, impurity="entropy", minInfoGain=0.0)
  
  # Chain vecAssembler and classification model 
pipeline = Pipeline(stages=[vector_assembler, classifier])
  
  # Run stages in pipeline with the train data
model = pipeline.fit(train_data)



Training the model on the whole train set has improved the model at 0.90724.
This result comes from Kaggle evaluation.

Once we have trained the classifier, we can use it to make predictions on the test data.

In [62]:
# Make predictions on testData
def make_predictions(model, test_set=test_data):
  predictions = model.transform(test_set)
  predictions = predictions.withColumn("Cover_Type", predictions["prediction"].cast("int"))  # Cast predictions to 'int' to mach the data type expected by Kaggle
  # Show the content of 'predictions'
  predictions.printSchema()
  return predictions

In [63]:
predictions = make_predictions(model,test_data)

Finally, we can create a file with the predictions.

In [65]:
# Select columns Id and prediction
(predictions
 .repartition(1)
 .select('Id', 'Cover_Type')
 .write
 .format('com.databricks.spark.csv')
 .options(header='true')
 .mode('overwrite')
 .save('/FileStore/kaggle-submission'))

To be able to download the predictions file, we need its name (`part-*.csv`):

In [67]:
display(dbutils.fs.ls("dbfs:/FileStore/kaggle-submission"))

path,name,size
dbfs:/FileStore/kaggle-submission/_SUCCESS,_SUCCESS,0
dbfs:/FileStore/kaggle-submission/_committed_4279499521838331575,_committed_4279499521838331575,199
dbfs:/FileStore/kaggle-submission/_committed_4431659471375385229,_committed_4431659471375385229,199
dbfs:/FileStore/kaggle-submission/_committed_5499012469187108500,_committed_5499012469187108500,210
dbfs:/FileStore/kaggle-submission/_committed_7176294531204529675,_committed_7176294531204529675,198
dbfs:/FileStore/kaggle-submission/_committed_7849635464235053575,_committed_7849635464235053575,199
dbfs:/FileStore/kaggle-submission/_started_5499012469187108500,_started_5499012469187108500,0
dbfs:/FileStore/kaggle-submission/part-00000-tid-5499012469187108500-69abafc7-d34a-4d03-9d76-1d375136115c-8069-c000.csv,part-00000-tid-5499012469187108500-69abafc7-d34a-4d03-9d76-1d375136115c-8069-c000.csv,2039369


My ID
https://community.cloud.databricks.com/files/kaggle-submission/part-*.csv?o=525206096865672#notebook/875111623939519/command/875111623939536

Files stored in /FileStore are accessible in your web browser via `https://<databricks-instance-name>.cloud.databricks.com/files/`.
  
For this example:

https://community.cloud.databricks.com/files/kaggle-submission/part-*.csv?o=######

where `part-*.csv` should be replaced by the name displayed in your system  and the number after `o=` is the same as in your Community Edition URL.


Finally, we can upload the predictions to kaggle and check what is the perfromance.