# Modelo predictivo (Espacial) de siniestros en las calles de Santiago
**Model will be split in train/test (80%/20%), using 2013-2017 Dataset and Validated with 2018 Dataset**
- UDD - MDS18 - BDA
- Final Delivery 
- 40_Final_Geo_Project_PySpark_MLib_RF_GBT_Models_Gr100_dataset-2013-17_val_18
- 09 August 2019

**Running dataset created from 2013 to 2017 as df**

Developed based on:
- Big Data Analytics, Oscar Peredo 
- [PySpark ML and XGBoost full integration tested on the Kaggle Titanic dataset](https://towardsdatascience.com/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb)
- [Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem](https://towardsdatascience.com/machine-learning-with-pyspark-and-mllib-solving-a-binary-classification-problem-96396065d2aa)
- [Build an end-to-end Machine Learning Model with MLlib in pySpark.](https://towardsdatascience.com/build-an-end-to-end-machine-learning-model-with-mllib-in-pyspark-4917bdf289c5)

## Preliminar Installation

- FindSpark: PySpark isn't on sys.path by default, but it can be used as a regular library. This can be addressed by either symlinking pyspark into your site-packages, or adding pyspark to sys.path at runtime, as done by findspark.<br>
[findspark](https://github.com/minrk/findspark)<br><br>


## General functions

In [117]:
def count_missings(spark_df):
    null_counts = []        
    for col in spark_df.dtypes:    
        cname = col[0]     
        ctype = col[1]      
        nulls = spark_df.where( spark_df[cname].isNull()).count() #check count of null in column name
        result = tuple([cname, nulls])  #new tuple, (column name, null count)
        null_counts.append(result)      #put the new tuple in our result list
    null_counts=[(x,y) for (x,y) in null_counts if y!=0]  #view just columns that have missing values
    return null_counts

In [118]:
def find_num_cat_features(df):
    cat_cols = [item[0] for item in df.dtypes if item[1].startswith('string')]
    print(str(len(cat_cols)) + '  categorical features')

    num_cols = [
        item[0] for item in df.dtypes
        if item[1].startswith('int') | item[1].startswith('double')
    ][1:]
    print(str(len(num_cols)) + '  numerical features')
    return num_cols, cat_cols

In [119]:
def weight_balance(labels):
    return when(labels == 1, ratio).otherwise(1*(1-ratio))

## Libraries and pySpark initialization

In [120]:
import numpy as np
import pandas as pd

In [121]:
import findspark
findspark.init()

In [122]:
import pyspark
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import GBTClassificationModel
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.functions import rank, sum, col, mean, round, when

In [123]:
spark = SparkSession\
        .builder\
        .appName("PySpark GBT Grid 100m Data Siniestros 2013-2017")\
        .getOrCreate()

## Dataset - Generated from Notebook: 
- 30_Final_Geo_Project_Final Datasets_Creation

In [76]:
!ls ../data/

[34mCONASET[m[m                              geo_stgo_100_crash_test_dataset.csv
[34mOSM_Chile[m[m                            geo_stgo_100_crash_train_dataset.csv
final_test_dataset_grid_100.csv      geo_stgo_100_estatic_dataset.csv
final_train_dataset_grid_100.csv


In [77]:
df = spark.read.csv('../data/final_train_dataset_grid_100.csv',
                    header=True,
                    inferSchema=True)
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- X: double (nullable = true)
 |-- Y: double (nullable = true)
 |-- bank: integer (nullable = true)
 |-- bench: integer (nullable = true)
 |-- beverages: integer (nullable = true)
 |-- bus_stop: integer (nullable = true)
 |-- bus_stop_100: integer (nullable = true)
 |-- cafe: integer (nullable = true)
 |-- convenience: integer (nullable = true)
 |-- convenience_100: integer (nullable = true)
 |-- convenience_200: integer (nullable = true)
 |-- crossing: integer (nullable = true)
 |-- crossing_100: integer (nullable = true)
 |-- fast_food: integer (nullable = true)
 |-- fast_food_100: integer (nullable = true)
 |-- fast_food_200: integer (nullable = true)
 |-- fuel: integer (nullable = true)
 |-- intercect: integer (nullable = true)
 |-- kindergarten: integer (nullable = true)
 |-- motorway_junction: integer (nullable = true)
 |-- parking: integer (nullable = true)
 |-- parking_bicycle: integer (nullable = tr

## Verifying dataset balance

In [78]:
num_cols, cat_cols = find_num_cat_features(df)

0  categorical features
54  numerical features


In [79]:
df.groupby('SINIESTRO').count().toPandas()

Unnamed: 0,SINIESTRO,count
0,1,17579
1,0,45450


`There is an imbalanced ratio of (0.72 and 0.28). Maybe some ideas must be tried to compensate it. A weight column does not result feasible.`

**This code is to be used If we want to add a new column weights with ratios**

## Preparing Data for Machine Learning

 Verify numerical features:

In [80]:
numericCols = num_cols[1:-1] # Taking out id and Target variable
numericCols

['X',
 'Y',
 'bank',
 'bench',
 'beverages',
 'bus_stop',
 'bus_stop_100',
 'cafe',
 'convenience',
 'convenience_100',
 'convenience_200',
 'crossing',
 'crossing_100',
 'fast_food',
 'fast_food_100',
 'fast_food_200',
 'fuel',
 'intercect',
 'kindergarten',
 'motorway_junction',
 'parking',
 'parking_bicycle',
 'pharmacy',
 'railway_station',
 'railway_station_100',
 'restaurant',
 'restaurant_100',
 'school',
 'school_100',
 'school_200',
 'stop',
 'stop_100',
 'taxi',
 'traffic_signals',
 'traffic_signals_100',
 'turning_circle',
 'ATROPELLO_100',
 'ATROPELLO_200',
 'CAIDA_100',
 'CAIDA_200',
 'CHOQUE_100',
 'CHOQUE_200',
 'COLISION_100',
 'COLISION_200',
 'INCENDIO_100',
 'INCENDIO_200',
 'OTRO TIPO_100',
 'OTRO TIPO_200',
 'SEV_Index_100',
 'SEV_Index_200',
 'VOLCADURA_100',
 'VOLCADURA_200']

Category Indexing, One-Hot Encoding and VectorAssembler - a feature transformer that merges multiple columns into a vector column.

In [81]:
categoricalColumns = cat_cols
cols = df.columns
stages = []

for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol=categoricalCol,
                                  outputCol=categoricalCol + 'Index')
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()],
                                     outputCols=[categoricalCol + "classVec"])
    stages += [stringIndexer, encoder]
    
label_stringIdx = StringIndexer(inputCol='SINIESTRO', outputCol='label')
stages += [label_stringIdx]

# Assemble the columns into a feature vector
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

In [82]:
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
selectedCols = ['label', 'features'] + cols
df = df.select(selectedCols)
df.printSchema()

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- X: double (nullable = true)
 |-- Y: double (nullable = true)
 |-- bank: integer (nullable = true)
 |-- bench: integer (nullable = true)
 |-- beverages: integer (nullable = true)
 |-- bus_stop: integer (nullable = true)
 |-- bus_stop_100: integer (nullable = true)
 |-- cafe: integer (nullable = true)
 |-- convenience: integer (nullable = true)
 |-- convenience_100: integer (nullable = true)
 |-- convenience_200: integer (nullable = true)
 |-- crossing: integer (nullable = true)
 |-- crossing_100: integer (nullable = true)
 |-- fast_food: integer (nullable = true)
 |-- fast_food_100: integer (nullable = true)
 |-- fast_food_200: integer (nullable = true)
 |-- fuel: integer (nullable = true)
 |-- intercect: integer (nullable = true)
 |-- kindergarten: integer (nullable = true)
 |-- motorway_junction: integer (nullable = true)
 |-- p

In [83]:
train, test = df.randomSplit([0.8, 0.2], seed=24)

## RF Model

In [84]:
rf = RandomForestClassifier(featuresCol='features', labelCol='label')
rfModel = rf.fit(train)
predictions = rfModel.transform(test)

In [85]:
predictions.select('id', 'label', 'rawPrediction', 'prediction',
                   'probability').show(10)

+---+-----+--------------------+----------+--------------------+
| id|label|       rawPrediction|prediction|         probability|
+---+-----+--------------------+----------+--------------------+
|  0|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
|  3|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 14|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 18|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 23|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 33|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 34|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 44|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 56|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 54|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
+---+-----+--------------------+----------+--------------------+
only showing top 10 rows



In [86]:
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(
    evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

Test Area Under ROC: 0.8294667464568877


In [88]:
evaluator = MulticlassClassificationEvaluator()
accuracy = evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"})
print("Accuracy: " + str(accuracy))

Accuracy: 0.7782577959311916


In [89]:
print("Test Error = %g" % (1.0 - accuracy))

Test Error = 0.221742


## GBT Model

In [90]:
gbt = GBTClassifier(maxIter=10)

In [91]:
gbtModel = gbt.fit(train)

In [92]:
predictions = gbtModel.transform(test)

In [93]:
predictions.select('id', 'label', 'rawPrediction', 'prediction',
                   'probability').show(5)

+---+-----+--------------------+----------+--------------------+
| id|label|       rawPrediction|prediction|         probability|
+---+-----+--------------------+----------+--------------------+
|  0|  0.0|[1.32412711336971...|       0.0|[0.93390330915520...|
|  3|  0.0|[1.32412711336971...|       0.0|[0.93390330915520...|
| 14|  0.0|[1.32412711336971...|       0.0|[0.93390330915520...|
| 18|  0.0|[1.32412711336971...|       0.0|[0.93390330915520...|
| 23|  0.0|[1.32412711336971...|       0.0|[0.93390330915520...|
+---+-----+--------------------+----------+--------------------+
only showing top 5 rows



In [94]:
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(
    evaluator.evaluate(predictions, 
                       {evaluator.metricName: "areaUnderROC"})))

Test Area Under ROC: 0.8383690618564904


In [95]:
evaluator = MulticlassClassificationEvaluator()
accuracy = evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"})
print("Accuracy: " + str(accuracy))

Accuracy: 0.7879978006440971


In [96]:
print("Test Error = %g" % (1.0 - accuracy))

Test Error = 0.212002


In [97]:
gbtModel.featureImportances

SparseVector(52, {0: 0.0114, 1: 0.0286, 2: 0.0003, 3: 0.0007, 4: 0.0047, 5: 0.056, 6: 0.0233, 7: 0.0005, 8: 0.0088, 9: 0.0011, 10: 0.0009, 11: 0.09, 12: 0.0158, 13: 0.0023, 15: 0.0008, 16: 0.0016, 17: 0.0717, 18: 0.0006, 22: 0.0132, 23: 0.0005, 25: 0.0156, 26: 0.0168, 29: 0.0008, 33: 0.047, 34: 0.0211, 36: 0.0353, 37: 0.0057, 38: 0.0089, 39: 0.0009, 40: 0.0026, 41: 0.0091, 42: 0.4391, 43: 0.0127, 44: 0.0015, 45: 0.0005, 46: 0.0147, 47: 0.0041, 48: 0.0168, 49: 0.0, 50: 0.0138})

Function to read Features Importance, Thanks to:
- https://www.timlrx.com/2018/06/19/feature-selection-using-feature-importance-score-creating-a-pyspark-estimator/

In [98]:
def ExtractFeatureImp(featureImp, dataset, featuresCol):
    list_extract = []
    for i in dataset.schema[featuresCol].metadata["ml_attr"]["attrs"]:
        list_extract = list_extract + dataset.schema[featuresCol].metadata["ml_attr"]["attrs"][i]
    varlist = pd.DataFrame(list_extract)
    varlist['score'] = varlist['idx'].apply(lambda x: featureImp[x])
    return(varlist.sort_values('score', ascending = False))

In [99]:
ExtractFeatureImp(
    gbtModel.featureImportances, 
    df, "features").head(20)

Unnamed: 0,idx,name,score
42,42,COLISION_100,0.439067
11,11,crossing,0.090049
17,17,intercect,0.071686
5,5,bus_stop,0.055971
33,33,traffic_signals,0.047034
36,36,ATROPELLO_100,0.035349
1,1,Y,0.028571
6,6,bus_stop_100,0.023284
34,34,traffic_signals_100,0.021061
48,48,SEV_Index_100,0.016812


## Tuning The GBT Model

From a running on H20 AutoML ==> Gradient Boosting Machine best hyperparameters:
- number_of_trees	58
- number_of_internal_trees	58
- model_size_in_bytes	45603
- min_depth	6
- max_depth	6
- mean_depth	6.0
- min_leaves	42
- max_leaves	64
- mean_leaves	57.9310

In [100]:
paramGrid = (ParamGridBuilder()
             .addGrid(gbt.maxDepth, [4, 6])
             .addGrid(gbt.maxBins, [40, 60, 70])
             .addGrid(gbt.maxIter, [10, 20])
             .build())

cv = CrossValidator(estimator=gbt,
                    estimatorParamMaps=paramGrid,
                    evaluator=evaluator,
                    numFolds=5)

# Run cross validations.  This can take about 7.3 minutes!
cvModel = cv.fit(train)
predictions = cvModel.transform(test)

In [101]:
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(
    evaluator.evaluate(predictions, 
                       {evaluator.metricName: "areaUnderROC"})))

Test Area Under ROC: 0.841362042678081


In [102]:
evaluator = MulticlassClassificationEvaluator()
accuracy = evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"})
print("Accuracy: " + str(accuracy))

Accuracy: 0.7924750608750295


In [103]:
print("Test Error = %g" % (1.0 - accuracy))

Test Error = 0.207525


In [104]:
selected = predictions.select('id', 'label', 'rawPrediction', 'probability', 'prediction')
for row in selected.collect():
    print(row)

Row(id=0, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=3, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=14, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=18, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=23, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=33, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=34, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=44, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), pr

Row(id=8171, label=0.0, rawPrediction=DenseVector([1.5254, -1.5254]), probability=DenseVector([0.9548, 0.0452]), prediction=0.0)
Row(id=4192, label=0.0, rawPrediction=DenseVector([0.8025, -0.8025]), probability=DenseVector([0.8327, 0.1673]), prediction=0.0)
Row(id=23417, label=0.0, rawPrediction=DenseVector([0.8395, -0.8395]), probability=DenseVector([0.8428, 0.1572]), prediction=0.0)
Row(id=791, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=9640, label=0.0, rawPrediction=DenseVector([0.6806, -0.6806]), probability=DenseVector([0.796, 0.204]), prediction=0.0)
Row(id=23618, label=0.0, rawPrediction=DenseVector([0.7284, -0.7284]), probability=DenseVector([0.811, 0.189]), prediction=0.0)
Row(id=25077, label=0.0, rawPrediction=DenseVector([0.4086, -0.4086]), probability=DenseVector([0.6936, 0.3064]), prediction=0.0)
Row(id=21182, label=0.0, rawPrediction=DenseVector([0.2949, -0.2949]), probability=DenseVector([0.6

Row(id=18737, label=0.0, rawPrediction=DenseVector([1.074, -1.074]), probability=DenseVector([0.8955, 0.1045]), prediction=0.0)
Row(id=19706, label=0.0, rawPrediction=DenseVector([1.5394, -1.5394]), probability=DenseVector([0.956, 0.044]), prediction=0.0)
Row(id=4847, label=0.0, rawPrediction=DenseVector([1.0322, -1.0322]), probability=DenseVector([0.8874, 0.1126]), prediction=0.0)
Row(id=12812, label=0.0, rawPrediction=DenseVector([0.9459, -0.9459]), probability=DenseVector([0.869, 0.131]), prediction=0.0)
Row(id=11205, label=0.0, rawPrediction=DenseVector([0.6447, -0.6447]), probability=DenseVector([0.784, 0.216]), prediction=0.0)
Row(id=2944, label=0.0, rawPrediction=DenseVector([1.0219, -1.0219]), probability=DenseVector([0.8853, 0.1147]), prediction=0.0)
Row(id=13131, label=0.0, rawPrediction=DenseVector([1.0322, -1.0322]), probability=DenseVector([0.8874, 0.1126]), prediction=0.0)
Row(id=20259, label=0.0, rawPrediction=DenseVector([0.9459, -0.9459]), probability=DenseVector([0.86

Row(id=30105, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=29938, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=30997, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=30995, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=31230, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=31947, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=31946, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=32424, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVe

Row(id=40998, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=29498, label=0.0, rawPrediction=DenseVector([0.8243, -0.8243]), probability=DenseVector([0.8387, 0.1613]), prediction=0.0)
Row(id=40968, label=0.0, rawPrediction=DenseVector([0.7948, -0.7948]), probability=DenseVector([0.8306, 0.1694]), prediction=0.0)
Row(id=46459, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=38546, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=41258, label=0.0, rawPrediction=DenseVector([0.3128, -0.3128]), probability=DenseVector([0.6515, 0.3485]), prediction=0.0)
Row(id=34036, label=0.0, rawPrediction=DenseVector([0.6643, -0.6643]), probability=DenseVector([0.7906, 0.2094]), prediction=0.0)
Row(id=34260, label=0.0, rawPrediction=DenseVector([0.4323, -0.4323]), probability=DenseVe

Row(id=39228, label=0.0, rawPrediction=DenseVector([0.3068, -0.3068]), probability=DenseVector([0.6488, 0.3512]), prediction=0.0)
Row(id=53163, label=0.0, rawPrediction=DenseVector([0.7938, -0.7938]), probability=DenseVector([0.8303, 0.1697]), prediction=0.0)
Row(id=42207, label=0.0, rawPrediction=DenseVector([0.8195, -0.8195]), probability=DenseVector([0.8374, 0.1626]), prediction=0.0)
Row(id=40810, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=51815, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=32938, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=33297, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=42895, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVe

Row(id=41557, label=1.0, rawPrediction=DenseVector([-0.1367, 0.1367]), probability=DenseVector([0.4321, 0.5679]), prediction=1.0)
Row(id=43182, label=1.0, rawPrediction=DenseVector([0.5876, -0.5876]), probability=DenseVector([0.7641, 0.2359]), prediction=0.0)
Row(id=51925, label=1.0, rawPrediction=DenseVector([0.8619, -0.8619]), probability=DenseVector([0.8486, 0.1514]), prediction=0.0)
Row(id=38714, label=1.0, rawPrediction=DenseVector([0.3629, -0.3629]), probability=DenseVector([0.6739, 0.3261]), prediction=0.0)
Row(id=50731, label=1.0, rawPrediction=DenseVector([0.4589, -0.4589]), probability=DenseVector([0.7146, 0.2854]), prediction=0.0)
Row(id=40009, label=1.0, rawPrediction=DenseVector([-0.0554, 0.0554]), probability=DenseVector([0.4723, 0.5277]), prediction=1.0)
Row(id=36255, label=1.0, rawPrediction=DenseVector([0.3657, -0.3657]), probability=DenseVector([0.6751, 0.3249]), prediction=0.0)
Row(id=52652, label=1.0, rawPrediction=DenseVector([0.2527, -0.2527]), probability=DenseVe

Row(id=56050, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=56647, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=57037, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=57024, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=56899, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=56882, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=57210, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=57368, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVe

In [105]:
predictions.select('id', 'label', 'rawPrediction', 'probability', 'prediction').show(10)

+---+-----+--------------------+--------------------+----------+
| id|label|       rawPrediction|         probability|prediction|
+---+-----+--------------------+--------------------+----------+
|  0|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  3|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 14|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 18|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 23|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 33|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 34|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 44|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 56|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 54|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
+---+-----+--------------------+--------------------+----------+
only showing top 10 rows



In [106]:
bestModel = cvModel.bestModel

In [107]:
bestModel

GBTClassificationModel (uid=GBTClassifier_95ac8c178565) with 20 trees

In [108]:
bestModel.write().overwrite().save('../model/GeoProjectBestModel_1.model')

## Validating Best Model with 2018 dataset
This dataset is used as VALIDATION.
- The real crash events will be from 2018
- The dynamic features (type_100 and type_200) will be from the year before (2017)

In [134]:
valModel = GBTClassificationModel.load("../model/GeoProjectBestModel_1.model")

Loading the dataset with 2018 data:

In [135]:
df_2018 = spark.read.csv('../data/final_test_dataset_grid_100.csv',
                    header=True,
                    inferSchema=True)
df_2018.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- X: double (nullable = true)
 |-- Y: double (nullable = true)
 |-- bank: integer (nullable = true)
 |-- bench: integer (nullable = true)
 |-- beverages: integer (nullable = true)
 |-- bus_stop: integer (nullable = true)
 |-- bus_stop_100: integer (nullable = true)
 |-- cafe: integer (nullable = true)
 |-- convenience: integer (nullable = true)
 |-- convenience_100: integer (nullable = true)
 |-- convenience_200: integer (nullable = true)
 |-- crossing: integer (nullable = true)
 |-- crossing_100: integer (nullable = true)
 |-- fast_food: integer (nullable = true)
 |-- fast_food_100: integer (nullable = true)
 |-- fast_food_200: integer (nullable = true)
 |-- fuel: integer (nullable = true)
 |-- intercect: integer (nullable = true)
 |-- kindergarten: integer (nullable = true)
 |-- motorway_junction: integer (nullable = true)
 |-- parking: integer (nullable = true)
 |-- parking_bicycle: integer (nullable = tr

In [136]:
num_cols, cat_cols = find_num_cat_features(df_2018)

0  categorical features
54  numerical features


In [137]:
df_2018.groupby('SINIESTRO').count().toPandas()

Unnamed: 0,SINIESTRO,count
0,1,6687
1,0,56342


`Note that on this dataset, we have around 10.5% of the Grid's cells with crash events and 89.4% with No events.`

In [138]:
numericCols = num_cols[1:-1] # Taking out id and Target variable
numericCols

['X',
 'Y',
 'bank',
 'bench',
 'beverages',
 'bus_stop',
 'bus_stop_100',
 'cafe',
 'convenience',
 'convenience_100',
 'convenience_200',
 'crossing',
 'crossing_100',
 'fast_food',
 'fast_food_100',
 'fast_food_200',
 'fuel',
 'intercect',
 'kindergarten',
 'motorway_junction',
 'parking',
 'parking_bicycle',
 'pharmacy',
 'railway_station',
 'railway_station_100',
 'restaurant',
 'restaurant_100',
 'school',
 'school_100',
 'school_200',
 'stop',
 'stop_100',
 'taxi',
 'traffic_signals',
 'traffic_signals_100',
 'turning_circle',
 'ATROPELLO_100',
 'ATROPELLO_200',
 'CAIDA_100',
 'CAIDA_200',
 'CHOQUE_100',
 'CHOQUE_200',
 'COLISION_100',
 'COLISION_200',
 'INCENDIO_100',
 'INCENDIO_200',
 'OTRO TIPO_100',
 'OTRO TIPO_200',
 'SEV_Index_100',
 'SEV_Index_200',
 'VOLCADURA_100',
 'VOLCADURA_200']

In [139]:
categoricalColumns = cat_cols
cols = df.columns
stages = []

for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol=categoricalCol,
                                  outputCol=categoricalCol + 'Index')
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()],
                                     outputCols=[categoricalCol + "classVec"])
    stages += [stringIndexer, encoder]
    
label_stringIdx = StringIndexer(inputCol='SINIESTRO', outputCol='label')
stages += [label_stringIdx]

assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df_2018)
df_2018 = pipelineModel.transform(df_2018)
selectedCols = ['label', 'features'] + cols
df_2018 = df_2018.select(selectedCols)
df_2018.printSchema()

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- X: double (nullable = true)
 |-- Y: double (nullable = true)
 |-- bank: integer (nullable = true)
 |-- bench: integer (nullable = true)
 |-- beverages: integer (nullable = true)
 |-- bus_stop: integer (nullable = true)
 |-- bus_stop_100: integer (nullable = true)
 |-- cafe: integer (nullable = true)
 |-- convenience: integer (nullable = true)
 |-- convenience_100: integer (nullable = true)
 |-- convenience_200: integer (nullable = true)
 |-- crossing: integer (nullable = true)
 |-- crossing_100: integer (nullable = true)
 |-- fast_food: integer (nullable = true)
 |-- fast_food_100: integer (nullable = true)
 |-- fast_food_200: integer (nullable = true)
 |-- fuel: integer (nullable = true)
 |-- intercect: integer (nullable = true)
 |-- kindergarten: inte

In [147]:
predict_2018 = valModel.transform(df_2018)

In [148]:
selected = predict_2018.select('id', 'label', 'rawPrediction', 'probability', 'prediction')
for row in selected.collect():
    print(row)

Row(id=0, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=1, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=2, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=3, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=4, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=5, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=6, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=7, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), predicti

Row(id=1921, label=0.0, rawPrediction=DenseVector([1.5394, -1.5394]), probability=DenseVector([0.956, 0.044]), prediction=0.0)
Row(id=1922, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=1923, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=1924, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=1925, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=1926, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=1927, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=1928, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.95

Row(id=3800, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=3801, label=0.0, rawPrediction=DenseVector([1.5254, -1.5254]), probability=DenseVector([0.9548, 0.0452]), prediction=0.0)
Row(id=3802, label=0.0, rawPrediction=DenseVector([0.9185, -0.9185]), probability=DenseVector([0.8626, 0.1374]), prediction=0.0)
Row(id=3803, label=0.0, rawPrediction=DenseVector([0.1964, -0.1964]), probability=DenseVector([0.597, 0.403]), prediction=0.0)
Row(id=3804, label=0.0, rawPrediction=DenseVector([1.1301, -1.1301]), probability=DenseVector([0.9055, 0.0945]), prediction=0.0)
Row(id=3805, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=3806, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=3807, label=0.0, rawPrediction=DenseVector([1.0636, -1.0636]), probability=DenseVector([0.89

Row(id=5646, label=0.0, rawPrediction=DenseVector([0.8573, -0.8573]), probability=DenseVector([0.8474, 0.1526]), prediction=0.0)
Row(id=5647, label=1.0, rawPrediction=DenseVector([0.3309, -0.3309]), probability=DenseVector([0.6596, 0.3404]), prediction=0.0)
Row(id=5648, label=0.0, rawPrediction=DenseVector([1.051, -1.051]), probability=DenseVector([0.8911, 0.1089]), prediction=0.0)
Row(id=5649, label=0.0, rawPrediction=DenseVector([0.5886, -0.5886]), probability=DenseVector([0.7645, 0.2355]), prediction=0.0)
Row(id=5650, label=0.0, rawPrediction=DenseVector([1.0233, -1.0233]), probability=DenseVector([0.8856, 0.1144]), prediction=0.0)
Row(id=5651, label=0.0, rawPrediction=DenseVector([0.2721, -0.2721]), probability=DenseVector([0.6328, 0.3672]), prediction=0.0)
Row(id=5652, label=0.0, rawPrediction=DenseVector([0.9459, -0.9459]), probability=DenseVector([0.869, 0.131]), prediction=0.0)
Row(id=5653, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563

Row(id=7541, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=7542, label=0.0, rawPrediction=DenseVector([1.0636, -1.0636]), probability=DenseVector([0.8935, 0.1065]), prediction=0.0)
Row(id=7543, label=0.0, rawPrediction=DenseVector([1.0636, -1.0636]), probability=DenseVector([0.8935, 0.1065]), prediction=0.0)
Row(id=7544, label=0.0, rawPrediction=DenseVector([1.0636, -1.0636]), probability=DenseVector([0.8935, 0.1065]), prediction=0.0)
Row(id=7545, label=0.0, rawPrediction=DenseVector([1.0636, -1.0636]), probability=DenseVector([0.8935, 0.1065]), prediction=0.0)
Row(id=7546, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=7547, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=7548, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.

Row(id=9326, label=0.0, rawPrediction=DenseVector([1.0117, -1.0117]), probability=DenseVector([0.8832, 0.1168]), prediction=0.0)
Row(id=9327, label=1.0, rawPrediction=DenseVector([0.396, -0.396]), probability=DenseVector([0.6882, 0.3118]), prediction=0.0)
Row(id=9328, label=0.0, rawPrediction=DenseVector([1.5254, -1.5254]), probability=DenseVector([0.9548, 0.0452]), prediction=0.0)
Row(id=9329, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=9330, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=9331, label=0.0, rawPrediction=DenseVector([0.396, -0.396]), probability=DenseVector([0.6882, 0.3118]), prediction=0.0)
Row(id=9332, label=0.0, rawPrediction=DenseVector([0.4674, -0.4674]), probability=DenseVector([0.718, 0.282]), prediction=0.0)
Row(id=9333, label=0.0, rawPrediction=DenseVector([0.4883, -0.4883]), probability=DenseVector([0.7264, 

Row(id=11156, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=11157, label=0.0, rawPrediction=DenseVector([1.5254, -1.5254]), probability=DenseVector([0.9548, 0.0452]), prediction=0.0)
Row(id=11158, label=0.0, rawPrediction=DenseVector([1.5254, -1.5254]), probability=DenseVector([0.9548, 0.0452]), prediction=0.0)
Row(id=11159, label=0.0, rawPrediction=DenseVector([1.5254, -1.5254]), probability=DenseVector([0.9548, 0.0452]), prediction=0.0)
Row(id=11160, label=0.0, rawPrediction=DenseVector([1.5254, -1.5254]), probability=DenseVector([0.9548, 0.0452]), prediction=0.0)
Row(id=11161, label=0.0, rawPrediction=DenseVector([1.5254, -1.5254]), probability=DenseVector([0.9548, 0.0452]), prediction=0.0)
Row(id=11162, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=11163, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVe

Row(id=12948, label=0.0, rawPrediction=DenseVector([0.8586, -0.8586]), probability=DenseVector([0.8478, 0.1522]), prediction=0.0)
Row(id=12949, label=0.0, rawPrediction=DenseVector([1.5394, -1.5394]), probability=DenseVector([0.956, 0.044]), prediction=0.0)
Row(id=12950, label=0.0, rawPrediction=DenseVector([0.9052, -0.9052]), probability=DenseVector([0.8594, 0.1406]), prediction=0.0)
Row(id=12951, label=0.0, rawPrediction=DenseVector([0.9052, -0.9052]), probability=DenseVector([0.8594, 0.1406]), prediction=0.0)
Row(id=12952, label=0.0, rawPrediction=DenseVector([1.023, -1.023]), probability=DenseVector([0.8855, 0.1145]), prediction=0.0)
Row(id=12953, label=0.0, rawPrediction=DenseVector([0.9916, -0.9916]), probability=DenseVector([0.879, 0.121]), prediction=0.0)
Row(id=12954, label=0.0, rawPrediction=DenseVector([0.9916, -0.9916]), probability=DenseVector([0.879, 0.121]), prediction=0.0)
Row(id=12955, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.

Row(id=14799, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=14800, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=14801, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=14802, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=14803, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=14804, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=14805, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=14806, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVe

Row(id=16645, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=16646, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=16647, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=16648, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=16649, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=16650, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=16651, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=16652, label=0.0, rawPrediction=DenseVector([1.0636, -1.0636]), probability=DenseVe

Row(id=18450, label=0.0, rawPrediction=DenseVector([0.9246, -0.9246]), probability=DenseVector([0.864, 0.136]), prediction=0.0)
Row(id=18451, label=0.0, rawPrediction=DenseVector([0.9802, -0.9802]), probability=DenseVector([0.8766, 0.1234]), prediction=0.0)
Row(id=18452, label=0.0, rawPrediction=DenseVector([0.9364, -0.9364]), probability=DenseVector([0.8668, 0.1332]), prediction=0.0)
Row(id=18453, label=0.0, rawPrediction=DenseVector([1.0792, -1.0792]), probability=DenseVector([0.8964, 0.1036]), prediction=0.0)
Row(id=18454, label=0.0, rawPrediction=DenseVector([0.4761, -0.4761]), probability=DenseVector([0.7216, 0.2784]), prediction=0.0)
Row(id=18455, label=0.0, rawPrediction=DenseVector([1.2019, -1.2019]), probability=DenseVector([0.9171, 0.0829]), prediction=0.0)
Row(id=18456, label=1.0, rawPrediction=DenseVector([-0.6963, 0.6963]), probability=DenseVector([0.199, 0.801]), prediction=1.0)
Row(id=18457, label=1.0, rawPrediction=DenseVector([-1.5191, 1.5191]), probability=DenseVector

Row(id=20244, label=0.0, rawPrediction=DenseVector([1.0661, -1.0661]), probability=DenseVector([0.894, 0.106]), prediction=0.0)
Row(id=20245, label=0.0, rawPrediction=DenseVector([0.739, -0.739]), probability=DenseVector([0.8143, 0.1857]), prediction=0.0)
Row(id=20246, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=20247, label=0.0, rawPrediction=DenseVector([1.1001, -1.1001]), probability=DenseVector([0.9003, 0.0997]), prediction=0.0)
Row(id=20248, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=20249, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=20250, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=20251, label=0.0, rawPrediction=DenseVector([1.0863, -1.0863]), probability=DenseVector

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Row(id=31091, label=0.0, rawPrediction=DenseVector([0.5054, -0.5054]), probability=DenseVector([0.7332, 0.2668]), prediction=0.0)
Row(id=31092, label=0.0, rawPrediction=DenseVector([0.4959, -0.4959]), probability=DenseVector([0.7294, 0.2706]), prediction=0.0)
Row(id=31093, label=0.0, rawPrediction=DenseVector([0.819, -0.819]), probability=DenseVector([0.8373, 0.1627]), prediction=0.0)
Row(id=31094, label=0.0, rawPrediction=DenseVector([0.8038, -0.8038]), probability=DenseVector([0.8331, 0.1669]), prediction=0.0)
Row(id=31095, label=1.0, rawPrediction=DenseVector([0.9216, -0.9216]), probability=DenseVector([0.8633, 0.1367]), prediction=0.0)
Row(id=31096, label=0.0, rawPrediction=DenseVector([0.9682, -0.9682]), probability=DenseVector([0.874, 0.126]), prediction=0.0)
Row(id=31097, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=31098, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVecto

Row(id=32911, label=0.0, rawPrediction=DenseVector([-0.255, 0.255]), probability=DenseVector([0.3752, 0.6248]), prediction=1.0)
Row(id=32912, label=0.0, rawPrediction=DenseVector([0.4036, -0.4036]), probability=DenseVector([0.6915, 0.3085]), prediction=0.0)
Row(id=32913, label=0.0, rawPrediction=DenseVector([1.0018, -1.0018]), probability=DenseVector([0.8812, 0.1188]), prediction=0.0)
Row(id=32914, label=0.0, rawPrediction=DenseVector([0.9495, -0.9495]), probability=DenseVector([0.8698, 0.1302]), prediction=0.0)
Row(id=32915, label=0.0, rawPrediction=DenseVector([0.7758, -0.7758]), probability=DenseVector([0.8251, 0.1749]), prediction=0.0)
Row(id=32916, label=0.0, rawPrediction=DenseVector([0.7837, -0.7837]), probability=DenseVector([0.8274, 0.1726]), prediction=0.0)
Row(id=32917, label=0.0, rawPrediction=DenseVector([0.7309, -0.7309]), probability=DenseVector([0.8118, 0.1882]), prediction=0.0)
Row(id=32918, label=1.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVect

Row(id=34827, label=0.0, rawPrediction=DenseVector([-0.3923, 0.3923]), probability=DenseVector([0.3134, 0.6866]), prediction=1.0)
Row(id=34828, label=0.0, rawPrediction=DenseVector([0.7674, -0.7674]), probability=DenseVector([0.8227, 0.1773]), prediction=0.0)
Row(id=34829, label=0.0, rawPrediction=DenseVector([1.5173, -1.5173]), probability=DenseVector([0.9541, 0.0459]), prediction=0.0)
Row(id=34830, label=0.0, rawPrediction=DenseVector([1.5173, -1.5173]), probability=DenseVector([0.9541, 0.0459]), prediction=0.0)
Row(id=34831, label=0.0, rawPrediction=DenseVector([1.5173, -1.5173]), probability=DenseVector([0.9541, 0.0459]), prediction=0.0)
Row(id=34832, label=1.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=34833, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=34834, label=0.0, rawPrediction=DenseVector([0.4127, -0.4127]), probability=DenseVe

Row(id=36518, label=1.0, rawPrediction=DenseVector([0.4734, -0.4734]), probability=DenseVector([0.7205, 0.2795]), prediction=0.0)
Row(id=36519, label=0.0, rawPrediction=DenseVector([1.1224, -1.1224]), probability=DenseVector([0.9042, 0.0958]), prediction=0.0)
Row(id=36520, label=0.0, rawPrediction=DenseVector([0.3886, -0.3886]), probability=DenseVector([0.6851, 0.3149]), prediction=0.0)
Row(id=36521, label=1.0, rawPrediction=DenseVector([0.3639, -0.3639]), probability=DenseVector([0.6743, 0.3257]), prediction=0.0)
Row(id=36522, label=1.0, rawPrediction=DenseVector([-0.2145, 0.2145]), probability=DenseVector([0.3944, 0.6056]), prediction=1.0)
Row(id=36523, label=0.0, rawPrediction=DenseVector([1.0764, -1.0764]), probability=DenseVector([0.8959, 0.1041]), prediction=0.0)
Row(id=36524, label=1.0, rawPrediction=DenseVector([0.1744, -0.1744]), probability=DenseVector([0.5863, 0.4137]), prediction=0.0)
Row(id=36525, label=1.0, rawPrediction=DenseVector([0.3704, -0.3704]), probability=DenseVe

Row(id=38269, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=38270, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=38271, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=38272, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=38273, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=38274, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=38275, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=38276, label=0.0, rawPrediction=DenseVector([0.4344, -0.4344]), probability=DenseVe

Row(id=40083, label=0.0, rawPrediction=DenseVector([0.5028, -0.5028]), probability=DenseVector([0.7321, 0.2679]), prediction=0.0)
Row(id=40084, label=0.0, rawPrediction=DenseVector([1.074, -1.074]), probability=DenseVector([0.8955, 0.1045]), prediction=0.0)
Row(id=40085, label=0.0, rawPrediction=DenseVector([0.7562, -0.7562]), probability=DenseVector([0.8194, 0.1806]), prediction=0.0)
Row(id=40086, label=0.0, rawPrediction=DenseVector([0.7562, -0.7562]), probability=DenseVector([0.8194, 0.1806]), prediction=0.0)
Row(id=40087, label=0.0, rawPrediction=DenseVector([1.5173, -1.5173]), probability=DenseVector([0.9541, 0.0459]), prediction=0.0)
Row(id=40088, label=0.0, rawPrediction=DenseVector([1.5173, -1.5173]), probability=DenseVector([0.9541, 0.0459]), prediction=0.0)
Row(id=40089, label=0.0, rawPrediction=DenseVector([1.1105, -1.1105]), probability=DenseVector([0.9021, 0.0979]), prediction=0.0)
Row(id=40090, label=0.0, rawPrediction=DenseVector([0.4777, -0.4777]), probability=DenseVect

Row(id=41877, label=0.0, rawPrediction=DenseVector([0.8146, -0.8146]), probability=DenseVector([0.8361, 0.1639]), prediction=0.0)
Row(id=41878, label=0.0, rawPrediction=DenseVector([0.8126, -0.8126]), probability=DenseVector([0.8355, 0.1645]), prediction=0.0)
Row(id=41879, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=41880, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=41881, label=0.0, rawPrediction=DenseVector([0.8735, -0.8735]), probability=DenseVector([0.8516, 0.1484]), prediction=0.0)
Row(id=41882, label=1.0, rawPrediction=DenseVector([0.8735, -0.8735]), probability=DenseVector([0.8516, 0.1484]), prediction=0.0)
Row(id=41883, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=41884, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVe

Row(id=43680, label=0.0, rawPrediction=DenseVector([0.7777, -0.7777]), probability=DenseVector([0.8257, 0.1743]), prediction=0.0)
Row(id=43681, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=43682, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=43683, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=43684, label=0.0, rawPrediction=DenseVector([1.083, -1.083]), probability=DenseVector([0.8971, 0.1029]), prediction=0.0)
Row(id=43685, label=1.0, rawPrediction=DenseVector([-0.1307, 0.1307]), probability=DenseVector([0.435, 0.565]), prediction=1.0)
Row(id=43686, label=0.0, rawPrediction=DenseVector([-0.0543, 0.0543]), probability=DenseVector([0.4729, 0.5271]), prediction=1.0)
Row(id=43687, label=0.0, rawPrediction=DenseVector([1.5394, -1.5394]), probability=DenseVector

Row(id=45532, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=45533, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=45534, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=45535, label=0.0, rawPrediction=DenseVector([1.1483, -1.1483]), probability=DenseVector([0.9086, 0.0914]), prediction=0.0)
Row(id=45536, label=0.0, rawPrediction=DenseVector([0.3451, -0.3451]), probability=DenseVector([0.666, 0.334]), prediction=0.0)
Row(id=45537, label=0.0, rawPrediction=DenseVector([0.868, -0.868]), probability=DenseVector([0.8502, 0.1498]), prediction=0.0)
Row(id=45538, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=45539, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector

Row(id=47396, label=1.0, rawPrediction=DenseVector([-0.7504, 0.7504]), probability=DenseVector([0.1823, 0.8177]), prediction=1.0)
Row(id=47397, label=0.0, rawPrediction=DenseVector([0.8382, -0.8382]), probability=DenseVector([0.8424, 0.1576]), prediction=0.0)
Row(id=47398, label=0.0, rawPrediction=DenseVector([-0.7289, 0.7289]), probability=DenseVector([0.1888, 0.8112]), prediction=1.0)
Row(id=47399, label=0.0, rawPrediction=DenseVector([0.1735, -0.1735]), probability=DenseVector([0.5859, 0.4141]), prediction=0.0)
Row(id=47400, label=0.0, rawPrediction=DenseVector([0.5508, -0.5508]), probability=DenseVector([0.7506, 0.2494]), prediction=0.0)
Row(id=47401, label=0.0, rawPrediction=DenseVector([-0.8358, 0.8358]), probability=DenseVector([0.1582, 0.8418]), prediction=1.0)
Row(id=47402, label=0.0, rawPrediction=DenseVector([1.0762, -1.0762]), probability=DenseVector([0.8959, 0.1041]), prediction=0.0)
Row(id=47403, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVe

Row(id=49274, label=1.0, rawPrediction=DenseVector([-0.1725, 0.1725]), probability=DenseVector([0.4146, 0.5854]), prediction=1.0)
Row(id=49275, label=0.0, rawPrediction=DenseVector([0.4803, -0.4803]), probability=DenseVector([0.7233, 0.2767]), prediction=0.0)
Row(id=49276, label=0.0, rawPrediction=DenseVector([-0.8278, 0.8278]), probability=DenseVector([0.1604, 0.8396]), prediction=1.0)
Row(id=49277, label=1.0, rawPrediction=DenseVector([0.7722, -0.7722]), probability=DenseVector([0.8241, 0.1759]), prediction=0.0)
Row(id=49278, label=1.0, rawPrediction=DenseVector([-0.3069, 0.3069]), probability=DenseVector([0.3512, 0.6488]), prediction=1.0)
Row(id=49279, label=1.0, rawPrediction=DenseVector([-0.6937, 0.6937]), probability=DenseVector([0.1998, 0.8002]), prediction=1.0)
Row(id=49280, label=1.0, rawPrediction=DenseVector([0.0968, -0.0968]), probability=DenseVector([0.5482, 0.4518]), prediction=0.0)
Row(id=49281, label=1.0, rawPrediction=DenseVector([0.0563, -0.0563]), probability=DenseVe

Row(id=51120, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=51121, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=51122, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=51123, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=51124, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=51125, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=51126, label=0.0, rawPrediction=DenseVector([1.5434, -1.5434]), probability=DenseVector([0.9563, 0.0437]), prediction=0.0)
Row(id=51127, label=0.0, rawPrediction=DenseVector([0.998, -0.998]), probability=DenseVect

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [149]:
predict_2018.select('id', 'label', 'rawPrediction', 'probability', 'prediction').show(10)

+---+-----+--------------------+--------------------+----------+
| id|label|       rawPrediction|         probability|prediction|
+---+-----+--------------------+--------------------+----------+
|  0|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  1|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  2|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  3|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  4|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  5|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  6|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  7|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  8|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  9|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
+---+-----+--------------------+--------------------+----------+
only showing top 10 rows



In [150]:
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(
    evaluator.evaluate(predict_2018, 
                       {evaluator.metricName: "areaUnderROC"})))

Test Area Under ROC: 0.7742939707280319


In [151]:
evaluator = MulticlassClassificationEvaluator()
accuracy = evaluator.evaluate(predict_2018, {evaluator.metricName: "accuracy"})
print("Accuracy: " + str(accuracy))

Accuracy: 0.8942708911770773


In [146]:
print("Test Error = %g" % (1.0 - accuracy))

Test Error = 0.105729


In [152]:
predict_2018.select("prediction").groupBy("prediction").count().show(truncate=False)

+----------+-----+
|prediction|count|
+----------+-----+
|0.0       |58578|
|1.0       |4451 |
+----------+-----+



In [153]:
predict_2018.select("label").groupBy("label").count().show(truncate=False)

+-----+-----+
|label|count|
+-----+-----+
|0.0  |56342|
|1.0  |6687 |
+-----+-----+



In [154]:
Crash_Prevision_2018 = predict_2018.select('id', 'X', 'Y', 'label', 'rawPrediction', 'probability', 'prediction')

In [155]:
Crash_Prevision_2018.toPandas().to_csv("../model/Crash_Prevision_2018.csv")

In [157]:
spark.stop()

---