# Modelo predictivo (Espacial) de siniestros en las calles de Santiago
**Model will be split in train/test (80%/20%), using 2013-2017 Dataset and Validated with 2018 Dataset**
- UDD - MDS18 - BDA
- Final Delivery 
- 40_Final_Geo_Project_PySpark_MLib_RF_GBT_Models_Gr100_dataset-2013-17_val_18
- 09 August 2019

**Running dataset created from 2013 to 2017 as df**

Developed based on:
- Big Data Analytics, Oscar Peredo 
- [PySpark ML and XGBoost full integration tested on the Kaggle Titanic dataset](https://towardsdatascience.com/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb)
- [Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem](https://towardsdatascience.com/machine-learning-with-pyspark-and-mllib-solving-a-binary-classification-problem-96396065d2aa)
- [Build an end-to-end Machine Learning Model with MLlib in pySpark.](https://towardsdatascience.com/build-an-end-to-end-machine-learning-model-with-mllib-in-pyspark-4917bdf289c5)

## Preliminar Installation

- FindSpark: PySpark isn't on sys.path by default, but it can be used as a regular library. This can be addressed by either symlinking pyspark into your site-packages, or adding pyspark to sys.path at runtime, as done by findspark.<br>
[findspark](https://github.com/minrk/findspark)<br><br>


## General functions

In [117]:
def count_missings(spark_df):
    null_counts = []        
    for col in spark_df.dtypes:    
        cname = col[0]     
        ctype = col[1]      
        nulls = spark_df.where( spark_df[cname].isNull()).count() #check count of null in column name
        result = tuple([cname, nulls])  #new tuple, (column name, null count)
        null_counts.append(result)      #put the new tuple in our result list
    null_counts=[(x,y) for (x,y) in null_counts if y!=0]  #view just columns that have missing values
    return null_counts

In [118]:
def find_num_cat_features(df):
    cat_cols = [item[0] for item in df.dtypes if item[1].startswith('string')]
    print(str(len(cat_cols)) + '  categorical features')

    num_cols = [
        item[0] for item in df.dtypes
        if item[1].startswith('int') | item[1].startswith('double')
    ][1:]
    print(str(len(num_cols)) + '  numerical features')
    return num_cols, cat_cols

In [119]:
def weight_balance(labels):
    return when(labels == 1, ratio).otherwise(1*(1-ratio))

## Libraries and pySpark initialization

In [120]:
import numpy as np
import pandas as pd

In [121]:
import findspark
findspark.init()

In [122]:
import pyspark
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import GBTClassificationModel
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.functions import rank, sum, col, mean, round, when

In [123]:
spark = SparkSession\
        .builder\
        .appName("PySpark GBT Grid 100m Data Siniestros 2013-2017")\
        .getOrCreate()

## Dataset - Generated from Notebook: 
- 30_Final_Geo_Project_Final Datasets_Creation

In [76]:
!ls ../data/

[34mCONASET[m[m                              geo_stgo_100_crash_test_dataset.csv
[34mOSM_Chile[m[m                            geo_stgo_100_crash_train_dataset.csv
final_test_dataset_grid_100.csv      geo_stgo_100_estatic_dataset.csv
final_train_dataset_grid_100.csv


In [77]:
df = spark.read.csv('../data/final_train_dataset_grid_100.csv',
                    header=True,
                    inferSchema=True)
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- X: double (nullable = true)
 |-- Y: double (nullable = true)
 |-- bank: integer (nullable = true)
 |-- bench: integer (nullable = true)
 |-- beverages: integer (nullable = true)
 |-- bus_stop: integer (nullable = true)
 |-- bus_stop_100: integer (nullable = true)
 |-- cafe: integer (nullable = true)
 |-- convenience: integer (nullable = true)
 |-- convenience_100: integer (nullable = true)
 |-- convenience_200: integer (nullable = true)
 |-- crossing: integer (nullable = true)
 |-- crossing_100: integer (nullable = true)
 |-- fast_food: integer (nullable = true)
 |-- fast_food_100: integer (nullable = true)
 |-- fast_food_200: integer (nullable = true)
 |-- fuel: integer (nullable = true)
 |-- intercect: integer (nullable = true)
 |-- kindergarten: integer (nullable = true)
 |-- motorway_junction: integer (nullable = true)
 |-- parking: integer (nullable = true)
 |-- parking_bicycle: integer (nullable = tr

## Verifying dataset balance

In [78]:
num_cols, cat_cols = find_num_cat_features(df)

0  categorical features
54  numerical features


In [79]:
df.groupby('SINIESTRO').count().toPandas()

Unnamed: 0,SINIESTRO,count
0,1,17579
1,0,45450


`There is an imbalanced ratio of (0.72 and 0.28). Maybe some ideas must be tried to compensate it. A weight column does not result feasible.`

**This code is to be used If we want to add a new column weights with ratios**

## Preparing Data for Machine Learning

 Verify numerical features:

In [80]:
numericCols = num_cols[1:-1] # Taking out id and Target variable
numericCols

['X',
 'Y',
 'bank',
 'bench',
 'beverages',
 'bus_stop',
 'bus_stop_100',
 'cafe',
 'convenience',
 'convenience_100',
 'convenience_200',
 'crossing',
 'crossing_100',
 'fast_food',
 'fast_food_100',
 'fast_food_200',
 'fuel',
 'intercect',
 'kindergarten',
 'motorway_junction',
 'parking',
 'parking_bicycle',
 'pharmacy',
 'railway_station',
 'railway_station_100',
 'restaurant',
 'restaurant_100',
 'school',
 'school_100',
 'school_200',
 'stop',
 'stop_100',
 'taxi',
 'traffic_signals',
 'traffic_signals_100',
 'turning_circle',
 'ATROPELLO_100',
 'ATROPELLO_200',
 'CAIDA_100',
 'CAIDA_200',
 'CHOQUE_100',
 'CHOQUE_200',
 'COLISION_100',
 'COLISION_200',
 'INCENDIO_100',
 'INCENDIO_200',
 'OTRO TIPO_100',
 'OTRO TIPO_200',
 'SEV_Index_100',
 'SEV_Index_200',
 'VOLCADURA_100',
 'VOLCADURA_200']

Category Indexing, One-Hot Encoding and VectorAssembler - a feature transformer that merges multiple columns into a vector column.

In [81]:
categoricalColumns = cat_cols
cols = df.columns
stages = []

for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol=categoricalCol,
                                  outputCol=categoricalCol + 'Index')
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()],
                                     outputCols=[categoricalCol + "classVec"])
    stages += [stringIndexer, encoder]
    
label_stringIdx = StringIndexer(inputCol='SINIESTRO', outputCol='label')
stages += [label_stringIdx]

# Assemble the columns into a feature vector
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

In [82]:
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
selectedCols = ['label', 'features'] + cols
df = df.select(selectedCols)
df.printSchema()

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- X: double (nullable = true)
 |-- Y: double (nullable = true)
 |-- bank: integer (nullable = true)
 |-- bench: integer (nullable = true)
 |-- beverages: integer (nullable = true)
 |-- bus_stop: integer (nullable = true)
 |-- bus_stop_100: integer (nullable = true)
 |-- cafe: integer (nullable = true)
 |-- convenience: integer (nullable = true)
 |-- convenience_100: integer (nullable = true)
 |-- convenience_200: integer (nullable = true)
 |-- crossing: integer (nullable = true)
 |-- crossing_100: integer (nullable = true)
 |-- fast_food: integer (nullable = true)
 |-- fast_food_100: integer (nullable = true)
 |-- fast_food_200: integer (nullable = true)
 |-- fuel: integer (nullable = true)
 |-- intercect: integer (nullable = true)
 |-- kindergarten: integer (nullable = true)
 |-- motorway_junction: integer (nullable = true)
 |-- p

In [83]:
train, test = df.randomSplit([0.8, 0.2], seed=24)

## RF Model

In [84]:
rf = RandomForestClassifier(featuresCol='features', labelCol='label')
rfModel = rf.fit(train)
predictions = rfModel.transform(test)

In [85]:
predictions.select('id', 'label', 'rawPrediction', 'prediction',
                   'probability').show(10)

+---+-----+--------------------+----------+--------------------+
| id|label|       rawPrediction|prediction|         probability|
+---+-----+--------------------+----------+--------------------+
|  0|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
|  3|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 14|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 18|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 23|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 33|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 34|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 44|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 56|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
| 54|  0.0|[19.3985287329805...|       0.0|[0.96992643664902...|
+---+-----+--------------------+----------+--------------------+
only showing top 10 rows



In [86]:
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(
    evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

Test Area Under ROC: 0.8294667464568877


In [88]:
evaluator = MulticlassClassificationEvaluator()
accuracy = evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"})
print("Accuracy: " + str(accuracy))

Accuracy: 0.7782577959311916


In [89]:
print("Test Error = %g" % (1.0 - accuracy))

Test Error = 0.221742


## GBT Model

In [90]:
gbt = GBTClassifier(maxIter=10)

In [91]:
gbtModel = gbt.fit(train)

In [92]:
predictions = gbtModel.transform(test)

In [93]:
predictions.select('id', 'label', 'rawPrediction', 'prediction',
                   'probability').show(5)

+---+-----+--------------------+----------+--------------------+
| id|label|       rawPrediction|prediction|         probability|
+---+-----+--------------------+----------+--------------------+
|  0|  0.0|[1.32412711336971...|       0.0|[0.93390330915520...|
|  3|  0.0|[1.32412711336971...|       0.0|[0.93390330915520...|
| 14|  0.0|[1.32412711336971...|       0.0|[0.93390330915520...|
| 18|  0.0|[1.32412711336971...|       0.0|[0.93390330915520...|
| 23|  0.0|[1.32412711336971...|       0.0|[0.93390330915520...|
+---+-----+--------------------+----------+--------------------+
only showing top 5 rows



In [94]:
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(
    evaluator.evaluate(predictions, 
                       {evaluator.metricName: "areaUnderROC"})))

Test Area Under ROC: 0.8383690618564904


In [95]:
evaluator = MulticlassClassificationEvaluator()
accuracy = evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"})
print("Accuracy: " + str(accuracy))

Accuracy: 0.7879978006440971


In [96]:
print("Test Error = %g" % (1.0 - accuracy))

Test Error = 0.212002


In [97]:
gbtModel.featureImportances

SparseVector(52, {0: 0.0114, 1: 0.0286, 2: 0.0003, 3: 0.0007, 4: 0.0047, 5: 0.056, 6: 0.0233, 7: 0.0005, 8: 0.0088, 9: 0.0011, 10: 0.0009, 11: 0.09, 12: 0.0158, 13: 0.0023, 15: 0.0008, 16: 0.0016, 17: 0.0717, 18: 0.0006, 22: 0.0132, 23: 0.0005, 25: 0.0156, 26: 0.0168, 29: 0.0008, 33: 0.047, 34: 0.0211, 36: 0.0353, 37: 0.0057, 38: 0.0089, 39: 0.0009, 40: 0.0026, 41: 0.0091, 42: 0.4391, 43: 0.0127, 44: 0.0015, 45: 0.0005, 46: 0.0147, 47: 0.0041, 48: 0.0168, 49: 0.0, 50: 0.0138})

Function to read Features Importance, Thanks to:
- https://www.timlrx.com/2018/06/19/feature-selection-using-feature-importance-score-creating-a-pyspark-estimator/

In [98]:
def ExtractFeatureImp(featureImp, dataset, featuresCol):
    list_extract = []
    for i in dataset.schema[featuresCol].metadata["ml_attr"]["attrs"]:
        list_extract = list_extract + dataset.schema[featuresCol].metadata["ml_attr"]["attrs"][i]
    varlist = pd.DataFrame(list_extract)
    varlist['score'] = varlist['idx'].apply(lambda x: featureImp[x])
    return(varlist.sort_values('score', ascending = False))

In [99]:
ExtractFeatureImp(
    gbtModel.featureImportances, 
    df, "features").head(20)

Unnamed: 0,idx,name,score
42,42,COLISION_100,0.439067
11,11,crossing,0.090049
17,17,intercect,0.071686
5,5,bus_stop,0.055971
33,33,traffic_signals,0.047034
36,36,ATROPELLO_100,0.035349
1,1,Y,0.028571
6,6,bus_stop_100,0.023284
34,34,traffic_signals_100,0.021061
48,48,SEV_Index_100,0.016812


## Tuning The GBT Model

From a running on H20 AutoML ==> Gradient Boosting Machine best hyperparameters:
- number_of_trees	58
- number_of_internal_trees	58
- model_size_in_bytes	45603
- min_depth	6
- max_depth	6
- mean_depth	6.0
- min_leaves	42
- max_leaves	64
- mean_leaves	57.9310

In [100]:
paramGrid = (ParamGridBuilder()
             .addGrid(gbt.maxDepth, [4, 6])
             .addGrid(gbt.maxBins, [40, 60, 70])
             .addGrid(gbt.maxIter, [10, 20])
             .build())

cv = CrossValidator(estimator=gbt,
                    estimatorParamMaps=paramGrid,
                    evaluator=evaluator,
                    numFolds=5)

# Run cross validations.  This can take about 7.3 minutes!
cvModel = cv.fit(train)
predictions = cvModel.transform(test)

In [101]:
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(
    evaluator.evaluate(predictions, 
                       {evaluator.metricName: "areaUnderROC"})))

Test Area Under ROC: 0.841362042678081


In [102]:
evaluator = MulticlassClassificationEvaluator()
accuracy = evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"})
print("Accuracy: " + str(accuracy))

Accuracy: 0.7924750608750295


In [103]:
print("Test Error = %g" % (1.0 - accuracy))

Test Error = 0.207525


In [None]:
selected = predictions.select('id', 'label', 'rawPrediction', 'probability', 'prediction')
for row in selected.collect():
    print(row)

In [105]:
predictions.select('id', 'label', 'rawPrediction', 'probability', 'prediction').show(10)

+---+-----+--------------------+--------------------+----------+
| id|label|       rawPrediction|         probability|prediction|
+---+-----+--------------------+--------------------+----------+
|  0|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  3|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 14|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 18|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 23|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 33|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 34|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 44|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 56|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
| 54|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
+---+-----+--------------------+--------------------+----------+
only showing top 10 rows



In [106]:
bestModel = cvModel.bestModel

In [107]:
bestModel

GBTClassificationModel (uid=GBTClassifier_95ac8c178565) with 20 trees

In [108]:
bestModel.write().overwrite().save('../model/GeoProjectBestModel_1.model')

## Validating Best Model with 2018 dataset
This dataset is used as VALIDATION.
- The real crash events will be from 2018
- The dynamic features (type_100 and type_200) will be from the year before (2017)

In [134]:
valModel = GBTClassificationModel.load("../model/GeoProjectBestModel_1.model")

Loading the dataset with 2018 data:

In [135]:
df_2018 = spark.read.csv('../data/final_test_dataset_grid_100.csv',
                    header=True,
                    inferSchema=True)
df_2018.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- X: double (nullable = true)
 |-- Y: double (nullable = true)
 |-- bank: integer (nullable = true)
 |-- bench: integer (nullable = true)
 |-- beverages: integer (nullable = true)
 |-- bus_stop: integer (nullable = true)
 |-- bus_stop_100: integer (nullable = true)
 |-- cafe: integer (nullable = true)
 |-- convenience: integer (nullable = true)
 |-- convenience_100: integer (nullable = true)
 |-- convenience_200: integer (nullable = true)
 |-- crossing: integer (nullable = true)
 |-- crossing_100: integer (nullable = true)
 |-- fast_food: integer (nullable = true)
 |-- fast_food_100: integer (nullable = true)
 |-- fast_food_200: integer (nullable = true)
 |-- fuel: integer (nullable = true)
 |-- intercect: integer (nullable = true)
 |-- kindergarten: integer (nullable = true)
 |-- motorway_junction: integer (nullable = true)
 |-- parking: integer (nullable = true)
 |-- parking_bicycle: integer (nullable = tr

In [136]:
num_cols, cat_cols = find_num_cat_features(df_2018)

0  categorical features
54  numerical features


In [137]:
df_2018.groupby('SINIESTRO').count().toPandas()

Unnamed: 0,SINIESTRO,count
0,1,6687
1,0,56342


`Note that on this dataset, we have around 10.5% of the Grid's cells with crash events and 89.4% with No events.`

In [138]:
numericCols = num_cols[1:-1] # Taking out id and Target variable
numericCols

['X',
 'Y',
 'bank',
 'bench',
 'beverages',
 'bus_stop',
 'bus_stop_100',
 'cafe',
 'convenience',
 'convenience_100',
 'convenience_200',
 'crossing',
 'crossing_100',
 'fast_food',
 'fast_food_100',
 'fast_food_200',
 'fuel',
 'intercect',
 'kindergarten',
 'motorway_junction',
 'parking',
 'parking_bicycle',
 'pharmacy',
 'railway_station',
 'railway_station_100',
 'restaurant',
 'restaurant_100',
 'school',
 'school_100',
 'school_200',
 'stop',
 'stop_100',
 'taxi',
 'traffic_signals',
 'traffic_signals_100',
 'turning_circle',
 'ATROPELLO_100',
 'ATROPELLO_200',
 'CAIDA_100',
 'CAIDA_200',
 'CHOQUE_100',
 'CHOQUE_200',
 'COLISION_100',
 'COLISION_200',
 'INCENDIO_100',
 'INCENDIO_200',
 'OTRO TIPO_100',
 'OTRO TIPO_200',
 'SEV_Index_100',
 'SEV_Index_200',
 'VOLCADURA_100',
 'VOLCADURA_200']

In [139]:
categoricalColumns = cat_cols
cols = df.columns
stages = []

for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol=categoricalCol,
                                  outputCol=categoricalCol + 'Index')
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()],
                                     outputCols=[categoricalCol + "classVec"])
    stages += [stringIndexer, encoder]
    
label_stringIdx = StringIndexer(inputCol='SINIESTRO', outputCol='label')
stages += [label_stringIdx]

assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df_2018)
df_2018 = pipelineModel.transform(df_2018)
selectedCols = ['label', 'features'] + cols
df_2018 = df_2018.select(selectedCols)
df_2018.printSchema()

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- X: double (nullable = true)
 |-- Y: double (nullable = true)
 |-- bank: integer (nullable = true)
 |-- bench: integer (nullable = true)
 |-- beverages: integer (nullable = true)
 |-- bus_stop: integer (nullable = true)
 |-- bus_stop_100: integer (nullable = true)
 |-- cafe: integer (nullable = true)
 |-- convenience: integer (nullable = true)
 |-- convenience_100: integer (nullable = true)
 |-- convenience_200: integer (nullable = true)
 |-- crossing: integer (nullable = true)
 |-- crossing_100: integer (nullable = true)
 |-- fast_food: integer (nullable = true)
 |-- fast_food_100: integer (nullable = true)
 |-- fast_food_200: integer (nullable = true)
 |-- fuel: integer (nullable = true)
 |-- intercect: integer (nullable = true)
 |-- kindergarten: inte

In [147]:
predict_2018 = valModel.transform(df_2018)

In [None]:
selected = predict_2018.select('id', 'label', 'rawPrediction', 'probability', 'prediction')
for row in selected.collect():
    print(row)

In [149]:
predict_2018.select('id', 'label', 'rawPrediction', 'probability', 'prediction').show(10)

+---+-----+--------------------+--------------------+----------+
| id|label|       rawPrediction|         probability|prediction|
+---+-----+--------------------+--------------------+----------+
|  0|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  1|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  2|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  3|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  4|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  5|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  6|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  7|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  8|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
|  9|  0.0|[1.54341621871724...|[0.95634631650520...|       0.0|
+---+-----+--------------------+--------------------+----------+
only showing top 10 rows



In [150]:
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(
    evaluator.evaluate(predict_2018, 
                       {evaluator.metricName: "areaUnderROC"})))

Test Area Under ROC: 0.7742939707280319


In [151]:
evaluator = MulticlassClassificationEvaluator()
accuracy = evaluator.evaluate(predict_2018, {evaluator.metricName: "accuracy"})
print("Accuracy: " + str(accuracy))

Accuracy: 0.8942708911770773


In [146]:
print("Test Error = %g" % (1.0 - accuracy))

Test Error = 0.105729


In [152]:
predict_2018.select("prediction").groupBy("prediction").count().show(truncate=False)

+----------+-----+
|prediction|count|
+----------+-----+
|0.0       |58578|
|1.0       |4451 |
+----------+-----+



In [153]:
predict_2018.select("label").groupBy("label").count().show(truncate=False)

+-----+-----+
|label|count|
+-----+-----+
|0.0  |56342|
|1.0  |6687 |
+-----+-----+



In [154]:
Crash_Prevision_2018 = predict_2018.select('id', 'X', 'Y', 'label', 'rawPrediction', 'probability', 'prediction')

In [155]:
Crash_Prevision_2018.toPandas().to_csv("../model/Crash_Prevision_2018.csv")

In [157]:
spark.stop()

---