# DS5559 Final Project Models

Team Left Twix Members

* Alice Wright - aew7j
* Edward Thompson - ejt8b
* Michael Davies -  mld9s
* Sam Parsons - sp8hp

Data Source:
The best data source for this appears to be from the City of Chicago, as it is large (169M records and 21 columns), relatively clean, anonymized, and accessible via API.

City of Chicago:
https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Trips/m6dm-c72p

Code Rubric

* Model construction (min 3 models) | 3 pts

* Model evaluation | 2 pts

## Set up Spark Session and Load Libraries Required for Notebook

In [1]:
# import context manager: SparkSession
from pyspark.sql import SparkSession

# import data types
# from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, BooleanType
# import pyspark.sql.types as typ
# import pyspark.sql.functions as F
# import os

from pyspark.sql.types import *

spark = SparkSession.builder \
        .master("local[*]") \
        .appName("mllib_classifier") \
        .config("spark.executor.memory", '21g') \
        .config('spark.executor.cores', '6') \
        .config('spark.executor.instances', '7') \
        .config("spark.driver.memory",'1g') \
        .getOrCreate()
sc = spark.sparkContext

# import data manipulation methods
# from pyspark.ml.feature import Binarizer
from pyspark.ml import Pipeline  
# from pyspark.ml.feature import *  
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier

from pyspark.ml.linalg import DenseVector
from pyspark.ml.feature import VectorAssembler 
from pyspark.mllib.linalg import Vectors

from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import OneHotEncoder
#from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator

from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.mllib.evaluation import BinaryClassificationMetrics

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

import time

import numpy as np
import pandas as pd

import pickle

### Read in our dataset from preprocessed parquet file generated in notebook 1

In [2]:
p_df = spark.read.parquet("/../../project/ds5559/Alice_Ed_Michael_Sam_project/final_dataset.parquet")

In [3]:
p_df.printSchema()

root
 |-- Trip_ID: string (nullable = true)
 |-- Trip_Seconds: integer (nullable = true)
 |-- Trip_Miles: double (nullable = true)
 |-- Pickup_Census_Tract: string (nullable = true)
 |-- Dropoff_Census_Tract: string (nullable = true)
 |-- Pickup_Community_Area: integer (nullable = true)
 |-- Dropoff_Community_Area: integer (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Tip: double (nullable = true)
 |-- Additional_Charges_str: double (nullable = true)
 |-- Trip_Total: double (nullable = true)
 |-- Shared_Trip_Authorized: boolean (nullable = true)
 |-- Trips_Pooled: integer (nullable = true)
 |-- Pickup_Centroid_Latitude: string (nullable = true)
 |-- Pickup_Centroid_Longitude: string (nullable = true)
 |-- Pickup_Centroid_Location: string (nullable = true)
 |-- Dropoff_Centroid_Latitude: string (nullable = true)
 |-- Dropoff_Centroid_Longitude: string (nullable = true)
 |-- Dropoff_Centroid_Location: string (nullable = true)
 |-- Trip_Start_Timestamp: timestamp (nullable = 

Remove unnecessary fields, duplicate location data

In [4]:
p_df = p_df.drop('Pickup_Census_Tract',
             'Dropoff_Census_Tract',
             'Pickup_Centroid_Latitude',
             'Pickup_Centroid_Longitude', 
             'Pickup_Centroid_Location', 
             'Dropoff_Centroid_Latitude', 
             'Dropoff_Centroid_Longitude', 
             'Dropoff_Centroid_Location',
             'Day_Month_str')

In [5]:
p_df.printSchema()

root
 |-- Trip_ID: string (nullable = true)
 |-- Trip_Seconds: integer (nullable = true)
 |-- Trip_Miles: double (nullable = true)
 |-- Pickup_Community_Area: integer (nullable = true)
 |-- Dropoff_Community_Area: integer (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Tip: double (nullable = true)
 |-- Additional_Charges_str: double (nullable = true)
 |-- Trip_Total: double (nullable = true)
 |-- Shared_Trip_Authorized: boolean (nullable = true)
 |-- Trips_Pooled: integer (nullable = true)
 |-- Trip_Start_Timestamp: timestamp (nullable = true)
 |-- Trip_End_Timestamp: timestamp (nullable = true)
 |-- PostShutdownFlag: integer (nullable = true)
 |-- label: integer (nullable = true)
 |-- Trip_Year: integer (nullable = true)
 |-- Trip_Month: integer (nullable = true)
 |-- Trip_WeekNumber: integer (nullable = true)
 |-- Trip_DayofWeek: integer (nullable = true)
 |-- Trip_Start_Hour: integer (nullable = true)
 |-- Trip_Start_Minute: integer (nullable = true)
 |-- Trip_End_Hour: 

Using Chicago Community Areas for pickup and dropoff locations.  Null area is location outside City of Chicago.

In [6]:
p_df.groupby("Pickup_Community_Area").count().show()

+---------------------+------+
|Pickup_Community_Area| count|
+---------------------+------+
|                   31| 53670|
|                   65| 10567|
|                   53| 10712|
|                   34| 15822|
|                   28|402866|
|                   76|228152|
|                   26| 12039|
|                   27| 19081|
|                   44| 28113|
|                   12|  6564|
|                   22|170353|
|                   47|  1750|
|                 null|354582|
|                    1| 58825|
|                   52|  2478|
|                   13| 10627|
|                    6|293132|
|                   16| 47734|
|                    3|107329|
|                   40| 12284|
+---------------------+------+
only showing top 20 rows



Replace Null with dummy value, using 78 as to not expand one hot encoding excessivly.

In [7]:
#fill our NA community areas

p_df = p_df.na.fill(value=78,subset=['Pickup_Community_Area', 'Dropoff_Community_Area'])

In [8]:
p_df.groupby("Pickup_Community_Area").count().show()

+---------------------+------+
|Pickup_Community_Area| count|
+---------------------+------+
|                   31| 53670|
|                   65| 10567|
|                   53| 10712|
|                   78|354582|
|                   34| 15822|
|                   28|402866|
|                   76|228152|
|                   26| 12039|
|                   27| 19081|
|                   44| 28113|
|                   12|  6564|
|                   22|170353|
|                   47|  1750|
|                    1| 58825|
|                   52|  2478|
|                   13| 10627|
|                    6|293132|
|                   16| 47734|
|                    3|107329|
|                   40| 12284|
+---------------------+------+
only showing top 20 rows



Create Test/Train Split

In [9]:
# split the data

train, test = p_df.randomSplit([0.8, 0.2], seed=2021)

org_a_count = train.filter(train['label'] == 0).count()
org_b_count = train.filter(train['label'] == 1).count()

print("Original No Tip Count: ", org_a_count)
print("Original Tip Count   : ", org_b_count)

Original No Tip Count:  1900289
Original Tip Count   :  1900435


Pre processing in notebook 1 handled the imbalanced dataset.  No further balancing is required.  Cache our datasets for speed.

In [None]:
test.cache()
train.cache()

# delete the original parquet file for more memory. 
del (p_df)

### Evaluation Functions (UDF)

#### Confusion Matrix and Accuracy

In [None]:
def cmacc(pred):
    t0 = time.time()
    # make a confusion matrix and return the accuracy
    # select predictions and labels from prediction transform as rdd as there isn't a DF function for this
    pred_rdd= pred.select('prediction').rdd.flatMap(lambda x: x)
    label_rdd = pred.select('label').rdd.flatMap(lambda x: x).map(lambda x: float(x))
    
    #zip them together
    predictionAndLabels =  pred_rdd.zip(label_rdd)
    
    #metrics transform
    metrics = MulticlassMetrics(predictionAndLabels)
    
    metrics2 = BinaryClassificationMetrics(predictionAndLabels)
    
    #make our confusion matrix
    cm = metrics.confusionMatrix().toArray()

    #calc accuracy from confusion matrix
    
    acc = (cm[0][0] + cm[1][1])/(cm[0][0] + cm[1][1] + cm[0][1] + cm[1][0])
    
    # McM accuracy
    
    acc2 = metrics.accuracy
    
    #calc area under curve
    auc = metrics2.areaUnderROC
    
    prc = metrics2.areaUnderPR
    
    print("Confusion Matrix")
    print(cm)
    print()
    print("Accuracy from Confusion Matrix: ", acc)
    print()
    print("Accuracy from MulticlassMetrics: ", acc2)
    print()     
    print("Area Under the ROC", auc)
    print()
    print("Area Under the PR Curve", prc)
    print('-'*50)
    print("Metrics2 time:", time.time() - t0)

#### Confusion Matrix and Accuracy, Extended

In [12]:
def cmacc2(pred, name, trtime):
    # make a confusion matrix and return the accuracy
    # select predictions and labels from prediction transform as rdd as there isn't a DF function for this
     
    pred_rdd= pred.select('prediction').rdd.flatMap(lambda x: x)
    label_rdd = pred.select('label').rdd.flatMap(lambda x: x).map(lambda x: float(x))

    #zip them together
    predictionAndLabels =  pred_rdd.zip(label_rdd)
    print("Zipped P and L")
    
    #metrics transform
    metrics = MulticlassMetrics(predictionAndLabels)
    metrics2 = BinaryClassificationMetrics(predictionAndLabels)
    print('metrics created')
    
    #make our confusion matrix
    cm = metrics.confusionMatrix().toArray()

    #calc accuracy from confusion matrix
    acc = (cm[0][0] + cm[1][1])/(cm[0][0] + cm[1][1] + cm[0][1] + cm[1][0])
    
    #McM accuracy
    acc2 = metrics.accuracy
    
    #calc area under curve
    auc = metrics2.areaUnderROC
    prc = metrics2.areaUnderPR
    print('areas under the curve')
    
    #Precision = TP/TP+FP
    precision = metrics.precision()
    cmprecision = (cm[0][0])/(cm[0][0] + cm[0][1])
    
    #Recall = TP/TP+FN
    recall = metrics.recall()
    cmrecall = (cm[0][0])/(cm[0][0] + cm[1][0])
    
    #F1 = 2*TP/2*TP +FP+FN or 2* (precision * recall)/(precision+ recall)
    f1Score = metrics.fMeasure()
    cmf1 = (2*cm[0][0])/(2*cm[0][0] + cm[0][1] + cm[1][0])
        
    print('-'*50)    
    print("Confusion Matrix")
    print(cm)
    print('-'*50)
    print()
    print("Accuracy from Confusion Matrix: ", acc)
    print("Accuracy from MulticlassMetrics: ", acc2)
    print()
    print("Area Under the ROC", auc)
    print("Area Under the PR Curve", prc)
    print('-'*50)
    print()
    print("Summary Stats")
    print()
    print("Precision from MulticlassMetrics = %s" % precision)
    print("Precision from Confusion Matrix :", cmprecision)
    print()
    print("Recall from MulticlassMetrics = %s" % recall)
    print("Recall from Confusion Matrix :", cmrecall)
    print()
    print("F1 Score = %s" % f1Score)
    print("F1 from Confusion Matrix : ", cmf1)
    print()
    
#     # Weighted stats
#     print("Weighted recall = %s" % metrics.weightedRecall)
#     print("Weighted precision = %s" % metrics.weightedPrecision)
#     print("Weighted F(1) Score = %s" % metrics.weightedFMeasure())
#     print("Weighted F(0.5) Score = %s" % metrics.weightedFMeasure(beta=0.5))
#     print("Weighted false positive rate = %s" % metrics.weightedFalsePositiveRate)
    print('-'*50)
#     print("Metrics2 time:", time.time() - t0)

    # set up storage   
    out_list = [name, cm[0][0], cm[1][1], cm[0][1], cm[1][0], acc, auc, prc, cmprecision, cmrecall, cmf1, trtime]
    
    print(out_list)
    pickel_name = name + ".pkl"
    
    # thanks to Prof. Tashman for demonstarting this during the DS5100 notebook testing.
    with open(pickel_name, 'wb') as f:
        pickle.dump(out_list, f)
        
    return out_list
    

# Build Basic Models

#### One Hot Encoding Piplelines

In [13]:
# One Hot Encoding for models (Michael sugested that we didn't need it in each model step)

#onehotencoder to pickup
ohe_pu = OneHotEncoder(inputCol="Pickup_Community_Area", outputCol="Pickup_Community_Area_vec")

#onehotencoder to dropoff
ohe_do = OneHotEncoder(inputCol="Dropoff_Community_Area", outputCol="Dropoff_Community_Area_vec")

#onehotencoder to weekNumber
ohe_twn = OneHotEncoder(inputCol="Trip_WeekNumber", outputCol="Trip_WeekNumber_vec")

#onehotencoder to dayOfWeek
ohe_dw = OneHotEncoder(inputCol="Trip_DayofWeek", outputCol="Trip_DayofWeek_vec")

#onehotencoder to startHour
ohe_sh = OneHotEncoder(inputCol="Trip_Start_Hour", outputCol="Trip_Start_Hour_vec")

#onehotencoder to startMinute
ohe_sm = OneHotEncoder(inputCol="Trip_Start_Minute", outputCol="Trip_Start_Minute_vec")

### Baseline Pipeline Logistic Regression Model

In [40]:
model_name = 'Baseline_LR'

# our colulms for lr
predictor_col_for_lr = ['Trip_Seconds',
                        'Trip_Miles',
                        'Fare',
                        'Additional_Charges_str',
                        'Shared_Trip_Authorized',
                        'Trips_Pooled',
                        'Pickup_Community_Area_vec',
                        'Dropoff_Community_Area_vec',
                        'Trip_Year', 
                        'Trip_Month',
                        'Trip_WeekNumber_vec', 
                        'Trip_DayofWeek_vec', 
                        'Trip_Start_Hour_vec',
                        'Trip_Start_Minute_vec',
                        'PostShutdownFlag'
                        ] 

#assemble the vector for lr
lr_va = VectorAssembler(inputCols=predictor_col_for_lr, outputCol="features") 

#scale our lr
lr_scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=False)

#what do we want to do if we are doing a parameter search? make the parameters as variables and just do a loop?
#we learned that this week.  May also need to add in cv step

lr = LogisticRegression(maxIter=10,
                        regParam=0.1, #org 0.1
                        elasticNetParam=0.3, #org 0.3
                        featuresCol="features",
                        labelCol="label")

# Build the pipeline
lr_pipeline = Pipeline(stages=[ohe_pu, ohe_do, ohe_twn, ohe_dw, ohe_sh, ohe_sm, lr_va, lr_scaler, lr])

#time check
t0 = time.time()

# Fit the pipeline
print("training")
lr_model = lr_pipeline.fit(train)

# Make a prediction
print("testing")
lr_prediction = lr_model.transform(test)

t1 = time.time()

tt = round(t1-t0, 2)
print("Baseline LR Train/Test Time:", tt)

train_time = tt.real

baseLR = cmacc2(lr_prediction, model_name, train_time)

training
testing
Baseline LR Train/Test Time: 35.92
Zipped P and L
metrics created
areas under the curve
--------------------------------------------------
Confusion Matrix
[[229461. 245936.]
 [149632. 325116.]]
--------------------------------------------------

Accuracy from Confusion Matrix:  0.5836761757415974
Accuracy from MulticlassMetrics:  0.5836761757415974

Area Under the ROC 0.583745213910684
Area Under the PR Curve 0.5583488691324588
--------------------------------------------------

Summary Stats

Precision from MulticlassMetrics = 0.5836761757415974
Precision from Confusion Matrix : 0.48267237698176474

Recall from MulticlassMetrics = 0.5836761757415974
Recall from Confusion Matrix : 0.6052894672283582

F1 Score = 0.5836761757415974
F1 from Confusion Matrix :  0.5370712354737914

--------------------------------------------------
['Baseline_LR', 229461.0, 325116.0, 245936.0, 149632.0, 0.5836761757415974, 0.583745213910684, 0.5583488691324588, 0.48267237698176474, 0.60528

In [15]:
baseLR

['Baseline_LR',
 229461.0,
 325116.0,
 245936.0,
 149632.0,
 0.5836761757415974,
 0.583745213910684,
 0.5583488691324588,
 0.48267237698176474,
 0.6052894672283582,
 0.5370712354737914,
 45.69]

In [47]:
lr_model.stages[-1].coefficientMatrix

SparseMatrix(1, 256, [0, 6], [2, 3, 4, 5, 163, 255], [0.001, 0.0784, -0.1861, -0.0158, -0.1056, -0.0625], 1)

### Baseline Pipeline for Random Forrest

In [22]:
model_name = 'Baseline_RF'

# our colulms for rf
predictor_col_for_rf = ['Trip_Seconds',
                        'Trip_Miles',
                        'Fare',
                        'Additional_Charges_str',
                        'Shared_Trip_Authorized',
                        'Trips_Pooled',
                        'Pickup_Community_Area_vec',
                        'Dropoff_Community_Area_vec',
                        'Trip_Year', 
                        'Trip_Month',
                        'Trip_WeekNumber_vec', 
                        'Trip_DayofWeek_vec', 
                        'Trip_Start_Hour_vec',
                        'Trip_Start_Minute_vec',
                        'PostShutdownFlag'
                        ] # 'Date' not supported datatype

# assemble feature vector
rf_va = VectorAssembler(inputCols=predictor_col_for_rf, outputCol="features") 

# set classifier
rf = RandomForestClassifier(labelCol="label", 
                            featuresCol="features", 
                            numTrees=10)

# Build the pipeline
rf_pipeline = Pipeline(stages=[ohe_pu, ohe_do, ohe_twn, ohe_dw, ohe_sh, ohe_sm, rf_va, rf])

#time check
t0 = time.time()

# Fit the pipeline
print("training")
rf_model = rf_pipeline.fit(train)

# Make a prediction
print("testing")
rf_prediction = rf_model.transform(test)
t1 = time.time()

# Select (prediction, true label) and compute test error
rf_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
#metric options f1|accuracy|weightedPrecision|weightedRecall

rf_accuracy = rf_evaluator.evaluate(rf_prediction)

print("Baseline RF Test Error = %g" % (1.0 - rf_accuracy))
print("Accuracy: " , rf_accuracy)

rfModel2 = rf_model.stages[7]
print(rfModel2)  # summary only

tt = round(t1-t0, 2)
train_time = tt.real

print("Baseline RF Train/Test Time:", train_time)

baseRF = cmacc2(rf_prediction, model_name, train_time)

training
testing
Baseline RF Test Error = 0.409512
Accuracy:  0.590488251152715
RandomForestClassificationModel (uid=RandomForestClassifier_4bfa20a92609) with 10 trees
Baseline RF Train/Test Time: 35.21
Zipped P and L
metrics created
areas under the curve
--------------------------------------------------
Confusion Matrix
[[329109. 146288.]
 [239071. 235677.]]
--------------------------------------------------

Accuracy from Confusion Matrix:  0.5944208515542364
Accuracy from MulticlassMetrics:  0.5944208515542364

Area Under the ROC 0.5943539611433748
Area Under the PR Curve 0.5874638989337708
--------------------------------------------------

Summary Stats

Precision from MulticlassMetrics = 0.5944208515542364
Precision from Confusion Matrix : 0.6922824502468463

Recall from MulticlassMetrics = 0.5944208515542364
Recall from Confusion Matrix : 0.5792336935478194

F1 Score = 0.5944208515542364
F1 from Confusion Matrix :  0.6307325669308542

-------------------------------------------

In [15]:
baseRF

['Baseline_RF',
 325504.0,
 239354.0,
 149893.0,
 235394.0,
 0.594496629461819,
 0.594434974075626,
 0.5863415643818957,
 0.6846993144676975,
 0.5803265477858719,
 0.6282072189868715,
 38.34]

In [None]:
#https://www.timlrx.com/blog/feature-selection-using-feature-importance-score-creating-a-pyspark-estimator

In [32]:
rf_model.stages[-1].featureImportances

SparseVector(256, {0: 0.0215, 1: 0.0687, 2: 0.0526, 3: 0.2716, 4: 0.0286, 5: 0.0693, 10: 0.0, 13: 0.0, 14: 0.0085, 20: 0.0, 31: 0.004, 34: 0.0, 38: 0.0248, 39: 0.0, 41: 0.0, 45: 0.0, 46: 0.0049, 48: 0.0015, 49: 0.0001, 50: 0.013, 52: 0.0002, 62: 0.0192, 65: 0.0, 67: 0.0, 72: 0.0001, 75: 0.0273, 77: 0.0024, 82: 0.1729, 83: 0.0001, 85: 0.0, 90: 0.0003, 94: 0.0, 97: 0.0, 103: 0.0, 108: 0.0002, 109: 0.0452, 112: 0.0001, 113: 0.0002, 116: 0.0053, 117: 0.0, 119: 0.0008, 121: 0.0, 122: 0.0056, 127: 0.0031, 128: 0.012, 137: 0.0003, 140: 0.0002, 144: 0.0001, 152: 0.0095, 153: 0.016, 156: 0.0, 160: 0.0554, 161: 0.0002, 163: 0.0089, 172: 0.0, 173: 0.0, 176: 0.0048, 177: 0.0078, 178: 0.0037, 179: 0.0053, 183: 0.0, 186: 0.0, 188: 0.0, 191: 0.0, 192: 0.0, 194: 0.0, 197: 0.0, 198: 0.0, 199: 0.0, 200: 0.0, 204: 0.0, 207: 0.0, 210: 0.0, 255: 0.0235})

### Baseline Pipeline for Gradient Boosted Trees

In [16]:
model_name = 'Baseline_GBT'

# our colulms for gbt
predictor_col_for_gbt = ['Trip_Seconds',
                        'Trip_Miles',
                        'Fare',
                        'Additional_Charges_str',
                        'Shared_Trip_Authorized',
                        'Trips_Pooled',
                        'Pickup_Community_Area_vec',
                        'Dropoff_Community_Area_vec',
                        'Trip_Year', 
                        'Trip_Month',
                        'Trip_WeekNumber_vec', 
                        'Trip_DayofWeek_vec', 
                        'Trip_Start_Hour_vec',
                        'Trip_Start_Minute_vec',
                        'PostShutdownFlag'
                        ] # 'Date' not supported datatype

gbt_va = VectorAssembler(inputCols=predictor_col_for_rf, outputCol="features") 

# set classifier
gbt = GBTClassifier(labelCol="label", featuresCol="features", maxIter=5)

# Chain indexers and GBT in a Pipeline
gbt_pipeline = Pipeline(stages=[ohe_pu, ohe_do, ohe_twn, ohe_dw, ohe_sh, ohe_sm, gbt_va, gbt]) #labelIndexer, featureIndexer

#time check
t0 = time.time()

# Train model.
gbt_model = gbt_pipeline.fit(train)
print('training')

# Make predictions.
gbt_prediction = gbt_model.transform(test)
print('testing')

t1 = time.time()

tt = round(t1-t0, 2)
train_time = tt.real

print("GBT Baseline Train/Test Time:", train_time)

# Select (prediction, true label) and compute test error
gbt_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
gbt_accuracy = gbt_evaluator.evaluate(gbt_prediction)
print("GBT Test Error = %g" % (1.0 - gbt_accuracy))
print('GBT accuracy = ', gbt_accuracy)
gbtModel = gbt_model.stages[7]
print(gbtModel)  # summary only

baseGBT = cmacc2(gbt_prediction, model_name, train_time)

training
testing
GBT Baseline Train/Test Time: 116.96
GBT Test Error = 0.404306
GBT accuracy =  0.5956939318235045
GBTClassificationModel (uid=GBTClassifier_5cc09f72907f) with 5 trees
Zipped P and L
metrics created
areas under the curve
--------------------------------------------------
Confusion Matrix
[[330585. 144812.]
 [235790. 238958.]]
--------------------------------------------------

Accuracy from Confusion Matrix:  0.5994274558093765
Accuracy from MulticlassMetrics:  0.5994274558093765

Area Under the ROC 0.5993618653388731
Area Under the PR Curve 0.5921143690572241
--------------------------------------------------

Summary Stats

Precision from MulticlassMetrics = 0.5994274558093765
Precision from Confusion Matrix : 0.6953872237309028

Recall from MulticlassMetrics = 0.5994274558093765
Recall from Confusion Matrix : 0.5836857205914809

F1 Score = 0.5994274558093765
F1 from Confusion Matrix :  0.634659023279566

--------------------------------------------------
['Baseline_G

In [17]:
baseGBT

['Baseline_GBT',
 330585.0,
 238958.0,
 144812.0,
 235790.0,
 0.5994274558093765,
 0.5993618653388731,
 0.5921143690572241,
 0.6953872237309028,
 0.5836857205914809,
 0.634659023279566,
 116.96]

### Pipeline LR with Tuning


In [18]:
model_name = 'Tuned_LR'

# our colulms for lr
predictor_col_for_lr = ['Trip_Seconds',
                        'Trip_Miles',
                        'Fare',
                        'Additional_Charges_str',
                        'Shared_Trip_Authorized',
                        'Trips_Pooled',
                        'Pickup_Community_Area_vec',
                        'Dropoff_Community_Area_vec',
                        'Trip_Year', 
                        'Trip_Month',
                        'Trip_WeekNumber_vec', 
                        'Trip_DayofWeek_vec', 
                        'Trip_Start_Hour_vec',
                        'Trip_Start_Minute_vec',
                        'PostShutdownFlag'
                        ] # 'Date' not supported datatype

#assemble the vector or LR

lr_va = VectorAssembler(inputCols=predictor_col_for_lr, outputCol="features") 

#scale our LR

lr_scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=False)
#classifier
lr = LogisticRegression(featuresCol="features",
                        labelCol="label") # regParam=0.1, elasticNetParam=0.3, maxIter=10,

# Build the pipeline
lr_pipeline = Pipeline(stages=[ohe_pu, ohe_do, ohe_twn, ohe_dw, ohe_sh, ohe_sm, lr_va, lr_scaler, lr])

# Set up the parameter grid
lr_paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.01, 0.03, 0.05]) \
    .addGrid(lr.elasticNetParam, [0.1, 0.2, 0.3]) \
    .addGrid(lr.maxIter, [5, 10]) \
    .build()

print('len(lr_paramGrid): {}'.format(len(lr_paramGrid)))


# Treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
print('train/cv')
lr_crossval = CrossValidator(estimator=lr_pipeline,
                          estimatorParamMaps=lr_paramGrid,
                          #evaluator=BinaryClassificationEvaluator(metricName='areaUnderROC'), #we can pass in our own function if necessary
                          evaluator= MulticlassClassificationEvaluator(metricName='f1'),
                          numFolds=5)

# you can do a custom evaluator, but it seems to be a lot of work.  https://stackoverflow.com/questions/51404344/custom-evaluator-in-pyspark
# we can use either areaUnderROC or areaUnderPR as defaults for binary.
# f1|accuracy|weightedPrecision|weightedRecall for multiclass

# Run cross-validation, and choose the best set of parameters. Print the training time.

t0 = time.time()
lr_cvModel = lr_crossval.setParallelism(6).fit(train) # train 6 models in parallel



# Make predictions on test samples. cvModel uses the best model found (lrModel).
print('test')
lr_prediction = lr_cvModel.transform(test)
t1 = time.time()c

tt = round(t1-t0, 2)
train_time = tt.real
print("LR with Tuning Train Time:", tt)

TunedLR = cmacc2(lr_prediction, model_name, train_time)

len(lr_paramGrid): 18
train/cv
test
LR with Tuning Train Time: 2386.31
Zipped P and L
metrics created
areas under the curve
--------------------------------------------------
Confusion Matrix
[[249532. 225865.]
 [133521. 341227.]]
--------------------------------------------------

Accuracy from Confusion Matrix:  0.6217566792436944
Accuracy from MulticlassMetrics:  0.6217566792436944

Area Under the ROC 0.6218228883577326
Area Under the PR Curve 0.587362346525713
--------------------------------------------------

Summary Stats

Precision from MulticlassMetrics = 0.6217566792436944
Precision from Confusion Matrix : 0.5248918272517495

Recall from MulticlassMetrics = 0.6217566792436944
Recall from Confusion Matrix : 0.6514294366575905

F1 Score = 0.6217566792436944
F1 from Confusion Matrix :  0.5813547673131807

--------------------------------------------------
['Tuned_LR', 249532.0, 341227.0, 225865.0, 133521.0, 0.6217566792436944, 0.6218228883577326, 0.587362346525713, 0.52489182725

In [26]:
#how to find all our items we can call
#dir(crossval.evaluator)

In [19]:
#Model RMSE https://projector-video-pdf-converter.datacamp.com/14989/chapter4.pdf
lr_cvModel.avgMetrics

[0.6187369732054344,
 0.61916512106001,
 0.6178265947983667,
 0.6189043791594819,
 0.6163600388620095,
 0.6185034346180186,
 0.6163660091873301,
 0.6181994478191033,
 0.6105985263982171,
 0.6137518450672478,
 0.6049166678779192,
 0.6091241268373309,
 0.6128870658279999,
 0.6153804440480171,
 0.6028563476245997,
 0.6088742135190593,
 0.593679373510709,
 0.597261311965908]

In [20]:
#determine paramaters of best model
#https://dsharpc.github.io/SparkMLFlights/

lr_cvModel.getEstimatorParamMaps()[ np.argmin(lr_cvModel.avgMetrics) ]

{Param(parent='LogisticRegression_8a026aebfd5f', name='regParam', doc='regularization parameter (>= 0).'): 0.05,
 Param(parent='LogisticRegression_8a026aebfd5f', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.3,
 Param(parent='LogisticRegression_8a026aebfd5f', name='maxIter', doc='max number of iterations (>= 0).'): 5}

In [33]:
model_name = 'Tuned_LR2'

# our colulms for lr
predictor_col_for_lr = ['Trip_Seconds',
                        'Trip_Miles',
                        'Fare',
                        'Additional_Charges_str',
                        'Shared_Trip_Authorized',
                        'Trips_Pooled',
                        'Pickup_Community_Area_vec',
                        'Dropoff_Community_Area_vec',
                        'Trip_Year', 
                        'Trip_Month',
                        'Trip_WeekNumber_vec', 
                        'Trip_DayofWeek_vec', 
                        'Trip_Start_Hour_vec',
                        'Trip_Start_Minute_vec',
                        'PostShutdownFlag'
                        ] # 'Date' not supported datatype

#assemble the vector or LR

lr_va = VectorAssembler(inputCols=predictor_col_for_lr, outputCol="features") 

#scale our LR

lr_scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=False)
#classifier
lr = LogisticRegression(featuresCol="features",
                        labelCol="label") # regParam=0.1, elasticNetParam=0.3, maxIter=10,

# Build the pipeline
lr_pipeline = Pipeline(stages=[ohe_pu, ohe_do, ohe_twn, ohe_dw, ohe_sh, ohe_sm, lr_va, lr_scaler, lr])

# Set up the parameter grid
lr_paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.05, 0.075, 0.1]) \
    .addGrid(lr.elasticNetParam, [0.3, 0.4, 0.5]) \
    .addGrid(lr.maxIter, [5, 10]) \
    .build()

print('len(lr_paramGrid): {}'.format(len(lr_paramGrid)))


# Treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
print('train/cv')
lr_crossval = CrossValidator(estimator=lr_pipeline,
                          estimatorParamMaps=lr_paramGrid,
                          #evaluator=BinaryClassificationEvaluator(metricName='areaUnderROC'), #we can pass in our own function if necessary
                          evaluator= MulticlassClassificationEvaluator(metricName='f1'),
                          numFolds=5)

# you can do a custom evaluator, but it seems to be a lot of work.  https://stackoverflow.com/questions/51404344/custom-evaluator-in-pyspark
# we can use either areaUnderROC or areaUnderPR as defaults for binary.
# f1|accuracy|weightedPrecision|weightedRecall for multiclass

# Run cross-validation, and choose the best set of parameters. Print the training time.

t0 = time.time()
lr_cvModel = lr_crossval.setParallelism(6).fit(train) # train 6 models in parallel



# Make predictions on test samples. cvModel uses the best model found (lrModel).
print('test')
lr_prediction = lr_cvModel.transform(test)
t1 = time.time()

tt = round(t1-t0, 2)
train_time = tt.real
print("LR with Tuning Train Time:", tt)

TunedLR = cmacc2(lr_prediction, model_name, train_time)

len(lr_paramGrid): 18
train/cv
test
LR with Tuning Train Time: 2518.9
Zipped P and L
metrics created
areas under the curve
--------------------------------------------------
Confusion Matrix
[[243606. 231791.]
 [150568. 324180.]]
--------------------------------------------------

Accuracy from Confusion Matrix:  0.597578264370175
Accuracy from MulticlassMetrics:  0.597578264370175

Area Under the ROC 0.5976364673784024
Area Under the PR Curve 0.5698579879684897
--------------------------------------------------

Summary Stats

Precision from MulticlassMetrics = 0.597578264370175
Precision from Confusion Matrix : 0.5124264562039726

Recall from MulticlassMetrics = 0.597578264370175
Recall from Confusion Matrix : 0.6180164089970419

F1 Score = 0.597578264370175
F1 from Confusion Matrix :  0.5602900740710074

--------------------------------------------------
['Tuned_LR2', 243606.0, 324180.0, 231791.0, 150568.0, 0.597578264370175, 0.5976364673784024, 0.5698579879684897, 0.512426456203972

In [34]:
#how to find all our items we can call
#dir(crossval.evaluator)

In [35]:
#Model RMSE https://projector-video-pdf-converter.datacamp.com/14989/chapter4.pdf
lr_cvModel.avgMetrics

[0.593679373510709,
 0.597261311965908,
 0.5897296019074938,
 0.5862079132409334,
 0.5886593061696461,
 0.5819984712150646,
 0.5871984081505421,
 0.5787496349058965,
 0.5868907072411093,
 0.57668576758561,
 0.587471400775593,
 0.5799174851622375,
 0.5841778323500006,
 0.5784160183360071,
 0.5783317524819762,
 0.5684637536276904,
 0.5863596900940766,
 0.5801111239452142]

In [36]:
#determine paramaters of best model
#https://dsharpc.github.io/SparkMLFlights/

lr_cvModel.getEstimatorParamMaps()[ np.argmin(lr_cvModel.avgMetrics) ]

{Param(parent='LogisticRegression_6f31af815f7b', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
 Param(parent='LogisticRegression_6f31af815f7b', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.4,
 Param(parent='LogisticRegression_6f31af815f7b', name='maxIter', doc='max number of iterations (>= 0).'): 10}

# Pipeline RF with Tuning


In [21]:
model_name = 'Tuned_RF'

# our colulms for rf
predictor_col_for_rf = ['Trip_Seconds',
                        'Trip_Miles',
                        'Fare',
                        'Additional_Charges_str',
                        'Shared_Trip_Authorized',
                        'Trips_Pooled',
                        'Pickup_Community_Area_vec',
                        'Dropoff_Community_Area_vec',
                        'Trip_Year', 
                        'Trip_Month',
                        'Trip_WeekNumber_vec', 
                        'Trip_DayofWeek_vec', 
                        'Trip_Start_Hour_vec',
                        'Trip_Start_Minute_vec',
                        'PostShutdownFlag'
                        ] # 'Date' not supported datatype

# assemble feature vector
rf_va = VectorAssembler(inputCols=predictor_col_for_rf, outputCol="features") 
               
# set classifier
rf = RandomForestClassifier(labelCol="label", 
                            featuresCol="features")
    
# Build the pipeline
rf_pipeline = Pipeline(stages=[ohe_pu, ohe_do, ohe_twn, ohe_dw, ohe_sh, ohe_sm, rf_va, rf])

# Set up the parameter grid
rf_paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [5, 10, 15]) \
    .addGrid(rf.maxDepth, [5, 10, 15]) \
    .build()
#"entropy"
#.addGrid(rf.featureSubsetStrategy, ['auto', 'sqrt'])\
#.addGrid(rf.impurity, ["gini"])\
   
print('len(rf_paramGrid): {}'.format(len(rf_paramGrid)))

#https://medium.com/rahasak/random-forest-classifier-with-apache-spark-c63b4a23a7cc
#maxDepth, maxBins, impurity, auto and seed 
#.addGrid(randomForestClassifier.impurity, Array("entropy", "gini"))
#name='featureSubsetStrategy', auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n]

# Treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
print('train/cv')
rf_crossval = CrossValidator(estimator=rf_pipeline,
                          estimatorParamMaps=rf_paramGrid,
                          #evaluator=BinaryClassificationEvaluator(metricName='areaUnderROC'), #we can pass in our own function if necessary
                          evaluator= MulticlassClassificationEvaluator(metricName='f1'),
                          numFolds=5)

# you can do a custom evaluator, but it seems to be a lot of work.  https://stackoverflow.com/questions/51404344/custom-evaluator-in-pyspark
# we can use either areaUnderROC or areaUnderPR as defaults for binary.
# f1|accuracy|weightedPrecision|weightedRecall for multiclass

# Run cross-validation, and choose the best set of parameters. Print the training time.
t0 = time.time()
cvModel_rf = rf_crossval.setParallelism(6).fit(train) # train 6 models in parallel

# Make predictions on test documents. cvModel uses the best model found (lrModel).
print('test')
prediction_rf = cvModel_rf.transform(test)

t1 = time.time()

tt = round(t1-t0, 2)
train_time = tt.real

print("RF with Tuning Train Time:", tt)

# Make predictions on test documents. cvModel uses the best model found (lrModel).
prediction_rf = cvModel_rf.transform(test)

TunedRF = cmacc2(prediction_rf, model_name, train_time)

len(rf_paramGrid): 9
train/cv
test
RF with Tuning Train Time: 2089.0
Zipped P and L
metrics created
areas under the curve
--------------------------------------------------
Confusion Matrix
[[297800. 177597.]
 [193923. 280825.]]
--------------------------------------------------

Accuracy from Confusion Matrix:  0.6089859968741613
Accuracy from MulticlassMetrics:  0.6089859968741613

Area Under the ROC 0.6089740777710633
Area Under the PR Curve 0.5895255738497178
--------------------------------------------------

Summary Stats

Precision from MulticlassMetrics = 0.6089859968741613
Precision from Confusion Matrix : 0.626423809994594

Recall from MulticlassMetrics = 0.6089859968741613
Recall from Confusion Matrix : 0.6056255249398543

F1 Score = 0.6089859968741613
F1 from Confusion Matrix :  0.6158491190338324

--------------------------------------------------
['Tuned_RF', 297800.0, 280825.0, 177597.0, 193923.0, 0.6089859968741613, 0.6089740777710633, 0.5895255738497178, 0.626423809994

In [22]:
TunedRF

['Tuned_RF',
 297800.0,
 280825.0,
 177597.0,
 193923.0,
 0.6089859968741613,
 0.6089740777710633,
 0.5895255738497178,
 0.626423809994594,
 0.6056255249398543,
 0.6158491190338324,
 2089.0]

In [None]:
#how to find all our items we can call
#dir(crossval.evaluator)

In [23]:
#not sure what this metric is... rmse
cvModel_rf.avgMetrics

[0.5404866096767633,
 0.5868874064610188,
 0.6023500055035178,
 0.5712415911085544,
 0.5986262181007318,
 0.6058893606475215,
 0.5619321633863338,
 0.6003225344428791,
 0.6085699826147258]

In [24]:
#Best model paramaters from lowest RMSE

cvModel_rf.getEstimatorParamMaps()[ np.argmin(cvModel_rf.avgMetrics) ]

{Param(parent='RandomForestClassifier_6fbfd7950324', name='numTrees', doc='Number of trees to train (>= 1).'): 5,
 Param(parent='RandomForestClassifier_6fbfd7950324', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 5}

In [29]:
model_name = 'Tuned_RF2'

# our colulms for rf
predictor_col_for_rf = ['Trip_Seconds',
                        'Trip_Miles',
                        'Fare',
                        'Additional_Charges_str',
                        'Shared_Trip_Authorized',
                        'Trips_Pooled',
                        'Pickup_Community_Area_vec',
                        'Dropoff_Community_Area_vec',
                        'Trip_Year', 
                        'Trip_Month',
                        'Trip_WeekNumber_vec', 
                        'Trip_DayofWeek_vec', 
                        'Trip_Start_Hour_vec',
                        'Trip_Start_Minute_vec',
                        'PostShutdownFlag'
                        ] # 'Date' not supported datatype

# assemble feature vector
rf_va = VectorAssembler(inputCols=predictor_col_for_rf, outputCol="features") 
               
# set classifier
rf = RandomForestClassifier(labelCol="label", 
                            featuresCol="features")
    
# Build the pipeline
rf_pipeline = Pipeline(stages=[ohe_pu, ohe_do, ohe_twn, ohe_dw, ohe_sh, ohe_sm, rf_va, rf])

#parameter grid
rf_paramGrid = ParamGridBuilder()\
    .addGrid(rf.numTrees, [3, 5])\
    .addGrid(rf.maxDepth, [3, 5])\
    .addGrid(rf.featureSubsetStrategy, ['auto', 'sqrt'])\
    .build()
   
print('len(rf_paramGrid): {}'.format(len(rf_paramGrid)))

#https://medium.com/rahasak/random-forest-classifier-with-apache-spark-c63b4a23a7cc
#maxDepth, maxBins, impurity, auto and seed 
#.addGrid(randomForestClassifier.impurity, Array("entropy", "gini"))
#name='featureSubsetStrategy', auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n]

# Treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
print('train/cv')
rf_crossval = CrossValidator(estimator=rf_pipeline,
                          estimatorParamMaps=rf_paramGrid,
                          #evaluator=BinaryClassificationEvaluator(metricName='areaUnderROC'), #we can pass in our own function if necessary
                          evaluator= MulticlassClassificationEvaluator(metricName='f1'),
                          numFolds=5)

# you can do a custom evaluator, but it seems to be a lot of work.  https://stackoverflow.com/questions/51404344/custom-evaluator-in-pyspark
# we can use either areaUnderROC or areaUnderPR as defaults for binary.
# f1|accuracy|weightedPrecision|weightedRecall for multiclass

# Run cross-validation, and choose the best set of parameters. Print the training time.
t0 = time.time()
cvModel_rf = rf_crossval.setParallelism(6).fit(train) # train 6 models in parallel

# Make predictions on test documents. cvModel uses the best model found (lrModel).
print('test')
prediction_rf = cvModel_rf.transform(test)

t1 = time.time()

tt = round(t1-t0, 2)
train_time = tt.real

print("RF with Tuning Train Time:", tt)

# Make predictions on test documents. cvModel uses the best model found (lrModel).
prediction_rf = cvModel_rf.transform(test)

TunedRF2 = cmacc2(prediction_rf, model_name, train_time)


len(rf_paramGrid): 8
train/cv
test
RF with Tuning Train Time: 843.65
Zipped P and L
metrics created
areas under the curve
--------------------------------------------------
Confusion Matrix
[[311983. 163414.]
 [228521. 246227.]]
--------------------------------------------------

Accuracy from Confusion Matrix:  0.587499802661699
Accuracy from MulticlassMetrics:  0.587499802661699

Area Under the ROC 0.587452805144262
Area Under the PR Curve 0.5766702390917446
--------------------------------------------------

Summary Stats

Precision from MulticlassMetrics = 0.587499802661699
Precision from Confusion Matrix : 0.6562578224094809

Recall from MulticlassMetrics = 0.587499802661699
Recall from Confusion Matrix : 0.57720756923168

F1 Score = 0.587499802661699
F1 from Confusion Matrix :  0.614199611970064

--------------------------------------------------
['Tuned_RF2', 311983.0, 246227.0, 163414.0, 228521.0, 0.587499802661699, 0.587452805144262, 0.5766702390917446, 0.6562578224094809, 0.5

In [30]:
#not sure what this metric is... rmse
cvModel_rf.avgMetrics

[0.5778278140228544,
 0.5778278140228544,
 0.5838845019172043,
 0.5838845019172043,
 0.526715010695165,
 0.526715010695165,
 0.5404866096767633,
 0.5404866096767633]

In [31]:
#Best model paramaters from lowest RMSE

cvModel_rf.getEstimatorParamMaps()[ np.argmin(cvModel_rf.avgMetrics) ]

{Param(parent='RandomForestClassifier_d1e089515056', name='numTrees', doc='Number of trees to train (>= 1).'): 5,
 Param(parent='RandomForestClassifier_d1e089515056', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 3,
 Param(parent='RandomForestClassifier_d1e089515056', name='featureSubsetStrategy', doc="The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 'n' (when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features). default = 'auto'"): 'auto'}

# Pipeline GBT with Tuning

GBT was run once on the full dataset.  Due to its long modeling time it was elimiated as a canidate as its accuracy did not warrent the long compute times.

In [25]:
model_name = "Tuned_GBT"

# our colulms for gbt
predictor_col_for_gbt = ['Trip_Seconds',
                        'Trip_Miles',
                        'Fare',
                        'Additional_Charges_str',
                        'Shared_Trip_Authorized',
                        'Trips_Pooled',
                        'Pickup_Community_Area_vec',
                        'Dropoff_Community_Area_vec',
                        'Trip_Year', 
                        'Trip_Month',
                        'Trip_WeekNumber_vec', 
                        'Trip_DayofWeek_vec', 
                        'Trip_Start_Hour_vec',
                        'Trip_Start_Minute_vec'
                        ] # 'Date' not supported datatype

gbt_va = VectorAssembler(inputCols=predictor_col_for_gbt, outputCol="features") 

# Train a GBT model.
gbt = GBTClassifier(labelCol="label", featuresCol="features") #, maxIter=5

# Chain indexers and GBT in a Pipeline
gbt_pipeline = Pipeline(stages=[ohe_pu, ohe_do, ohe_twn, ohe_dw, ohe_sh, ohe_sm, gbt_va, gbt]) #labelIndexer, featureIndexer

# Set up the parameter grid
gbt_paramGrid = ParamGridBuilder() \
    .addGrid(gbt.maxIter, [5, 10])\
    .addGrid(gbt.maxDepth, [5, 10])\
    .build()

print('len(gbt_paramGrid): {}'.format(len(gbt_paramGrid)))


# Treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
gbt_crossval = CrossValidator(estimator=gbt_pipeline,
                          estimatorParamMaps=gbt_paramGrid,
                          #evaluator=BinaryClassificationEvaluator(metricName='areaUnderROC'), #we can pass in our own function if necessary
                          evaluator= MulticlassClassificationEvaluator(metricName='f1'),
                          numFolds=5)

# you can do a custom evaluator, but it seems to be a lot of work.  https://stackoverflow.com/questions/51404344/custom-evaluator-in-pyspark
# we can use either areaUnderROC or areaUnderPR as defaults for binary.
# f1|accuracy|weightedPrecision|weightedRecall for multiclass

# Run cross-validation, and choose the best set of parameters. Print the training time.
print("train")
t0 = time.time()
cvModel_gbt = gbt_crossval.setParallelism(6).fit(train) # train 6 models in parallel


# Make predictions on test documents. cvModel uses the best model found (lrModel).
print("test")
prediction_gbt = cvModel_gbt.transform(test)

t1 = time.time()

tt = round(t1-t0, 2)
train_time = tt.real

print("GBT with Tuning Train Time:", tt)

TunedGBT = cmacc2(prediction_gbt, model_name, train_time)

len(gbt_paramGrid): 4
train
test
GBT with Tuning Train Time: 4872.3
Zipped P and L
metrics created
areas under the curve
--------------------------------------------------
Confusion Matrix
[[313556. 161841.]
 [209444. 265304.]]
--------------------------------------------------

Accuracy from Confusion Matrix:  0.6092333275447431
Accuracy from MulticlassMetrics:  0.6092333275447431

Area Under the ROC 0.6091989236956347
Area Under the PR Curve 0.5943196320513571
--------------------------------------------------

Summary Stats

Precision from MulticlassMetrics = 0.6092333275447431
Precision from Confusion Matrix : 0.6595666358853758

Recall from MulticlassMetrics = 0.6092333275447431
Recall from Confusion Matrix : 0.5995334608030592

F1 Score = 0.6092333275447431
F1 from Confusion Matrix :  0.6281188745559131

--------------------------------------------------
['Tuned_GBT', 313556.0, 265304.0, 161841.0, 209444.0, 0.6092333275447431, 0.6091989236956347, 0.5943196320513571, 0.65956663588

In [26]:
# cmacc2(prediction_gbt)

In [27]:
#not sure what this metric is...
cvModel_gbt.avgMetrics

[0.5925946750513758,
 0.6021299266975099,
 0.5999075809775377,
 0.6107245076006178]

In [28]:
#best paramaters
cvModel_gbt.getEstimatorParamMaps()[np.argmin(cvModel_gbt.avgMetrics)]

{Param(parent='GBTClassifier_388f134878e9', name='maxIter', doc='max number of iterations (>= 0).'): 5,
 Param(parent='GBTClassifier_388f134878e9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 5}

# Pipeline LR with CV and no Tuning


Verify Gridsearch results.

In [41]:
model_name = 'CV_LR'

# our colulms for lr
predictor_col_for_lr = ['Trip_Seconds',
                        'Trip_Miles',
                        'Fare',
                        'Additional_Charges_str',
                        'Shared_Trip_Authorized',
                        'Trips_Pooled',
                        'Pickup_Community_Area_vec',
                        'Dropoff_Community_Area_vec',
                        'Trip_Year', 
                        'Trip_Month',
                        'Trip_WeekNumber_vec', 
                        'Trip_DayofWeek_vec', 
                        'Trip_Start_Hour_vec',
                        'Trip_Start_Minute_vec',
                        'PostShutdownFlag'
                        ] # 'Date' not supported datatype

#assemble the vector ror lr
lr_va = VectorAssembler(inputCols=predictor_col_for_lr, outputCol="features") 

#scale our LR

lr_scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=False)

#what do we want to do if we are doing a parameter search? make the parameters as variables and just do a loop?
#we learned that this week.  May also need to add in cv step

lr = LogisticRegression(featuresCol="features",
                        labelCol="label") # regParam=0.1, elasticNetParam=0.3, maxIter=10,

# Build the pipeline
lr_pipeline = Pipeline(stages=[ohe_pu, ohe_do, ohe_twn, ohe_dw, ohe_sh, ohe_sm, lr_va, lr_scaler, lr])

# Set up the parameter grid
lr_paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.05]) \
    .addGrid(lr.elasticNetParam, [0.3]) \
    .addGrid(lr.maxIter, [5]) \
    .build()

print('len(lr_paramGrid): {}'.format(len(lr_paramGrid)))

# Treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
lr_crossval = CrossValidator(estimator=lr_pipeline,
                          estimatorParamMaps=lr_paramGrid,
                          #evaluator=BinaryClassificationEvaluator(metricName='areaUnderROC'), #we can pass in our own function if necessary
                          evaluator= MulticlassClassificationEvaluator(metricName='f1'),
                          numFolds=5)

# you can do a custom evaluator, but it seems to be a lot of work.  https://stackoverflow.com/questions/51404344/custom-evaluator-in-pyspark
# we can use either areaUnderROC or areaUnderPR as defaults for binary.
# f1|accuracy|weightedPrecision|weightedRecall for multiclass

# Run cross-validation, and choose the best set of parameters. Print the training time.
print("train")
t0 = time.time()
lr_cvModel = lr_crossval.setParallelism(5).fit(train) # train 5 models in parallel
print("train time:", time.time() - t0)

# Make predictions on test documents. cvModel uses the best model found (lrModel).
print("test")
lr_prediction = lr_cvModel.transform(test)

t1 = time.time()

tt = round(t1-t0, 2)
train_time = tt.real

print("LR with CV, no tuning, train time:", tt)

CVLR = cmacc2(lr_prediction, model_name, train_time)

len(lr_paramGrid): 1
train
train time: 151.6270649433136
test
LR with CV, no tuning, train time: 152.83
Zipped P and L
metrics created
areas under the curve
--------------------------------------------------
Confusion Matrix
[[310536. 164861.]
 [219267. 255481.]]
--------------------------------------------------

Accuracy from Confusion Matrix:  0.59571644327971
Accuracy from MulticlassMetrics:  0.59571644327971

Area Under the ROC 0.5956771424852536
Area Under the PR Curve 0.5828216432401201
--------------------------------------------------

Summary Stats

Precision from MulticlassMetrics = 0.59571644327971
Precision from Confusion Matrix : 0.653214050572469

Recall from MulticlassMetrics = 0.59571644327971
Recall from Confusion Matrix : 0.5861348463485484

F1 Score = 0.59571644327971
F1 from Confusion Matrix :  0.617859132510943

--------------------------------------------------
['CV_LR', 310536.0, 255481.0, 164861.0, 219267.0, 0.59571644327971, 0.5956771424852536, 0.5828216432401

In [42]:
#Model RMSE https://projector-video-pdf-converter.datacamp.com/14989/chapter4.pdf
lr_cvModel.avgMetrics

[0.593679373510709]

In [43]:
#determine paramaters of best model
#https://dsharpc.github.io/SparkMLFlights/

lr_cvModel.getEstimatorParamMaps()[ np.argmin(lr_cvModel.avgMetrics) ]

{Param(parent='LogisticRegression_51fe7a567505', name='regParam', doc='regularization parameter (>= 0).'): 0.05,
 Param(parent='LogisticRegression_51fe7a567505', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.3,
 Param(parent='LogisticRegression_51fe7a567505', name='maxIter', doc='max number of iterations (>= 0).'): 5}

# Pipeline RF with CV and no Tuning

Verify Gridsearch results

In [14]:
model_name = 'CV_RF'

# our colulms for rf
predictor_col_for_rf = ['Trip_Seconds',
                        'Trip_Miles',
                        'Fare',
                        'Additional_Charges_str',
                        'Shared_Trip_Authorized',
                        'Trips_Pooled',
                        'Pickup_Community_Area_vec',
                        'Dropoff_Community_Area_vec',
                        'Trip_Year', 
                        'Trip_Month',
                        'Trip_WeekNumber_vec', 
                        'Trip_DayofWeek_vec', 
                        'Trip_Start_Hour_vec',
                        'Trip_Start_Minute_vec',
                        'PostShutdownFlag'
                        ]

# assemble feature vector
rf_va = VectorAssembler(inputCols=predictor_col_for_rf, outputCol="features") 

# set classifier
rf = RandomForestClassifier(labelCol="label", 
                            featuresCol="features")

# # Build the pipeline
# rf_pipeline = Pipeline(stages=[ohe_pu, ohe_do, rf_va, rf])

# Build the pipeline
rf_pipeline = Pipeline(stages=[ohe_pu, ohe_do, ohe_twn, ohe_dw, ohe_sh, ohe_sm, rf_va, rf])

# Set up the parameter grid
rf_paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [5]) \
    .addGrid(rf.maxDepth, [5]) \
    .addGrid(rf.impurity, ["gini"])\
    .addGrid(rf.featureSubsetStrategy, ['auto'])\
    .build()
   
    

print('len(rf_paramGrid): {}'.format(len(rf_paramGrid)))

#https://medium.com/rahasak/random-forest-classifier-with-apache-spark-c63b4a23a7cc
#maxDepth, maxBins, impurity, auto and seed 
#.addGrid(randomForestClassifier.impurity, Array("entropy", "gini"))
#name='featureSubsetStrategy', auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n]

# Treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
rf_crossval = CrossValidator(estimator=rf_pipeline,
                          estimatorParamMaps=rf_paramGrid,
                          #evaluator=BinaryClassificationEvaluator(metricName='areaUnderROC'), #we can pass in our own function if necessary
                          evaluator= MulticlassClassificationEvaluator(metricName='f1'),
                          numFolds=5)

# you can do a custom evaluator, but it seems to be a lot of work.  https://stackoverflow.com/questions/51404344/custom-evaluator-in-pyspark
# we can use either areaUnderROC or areaUnderPR as defaults for binary.
# f1|accuracy|weightedPrecision|weightedRecall for multiclass

# Run cross-validation, and choose the best set of parameters. Print the training time.
print("train")
t0 = time.time()
cvModel_rf = rf_crossval.setParallelism(5).fit(train) # train 5 models in parallel

# Make predictions on test documents. 
print("test")
prediction_rf = cvModel_rf.transform(test)

t1 = time.time()

tt = round(t1-t0, 2)
train_time = tt.real

print("LR with CV, no tuning, train time:", tt)

CVLR = cmacc2(prediction_rf, model_name, train_time)

len(rf_paramGrid): 1
train
test
LR with CV, no tuning, train time: 223.43
Zipped P and L
metrics created
areas under the curve
--------------------------------------------------
Confusion Matrix
[[392801.  82596.]
 [317951. 156797.]]
--------------------------------------------------

Accuracy from Confusion Matrix:  0.5784359229380779
Accuracy from MulticlassMetrics:  0.5784359229380779

Area Under the ROC 0.5782665308511222
Area Under the PR Curve 0.602966838029918
--------------------------------------------------

Summary Stats

Precision from MulticlassMetrics = 0.5784359229380779
Precision from Confusion Matrix : 0.8262588951970669

Recall from MulticlassMetrics = 0.5784359229380779
Recall from Confusion Matrix : 0.5526554972761244

F1 Score = 0.5784359229380779
F1 from Confusion Matrix :  0.6623130820832795

--------------------------------------------------
['CV_RF', 392801.0, 156797.0, 82596.0, 317951.0, 0.5784359229380779, 0.5782665308511222, 0.602966838029918, 0.826258895197

In [39]:
#not sure what this metric is... rmse
cvModel_rf.avgMetrics

[0.5404866096767633]

In [40]:
#paramaters

cvModel_rf.getEstimatorParamMaps()[ np.argmin(cvModel_rf.avgMetrics) ]

{Param(parent='RandomForestClassifier_d099437a911e', name='numTrees', doc='Number of trees to train (>= 1).'): 5,
 Param(parent='RandomForestClassifier_d099437a911e', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 5,
 Param(parent='RandomForestClassifier_d099437a911e', name='impurity', doc='Criterion used for information gain calculation (case-insensitive). Supported options: entropy, gini'): 'gini',
 Param(parent='RandomForestClassifier_d099437a911e', name='featureSubsetStrategy', doc="The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 'n' (when n is in the rang

# Create Matrix of Results

In [44]:
# list of all the models we ran
model_list = ['Baseline_LR.pkl', 'Baseline_RF.pkl', 'Baseline_GBT.pkl', 'Tuned_LR.pkl',  'Tuned_LR2.pkl', 'Tuned_RF.pkl', 'Tuned_RF2.pkl', 'Tuned_GBT.pkl', 'CV_LR.pkl', 'CV_RF.pkl']

In [45]:
# set up pandas dataframe to hold all the model results
data_out = pd.DataFrame(columns=['model_name', 'TP', 'TN', 'FP', 'FN', 'Accuracy', 'AUROC', 'AUPR', 'Precision', 'Recall', 'F1']) #, index=index

In [46]:
#itterate through model_list to write each pickel into the pandas data frame

for model in model_list:
    print(model)
    with open(model, 'rb') as f:
        data = pickle.load(f)
    print(data)
    new_data = pd.DataFrame([[data[0], data[1], data[2], data[3], data[4], data[5], data[6], data[7], data[8], data[9], data[10], data[11]]], 
                            columns=['model_name', 'TP', 'TN', 'FP', 'FN', 'Accuracy', 'AUROC', 'AUPR', 'Precision', 'Recall', 'F1', 'Train_Test_Time'])
    #print(new_data)
    data_out = pd.concat([data_out, new_data])
    print(data_out)


Baseline_LR.pkl
['Baseline_LR', 229461.0, 325116.0, 245936.0, 149632.0, 0.5836761757415974, 0.583745213910684, 0.5583488691324588, 0.48267237698176474, 0.6052894672283582, 0.5370712354737914, 45.69]
    model_name        TP        TN        FP        FN  Accuracy     AUROC  \
0  Baseline_LR  229461.0  325116.0  245936.0  149632.0  0.583676  0.583745   

       AUPR  Precision    Recall        F1  Train_Test_Time  
0  0.558349   0.482672  0.605289  0.537071            45.69  
Baseline_RF.pkl
['Baseline_RF', 325504.0, 239354.0, 149893.0, 235394.0, 0.594496629461819, 0.594434974075626, 0.5863415643818957, 0.6846993144676975, 0.5803265477858719, 0.6282072189868715, 38.34]
    model_name        TP        TN        FP        FN  Accuracy     AUROC  \
0  Baseline_LR  229461.0  325116.0  245936.0  149632.0  0.583676  0.583745   
0  Baseline_RF  325504.0  239354.0  149893.0  235394.0  0.594497  0.594435   

       AUPR  Precision    Recall        F1  Train_Test_Time  
0  0.558349   0.482672  0.

In [47]:
# verify that pandas dataframe was written correctly
data_out

Unnamed: 0,model_name,TP,TN,FP,FN,Accuracy,AUROC,AUPR,Precision,Recall,F1,Train_Test_Time
0,Baseline_LR,229461.0,325116.0,245936.0,149632.0,0.583676,0.583745,0.558349,0.482672,0.605289,0.537071,45.69
0,Baseline_RF,325504.0,239354.0,149893.0,235394.0,0.594497,0.594435,0.586342,0.684699,0.580327,0.628207,38.34
0,Baseline_GBT,330585.0,238958.0,144812.0,235790.0,0.599427,0.599362,0.592114,0.695387,0.583686,0.634659,116.96
0,Tuned_LR,249532.0,341227.0,225865.0,133521.0,0.621757,0.621823,0.587362,0.524892,0.651429,0.581355,2386.31
0,Tuned_LR2,243606.0,324180.0,231791.0,150568.0,0.597578,0.597636,0.569858,0.512426,0.618016,0.56029,2518.9
0,Tuned_RF,297800.0,280825.0,177597.0,193923.0,0.608986,0.608974,0.589526,0.626424,0.605626,0.615849,2089.0
0,Tuned_RF2,311983.0,246227.0,163414.0,228521.0,0.5875,0.587453,0.57667,0.656258,0.577208,0.6142,843.65
0,Tuned_GBT,313556.0,265304.0,161841.0,209444.0,0.609233,0.609199,0.59432,0.659567,0.599533,0.628119,4872.3
0,CV_LR,310536.0,255481.0,164861.0,219267.0,0.595716,0.595677,0.582822,0.653214,0.586135,0.617859,152.83
0,CV_RF,390424.0,149276.0,84973.0,325472.0,0.568019,0.567845,0.590088,0.821259,0.545364,0.655463,173.34


In [49]:
data_out.columns

Index(['model_name', 'TP', 'TN', 'FP', 'FN', 'Accuracy', 'AUROC', 'AUPR',
       'Precision', 'Recall', 'F1', 'Train_Test_Time'],
      dtype='object')

In [50]:
# set our index to the model name

data_out = data_out.set_index('model_name')

In [51]:
data_out

Unnamed: 0_level_0,TP,TN,FP,FN,Accuracy,AUROC,AUPR,Precision,Recall,F1,Train_Test_Time
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Baseline_LR,229461.0,325116.0,245936.0,149632.0,0.583676,0.583745,0.558349,0.482672,0.605289,0.537071,45.69
Baseline_RF,325504.0,239354.0,149893.0,235394.0,0.594497,0.594435,0.586342,0.684699,0.580327,0.628207,38.34
Baseline_GBT,330585.0,238958.0,144812.0,235790.0,0.599427,0.599362,0.592114,0.695387,0.583686,0.634659,116.96
Tuned_LR,249532.0,341227.0,225865.0,133521.0,0.621757,0.621823,0.587362,0.524892,0.651429,0.581355,2386.31
Tuned_LR2,243606.0,324180.0,231791.0,150568.0,0.597578,0.597636,0.569858,0.512426,0.618016,0.56029,2518.9
Tuned_RF,297800.0,280825.0,177597.0,193923.0,0.608986,0.608974,0.589526,0.626424,0.605626,0.615849,2089.0
Tuned_RF2,311983.0,246227.0,163414.0,228521.0,0.5875,0.587453,0.57667,0.656258,0.577208,0.6142,843.65
Tuned_GBT,313556.0,265304.0,161841.0,209444.0,0.609233,0.609199,0.59432,0.659567,0.599533,0.628119,4872.3
CV_LR,310536.0,255481.0,164861.0,219267.0,0.595716,0.595677,0.582822,0.653214,0.586135,0.617859,152.83
CV_RF,390424.0,149276.0,84973.0,325472.0,0.568019,0.567845,0.590088,0.821259,0.545364,0.655463,173.34


In [52]:
# sort the list for F1 score
data_out = data_out.sort_values(by=['F1'], ascending = False)

In [53]:
data_out

Unnamed: 0_level_0,TP,TN,FP,FN,Accuracy,AUROC,AUPR,Precision,Recall,F1,Train_Test_Time
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
CV_RF,390424.0,149276.0,84973.0,325472.0,0.568019,0.567845,0.590088,0.821259,0.545364,0.655463,173.34
Baseline_GBT,330585.0,238958.0,144812.0,235790.0,0.599427,0.599362,0.592114,0.695387,0.583686,0.634659,116.96
Baseline_RF,325504.0,239354.0,149893.0,235394.0,0.594497,0.594435,0.586342,0.684699,0.580327,0.628207,38.34
Tuned_GBT,313556.0,265304.0,161841.0,209444.0,0.609233,0.609199,0.59432,0.659567,0.599533,0.628119,4872.3
CV_LR,310536.0,255481.0,164861.0,219267.0,0.595716,0.595677,0.582822,0.653214,0.586135,0.617859,152.83
Tuned_RF,297800.0,280825.0,177597.0,193923.0,0.608986,0.608974,0.589526,0.626424,0.605626,0.615849,2089.0
Tuned_RF2,311983.0,246227.0,163414.0,228521.0,0.5875,0.587453,0.57667,0.656258,0.577208,0.6142,843.65
Tuned_LR,249532.0,341227.0,225865.0,133521.0,0.621757,0.621823,0.587362,0.524892,0.651429,0.581355,2386.31
Tuned_LR2,243606.0,324180.0,231791.0,150568.0,0.597578,0.597636,0.569858,0.512426,0.618016,0.56029,2518.9
Baseline_LR,229461.0,325116.0,245936.0,149632.0,0.583676,0.583745,0.558349,0.482672,0.605289,0.537071,45.69


In [54]:
#save the pandas df for use in final report

data_out.to_csv('Model_Results.csv')

##### Original Code, Archived

Code developed to read in original CSV.  Replaced with parquet file.

In [None]:
# create a custom schema.  

customSchema = StructType([
    StructField('Trip_ID', StringType(), True),        
    StructField('Trip_Start_Timestamp', StringType(), True),
    StructField('Trip_End_Timestamp', StringType(), True),
    StructField('Trip_Seconds', DoubleType(), True),
    StructField('Trip_Miles', DoubleType(), True),
    StructField('Pickup_Census_Tract', StringType(), True),
    StructField('Dropoff_Census_Tract', StringType(), True),
    StructField('Pickup_Community_Area', DoubleType(), True),
    StructField('Dropoff_Community_Area', DoubleType(), True),
    StructField("Fare", DoubleType(), True),
    StructField("Tip", DoubleType(), True),
    StructField("Additional_Charges", DoubleType(), True),
    StructField("Trip_Total", StringType(), True),
    StructField("Shared_Trip_Authorized", BooleanType(), True),
    StructField("Trips_Pooled", DoubleType(), True),
    StructField('Pickup_Centroid_Latitude', StringType(), True),
    StructField('Pickup_Centroid_Longitude', StringType(), True),
    StructField('Pickup_Centroid_Location', StringType(), True),
    StructField('Dropoff_Centroid_Latitude', StringType(), True),
    StructField('Dropoff_Centroid_Longitude', StringType(), True),
    StructField('Dropoff_Centroid_Location', StringType(), True)])

#old readin.  Infer is slow for large dataset
#df = spark.read.csv('/../../project/ds5559/Alice_Ed_Michael_Sam_project/BigTrips.csv', header = True, inferSchema=True)

#read in the data to a dataframe
df = spark.read.csv('/../../project/ds5559/Alice_Ed_Michael_Sam_project/BigTrips.csv', header = True, schema=customSchema)
df.show(5)

In [None]:
#Doesn't update if you don't resave the variable

df = df.drop('Pickup_Census_Tract',
             'Dropoff_Census_Tract',
             'Pickup_Centroid_Latitude',
             'Pickup_Centroid_Longitude', 
             'Pickup_Centroid_Location', 
             'Dropoff_Centroid_Latitude', 
             'Dropoff_Centroid_Longitude', 
             'Dropoff_Centroid_Location')

#'Trip_End_Timestamp' keep for now

In [None]:
df.printSchema()

In [None]:
df2 = df.sample(False, .05, seed = 2021) #decreased our sample size

In [None]:
df2.count()

In [None]:
#delete the big df for now
del (df)

#hopefully that will make things faster 

In [None]:
#fill our NA community areas

df2 = df2.na.fill(value=78,subset=['Pickup_Community_Area', 'Dropoff_Community_Area'])

In [None]:
# make a binary tip/no tip indicator
# https://spark.apache.org/docs/2.2.0/ml-features.html#binarizer

#binarized tip seems to be causing problems.  Change its name to label as that is that the packages are expecting

binarizer = Binarizer(threshold=0, inputCol="Tip", outputCol="label")
df2 = binarizer.transform(df2)

In [None]:
df2.printSchema()

In [None]:
df2 = df2.withColumn("Trip_Start_TS", F.to_timestamp(F.col("Trip_Start_Timestamp"), "MM/dd/yyyy hh:mm:ss a"))

In [None]:
df2.printSchema()

In [None]:
df2 = df2.withColumn('Trip_Year',F.year(F.to_timestamp('Trip_Start_TS'))) \
         .withColumn('Trip_Month',F.month(F.to_timestamp('Trip_Start_TS'))) \
         .withColumn('Trip_WeekNumber',F.weekofyear(F.to_timestamp('Trip_Start_TS'))) \
         .withColumn('Trip_DayofWeek', F.dayofweek(F.col('Trip_Start_TS'))) \
         .withColumn('Trip_Start_Hour', F.hour(F.col('Trip_Start_TS'))) \
         .withColumn('Trip_Start_Minute', F.minute(F.col('Trip_Start_TS'))) \
         .withColumn('Date', F.to_date(F.col('Trip_Start_TS')))
         
df2.show(5, False)

In [None]:
df2.printSchema()

In [None]:
# split the data

# our model didn't work on the standard test train split.  Prof. Tashman recomended upscalling the help with the imbalanced dataset.
#from https://spark.apache.org/docs/2.1.0/ml-tuning.html#train-validation-split

train_inital, test = df2.randomSplit([0.8, 0.2], seed=2021)

# cahce our test values for later speed
test.cache()

# oversampleing code sample
# https://stackoverflow.com/questions/53273133/how-to-perform-up-sampling-using-sample-functionpy-spark

df_a = train_inital.filter(train_inital['label'] == 0)
df_b = train_inital.filter(train_inital['label'] == 1)

org_a_count = df_a.count()
org_b_count = df_b.count() 


ratio = df_a.count() / df_b.count()
# print(ratio)

df_b_overampled = df_b.sample(withReplacement=True, fraction=ratio, seed=2021)

# cahce our train values for later speed
train = df_a.unionAll(df_b_overampled).cache()

df_af = train.filter(train_inital['label'] == 0)
df_bf = train.filter(train_inital['label'] == 1)
fin_a_count = df_af.count()
fin_b_count = df_bf.count() 

print("Original No Tip Count: ", org_a_count)
print("Original Tip Count   : ", org_b_count)
print("")
print("Final No Tip Count   : ", fin_a_count)
print("Final Tip Count      : ", fin_b_count)


In [None]:
del (df2)

In [None]:
#LR inital results gridsearch

'''
25k set
best from tuning inital (accuracy):
addGrid(lr.regParam, [0.03, 0.05, 0.07]) \
.addGrid(lr.elasticNetParam, [0.15, 0.2, 0.25]) \
.addGrid(lr.maxIter, [8, 9, 10, 11, 12]) 

regParam= 0.03
elasticNetParam = 0.15
maxIter = 12
 
Confusion Matrix
[[1842. 2203.]
 [ 193.  595.]]

Accuracy from Confusion Matrix:  0.5042416718394372
Accuracy from MulticlassMetrics:  0.5042416718394372

Area Under the ROC 0.6052265753923187

Area Under the PR Curve 0.2065770273222743
Summary Stats
Precision = 0.5042416718394372
Recall = 0.5042416718394372
F1 Score = 0.5042416718394372
Weighted recall = 0.5042416718394371
Weighted precision = 0.7922492654683645
Weighted F(1) Score = 0.5612342974368173
Weighted F(0.5) Score = 0.6730989071456968
Weighted false positive rate = 0.2937885210547999

2nd round:

lr_paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.01, 0.02, 0.03]) \
    .addGrid(lr.elasticNetParam, [0.05, 0.1, 0.15]) \
    .addGrid(lr.maxIter, [11, 12, 13, 15]) \

regParam= 0.03
elasticNetParam = 0.15
maxIter = 11

Confusion Matrix
[[1815. 2230.]
 [ 183.  605.]]

Accuracy from Confusion Matrix:  0.5007241878750258
Accuracy from MulticlassMetrics:  0.5007241878750258

Area Under the ROC 0.6082342994108162

Area Under the PR Curve 0.2075564549699384
Summary Stats
Precision = 0.5007241878750258
Recall = 0.5007241878750258
F1 Score = 0.5007241878750258
Weighted recall = 0.5007241878750258
Weighted precision = 0.7950908896146499
Weighted F(1) Score = 0.5572078454446675
Weighted F(0.5) Score = 0.6716684076809212
Weighted false positive rate = 0.28425558905339354

'''

In [None]:
# RF inital results gridsearch

'''
best from tuning inital (accuracy):
rf_paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [5, 10]) \
    .addGrid(rf.maxDepth, [5, 10]) \
    .addGrid(rf.impurity, ["entropy", "gini"])\
    .build()

numTrees = 10
maxDepth = 10
impurity = gini
 
Confusion Matrix
[[2901. 1144.]
 [ 456.  332.]]

Accuracy from Confusion Matrix:  0.6689426857024623
Accuracy from MulticlassMetrics:  0.6689426857024623

Area Under the ROC 0.5692507513819781

Area Under the PR Curve 0.2070259967551608
Summary Stats
Precision = 0.6689426857024623
Recall = 0.6689426857024623
F1 Score = 0.6689426857024623
Weighted recall = 0.6689426857024622
Weighted precision = 0.7599403563099746
Weighted F(1) Score = 0.7038591473392332
Weighted F(0.5) Score = 0.735232181263078
Weighted false positive rate = 0.530441182938506

2nd attempt:

rf_paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [5, 10, 15]) \
    .addGrid(rf.maxDepth, [5, 10, 15]) \
    .addGrid(rf.impurity, ["entropy", "gini"])\
    .addGrid(rf.featureSubsetStrategy, ['auto', 'sqrt'])\
    .build()


numTrees = 10
maxDepth = 15
impurity = gini
featureSubsetStrategy = auto

Confusion Matrix
[[2663. 1382.]
 [ 388.  400.]]

Accuracy from Confusion Matrix:  0.6337678460583489
Accuracy from MulticlassMetrics:  0.6337678460583489

Area Under the ROC 0.5829789236570811

Area Under the PR Curve 0.209345437091233
Summary Stats
Precision = 0.6337678460583489
Recall = 0.6337678460583489
F1 Score = 0.6337678460583489
Weighted recall = 0.6337678460583488
Weighted precision = 0.7671159775546591
Weighted F(1) Score = 0.6789410276493224
Weighted F(0.5) Score = 0.7270236282319229
Weighted false positive rate = 0.46780999874418644

'''


In [None]:
# GBT inital results gridsearch


'''
best from tuning inital (accuracy):
gbt_paramGrid = ParamGridBuilder() \
    .addGrid(gbt.maxIter, [5, 10, 20]) \
    .build()

maxIter = 20

Confusion Matrix
[[2798. 1247.]
 [ 423.  365.]]

Accuracy from Confusion Matrix:  0.654458928201945
Accuracy from MulticlassMetrics:  0.654458928201945

Area Under the ROC 0.5774580700620556

Area Under the PR Curve 0.2094152550126291
Summary Stats
Precision = 0.654458928201945
Recall = 0.654458928201945
F1 Score = 0.654458928201945
Weighted recall = 0.654458928201945
Weighted precision = 0.7639586098089827
Weighted F(1) Score = 0.6941837869282138
Weighted F(0.5) Score = 0.7327747544723259
Weighted false positive rate = 0.4995427880778336


maxIter = 40

Confusion Matrix
[[2571. 1474.]
 [ 382.  406.]]

Accuracy from Confusion Matrix:  0.6159735154148562
Accuracy from MulticlassMetrics:  0.6159735154148562

Area Under the ROC 0.5754139659791808

Area Under the PR Curve 0.20313239804234073
Summary Stats
Precision = 0.6159735154148562
Recall = 0.6159735154148562
F1 Score = 0.6159735154148562
Weighted recall = 0.6159735154148562
Weighted precision = 0.7638968296438198
Weighted F(1) Score = 0.6646010165217533
Weighted F(0.5) Score = 0.7183436330713325
Weighted false positive rate = 0.4651455834564943


gbt_paramGrid = ParamGridBuilder() \
    .addGrid(gbt.maxIter, [5, 10])\
    .addGrid(gbt.maxDepth, [5, 10])\
    .build()
    
maxIter = 5
maxDepth = 5

Confusion Matrix
[[2863. 1182.]
 [ 469.  319.]]

Accuracy from Confusion Matrix:  0.6583902338092282
Accuracy from MulticlassMetrics:  0.6583902338092282

Area Under the ROC 0.5563048634335804

Area Under the PR Curve 0.19780050930331394
Summary Stats
Precision = 0.6583902338092282
Recall = 0.6583902338092282
F1 Score = 0.6583902338092282
Weighted recall = 0.6583902338092282
Weighted precision = 0.7537989743798752
Weighted F(1) Score = 0.6950856095348379
Weighted F(0.5) Score = 0.7279222226014918
Weighted false positive rate = 0.5457805069420676
'''
