# Start of DS5559 Final Project

Team Left Twix Members

* Alice Wright - aew7j
* Edward Thompson - ejt8b
* Michael Davies -  mld9s
* Sam Parsons - sp8hp

In STAT 6021 members of our cohort looked at Transportation Network Company data sets to see if there was a potential relationship between tipping and other indicators, specifically with “transportation network providers” i.e. rideshares such as Uber, Lyft, etc.  At that point in our Data Science journey we did not have the skills or equipment to investigate this question in depth.  

Utilizing machine learning skills from SYS 6018 and applying Spark to this dataset we hope to come up with a more robust set of answers and potentially a better predictor of tipping. With other classification algorithms such as random forest and the heavy-weight data processing of Spark, will we be able to create a more robust predictive model?


Potential Questions from the TNC Data:

* Can it be predicted what fares are most likely to tip the driver?
* Is there a relationship between time of the fare and tipping? (workday stat, bar close, weekday, weekend, etc)
* Is there a relationship between start or end location of the ride and tipping? (downtown pickup, north shore, airport, etc)
* Is there a relationship between length or cost of ride and tipping? (do longer rides result in tips)
* Using this data would we be able to make recommendations to drivers to maximize likelihood of receiving a tip?
* Is the likelihood of tipping changing over time?  Are more rides being tipped?
* Are there re-identification abilities in this dataset? For instance, can we find records for a person who reliably takes a rideshare to/from work every day thereby linking a home address to a work address?




Additionally, joining in additional datasets may yield answers to questions about external factors such as:
* How did news reporting/social media on rideshare companies correlate with tipping?
* What relationship(s) does trip demand have with the stocks of these companies?

Data Source:
The best data source for this appears to be from the City of Chicago, as it is large (169M records and 21 columns), relatively clean, anonymized, and accessible via API.

City of Chicago:
https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Trips/m6dm-c72p

So far we have only pulled the data down via a CSV.

Code Rubric

* Data Import and PreProcessing | 2 pts

* Data splitting/sampling | 1 pt

* EDA (min two graphs) | 2 pts

* Model construction (min 3 models) | 3 pts

* Model evaluation | 2 pts

In [4]:
# import context manager: SparkSession
from pyspark.sql import SparkSession

# import data types
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, BooleanType
import pyspark.sql.types as typ
import pyspark.sql.functions as F
import os

from pyspark.sql.types import *

spark = SparkSession.builder \
        .master("local") \
        .appName("mllib_classifier") \
        .getOrCreate()
sc = spark.sparkContext

In [3]:
%whos

Variable        Type            Data/Info
-----------------------------------------
ArrayType       type            <class 'pyspark.sql.types.ArrayType'>
BinaryType      type            <class 'pyspark.sql.types.BinaryType'>
BooleanType     type            <class 'pyspark.sql.types.BooleanType'>
ByteType        type            <class 'pyspark.sql.types.ByteType'>
DataType        type            <class 'pyspark.sql.types.DataType'>
DateType        type            <class 'pyspark.sql.types.DateType'>
DecimalType     type            <class 'pyspark.sql.types.DecimalType'>
DoubleType      type            <class 'pyspark.sql.types.DoubleType'>
F               module          <module 'pyspark.sql.func<...>yspark/sql/functions.py'>
FloatType       type            <class 'pyspark.sql.types.FloatType'>
IntegerType     type            <class 'pyspark.sql.types.IntegerType'>
LongType        type            <class 'pyspark.sql.types.LongType'>
MapType         type            <class 'pyspark.sql.ty

In [None]:
#clear old df
#del (df)

# Read in our Dataset

## Create a Custom Schema.  
This schema was been primarly determined by using a much smaller dataset and letting spark infer the schema.  We encountered an issue with spark reading in the ENTIRE dataset as NULL when there was a type mismatch.  Only the data we are likely to use later has been assigned to a specific type, otherwise it is left as a string type.

In [10]:
# create a custom schema.  

customSchema = StructType([
    StructField('Trip_ID', StringType(), True),        
    StructField('Trip_Start_Timestamp', StringType(), True),
    StructField('Trip_End_Timestamp', StringType(), True),
    StructField('Trip_Seconds', DoubleType(), True),
    StructField('Trip_Miles', DoubleType(), True),
    StructField('Pickup_Census_Tract', StringType(), True),
    StructField('Dropoff_Census_Tract', StringType(), True),
    StructField('Pickup_Community_Area', DoubleType(), True),
    StructField('Dropoff_Community_Area', DoubleType(), True),
    StructField("Fare", DoubleType(), True),
    StructField("Tip", DoubleType(), True),
    StructField("Additional_Charges", DoubleType(), True),
    StructField("Trip_Total", StringType(), True),
    StructField("Shared_Trip_Authorized", BooleanType(), True),
    StructField("Trips_Pooled", DoubleType(), True),
    StructField('Pickup_Centroid_Latitude', StringType(), True),
    StructField('Pickup_Centroid_Longitude', StringType(), True),
    StructField('Pickup_Centroid_Location', StringType(), True),
    StructField('Dropoff_Centroid_Latitude', StringType(), True),
    StructField('Dropoff_Centroid_Longitude', StringType(), True),
    StructField('Dropoff_Centroid_Location', StringType(), True)
])

#old readin.  Infer is slow for large dataset
#df = spark.read.csv('/../../project/ds5559/Alice_Ed_Michael_Sam_project/BigTrips.csv', header = True, inferSchema=True)

#read in the data to a dataframe
df = spark.read.csv('/../../project/ds5559/Alice_Ed_Michael_Sam_project/BigTrips.csv', header = True, schema=customSchema)
df.show(5)

+--------------------+--------------------+--------------------+------------+----------+-------------------+--------------------+---------------------+----------------------+----+---+------------------+----------+----------------------+------------+------------------------+-------------------------+------------------------+-------------------------+--------------------------+-------------------------+
|             Trip_ID|Trip_Start_Timestamp|  Trip_End_Timestamp|Trip_Seconds|Trip_Miles|Pickup_Census_Tract|Dropoff_Census_Tract|Pickup_Community_Area|Dropoff_Community_Area|Fare|Tip|Additional_Charges|Trip_Total|Shared_Trip_Authorized|Trips_Pooled|Pickup_Centroid_Latitude|Pickup_Centroid_Longitude|Pickup_Centroid_Location|Dropoff_Centroid_Latitude|Dropoff_Centroid_Longitude|Dropoff_Centroid_Location|
+--------------------+--------------------+--------------------+------------+----------+-------------------+--------------------+---------------------+----------------------+----+---+-------

In [3]:
df.printSchema()

root
 |-- Trip_ID: string (nullable = true)
 |-- Trip_Start_Timestamp: string (nullable = true)
 |-- Trip_End_Timestamp: string (nullable = true)
 |-- Trip_Seconds: double (nullable = true)
 |-- Trip_Miles: double (nullable = true)
 |-- Pickup_Census_Tract: string (nullable = true)
 |-- Dropoff_Census_Tract: string (nullable = true)
 |-- Pickup_Community_Area: double (nullable = true)
 |-- Dropoff_Community_Area: double (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Tip: double (nullable = true)
 |-- Additional_Charges: double (nullable = true)
 |-- Trip_Total: string (nullable = true)
 |-- Shared_Trip_Authorized: boolean (nullable = true)
 |-- Trips_Pooled: double (nullable = true)
 |-- Pickup_Centroid_Latitude: string (nullable = true)
 |-- Pickup_Centroid_Longitude: string (nullable = true)
 |-- Pickup_Centroid_Location: string (nullable = true)
 |-- Dropoff_Centroid_Latitude: string (nullable = true)
 |-- Dropoff_Centroid_Longitude: string (nullable = true)
 |-- Dropoff

In [4]:
df.columns

['Trip_ID',
 'Trip_Start_Timestamp',
 'Trip_End_Timestamp',
 'Trip_Seconds',
 'Trip_Miles',
 'Pickup_Census_Tract',
 'Dropoff_Census_Tract',
 'Pickup_Community_Area',
 'Dropoff_Community_Area',
 'Fare',
 'Tip',
 'Additional_Charges',
 'Trip_Total',
 'Shared_Trip_Authorized',
 'Trips_Pooled',
 'Pickup_Centroid_Latitude',
 'Pickup_Centroid_Longitude',
 'Pickup_Centroid_Location',
 'Dropoff_Centroid_Latitude',
 'Dropoff_Centroid_Longitude',
 'Dropoff_Centroid_Location']

In [11]:
#Doesn't update if you don't resave the variable

df = df.drop('Trip_End_Timestamp', 
             'Pickup_Census_Tract',
             'Dropoff_Census_Tract',
             'Pickup_Centroid_Latitude',
             'Pickup_Centroid_Longitude', 
             'Pickup_Centroid_Location', 
             'Dropoff_Centroid_Latitude', 
             'Dropoff_Centroid_Longitude', 
             'Dropoff_Centroid_Location')

In [7]:
df.printSchema()

root
 |-- Trip_ID: string (nullable = true)
 |-- Trip_Start_Timestamp: string (nullable = true)
 |-- Trip_Seconds: double (nullable = true)
 |-- Trip_Miles: double (nullable = true)
 |-- Pickup_Community_Area: double (nullable = true)
 |-- Dropoff_Community_Area: double (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Tip: double (nullable = true)
 |-- Additional_Charges: double (nullable = true)
 |-- Trip_Total: string (nullable = true)
 |-- Shared_Trip_Authorized: boolean (nullable = true)
 |-- Trips_Pooled: double (nullable = true)



In [18]:
df.count()

49108003

Make a sampled dataframe for faster work while developing all the steps.

In [12]:
df2 = df.sample(False, .0005, 1221)

In [9]:
df2.count()

24414

In [13]:
#delete the big df for now
del (df)

#hopefully that will make things faster 

In [None]:
df.select('Trip_ID').count()

In [None]:
df.describe()

From reading the data dictionary it appears that there are multiple ways that pickup and drop off locations are being reported.

Can we do histograms of the pickup and drop off areas?  To see if there are desinations that are popular (Airports, downtown, ball parks, museum row... etc)

In [24]:
df2.groupby('Pickup_Community_Area').count().show(80)

+---------------------+-----+
|Pickup_Community_Area|count|
+---------------------+-----+
|                  8.0| 3350|
|                 70.0|   87|
|                 67.0|  128|
|                 69.0|  215|
|                  7.0| 1199|
|                 49.0|  153|
|                 29.0|  221|
|                 64.0|   29|
|                 75.0|   76|
|                 47.0|   14|
|                 42.0|  174|
|                 44.0|  211|
|                 null| 1773|
|                 35.0|  168|
|                 62.0|   39|
|                 18.0|   25|
|                  1.0|  308|
|                 39.0|  121|
|                 34.0|  104|
|                 37.0|   28|
|                 25.0|  427|
|                 36.0|   34|
|                 41.0|  323|
|                  4.0|  250|
|                 23.0|  270|
|                 77.0|  375|
|                 56.0|  250|
|                 50.0|   41|
|                 45.0|   32|
|                 71.0|  192|
|         

only showed top 10 as we have pretty large fall off.  now to replace nulls with actual values.  Will assume to used 99 for outside the city

In [40]:
#now lets sort the list, added desc to order the list from largest to smallest

df2.groupby('Pickup_Community_Area').count().orderBy('count', ascending=False).show(10)

+---------------------+-----+
|Pickup_Community_Area|count|
+---------------------+-----+
|                  8.0| 3350|
|                 28.0| 1852|
|                 null| 1773|
|                 32.0| 1732|
|                  6.0| 1498|
|                 24.0| 1397|
|                  7.0| 1199|
|                 22.0|  798|
|                 76.0|  735|
|                  3.0|  515|
+---------------------+-----+
only showing top 10 rows



just for reference

Top Pickup Areas

* 8 is the area of magnificiant mile, riverwalk, Gold Coast Neigherborhood, and Navy Pier (tourist trap)
* 28 is the near west side, I think lots of condos and new resturants
* null it outside the city
* 32 is "The Loop" downtown busisness district, train EL hub
* 6 is Lakeview, Near north neigherbood perdominatly white and Cubs Stadium
* 24 West Town don't know as well, likely where Bulls play.  Need to verify
* 7 Lincoln Park.  Like Lakeview but more expensive
* 22 Logan Square.  Edge of transitioning neigherbood.  More afforable for new (white) owners
* 76 O'Hare Airport
* 3 Uptown, north of lakeview, like Logan Square less expensive, 

In [41]:
# do the same for the dropoffs
df2.groupby('Dropoff_Community_Area').count().orderBy('count', ascending=False).show(10)

+----------------------+-----+
|Dropoff_Community_Area|count|
+----------------------+-----+
|                   8.0| 3241|
|                  32.0| 1972|
|                  28.0| 1942|
|                  null| 1911|
|                   6.0| 1449|
|                  24.0| 1296|
|                   7.0| 1213|
|                  76.0|  829|
|                  22.0|  814|
|                   3.0|  514|
+----------------------+-----+
only showing top 10 rows



just for reference

Dropoff Areas

* 8 is the area of magnificiant mile, riverwalk, Gold Coast Neigherborhood, and Navy Pier (tourist trap)
* 32 is "The Loop" downtown busisness district, train EL hub
* 28 is the near west side, I think lots of condos and new resturants
* null it outside the city
* 6 is Lakeview, Near north neigherbood perdominatly white and Cubs Stadium
* 24 West Town don't know as well, likely where Bulls play.  Need to verify
* 7 Lincoln Park.  Like Lakeview but more expensive
* 76 O'Hare Airport
* 22 Logan Square.  Edge of transitioning neigherbood.  More afforable for new (white) owners
* 3 Uptown, north of lakeview, like Logan Square less expensive, 

The lists are almost identical, just a few order changes.

How much of the traffic comes from these heavy use neigherborhoods?  

In [None]:
# replace our null values in the pickup and dropoff locations
# https://stackoverflow.com/questions/42312042/how-to-replace-all-null-values-of-a-dataframe-in-pyspark

In [14]:
df3 = df2.na.fill(value=99,subset=['Pickup_Community_Area', 'Dropoff_Community_Area'])

In [13]:
df3.groupby('Dropoff_Community_Area').count().orderBy('count', ascending=False).show(10)

+----------------------+-----+
|Dropoff_Community_Area|count|
+----------------------+-----+
|                   8.0| 3241|
|                  32.0| 1972|
|                  28.0| 1942|
|                  99.0| 1911|
|                   6.0| 1449|
|                  24.0| 1296|
|                   7.0| 1213|
|                  76.0|  829|
|                  22.0|  814|
|                   3.0|  514|
+----------------------+-----+
only showing top 10 rows



In [14]:
df3.groupby('Pickup_Community_Area').count().orderBy('count', ascending=False).show(10)

+---------------------+-----+
|Pickup_Community_Area|count|
+---------------------+-----+
|                  8.0| 3350|
|                 28.0| 1852|
|                 99.0| 1773|
|                 32.0| 1732|
|                  6.0| 1498|
|                 24.0| 1397|
|                  7.0| 1199|
|                 22.0|  798|
|                 76.0|  735|
|                  3.0|  515|
+---------------------+-----+
only showing top 10 rows



looks like it works to change the nulls to a 99 for the community area.

Next lets add a colum with the tip or no tip as a binary

In [31]:
df3.show(20)

+--------------------+--------------------+------------+----------+---------------------+----------------------+----+---+------------------+----------+----------------------+------------+
|             Trip_ID|Trip_Start_Timestamp|Trip_Seconds|Trip_Miles|Pickup_Community_Area|Dropoff_Community_Area|Fare|Tip|Additional_Charges|Trip_Total|Shared_Trip_Authorized|Trips_Pooled|
+--------------------+--------------------+------------+----------+---------------------+----------------------+----+---+------------------+----------+----------------------+------------+
|923e2cd22e434fe90...|12/01/2019 12:15:...|       966.0|       4.6|                 28.0|                   6.0|10.0|0.0|              2.55|     12.55|                 false|         1.0|
|54fb0ae8d3d76bc94...|12/01/2019 12:45:...|       825.0|       3.3|                  7.0|                   5.0|10.0|0.0|              2.55|     12.55|                 false|         1.0|
|fc02ddb50acc2353e...|12/01/2019 01:00:...|       310.0|    

In [25]:
df4 = df3.withColumn("Tip_bool",df3.Tip.cast(BooleanType())).printSchema()

root
 |-- Trip_ID: string (nullable = true)
 |-- Trip_Start_Timestamp: string (nullable = true)
 |-- Trip_Seconds: double (nullable = true)
 |-- Trip_Miles: double (nullable = true)
 |-- Pickup_Community_Area: double (nullable = false)
 |-- Dropoff_Community_Area: double (nullable = false)
 |-- Fare: double (nullable = true)
 |-- Tip: double (nullable = true)
 |-- Additional_Charges: double (nullable = true)
 |-- Trip_Total: string (nullable = true)
 |-- Shared_Trip_Authorized: boolean (nullable = true)
 |-- Trips_Pooled: double (nullable = true)
 |-- Tip_bool: boolean (nullable = true)



In [32]:
df4.show(20)

AttributeError: 'NoneType' object has no attribute 'show'

I think this broke.  I think we have to make a column that is 1 or 0 depending on the tip, then we can cast that too boolean

In [40]:
#do we need this lit function?
#https://hackersandslackers.com/transforming-pyspark-dataframes/
    
from pyspark.sql.functions import lit, when, col

#df4 = df3.withColumn('testColumn', F.lit('this is a test'))
# that worked

df4 = df3.withColumn('Tip_Bool', when((col("tip") > 0), 1).otherwise(0))

# df = df.withColumn([COLUMN_NAME]. F.when([CONDITIONAL], [COLUMN_VALUE]).otherwsie([COLUMN_VALUE]))


display(df4)

DataFrame[Trip_ID: string, Trip_Start_Timestamp: string, Trip_Seconds: double, Trip_Miles: double, Pickup_Community_Area: double, Dropoff_Community_Area: double, Fare: double, Tip: double, Additional_Charges: double, Trip_Total: string, Shared_Trip_Authorized: boolean, Trips_Pooled: double, Tip_Bool: int]

In [41]:
df4.show(50)

+--------------------+--------------------+------------+----------+---------------------+----------------------+----+---+------------------+----------+----------------------+------------+--------+
|             Trip_ID|Trip_Start_Timestamp|Trip_Seconds|Trip_Miles|Pickup_Community_Area|Dropoff_Community_Area|Fare|Tip|Additional_Charges|Trip_Total|Shared_Trip_Authorized|Trips_Pooled|Tip_Bool|
+--------------------+--------------------+------------+----------+---------------------+----------------------+----+---+------------------+----------+----------------------+------------+--------+
|923e2cd22e434fe90...|12/01/2019 12:15:...|       966.0|       4.6|                 28.0|                   6.0|10.0|0.0|              2.55|     12.55|                 false|         1.0|       0|
|54fb0ae8d3d76bc94...|12/01/2019 12:45:...|       825.0|       3.3|                  7.0|                   5.0|10.0|0.0|              2.55|     12.55|                 false|         1.0|       0|
|fc02ddb50acc23

In [42]:
# now we can cast it 
# https://sparkbyexamples.com/pyspark/pyspark-cast-column-type/

df5 = df4.withColumn("Tip_Bool",col("Tip_Bool").cast(BooleanType()))

In [43]:
df5.printSchema()

root
 |-- Trip_ID: string (nullable = true)
 |-- Trip_Start_Timestamp: string (nullable = true)
 |-- Trip_Seconds: double (nullable = true)
 |-- Trip_Miles: double (nullable = true)
 |-- Pickup_Community_Area: double (nullable = false)
 |-- Dropoff_Community_Area: double (nullable = false)
 |-- Fare: double (nullable = true)
 |-- Tip: double (nullable = true)
 |-- Additional_Charges: double (nullable = true)
 |-- Trip_Total: string (nullable = true)
 |-- Shared_Trip_Authorized: boolean (nullable = true)
 |-- Trips_Pooled: double (nullable = true)
 |-- Tip_Bool: boolean (nullable = false)



In [44]:
df5.show(3)

+--------------------+--------------------+------------+----------+---------------------+----------------------+----+---+------------------+----------+----------------------+------------+--------+
|             Trip_ID|Trip_Start_Timestamp|Trip_Seconds|Trip_Miles|Pickup_Community_Area|Dropoff_Community_Area|Fare|Tip|Additional_Charges|Trip_Total|Shared_Trip_Authorized|Trips_Pooled|Tip_Bool|
+--------------------+--------------------+------------+----------+---------------------+----------------------+----+---+------------------+----------+----------------------+------------+--------+
|923e2cd22e434fe90...|12/01/2019 12:15:...|       966.0|       4.6|                 28.0|                   6.0|10.0|0.0|              2.55|     12.55|                 false|         1.0|   false|
|54fb0ae8d3d76bc94...|12/01/2019 12:45:...|       825.0|       3.3|                  7.0|                   5.0|10.0|0.0|              2.55|     12.55|                 false|         1.0|   false|
|fc02ddb50acc23

now that we have that, can we do group by tips and our pickup/dropoff areas?

In [46]:
df5.groupby('Dropoff_Community_Area', 'Tip_bool').count().orderBy('count', ascending=False).show(50)

+----------------------+--------+-----+
|Dropoff_Community_Area|Tip_bool|count|
+----------------------+--------+-----+
|                   8.0|   false| 2634|
|                  28.0|   false| 1615|
|                  99.0|   false| 1565|
|                  32.0|   false| 1560|
|                   6.0|   false| 1168|
|                  24.0|   false| 1077|
|                   7.0|   false| 1019|
|                  22.0|   false|  668|
|                   8.0|    true|  607|
|                  76.0|   false|  571|
|                   3.0|   false|  419|
|                  32.0|    true|  412|
|                  25.0|   false|  370|
|                  99.0|    true|  346|
|                  28.0|    true|  327|
|                  33.0|   false|  316|
|                  41.0|   false|  293|
|                  43.0|   false|  287|
|                   6.0|    true|  281|
|                  77.0|   false|  277|
|                  76.0|    true|  258|
|                  31.0|   false|  245|


i guess now we should start doing more statistics?

stuff coppied for other Tahsman Notebooks that might be useful.  Not developed yet.

In [None]:
# for each field, compute missing percentage
# from preprcessing example notebook

df.agg(*[
    (1 - F.count(c) / F.count('*')).alias(c + '_miss')
    for c in df.columns
]).show()

In [None]:
# we might need to use this.  From the logistic regression example code
# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])

In [None]:


parsedData = data.map(parsePoint)

# Print a record to understand the data structure
print(parsedData.take(1))

In [None]:
# Build the model
# one line model build.  We may need to do multiple types
model = LogisticRegressionWithSGD.train(parsedData)

In [None]:
#this example code is backwards.  we'd likely need to use Predictions and Labels as in the documentation
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
print(labelsAndPreds.take(3))

In [None]:
#bayes example

In [None]:
from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
from pyspark.mllib.util import MLUtils

# Load the data file. Note this data is in sparse format.
data = MLUtils.loadLibSVMFile(sc, 'sample_libsvm_data.txt')
data.take(2)

In [None]:
# Split data approximately into training (60%) and test (40%)
training, test = data.randomSplit([0.6, 0.4])

In [None]:
# Train a naive Bayes model.
model = NaiveBayes.train(training, 1.0)

In [None]:
# Make prediction and test accuracy.
labelsAndPreds = test.map(lambda p: (p.label, model.predict(p.features)))
accuracy = 1.0 * labelsAndPreds.filter(lambda pl: pl[0] == pl[1]).count() / test.count()
print('model accuracy {}'.format(accuracy))

# Source: https://spark.apache.org/docs/latest/mllib-naive-bayes.html

In [None]:
#decision tree examples

In [None]:
from pyspark.mllib.tree import DecisionTree
from pyspark.mllib.util import MLUtils

# Load and parse the data file
data = MLUtils.loadLibSVMFile(sc, 'sample_libsvm_data.txt')
data.take(2)

In [None]:
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

In [None]:
# Train a DecisionTree model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     impurity='gini', maxDepth=5, maxBins=32)

In [None]:
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(
    lambda lp: lp[0] != lp[1]).count() / float(testData.count())
print('Test Error = ' + str(testErr))

In [52]:
# my personal favorite... trees

**Tree-Based Ensemble Methods**

*Ensembles* combine multiple models together to produce a new model.  
They may consist of models of the same type (e.g., all decision trees) or mixed type (e.g., decision tree + neural net + svm)  

One of the fundamental results in machine learning is that multiple weak classifiers can be combined to produce a strong classifier.  

Ensembles are useful in reducing overfitting, since predictions are based on several different trees  

The two most popular tree-based ensemble methods are *Random Forests* and *Boosted Trees* (e.g. *Gradient-Boosted Trees*)  

They are popular because they are often very competitive  

The nice properties of decision trees carry over to ensembles of trees  

This combining step can proceed using different methods, including:  

- voting (for classification)
- averaging (for regression) 
- running model predictions through another model (classification and regression)

There are downsides to ensembles:  

- Multiple models need to be trained, loaded, and maintained  
- Model explanation is harder: no p-values like regression, several trees are feeding overall decision.  
There are methods to provide feature importance information, such as partial dependence plots.

**Random Forest**  
Ensembles of decision trees  

RFs inject two sources of randomness into modeling:  

1. At each step, randomly select $p$ features out of $n$ total features for possible inclusion (random subspace method)
2. Sample the original training set with replacement, up to the size of the original training set (bootstrapping of the training set)

The number of features to randomly select $p$ is a parameter  
The number of bootstrapped trees to grow $N$ is a parameter  

Since the trees are grown independently, the training and prediction tasks are embarrassingly parallel and can be assigned to multiple workers.

Classification prediction done by majority vote across trees

**Random Forest Implementation**

`from pyspark.mllib.tree import RandomForest`  

Two most important parameters (which should be tuned using $k$-fold cross validation):  

- `numTrees`: Number of trees in forest
More trees will increase accuracy but also training time  

- `maxDepth`: Maximum depth of each tree in forest
Increasing depth can increase power of model, but will take longer to train and can overfit  

Other important parameters:

- `subsamplingRate`: fraction of size of original training set (default=1.0 recommended)

- `featureSubsetStrategy`: specified as fraction or function of total number of features

**Random Forest Example: load data/train model/predict**  
NOTE: Very similar to Decision Tree code above


In [None]:
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils

data = MLUtils.loadLibSVMFile(sc, 'sample_libsvm_data.txt')
data.take(2)

In [None]:
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

In [None]:
# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=1000, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=5, maxBins=32)

In [None]:
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(
    lambda lp: lp[0] != lp[1]).count() / float(testData.count())
print('Test Error = ' + str(testErr))

**Gradient-Boosted Trees**  

GBTs work by building a sequence of trees and combining their predictions at each iteration.  The trees constructed are generally *stumps* which use a single decision split.  A stump is an example of a weak learner.

This is different from random forests, where each tree independently gives predictions on each training instance.



A loss is specified and an optimization problem is solved whereby the objective is to minimize the loss of the model by adding weak learners using a gradient-descent-like procedure.

The procedure follows a stage-wise additive model, meaning that one new weak learner is
added at a time and existing weak learners are left unchanged.
For the original work, see:

*Friedman, Jerome H. "Greedy function approximation: a gradient boosting machine." Annals of Statistics (2001): 1189–1232.*


**Gradient-Boosted Trees Implementation**  

Since the trees are built in a sequential fashion, the algorithm can not be run in parallel.  
However, shallow trees (e.g., stumps) can be used effectively; this saves time versus random forests, which use deeper trees.

The loss function in classification problems is the log loss, equal to twice the binomial negative log likelihood.

Important parameters:
- `numIterations`:  equal to the number of trees in the ensemble.  More trees means longer runtime but also better performance up to a point.
- `learningRate`:  how quickly the model adapts on each iteration. A smaller value may help the algo have better performance, but at the cost of additional runtime. The documentation recommends NOT tuning this param.

The method `runWithValidation` can help mitigate overfitting.  It takes a training RDD and a validation RDD.

The training is stopped when the improvement in the validation error is not more than a certain tolerance (supplied by the `validationTol` argument in `BoostingStrategy`).

**GBT Example: load data/train model/predict**

In [None]:
from pyspark.mllib.tree import GradientBoostedTrees
from pyspark.mllib.util import MLUtils

data = MLUtils.loadLibSVMFile(sc, 'sample_libsvm_data.txt')
data.take(2)

In [None]:
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

In [None]:
# Train a GradientBoostedTrees model.
model = GradientBoostedTrees.trainClassifier(trainingData, categoricalFeaturesInfo={}, numIterations=10)

In [None]:
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(
    lambda lp: lp[0] != lp[1]).count() / float(testData.count())
print('Test Error = ' + str(testErr))

In [None]:

df = spark.read.json('https://data.cityofchicago.org/api/odata/v4/m6dm-c72p')

In [None]:
from sodapy import Socrata

In [None]:
def download_dataset(domain, dataset_id):
    # for this exercise, we're not using an app token,
    # but you *should* sign-up and register for an app_token if you want to use the Socrata API
    client = Socrata(domain, app_token=None)
    offset = None
    data = []
    batch_size = 1000

    while True:
        records = client.get(dataset_id, offset=offset, limit=batch_size)
        data.extend(records)
        if len(records) < batch_size:
            break
        offset = offset + batch_size if (offset) else batch_size

    return pd.DataFrame.from_dict(data)

def download_permits_dataset():
    return seattle_permits_df if "seattle_permits_df" in globals() else download_dataset("data.seattle.gov", "k44w-2dcq")

# load Seattle permits data
seattle_permits_df = download_permits_dataset()