# Start of DS5559 Final Project

Team Left Twix Members

* Alice Wright - aew7j
* Edward Thompson - ejt8b
* Michael Davies -  mld9s
* Sam Parsons - sp8hp

In STAT 6021 members of our cohort looked at Transportation Network Company data sets to see if there was a potential relationship between tipping and other indicators, specifically with “transportation network providers” i.e. rideshares such as Uber, Lyft, etc.  At that point in our Data Science journey we did not have the skills or equipment to investigate this question in depth.  

Utilizing machine learning skills from SYS 6018 and applying Spark to this dataset we hope to come up with a more robust set of answers and potentially a better predictor of tipping. With other classification algorithms such as random forest and the heavy-weight data processing of Spark, will we be able to create a more robust predictive model?


Potential Questions from the TNC Data:

* Can it be predicted what fares are most likely to tip the driver?
* Is there a relationship between time of the fare and tipping? (workday stat, bar close, weekday, weekend, etc)
* Is there a relationship between start or end location of the ride and tipping? (downtown pickup, north shore, airport, etc)
* Is there a relationship between length or cost of ride and tipping? (do longer rides result in tips)
* Using this data would we be able to make recommendations to drivers to maximize likelihood of receiving a tip?
* Is the likelihood of tipping changing over time?  Are more rides being tipped?
* Are there re-identification abilities in this dataset? For instance, can we find records for a person who reliably takes a rideshare to/from work every day thereby linking a home address to a work address?




Additionally, joining in additional datasets may yield answers to questions about external factors such as:
* How did news reporting/social media on rideshare companies correlate with tipping?
* What relationship(s) does trip demand have with the stocks of these companies?

Data Source:
The best data source for this appears to be from the City of Chicago, as it is large (169M records and 21 columns), relatively clean, anonymized, and accessible via API.

City of Chicago:
https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Trips/m6dm-c72p

So far we have only pulled the data down via a CSV.

Code Rubric

* Data Import and PreProcessing | 2 pts

* Data splitting/sampling | 1 pt

* EDA (min two graphs) | 2 pts

* Model construction (min 3 models) | 3 pts

* Model evaluation | 2 pts

In [1]:
# import context manager: SparkSession
from pyspark.sql import SparkSession

# import data types
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, DoubleType, BooleanType
import pyspark.sql.types as typ
import pyspark.sql.functions as F
import os

from pyspark.sql.types import *

spark = SparkSession.builder \
        .master("local") \
        .appName("mllib_classifier") \
        .getOrCreate()
sc = spark.sparkContext

In [9]:
#this reads as an RDD
#Need to update with our data file
#data = sc.textFile('Trips50sample.csv')


In [11]:
#data.take(5)

In [5]:
#type(data)

pyspark.rdd.RDD

In [1]:
#reads our data in as a DF
df = spark.read.csv('/../../project/ds5559/Alice_Ed_Michael_Sam_project/BigTrips.csv')

#better way?
#https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/

#df = spark.read.format('csv').options(header='true').load('Trips50sample.csv')

#https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html

#df = spark.read.load('Trips50sample.csv', format="csv", sep=",", inferSchema="true", header="true")

#df = spark.read.load('Trips50sample.tar.gz', format="csv", sep=",", inferSchema="true", header="true")

#https://stackoverflow.com/questions/40377820/loading-compressed-gzipped-csv-file-in-spark-2-0
#df = spark.read.option("header", "true").csv("Trips50sample.tar.gz")

#df = spark.read.load('Trips50sample.csv', format="csv", sep=",", inferSchema="false", header="true")



NameError: name 'spark' is not defined

In [14]:
df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)



In [15]:
df.show(5)

+---------+--------------------+--------------------+------------+----------+-------------------+--------------------+--------------------+--------------------+----+----+------------------+----------+--------------------+------------+
|      _c0|                 _c1|                 _c2|         _c3|       _c4|                _c5|                 _c6|                 _c7|                 _c8| _c9|_c10|              _c11|      _c12|                _c13|        _c14|
+---------+--------------------+--------------------+------------+----------+-------------------+--------------------+--------------------+--------------------+----+----+------------------+----------+--------------------+------------+
|     null|Trip Start Timestamp|  Trip End Timestamp|Trip Seconds|Trip Miles|Pickup Census Tract|Dropoff Census Tract|Pickup Community ...|Dropoff Community...|Fare| Tip|Additional Charges|Trip Total|Shared Trip Autho...|Trips Pooled|
| 11911213|02/10/2019 06:15:...|02/10/2019 06:30:...|       

fields from test50

	trip ID Trip Start Timestamp	Trip End Timestamp	Trip Seconds	Trip Miles	Pickup Census Tract	Dropoff Census Tract	Pickup Community Area	Dropoff Community Area	Fare	Tip	Additional Charges	Trip Total	Shared Trip Authorized	Trips Pooled


In [18]:
#load the data with a specific schema
#https://stackoverflow.com/questions/39926411/provide-schema-while-reading-csv-file-as-a-dataframe
#https://towardsdatascience.com/pyspark-import-any-data-f2856cda45fd

#this schema only works with the fields in the test data, need to expand for the full dataset

customSchema = StructType([
    StructField("trip_id", StringType(), True),        
    StructField("trip_start_time", TimestampType(), True),
    StructField("trip_end_time", TimestampType(), True),
    StructField("trip_seconds", DoubleType(), True),
    StructField("trip_miles", DoubleType(), True),
    StructField("trip_pickup_census", DoubleType(), True),
    StructField("trip_dropoff_census", DoubleType(), True),
    StructField("trip_pickup_ca", DoubleType(), True),
    StructField("trip_dropoff_ca", DoubleType(), True),
    StructField("trip_fare", DoubleType(), True),
    StructField("trip_tip", DoubleType(), True),
    StructField("trip_additional_charges", DoubleType(), True),
    StructField("trip_total", DoubleType(), True),
    StructField("trip_shared_trip_auth", BooleanType(), True),
    StructField("trip_trips_pooled", IntegerType(), True)    
])


csv_2_df = spark.read.csv("Trips50sample.csv.gz", header = 'false', schema=customSchema)
# # #df = spark.read.load('Trips50sample.csv', format="csv", header="false", sep=',', schema=customSchema)

# df = spark.read.load('Trips50sample.csv', format="csv", sep=",", inferSchema="true", header="true")


#testProduct.csv
#ID|SEARCHNAME|PRICE
#6607|EFKTON75LIN|890.88
#6612|EFKTON100HEN|55.66

In [19]:
csv_2_df.show(5)

+-------+---------------+-------------+------------+----------+------------------+-------------------+--------------+---------------+---------+--------+-----------------------+----------+---------------------+-----------------+
|trip_id|trip_start_time|trip_end_time|trip_seconds|trip_miles|trip_pickup_census|trip_dropoff_census|trip_pickup_ca|trip_dropoff_ca|trip_fare|trip_tip|trip_additional_charges|trip_total|trip_shared_trip_auth|trip_trips_pooled|
+-------+---------------+-------------+------------+----------+------------------+-------------------+--------------+---------------+---------+--------+-----------------------+----------+---------------------+-----------------+
|   null|           null|         null|        null|      null|              null|               null|          null|           null|     null|    null|                   null|      null|                 null|             null|
|   null|           null|         null|        null|      null|              null|      

In [48]:
# schema = StructType() \
#     .add("trip_ID", StringType(), False)\
#     .add("Trip Start Timestamp", TimestampType(), False)\
#     .add("trip_end_time", TimestampType(), False)\
#     .add("trip_seconds", DoubleType(), False)\
#     .add("trip_miles", DoubleType(), False)\
#     .add("trip_pickup_census", DoubleType(), True)\
#     .add("trip_dropoff_census", DoubleType(), True)\
#     .add("trip_pickup_ca", DoubleType(), True)\
#     .add("trip_dropoff_ca", DoubleType(), True)\
#     .add("trip_fare", DoubleType(), False)\
#     .add("trip_tip", DoubleType(), False)\
#     .add("trip_additional_charges", DoubleType(), False)\
#     .add("trip_total", DoubleType(), False)\
#     .add("trip_shared_trip_auth", BooleanType(), True)\
#     .add("trip_trips_pooled", IntegerType(), True)



In [49]:
# df_with_schema = spark.read.format("csv") \
#       .option("header", False) \
#       .schema(schema) \
#       .load('Trips50sample.csv')

In [50]:
#df_with_schema.printSchema()

root
 |-- trip_ID: string (nullable = true)
 |-- Trip Start Timestamp: timestamp (nullable = true)
 |-- trip_end_time: timestamp (nullable = true)
 |-- trip_seconds: double (nullable = true)
 |-- trip_miles: double (nullable = true)
 |-- trip_pickup_census: double (nullable = true)
 |-- trip_dropoff_census: double (nullable = true)
 |-- trip_pickup_ca: double (nullable = true)
 |-- trip_dropoff_ca: double (nullable = true)
 |-- trip_fare: double (nullable = true)
 |-- trip_tip: double (nullable = true)
 |-- trip_additional_charges: double (nullable = true)
 |-- trip_total: double (nullable = true)
 |-- trip_shared_trip_auth: boolean (nullable = true)
 |-- trip_trips_pooled: integer (nullable = true)



In [51]:
#df_with_schema.show(5)

+-------+--------------------+-------------+------------+----------+------------------+-------------------+--------------+---------------+---------+--------+-----------------------+----------+---------------------+-----------------+
|trip_ID|Trip Start Timestamp|trip_end_time|trip_seconds|trip_miles|trip_pickup_census|trip_dropoff_census|trip_pickup_ca|trip_dropoff_ca|trip_fare|trip_tip|trip_additional_charges|trip_total|trip_shared_trip_auth|trip_trips_pooled|
+-------+--------------------+-------------+------------+----------+------------------+-------------------+--------------+---------------+---------+--------+-----------------------+----------+---------------------+-----------------+
|   null|                null|         null|        null|      null|              null|               null|          null|           null|     null|    null|                   null|      null|                 null|             null|
|   null|                null|         null|        null|      null|

In [53]:
type(df)

pyspark.sql.dataframe.DataFrame

In [37]:
df.show(5)

+---------+--------------------+--------------------+------------+----------+-------------------+--------------------+---------------------+----------------------+----+----+------------------+----------+----------------------+------------+
|      _c0|Trip Start Timestamp|  Trip End Timestamp|Trip Seconds|Trip Miles|Pickup Census Tract|Dropoff Census Tract|Pickup Community Area|Dropoff Community Area|Fare| Tip|Additional Charges|Trip Total|Shared Trip Authorized|Trips Pooled|
+---------+--------------------+--------------------+------------+----------+-------------------+--------------------+---------------------+----------------------+----+----+------------------+----------+----------------------+------------+
| 11911213|02/10/2019 06:15:...|02/10/2019 06:30:...|       801.0|       1.9|    1.7031081201E10|       1.70313201E10|                  8.0|                  32.0| 5.0| 0.0|               0.0|       5.0|                  true|           3|
|110540837|12/27/2019 03:30:...|12/27/20

In [60]:
df = df.withColumnRenamed("_c0","trip_id")

the above is better, we've got our headers, ecept the ID

with custom schema all the data is now nulls

In [61]:
df.printSchema()

root
 |-- trip_id: integer (nullable = true)
 |-- Trip Start Timestamp: string (nullable = true)
 |-- Trip End Timestamp: string (nullable = true)
 |-- Trip Seconds: double (nullable = true)
 |-- Trip Miles: double (nullable = true)
 |-- Pickup Census Tract: double (nullable = true)
 |-- Dropoff Census Tract: double (nullable = true)
 |-- Pickup Community Area: double (nullable = true)
 |-- Dropoff Community Area: double (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Tip: double (nullable = true)
 |-- Additional Charges: double (nullable = true)
 |-- Trip Total: double (nullable = true)
 |-- Shared Trip Authorized: boolean (nullable = true)
 |-- Trips Pooled: integer (nullable = true)



In [62]:
#df2 = df.withColumn('_c0)',F.col('trip_id').cast(StringType))

In [None]:
# we might need to use this.  From the logistic regression example code
# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])

In [None]:


parsedData = data.map(parsePoint)

# Print a record to understand the data structure
print(parsedData.take(1))

In [None]:
# Build the model
# one line model build.  We may need to do multiple types
model = LogisticRegressionWithSGD.train(parsedData)

In [None]:
#this example code is backwards.  we'd likely need to use Predictions and Labels as in the documentation
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
print(labelsAndPreds.take(3))

In [None]:
#bayes example

In [None]:
from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
from pyspark.mllib.util import MLUtils

# Load the data file. Note this data is in sparse format.
data = MLUtils.loadLibSVMFile(sc, 'sample_libsvm_data.txt')
data.take(2)

In [None]:
# Split data approximately into training (60%) and test (40%)
training, test = data.randomSplit([0.6, 0.4])

In [None]:
# Train a naive Bayes model.
model = NaiveBayes.train(training, 1.0)

In [None]:
# Make prediction and test accuracy.
labelsAndPreds = test.map(lambda p: (p.label, model.predict(p.features)))
accuracy = 1.0 * labelsAndPreds.filter(lambda pl: pl[0] == pl[1]).count() / test.count()
print('model accuracy {}'.format(accuracy))

# Source: https://spark.apache.org/docs/latest/mllib-naive-bayes.html

In [None]:
#decision tree examples

In [None]:
from pyspark.mllib.tree import DecisionTree
from pyspark.mllib.util import MLUtils

# Load and parse the data file
data = MLUtils.loadLibSVMFile(sc, 'sample_libsvm_data.txt')
data.take(2)

In [None]:
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

In [None]:
# Train a DecisionTree model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     impurity='gini', maxDepth=5, maxBins=32)

In [None]:
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(
    lambda lp: lp[0] != lp[1]).count() / float(testData.count())
print('Test Error = ' + str(testErr))

In [None]:
# my personal favorite... trees

**Tree-Based Ensemble Methods**

*Ensembles* combine multiple models together to produce a new model.  
They may consist of models of the same type (e.g., all decision trees) or mixed type (e.g., decision tree + neural net + svm)  

One of the fundamental results in machine learning is that multiple weak classifiers can be combined to produce a strong classifier.  

Ensembles are useful in reducing overfitting, since predictions are based on several different trees  

The two most popular tree-based ensemble methods are *Random Forests* and *Boosted Trees* (e.g. *Gradient-Boosted Trees*)  

They are popular because they are often very competitive  

The nice properties of decision trees carry over to ensembles of trees  

This combining step can proceed using different methods, including:  

- voting (for classification)
- averaging (for regression) 
- running model predictions through another model (classification and regression)

There are downsides to ensembles:  

- Multiple models need to be trained, loaded, and maintained  
- Model explanation is harder: no p-values like regression, several trees are feeding overall decision.  
There are methods to provide feature importance information, such as partial dependence plots.

**Random Forest**  
Ensembles of decision trees  

RFs inject two sources of randomness into modeling:  

1. At each step, randomly select $p$ features out of $n$ total features for possible inclusion (random subspace method)
2. Sample the original training set with replacement, up to the size of the original training set (bootstrapping of the training set)

The number of features to randomly select $p$ is a parameter  
The number of bootstrapped trees to grow $N$ is a parameter  

Since the trees are grown independently, the training and prediction tasks are embarrassingly parallel and can be assigned to multiple workers.

Classification prediction done by majority vote across trees

**Random Forest Implementation**

`from pyspark.mllib.tree import RandomForest`  

Two most important parameters (which should be tuned using $k$-fold cross validation):  

- `numTrees`: Number of trees in forest
More trees will increase accuracy but also training time  

- `maxDepth`: Maximum depth of each tree in forest
Increasing depth can increase power of model, but will take longer to train and can overfit  

Other important parameters:

- `subsamplingRate`: fraction of size of original training set (default=1.0 recommended)

- `featureSubsetStrategy`: specified as fraction or function of total number of features

**Random Forest Example: load data/train model/predict**  
NOTE: Very similar to Decision Tree code above


In [None]:
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils

data = MLUtils.loadLibSVMFile(sc, 'sample_libsvm_data.txt')
data.take(2)

In [None]:
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

In [None]:
# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=1000, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=5, maxBins=32)

In [None]:
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(
    lambda lp: lp[0] != lp[1]).count() / float(testData.count())
print('Test Error = ' + str(testErr))

**Gradient-Boosted Trees**  

GBTs work by building a sequence of trees and combining their predictions at each iteration.  The trees constructed are generally *stumps* which use a single decision split.  A stump is an example of a weak learner.

This is different from random forests, where each tree independently gives predictions on each training instance.



A loss is specified and an optimization problem is solved whereby the objective is to minimize the loss of the model by adding weak learners using a gradient-descent-like procedure.

The procedure follows a stage-wise additive model, meaning that one new weak learner is
added at a time and existing weak learners are left unchanged.
For the original work, see:

*Friedman, Jerome H. "Greedy function approximation: a gradient boosting machine." Annals of Statistics (2001): 1189–1232.*


**Gradient-Boosted Trees Implementation**  

Since the trees are built in a sequential fashion, the algorithm can not be run in parallel.  
However, shallow trees (e.g., stumps) can be used effectively; this saves time versus random forests, which use deeper trees.

The loss function in classification problems is the log loss, equal to twice the binomial negative log likelihood.

Important parameters:
- `numIterations`:  equal to the number of trees in the ensemble.  More trees means longer runtime but also better performance up to a point.
- `learningRate`:  how quickly the model adapts on each iteration. A smaller value may help the algo have better performance, but at the cost of additional runtime. The documentation recommends NOT tuning this param.

The method `runWithValidation` can help mitigate overfitting.  It takes a training RDD and a validation RDD.

The training is stopped when the improvement in the validation error is not more than a certain tolerance (supplied by the `validationTol` argument in `BoostingStrategy`).

**GBT Example: load data/train model/predict**

In [None]:
from pyspark.mllib.tree import GradientBoostedTrees
from pyspark.mllib.util import MLUtils

data = MLUtils.loadLibSVMFile(sc, 'sample_libsvm_data.txt')
data.take(2)

In [None]:
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

In [None]:
# Train a GradientBoostedTrees model.
model = GradientBoostedTrees.trainClassifier(trainingData, categoricalFeaturesInfo={}, numIterations=10)

In [None]:
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(
    lambda lp: lp[0] != lp[1]).count() / float(testData.count())
print('Test Error = ' + str(testErr))

In [None]:

df = spark.read.json('https://data.cityofchicago.org/api/odata/v4/m6dm-c72p')

In [2]:
from sodapy import Socrata

In [None]:
def download_dataset(domain, dataset_id):
    # for this exercise, we're not using an app token,
    # but you *should* sign-up and register for an app_token if you want to use the Socrata API
    client = Socrata(domain, app_token=None)
    offset = None
    data = []
    batch_size = 1000

    while True:
        records = client.get(dataset_id, offset=offset, limit=batch_size)
        data.extend(records)
        if len(records) < batch_size:
            break
        offset = offset + batch_size if (offset) else batch_size

    return pd.DataFrame.from_dict(data)

def download_permits_dataset():
    return seattle_permits_df if "seattle_permits_df" in globals() else download_dataset("data.seattle.gov", "k44w-2dcq")

# load Seattle permits data
seattle_permits_df = download_permits_dataset()