# NYC Taxi Fare Prediction (Using total distance in Kms)

We have a training dataset comprised of pick up and drop off locations and we are gonna predict the fare amount for taxi rides.

In [1]:
%%init_spark
launcher.master="yarn"
launcher.num_executors=6
launcher.executor_cores=2
launcher.executor_memory='2500m'

# Data Exploration

In [2]:
val training_data=spark.read.option("header","true").option("inferschema", "true").csv("/project/train.csv")


Intitializing Scala interpreter ...

Spark Web UI available at http://HM11:8088/proxy/application_1544141973356_0015
SparkContext available as 'sc' (version = 2.3.2, master = yarn, app id = application_1544141973356_0015)
SparkSession available as 'spark'


2018-12-07 00:31:08 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-12-07 00:31:11 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


training_data: org.apache.spark.sql.DataFrame = [key: timestamp, fare_amount: double ... 6 more fields]


# Feature Engineering

We will see all the attributes in the dataset, their data types and will make new columns using the available ones and then downsample the data because the size of data is too big, i.e. 5.7 GB (55 million records)

In [4]:
training_data.show(5)

+-------------------+-----------+--------------------+----------------+---------------+-----------------+----------------+---------------+
|                key|fare_amount|     pickup_datetime|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|passenger_count|
+-------------------+-----------+--------------------+----------------+---------------+-----------------+----------------+---------------+
|2009-06-15 17:26:21|        4.5|2009-06-15 17:26:...|      -73.844311|      40.721319|        -73.84161|       40.712278|              1|
|2010-01-05 16:52:16|       16.9|2010-01-05 16:52:...|      -74.016048|      40.711303|       -73.979268|       40.782004|              1|
|2011-08-18 00:35:00|        5.7|2011-08-18 00:35:...|      -73.982738|       40.76127|       -73.991242|       40.750562|              2|
|2012-04-21 04:30:42|        7.7|2012-04-21 04:30:...|       -73.98713|      40.733143|       -73.991567|       40.758092|              1|
|2010-03-09 07:51:00|      

In [5]:
//to see the data types of the attributes
training_data.printSchema()

root
 |-- key: timestamp (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- pickup_datetime: string (nullable = true)
 |-- pickup_longitude: double (nullable = true)
 |-- pickup_latitude: double (nullable = true)
 |-- dropoff_longitude: double (nullable = true)
 |-- dropoff_latitude: double (nullable = true)
 |-- passenger_count: integer (nullable = true)



we are gonna calculate the distance in Kms by using formula below:

In [42]:
val temp = training_data.withColumn("a", pow(sin(toRadians($"dropoff_latitude" - $"pickup_latitude") / 2), 2) + cos(toRadians($"pickup_latitude")) * cos(toRadians($"dropoff_latitude")) * pow(sin(toRadians($"dropoff_longitude" - $"pickup_longitude") / 2), 2)).withColumn("distance", atan2(sqrt($"a"), sqrt(-$"a" + 1)) * 2 * 6371)
val temp1 = temp.drop(col("a"))

temp: org.apache.spark.sql.DataFrame = [key: timestamp, fare_amount: double ... 8 more fields]
temp1: org.apache.spark.sql.DataFrame = [key: timestamp, fare_amount: double ... 7 more fields]


In [45]:
temp1.show(5)

+-------------------+-----------+--------------------+----------------+---------------+-----------------+----------------+---------------+------------------+
|                key|fare_amount|     pickup_datetime|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|passenger_count|          distance|
+-------------------+-----------+--------------------+----------------+---------------+-----------------+----------------+---------------+------------------+
|2009-06-15 17:26:21|        4.5|2009-06-15 17:26:...|      -73.844311|      40.721319|        -73.84161|       40.712278|              1|1.0307639350492535|
|2010-01-05 16:52:16|       16.9|2010-01-05 16:52:...|      -74.016048|      40.711303|       -73.979268|       40.782004|              1| 8.450133595806088|
|2011-08-18 00:35:00|        5.7|2011-08-18 00:35:...|      -73.982738|       40.76127|       -73.991242|       40.750562|              2|1.3895252257699269|
|2012-04-21 04:30:42|        7.7|2012-04-21 04:30:..

Dropping the null values from the dataset

In [46]:
val train3=temp1.na.drop()
train3.count()

train3: org.apache.spark.sql.DataFrame = [key: timestamp, fare_amount: double ... 7 more fields]
res22: Long = 55423480


Now we are gonna filter the rows in which distance is greator then 35 or in which fare amount is greator then 0

In [48]:
val tr4=train3.filter($"distance" < 35).filter($"fare_amount" > 0).toDF()
val tr5=tr4.drop(col("pickup_datetime")).drop(col("key")).drop().toDF()

tr4: org.apache.spark.sql.DataFrame = [key: timestamp, fare_amount: double ... 7 more fields]
tr5: org.apache.spark.sql.DataFrame = [fare_amount: double, pickup_longitude: double ... 5 more fields]


Downsampling the dataset, we are taking only 25% of the dataset

In [50]:
val factor=0.25
val downSampledData=tr5.sample(true,factor)

factor: Double = 0.25
downSampledData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [fare_amount: double, pickup_longitude: double ... 5 more fields]


In [51]:
downSampledData.count()

res25: Long = 13819780


In [52]:
downSampledData.show(5)

+-----------+----------------+---------------+-----------------+----------------+---------------+------------------+
|fare_amount|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|passenger_count|          distance|
+-----------+----------------+---------------+-----------------+----------------+---------------+------------------+
|        4.5|      -73.844311|      40.721319|        -73.84161|       40.712278|              1|1.0307639350492535|
|        5.3|      -73.968095|      40.768008|       -73.956655|       40.783762|              1|1.9991567879961665|
|       16.5|        -73.9513|      40.774138|       -73.990095|       40.751048|              1| 4.155444291845964|
|        9.0|      -74.006462|      40.726713|       -73.993078|       40.731628|              1| 1.253231512725611|
|       10.5|      -73.985382|      40.747858|       -73.978377|        40.76207|              1| 1.686861330169933|
+-----------+----------------+---------------+-----------------+

Now we will separate the target variable and then assemble the features for processing by the models

In [53]:
import org.apache.spark.ml.feature._

//get all the numeric features except the target variable
val numeric_features=downSampledData.columns.filter(c =>  !c.equals("fare_amount") )


//Use VectorAssesmbler to aseemble numeric features into a vector
val vectorizer_numeric=new VectorAssembler().setInputCols(numeric_features).setOutputCol("features")

//Create an estimator to standardize the numeric feature
//val standardizer=new StandardScaler().setWithMean(true).setInputCol("numeric_features").setOutputCol("features")


import org.apache.spark.ml.feature._
numeric_features: Array[String] = Array(pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count, distance)
vectorizer_numeric: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_bf28be0dcfae


# Using Linear Regression for fare prediction

In [54]:
import org.apache.spark.ml._
import org.apache.spark.ml.feature._
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.regression._
//Creating the linearRegression model and fit it to the transformed training data
val LR= new LinearRegression().setLabelCol("fare_amount").setFeaturesCol("features").
setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.7)

import org.apache.spark.ml._
import org.apache.spark.ml.feature._
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.regression._
LR: org.apache.spark.ml.regression.LinearRegression = linReg_4a02d3bb9414


In [55]:
//Creating a Pipeline and add the transformation we did so far to this pipeline
val pipeline = new Pipeline().setStages(Array(vectorizer_numeric, LR))

pipeline: org.apache.spark.ml.Pipeline = pipeline_de8d6aeecd30


In [56]:
//Split the data randomly to 80% tranining and 20% testing. The training data is used to build the model and the testing data is used for testing the model
val Array(training,testing)=downSampledData.randomSplit(Array(0.8,0.2),111)

training: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [fare_amount: double, pickup_longitude: double ... 5 more fields]
testing: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [fare_amount: double, pickup_longitude: double ... 5 more fields]


In [57]:
import org.apache.spark.ml.evaluation._

//Fitting the pipeline to the traning data and transforming the training data
val pipeline_model= pipeline.fit(training)


import org.apache.spark.ml.evaluation._
pipeline_model: org.apache.spark.ml.PipelineModel = pipeline_de8d6aeecd30


In [58]:
import org.apache.spark.ml.evaluation._

//apllyintg the model to the test data to make predictions
val predictions = pipeline_model.transform(testing)

// Select example rows to display.
predictions.select("prediction","fare_amount", "features").show(5)



+------------------+-----------+--------------------+
|        prediction|fare_amount|            features|
+------------------+-----------+--------------------+
| 3.986536047272377|        2.5|[-74.254295,40.67...|
| 4.000030862058741|        2.5|[-74.029732,40.75...|
|4.0132666761122975|        2.5|[-74.024305,40.60...|
|4.0132666761122975|        2.5|[-74.024305,40.60...|
| 4.107659212920624|        2.5|[-74.015233,40.71...|
+------------------+-----------+--------------------+
only showing top 5 rows



import org.apache.spark.ml.evaluation._
predictions: org.apache.spark.sql.DataFrame = [fare_amount: double, pickup_longitude: double ... 7 more fields]


In [59]:
// Select (prediction, true label) and compute test error.
val evaluator = new RegressionEvaluator()
  .setLabelCol("fare_amount")
  .setPredictionCol("prediction")
  .setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")

Root Mean Squared Error (RMSE) on test data = 5.164927129414834


evaluator: org.apache.spark.ml.evaluation.RegressionEvaluator = regEval_956e76753b35
rmse: Double = 5.164927129414834


# GBT Regression

In [60]:
import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor}

// Create a GBT model.
val gbt = new GBTRegressor()
  .setLabelCol("fare_amount")
  .setFeaturesCol("features")


import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor}
gbt: org.apache.spark.ml.regression.GBTRegressor = gbtr_e29f8f3fdecc


In [61]:
val pipeline_gbt = new Pipeline().setStages(Array(vectorizer_numeric, gbt))

pipeline_gbt: org.apache.spark.ml.Pipeline = pipeline_39f3229a4a8d


In [62]:
//Split the data randomly to 80% tranining and 20% testing. The training data is used to build the model and the testing data is used for testing the model
val Array(training_gbt,testing_gbt)=downSampledData.randomSplit(Array(0.8,0.2),111)

training_gbt: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [fare_amount: double, pickup_longitude: double ... 5 more fields]
testing_gbt: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [fare_amount: double, pickup_longitude: double ... 5 more fields]


In [63]:
val pipeline_model_gbt= pipeline_gbt.fit(training_gbt)

pipeline_model_gbt: org.apache.spark.ml.PipelineModel = pipeline_39f3229a4a8d


In [64]:
//applyintg the model to the test data to make predictions
val predictions_gbt = pipeline_model_gbt.transform(testing_gbt)

// Select example rows to display.
predictions_gbt.select("prediction", "fare_amount","features").show(5)

+------------------+-----------+--------------------+
|        prediction|fare_amount|            features|
+------------------+-----------+--------------------+
|38.155541048769095|        2.5|[-74.254295,40.67...|
|32.911417927072534|        2.5|[-74.029732,40.75...|
| 34.66767129887482|        2.5|[-74.024305,40.60...|
| 34.66767129887482|        2.5|[-74.024305,40.60...|
| 7.742086164449066|        2.5|[-74.015233,40.71...|
+------------------+-----------+--------------------+
only showing top 5 rows



predictions_gbt: org.apache.spark.sql.DataFrame = [fare_amount: double, pickup_longitude: double ... 7 more fields]


In [65]:
// Select (prediction, true label) and compute test error.
val evaluator_gbt = new RegressionEvaluator()
  .setLabelCol("fare_amount")
  .setPredictionCol("prediction")
  .setMetricName("rmse")
val rmse_gbt = evaluator_gbt.evaluate(predictions_gbt)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse_gbt")

Root Mean Squared Error (RMSE) on test data = 4.697264054494451


evaluator_gbt: org.apache.spark.ml.evaluation.RegressionEvaluator = regEval_33dce0337877
rmse_gbt: Double = 4.697264054494451


# Using Random Forest Regressor

In [66]:
import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}

// Create a RF Regression model.
val RF = new RandomForestRegressor()
  .setLabelCol("fare_amount")
  .setFeaturesCol("features")

import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}
RF: org.apache.spark.ml.regression.RandomForestRegressor = rfr_c4061cb49359


In [67]:
val pipeline_rf = new Pipeline().setStages(Array(vectorizer_numeric, RF))

pipeline_rf: org.apache.spark.ml.Pipeline = pipeline_2e22137c602b


In [68]:
//Split the data randomly to 80% tranining and 20% testing. The training data is used to build the model and the testing data is used for testing the model
val Array(training_rf,testing_rf)=downSampledData.randomSplit(Array(0.8,0.2),111)

training_rf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [fare_amount: double, pickup_longitude: double ... 5 more fields]
testing_rf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [fare_amount: double, pickup_longitude: double ... 5 more fields]


In [69]:
val pipeline_model_rf= pipeline_rf.fit(training_rf)

pipeline_model_rf: org.apache.spark.ml.PipelineModel = pipeline_2e22137c602b


In [70]:
//apllyintg the model to the test data to make predictions
val predictions_rf = pipeline_model_rf.transform(testing_rf)

// Select example rows to display.
predictions_rf.select("prediction", "fare_amount","features").show(5)

+------------------+-----------+--------------------+
|        prediction|fare_amount|            features|
+------------------+-----------+--------------------+
|16.438061360259272|        2.5|[-74.254295,40.67...|
|13.962456553318157|        2.5|[-74.029732,40.75...|
|17.422949328601113|        2.5|[-74.024305,40.60...|
|17.422949328601113|        2.5|[-74.024305,40.60...|
| 9.085559578182544|        2.5|[-74.015233,40.71...|
+------------------+-----------+--------------------+
only showing top 5 rows



predictions_rf: org.apache.spark.sql.DataFrame = [fare_amount: double, pickup_longitude: double ... 7 more fields]


In [71]:
// Select (prediction, true label) and compute test error.
val evaluator_rf = new RegressionEvaluator()
  .setLabelCol("fare_amount")
  .setPredictionCol("prediction")
  .setMetricName("rmse")
val rmse_rf = evaluator_rf.evaluate(predictions_rf)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse_rf")

Root Mean Squared Error (RMSE) on test data = 5.086012994281541

evaluator_rf: org.apache.spark.ml.evaluation.RegressionEvaluator = regEval_e4064fb8021a
rmse_rf: Double = 5.086012994281541





# Using Decision Tree Regression

In [72]:
import org.apache.spark.ml.regression.{DecisionTreeRegressionModel, DecisionTreeRegressor}

// Create a Decision Tree Regression model.
val DT = new DecisionTreeRegressor()
  .setLabelCol("fare_amount")
  .setFeaturesCol("features")

import org.apache.spark.ml.regression.{DecisionTreeRegressionModel, DecisionTreeRegressor}
DT: org.apache.spark.ml.regression.DecisionTreeRegressor = dtr_a6d474e37f38


In [73]:
val pipeline_dt = new Pipeline().setStages(Array(vectorizer_numeric, DT))

pipeline_dt: org.apache.spark.ml.Pipeline = pipeline_c4f4b278378c


In [74]:
//Split the data randomly to 80% tranining and 20% testing. The training data is used to build the model and the testing data is used for testing the model
val Array(training_dt,testing_dt)=downSampledData.randomSplit(Array(0.8,0.2),111)

training_dt: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [fare_amount: double, pickup_longitude: double ... 5 more fields]
testing_dt: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [fare_amount: double, pickup_longitude: double ... 5 more fields]


In [75]:
val pipeline_model_dt= pipeline_dt.fit(training_dt)

pipeline_model_dt: org.apache.spark.ml.PipelineModel = pipeline_c4f4b278378c


In [76]:
//apllyintg the model to the test data to make predictions
val predictions_dt = pipeline_model_dt.transform(testing_dt)

// Select example rows to display.
predictions_dt.select("prediction","fare_amount", "features").show(5)

+------------------+-----------+--------------------+
|        prediction|fare_amount|            features|
+------------------+-----------+--------------------+
|29.040809586264455|        2.5|[-74.254295,40.67...|
|29.040809586264455|        2.5|[-74.029732,40.75...|
| 5.736257392009186|        2.5|[-74.024305,40.60...|
| 5.736257392009186|        2.5|[-74.024305,40.60...|
| 5.736257392009186|        2.5|[-74.015233,40.71...|
+------------------+-----------+--------------------+
only showing top 5 rows



predictions_dt: org.apache.spark.sql.DataFrame = [fare_amount: double, pickup_longitude: double ... 7 more fields]


In [77]:
// Select (prediction, true label) and compute test error.
val evaluator_dt = new RegressionEvaluator()
  .setLabelCol("fare_amount")
  .setPredictionCol("prediction")
  .setMetricName("rmse")
val rmse_dt = evaluator_dt.evaluate(predictions_dt)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse_dt")

Root Mean Squared Error (RMSE) on test data = 4.955962326461401


evaluator_dt: org.apache.spark.ml.evaluation.RegressionEvaluator = regEval_3f441d548833
rmse_dt: Double = 4.955962326461401


# Conclusion

By seeing the RMSE values, we can say that the GBT model is better because it has lowest RMSE, i.e. 4.69

In [None]:
x