# Spark Machine Learning using linear regression


#### Topics covered in this example
* `VectorAssembler`, `LinearRegression` and `RegressionEvaluator` from `pyspark.ml`.

***

## Prerequisites
<div class="alert alert-block alert-info">
<b>NOTE :</b> In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.</div>

* The EMR cluster attached to this notebook should have the `Spark` application installed.
* This example uses a public dataset, hence the EMR cluster attached to this notebook must have internet connectivity.
* This notebook uses the `PySpark` kernel.
***

## Introduction
In this example we use pyspark to predict the total cost of a trip using <a href="https://registry.opendata.aws/nyc-tlc-trip-records-pds/" target="_blank">New York City Taxi and Limousine Commission (TLC) Trip Record Data</a> from <a href="https://registry.opendata.aws/" target="_blank">Registry of Open Data on AWS</a>.

***

## Example
Load the data set for trips into a Spark DataFrame.

In [2]:
df = spark.read.format("parquet") \
.load("s3://itam-analytics-MINOMBRE/taxi/yellow_tripdata_2022-*.parquet", 
      inferSchema = True, 
      header = True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Mark the dataFrame for caching in memory and display the schema to check the data-types using the `printSchema` method.

In [3]:
# Mark the dataFrame for caching in memory
df.cache()

# Print the scehma
df.printSchema()

# Get the dimensions of the data
df.count() , len(df.columns)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- VendorID: long (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: double (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: double (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- payment_type: long (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)

(39656098, 19)

In [4]:
# Get the summary of the columns
df.select("total_amount", "tip_amount")\
.describe()\
.show()

# Value counts of VendorID column
df.groupBy("VendorID")\
.count()\
.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+------------------+------------------+
|summary|      total_amount|        tip_amount|
+-------+------------------+------------------+
|  count|          39656098|          39656098|
|   mean|21.671268443305777| 7.234908207598432|
| stddev| 96.37360220544356|22328.083995040284|
|    min|           -2567.8|            -410.0|
|    max|         401095.62|    1.3339136353E8|
+-------+------------------+------------------+

+--------+--------+
|VendorID|   count|
+--------+--------+
|       5|     143|
|       6|   59601|
|       1|11271061|
|       2|28325293|
+--------+--------+

### Use <a href="https://spark.apache.org/docs/2.4.7/ml-features#vectorassembler" target="_blank">VectorAssembler</a> to transform input columns into vectors
<a href="https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html" target="_blank">pyspark.ml</a> provides dataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.    
A `VectorAssembler` combines a given list of columns into a single vector column. In the below cell we combine the columns to a single vector cloumn `features`.

In [5]:
from pyspark.ml.feature import VectorAssembler

# Specify the input and output columns of the vector assembler
vectorAssembler = VectorAssembler(
    inputCols = [
        "trip_distance",
        "PULocationID",
        "DOLocationID",
        "fare_amount",
        "mta_tax",
        "tip_amount", 
        "tolls_amount",
        "improvement_surcharge", 
        "congestion_surcharge"
    ], 
    outputCol = "features")

# Transform the data
v_df = vectorAssembler.setHandleInvalid("skip").transform(df)

# View the transformed data
v_df = v_df.select(["features", "total_amount"])
v_df.show(3)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+------------+
|            features|total_amount|
+--------------------+------------+
|[2.4,90.0,209.0,1...|        13.8|
|[2.2,148.0,234.0,...|        14.3|
|[19.78,132.0,249....|       67.61|
+--------------------+------------+
only showing top 3 rows

Divide input dataset into training set and test set

In [6]:
splits = v_df.randomSplit([0.7, 0.3])
train_df = splits[0]
test_df = splits[1]

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Train the model using <a href="https://spark.apache.org/docs/2.4.7/ml-classification-regression.html#linear-regression" target="_blank">LinearRegression</a> against training set

In [7]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol = "features", \
                      labelCol = "total_amount", \
                      maxIter = 100, \
                      regParam = 0.3, \
                      elasticNetParam = 0.8)
lr_model = lr.fit(train_df)
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Coefficients: [0.0,0.0,0.0,0.9975537807494054,1.1988509011648747,0.9870092379839718,0.9841238812269312,0.6679739296547181,0.5506685726630876]
Intercept: 1.5774462355534828

Report the trained model performance on the training set

In [8]:
training_summary = lr_model.summary
print("RMSE: %f" % training_summary.rootMeanSquaredError)
print("R squred (R2): %f" % training_summary.r2)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

RMSE: 0.838512
R squred (R2): 0.999943

Predict the result using test set and report accuracy

In [9]:
predictions = lr_model.transform(test_df)

from pyspark.sql.functions import col
predictions.filter(predictions.total_amount > 10.0)\
.select("prediction", "total_amount")\
.withColumn("diff", col("prediction") - col("total_amount"))\
.withColumn("diff%", (col("diff") / col("total_amount")) * 100)\
.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------+------------+------------------+------------------+
|        prediction|total_amount|              diff|             diff%|
+------------------+------------+------------------+------------------+
| 66.41844198426483|        65.0| 1.418441984264831|2.1822184373305094|
| 66.41844198426483|        65.0| 1.418441984264831|2.1822184373305094|
| 76.39397979175888|        75.0|1.3939797917588805| 1.858639722345174|
| 69.41110332651304|        68.0| 1.411103326513043| 2.075151950754475|
|20.331457313642307|        18.8|1.5314573136423064| 8.146049540650566|
|14.545645385295753|        13.0| 1.545645385295753|11.889579886890408|
| 26.51629075428862|        25.0|1.5162907542886188| 6.065163017154475|
| 70.00963559496269|        68.6| 1.409635594962694| 2.054862383327542|
|  81.3817486955059|        80.0|1.3817486955059053|1.7271858693823816|
| 83.37685625700472|        82.0| 1.376856257004718| 1.679092996347217|
| 71.40621088801186|        70.0|1.4062108880118558|2.0088726971

### Report performance on the test set using <a href="https://spark.apache.org/docs/2.4.7/api/java/org/apache/spark/ml/evaluation/RegressionEvaluator.html" target="_blank">RegressionEvaluator</a>

In [10]:
from pyspark.ml.evaluation import RegressionEvaluator

lr_evaluator = RegressionEvaluator(predictionCol = "prediction", \
                                   labelCol = "total_amount", \
                                   metricName = "r2")
print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(predictions))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

R Squared (R2) on test data = 0.999811