## Assignment 4 - Spark ML

## Learning Outcomes
In this assignment you will: 

-  Use ML piplenes
-  Improve a Random Forest model
-  Perform Hyperparameter tuning

**Submission Format**

- If you used databricks, please submit the **published** notebook link in a word or pdf document. Do not submit HTML, Jupyter notebook, or archive (DBC) formats.
- If you used a local instance of spark, please submit a Jupyter notebook.

#### Question 1:  (5 marks)

In our learning from this module, we have identified a fairly significant link by leveraging the ML pipeline, a more sophisticated model, and better hyperparameter tuning. However these results are still a bit disappointing. With that being said, we're working with very few features and we've likely made some assumptions that just aren't quite valid (like zip code shortening). Also, just because a rich zip code exists doesn't mean that the farmer's market would be held in that zip code too. In fact we might want to start looking at neighboring zip codes or doing some sort of distance measure to predict whether or not there exists a farmer's market in a certain mile radius from a wealthy zip code.

With that being said, we've got a lot of other potential features and plenty of other parameters to tune on our random forest so play around with the above pipeline and see if you can improve it further! Note: adding a feaure for the distance measure is just an example and not a mandatory change to improve the model's performance. We also aren't concerned about if the model's perforamnce is actually improved! We simply want to see if changes have been made to the code for possible improvements. 

Learn mode about the Farmers Markets dataset, here: https://catalog.data.gov/dataset/farmers-markets-directory-and-geographic-data
    
You may use the same classifier we built in the notebook in this module.

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5915990090493625/2446126855165611/6085673883631125/latest.html

#### Question 2 ( 7 marks)


Using the Apache Spark ML pipeline, build a model to predict the price of a diamond based on the available features.

Read from the following notebook for details about dataset.

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5915990090493625/4396972618536508/6085673883631125/latest.html

Note:  
- If you receive an R_Squared value that is negative, that is okay. This may occur due to the low sample size of the data.

In [None]:
from pyspark.sql import SparkSession
from IPython.display import display, HTML

spark = SparkSession.builder \
    .appName("VSCodeTest") \
    .master("local[*]") \
    .getOrCreate()

print("✅ Spark session running in VS Code!")
print("Spark version:", spark.version)

✅ Spark session running in VS Code!
Spark version: 4.0.0


In [3]:
# Load the datasets
taxes2013 = spark.read \
    .option("header", "true") \
    .csv("C:/Users/jverc/Downloads/2013_soi_zipcode_agi.csv")

markets = spark.read \
    .option("header", "true") \
    .csv("C:/Users/jverc/Downloads/market_data.csv")

In [None]:
taxes2013.createOrReplaceTempView("taxes2013")
markets.createOrReplaceTempView("markets")

In [None]:
df_display = spark.sql("""
SELECT 
  state, 
  INT(zipcode / 10) AS zipcode, 
  CAST(CAST(mars1 AS DOUBLE) AS INT) AS single_returns, 
  CAST(CAST(mars2 AS DOUBLE) AS INT) AS joint_returns, 
  CAST(CAST(numdep AS DOUBLE) AS INT) AS numdep, 
  CAST(A02650 AS DOUBLE) AS total_income_amount,
  CAST(A00300 AS DOUBLE) AS taxable_interest_amount,
  CAST(A01000 AS DOUBLE) AS net_capital_gains,
  CAST(A00900 AS DOUBLE) AS biz_net_income
FROM taxes2013
""")
df_display.show()

df_display.createOrReplaceTempView("cleaned_taxes")
spark.catalog.cacheTable("cleaned_taxes")

+-----+-------+--------------+-------------+------+-------------------+-----------------------+-----------------+--------------+
|state|zipcode|single_returns|joint_returns|numdep|total_income_amount|taxable_interest_amount|net_capital_gains|biz_net_income|
+-----+-------+--------------+-------------+------+-------------------+-----------------------+-----------------+--------------+
|   AL|      0|        488030|       122290|571240|        1.1444868E7|                77952.0|          23583.0|      824487.0|
|   AL|      0|        195840|       155230|383240|        1.7810952E7|                81216.0|          54639.0|      252768.0|
|   AL|      0|         72710|       146880|189340|        1.6070153E7|                80627.0|          84137.0|      259836.0|
|   AL|      0|         24860|       126480|134370|        1.4288572E7|                71086.0|         105947.0|      214668.0|
|   AL|      0|         16930|       168170|177800|         2.605392E7|               149150.0|  

In [None]:
# Convert back to a dataset from a table
cleanedTaxes = spark.sql("SELECT * FROM cleaned_taxes")

summedTaxes = cleanedTaxes.groupBy("zipcode").sum() # because of AGI, where groups income groups are broken out 

cleanedMarkets = (markets
  .selectExpr("*", "CAST(TRY_CAST(zip AS INT) / 10 AS INT) AS zipcode") # Needed 
  .groupBy("zipcode")
  .count()
  .selectExpr("double(count) as count", "zipcode as zip"))
#  selectExpr is short for Select Expression - equivalent to what we
#  might be doing in SQL SELECT expression

joined = (cleanedMarkets.join(summedTaxes, cleanedMarkets.zip == summedTaxes.zipcode, "outer"))

In [32]:
cleanedMarkets.show()

+-----+----+
|count| zip|
+-----+----+
|  5.0|4900|
|  2.0|7240|
|  8.0|4818|
|  1.0|9852|
|  2.0|5300|
|  5.0|2122|
|  2.0|9900|
|  1.0|8592|
|  1.0|1580|
|  1.0|3175|
|  1.0|7754|
|  2.0|6336|
|  1.0|7833|
|  1.0|7340|
|  1.0|6466|
|  4.0|6620|
|  1.0|1342|
|  2.0| 496|
|  1.0|5156|
|  1.0|4101|
+-----+----+
only showing top 20 rows


In [34]:
joined.show()

+------+----+-------+------------+-------------------+------------------+-----------+------------------------+----------------------------+----------------------+-------------------+
| count| zip|zipcode|sum(zipcode)|sum(single_returns)|sum(joint_returns)|sum(numdep)|sum(total_income_amount)|sum(taxable_interest_amount)|sum(net_capital_gains)|sum(biz_net_income)|
+------+----+-------+------------+-------------------+------------------+-----------+------------------------+----------------------------+----------------------+-------------------+
|1009.0|NULL|   NULL|        NULL|               NULL|              NULL|       NULL|                    NULL|                        NULL|                  NULL|               NULL|
|   1.0|   0|      0|           0|           66430180|          52885400|   96500590|           9.274122025E9|                  8.271064E7|          3.99567789E8|       3.10024683E8|
|   1.0|   3|   NULL|        NULL|               NULL|              NULL|       NULL|

In [None]:
# Matches source data
joined.filter("zipcode = 1342").show()
joined.filter("zipcode = 833").show()

+-----+----+-------+------------+-------------------+------------------+-----------+------------------------+----------------------------+----------------------+-------------------+
|count| zip|zipcode|sum(zipcode)|sum(single_returns)|sum(joint_returns)|sum(numdep)|sum(total_income_amount)|sum(taxable_interest_amount)|sum(net_capital_gains)|sum(biz_net_income)|
+-----+----+-------+------------+-------------------+------------------+-----------+------------------------+----------------------------+----------------------+-------------------+
|  1.0|1342|   1342|       40260|               4800|              3680|       5450|                475283.0|                      2919.0|                6476.0|            13496.0|
+-----+----+-------+------------+-------------------+------------------+-----------+------------------------+----------------------------+----------------------+-------------------+

+-----+----+-------+------------+-------------------+------------------+-----------+-----

In [41]:
prepped = joined.na.fill(0)
prepped.filter("zipcode = 1342").show()
prepped.filter("zipcode = 833").show()
prepped.show()

+-----+----+-------+------------+-------------------+------------------+-----------+------------------------+----------------------------+----------------------+-------------------+
|count| zip|zipcode|sum(zipcode)|sum(single_returns)|sum(joint_returns)|sum(numdep)|sum(total_income_amount)|sum(taxable_interest_amount)|sum(net_capital_gains)|sum(biz_net_income)|
+-----+----+-------+------------+-------------------+------------------+-----------+------------------------+----------------------------+----------------------+-------------------+
|  1.0|1342|   1342|       40260|               4800|              3680|       5450|                475283.0|                      2919.0|                6476.0|            13496.0|
+-----+----+-------+------------+-------------------+------------------+-----------+------------------------+----------------------------+----------------------+-------------------+

+-----+---+-------+------------+-------------------+------------------+-----------+------

In [42]:
nonFeatureCols = ["zip", "zipcode", "count"]
featureCols = [item for item in prepped.columns if item not in nonFeatureCols]


In [43]:
# VectorAssembler Assembles all of these columns into one single vector. To do this, set the input columns and output column. Then that assembler will be used to transform the prepped data to the final dataset.
from pyspark.ml.feature import VectorAssembler

assembler = (VectorAssembler()
  .setInputCols(featureCols)
  .setOutputCol("features"))

finalPrep = assembler.transform(prepped)

In [46]:
training, test = finalPrep.randomSplit([0.7, 0.3])

#  Going to cache the data to make sure things stay snappy!
training.cache()
test.cache()

print(training.count()) # Why execute count here??
print(test.count())

4107
1695


In [47]:
from pyspark.ml.regression import LinearRegression

lrModel = (LinearRegression()
  .setLabelCol("count")
  .setFeaturesCol("features")
  .setElasticNetParam(0.5))

print("Printing out the model Parameters:")
print("-"*20)
print(lrModel.explainParams())
print("-"*20)

Printing out the model Parameters:
--------------------
aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0, current: 0.5)
epsilon: The shape parameter to control the amount of robustness. Must be > 1.0. Only valid when loss is huber (default: 1.35)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: count)
loss: The loss function to be optimized. Supported options: squaredError, huber. (default: squaredError)
maxBlockSizeInMB: maximum memory in MB for stacking input data into blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. Default 0.0 represents choosing optimal value, depends on specifi

In [48]:
from pyspark.mllib.evaluation import RegressionMetrics
lrFitted = lrModel.fit(training)

In [50]:
holdout = (lrFitted
  .transform(test)
  .selectExpr("prediction as raw_prediction", 
    "double(round(prediction)) as prediction", 
    "count", 
    """CASE double(round(prediction)) = count 
  WHEN true then 1
  ELSE 0
END as equal"""))

holdout.show()

+------------------+----------+-----+-----+
|    raw_prediction|prediction|count|equal|
+------------------+----------+-----+-----+
|1.6266480360272162|       2.0|  0.0|    0|
|1.3960107124675702|       1.0|  0.0|    0|
|1.1748730028702283|       1.0|  0.0|    0|
|1.6076730958060064|       2.0|  0.0|    0|
|1.6195503737863977|       2.0|  1.0|    0|
| 1.662668739464466|       2.0|  1.0|    0|
| 1.425376468040827|       1.0|  1.0|    1|
|1.4983417646037873|       1.0|  1.0|    1|
|1.2356768699980547|       1.0|  1.0|    1|
|0.5773431559152427|       1.0|  2.0|    0|
|0.7806139194397206|       1.0|  2.0|    0|
|1.9457817650198195|       2.0|  5.0|    0|
|1.6599200673154881|       2.0|  0.0|    0|
|1.5975276374177712|       2.0|  0.0|    0|
|1.6380010046941103|       2.0|  1.0|    0|
|1.2801867869754961|       1.0|  1.0|    1|
|2.6075669587082286|       3.0|  8.0|    0|
|1.4584202027423474|       1.0|  0.0|    0|
|1.3832767489712159|       1.0|  0.0|    0|
|1.5817514555799617|       2.0| 

In [51]:
holdout.selectExpr("sum(equal)/sum(1)").show()

+---------------------+
|(sum(equal) / sum(1))|
+---------------------+
|   0.2176991150442478|
+---------------------+



In [64]:
from pyspark.ml.evaluation import RegressionEvaluator

# Evaluate Mean Squared Error
evaluator = RegressionEvaluator(
    labelCol="count", 
    predictionCol="prediction", 
    metricName="mse"
)
mse = evaluator.evaluate(holdout)

mae = evaluator.setMetricName("mae").evaluate(holdout)
rmse = evaluator.setMetricName("rmse").evaluate(holdout)
r2 = evaluator.setMetricName("r2").evaluate(holdout)

print(f"MSE: {mse}")
print(f"MAE: {mae}")
print(f"RMSE: {rmse}")
print(f"R^2: {r2}")


MSE: 2.437758112094395
MAE: 1.2
RMSE: 1.5613321594377012
R^2: 0.014553741176523749


In [53]:
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator

from pyspark.ml import Pipeline

rfModel = (RandomForestRegressor()
  .setLabelCol("count")
  .setFeaturesCol("features"))

paramGrid = (ParamGridBuilder()
  .addGrid(rfModel.maxDepth, [5, 10])
  .addGrid(rfModel.numTrees, [20, 60])
  .build())
# Note, that this parameter grid will take a long time
# to run in the community edition due to limited number
# of workers available! Be patient for it to run!
# If you want it to run faster, remove some of
# the above parameters and it'll speed right up!

stages = [rfModel]

pipeline = Pipeline().setStages(stages)

cv = (CrossValidator() # you can feel free to change the number of folds used in cross validation as well
  .setEstimator(pipeline) # the estimator can also just be an individual model rather than a pipeline
  .setEstimatorParamMaps(paramGrid)
  .setEvaluator(RegressionEvaluator().setLabelCol("count")))

pipelineFitted = cv.fit(training)

In [54]:
print("The Best Parameters:\n--------------------")
print(pipelineFitted.bestModel.stages[0])
pipelineFitted.bestModel.stages[0].extractParamMap()

The Best Parameters:
--------------------
RandomForestRegressionModel: uid=RandomForestRegressor_83b41bde753f, numTrees=20, numFeatures=8


{Param(parent='RandomForestRegressor_83b41bde753f', name='bootstrap', doc='Whether bootstrap samples are used when building trees.'): True,
 Param(parent='RandomForestRegressor_83b41bde753f', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval.'): False,
 Param(parent='RandomForestRegressor_83b41bde753f', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.'): 10,
 Param(parent='RandomForestRegressor_83b41bde753f', name='featureSubsetStrategy', doc="The number of features to consider for splits at each tree node. Supporte

In [55]:
pipelineFitted.bestModel

PipelineModel_b51eb85a4abd

In [56]:
holdout2 = (pipelineFitted.bestModel
  .transform(test)
  .selectExpr("prediction as raw_prediction", 
    "double(round(prediction)) as prediction", 
    "count", 
    """CASE double(round(prediction)) = count 
  WHEN true then 1
  ELSE 0
END as equal"""))
  
holdout2.show()

+-------------------+----------+-----+-----+
|     raw_prediction|prediction|count|equal|
+-------------------+----------+-----+-----+
|0.15740984747919035|       0.0|  0.0|    1|
| 0.2206896610523093|       0.0|  0.0|    1|
| 0.3156566671570846|       0.0|  0.0|    1|
|  0.946622624784443|       1.0|  0.0|    0|
| 0.9449481125769367|       1.0|  1.0|    1|
| 1.0068125603311764|       1.0|  1.0|    1|
| 0.9726002700913259|       1.0|  1.0|    1|
| 1.1679757038903142|       1.0|  1.0|    1|
| 1.4782475185788102|       1.0|  1.0|    1|
| 2.3030802616263655|       2.0|  2.0|    1|
| 1.9496039240622676|       2.0|  2.0|    1|
| 2.9311498276893317|       3.0|  5.0|    0|
| 0.9558531423897045|       1.0|  0.0|    0|
|0.15740984747919035|       0.0|  0.0|    1|
| 1.6158593301353708|       2.0|  1.0|    0|
| 0.8279278118275654|       1.0|  1.0|    1|
|  1.798522792753974|       2.0|  8.0|    0|
| 0.7232243710662721|       1.0|  0.0|    0|
| 0.3882577465394846|       0.0|  0.0|    1|
| 0.428730

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

# Evaluate Mean Squared Error
evaluator = RegressionEvaluator(
    labelCol="count", 
    predictionCol="prediction", 
    metricName="mse"
)
mse = evaluator.evaluate(holdout2)

mae = evaluator.setMetricName("mae").evaluate(holdout2)
rmse = evaluator.setMetricName("rmse").evaluate(holdout2)
r2 = evaluator.setMetricName("r2").evaluate(holdout2)

print(f"MSE: {mse}")
print(f"MAE: {mae}")
print(f"RMSE: {rmse}")
print(f"R^2: {r2}")


MSE: 7.900294985250737
MAE: 1.136873156342183
RMSE: 2.810746339542353
R^2: -2.1936376698705637


In [57]:
holdout2.selectExpr("sum(equal)/sum(1)").show()

+---------------------+
|(sum(equal) / sum(1))|
+---------------------+
|   0.3817109144542773|
+---------------------+



In [None]:
spark.stop()