*** SPARK SQL ***

## The Data

![img](http://training.databricks.com/databricks_guide/USDA_logo.png)

The first of the two datasets that we will be working with is the **Farmers Markets Directory and Geographic Data**. This dataset contains information on the longitude and latitude, state, address, name, and zip code of farmers markets in the United States. The raw data is published by the Department of Agriculture. The version on the data that is found in Databricks (and is used in this tutorial) was updated by the Department of Agriculture on Dec 01, 2015.

![img](http://training.databricks.com/databricks_guide/irs-logo.jpg)

The second dataset we will be working with is the **SOI Tax Stats - Individual Income Tax Statistics - ZIP Code Data (SOI)**. This study provides detailed tabulations of individual income tax return data at the state and ZIP code level and is provided by the IRS. This repository only has a sample of the data: 2013 and includes "AGI". The ZIP Code data shows selected income and tax items classified by State, ZIP Code, and size of adjusted gross income. Data is based on individual income tax returns filed with the IRS and is available for Tax Years 1998, 2001, 2004 through 2013.


In [0]:
# Read The data
taxes2013 = (spark.read
  .option("header", "true")
  .csv("dbfs:/databricks-datasets/data.gov/irs_zip_code_data/data-001/2013_soi_zipcode_agi.csv"))

markets = (spark.read
  .option("header", "true")
  .csv("dbfs:/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/market_data.csv"))

In [0]:
import math
def haversine(lat1, lon1, lat2, lon2):
    # convert decimal degrees to radians
    lat1, lon1, lat2, lon2 = map(math.radians, [lat1, lon1, lat2, lon2])
    # haversine formula
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
    c = 2 * math.asin(math.sqrt(a))
    r = 3956  # Radius of earth in miles. Use 6371 for kilometers.
    return c * r

from pyspark.sql.functions import udf, col
from pyspark.sql.types import DoubleType

# Set your target wealthy zip code's latitude and longitude
target_lat = 40.7128    # example latitude
target_lon = -74.0060   # example longitude

from pyspark.sql.functions import col, udf
from pyspark.sql.types import DoubleType

# Define a function for Euclidean distance calculation
def euclidean_distance(lat1, lon1, lat2, lon2):
    return ((lat1 - lat2)**2 + (lon1 - lon2)**2)**0.5

# Register the function as a UDF
distance_udf = udf(euclidean_distance, DoubleType())

# Add the new column using the UDF
markets = markets.withColumn("distance_to_wealthy", distance_udf(
    col("Market_Lat"), col("Market_Long"), col("Rich_ZIP_Lat"), col("Rich_ZIP_Long")
))




In [0]:
# Register spark SQL tables

taxes2013.createOrReplaceTempView("taxes2013")
markets.createOrReplaceTempView("markets")

In [0]:
%sql
DROP TABLE IF EXISTS cleaned_taxes;

CREATE OR REPLACE TABLE cleaned_taxes AS
SELECT state, int(zipcode / 10) as zipcode, 
  int(mars1) as single_returns, 
  int(mars2) as joint_returns, 
  int(numdep) as numdep, 
  double(A02650) as total_income_amount,
  double(A00300) as taxable_interest_amount,
  double(a01000) as net_capital_gains,
  double(a00900) as biz_net_income
FROM taxes2013;


num_affected_rows,num_inserted_rows


In [0]:
sqlContext.cacheTable("cleaned_taxes")

# Convert back to a dataset from a table
cleanedTaxes = spark.sql("SELECT * FROM cleaned_taxes")

summedTaxes = cleanedTaxes.groupBy("zipcode").sum() # because of AGI, where groups income groups are broken out 

cleanedMarkets = (markets
  .selectExpr("*", "int(zip / 10) as zipcode")
  .groupBy("zipcode")
  .count()
  .selectExpr("double(count) as count", "zipcode as zip"))
#  selectExpr is short for Select Expression - equivalent to what we
#  might be doing in SQL SELECT expression

joined = (cleanedMarkets.join(summedTaxes, cleanedMarkets.zip == summedTaxes.zipcode, "outer"))

In [0]:
display(cleanedMarkets)

count,zip
5.0,4900.0
2.0,7240.0
8.0,4818.0
1.0,9852.0
2.0,5300.0
5.0,2122.0
2.0,9900.0
1.0,8592.0
1.0,1580.0
1.0,3175.0


In [0]:
display(joined)

count,zip,zipcode,sum(zipcode),sum(single_returns),sum(joint_returns),sum(numdep),sum(total_income_amount),sum(taxable_interest_amount),sum(net_capital_gains),sum(biz_net_income)
1009.0,,,,,,,,,,
1.0,0.0,0.0,0.0,66430180.0,52885400.0,96500590.0,9274122025.0,82710640.0,399567789.0,310024683.0
1.0,3.0,,,,,,,,,
4.0,60.0,,,,,,,,,
1.0,61.0,,,,,,,,,
2.0,62.0,,,,,,,,,
1.0,63.0,,,,,,,,,
1.0,65.0,,,,,,,,,
4.0,66.0,,,,,,,,,
4.0,67.0,,,,,,,,,


deal with na values

In [0]:
prepped = joined.na.fill(0)
display(prepped)

count,zip,zipcode,sum(zipcode),sum(single_returns),sum(joint_returns),sum(numdep),sum(total_income_amount),sum(taxable_interest_amount),sum(net_capital_gains),sum(biz_net_income)
1009.0,0,0,0,0,0,0,0.0,0.0,0.0,0.0
1.0,0,0,0,66430180,52885400,96500590,9274122025.0,82710640.0,399567789.0,310024683.0
1.0,3,0,0,0,0,0,0.0,0.0,0.0,0.0
4.0,60,0,0,0,0,0,0.0,0.0,0.0,0.0
1.0,61,0,0,0,0,0,0.0,0.0,0.0,0.0
2.0,62,0,0,0,0,0,0.0,0.0,0.0,0.0
1.0,63,0,0,0,0,0,0.0,0.0,0.0,0.0
1.0,65,0,0,0,0,0,0.0,0.0,0.0,0.0
4.0,66,0,0,0,0,0,0.0,0.0,0.0,0.0
4.0,67,0,0,0,0,0,0.0,0.0,0.0,0.0


In [0]:
display(prepped)

count,zip,zipcode,sum(zipcode),sum(single_returns),sum(joint_returns),sum(numdep),sum(total_income_amount),sum(taxable_interest_amount),sum(net_capital_gains),sum(biz_net_income)
1009.0,0,0,0,0,0,0,0.0,0.0,0.0,0.0
1.0,0,0,0,66430180,52885400,96500590,9274122025.0,82710640.0,399567789.0,310024683.0
1.0,3,0,0,0,0,0,0.0,0.0,0.0,0.0
4.0,60,0,0,0,0,0,0.0,0.0,0.0,0.0
1.0,61,0,0,0,0,0,0.0,0.0,0.0,0.0
2.0,62,0,0,0,0,0,0.0,0.0,0.0,0.0
1.0,63,0,0,0,0,0,0.0,0.0,0.0,0.0
1.0,65,0,0,0,0,0,0.0,0.0,0.0,0.0
4.0,66,0,0,0,0,0,0.0,0.0,0.0,0.0
4.0,67,0,0,0,0,0,0.0,0.0,0.0,0.0


Now that all of our data is prepped. We're going to have to put all of it into one column of a vector type for Spark MLLib. This makes it easy to embed a prediction right in a DataFrame and also makes it very clear as to what is getting passed into the model and what isn't without having to convert it to a numpy array or specify an R formula. This also makes it easy to incrementally add new features, simply by adding to the vector. In the below case rather than specifically adding them in.

In [0]:
nonFeatureCols = ["zip", "zipcode", "count",'distance_to_wealthy']
featureCols = [item for item in prepped.columns if item not in nonFeatureCols]

In [0]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, StandardScaler

# Convert categorical variables (State -> Numerical)
state_indexer = StringIndexer(inputCol="State", outputCol="State_Index")

# One-hot encode State
state_encoder = OneHotEncoder(inputCol="State_Index", outputCol="State_OHE")

# Normalize income-based features
scaler = StandardScaler(inputCol="AGI", outputCol="AGI_Scaled")


In [0]:
# VectorAssembler Assembles all of these columns into one single vector. To do this, set the input columns and output column. Then that assembler will be used to transform the prepped data to the final dataset.
from pyspark.ml.feature import VectorAssembler

assembler = (VectorAssembler()
  .setInputCols(featureCols)
  .setOutputCol("features"))

finalPrep = assembler.transform(prepped)

Now split the dataset 70-30 for training and testing purposes.A validation set can be created as well, we are omitting it here. It's worth noting that MLLib also supports performing hyperparameter tuning with cross validation and pipelines. All this can be found in [the Databrick's Guide](https://docs.databricks.com).

In [0]:
training, test = finalPrep.randomSplit([0.7, 0.3])

#  Going to cache the data to make sure things stay snappy!
training.cache()
test.cache()

print(training.count()) # Why execute count here??
print(test.count())

4066
1736


# Apache Spark MLLib

At a high level, we're going to create an instance of a `regressor` or `classifier`, that in turn will then be trained and return a `Model` type. Whenever you access Spark MLLib you should be sure to import/train on the name of the algorithm you want as opposed to the `Model` type. For example:

You should import:

`org.apache.spark.ml.regression.LinearRegression`

as opposed to:

`org.apache.spark.ml.regression.LinearRegressionModel`

In the below example, we're going to use linear regression.

The linear regression that is available in Spark MLLib supports an elastic net parameter allowing you to set a threshold of how much you would like to mix l1 and l2 regularization, for [more information on Elastic net regularization see Wikipedia](https://en.wikipedia.org/wiki/Elastic_net_regularization).

As we saw above, we had to perform some preparation of the data before inputting it into the model. We've got to do the same with the model itself. We'll set our hyper parameters, print them out and then finally we can train it! The `explainParams` is a great way to ensure that you're taking advantage of all the different hyperparameters that you have available.

In [0]:
from pyspark.ml.regression import LinearRegression

lrModel = (LinearRegression()
  .setLabelCol("count")
  .setFeaturesCol("features")
  .setElasticNetParam(0.5))

print("Printing out the model Parameters:")
print("-"*20)
print(lrModel.explainParams())
print("-"*20)

Printing out the model Parameters:
--------------------
aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0, current: 0.5)
epsilon: The shape parameter to control the amount of robustness. Must be > 1.0. Only valid when loss is huber (default: 1.35)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: count)
loss: The loss function to be optimized. Supported options: squaredError, huber. (default: squaredError)
maxBlockSizeInMB: maximum memory in MB for stacking input data into blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. Default 0.0 represents choosing optimal value, depends on specifi

Now finally we can go about fitting our model! You'll see that we're going to do this in a series of steps. First we'll fit it, then we'll use it to make predictions via the `transform` method. This is the same way you would make predictions with your model in the future however in this case we're using it to evaluate how our model is doing. We'll be using regression metrics to get some idea of how our model is performing, we'll then print out those values to be able to evaluate how it performs.

In [0]:
from pyspark.mllib.evaluation import RegressionMetrics
lrFitted = lrModel.fit(training)

In [0]:
%fs ls /databricks-datasets/songs/data-001/

path,name,size,modificationTime
dbfs:/databricks-datasets/songs/data-001/header.txt,header.txt,377,1454633901000
dbfs:/databricks-datasets/songs/data-001/part-00000,part-00000,52837,1454547464000
dbfs:/databricks-datasets/songs/data-001/part-00001,part-00001,52469,1454547465000
dbfs:/databricks-datasets/songs/data-001/part-00002,part-00002,51778,1454547465000
dbfs:/databricks-datasets/songs/data-001/part-00003,part-00003,50551,1454547465000
dbfs:/databricks-datasets/songs/data-001/part-00004,part-00004,53449,1454547465000
dbfs:/databricks-datasets/songs/data-001/part-00005,part-00005,53301,1454547465000
dbfs:/databricks-datasets/songs/data-001/part-00006,part-00006,54184,1454547465000
dbfs:/databricks-datasets/songs/data-001/part-00007,part-00007,50924,1454547465000
dbfs:/databricks-datasets/songs/data-001/part-00008,part-00008,52533,1454547466000


Now you'll see that since we're working with exact numbers (you can't have 1/2 a farmer's market for example), I'm going to check equality by first rounding the value to the nearest digital value.

In [0]:
holdout = (lrFitted
  .transform(test)
  .selectExpr("prediction as raw_prediction", 
    "double(round(prediction)) as prediction", 
    "count", 
    """CASE double(round(prediction)) = count 
  WHEN true then 1
  ELSE 0
END as equal"""))

display(holdout)

raw_prediction,prediction,count,equal
1.650102381516466,2.0,0.0,0
1.3295046135694208,1.0,0.0,0
1.6100909697991914,2.0,0.0,0
1.4991630468288155,1.0,0.0,0
1.4171628261507605,1.0,0.0,0
1.2244704000958662,1.0,0.0,0
1.6135293951958718,2.0,1.0,0
1.3994024918528112,1.0,1.0,1
1.232267098659754,1.0,1.0,1
1.4676134435940933,1.0,1.0,1


Now let's see what proportion was exactly correct.

In [0]:
display(holdout.selectExpr("sum(equal)/sum(1)"))

(sum(equal) / sum(1))
0.2540322580645161


In [0]:
# have to do a type conversion for RegressionMetrics
rm = RegressionMetrics(holdout.select("prediction", "count").rdd.map(lambda x:  (x[0], x[1])))

print("MSE: ", rm.meanSquaredError)
print("MAE: ", rm.meanAbsoluteError)
print("RMSE Squared: ", rm.rootMeanSquaredError)
print("R Squared: ", rm.r2)
print("Explained Variance: ", rm.explainedVariance, "\n")



MSE:  2.673387096774194
MAE:  1.169930875576037
RMSE Squared:  1.6350495701275218
R Squared:  0.012849365716335548
Explained Variance:  0.5866736392788127 



These results appear to be sub-optimal, so let's try exploring another way to train the model. Rather than training on a single model with hard-coded parameters, let's train using a [pipeline](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline). 

A pipeline is going to give us some nice benefits in that it will allow us to use a couple of transformations we need in order to transform our raw data into the prepared data for the model but also it provides a simple, straightforward way to try out a lot of different combinations of parameters. This is a process called [hyperparameter tuning](https://en.wikipedia.org/wiki/Hyperparameter_optimization) or grid search. To review, grid search is where you set up the exact parameters that you would like to test and MLLib will automatically create all the necessary combinations of these to test.

For example, below we'll set `numTrees` to 20 and 60 and `maxDepth` to 5 and 10. The parameter grid builder will automatically construct all the combinations of these two variable (along with the other ones that we might specify too). Additionally we're also going to use [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) to tune our hyperparameters, this will allow us to attempt to try to control [overfitting](https://en.wikipedia.org/wiki/Overfitting) of our model.

Lastly we'll need to set up a [Regression Evaluator](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator) that will evaluate the models that we choose based on some metric (the default is RMSE). The key take away is that the pipeline will automatically optimize for our given metric choice by exploring the parameter grid that we set up rather than us having to do it manually like we would have had to do above.

Now we can go about training our random forest! 

*note: this might take a little while because of the number of combinations that we're trying and limitations in workers available.*

In [0]:
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline

# Define the Random Forest Regressor
rfModel = RandomForestRegressor(labelCol="count", featuresCol="features")

# Define the parameter grid for hyperparameter tuning
paramGrid = (ParamGridBuilder()
  .addGrid(rfModel.maxDepth, [5, 10])  # Testing different tree depths
  .addGrid(rfModel.numTrees, [20, 50])  # Testing different numbers of trees
  .build())

# Define the pipeline with the model
pipeline = Pipeline(stages=[rfModel])

# Set up cross-validation
cv = CrossValidator(
    estimator=pipeline,  # The model pipeline
    estimatorParamMaps=paramGrid,  # The parameter grid
    evaluator=RegressionEvaluator(labelCol="count", metricName="rmse"),  # Evaluating using RMSE
    numFolds=3  # 3-fold cross-validation
)

# Train the model using cross-validation
pipelineFitted = cv.fit(training)

# Make predictions on the test set
predictions = pipelineFitted.transform(test)

# Evaluate the model on test data
evaluator = RegressionEvaluator(labelCol="count", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)

# Print the final RMSE
print(f"Test RMSE: {rmse}")


Now we've trained our model! Let's take a look at which version performed best!

In [0]:
print("The Best Parameters:\n--------------------")
print(pipelineFitted.bestModel.stages[0])
pipelineFitted.bestModel.stages[0].extractParamMap()

com.databricks.backend.common.rpc.CommandSkippedException
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3(SequenceExecutionState.scala:138)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3$adapted(SequenceExecutionState.scala:133)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at com.databricks.spark.chauffeur.SequenceExecutionState.cancel(SequenceExecutionState.scala:133)
	at com.databricks.spark.chauffeur.ExecContextState.cancelRunningSequence(ExecContextState.scala:728)
	at com.databricks.spark.chauffeur.ExecContextState.$anonfun$cancel$1(ExecContextState.scala:446)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.ExecContextState.cancel(ExecContextState.scala:446)
	at com.databricks.spark.chauffeur.ExecutionContextManagerV1.cancelExecution(ExecutionContextManagerV1.scala:464)
	at com.databricks.spark.chauffeur.ChauffeurState.$anonfun$process$1(ChauffeurState.scala:571)
	at com.data

As well as our regression metrics on the test set.

In [0]:
%fs


com.databricks.backend.common.rpc.CommandSkippedException
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3(SequenceExecutionState.scala:138)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3$adapted(SequenceExecutionState.scala:133)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at com.databricks.spark.chauffeur.SequenceExecutionState.cancel(SequenceExecutionState.scala:133)
	at com.databricks.spark.chauffeur.ExecContextState.cancelRunningSequence(ExecContextState.scala:728)
	at com.databricks.spark.chauffeur.ExecContextState.$anonfun$cancel$1(ExecContextState.scala:446)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.ExecContextState.cancel(ExecContextState.scala:446)
	at com.databricks.spark.chauffeur.ExecutionContextManagerV1.cancelExecution(ExecutionContextManagerV1.scala:464)
	at com.databricks.spark.chauffeur.ChauffeurState.$anonfun$process$1(ChauffeurState.scala:571)
	at com.data

In [0]:
pipelineFitted.bestModel

com.databricks.backend.common.rpc.CommandSkippedException
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3(SequenceExecutionState.scala:138)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3$adapted(SequenceExecutionState.scala:133)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at com.databricks.spark.chauffeur.SequenceExecutionState.cancel(SequenceExecutionState.scala:133)
	at com.databricks.spark.chauffeur.ExecContextState.cancelRunningSequence(ExecContextState.scala:728)
	at com.databricks.spark.chauffeur.ExecContextState.$anonfun$cancel$1(ExecContextState.scala:446)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.ExecContextState.cancel(ExecContextState.scala:446)
	at com.databricks.spark.chauffeur.ExecutionContextManagerV1.cancelExecution(ExecutionContextManagerV1.scala:464)
	at com.databricks.spark.chauffeur.ChauffeurState.$anonfun$process$1(ChauffeurState.scala:571)
	at com.data

In [0]:
holdout2 = (pipelineFitted.bestModel
  .transform(test)
  .selectExpr("prediction as raw_prediction", 
    "double(round(prediction)) as prediction", 
    "count", 
    """CASE double(round(prediction)) = count 
  WHEN true then 1
  ELSE 0
END as equal"""))
  
display(holdout2)

com.databricks.backend.common.rpc.CommandSkippedException
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3(SequenceExecutionState.scala:138)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3$adapted(SequenceExecutionState.scala:133)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at com.databricks.spark.chauffeur.SequenceExecutionState.cancel(SequenceExecutionState.scala:133)
	at com.databricks.spark.chauffeur.ExecContextState.cancelRunningSequence(ExecContextState.scala:728)
	at com.databricks.spark.chauffeur.ExecContextState.$anonfun$cancel$1(ExecContextState.scala:446)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.ExecContextState.cancel(ExecContextState.scala:446)
	at com.databricks.spark.chauffeur.ExecutionContextManagerV1.cancelExecution(ExecutionContextManagerV1.scala:464)
	at com.databricks.spark.chauffeur.ChauffeurState.$anonfun$process$1(ChauffeurState.scala:571)
	at com.data

In [0]:
rm2 = RegressionMetrics(holdout2.select("prediction", "count").rdd.map(lambda x:  (x[0], x[1])))

print("MSE: ", rm2.meanSquaredError)
print("MAE: ", rm2.meanAbsoluteError)
print("RMSE Squared: ", rm2.rootMeanSquaredError)
print("R Squared: ", rm2.r2)
print("Explained Variance: ", rm2.explainedVariance, "\n")

com.databricks.backend.common.rpc.CommandSkippedException
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3(SequenceExecutionState.scala:138)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3$adapted(SequenceExecutionState.scala:133)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at com.databricks.spark.chauffeur.SequenceExecutionState.cancel(SequenceExecutionState.scala:133)
	at com.databricks.spark.chauffeur.ExecContextState.cancelRunningSequence(ExecContextState.scala:728)
	at com.databricks.spark.chauffeur.ExecContextState.$anonfun$cancel$1(ExecContextState.scala:446)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.ExecContextState.cancel(ExecContextState.scala:446)
	at com.databricks.spark.chauffeur.ExecutionContextManagerV1.cancelExecution(ExecutionContextManagerV1.scala:464)
	at com.databricks.spark.chauffeur.ChauffeurState.$anonfun$process$1(ChauffeurState.scala:571)
	at com.data

Finally we'll see an improvement in our "exactly right" proportion as well!

In [0]:
display(holdout2.selectExpr("sum(equal)/sum(1)"))

com.databricks.backend.common.rpc.CommandSkippedException
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3(SequenceExecutionState.scala:138)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3$adapted(SequenceExecutionState.scala:133)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at com.databricks.spark.chauffeur.SequenceExecutionState.cancel(SequenceExecutionState.scala:133)
	at com.databricks.spark.chauffeur.ExecContextState.cancelRunningSequence(ExecContextState.scala:728)
	at com.databricks.spark.chauffeur.ExecContextState.$anonfun$cancel$1(ExecContextState.scala:446)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.ExecContextState.cancel(ExecContextState.scala:446)
	at com.databricks.spark.chauffeur.ExecutionContextManagerV1.cancelExecution(ExecutionContextManagerV1.scala:464)
	at com.databricks.spark.chauffeur.ChauffeurState.$anonfun$process$1(ChauffeurState.scala:571)
	at com.data