## Basic Track Analysis - Pyspark
This guide performs similar actions as the "Basic Track Analysis" notebook, but it accomplishes it using Pyspark, which is the Python API for Spark. By using the `run_python_script` functionality present in the ArcGIS API for Python, we can leverage Spark functionality in our GeoAnalytics Server to get faster analysis results. 

In this notebook, we'll use Pyspark to examine large volumes of track data. We'll show the power of Pyspark for querying, aggregating, identifying clusters, calculating new statistics, and summarizing attributes. 

This guide assumes that:
1. You are an administrator for your organization OR you are a Track Viewer that has privileges to create content.
2. You are using Enterprise 10.7+
3. You have GeoAnalytics Server as part of your Enterprise deployment
4. You are running this with the latest version of the ArcGIS API for Python

In [172]:
import arcgis
from arcgis.gis import GIS
from arcgis.apps.tracker import TrackView

arcgis.env.process_spatial_reference = "102100"

gis = GIS("https://dev0007109.esri.com/portal", "admin", verify_cert=False)
if not arcgis.geoanalytics.is_supported(gis):
    raise Exception("GeoAnalytics is not supported for this organization")

Enter password: ········


### First, let's get familiar with our data

We'll be performing analyses on Tracker data in the Inland Empire area. We're not filtering by any particular attribute here when displaying these tracks, so the amount of track data is pretty significant - almost 500k track points.

In [173]:
tracks_layer = arcgis.apps.tracker.TrackView(gis.content.get('0a9292baf0c148b39745d6c014ba30bf')).tracks_layer
print("Number of Tracks in LTS: " + str(tracks_layer.query("1=1", return_count_only=True)))

Number of Tracks in LTS: 470434


In [180]:
map1 = gis.map("Chino Hills, CA", zoomlevel=9)
map1.basemap = 'streets-night-vector'
map1.add_layer(tracks_layer)
map1

### That's a lot of tracks!
As you can see, we have a ton of tracks in the layer. That's going to make performing any of our analysis operations in GeoAnalytics take quite a bit of time. Let's return all the tracks that are walking (activity attribute == 2) and see how long it takes to do so. 

In [175]:
import datetime
start_time = datetime.datetime.now()
tracks = tracks_layer.query("activity=2", return_all_records=True)
print("Time Elapsed: " + str(datetime.datetime.now() - start_time))
print("Number of Tracks that match criteria: " + str(len(tracks)))

Time Elapsed: 0:00:06.264041
Number of Tracks that match criteria: 48877


### Now let's try that in PySpark
We'll use the `run_python_script` function available in the ArcGIS Python API to run a code block which filters those tracks. The code block contains the `filter()` function which can be performed on a Spark Dataframe.

In [176]:
import json
start_time = datetime.datetime.now()
code = '''
import datetime
print(layers[0].filter(layers[0].activity == 2).count())
'''
x = arcgis.geoanalytics.manage_data.run_python_script(code=code, layers=[tracks_layer])
print("Time Elapsed: " + str(datetime.datetime.now() - start_time))
print("Number of tracks that matched criteria: " + json.loads(x[-2]['description'])['message'])

Time Elapsed: 0:00:29.091728
Number of tracks that matched criteria: 15/15 distributed tasks completed.


### No performance improvement yet...

But now, let's compare performance of several Geoanalytics tools using the standard Python API geoanalytics module and using `run_python_script`. In this example, we will created some aggregated hexagon bins to analyze where the most frequently visted locations were. We'll perform two actions in each example. 

First, we will aggregate the walking track points using the `aggregate_points` functionality in GeoAnalytics. Then, to ensure high quality data, we will filter the aggregated bins so that we return only aggregations where the mean horizontal accuracy is below 20. However, the first example will use the Python API directly, which the second example will use `run_python_script` and Pyspark filter functionality. 

In [177]:
# without run_python_script
tracks_layer.filter = 'activity = 2'
start_time = datetime.datetime.now()
from arcgis.geoanalytics.summarize_data import aggregate_points
aggregated_tracks = aggregate_points(point_layer=tracks_layer,
                         bin_size=50,
                         bin_size_unit="Meters",
                         bin_type="Hexagon")
quality_aggregated_tracks = aggregated_tracks.layers[0].query('MEAN_horizontal_accuracy < 20')
print("Regular API Time Elapsed: " + str(datetime.datetime.now() - start_time))

# with run_python_script
start_time = datetime.datetime.now()
code_2 = '''
aggregated_tracks = geoanalytics.aggregate_points(layers[0],
                     bin_size=50,
                     bin_size_unit="Meters",
                     bin_type="Hexagon")
quality_aggregated_tracks = aggregated_tracks.filter(aggregated_tracks.MEAN_horizontal_accuracy < 20)
quality_aggregated_tracks.write.format('webgis').save('Aggregate_{0}'.format(time.time()))
'''
arcgis.geoanalytics.manage_data.run_python_script(code=code_2, layers=[tracks_layer])
print("Pyspark Time Elapsed: " + str(datetime.datetime.now() - start_time))



Regular API Time Elapsed: 0:00:43.693165
Pyspark Time Elapsed: 0:00:33.012966


In [181]:
map2.take_screenshot()

### Looks like run_python_script has some performance improvement!
The function gets its advantage from the fact that you don't have to write out intermediate results - the entire program is executed for you on the GeoAnalytics Server, so you only have to deal with input and output. The more functions you want to chain together, the better performance improvement you'll see!

Now, let's try a second example where we chain together even more actions. We're going to find the users in the track view with the largest range of acceleration values.

First, we'll calculate acceleration by using `calculate_fields` and the speed field. Then, we'll `summarize_attributes` to find the range of acceleration to find the users that were accelerating the most while driving. 

In [179]:
# without run_python_script
tracks_layer.filter = 'activity = 5'
start_time = datetime.datetime.now()
from arcgis.geoanalytics.summarize_data import summarize_attributes
from arcgis.geoanalytics.manage_data import calculate_fields
acceleration_expression = """
    ($track.field["speed"].history(-2)[0] - $feature.speed)/(DateDiff($track.field["location_timestamp"].history(-2)[0], $feature.location_timestamp, "seconds"))
    """
acceleration = calculate_fields(tracks_layer,
                               "acceleration",
                               "Double",
                               acceleration_expression,
                               track_aware=True,
                               track_fields=["created_user"],
                               )
summarized_result = summarize_attributes(input_layer=acceleration.layers[0], fields="created_user", summary_fields=[{"statisticType":"Range","onStatisticField":"acceleration"}])                     
print("Regular API Time Elapsed: " + str(datetime.datetime.now() - start_time))

# with run_python_script
start_time = datetime.datetime.now()
def calculate_acceleration_range():
    acceleration_expression = """
        ($track.field["speed"].history(-2)[0] - $feature.speed)/(DateDiff($track.field["location_timestamp"].history(-2)[0], $feature.location_timestamp, "seconds"))
        """
    acceleration = geoanalytics.calculate_field(layers[0],
                                   "acceleration",
                                   "Double",
                                   acceleration_expression,
                                   track_aware=True,
                                   track_fields=["created_user"],
                                   )
    acceleration_range = geoanalytics.summarize_attributes(acceleration, fields=["created_user"], summary_fields=[{"statisticType" : "Range", "onStatisticField" : "acceleration"}])
    acceleration_range.write.format('webgis').save('Summary_{0}'.format(time.time()))

# use function as code block for run_python_script
arcgis.geoanalytics.manage_data.run_python_script(code=calculate_acceleration_range, layers=[tracks_layer])
print("Pyspark Time Elapsed: " + str(datetime.datetime.now() - start_time))

# display users with five highest acceleration range values
acceleration_range_df = summarized_result.tables[0].query('1=1', as_df=True)
acceleration_range_df.sort_values('RANGE_acceleration', axis=0, ascending=False, inplace=True)
acceleration_range_df.head()

Regular API Time Elapsed: 0:01:40.512862
Pyspark Time Elapsed: 0:00:41.246889


Unnamed: 0,created_user,COUNT,RANGE_acceleration,globalid,OBJECTID
16,anurasih,18877.0,473.209991,{BE96F27B-21B4-DBCA-6E52-00688FCD0578},17
15,delias,10455.0,419.880005,{C285A210-41E0-EB1A-77E5-6F5B11530E7A},16
1,apatel,89268.0,193.632614,{150D0608-9745-BEB5-D7F7-9341DFAEB38A},2
19,nchowdhury,25222.0,59.653349,{B599E69C-E654-F52B-D712-EE02982564E6},23
3,metric,37360.0,48.540001,{EB2D08B5-DE2A-E061-69F4-DB2FFD107B1E},4


### Summary
We saved half the time in the previous example! The more actions you chain together, the more performance improvement you'll see using `run_python_script`. Check out these resources for more with Pyspark:

1. GeoAnalytics tools with Pyspark: https://developers.arcgis.com/sign-in/?redirect_uri=%2Frest%2Fservices-reference%2Fusing-geoanalytics-tools-in-pyspark.htm
2. Working with Pyspark Dataframes: https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/