# Managing Big Data

The `arcgis.geoanalytics.manage_data` submodule contains functions that are used for the day-to-day management of geographic and tabular data.

<h2>Table of contents</h2>
    
- [Append Data](#Append-Data)
- [Calculate Fields](#Calculate-Fields)
- [Clip Layer](#Clip-Layer)
- [Copy To Datastore](#Copy-To-Datastore)
- [Dissolve Boundaries](#Disssolve-Boundaries)
- [Merge Layers](#Merge-Layers)
- [Overlay Data](#Overlay-Data)
- [Run Python Script](#Run-Python_script) 

This toolset uses distributed processing to complete analytics on your GeoAnalytics Server.

<table>
  <tr>
    <th><center>Tool</center></th>
    <th><center>Description</center></th>    
  </tr>
  <tr>
      <td><a href="https://enterprise.arcgis.com/en/portal/latest/use/geoanalytics-append-data.htm"><p align="left">append_data</p></a></td>
      <td><p align="left">The Append Data tool allows you to append features to an existing hosted layer in your ArcGIS Enterprise organization. Append Data allows you to update or modify existing datasets.</p></td>
  </tr>
  <tr>
      <td><a href="https://enterprise.arcgis.com/en/portal/latest/use/geoanalytics-calculate-field.htm"><p align="left">calculate_fields</p></a></td>
      <td><p aligh="left">The Calculate Field tool calculates field values on a new or existing field. The output will always be a new layer in your ArcGIS Enterprise portal contents.</p></td>
  </tr>
  <tr>
      <td><a href="https://enterprise.arcgis.com/en/portal/latest/use/geoanalytics-clip-layer.htm"><p align="left">clip_layer</p></a></td>
      <td><p align="left">The Clip Layer tool allows you to create subsets of your input features by clipping them to areas of interest. The output subset layer will be available in your ArcGIS Enterprise organization.</p></td>
  </tr>
  <tr>
      <td><a href="https://enterprise.arcgis.com/en/portal/latest/use/geoanalytics-copy-to-data-store.htm"><p align="left">copy_to_data_store</p></a></td>
      <td><p align="left">The Copy To Data Store tool is a convenient way to copy datasets to a layer in your portal. Copy to Data Store creates an item in your content containing your layer.</p></td>
  </tr>
  <tr>
      <td><a href="https://enterprise.arcgis.com/en/portal/latest/use/geoanalytics-dissolve-boundaries.htm"><p align="left">dissolve_boundaries</p></a></td>
      <td><p align="left">The Dissolve Boundaries tool merges area features that either intersect or have the same field values.</p></td>
  </tr>
  <tr>
      <td><a href="https://enterprise.arcgis.com/en/portal/latest/use/geoanalytics-merge-layers.htm"><p align="left">merge_layers</p></a></td>
      <td><p align="left">The Merge Layers tool combines two feature layers to create a single output layer. The tool requires that both layers have the same geometry type (tabular, point, line, or polygon). I</p></td>
  </tr>
  <tr>
      <td><a href="https://enterprise.arcgis.com/en/portal/latest/use/geoanalytics-overlay-layers.htm"><p align="left">overlay_data</p></a></td>
      <td><p align="left">The Overlay Layers tool combines two layers into a single layer using one of five methods: Intersect, Erase, Union, Identity, or Symmetric Difference.</p></td>
  </tr>
  <tr>
      <td><a href="https://developers.arcgis.com/rest/services-reference/enterprise/run-python-script.htm"><p align="left">run_python_script</p></a></td>
      <td><p align="left">The run python script tool executes a Python script directly in an ArcGIS GeoAnalytics server site .</p></td>
  </tr>   
 

**Note**: The purpose of the notebook is to show examples of the different tools that can be run on an example dataset.

<b>Necessary imports</b>

In [10]:
# connect to Enterprise GIS
from arcgis.gis import GIS
import arcgis.geoanalytics

portal_gis = GIS("your_enterprise_profile")

In [16]:
item = portal_gis.content.get('5c6ef8ef57934990b543708f815d606e')
usa_counties_lyr = item.layers[0]

In [17]:
usa_counties_lyr

<FeatureLayer url:"https://pythonapi.playground.esri.com/server/rest/services/Hosted/usaCounties/FeatureServer/0">

In [18]:
search_result = portal_gis.content.search("", item_type = "big data file share")
search_result

[<Item title:"bigDataFileShares_hurricanes_dask_shp" type:Big Data File Share owner:atma.mani>,
 <Item title:"bigDataFileShares_NYC_taxi_data15" type:Big Data File Share owner:api_data_owner>,
 <Item title:"bigDataFileShares_hurricanes_dask_csv" type:Big Data File Share owner:atma.mani>,
 <Item title:"bigDataFileShares_all_hurricanes" type:Big Data File Share owner:api_data_owner>,
 <Item title:"bigDataFileShares_ServiceCallsOrleans" type:Big Data File Share owner:portaladmin>,
 <Item title:"bigDataFileShares_ServiceCallsOrleansTest" type:Big Data File Share owner:arcgis_python>,
 <Item title:"bigDataFileShares_calls" type:Big Data File Share owner:api_data_owner>,
 <Item title:"bigDataFileShares_Samples_Data" type:Big Data File Share owner:api_data_owner>,
 <Item title:"bigDataFileShares_GA_Data" type:Big Data File Share owner:arcgis_python>,
 <Item title:"bigDataFileShares_GA_Data" type:Big Data File Share owner:api_data_owner>]

In [19]:
air_lyr = search_result[-2].layers[0]

In [20]:
air_lyr

<Layer url:"https://pythonapi.playground.esri.com/ga/rest/services/DataStoreCatalogs/bigDataFileShares_GA_Data/BigDataCatalogServer/air_quality">

## Append Data

The Append Data tool allows you to append features to an existing hosted layer in your ArcGIS Enterprise organization and update or modify existing datasets.

<center><img src="../../static/img/guide_img/ga/append_data.png" height="300" width="300"></center>

The [`append_data`](https://developers.arcgis.com/rest/services-reference/enterprise/append-data.htm) operation appends tabular, point, line, or polygon data to an existing layer. The input layer must be a hosted feature layer. The tool will add the appended data as rows to the input layer. No new output layer will be created.

In [42]:
from arcgis.geoanalytics.manage_data import append_data

In [46]:
input_lyr = portal_gis.content.get('b39deed705144a2a90c5eaf5a44f5a14').layers[0]
append_lyr = portal_gis.content.get('78d4b1914bba4e09b9e8006fa6a3157c').layers[0]

In [47]:
append_data(input_layer=input_lyr, append_layer=append_lyr)

Attaching log redirect
Log level set to DEBUG
{"messageCode":"BD_101117","message":"No field mapping specified. Default field mapping will be applied."}
{"messageCode":"BD_101125","message":"The following fields have not been appended: INSTANT_DATETIME, globalid, OBJECTID.","params":{"fieldNames":"INSTANT_DATETIME, globalid, OBJECTID"}}
{"messageCode":"BD_101124","message":"The following fields are missing from the append layer and field mapping: pres_wmo_, Wind, size, latitude, pres_wmo1, iso_time, longitude, wind_wmo_, wind_wmo1.","params":{"missingFields":"pres_wmo_, Wind, size, latitude, pres_wmo1, iso_time, longitude, wind_wmo_, wind_wmo1"}}
{"messageCode":"BD_101132","message":"The following append layer fields have not been appended due to a field name mismatch: Field1, longitude_, pressure_m, ISO_time_s, latitude_m, Current_Ba, eye_dia_mi, wind_knots.","params":{"mismatchedFields":"Field1, longitude_, pressure_m, ISO_time_s, latitude_m, Current_Ba, eye_dia_mi, wind_knots"}}
Det

True

## Calculate Fields

The [`calculate_fields`](https://developers.arcgis.com/rest/services-reference/enterprise/calculate-field.htm) operation works with a layer to create and populate a new field or edit an existing field. The output is a new feature service that is the same as the input features, but with the newly calculated values.

<center><img src="../../static/img/guide_img/ga/calculate_field.png" height="300" width="300"></center>

In [35]:
from arcgis.geoanalytics.manage_data import calculate_fields

In [40]:
calculate_fields(input_layer=hurr2,
                 field_name="avg",
                 data_type="Double",
                 expression='max($feature["wind_wmo1"],$feature["pres_wmo1"])')

Attaching log redirect
Log level set to DEBUG
Detaching log redirect


## Clip Layer

The [`clip_layer`](https://developers.arcgis.com/rest/services-reference/enterprise/clip-layer.htm) tool allows you to create subsets of your input features by clipping them to areas of interest. The output subset layer will be available in your ArcGIS Enterprise organization.

<center><img src="../../static/img/guide_img/ga/clip_layer.png" height="300" width="300"></center>

In [13]:
from arcgis.geoanalytics.manage_data import clip_layer
from datetime import datetime as dt

In [9]:
search_result = portal_gis.content.search("bigDataFileShares_ServiceCallsOrleans", item_type = "big data file share")[0]
search_result

In [10]:
search_result.layers

[<Layer url:"https://pythonapi.playground.esri.com/ga/rest/services/DataStoreCatalogs/bigDataFileShares_ServiceCallsOrleans/BigDataCatalogServer/yearly_calls">]

In [11]:
calls = search_result.layers[0]

In [4]:
block_grp_item = portal_gis.content.get('9975b4dd3ca24d4bbe6177b85f9da7bb')
blk_grp_lyr = block_grp_item.layers[0]

In [5]:
blk_grp_lyr.query(as_df=True)

Unnamed: 0,fid,statefp,countyfp,tractce,blkgrpce,geoid,namelsad,mtfcc,funcstat,aland,awater,intptlat,intptlon,globalid,OBJECTID,SHAPE
0,2022,22,071,004500,1,220710045001,Block Group 1,G5030,S,173150.0,0.0,+29.9715423,-090.0838705,{E8B3C635-D935-62F7-168F-64EDC2BED042},1,"{""rings"": [[[-90.08673061729473, 29.9701864174..."
1,2209,22,071,013500,3,220710135003,Block Group 3,G5030,S,214346.0,0.0,+29.9573925,-090.0664905,{C253C889-C96C-6493-7DD3-1F0CB06A74BB},2,"{""rings"": [[[-90.06992661254384, 29.9589464152..."
2,2300,22,071,001722,2,220710017222,Block Group 2,G5030,S,290122.0,0.0,+30.0167107,-089.9915777,{7D29E3F0-9C17-BC5D-86E9-2F604B7B6682},3,"{""rings"": [[[-89.99454259416906, 30.0191794304..."
3,2322,22,071,002200,2,220710022002,Block Group 2,G5030,S,119731.0,0.0,+29.9804678,-090.0482377,{5A369F6C-B416-ED2A-EEF9-AE2AC8A13BE9},4,"{""rings"": [[[-90.04979560782573, 29.9822134209..."
4,2792,22,071,009600,3,220710096003,Block Group 3,G5030,S,98006.0,0.0,+29.9195453,-090.0922712,{C3B67E6D-30B1-9C16-1E93-27A02A7869A5},5,"{""rings"": [[[-90.09409861704567, 29.9214634062..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
531,170,22,071,003000,2,220710030002,Block Group 2,G5030,S,190414.0,0.0,+29.9798908,-090.0617835,{F6A152A8-6629-7E38-E0EB-6C20D43EB4F5},799,"{""rings"": [[[-90.06492061151569, 29.9813114200..."
532,155,22,071,002501,4,220710025014,Block Group 4,G5030,S,216634.0,0.0,+30.0143991,-090.0571656,{E0880AD9-1B54-18C1-830B-3FBA14395285},820,"{""rings"": [[[-90.06078861197408, 30.0155984278..."
533,2660,22,071,012600,1,220710126001,Block Group 1,G5030,S,179117.0,0.0,+29.9445416,-090.1295354,{64CA7D7F-FBA7-2614-D997-6D55B9D65DF8},829,"{""rings"": [[[-90.13190562891597, 29.9451464105..."
534,2161,22,071,001800,1,220710018001,Block Group 1,G5030,S,211600.0,0.0,+29.9673921,-090.0537128,{16764B95-5EEE-DCD9-21D1-814A00C6A7FE},859,"{""rings"": [[[-90.05698460887318, 29.9687144176..."


In [14]:
clip_result = clip_layer(calls, blk_grp_lyr, output_name="service calls in new Orleans" + str(dt.now().microsecond))

## Copy To Datastore

The Copy To Data Store tool is a convenient way to copy datasets to a layer in your portal. Copy to Data Store creates an item in your content containing your copied dataset layer.

<center><img src="../../static/img/guide_img/ga/copy_to_data_Store.png" height="300" width="300"></center>

In [2]:
from arcgis.geoanalytics.manage_data import copy_to_data_store

In [41]:
copy_to_data_store(hurr2)

Attaching log redirect
Log level set to DEBUG
Detaching log redirect


In [43]:
copy_to_data_store(hurr1)

Attaching log redirect
Log level set to DEBUG
Detaching log redirect
{"messageCode":"BD_101051","message":"Possible issues were found while reading 'inputLayer'.","params":{"paramName":"inputLayer"}}
{"messageCode":"BD_101054","message":"Some records have either missing or invalid geometries."}


## Disolve Boundaries

The Dissolve Boundaries tool merges area features that either intersect or have the same field values.

<center><img src="../../static/img/guide_img/ga/dissolve_boundaries.png" height="300" width="300"></center>

In [3]:
from arcgis.geoanalytics.manage_data import dissolve_boundaries

In [6]:
dissolve_boundaries(input_layer=blk_grp_lyr, 
                    dissolve_fields='countyfp', 
                    output_name='dissolved by countyfp')

Attaching log redirect
Log level set to DEBUG
Detaching log redirect


## Merge Layers

The Merge Layers tool combines two feature layers to create a single output layer. The tool requires that both layers have the same geometry type (tabular, point, line, or polygon). If time is enabled on one layer, the other layer must also be time enabled and have the same time type (instant or interval). The result will always contain all fields from the input layer. All fields from the merge layer will be included by default, or you can specify custom merge rules to define the resulting schema.

<center><img src="../../static/img/guide_img/ga/merge_layers.png" height="300" width="300"></center>

In [7]:
from arcgis.geoanalytics.manage_data import merge_layers

In [18]:
search_result = portal_gis.content.search("", item_type = "big data file share")
search_result

[<Item title:"bigDataFileShares_hurricanes_dask_shp" type:Big Data File Share owner:atma.mani>,
 <Item title:"bigDataFileShares_NYC_taxi_data15" type:Big Data File Share owner:api_data_owner>,
 <Item title:"bigDataFileShares_hurricanes_dask_csv" type:Big Data File Share owner:atma.mani>,
 <Item title:"bigDataFileShares_all_hurricanes" type:Big Data File Share owner:api_data_owner>,
 <Item title:"bigDataFileShares_ServiceCallsOrleans" type:Big Data File Share owner:portaladmin>,
 <Item title:"bigDataFileShares_ServiceCallsOrleansTest" type:Big Data File Share owner:arcgis_python>,
 <Item title:"bigDataFileShares_GA_Data" type:Big Data File Share owner:arcgis_python>,
 <Item title:"bigDataFileShares_GA_Data" type:Big Data File Share owner:arcgis_python>,
 <Item title:"bigDataFileShares_Chicago_Crimes" type:Big Data File Share owner:arcgis_python>,
 <Item title:"bigDataFileShares_csv_table" type:Big Data File Share owner:arcgis_python>]

In [22]:
hurr1 = search_result[0].layers[0]
hurr2 = search_result[3].layers[0]

In [23]:
merge_layers(hurr1, hurr2, output_name='merged layers')

Attaching log redirect
Log level set to DEBUG
Detaching log redirect
{"messageCode":"BD_101051","message":"Possible issues were found while reading 'inputLayer'.","params":{"paramName":"inputLayer"}}
{"messageCode":"BD_101054","message":"Some records have either missing or invalid geometries."}


## Overlay Data

The Overlay Layers tool combines two layers into a single layer using one of five methods: Intersect, Erase, Union, Identity, or Symmetric Difference.

<center><img src="../../static/img/guide_img/ga/overlay_layers.png" height="300" width="300"></center>

In [11]:
from arcgis.geoanalytics.manage_data import overlay_data

In [None]:
overlay_data(calls, blk_grp_lyr, output_name='intersected features')

## Run Python Script

The [`run_python_script`](https://developers.arcgis.com/rest/services-reference/enterprise/using-geoanalytics-tools-in-pyspark.htm) method executes a Python script directly in an ArcGIS GeoAnalytics server site . The script can create an analysis pipeline by chaining together multiple GeoAnalytics tools without writing intermediate results to a data store. The tool can also distribute Python functionality across the GeoAnalytics server site.

Geoanalytics Server installs a Python 3.6 environment that this tool uses. The environment includes Spark 2.2.0, the compute platform that distributes analysis across multiple cores of one or more machines in your GeoAnalytics Server site. The environment includes the pyspark module that provides a collection of distributed analysis tools for data management, clustering, regression, and more. The `run_python_script` task automatically imports the pyspark module so you can directly interact with it.

When using the geoanalytics and pyspark packages, most functions return analysis results as Spark DataFrame memory structures. You can write these data frames to a data store or process them in a script. This lets you chain multiple geoanalytics and pyspark tools while only writing out the final result, eliminating the need to create any bulky intermediate result layers.

In [15]:
from arcgis.geoanalytics.manage_data import run_python_script

The function below filters the data by rows that give information about the PM2.5 pollutant. To find the average PM2.5 value of each county, we will use the `join_features` tool. Finally, we will write the output to the datastore.

In [21]:
def average():
    df = layers[0]
    df = df.filter(df['Parameter Name'] == 'PM2.5 - Local Conditions') #pyspark filter
    res = geoanalytics.join_features(target_layer=layers[1], 
                                     join_layer=df, 
                                     join_operation="JoinOneToOne",
                                     summary_fields=[{'statisticType' : 'mean', 'onStatisticField' : 'Sample Measurement'}],
                                     spatial_relationship='Contains')
    res.write.format("webgis").save("average_pm_by_county")

In [22]:
run_python_script(average, [air_lyr, usa_counties_lyr])

[{'type': 'esriJobMessageTypeInformative',
  'description': 'Executing (RunPythonScript): RunPythonScript "def average():\n    df = layers[0]\n    df = df.filter(df[\'Parameter Name\'] == \'PM2.5 - Local Conditions\') #pyspark filter\n    res = geoanalytics.join_features(target_layer=layers[1], \n                                     join_layer=df, \n                                     join_operation="JoinOneToOne",\n                                     summary_fields=[{\'statisticType\' : \'mean\', \'onStatisticField\' : \'Sample Measurement\'}],\n                                     spatial_relationship=\'Contains\')\n    res.write.format("webgis").save("average_pm_by_county")\n\naverage()" https://pythonapi.playground.esri.com/ga/rest/services/DataStoreCatalogs/bigDataFileShares_GA_Data/BigDataCatalogServer/air_quality;https://pythonapi.playground.esri.com/server/rest/services/Hosted/usaCounties/FeatureServer/0 "{"defaultAggregationStyles": false}"'},
 {'type': 'esriJobMessageTypeIn

In this guide, we learned how we can manage our tabular and geographic big data. In the next guide, we will learn about the `run_python_script` tool.