# Esri Summer Internship

## By: Juan Carlos Reyes

### For: Dr. Linda Beale and the ArcGIS Insights Team

#### Responsibilities:

External Facing: 
Contribute to the ArcGIS Insights Lab

- Create a workflow that will demonstrate data cleaning of open-sourced datasets 
using various python libraries (Pandas, Dyplr, PyTorch, TensorFlow, etc.)
through Scripting console in Insights.
- Creatively highlight uses of scripting console that can be beneficial to
Insights users
- Show various data cleaning techniques in addition to the ones present in 
Data Engineering within Insights
- Incorporate Machine Learning, Artificial Intelligence (AI) and Big Data as 
needed

Internal Facing: 
Research and test different map data clustering approaches to determine which ones are most effective for Insights. Clustering will be implemented on the server side of Insights.

Background: 


- Map clustering will simplify symbology to ensure map is legible 
- Optimal clustering technique will aid the Insights development team in 
assessing how to show all the data without crashing the browser.
- Currently, data caching is not implemented since clusters are being 
precalculated at each zoom level for faster access. 


Types of Clustering Methods/Options:

- DBSCAN clustering algorithm
- Self-organizing Map (SOM)
- K-means
- Others?

Goals:

- Ultimately clustering will be used to display a clustering symbol where 
large numbers of data are present instead of the current implementation.

- Utilize Python and R libraries to demonstrate the pros/cons of different 
clustering methods. The work should allow the team to determine the 
accuracy of each approach, based on the input data and clusters created 
and the time it took to run.

Project Deliverables:

- Final presentation to the Insights Development Team (format can be a Story Map). 

Presentation Should:
- Give a detailed overview of the main aspects of project
- Highlight project findings and how goals were met 
- Suggest ways to move forward and how to implement results
- Submit copies of all codes, workflows and deliverables created during internship

-----------------------------------------------------------







# PART ONE

## Internal Facing

### Data Cleaning with Pandas/Geopandas and Interaction with ArcGIS Insights Scripting Console

The purpose of this segment is to demonstrate to users how standard data cleaning procedures with pandas can be integrated through the ArcGIS Insights console. This is incredibly useful given that geopandas extends the datatypes used by pandas to allow spatial opearations on geometric types. This means that an analysts' workflow in python can bridge seamlessly by reading/writing Esri shapefiles through the use of the geopandas.GeoDataFrame.to_file() and geopandas.GeoDataFrame.read_file() functions. 

https://geopandas.org/en/stable/

https://geopandas.org/en/stable/docs/user_guide/io.html

Let's take a look at reading a shapefile inside a zipped folder.

First, let's begin by importing libraries standard for data science workflows.


In [96]:
import numpy as np
import pandas as pd
import geopandas 
import folium
import time
import matplotlib.pyplot as plt
import plotly.express as px
import json

In [22]:
elapsed_non_zip = []
for attempt in range(0,10):
    start_time = time.time()
    roads_no_zip = r"C:\Users\jreye\Desktop\NS_DATA\TRNS_NSRN_Addressed_Roads_SHP_UT83v3_CGVD28/TRNS_NSRN_Addressed_Roads_UT83v3_CGVD28.shp"
    ns_roads_no_zip = geopandas.read_file(roads_no_zip)
    print(str(attempt) + "--- %s seconds ---" % (time.time() - start_time))
    elapsed_non_zip.append(time.time() - start_time)

0--- 4.5944297313690186 seconds ---
1--- 4.4500205516815186 seconds ---
2--- 4.748734951019287 seconds ---
3--- 4.479432821273804 seconds ---
4--- 4.637072324752808 seconds ---
5--- 4.5001232624053955 seconds ---
6--- 4.697338104248047 seconds ---
7--- 4.580337047576904 seconds ---
8--- 4.68772029876709 seconds ---
9--- 4.504506587982178 seconds ---


In [64]:
start_time = time.time()
#ns_roads
roads_no_zip = r"C:\Users\jreye\Desktop\NS_DATA\TRNS_NSRN_Addressed_Roads_SHP_UT83v3_CGVD28\TRNS_NSRN_Addressed_Roads_UT83v3_CGVD28.shp"
ns_roads = geopandas.read_file(roads_no_zip)
#ns houses
nshouses_no_zip = r"C:\Users\jreye\Desktop\NS_DATA\BASE_NS_CivicAddress_File_SHP_UT83v3_CGVD28\BASE_NS_CivicAddress_File_UT83v3_CGVD28.shp"
ns_houses = geopandas.read_file(nshouses_no_zip)

ns_water_no_zip = r"C:\Users\jreye\Desktop\NS_DATA\BASE_Water_SHP_UT83v3_CGVD28\WA_POLY_10K.shp"
ns_water = geopandas.read_file(ns_water_no_zip)


print("--- %s seconds ---" % (time.time() - start_time))

--- 41.724098682403564 seconds ---


Now we can visualize the points in Python using our standard matplotlib graphing library.

In [103]:
"""
fig, ax = plt.subplots(figsize=(20,20))

ns_roads.plot(ax = ax, color = 'red')
ns_houses.plot(ax = ax,marker="+", color='blue')
ns_water.plot(ax = ax, color = "yellow")
"""

'\nfig, ax = plt.subplots(figsize=(20,20))\n\nns_roads.plot(ax = ax, color = \'red\')\nns_houses.plot(ax = ax,marker="+", color=\'blue\')\nns_water.plot(ax = ax, color = "yellow")\n'

In [104]:
m = folium.Map(location=[45.5236, -122.6750])



In [55]:
df = px.data.election()
df

Unnamed: 0,district,Coderre,Bergeron,Joly,total,winner,result,district_id
0,101-Bois-de-Liesse,2481,1829,3024,7334,Joly,plurality,101
1,102-Cap-Saint-Jacques,2525,1163,2675,6363,Joly,plurality,102
2,11-Sault-au-Récollet,3348,2770,2532,8650,Coderre,plurality,11
3,111-Mile-End,1734,4782,2514,9030,Bergeron,majority,111
4,112-DeLorimier,1770,5933,3044,10747,Bergeron,majority,112
5,113-Jeanne-Mance,1455,3599,2316,7370,Bergeron,plurality,113
6,12-Saint-Sulpice,3252,2521,2543,8316,Coderre,plurality,12
7,121-La Pointe-aux-Prairies,5456,1760,3330,10546,Coderre,majority,121
8,122-Pointe-aux-Trembles,4734,1879,2852,9465,Coderre,majority,122
9,123-Rivière-des-Prairies,5737,958,1656,8351,Coderre,majority,123


In [57]:
import geopandas as gpd

In [58]:
geo_df = gpd.GeoDataFrame.from_features(
    px.data.election_geojson()["features"]
).merge(df, on="district").set_index("district")

In [59]:
geo_df

Unnamed: 0_level_0,geometry,Coderre,Bergeron,Joly,total,winner,result,district_id
district,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
11-Sault-au-Récollet,"MULTIPOLYGON (((-73.63632 45.57592, -73.63628 ...",3348,2770,2532,8650,Coderre,plurality,11
12-Saint-Sulpice,"POLYGON ((-73.62175 45.55448, -73.62350 45.553...",3252,2521,2543,8316,Coderre,plurality,12
13-Ahuntsic,"POLYGON ((-73.65132 45.55457, -73.65687 45.545...",2979,3430,2873,9282,Bergeron,plurality,13
14-Bordeaux-Cartierville,"POLYGON ((-73.70430 45.54419, -73.70421 45.543...",3612,1554,2081,7247,Coderre,plurality,14
21-Ouest,"POLYGON ((-73.55769 45.59322, -73.56942 45.597...",2184,691,1076,3951,Coderre,majority,21
22-Est,"POLYGON ((-73.54528 45.59596, -73.54910 45.597...",1589,708,1172,3469,Coderre,plurality,22
23-Centre,"POLYGON ((-73.55769 45.59322, -73.55646 45.595...",2526,851,1286,4663,Coderre,majority,23
31-Darlington,"POLYGON ((-73.62076 45.51390, -73.62366 45.510...",1873,1182,1232,4287,Coderre,plurality,31
32-Côte-des-Neiges,"POLYGON ((-73.59561 45.50406, -73.59414 45.503...",1644,1950,1578,5172,Bergeron,plurality,32
33-Snowdon,"POLYGON ((-73.64581 45.50163, -73.64804 45.499...",1548,1503,1636,4687,Joly,plurality,33


In [105]:
fig = px.choropleth_mapbox(geo_df,
                           geojson=geo_df.geometry,
                           locations=geo_df.index,
                           color="Joly",
                           center={"lat": 45.5517, "lon": -73.7073},
                           mapbox_style="open-street-map",
                           zoom=8.5)
#fig.show()

In [78]:
fig = px.choropleth_mapbox(ns_water, geojson = ns_water.geometry)


In [79]:
fig.show()

In [91]:
ns_water_1 = ns_water.iloc[0:10]


Unnamed: 0,FEAT_CODE,FEAT_DESC,ZVALUE,SHAPE_AREA,SHAPE_LEN,SHAPE_FID,geometry
0,WACO40,Coast Water Area polygon,9999.0,4430773000.0,361765.8,2,"POLYGON Z ((779445.045 4905321.985 0.000, 7794..."
1,WACO40,Coast Water Area polygon,9999.0,3934059000.0,1294017.0,3,"POLYGON Z ((730200.833 5153698.919 0.000, 7302..."
2,WACO40,Coast Water Area polygon,9999.0,3088790000.0,794661.4,4,"POLYGON Z ((276195.370 4842456.528 1.500, 2761..."
3,WACO40,Coast Water Area polygon,9999.0,2852269000.0,805534.5,6,"POLYGON Z ((666620.317 5029455.144 0.000, 6766..."
4,WACO40,Coast Water Area polygon,9999.0,2700416000.0,327698.3,7,"POLYGON Z ((315227.529 5033597.751 0.000, 3392..."
5,WACO40,Coast Water Area polygon,9999.0,2638273000.0,532115.0,9,"POLYGON Z ((279410.700 4954768.100 3.600, 2804..."
6,WACO40,Coast Water Area polygon,9999.0,2603431000.0,725087.1,10,"POLYGON Z ((246991.362 4869454.559 0.667, 2469..."
7,WACO40,Coast Water Area polygon,9999.0,2463719000.0,444384.4,14,"POLYGON Z ((727013.342 5237034.457 0.000, 7270..."
8,WACO40,Coast Water Area polygon,9999.0,2387378000.0,357249.9,17,"POLYGON Z ((693330.579 5212074.212 0.000, 6905..."
9,WACO40,Coast Water Area polygon,9999.0,2088896000.0,478483.3,19,"POLYGON Z ((363280.201 4879016.400 0.800, 3632..."


In [94]:
ns_water_1.to_file(r"C:\Users\jreye\Desktop\NS_DATA\BASE_Water_SHP_UT83v3_CGVD28\nswater.geojson",
                  driver = "GeoJSON")

In [99]:
ns_water_1

Unnamed: 0,FEAT_CODE,FEAT_DESC,ZVALUE,SHAPE_AREA,SHAPE_LEN,SHAPE_FID,geometry
0,WACO40,Coast Water Area polygon,9999.0,4430773000.0,361765.8,2,"POLYGON Z ((779445.045 4905321.985 0.000, 7794..."
1,WACO40,Coast Water Area polygon,9999.0,3934059000.0,1294017.0,3,"POLYGON Z ((730200.833 5153698.919 0.000, 7302..."
2,WACO40,Coast Water Area polygon,9999.0,3088790000.0,794661.4,4,"POLYGON Z ((276195.370 4842456.528 1.500, 2761..."
3,WACO40,Coast Water Area polygon,9999.0,2852269000.0,805534.5,6,"POLYGON Z ((666620.317 5029455.144 0.000, 6766..."
4,WACO40,Coast Water Area polygon,9999.0,2700416000.0,327698.3,7,"POLYGON Z ((315227.529 5033597.751 0.000, 3392..."
5,WACO40,Coast Water Area polygon,9999.0,2638273000.0,532115.0,9,"POLYGON Z ((279410.700 4954768.100 3.600, 2804..."
6,WACO40,Coast Water Area polygon,9999.0,2603431000.0,725087.1,10,"POLYGON Z ((246991.362 4869454.559 0.667, 2469..."
7,WACO40,Coast Water Area polygon,9999.0,2463719000.0,444384.4,14,"POLYGON Z ((727013.342 5237034.457 0.000, 7270..."
8,WACO40,Coast Water Area polygon,9999.0,2387378000.0,357249.9,17,"POLYGON Z ((693330.579 5212074.212 0.000, 6905..."
9,WACO40,Coast Water Area polygon,9999.0,2088896000.0,478483.3,19,"POLYGON Z ((363280.201 4879016.400 0.800, 3632..."


In [97]:
with open(r"C:\Users\jreye\Desktop\NS_DATA\BASE_Water_SHP_UT83v3_CGVD28\nswater.geojson") as geofile:
    j_file = json.load(geofile)

In [98]:
j_file

{'type': 'FeatureCollection',
 'crs': {'type': 'name',
  'properties': {'name': 'urn:ogc:def:crs,crs:EPSG::2961,crs:EPSG::5713'}},
 'features': [{'type': 'Feature',
   'properties': {'FEAT_CODE': 'WACO40',
    'FEAT_DESC': 'Coast Water Area polygon',
    'ZVALUE': 9999.0,
    'SHAPE_AREA': 4430773094.40507,
    'SHAPE_LEN': 361765.829586929,
    'SHAPE_FID': 2},
   'geometry': {'type': 'Polygon',
    'coordinates': [[[779445.0449999999, 4905321.984999999, 0.0],
      [779456.8959999997, 4905044.289000001, 0.0],
      [779468.7470000004, 4904766.5940000005, 0.0],
      [779480.5959999999, 4904488.898, 0.0],
      [779492.4460000005, 4904211.203, 0.0],
      [779504.2949999999, 4903933.507999999, 0.0],
      [779516.1430000002, 4903655.812999999, 0.0],
      [779527.9910000004, 4903378.117000001, 0.0],
      [779539.8380000005, 4903100.422, 0.0],
      [779551.6849999996, 4902822.727, 0.0],
      [779563.5310000004, 4902545.033, 0.0],
      [779575.3760000002, 4902267.3379999995, 0.0],
 

In [100]:
i=1
for feature in j_file["features"]:
    feature ['id'] = str(i).zfill(2)
    i += 1

In [106]:
fig = px.choropleth_mapbox(ns_water_1, geojson = j_file,mapbox_style="open-street-map",locations=ns_water_1.index)
#fig.show()

Visualiztion is neat, however it is best if this data is visualized through Insights!

# PART TWO

## ArcGIS Insights interaction with the Operating System

ArcGIS Insights Desktop is running locally so we can access our operating system through the python 'os' library, allowing us to run code which allows us to best organize our data however we see fit. 

The following functions will create directories (wherever we are running this code) to allow us to create "test" folders to store our data.

In [34]:
import os


def create_test_dir(operating_path):
    dirs = os.listdir(operating_path)
    dirs.sort(key=lambda f: int(''.join(filter(str.isdigit, f))))
    try:
        os.mkdir(operating_path + "test_0")


    except OSError:
        
        count = str(int(dirs[-1].split('_')[1]) + 1)
        new_dir = "test_%s" % count
        os.mkdir(operating_path + new_dir)
        print("Created: " + new_dir)
    
    dirs = os.listdir(operating_path)
    dirs.sort(key=lambda f: int(''.join(filter(str.isdigit, f))))

    return dirs

    

Execute the function within our ArcGIS Insights 'data\\test_data' folder, creating 'n' test folders allowing us to organize our data.

In [36]:
tests = 10

operating_path = r"C:\Users\jreye\Desktop\NS_DATA\test_data\\"

for tests in range(tests):
    #print(tests)
    current_dir = create_test_dir(operating_path)


Created: test_1
Created: test_2
Created: test_3
Created: test_4
Created: test_5
Created: test_6
Created: test_7
Created: test_8
Created: test_9


# PART THREE: Data Generation into Esri Shapefile 

## This part is a mix of my internal vs external goals.

This section is used to demonstrate that any user python-generated data can be converted into a geopandas dataframe and exported into an Esri Shapefile.

It is important to note that ArcGIS Insights receives .zip files so interaction with the operating system is necessary again in order to zip our generated shapefiles.

The purpose of generating synthetic data has proven useful for me in the past in order to hold a full description of the data which I am feeding into any models which I am testing. By generating data with statistical/mathematical properties that we are aware/ in control of, we have a way of knowing how the algorithms (i.e: Kriging/ Machine Learning...) we devise are performing. 





# Test Mathematical Functions

In [37]:
import pandas as pd
import geopandas
import numpy as np

datum = "EPSG:4326"
cogs_long = -117.195646
cogs_lat = 34.056397
earth_circumference = 40074 #km
cogs_lat_radians = cogs_lat * (np.pi/180) #conversion of degrees to radians
long_km = 360 / (earth_circumference * np.cos(cogs_lat_radians)) 
lat_km = 1/110.574 
n=60
shift = 10 
portion = 1


#Constant function of '1s' at every spatial location. Homogeneous.
def ones_2d(size=n):
    return np.ones((n,n))

# Random normally distributed values, can set the mean and standard deviation. Can be scaled with the parameter A.
def normal_dist_2d(mean=0,sd=1,size=n,A=1):
    return A*np.random.normal(mean,sd, size =(n,n))

# https://en.wikipedia.org/wiki/Gaussian_function#Two-dimensional_Gaussian_function  
def gaus2d(x=0, y=0, mx=cogs_long, my=cogs_lat, sx=0.009, sy=0.009,shift_x=0.01,shift_y=0.00):
    # Different amplitude parameters, wikipedia uses A=1, however the chosen amplitude is usually selected
      # for gaussian normalization (integral = 1)

    #A=1  
    A = 1. / (2 * np.pi*sx * sy)
    #A1 = 1/ (np.sqrt(2*np.pi * sx**2 *sy**2))

    return A * np.exp(-((x - mx - shift_x)**2. / (2. * sx**2.) + (y - my - shift_y)**2. / (2. * sy**2.)))

# Trig function that I thought was cool... highly nonlinear so don't expect any methods to work..
def cool_trig_function(A=1,f=0.481,w=2.05,inner_pow = 4, outer_pow=2,cogs_long=cogs_long, cogs_lat=cogs_lat):

    return A*(np.sin( f*(yy-w*cogs_long)**inner_pow + f*(xx-w*cogs_lat)**inner_pow ) )**outer_pow/ np.sqrt(xx**2 + yy**2)
    #z1 = 1*(np.sin( 0.481*(yy-2.05*cogs_long)**4 + 0.481*(xx-2.05*cogs_lat)**4 ) )**2/np.sqrt(xx**2 + yy**2)


# Data Generation

In [38]:

x = np.linspace(cogs_lat - shift*lat_km, cogs_lat + shift*lat_km,n, endpoint=True)
y = np.linspace(cogs_long - shift*long_km,cogs_long + shift*long_km,n, endpoint=True)
xx , yy = np.meshgrid(x,y,indexing='ij')

###
### Functions
###

#z = ones_2d(n)

z= normal_dist_2d(mean=0,sd=1,size=n,A=1)


#z = gaus2d(x=yy,y=xx,mx=cogs_long, my=cogs_lat,sx=sx,sy=sy,shift_x=shift_x,shift_y=shift_y)

"""
sx = 0.1
sy = 0.1
shift_x = -0.5
shift_y = -0.5

z1 = gaus2d(x=yy,y=xx,mx=cogs_long, my=cogs_lat,sx=sx,sy=sy,shift_x=shift_x,shift_y=shift_y)

sx = -0.1
sy = -0.1
shift_x = 0.5
shift_y = 0.5

z2 = gaus2d(x=yy,y=xx,mx=cogs_long, my=cogs_lat,sx=sx,sy=sy,shift_x=shift_x,shift_y=shift_y)


z = z1 + z2

"""


'\nsx = 0.1\nsy = 0.1\nshift_x = -0.5\nshift_y = -0.5\n\nz1 = gaus2d(x=yy,y=xx,mx=cogs_long, my=cogs_lat,sx=sx,sy=sy,shift_x=shift_x,shift_y=shift_y)\n\nsx = -0.1\nsy = -0.1\nshift_x = 0.5\nshift_y = 0.5\n\nz2 = gaus2d(x=yy,y=xx,mx=cogs_long, my=cogs_lat,sx=sx,sy=sy,shift_x=shift_x,shift_y=shift_y)\n\n\nz = z1 + z2\n\n'

# Function to generate a Esri Shapefile

### Important: note that this function contains the function *insights_return()* which is important for returning our generated shapefile back into ArcGIS Insights!

In [39]:

def to_esri_shapefile(z):
    df = pd.DataFrame(z,index = x, columns=y)
    newdf = df.unstack().reset_index().rename(columns={'level_0':'longitude','level_1':'lat',0:'target_variable'})
    # Which portion of the data do we want to retain (1 is the entire set)
    newdf = newdf.sample(frac=portion)
    #Create a geopandas geodataframe to begin transition to .shp file. Assign Point Data.
    gdf = geopandas.GeoDataFrame(newdf, geometry=geopandas.points_from_xy(newdf.longitude, newdf.lat))
    #Define a Geodetic Parameter. EPSG 4326 refers to WGS84 geodetic datum.
    gdf.crs = datum
    #To export as a shapefile:
    #%insights_return(gdf)
    gdf.to_file(driver = 'ESRI Shapefile', filename= "./test_data/test_1/normal_dist.shp") 

Let's run this function,

In [40]:
to_esri_shapefile(z)

  gdf.to_file(driver = 'ESRI Shapefile', filename= "./test_data/test_1/normal_dist.shp")


# Compression into a .zip file in order to open on ArcGIS Insights

In [42]:
import shutil
import os
path_to_data_folder = r"C:\Users\jreye\Desktop\NS_DATA\test_data\\"
files = os.listdir(path_to_data_folder)
files.sort(key=lambda f: int(''.join(filter(str.isdigit, f))))

for file in files:
    #print(file)

    # path_to_data_location = path_to_data_folder + "\\test_" + str(file.split("_")[1])

    number = str(file.split("_")[-1])
    path_to_data_location = "{}\\test_{}".format(path_to_data_folder, number)

    #print(path_to_data_location)

    data_files = os.listdir(path_to_data_location)

    #print(data_files)

    #data_files = path_to_data_location + data_files

    # zip_file_name = "test_" + number + ".zip"
    zip_file_name = "test_{}".format(number)

    
    if data_files:
        #print(data_files)
        shutil.make_archive("{}\\{}".format(path_to_data_folder, zip_file_name), "zip", path_to_data_location)
        print("The files have been compressed.")





The files have been compressed.


# What is Next: 
## Begin experimenting with clustering algorithms in Python and R!

# DBSCAN data clustering algorithm