# Step 3: Create and visualize an index of park utilization
In the previous step, we calculated time spent at parks for individual users on their trajectories. We will now create a simple index of usage based on the number of visits. A full analysis (outside this demo) might create multiple indices based on unique users, time spent, or diversity of users.

The bottom of the file has a map visualization of the parks, colored by the simple index.

## Set Up

### 0.1 Load packages

In [1]:
from google.colab import drive
import os
from pathlib import Path
import numpy as np
import pandas as pd
from collections import Counter
from statistics import mean, median, pstdev
import folium

In [2]:
import geopandas as gpd
from folium.features import GeoJson, GeoJsonTooltip
import branca.colormap as cm


### 0.2 Mount Your Google Drive
You will be asked for permission to access your Google Drive.

Note: The `h3_index_demo` folder must be downloaded and unzipped to your personal Google Drive in order to run this code in Colab.

In [3]:
drive.mount('/content/drive')

Mounted at /content/drive


### 0.3 Change project directory
If the project folder is not in your main directory (`/content/drive`), change the directory here by updating `my_dir`.

In [4]:
try:
  os.chdir('/content/drive/MyDrive/h3_index_demo')
  print('Successfully changed project directory')
except:
  print('Project not in main Drive directory')
  try:
    # Define your containing folder here if not in main Drive directory
    my_dir = '/content/drive/MyDrive/!data_science'
    os.chdir(my_dir + '/h3_index_demo')
    print('Successfully changed project directory')
  except:
    print('Could not change to project directory.\nDid you define your containing folder?')

Project not in main Drive directory
Successfully changed project directory


### 0.4 Variable definitions

In [5]:
DATA_CLEAN = Path("data/clean")

## 1. Load user-trajectory time at parks

In [6]:
# Read in time by trajectory in park file
in_path = DATA_CLEAN / "time_by_traj_park.csv"
time_by_traj_park = pd.read_csv(in_path, low_memory=False)

In [7]:
time_by_traj_park.head()

Unnamed: 0,user_id,trajectory_id,osm_id,total_time_s,total_time_min,total_time_hr,stay_1min
0,0,000_20081023025304,638620290.0,65.0,1.083333,0.018056,1
1,0,000_20081028003826,24827108.0,7235.0,120.583333,2.009722,1
2,0,000_20081028003826,80716324.0,60.0,1.0,0.016667,1
3,0,000_20081103232153,24827108.0,12245.0,204.083333,3.401389,1
4,0,000_20081111001704,24827108.0,7560.0,126.0,2.1,1


In [8]:
# Number of unique users and parks
n_users = time_by_traj_park["user_id"].nunique()
n_parks = time_by_traj_park["osm_id"].nunique()

print(f"Users with >1 min stays: {n_users}")
print(f"Parks with >1 min stays: {n_parks}")

Users with >1 min stays: 128
Parks with >1 min stays: 392


## 2. Calculate total visits by park and analyze the distribution

### 2.1 Total stays by park

In [9]:
# Aggregate stats by per park
# We are only using the total number of stays
agg_spec = {
    "total_time_min": ["size", "sum", "median"]
}
park_totals = (
    time_by_traj_park.groupby("osm_id", as_index=False)
      .agg(agg_spec)
)
park_totals.columns = ["osm_id", "n_visits", "total_min", "median_visit_min"]
park_totals.head()

Unnamed: 0,osm_id,n_visits,total_min,median_visit_min
0,8769598.0,1,2.066667,2.066667
1,9054321.0,4,27.65,2.166667
2,9237440.0,2,334.183333,167.091667
3,9348599.0,1,1.183333,1.183333
4,9509823.0,2,9.983333,4.991667


### 2.2 Distribution of number of visits

In [10]:
# Distribution of number of visits
pct = (1, 5, 10, 25, 50, 75, 90, 95, 99)
vals = park_totals["n_visits"].dropna().to_numpy(dtype=float)
p = np.percentile(vals, pct)

percentiles_df = pd.DataFrame({"percentile": pct, "n_visits": p})
percentiles_df

Unnamed: 0,percentile,n_visits
0,1,1.0
1,5,1.0
2,10,1.0
3,25,1.0
4,50,2.0
5,75,5.0
6,90,12.0
7,95,22.0
8,99,131.52


## 3. Create index

This data is heavily skewed, with half of parks having only one visit. A good fit for this heavily skewed data is a log1p + winsorized min–max index. The log1p step shrinks very large numbers but still works when the count is 0 or 1. We then cap the very lowest and highest values so a few parks do not control the results. Last, we rescale everything to 0–100. This keeps the order of parks, reduces the effect of outliers, and makes the scores easy to compare over time and across places.

In [11]:
# Compute caps on raw counts
p1  = np.nanpercentile(park_totals["n_visits"], 1)
p99 = np.nanpercentile(park_totals["n_visits"], 99)

# Transform
a = np.log1p(park_totals["n_visits"])
a_min = np.log1p(p1)
a_max = np.log1p(p99)

# Winsorize (cap extremes)
a_cap = a.clip(lower=a_min, upper=a_max)

# Scale to 0–100
den = (a_max - a_min)
if np.isfinite(den) and den > 0:
    park_totals["visit_index"] = 100 * (a_cap - a_min) / den
else:
    # Fallback if distribution is degenerate: use percentile rank
    park_totals["visit_index"] = 100 * park_totals["n_visits"].rank(pct=True, method="average")

# Round
park_totals["visit_index"] = park_totals["visit_index"].round(0).astype(int)

In [12]:
# Frequency table of exact visit_index values
tab_counts = park_totals["visit_index"].value_counts(dropna=False).sort_index()
tab_percent = (tab_counts / tab_counts.sum() * 100).round(2)

tab = (
    pd.DataFrame({"count": tab_counts, "percent": tab_percent})
    .reset_index()
    .rename(columns={"index": "visit_index"})
)

tab

Unnamed: 0,visit_index,count,percent
0,0,163,41.58
1,10,68,17.35
2,17,30,7.65
3,22,29,7.4
4,26,13,3.32
5,30,10,2.55
6,33,14,3.57
7,36,8,2.04
8,38,6,1.53
9,41,6,1.53


## 4. Visualize parks with index values

In [13]:
# Load the simplified parks layer saved in 2.1 of Step 1 pynb
# This can take a few minutes
gdf = gpd.read_file("data/etl/deduplicated_parks.gpkg", layer="parks").to_crs(4326).copy()
gdf["osm_id"] = gdf["osm_id"].astype(int)

In [14]:
# Create a copy of relevant data for coloring and tooltip
pt = park_totals[["osm_id","n_visits","total_min","median_visit_min","visit_index"]].copy()
pt["osm_id"] = pt["osm_id"].astype(int)
pt["visit_index"] = pd.to_numeric(pt["visit_index"], errors="coerce").clip(0, 100)

In [15]:
# Keep only parks present in park_totals
gdf = gdf.merge(pt, on="osm_id", how="inner")  # inner join filters to indexed parks

In [16]:
# Center map
minx, miny, maxx, maxy = gdf.total_bounds
m = folium.Map(location=[(miny+maxy)/2, (minx+maxx)/2], zoom_start=11, tiles="cartodbpositron")

In [17]:
# Colormap 0–100
# Light to dark red, fixed to 0–100
cmap = cm.LinearColormap(
    colors=["#fff5f0", "#fcbba1", "#fc9272", "#fb6a4a", "#de2d26", "#a50f15"],
    vmin=0, vmax=100, caption="Park Visit Index (0–100)"
)

In [18]:
# Style function for Folium GeoJson:
#  Reads each feature's visit_index (0–100).
#  If missing, draw it light gray with partial fill.
def style_fn(feat):
    v = feat["properties"].get("visit_index")
    if v is None or pd.isna(v):
        return {"fill": True, "fillColor": "#f0f0f0", "fillOpacity": 0.6,
                "color": "#555555", "weight": 0.6}
    return {"fill": True, "fillColor": cmap(float(v)), "fillOpacity": 0.8,
            "color": "#333333", "weight": 0.6}


In [19]:
# Tooltip with visit information for park
tooltip = GeoJsonTooltip(
    fields=["osm_id","name","n_visits","visit_index","total_min","median_visit_min"],
    aliases=["OSM ID","Name","Visits","Visit Index","Total Minutes","Median Visit (min)"],
    localize=True, sticky=True,
)

In [20]:
# Add tooltip and color map
GeoJson(gdf.to_json(), name="Parks by Visit Index", style_function=style_fn, tooltip=tooltip).add_to(m)
cmap.add_to(m)
folium.LayerControl(collapsed=False).add_to(m)

<folium.map.LayerControl at 0x7d9073ede000>

In [21]:
# Show final map
m