<a href="https://colab.research.google.com/github/MODA-NYC/nyc-geography-crosswalks/blob/main/NYC_Geographies_Crosswalk_Selector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NYC Geographies: Interactive Crosswalk Selectors

This notebook provides interactive tools for generating two types of custom geographic crosswalk tables for New York City. It uses a comprehensive `all_boundaries.geojson` file containing multiple NYC geographic layers, generated locally.

**Important:** This notebook relies on the `all_boundaries.geojson` file being generated first by the `generate_all_bounds.py` script located within this repository (`nyc-geography-crosswalks`). Ensure you have run that script and have access to its output file.

**About the Data File:**
The `all_boundaries.geojson` file aggregates the same core NYC geographic boundaries previously used by the BetaNYC Boundaries Map project. However, the `generate_all_bounds.py` script aims to collect the **latest available versions** directly from their official sources at the time the script is run.

*   **Note on Versions:** Currently, the specific version links (e.g., URLs containing `_25a` for data from NYC Planning's 2025 Cycle A update) are **hardcoded** within the generator script (`generate_all_bounds.py`).
*   **Future Enhancement:** A potential future improvement could involve modifying the script to automatically check for and download the absolute latest versions available from the source portals, rather than relying on hardcoded versioned links.

The interactive selectors allow you to generate:
- **Wide-format Crosswalk**: Generates simplified tables for quick analysis of overlapping features. Each row represents one primary geography feature, and columns list overlapping features from selected target geographies (semicolon-separated).
- **Long-form Crosswalk**: Generates detailed intersection tables including precise calculations of overlap area and percentage for each intersecting pair between the primary and selected target geographies.

### General Workflow:
1.  **Load Data:** The notebook first loads the pre-generated `all_boundaries.geojson` file (typically from Google Drive when using Colab - see Cell 3).
2.  **Interactive UI**: Users choose a primary geography and one or more target geographies using widgets for both Wide and Long formats.
3.  **Spatial Analysis**: Performs intersections using negative buffering to ensure only significant overlaps are considered.
4.  **Custom CSV Output**: Generates a downloadable CSV file containing the results of the selected crosswalk.
5.  **Progress Indicators**: Real-time progress bars display processing status.

### Data Sources:
The **input** for *this notebook* is the `all_boundaries.geojson` file generated by the `generate_all_bounds.py` script. The **original sources** used by that script are:

*   **cd (Community Districts):** NYC Department of City Planning (DCP)
*   **pp (Police Precincts):** NYC Department of City Planning (DCP)
*   **dsny (Sanitation Districts):** NYC Open Data (Dataset ID: i6mn-amj2)
*   **fb (Fire Battalions):** NYC Department of City Planning (DCP)
*   **sd (School Districts):** NYC Department of City Planning (DCP)
*   **hc (Health Center Districts):** NYC Department of City Planning (DCP)
*   **cc (City Council Districts):** NYC Department of City Planning (DCP)
*   **nycongress (Congressional Districts):** NYC Department of City Planning (DCP)
*   **sa (State Assembly Districts):** NYC Department of City Planning (DCP)
*   **ss (State Senate Districts):** NYC Department of City Planning (DCP)
*   **bid (Business Improvement Districts):** NYC Open Data (Dataset ID: 7jdm-inj8 / derived from ejxk-d93y)
*   **nta (Neighborhood Tabulation Areas):** NYC Department of City Planning (DCP - NTA 2020)
*   **zipcode (Modified Zip Code Tabulation Areas):** NYC Open Data (Dataset ID: pri4-ifjk)
*   **hd (Historic Districts):** NYC Open Data (Dataset ID: skyk-mpzq / derived from xbvj-gfnw)
*   **ibz (Industrial Business Zones):** NYC Economic Development Corporation (EDC)

*Context for many planning datasets can be found at:*
*   [NYC Planning - Bytes of the Big Apple](https://www.nyc.gov/site/planning/data-maps/open-data/bytes-big-apple.page)
*   [NYC Open Data Portal](https://data.cityofnewyork.us/)

### Requirements:
- **Prerequisite:** Successful execution of `generate_all_bounds.py` and access to its output `all_boundaries.geojson`.
- **Python Libraries:** `geopandas`, `pandas`, `ipywidgets`, `tqdm`, `google.colab` (for Drive/files), `os`. `requests` is needed if using the URL loading method for the data file.
- **Environment:** Google Colab is recommended for the interactive features and Google Drive integration. Standard Jupyter environments can also be used if data loading is adapted.

 ## 1: Install Dependencies

In [None]:
# Install required libraries if running in a new environment
print("Installing dependencies...")
!pip install geopandas pandas ipywidgets requests tqdm --quiet
print("Dependencies installed.")

 ## 2: Import Libraries

In [None]:
# Import necessary libraries for the entire notebook
print("Importing libraries...")
import geopandas as gpd
import pandas as pd
import requests # Needed only if using URL method for data loading
from io import BytesIO # Needed only if using URL method for data loading
from google.colab import drive # For loading from Google Drive
from google.colab import files # For downloading results
import ipywidgets as widgets
from IPython.display import display, clear_output
from tqdm.notebook import tqdm # Progress bar
import os
from typing import Union # For type hints
print("Libraries imported.")

## 3: Load and Prepare Data

In [None]:
# --- Load and Prepare the Master GeoDataFrame ---

# Choose ONE of the methods below to load all_boundaries.geojson:

# --- Method 1: Load from Google Drive (Currently Active for Testing) ---
print("Attempting to mount Google Drive...")
drive.mount('/content/drive', force_remount=True)

# !!!! IMPORTANT: Replace this path with the ACTUAL path to your file on Google Drive !!!!
geojson_path_on_drive = '/content/drive/MyDrive/Projects/ODA/Crosswalk Experiment/all_boundaries.geojson' # <--- CHANGE THIS

gdf = None # Initialize gdf
if not os.path.exists(geojson_path_on_drive):
  print(f"ERROR: File not found at specified Google Drive path: {geojson_path_on_drive}")
  print("Please double-check the path and ensure the file exists.")
else:
  print(f"Found file at: {geojson_path_on_drive}")
  try:
      print("Reading GeoJSON from Google Drive...")
      gdf_loaded = gpd.read_file(geojson_path_on_drive)
      print(f"Successfully read file. Original CRS: {gdf_loaded.crs}")

      # Reproject to EPSG:2263 for buffer calculations in this notebook (feet)
      print("Reprojecting to EPSG:2263 (Feet)...")
      gdf = gdf_loaded.to_crs(epsg=2263)
      print(f"Successfully loaded and reprojected GeoDataFrame. New CRS: {gdf.crs}")

      # Create Spatial Index for faster lookups
      if gdf.sindex is None:
          print("Generating spatial index for gdf...")
          gdf.sindex

      print("\nGeoDataFrame Info:")
      print(gdf.info())

  except Exception as e:
      print(f"ERROR: Failed to read or reproject GeoJSON from Google Drive. Error: {e}")
      # raise e # Uncomment to stop execution on error

# --- End Method 1 ---


# --- Method 2: Load from URL (Currently Commented Out) ---
# # !!!! IMPORTANT: Replace this URL with the ACTUAL download URL for your GeoJSON file !!!!
# geojson_url = "YOUR_GEOJSON_DOWNLOAD_URL_HERE" # <--- CHANGE THIS

# gdf = None # Initialize gdf
# try:
#     print(f"Attempting to download GeoJSON from URL: {geojson_url}")
#     response = requests.get(geojson_url, timeout=60)
#     response.raise_for_status()
#     print("Download successful. Reading GeoJSON...")

#     gdf_loaded = gpd.read_file(BytesIO(response.content))
#     print(f"Successfully read file. Original CRS: {gdf_loaded.crs}")

#     print("Reprojecting to EPSG:2263 (Feet)...")
#     gdf = gdf_loaded.to_crs(epsg=2263)
#     print(f"Successfully loaded and reprojected GeoDataFrame. New CRS: {gdf.crs}")

#     if gdf.sindex is None:
#         print("Generating spatial index for gdf...")
#         gdf.sindex

#     print("\nGeoDataFrame Info:")
#     print(gdf.info())

# except requests.exceptions.RequestException as e:
#     print(f"ERROR: Failed to download GeoJSON from URL. Error: {e}")
# except Exception as e:
#     print(f"ERROR: Failed to read or reproject GeoJSON from downloaded data. Error: {e}")
#     # raise e # Uncomment to stop execution on error

# --- End Method 2 ---


# --- Final Verification ---
if 'gdf' not in locals() or gdf is None or not isinstance(gdf, gpd.GeoDataFrame) or gdf.empty:
    raise ValueError("ERROR: GeoDataFrame 'gdf' was not loaded successfully. Please check the path/URL and logs above.")
else:
    print("\nGeoDataFrame 'gdf' is loaded, prepared, and ready for use in subsequent cells.")

## 4: Wide Crosswalk Selector

In [None]:
# --- Wide-Format Crosswalk ---

# Check if gdf exists from previous cell
if 'gdf' not in locals() or gdf is None:
     raise NameError("ERROR: GeoDataFrame 'gdf' is not available. Please run the 'Load and Prepare Data' cell first.")

# Geography choices (Corrected list)
geo_choices = ['pp', 'fb', 'sd', 'bid', 'ibz', 'cd', 'dsny', 'hc',
               'cc', 'nycongress', 'sa', 'ss', 'nta', 'zipcode', 'hd']

# Interactive widgets
primary_geo_widget = widgets.Dropdown(options=geo_choices, description='Primary:')
target_geo_widget = widgets.SelectMultiple(options=geo_choices, description='Targets:', rows=8)
run_button = widgets.Button(description="Generate Wide Crosswalk")
output_wide = widgets.Output() # Use distinct output name

# Function to generate wide crosswalk
def generate_crosswalk(b):
    with output_wide: # Use distinct output name
        clear_output()
        primary_geo = primary_geo_widget.value
        target_geos = list(target_geo_widget.value)

        if not target_geos:
            print("Please select at least one target geography.")
            return
        if primary_geo in target_geos:
            print("Primary geography should not be in the selected target geographies.")
            return

        BUFFER_FEET = -200
        MIN_INTERSECTION_AREA = 400

        # Use the globally loaded gdf
        primary_gdf = gdf[gdf['id'] == primary_geo].copy()
        if primary_gdf.empty:
            print(f"No data found for primary geography '{primary_geo}'.")
            return

        # Use the globally loaded gdf's spatial index
        all_sindex = gdf.sindex
        if all_sindex is None:
             print("ERROR: Spatial index not found on gdf. Please rerun data loading cell.")
             return

        crosswalk_records = []

        print("Generating wide crosswalk...")
        for _, primary_row in tqdm(primary_gdf.iterrows(), total=primary_gdf.shape[0], desc=f"Wide: {primary_geo}"):
            primary_name = primary_row['nameCol']
            primary_geom_buffered = primary_row.geometry.buffer(BUFFER_FEET)

            candidate_idx = list(all_sindex.intersection(primary_geom_buffered.bounds))
            # Use global gdf here
            candidate_features = gdf.iloc[candidate_idx]

            mask = candidate_features.intersects(primary_geom_buffered)
            candidates = candidate_features[mask].copy()

            if not candidates.empty:
                candidates["intersection_area"] = candidates.geometry.intersection(primary_geom_buffered).area
                final_candidates = candidates[candidates["intersection_area"] > MIN_INTERSECTION_AREA]
            else:
                final_candidates = candidates

            record = {f'{primary_geo}': primary_name}

            for geo in target_geos:
                # Check if 'id' column exists in final_candidates
                if 'id' not in final_candidates.columns:
                    print(f"Warning: 'id' column missing in candidates for {primary_geo}. Skipping target {geo}.")
                    record[geo] = ""
                    continue

                subset = final_candidates[final_candidates['id'] == geo]
                 # Check if 'nameCol' exists before trying to access it
                if not subset.empty and 'nameCol' in subset.columns:
                   record[geo] = ";".join(subset['nameCol'].dropna().astype(str).unique())
                else:
                   record[geo] = ""


            crosswalk_records.append(record)

        if not crosswalk_records:
            print("No crosswalk records generated.")
            return

        crosswalk_df = pd.DataFrame(crosswalk_records)
        print("\nWide Crosswalk Preview:")
        display(crosswalk_df.head())

        filename = f'wide_crosswalk_{primary_geo}_to_{"_".join(target_geos)}.csv'
        crosswalk_df.to_csv(filename, index=False)
        files.download(filename)
        print(f"\nWide crosswalk generation complete. Attempting download: {filename}")

# Display widgets and attach handler
display(primary_geo_widget, target_geo_widget, run_button, output_wide)
run_button.on_click(generate_crosswalk)

## 5: Long-Form Crosswalk Selector

In [None]:
# --- Long-form Crosswalk Selector ---

# Check if gdf exists from previous cell
if 'gdf' not in locals() or gdf is None:
     raise NameError("ERROR: GeoDataFrame 'gdf' is not available. Please run the 'Load and Prepare Data' cell first.")

# Geography choices (Corrected list)
geo_choices_long = ['pp', 'fb', 'sd', 'bid', 'ibz', 'cd', 'dsny', 'hc',
                    'cc', 'nycongress', 'sa', 'ss', 'nta', 'zipcode', 'hd']

# Interactive widgets (using distinct names)
primary_geo_widget_long = widgets.Dropdown(options=geo_choices_long, description='Primary (Long):')
target_geo_widget_long = widgets.SelectMultiple(options=geo_choices_long, description='Targets (Long):', rows=8)
run_button_long = widgets.Button(description="Generate Long-form Crosswalk")
output_long = widgets.Output() # Use distinct output name

# Function to generate long-form crosswalk
def generate_longform_crosswalk(b):
    with output_long: # Use distinct output name
        clear_output()
        primary_geo = primary_geo_widget_long.value
        target_geos = list(target_geo_widget_long.value)

        if not target_geos:
            print("Please select at least one target geography.")
            return
        if primary_geo in target_geos:
            print("Primary geography should not be in the selected target geographies.")
            return

        BUFFER_FEET = -200
        MIN_INTERSECTION_AREA_FILTER = 40

        # Use the globally loaded gdf
        primary_gdf = gdf[gdf['id'] == primary_geo].copy()
        if primary_gdf.empty:
            print(f"No data found for primary geography '{primary_geo}'.")
            return

        # Use the globally loaded gdf's spatial index
        spatial_index = gdf.sindex
        if spatial_index is None:
             print("ERROR: Spatial index not found on gdf. Please rerun data loading cell.")
             return

        rows = []

        print(f"Generating Long-form Crosswalk for Primary: {primary_geo}, Targets: {', '.join(target_geos)}")
        print("Processing intersections...")
        for _, primary_row in tqdm(primary_gdf.iterrows(), total=primary_gdf.shape[0], desc=f"Long: {primary_geo}"):
            primary_name = primary_row['nameCol']
            primary_geom_original = primary_row.geometry
            primary_area = primary_geom_original.area
            primary_geom_buffered = primary_geom_original.buffer(BUFFER_FEET)

            if primary_area == 0: continue

            candidate_idx = list(spatial_index.intersection(primary_geom_buffered.bounds))
            # Use global gdf here
            candidate_features = gdf.iloc[candidate_idx][gdf.iloc[candidate_idx].intersects(primary_geom_buffered)].copy()

            for other_id in target_geos:
                if other_id == primary_geo: continue

                # Use global gdf here
                subset = candidate_features[candidate_features['id'] == other_id].copy()

                if not subset.empty:
                    subset['intersect_area_buffered'] = subset.geometry.intersection(primary_geom_buffered).area
                    subset = subset[subset['intersect_area_buffered'] > MIN_INTERSECTION_AREA_FILTER]

                # Check if 'nameCol' exists before looping through its unique values
                if 'nameCol' not in subset.columns:
                     print(f"Warning: 'nameCol' missing in subset for target {other_id}. Skipping.")
                     continue

                for name_val in subset['nameCol'].dropna().astype(str).unique():
                    feats_same_name = subset[subset['nameCol'] == name_val]

                    if not feats_same_name.empty:
                        try:
                            union_geom = feats_same_name.geometry.union_all()
                        except AttributeError:
                            union_geom = feats_same_name.geometry.unary_union

                        inter_geom = primary_geom_original.intersection(union_geom)
                        inter_area_final = inter_geom.area if not inter_geom.is_empty else 0
                        perc_overlap = (inter_area_final / primary_area) * 100
                    else:
                        inter_area_final = 0
                        perc_overlap = 0

                    if inter_area_final > 1e-6:
                         row = {
                             "Primary Geography ID": primary_geo,
                             "Primary Geography NameCol": primary_name,
                             "Other Geography ID": other_id,
                             "Other Geography NameCol": name_val,
                             "Primary Area (sq ft)": primary_area,
                             "Intersection Area (sq ft)": inter_area_final,
                             "Percentage Overlap": perc_overlap
                         }
                         rows.append(row)

        if not rows:
            print("No significant overlaps found for the selected criteria.")
            return

        overlap_df = pd.DataFrame(rows)
        overlap_df = overlap_df.sort_values(by=["Primary Geography NameCol", "Other Geography ID", "Percentage Overlap"], ascending=[True, True, False])

        print("\nLong-form Crosswalk Preview:")
        display(overlap_df.head())

        filename = f'longform_crosswalk_{primary_geo}_to_{"_".join(target_geos)}.csv'
        overlap_df.to_csv(filename, index=False)
        files.download(filename)
        print(f"\nLong-form crosswalk generation complete. Attempting download: {filename}")

# Display widgets and attach handler
display(primary_geo_widget_long, target_geo_widget_long, run_button_long, output_long)
run_button_long.on_click(generate_longform_crosswalk)