# Note: This notebook now delegates to the reusable CLI

This notebook has been wired to call the crosswalk builder module (`scripts/build_crosswalks.py`) for reproducibility.

Steps:
1. First run `generate_all_bounds.py` (or `python scripts/make_run.py`) to create a new `outputs/<run-id>/all_boundaries.geojson`.
2. Run the code cell below to produce longform and wide crosswalk CSVs into the same run folder.

You can still explore results in subsequent cells if desired.


In [None]:
import os, glob, subprocess

# Find latest run directory
runs = sorted([d for d in glob.glob('outputs/*') if os.path.isdir(d)])
if not runs:
    raise RuntimeError("No outputs/<run-id>/ found. Run generate_all_bounds.py or scripts/make_run.py first.")
latest = runs[-1]

boundaries = os.path.join(latest, 'all_boundaries.geojson')
if not os.path.isfile(boundaries):
    raise RuntimeError(f"Missing {boundaries}. Run bounds step first.")

print(f"Using run folder: {latest}")
cmd = [
    'python', 'scripts/build_crosswalks.py',
    '--boundaries', boundaries,
    '--run-dir', latest,
]
print('Running:', ' '.join(cmd))
subprocess.run(cmd, check=True)
print('Crosswalks built. See longform/ and wide/ under', latest)


<a href="https://colab.research.google.com/github/MODA-NYC/nyc-geography-crosswalks/blob/main/NYC_Geographies_Generate_All_Long_Crosswalks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NYC Geographies: Generate All Long-Form Crosswalks

This notebook automates the generation of comprehensive **long-form** geographic crosswalk tables for New York City. It processes all pairwise combinations of specified geographic boundaries using the `all_boundaries.geojson` file generated by the `generate_all_bounds.py` script.

**Important:** This notebook relies on the `all_boundaries.geojson` file being generated first by the `generate_all_bounds.py` script located within this repository (`nyc-geography-crosswalks`). Ensure you have run that script and have access to its output file before running this notebook.

**About the Input Data File:**
The `all_boundaries.geojson` file used as input aggregates the same core NYC geographic boundaries previously used by the BetaNYC Boundaries Map project. The `generate_all_bounds.py` script aims to collect the **latest available versions** directly from their official sources at the time the script is run.

*   **Note on Versions:** Currently, the specific version links (e.g., URLs containing `_25a` for data from NYC Planning's 2025 Cycle A update) are **hardcoded** within the generator script (`generate_all_bounds.py`).
*   **Future Enhancement:** A potential future improvement could involve modifying the generator script to automatically check for and download the absolute latest versions available from the source portals.

**Output:**
The notebook produces multiple **long-form CSV files**, one for each primary geography type (e.g., `longform_cd_crosswalk.csv`, `longform_pp_crosswalk.csv`), saved into a timestamped output folder. Each file contains detailed rows for every significant intersection between that primary geography's features and features from *all other* geography types included in the input file. Each row includes:
- Primary and Other Geography IDs and Names/Codes
- Primary Feature Area (in sq ft)
- Intersection Area (in sq ft, calculated using original, unbuffered geometries)
- Percentage Overlap relative to the Primary Feature Area

Finally, all generated CSV files are packaged into a single downloadable **ZIP archive** (e.g., `all_geographies_longform_crosswalks.zip`).

### Workflow:
1.  **Load Data:** Loads the pre-generated `all_boundaries.geojson` file (typically from Google Drive when using Colab - see Cell 3).
2.  **Iterate:** Loops through each defined geography type as the "primary" geography.
3.  **Calculate Intersections:** For each primary feature, calculates detailed geometric intersections with all features from *other* relevant geography types, using negative buffering to filter for significant overlaps.
4.  **Generate CSVs:** Saves the consolidated long-form results for each primary geography into a separate CSV file within a dedicated, versioned output folder.
5.  **Package Output:** Zips all generated CSV files from the current run into a single archive for download.

### Data Sources:
The **input** for *this notebook* is the `all_boundaries.geojson` file generated by the `generate_all_bounds.py` script. The **original sources** used by that script are:

*   **cd (Community Districts):** NYC Department of City Planning (DCP)
*   **pp (Police Precincts):** NYC Department of City Planning (DCP)
*   **dsny (Sanitation Districts):** NYC Open Data (Dataset ID: i6mn-amj2)
*   **fb (Fire Battalions):** NYC Department of City Planning (DCP)
*   **sd (School Districts):** NYC Department of City Planning (DCP)
*   **hc (Health Center Districts):** NYC Department of City Planning (DCP)
*   **cc (City Council Districts):** NYC Department of City Planning (DCP)
*   **nycongress (Congressional Districts):** NYC Department of City Planning (DCP)
*   **sa (State Assembly Districts):** NYC Department of City Planning (DCP)
*   **ss (State Senate Districts):** NYC Department of City Planning (DCP)
*   **bid (Business Improvement Districts):** NYC Open Data (Dataset ID: 7jdm-inj8 / derived from ejxk-d93y)
*   **nta (Neighborhood Tabulation Areas):** NYC Department of City Planning (DCP - NTA 2020)
*   **zipcode (Modified Zip Code Tabulation Areas):** NYC Open Data (Dataset ID: pri4-ifjk)
*   **hd (Historic Districts):** NYC Open Data (Dataset ID: skyk-mpzq / derived from xbvj-gfnw)
*   **ibz (Industrial Business Zones):** NYC Economic Development Corporation (EDC)

*Context for many planning datasets can be found at:*
*   [NYC Planning - Bytes of the Big Apple](https://www.nyc.gov/site/planning/data-maps/open-data/bytes-big-apple.page)
*   [NYC Open Data Portal](https://data.cityofnewyork.us/)

### Requirements:
- **Prerequisite:** Successful execution of `generate_all_bounds.py` and access to its output `all_boundaries.geojson`.
- **Python Libraries:** `geopandas`, `pandas`, `tqdm`, `google.colab` (for Drive/files), `os`, `zipfile`. `requests` is needed if using the URL loading method for the data file.
- **Environment:** Google Colab is recommended for Google Drive integration. Standard Python environments can also be used if data loading is adapted.

## 1: Install Dependencies

In [None]:
# Cell 1: Install Dependencies
# Install required libraries if running in a new environment
print("Installing dependencies...")
!pip install geopandas pandas ipywidgets requests tqdm --quiet
# Note: ipywidgets and requests might not be strictly needed if not using interactive elements
# or the URL loading method, but including them based on previous context is safe.
print("Dependencies installed.")

## 2: Import Libraries

In [None]:
# Cell 2: Import Libraries
# Import necessary libraries for the entire notebook
print("Importing libraries...")
import geopandas as gpd
import pandas as pd
# import requests # Needed only if using URL method for data loading
# from io import BytesIO # Needed only if using URL method for data loading
from google.colab import drive # For loading from Google Drive
from google.colab import files # For downloading results
import zipfile # For packaging results
from tqdm.notebook import tqdm # Progress bar
import os
from typing import Union # For type hints
# Import necessary shapely ops if needed for union_all fallback
# try:
#     from shapely.ops import unary_union
#     HAS_UNION_ALL = hasattr(gpd.GeoSeries([]).geometry, 'union_all')
# except ImportError:
#     HAS_UNION_ALL = False
print("Libraries imported.")

## 3: Load and Prepare Data

In [None]:
# Cell 3: Load and Prepare Data
# --- Load and Prepare the Master GeoDataFrame ---

# Choose ONE of the methods below to load all_boundaries.geojson:

# --- Method 1: Load from Google Drive (Currently Active for Testing) ---
print("Attempting to mount Google Drive...")
drive.mount('/content/drive', force_remount=True)

# !!!! IMPORTANT: Replace this path with the ACTUAL path to your file on Google Drive !!!!
geojson_path_on_drive = '/content/drive/MyDrive/Projects/ODA/Crosswalk Experiment/all_boundaries.geojson' # <--- CHANGE THIS

gdf = None # Initialize gdf
if not os.path.exists(geojson_path_on_drive):
  print(f"ERROR: File not found at specified Google Drive path: {geojson_path_on_drive}")
  print("Please double-check the path and ensure the file exists.")
else:
  print(f"Found file at: {geojson_path_on_drive}")
  try:
      print("Reading GeoJSON from Google Drive...")
      gdf_loaded = gpd.read_file(geojson_path_on_drive)
      print(f"Successfully read file. Original CRS: {gdf_loaded.crs}")

      # Reproject to EPSG:2263 for buffer calculations in this notebook (feet)
      print("Reprojecting to EPSG:2263 (Feet)...")
      gdf = gdf_loaded.to_crs(epsg=2263)
      print(f"Successfully loaded and reprojected GeoDataFrame. New CRS: {gdf.crs}")

      # Create Spatial Index for faster lookups
      if gdf.sindex is None:
          print("Generating spatial index for gdf...")
          gdf.sindex # This builds the index

      print("\nGeoDataFrame Info:")
      print(gdf.info())

  except Exception as e:
      print(f"ERROR: Failed to read or reproject GeoJSON from Google Drive. Error: {e}")
      # raise e # Uncomment to stop execution on error

# --- End Method 1 ---


# --- Method 2: Load from URL (Currently Commented Out) ---
# # !!!! IMPORTANT: Replace this URL with the ACTUAL download URL for your GeoJSON file !!!!
# import requests # Need this if using URL method
# from io import BytesIO # Need this if using URL method
# geojson_url = "YOUR_GEOJSON_DOWNLOAD_URL_HERE" # <--- CHANGE THIS

# gdf = None # Initialize gdf
# try:
#     print(f"Attempting to download GeoJSON from URL: {geojson_url}")
#     response = requests.get(geojson_url, timeout=60)
#     response.raise_for_status()
#     print("Download successful. Reading GeoJSON...")

#     gdf_loaded = gpd.read_file(BytesIO(response.content))
#     print(f"Successfully read file. Original CRS: {gdf_loaded.crs}")

#     print("Reprojecting to EPSG:2263 (Feet)...")
#     gdf = gdf_loaded.to_crs(epsg=2263)
#     print(f"Successfully loaded and reprojected GeoDataFrame. New CRS: {gdf.crs}")

#     if gdf.sindex is None:
#         print("Generating spatial index for gdf...")
#         gdf.sindex # Build index

#     print("\nGeoDataFrame Info:")
#     print(gdf.info())

# except requests.exceptions.RequestException as e:
#     print(f"ERROR: Failed to download GeoJSON from URL. Error: {e}")
# except Exception as e:
#     print(f"ERROR: Failed to read or reproject GeoJSON from downloaded data. Error: {e}")
#     # raise e # Uncomment to stop execution on error

# --- End Method 2 ---


# --- Final Verification ---
if 'gdf' not in locals() or gdf is None or not isinstance(gdf, gpd.GeoDataFrame) or gdf.empty:
    raise ValueError("ERROR: GeoDataFrame 'gdf' was not loaded successfully in Cell 3. Cannot proceed.")
else:
    print("\nGeoDataFrame 'gdf' is loaded, prepared, and ready for use in subsequent cells.")

# --- Define Geography IDs list globally for use in next cell ---
# (Corrected list based on previous steps)
geography_ids = ['pp', 'fb', 'sd', 'bid', 'ibz', 'cd', 'dsny', 'hc',
                 'cc', 'nycongress', 'sa', 'ss', 'nta', 'zipcode', 'hd']
print(f"\nGeography IDs available for crosswalks: {geography_ids}")

## 4: Generate Long-Form Crosswalks (One File Per Primary Geography)



In [None]:
# Cell 4: Generate All Long-Form Crosswalks (One File Per Primary Geography)

# Check if gdf and geography_ids exist from previous cell
if 'gdf' not in locals() or gdf is None:
     raise NameError("ERROR: GeoDataFrame 'gdf' is not available. Please run the 'Load and Prepare Data' cell first.")
if 'geography_ids' not in locals():
     raise NameError("ERROR: 'geography_ids' list not found. Please run the 'Load and Prepare Data' cell first.")
if gdf.sindex is None: # Ensure spatial index exists
    print("Warning: Spatial index not found on gdf. Generating now...")
    gdf.sindex

print("Starting generation of consolidated long-form crosswalks (one file per primary geography)...")

# --- Configuration ---
BUFFER_FEET = -200
MIN_INTERSECTION_AREA_FILTER = 40 # Threshold for initial filtering based on buffered intersection

# --- Output Setup ---
output_folder = 'longform_crosswalks_consolidated' # New folder name
os.makedirs(output_folder, exist_ok=True)
print(f"Output CSVs will be saved to: {output_folder}")
csv_files_generated = [] # Keep track of generated files

# Use the spatial index from the globally loaded gdf
spatial_index = gdf.sindex

# --- Main Loop: Iterate through each primary geography ---
for primary_geo in tqdm(geography_ids, desc="Primary Geographies"):
    primary_gdf = gdf[gdf['id'] == primary_geo].copy()
    if primary_gdf.empty:
        print(f"Skipping primary geography '{primary_geo}' - no features found.")
        continue

    print(f"\nProcessing Primary Geography: {primary_geo}")
    # Initialize list to hold ALL intersection rows for THIS primary_geo
    all_rows_for_this_primary = []

    # Iterate through each feature of the primary geography
    for _, primary_row in tqdm(primary_gdf.iterrows(), total=primary_gdf.shape[0], desc=f"Features in {primary_geo}", leave=False):
        primary_name = primary_row['nameCol']
        primary_geom_original = primary_row.geometry
        primary_area = primary_geom_original.area
        primary_geom_buffered = primary_geom_original.buffer(BUFFER_FEET)

        if primary_area == 0: continue # Skip features with no area

        # Find potential candidates intersecting the buffered primary feature's bounds
        candidate_idx = list(spatial_index.intersection(primary_geom_buffered.bounds))
        # Filter further: must intersect the actual buffered primary feature
        # Exclude candidates that are the *same* primary geography type
        candidate_features = gdf.iloc[candidate_idx][
            (gdf.iloc[candidate_idx]['id'] != primary_geo) &
            (gdf.iloc[candidate_idx].intersects(primary_geom_buffered))
        ].copy()

        if candidate_features.empty: continue # No potential overlaps for this primary feature

        # Calculate intersection area with the BUFFERED primary geom for filtering small/touching overlaps
        candidate_features['intersect_area_buffered'] = candidate_features.geometry.intersection(primary_geom_buffered).area
        # Keep only candidates meeting the minimum buffered intersection area
        target_candidates_filtered = candidate_features[candidate_features['intersect_area_buffered'] > MIN_INTERSECTION_AREA_FILTER]

        if target_candidates_filtered.empty: continue # No significant overlaps after filtering

        # Now, process these filtered candidates, grouping by their actual ID and nameCol
        # Check if 'id' and 'nameCol' exist before grouping
        if 'id' not in target_candidates_filtered.columns or 'nameCol' not in target_candidates_filtered.columns:
             print(f"Warning: 'id' or 'nameCol' missing in filtered candidates for primary feature {primary_name}. Skipping.")
             continue

        # Group potential targets by their ID and nameCol to perform union before final intersection
        grouped_targets = target_candidates_filtered.groupby(['id', 'nameCol'])

        for (target_id, target_name_val), group in grouped_targets:
            if pd.isna(target_name_val): continue # Skip if target name is missing

            target_name_val_str = str(target_name_val) # Ensure string for consistency

            # Union all geometries for this specific target ID and name
            try:
                union_geom = group.geometry.union_all()
            except AttributeError:
                union_geom = group.geometry.unary_union

            # Calculate intersection with the ORIGINAL primary geometry
            inter_geom = primary_geom_original.intersection(union_geom)
            inter_area_final = inter_geom.area if not inter_geom.is_empty else 0
            perc_overlap = (inter_area_final / primary_area) * 100 if primary_area > 0 else 0

            # Add record only if there's some meaningful overlap area
            if inter_area_final > 1e-6: # Use a small threshold
                 row = {
                     "Primary Geography ID": primary_geo,
                     "Primary Geography NameCol": primary_name,
                     "Other Geography ID": target_id,             # ID of the overlapping feature
                     "Other Geography NameCol": target_name_val_str, # Name of the overlapping feature
                     "Primary Area (sq ft)": primary_area,
                     "Intersection Area (sq ft)": inter_area_final,
                     "Percentage Overlap": perc_overlap
                 }
                 all_rows_for_this_primary.append(row)
    # --- End loop for features within the primary geography ---

    # --- Save the consolidated CSV for the current primary_geo ---
    if all_rows_for_this_primary:
        overlap_df = pd.DataFrame(all_rows_for_this_primary)
        # Sort for clarity within the file
        overlap_df = overlap_df.sort_values(
            by=["Primary Geography NameCol", "Other Geography ID", "Percentage Overlap"],
            ascending=[True, True, False]
        )

        filename = f"{output_folder}/longform_{primary_geo}_crosswalk.csv"
        try:
            overlap_df.to_csv(filename, index=False)
            csv_files_generated.append(filename)
            print(f"Saved consolidated file: {filename} ({len(overlap_df)} rows)")
        except Exception as save_e:
            print(f"ERROR saving {filename}: {save_e}")
    else:
        print(f"No significant overlaps found for primary geography {primary_geo}. No CSV generated.")

# --- End outer loop for primary geographies ---

print(f"\nFinished generating all consolidated long-form crosswalk CSVs in folder: {output_folder}")
print(f"Generated {len(csv_files_generated)} files.")

## 5: Package and Download Results

In [None]:
# Cell 5: Package and Download Results

# Check if csv_files_generated exists and has files
if 'csv_files_generated' not in locals() or not csv_files_generated:
    print("No CSV files were generated in the previous step, skipping packaging.")
else:
    # Zip and download the files
    zip_filename = "all_geographies_longform_crosswalks.zip"
    print(f"\nZipping {len(csv_files_generated)} generated CSV files into {zip_filename}...")

    try:
        with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
            for file_path in tqdm(csv_files_generated, desc="Zipping files"):
                # Add file to zip using its basename to avoid including the folder path
                zipf.write(file_path, arcname=os.path.basename(file_path))

        print(f"Zip file created: {zip_filename}")
        print("Attempting to trigger download...")
        files.download(zip_filename)
        print("Download initiated.")

    except FileNotFoundError as fnf_e:
         print(f"ERROR: File not found during zipping: {fnf_e}. Check file paths.")
    except Exception as zip_e:
         print(f"ERROR creating or downloading zip file: {zip_e}")