# Note: This notebook now delegates to the reusable CLI

Use `scripts/build_crosswalks.py` to generate all wide/long crosswalk files for the latest run folder created by `generate_all_bounds.py` or `scripts/make_run.py`. See the code cell below to execute it from this notebook.


In [None]:
import os, glob, subprocess

runs = sorted([d for d in glob.glob('outputs/*') if os.path.isdir(d)])
if not runs:
    raise RuntimeError("No outputs/<run-id>/ found. Run generate_all_bounds.py or scripts/make_run.py first.")
latest = runs[-1]

boundaries = os.path.join(latest, 'all_boundaries.geojson')
if not os.path.isfile(boundaries):
    raise RuntimeError(f"Missing {boundaries}. Run bounds step first.")

print(f"Using run folder: {latest}")
cmd = [
    'python', 'scripts/build_crosswalks.py',
    '--boundaries', boundaries,
    '--run-dir', latest,
]
print('Running:', ' '.join(cmd))
subprocess.run(cmd, check=True)
print('Crosswalks built. See longform/ and wide/ under', latest)


<a href="https://colab.research.google.com/github/MODA-NYC/nyc-geography-crosswalks/blob/main/NYC_Geographies_Generate_All_Wide_Crosswalks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NYC Geographies: Generate All Wide-Format Crosswalks

This notebook automates the generation of a complete set of **wide-format** geographic crosswalk tables for New York City. It processes all defined geographic boundaries using the `all_boundaries.geojson` file generated by the `generate_all_bounds.py` script.

**Important:** This notebook relies on the `all_boundaries.geojson` file being generated first by the `generate_all_bounds.py` script located within this repository (`nyc-geography-crosswalks`). Ensure you have run that script and have access to its output file before running this notebook.

**About the Input Data File:**
The `all_boundaries.geojson` file used as input aggregates the same core NYC geographic boundaries previously used by the BetaNYC Boundaries Map project. The `generate_all_bounds.py` script aims to collect the **latest available versions** directly from their official sources at the time the script is run.

*   **Note on Versions:** Currently, the specific version links (e.g., URLs containing `_25a` for data from NYC Planning's 2025 Cycle A update) are **hardcoded** within the generator script (`generate_all_bounds.py`).
*   **Future Enhancement:** A potential future improvement could involve modifying the generator script to automatically check for and download the absolute latest versions available from the source portals.

### What this notebook does:
1.  **Load Data:** Loads the pre-generated `all_boundaries.geojson` file (typically from Google Drive when using Colab - see Cell 3).
2.  **Iterate:** Loops through each defined geography type as the "primary" geography.
3.  **Spatial Intersections:** For each primary feature, computes overlaps with all features from *other* relevant geography types using GeoPandas.
4.  **Negative Buffering:** Applies a negative buffer to each primary geography feature during the intersection check to ensure only significant overlaps are included and trivial or merely touching geometries are excluded.
5.  **Generate Wide CSVs:** Produces **one CSV file per primary geography type** (e.g., `wide_cd_crosswalk.csv`, `wide_pp_crosswalk.csv`). Each CSV is structured as a wide table where:
    *   Each **row** represents a specific feature of the primary geography type (e.g., one Community District).
    *   Each **column** represents another geography type, containing a semicolon-separated list of the `nameCol` identifiers of the overlapping features from that type.
6.  **Package Output:** Zips all generated wide-format CSV files from the current run into a single downloadable archive.

### Data Sources:
The **input** for *this notebook* is the `all_boundaries.geojson` file generated by the `generate_all_bounds.py` script. The **original sources** used by that script are:

*   **cd (Community Districts):** NYC Department of City Planning (DCP)
*   **pp (Police Precincts):** NYC Department of City Planning (DCP)
*   **dsny (Sanitation Districts):** NYC Open Data (Dataset ID: i6mn-amj2)
*   **fb (Fire Battalions):** NYC Department of City Planning (DCP)
*   **sd (School Districts):** NYC Department of City Planning (DCP)
*   **hc (Health Center Districts):** NYC Department of City Planning (DCP)
*   **cc (City Council Districts):** NYC Department of City Planning (DCP)
*   **nycongress (Congressional Districts):** NYC Department of City Planning (DCP)
*   **sa (State Assembly Districts):** NYC Department of City Planning (DCP)
*   **ss (State Senate Districts):** NYC Department of City Planning (DCP)
*   **bid (Business Improvement Districts):** NYC Open Data (Dataset ID: 7jdm-inj8 / derived from ejxk-d93y)
*   **nta (Neighborhood Tabulation Areas):** NYC Department of City Planning (DCP - NTA 2020)
*   **zipcode (Modified Zip Code Tabulation Areas):** NYC Open Data (Dataset ID: pri4-ifjk)
*   **hd (Historic Districts):** NYC Open Data (Dataset ID: skyk-mpzq / derived from xbvj-gfnw)
*   **ibz (Industrial Business Zones):** NYC Economic Development Corporation (EDC)

*Context for many planning datasets can be found at:*
*   [NYC Planning - Bytes of the Big Apple](https://www.nyc.gov/site/planning/data-maps/open-data/bytes-big-apple.page)
*   [NYC Open Data Portal](https://data.cityofnewyork.us/)

### Requirements:
- **Prerequisite:** Successful execution of `generate_all_bounds.py` and access to its output `all_boundaries.geojson`.
- **Python Libraries:** `geopandas`, `pandas`, `tqdm`, `google.colab` (for Drive/files), `os`, `zipfile`. `requests` is needed if using the URL loading method for the data file.
- **Environment:** Google Colab is recommended for Google Drive integration. Standard Python environments can also be used if data loading is adapted.

### Output:
- **ZIP file:** `all_geographies_wide_crosswalks.zip` (or similar name generated by the script) containing individual wide-format CSV files, one for each primary geography type processed.

## 1: Install Dependencies

In [None]:
# Cell 1: Install Dependencies
# Install required libraries if running in a new environment
print("Installing dependencies...")
!pip install geopandas pandas requests tqdm --quiet
# Note: requests might not be strictly needed if only using Drive method, but safe to include.
# Note: tqdm added for potential future progress bars if desired. ipywidgets not needed here.
print("Dependencies installed.")

## 2: Import Libraries

In [None]:
# Cell 2: Import Libraries
# Import necessary libraries for the entire notebook
print("Importing libraries...")
import geopandas as gpd
import pandas as pd
import requests # Needed only if using URL method for data loading
from io import BytesIO # Needed only if using URL method for data loading
from google.colab import drive # For loading from Google Drive
from google.colab import files # For downloading results
import zipfile # For packaging results
from tqdm.notebook import tqdm # Progress bar (added import)
import os
# Import Union for type hints if needed
# from typing import Union
print("Libraries imported.")

## 3: Load and Prepare Data

In [None]:
# Cell 3: Load and Prepare Data
# --- Load and Prepare the Master GeoDataFrame ---

# Choose ONE of the methods below to load all_boundaries.geojson:

# --- Method 1: Load from Google Drive (Currently Active for Testing) ---
print("Attempting to mount Google Drive...")
drive.mount('/content/drive', force_remount=True)

# !!!! IMPORTANT: Replace this path with the ACTUAL path to your file on Google Drive !!!!
geojson_path_on_drive = '/content/drive/MyDrive/Projects/ODA/Crosswalk Experiment/all_boundaries.geojson' # <--- USE YOUR CORRECT PATH HERE

gdf = None # Initialize gdf
if not os.path.exists(geojson_path_on_drive):
  print(f"ERROR: File not found at specified Google Drive path: {geojson_path_on_drive}")
  print("Please double-check the path and ensure the file exists.")
else:
  print(f"Found file at: {geojson_path_on_drive}")
  try:
      print("Reading GeoJSON from Google Drive...")
      gdf_loaded = gpd.read_file(geojson_path_on_drive)
      print(f"Successfully read file. Original CRS: {gdf_loaded.crs}")

      # Reproject to EPSG:2263 for buffer calculations in this notebook (feet)
      print("Reprojecting to EPSG:2263 (Feet)...")
      gdf = gdf_loaded.to_crs(epsg=2263)
      print(f"Successfully loaded and reprojected GeoDataFrame. New CRS: {gdf.crs}")

      # Create Spatial Index for faster lookups
      if gdf.sindex is None:
          print("Generating spatial index for gdf...")
          gdf.sindex # This builds the index

      print("\nGeoDataFrame Info:")
      print(gdf.info())

  except Exception as e:
      print(f"ERROR: Failed to read or reproject GeoJSON from Google Drive. Error: {e}")
      # raise e # Uncomment to stop execution on error

# --- End Method 1 ---


# --- Method 2: Load from URL (Currently Commented Out) ---
# # !!!! IMPORTANT: Replace this URL with the ACTUAL download URL for your GeoJSON file !!!!
# geojson_url = "YOUR_GEOJSON_DOWNLOAD_URL_HERE" # <--- CHANGE THIS

# gdf = None # Initialize gdf
# try:
#     print(f"Attempting to download GeoJSON from URL: {geojson_url}")
#     response = requests.get(geojson_url, timeout=60)
#     response.raise_for_status()
#     print("Download successful. Reading GeoJSON...")

#     gdf_loaded = gpd.read_file(BytesIO(response.content))
#     print(f"Successfully read file. Original CRS: {gdf_loaded.crs}")

#     print("Reprojecting to EPSG:2263 (Feet)...")
#     gdf = gdf_loaded.to_crs(epsg=2263)
#     print(f"Successfully loaded and reprojected GeoDataFrame. New CRS: {gdf.crs}")

#     if gdf.sindex is None:
#         print("Generating spatial index for gdf...")
#         gdf.sindex # Build index

#     print("\nGeoDataFrame Info:")
#     print(gdf.info())

# except requests.exceptions.RequestException as e:
#     print(f"ERROR: Failed to download GeoJSON from URL. Error: {e}")
# except Exception as e:
#     print(f"ERROR: Failed to read or reproject GeoJSON from downloaded data. Error: {e}")
#     # raise e # Uncomment to stop execution on error

# --- End Method 2 ---


# --- Final Verification ---
if 'gdf' not in locals() or gdf is None or not isinstance(gdf, gpd.GeoDataFrame) or gdf.empty:
    raise ValueError("ERROR: GeoDataFrame 'gdf' was not loaded successfully in Cell 3. Cannot proceed.")
else:
    print("\nGeoDataFrame 'gdf' is loaded, prepared, and ready for use in subsequent cells.")

# --- Define Geography IDs list globally for use in next cell ---
# (Corrected list based on previous steps)
geography_ids = ['pp', 'fb', 'sd', 'bid', 'ibz', 'cd', 'dsny', 'hc',
                 'cc', 'nycongress', 'sa', 'ss', 'nta', 'zipcode', 'hd']
print(f"\nGeography IDs available for crosswalks: {geography_ids}")

# --- Configuration (moved here for clarity) ---
BUFFER_FEET = -200
MIN_INTERSECTION_AREA = 400
print(f"\nConfiguration: Buffer={BUFFER_FEET}ft, Min Intersection Area={MIN_INTERSECTION_AREA}sqft")

## 4: Generate All Wide-Format Crosswalks

In [None]:
# Cell 4: Generate All Wide-Format Crosswalks

# Check if prerequisites exist from previous cell
if 'gdf' not in locals() or gdf is None:
     raise NameError("ERROR: GeoDataFrame 'gdf' is not available. Please run the 'Load and Prepare Data' cell first.")
if 'geography_ids' not in locals():
     raise NameError("ERROR: 'geography_ids' list not found. Please run the 'Load and Prepare Data' cell first.")
if gdf.sindex is None: # Ensure spatial index exists
    print("Warning: Spatial index not found on gdf. Generating now...")
    gdf.sindex
if 'BUFFER_FEET' not in locals() or 'MIN_INTERSECTION_AREA' not in locals():
     raise NameError("ERROR: Configuration variables (BUFFER_FEET, MIN_INTERSECTION_AREA) not found.")


print("Starting generation of all wide-format crosswalks...")

# --- Output Setup ---
output_folder = 'wide_crosswalks_consolidated' # Use a distinct folder name
os.makedirs(output_folder, exist_ok=True)
print(f"Output CSVs will be saved to: {output_folder}")
csv_files_generated = [] # Keep track of generated files

# Use the spatial index from the globally loaded gdf
spatial_index = gdf.sindex

# --- Main Loop ---
for primary_geo in tqdm(geography_ids, desc="Primary Geographies"):
    # Use global gdf
    primary_gdf = gdf[gdf['id'] == primary_geo].copy()
    if primary_gdf.empty:
        print(f"Skipping primary geography '{primary_geo}' - no features found.")
        continue

    print(f"\nProcessing Primary Geography: {primary_geo}")
    records = [] # Reset records for each primary geography

    # Iterate through each primary feature
    for _, primary_row in tqdm(primary_gdf.iterrows(), total=primary_gdf.shape[0], desc=f"Features in {primary_geo}", leave=False):
        primary_name = primary_row['nameCol']
        primary_geom_buffered = primary_row.geometry.buffer(BUFFER_FEET)

        # Find candidates intersecting the buffered primary feature
        candidate_idx = list(spatial_index.intersection(primary_geom_buffered.bounds))
        # Use global gdf
        candidate_features = gdf.iloc[candidate_idx]

        # Filter for actual intersection with the buffer
        mask = candidate_features.intersects(primary_geom_buffered)
        candidates = candidate_features[mask].copy()

        # Calculate intersection area with the buffer and filter by minimum area
        if not candidates.empty:
            candidates['intersection_area'] = candidates.geometry.intersection(primary_geom_buffered).area
            final_candidates = candidates[candidates['intersection_area'] > MIN_INTERSECTION_AREA]
        else:
            final_candidates = candidates # Pass empty frame

        # Build the record for this primary feature
        record = {primary_geo: primary_name}

        # Add columns for all other geography types
        for secondary_geo in geography_ids:
            if secondary_geo == primary_geo:
                continue  # skip self-intersection column

            # Check if 'id' column exists in final_candidates
            if 'id' not in final_candidates.columns:
                # This shouldn't happen if gdf is structured correctly, but safe check
                print(f"Warning: 'id' column missing in candidates for primary {primary_geo}. Setting target {secondary_geo} to empty.")
                record[secondary_geo] = ""
                continue

            subset = final_candidates[final_candidates['id'] == secondary_geo]

            # Check if 'nameCol' exists before trying to access it
            if not subset.empty and 'nameCol' in subset.columns:
               # Get unique, non-null string representations
               unique_names = subset['nameCol'].dropna().astype(str).unique()
               record[secondary_geo] = ";".join(unique_names) if len(unique_names) > 0 else ""
            else:
               record[secondary_geo] = ""

        records.append(record)
    # --- End loop for primary features ---

    # --- Save the consolidated CSV for the current primary_geo ---
    if records:
        df = pd.DataFrame(records)
        # Define column order: primary first, then others alphabetically maybe?
        cols = [primary_geo] + sorted([g for g in geography_ids if g != primary_geo])
        df = df[cols] # Reorder columns

        filename = f"{output_folder}/wide_{primary_geo}_crosswalk.csv"
        try:
            df.to_csv(filename, index=False)
            csv_files_generated.append(filename)
            print(f"Saved consolidated file: {filename} ({len(df)} rows)")
        except Exception as save_e:
            print(f"ERROR saving {filename}: {save_e}")
    else:
        print(f"No records generated for primary geography {primary_geo}. No CSV generated.")

# --- End loop for primary geographies ---

print(f"\nFinished generating all wide-format crosswalk CSVs in folder: {output_folder}")
print(f"Generated {len(csv_files_generated)} files.")

## 5: Package and Download Results

In [None]:
# Cell 5: Package and Download Results

# Check if csv_files_generated exists and has files
if 'csv_files_generated' not in locals() or not csv_files_generated:
    print("No CSV files were generated in the previous step, skipping packaging.")
else:
    # Determine the output folder name used in the previous cell
    # (Assumes it was assigned to output_folder variable)
    if 'output_folder' not in locals():
         print("ERROR: 'output_folder' variable not found. Cannot determine zip filename.")
         zip_filename = "all_geographies_wide_crosswalks.zip" # Fallback name
    else:
         zip_filename = f"all_geographies_wide_crosswalks_{os.path.basename(output_folder)}.zip"


    print(f"\nZipping {len(csv_files_generated)} generated CSV files into {zip_filename}...")

    try:
        with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
            for file_path in tqdm(csv_files_generated, desc="Zipping files"):
                # Add file to zip using its basename to avoid including the folder path
                if os.path.exists(file_path):
                     zipf.write(file_path, arcname=os.path.basename(file_path))
                else:
                     print(f"Warning: File not found, skipping zip: {file_path}")

        print(f"Zip file created: {zip_filename}")
        print("Attempting to trigger download...")
        files.download(zip_filename)
        print("Download initiated.")

    except FileNotFoundError as fnf_e:
         print(f"ERROR: File not found during zipping: {fnf_e}. Check file paths.")
    except Exception as zip_e:
         print(f"ERROR creating or downloading zip file: {zip_e}")