<a href="https://colab.research.google.com/github/MODA-NYC/nyc-geography-crosswalks/blob/main/NYC_Geographies_Generate_All_Wide_Crosswalks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NYC Geographies: Generate All Wide Crosswalks

This notebook generates a complete set of wide crosswalk tables for various geographic boundaries in New York City using the BetaNYC `all_bounds.geojson` dataset.

### What this notebook does:
- **Spatial intersections:** Computes overlaps among NYC geographic boundaries using GeoPandas.
- **Negative Buffering:** Applies a negative buffer to each geography to ensure meaningful overlaps and exclude trivial or touching geometries.
- **Wide Crosswalk Tables:** Produces one CSV file per geographic boundary (e.g., Community Districts, ZIP codes, NTAs, BIDs), each structured as a wide table where:
  - Each **row** represents a specific geographic feature.
  - Each **column** shows overlapping features from other geography types (semicolon-separated).
- **Automated Outputs:** All generated CSV files are zipped into a single downloadable archive.

### Data Source:
- [BetaNYC nyc-boundaries GeoJSON](https://github.com/BetaNYC/nyc-boundaries)

### Requirements:
- Python libraries: `geopandas`, `requests`, `pandas`
- Environment: Google Colab recommended for ease of use.

### Output:
- **ZIP file**: `all_geographies_crosswalks.zip` containing individual CSVs for each geography type.

---

In [3]:
# Install required libraries
!pip install geopandas requests --quiet

import geopandas as gpd
import pandas as pd
import requests
from io import BytesIO
from google.colab import files
import zipfile
import os

# Configuration
BUFFER_FEET = -200
MIN_INTERSECTION_AREA = 400

# Geography IDs list
geography_ids = ['pp', 'fb', 'sd', 'bid', 'ibz', 'cd', 'dsny', 'hc',
                 'cc_upcoming', 'cc', 'nycongress', 'sa', 'ss', 'nta', 'zipcode', 'hd']

# --- Load GeoJSON Data ---

# Choose ONE of the methods below:

# --- Method 1: Load from Google Drive (Currently Active for Testing) ---
# Requires you to have uploaded 'all_boundaries.geojson' to your Drive
# and to authorize Colab access when prompted.

from google.colab import drive
import os
import geopandas as gpd # Make sure geopandas is imported

print("Attempting to mount Google Drive...")
drive.mount('/content/drive', force_remount=True) # force_remount can help if connection issues occur

# !!!! IMPORTANT: Replace this path with the ACTUAL path to your file on Google Drive !!!!
# Example: '/content/drive/MyDrive/data/processed/all_boundaries.geojson'
geojson_path_on_drive = '/content/drive/MyDrive/Projects/ODA/Crosswalk Experiment/all_boundaries.geojson' # <--- CHANGE THIS

gdf = None # Initialize gdf
if not os.path.exists(geojson_path_on_drive):
  print(f"ERROR: File not found at specified Google Drive path: {geojson_path_on_drive}")
  print("Please double-check the path and ensure the file exists.")
else:
  print(f"Found file at: {geojson_path_on_drive}")
  try:
      print("Reading GeoJSON from Google Drive...")
      # Read the file (source is EPSG:4326)
      gdf_loaded = gpd.read_file(geojson_path_on_drive)
      print(f"Successfully read file. Original CRS: {gdf_loaded.crs}")

      # Reproject to EPSG:2263 for buffer calculations in this notebook (feet)
      print("Reprojecting to EPSG:2263 (Feet)...")
      gdf = gdf_loaded.to_crs(epsg=2263)
      print(f"Successfully loaded and reprojected GeoDataFrame. New CRS: {gdf.crs}")
      print("\nGeoDataFrame Info:")
      print(gdf.info())
  except Exception as e:
      print(f"ERROR: Failed to read or reproject GeoJSON from Google Drive. Error: {e}")
      # Raise the error if loading is critical for subsequent cells
      # raise e

# --- End Method 1 ---


# --- Method 2: Load from URL (Currently Commented Out) ---
# Use this method if you host the file (e.g., on GitHub Releases)
# and want the notebook to download it directly.

# import requests
# from io import BytesIO
# import geopandas as gpd # Make sure geopandas is imported

# # !!!! IMPORTANT: Replace this URL with the ACTUAL download URL for your GeoJSON file !!!!
# # Example: "https://github.com/MODA-NYC/nyc-geography-crosswalks/releases/download/v0.1.0-data/all_boundaries.geojson"
# geojson_url = "YOUR_GEOJSON_DOWNLOAD_URL_HERE" # <--- CHANGE THIS

# gdf = None # Initialize gdf
# try:
#     print(f"Attempting to download GeoJSON from URL: {geojson_url}")
#     response = requests.get(geojson_url, timeout=60) # Add timeout
#     response.raise_for_status() # Check for HTTP errors
#     print("Download successful. Reading GeoJSON...")

#     # Read the file (source is likely EPSG:4326) from the downloaded content
#     gdf_loaded = gpd.read_file(BytesIO(response.content))
#     print(f"Successfully read file. Original CRS: {gdf_loaded.crs}")

#     # Reproject to EPSG:2263 for buffer calculations in this notebook (feet)
#     print("Reprojecting to EPSG:2263 (Feet)...")
#     gdf = gdf_loaded.to_crs(epsg=2263)
#     print(f"Successfully loaded and reprojected GeoDataFrame. New CRS: {gdf.crs}")
#     print("\nGeoDataFrame Info:")
#     print(gdf.info())

# except requests.exceptions.RequestException as e:
#     print(f"ERROR: Failed to download GeoJSON from URL. Error: {e}")
# except Exception as e:
#     print(f"ERROR: Failed to read or reproject GeoJSON from downloaded data. Error: {e}")
#     # Raise the error if loading is critical
#     # raise e

# --- End Method 2 ---


# --- Verification ---
# Ensure gdf was loaded before proceeding to the next steps
if gdf is None or gdf.empty:
    raise ValueError("ERROR: GeoDataFrame 'gdf' was not loaded successfully. Please check the path/URL and logs in the cell above.")
else:
    print("\nGeoDataFrame 'gdf' is loaded and ready for use.")

# --- The rest of the original cell's logic follows ---
# (Spatial index, output folder setup, loops, etc.)

# Spatial index for efficiency
spatial_index = gdf.sindex

# Temporary folder to store CSVs
output_folder = 'geography_crosswalks'
os.makedirs(output_folder, exist_ok=True)

csv_files = []

for primary_geo in geography_ids:
    primary_gdf = gdf[gdf['id'] == primary_geo].copy()
    if primary_gdf.empty:
        continue

    records = []

    for _, primary_row in primary_gdf.iterrows():
        primary_name = primary_row['nameCol']
        primary_geom_buffered = primary_row.geometry.buffer(BUFFER_FEET)

        candidate_idx = list(spatial_index.intersection(primary_geom_buffered.bounds))
        candidate_features = gdf.iloc[candidate_idx]

        # Initial intersection filter
        candidates = candidate_features[candidate_features.intersects(primary_geom_buffered)].copy()
        if candidates.empty:
            final_candidates = candidates
        else:
            candidates['intersection_area'] = candidates.geometry.intersection(primary_geom_buffered).area
            final_candidates = candidates[candidates['intersection_area'] > MIN_INTERSECTION_AREA]

        record = {primary_geo: primary_name}

        for secondary_geo in geography_ids:
            if secondary_geo == primary_geo:
                continue  # skip self-intersection
            subset = final_candidates[final_candidates['id'] == secondary_geo]
            record[secondary_geo] = ";".join(subset['nameCol'].unique()) if not subset.empty else ""

        records.append(record)

    df = pd.DataFrame(records)
    csv_filename = f"{output_folder}/{primary_geo}_wide_crosswalk.csv"
    df.to_csv(csv_filename, index=False)
    csv_files.append(csv_filename)

# Zip and download the files
zip_filename = "all_geographies_wide_crosswalks.zip"
with zipfile.ZipFile(zip_filename, 'w') as zipf:
    for file in csv_files:
        zipf.write(file, arcname=os.path.basename(file))

files.download(zip_filename)


Attempting to mount Google Drive...
Mounted at /content/drive
Found file at: /content/drive/MyDrive/Projects/ODA/Crosswalk Experiment/all_boundaries.geojson
Reading GeoJSON from Google Drive...
Successfully read file. Original CRS: EPSG:4326
Reprojecting to EPSG:2263 (Feet)...
Successfully loaded and reprojected GeoDataFrame. New CRS: EPSG:2263

GeoDataFrame Info:
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 1171 entries, 0 to 1170
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   id        1171 non-null   object  
 1   nameCol   1171 non-null   object  
 2   nameAlt   321 non-null    object  
 3   geometry  1171 non-null   geometry
dtypes: geometry(1), object(3)
memory usage: 36.7+ KB
None

GeoDataFrame 'gdf' is loaded and ready for use.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>