# NYC 311 Service Request Prediction with STM-Graph

This notebook demonstrates how to use the STM-Graph library to analyze New York City 311 service request data and do prediction. We'll go through the complete workflow:

1. Loading and preprocessing the raw data
2. Creating spatial mappings using Degree-based Voronoi partitioning
3. Extracting OpenStreetMap features / Urban Features Graph Creation
4. Building a graph representation of the data
5. Creating temporal graph dataset
6. Visualizing spatial and temporal patterns
7. Training a GNN model for service request prediction

Let's get started!

In [None]:
stm_graph_path = "/home/ubuntu/STM-Graph/src"

# Import required libraries
import sys
sys.path.append(stm_graph_path)
import stm_graph
import os
import pandas as pd
import numpy as np
from datetime import timedelta
import matplotlib.pyplot as plt
import contextily as ctx

# Define the data and output directories
DATA_DIR = "/mnt/data/nyc_crash_311"
DATASET = "311_Service_Requests_from_2010_to_Present_20241218.csv"
OUTPUT_DIR = "/mnt/data/nyc_crash_311/stm_graph/nyc/311"

# Define geographic boundaries for NYC
NYC_BOUNDS = {
    "min_lat": 40.4774,  # Southern boundary
    "max_lat": 40.9176,  # Northern boundary
    "min_lon": -74.2591,  # Western boundary
    "max_lon": -73.7004,  # Eastern boundary
}

os.makedirs(OUTPUT_DIR, exist_ok=True)

## 1. Data Loading and Preprocessing

First, we'll load the NYC 311 service request data and preprocess it using STM-Graph's built-in functions. This dataset contains citizen reports of non-emergency issues like noise complaints, illegal parking, etc.

The preprocessing steps include:
- Filtering data within NYC boundaries
- Converting coordinates to a proper spatial format
- Standardizing column names
- Filtering to a specific time range for analysis

In [None]:
# Process with STM-Graph
print(f"Processing 311 data from {os.path.join(DATA_DIR, DATASET)}")
gdf_311 = stm_graph.preprocess_dataset(
    data_path=DATA_DIR,
    dataset=DATASET,
    time_col="created_time",
    lat_col="Latitude",
    lng_col="Longitude",
    column_mapping={
        "Unique Key": "unique_key",
        "Created Date": "created_time",
        "Closed Date": "closed_time",
        "Agency": "agency",
        "Complaint Type": "complaint_type",
        "Descriptor": "descriptor",
        "Latitude": "latitude",
        "Longitude": "longitude",
        "Location": "location",
    },
    filter_dates=("2017-06-01 00:00:00", "2018-06-30 23:59:59"),
    testing_mode=True,
    test_bounds=NYC_BOUNDS,
    visualize=True,
    fig_format="png",
    output_dir=OUTPUT_DIR,
    show_background_map=True,
    point_color="blue",
    point_alpha=0.5,
    point_size=1
)

print(f"Processed dataset shape: {gdf_311.shape}")
print(f"Time range: {gdf_311['created_time'].min()} to {gdf_311['created_time'].max()}")

In [None]:
# Examine the distribution of 311 complaint types
complaint_counts = gdf_311['complaint_type'].value_counts()
print("Top 10 complaint types:")
print(complaint_counts.head(10))

# Visualize the top complaint types
plt.figure(figsize=(12, 6))
complaint_counts.head(10).plot(kind='bar')
plt.title('Top 10 311 Complaint Types')
plt.ylabel('Number of Reports')
plt.xlabel('Complaint Type')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## 2. Spatial Mapping

Next, we'll apply a spatial mapping to divide NYC into regions. For this dataset, we'll use a Degree-based Voronoi partitioning approach, which creates regions of varying sizes based on the important road network junctions.

In [None]:
# Apply Degree-based Voronoi mapping to the data
mapper = stm_graph.VoronoiDegreeMapping(
    small_cell_size=5000,
    large_cell_size=20000,
    meter_crs="EPSG:32618",
)

# Apply the mapping to get district geometries and point-to-partition mapping
district_gdf, point_to_partition = mapper.create_mapping(gdf_311)

print(f"Created mapping with {len(district_gdf)} regions")
print(f"Points with valid mapping: {(point_to_partition >= 0).sum()} of {len(point_to_partition)}")

# Visualize the mapping
mapper.visualize(
    points_gdf=gdf_311,
    partition_gdf=district_gdf,
    point_to_partition=point_to_partition,
    out_dir=OUTPUT_DIR,
    remove_empty=True,
    testing_mode=False,
    file_format="png"
)

## 3. OSM Feature Extraction / Urban Features Graph Creation

Now we'll extract features from OpenStreetMap (OSM) to enrich our model with contextual information about each area. These features provide important context about the urban environment that may influence service request patterns.

For example:
- More restaurants might correlate with more noise complaints
- Areas with more roads might have more pothole reports
- Areas with more parks might have different patterns of maintenance requests

In [None]:
# Define the feature types to extract
feature_types = ['poi', 'road', 'junction']

# Extract OSM features
osm_cache_dir = os.path.join(OUTPUT_DIR, "osm_cache")
osm_features = stm_graph.extract_osm_features(
    regions_gdf=district_gdf,
    bounds=NYC_BOUNDS,
    cache_dir=osm_cache_dir,
    feature_types=feature_types,
    normalize=True,
    meter_crs="EPSG:32618",
    lat_lon_crs="EPSG:4326"
)

# Print available features
print(f"Extracted {len(osm_features.columns)} OSM features")
print("\nFeature sample:")
osm_features.head()


## 4. Graph Construction

Now we'll build a graph representation of our data. In this graph:
- Nodes represent Voronoi regions
- Edges represent adjacency relationships between regions
- Node features include OSM features and service request statistics

This graph structure allows us to use Graph Neural Networks (GNNs) to model the spatial relationships between different areas of the city.

In [None]:
# Filter points that have valid mappings
gdf_311_valid = gdf_311[point_to_partition >= 0].copy()
point_to_partition_valid = point_to_partition[point_to_partition >= 0].copy()

print(f"Using {len(gdf_311_valid)} valid points for graph construction")

# Build graph with static features
graph_data = stm_graph.build_graph_and_augment(
    grid_gdf=district_gdf,
    points_gdf=gdf_311_valid,
    point_to_cell=point_to_partition_valid,
    adj_matrix=mapper.get_adjacency_matrix(),
    road_edges_gdf=mapper.get_road_network(),
    adjacency_type="road_based",
    remove_empty_nodes=True,
    out_dir=OUTPUT_DIR,
    save_flag=True,
    static_features=osm_features,
    meter_crs="EPSG:32618",
    
)

# Extract graph components
edge_index = graph_data["edge_index"]
edge_weight = graph_data["edge_weight"]
node_features = graph_data["node_features"]
augmented_df = graph_data["augmented_df"]
node_ids = graph_data["node_ids"]

print(f"Built graph with {edge_index.shape[1]} edges and {graph_data['num_nodes']} nodes")
print(f"Node features shape: {node_features.shape}")

# Display augmented dataframe
augmented_df.head()

## 5. Temporal Dataset Creation

With our graph structure in place, we'll now create a temporal dataset for time-aware analysis and prediction. We'll:

1. Bin service requests into daily intervals
2. Create sliding windows of data for training
3. Add time-based features (day of week, hour of day)
4. Normalize the features for better model training

This temporal dataset will allow us to capture patterns in when and where 311 requests occur throughout NYC.

In [None]:
# Create temporal dataset
temporal_dataset, dataset_path, metadata = stm_graph.create_temporal_dataset(
    edge_index=edge_index,
    augmented_df=augmented_df,
    edge_weights=edge_weight,
    node_ids=node_ids,
    static_features=osm_features,
    time_col="created_time",
    cell_col="cell_id",
    bin_type="daily",
    interval_hours=1,
    history_window=3,
    use_time_features=False,
    task="classification",
    horizon=1,
    downsample_factor=1,
    normalize=True,
    scaler_type="minmax",
    dataset_name="nyc_311_dataset",
    output_format="4d",
)

## 6. Visualization

Now let's create visualizations to better understand our 311 data patterns. We'll create:

1. Time series plots showing service request trends over time
2. Spatial network visualizations showing request density across NYC
3. Temporal heatmaps showing patterns across time

These visualizations help reveal when and where different types of service requests occur, which can inform city resource allocation.

In [None]:
temporal_dataset_3d = stm_graph.convert_4d_to_3d_dataset(
    temporal_dataset, static_features_count=osm_features.shape[1])

# Plot time series for the most active nodes
stm_graph.plot_node_time_series(
    temporal_dataset_3d,
    num_nodes=5,  # Show 5 nodes
    selection_method="highest_activity",  # Select most active nodes
    feature_idx=0,  # Event count feature
    plot_type="2d",  # 2D line plot
    start_time="2017-04-01",  # Start date for x-axis
    time_delta=timedelta(hours=1),  # Hourly data
    title="311 Service Requests Over Time (Most Active Nodes)",
    figsize=(15, 8),
    out_dir=OUTPUT_DIR,
    filename="311_time_series_top",
    file_format="png",
)

In [None]:
# Plot 3D visualization for most active nodes
stm_graph.plot_node_time_series(
    temporal_dataset_3d,
    num_nodes=3,  # Show 3 nodes
    selection_method="highest_activity",  # Select most active nodes
    feature_idx=0,  # Event count feature
    plot_type="3d",  # 3D surface plot
    n_steps=168,  # First week (7 days * 24 hours)
    title="311 Service Requests 3D Visualization Over Time",
    figsize=(15, 10),
    out_dir=OUTPUT_DIR,
    filename="311_time_series_3d",
    file_format="png",
)

In [None]:
# Extract service request counts for each region (node) at a specific time
time_step = 24  # Example: events after 24 hours
node_counts = np.array(
    [
        temporal_dataset_3d.features[time_step][node, 0].item()
        for node in range(graph_data["num_nodes"])
    ]
)

# Plot spatial network with region and edge colors
stm_graph.plot_spatial_network(
    regions_gdf=district_gdf,
    edge_index=edge_index,
    edge_weights=edge_weight,
    node_values=node_counts,
    node_ids=node_ids,
    time_step=time_step,
    title="Service Request Density After 24 Hours",
    node_cmap="YlOrRd",  # Red-yellow colormap for heat
    edge_cmap="viridis",  # Blue-green for edges
    map_style=ctx.providers.CartoDB.Positron,
    figsize=(15, 15),
    out_dir=OUTPUT_DIR,
    filename="311_spatial_network",
    file_format="png",
)

In [None]:
# Create a temporal heatmap to see patterns across time and nodes
stm_graph.plot_temporal_heatmap(
    temporal_dataset_3d,
    num_nodes=10,
    feature_idx=0,  # Event count feature
    selection_method="highest_activity",
    time_delta=timedelta(hours=1),
    n_steps=168,  # First week
    title="311 Requests Temporal Heatmap (First Week)",
    figsize=(15, 8),
    out_dir=OUTPUT_DIR,
    filename="temporal_heatmap",
    file_format="png",
)

## 7. Weekly and Daily Patterns Analysis

Let's analyze the weekly and daily patterns in the 311 service request data to identify when different types of requests are most frequent.

In [None]:
# Convert the created_time to datetime if not already
gdf_311_valid['created_time'] = pd.to_datetime(gdf_311_valid['created_time'])

# Extract day of week and hour
gdf_311_valid['day_of_week'] = gdf_311_valid['created_time'].dt.day_name()
gdf_311_valid['hour_of_day'] = gdf_311_valid['created_time'].dt.hour

# Plot requests by day of week
plt.figure(figsize=(12, 6))
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_counts = gdf_311_valid['day_of_week'].value_counts().reindex(day_order)
day_counts.plot(kind='bar')
plt.title('311 Service Requests by Day of Week')
plt.ylabel('Number of Requests')
plt.tight_layout()
plt.show()

# Plot requests by hour of day
plt.figure(figsize=(12, 6))
hour_counts = gdf_311_valid['hour_of_day'].value_counts().sort_index()
hour_counts.plot(kind='bar')
plt.title('311 Service Requests by Hour of Day')
plt.xlabel('Hour (24-hour format)')
plt.ylabel('Number of Requests')
plt.xticks(range(0, 24))
plt.tight_layout()
plt.show()

In [None]:
# Analyze top complaint types by day of week
plt.figure(figsize=(14, 8))
top_complaints = gdf_311_valid['complaint_type'].value_counts().head(5).index
day_complaint_counts = pd.crosstab(gdf_311_valid['day_of_week'], gdf_311_valid['complaint_type'])
day_complaint_counts = day_complaint_counts[top_complaints].reindex(day_order)
day_complaint_counts.plot(kind='bar', stacked=True)
plt.title('Top 5 Complaint Types by Day of Week')
plt.ylabel('Number of Requests')
plt.legend(title='Complaint Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

## 8. Model Training

Finally, we'll train a Graph Neural Network (GNN) model to predict service request events. We'll use the ST-GCN model, which is designed specifically for spatio-temporal graph data. We can use other models as well; sample codes provided.

This model will predict whether service requests will occur in each area in the next time step. Different designed custom models can be used or any supported model from Torch Geometric Temporal can be used. More custom models can be added. Training logs will be saved in logs folder locally in output directory. [Weights & Biases](https://wandb.ai/) integration is done and you can login and use online dashboard to control the training process and track training metrics and process live.

In [None]:
# STGCN
model = stm_graph.create_model(
    model_name="stgcn",
    source="custom",
    num_nodes=temporal_dataset.features[0].shape[0],
    in_channels=temporal_dataset.features[0].shape[2],
    out_channels=1,
    hidden_dim=64,
    k=3,
    embedding_dimensions=16,
    dropout=0.2,
    task="classification",
)

# Train the model
results = stm_graph.train_model(
    model=model,
    dataset=temporal_dataset,
    optimizer_name="adam",
    learning_rate=0.0001,
    task="classification",
    num_epochs=500,  
    batch_size=10,
    batch_to_device=True,
    test_size=0.15,
    val_size=0.15,
    use_nested_tqdm=True,
    early_stopping=True,
    patience=50,
    scheduler_type="step",
    lr_decay_epochs=50,
    lr_decay_factor=1,
    wandb_project="stm_graph_311",
    experiment_name="stgcn",
    use_wandb=True, 
    fixed_batch_size=True,
    log_dir=OUTPUT_DIR,
)

In [None]:

# GCN
model = stm_graph.create_model(
    model_name="gcn",
    source="custom",
    in_channels=temporal_dataset_3d.features[0].shape[1],
    out_channels=1,
    hidden_channels=64,
    dropout=0.2,
    task="classification",
)

# Train the model
results = stm_graph.train_model(
    model=model,
    dataset=temporal_dataset_3d,
    optimizer_name="adam",
    learning_rate=0.0001,
    task="classification",
    num_epochs=500,  
    batch_size=10,
    batch_to_device=True,
    test_size=0.15,
    val_size=0.15,
    use_nested_tqdm=True,
    early_stopping=True,
    patience=50,
    scheduler_type="step",
    lr_decay_epochs=50,
    lr_decay_factor=1,
    wandb_project="stm_graph_311",
    experiment_name="gcn",
    use_wandb=True, 
    fixed_batch_size=True,
    log_dir=OUTPUT_DIR,
)

In [None]:

# TGCN
model = stm_graph.create_model(
    model_name="tgcn",
    source="custom",
    in_channels=temporal_dataset_3d.features[0].shape[1],
    out_channels=1,
    batch_size=1,
    task="classification",
)

# Train the model
results = stm_graph.train_model(
    model=model,
    dataset=temporal_dataset_3d,
    optimizer_name="adam",
    learning_rate=0.0001,
    task="classification",
    num_epochs=500,  
    batch_size=1,
    batch_to_device=True,
    test_size=0.15,
    val_size=0.15,
    use_nested_tqdm=True,
    early_stopping=True,
    patience=50,
    scheduler_type="step",
    lr_decay_epochs=50,
    lr_decay_factor=1,
    wandb_project="stm_graph_311",
    experiment_name="tgcn",
    use_wandb=True, 
    fixed_batch_size=True,
    log_dir=OUTPUT_DIR,
)