# Data Preparation for Optimization Model

Clean, map, and filter raw data into standard CSV format for Pyomo optimization model.

**Input Files:**
- `data/raw/Optional_Driving_Distance_and_Time.xlsx`
- `data/processed/forecast_2025_truck_only.csv`

**Output Files:**
- `ready_network_data.csv`: [From_ID, To_ID, Distance_Miles, Time_Mins]
- `ready_forecast_data.csv`: [Depot_ID, Day_Index, Volume_oz]
- `depot_mapping.csv`: [Depot_Name, Deposit_ID]

## 1. Setup

In [20]:
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Define paths
NOTEBOOK_DIR = Path.cwd()
PROJECT_ROOT = NOTEBOOK_DIR.parent
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)

## 2. Extract Depot Name to ID Mapping

Extract existing mapping from forecast data (Depot name -> Deposit ID).

In [21]:
# Load forecast data and extract mapping
forecast_df = pd.read_csv(DATA_PROCESSED / "forecast_2025_truck_only.csv")
depot_name_to_id = forecast_df.groupby('Depo')['Deposit ID'].first().to_dict()

# Identify Hub (ID should be 1)
hub_names = ["Oklahoma Mothers' Milk Bank", "Oklahoma Mothers Milk Bank", "Oklahoma City", "Milk Bank"]
hub_id = next((deposit_id for name, deposit_id in depot_name_to_id.items() 
               if any(h in name.lower() for h in hub_names)), 1)

print(f"✓ Extracted mapping for {len(depot_name_to_id)} depots (Hub ID: {hub_id})")

✓ Extracted mapping for 36 depots (Hub ID: 1)


## 3. Process Distance and Time Matrices

Convert Excel data from wide format to long format and apply ID mapping.

In [22]:
# Read Excel and convert to long format
distance_file = DATA_RAW / "Optional_Driving_Distance_and_Time.xlsx"
df = pd.read_excel(distance_file, sheet_name=0)

# Extract columns
from_col, to_col = 'Depot 1 Designation', 'Depot 2 Designation'
distance_col = next(col for col in df.columns if 'distance' in col.lower())
time_col = next(col for col in df.columns if 'time' in col.lower() and 'minute' in col.lower())

# Create network data with ID mapping
network_data = df[[from_col, to_col, distance_col, time_col]].copy()
network_data.columns = ['From_Name', 'To_Name', 'Distance_Miles', 'Time_Mins']
network_data = network_data.dropna()

# Map names to IDs
network_data['From_ID'] = network_data['From_Name'].map(depot_name_to_id)
network_data['To_ID'] = network_data['To_Name'].map(depot_name_to_id)
network_data = network_data.dropna(subset=['From_ID', 'To_ID'])
network_data[['From_ID', 'To_ID']] = network_data[['From_ID', 'To_ID']].astype(int)

print(f"✓ Processed network data: {len(network_data)} rows")

✓ Processed network data: 1260 rows


## 4. Filter Truck Candidate Nodes

Keep only Truck Candidate nodes (Hub ID=1 must always be included).

In [23]:
# Get Truck Candidate Depot IDs (Hub ID=1 must be included)
truck_depot_ids = {depot_name_to_id[name] for name in forecast_df['Depo'].unique() if name in depot_name_to_id}
truck_depot_ids.add(1)

# Filter network data
ready_network = network_data[
    network_data['From_ID'].isin(truck_depot_ids) & 
    network_data['To_ID'].isin(truck_depot_ids)
][['From_ID', 'To_ID', 'Distance_Miles', 'Time_Mins']].copy()

print(f"✓ Filtered network data: {len(ready_network)} rows, {len(set(ready_network['From_ID']) | set(ready_network['To_ID']))} unique nodes")

✓ Filtered network data: 1260 rows, 36 unique nodes


## 5. Format Forecast Data

Add Day_Index column and filter to Truck Candidate depots only.

In [24]:
# Calculate Day_Index and map to IDs
forecast_df['Date_2025'] = pd.to_datetime(forecast_df['Date_2025'])
forecast_df['Day_Index'] = (forecast_df['Date_2025'] - pd.Timestamp('2025-01-01')).dt.days + 1
forecast_df['Depot_ID'] = forecast_df['Depo'].map(depot_name_to_id)

# Filter to Truck Candidate depots (exclude Hub) and prepare output
ready_forecast = forecast_df[
    forecast_df['Depot_ID'].notna() & 
    forecast_df['Depot_ID'].isin(truck_depot_ids - {1})
][['Depot_ID', 'Day_Index', 'Volume_2025']].copy()
ready_forecast.columns = ['Depot_ID', 'Day_Index', 'Volume_oz']
ready_forecast['Depot_ID'] = ready_forecast['Depot_ID'].astype(int)

print(f"✓ Formatted forecast data: {len(ready_forecast)} rows")

✓ Formatted forecast data: 3969 rows


## 6. Save Output Files

In [25]:
# Save output files
ready_network.to_csv(DATA_PROCESSED / "ready_network_data.csv", index=False)
ready_forecast.to_csv(DATA_PROCESSED / "ready_forecast_data.csv", index=False)

depot_mapping_df = pd.DataFrame([
    {'Depot_Name': name, 'Deposit_ID': deposit_id}
    for name, deposit_id in sorted(depot_name_to_id.items(), key=lambda x: x[1])
])
depot_mapping_df.to_csv(DATA_PROCESSED / "depot_mapping.csv", index=False)

print("✓ Files saved:")
print(f"  - ready_network_data.csv: {len(ready_network):,} rows")
print(f"  - ready_forecast_data.csv: {len(ready_forecast):,} rows")
print(f"  - depot_mapping.csv: {len(depot_mapping_df)} depots")

✓ Files saved:
  - ready_network_data.csv: 1,260 rows
  - ready_forecast_data.csv: 3,969 rows
  - depot_mapping.csv: 36 depots
