# Generating OD matrices for the demand

We'll going to work with demand data provided for various types of traffic in Fremont.
Demand datasets **(that we currently have)** fall into three categories based on **origin** and **destination** of the cars driving through the areas:

1. Cars that **start** their trip within internal centroid zones and **end** within internal centroid zones **(internal-internal demand)**
2. Cars that **start** their trip within internal centroid zones and **end** within external centroid zones **(internal-external demand)**
3. Cars that **start** their trip within external centroid zones and **end** within internal centroid zones **(external-internal demand)**

As you can see, for now we don't have any knowledge about cars that drive through the area - in other words, they **start** their trip within external centroid zones (or are coming from outside) and **end** their trip within external centroid zones (or are on their way somewhere away from the city).

In [1]:
# --- Global variables

# Setting up the Coordinate Reference Systems up front in the necessary format.
crs_degree = {'init': 'epsg:4326'} # CGS_WGS_1984 (what the GPS uses)

# --- Paths

# Root path of Fremont Dropbox
from fremontdropbox import get_dropbox_location
dbx = get_dropbox_location()

# Temporary! Location of the folder where the restructuring is currently happening
data_path = dbx + '/Private Structured data collection'

# Processing output path
output_path = data_path + '/Data processing/Demand/Temporary exports'

Internal and external centroid zones are exported from ArcGIS. They contain geometries (shapes created from polygons around certain zone) so they can be loaded with `GeoPandas` as `GeoDataFrame`:

In [2]:
# Read more about GeoPandas data structures here: http://geopandas.org/data_structures.html
from geopandas import GeoDataFrame

# Centroid zones
int_centroid_zones = GeoDataFrame.from_file(data_path + "/Data processing/Demand/TAZ/InternalCentroidZones.shp")
ext_centroid_zones = GeoDataFrame.from_file(data_path + "/Data processing/Demand/TAZ/ExternalCentroidZones.shp")

Let's load all the Fremont "legs", e.g. all categories of demand data we have:

In [3]:
import pandas as pd

# Cars that stay within internal centroid zones
internal_legs = pd.read_csv(data_path+'/Data processing/Demand/SFCTA demand data/internal_fremont_legs.csv')

# Cars that start within internal centroid zones and end outside
starting_legs = pd.read_csv(data_path+'/Data processing/Demand/SFCTA demand data/starting_fremont_legs.csv')

# Cars that start outside and end within internal centroid zones
ending_legs = pd.read_csv(data_path+'/Data processing/Demand/SFCTA demand data/ending_fremont_legs.csv')

We have to convert lattitude and longitude into **`Point` geometries** and convert Fremont legs `DataFrame`s into `GeoDataFrame`s. For "enhancing" the datasets with geometries, we're defining `add_point_geometry` function:

In [4]:
def add_point_geometry(df, lng_column='start_node_lng', lat_column='start_node_lat', geometry_column='geometry'):
    """
    Add a new Point geometry column
    Parameters
    ----------
    df : DataFrame
        DataFrame representing demand legs (internal|starting|ending)
    lat_column : string
        Name of the column representing lattitude
    lng_column : string
        Name of the column representing longitude
    geometry_column : string
        Name of the column that will represent geometry column

    Returns
    -------
    df_with_geometry : GeoDataFrame
        GeoDataFrame representing demand legs (internal|starting|ending) with added point geometry.
    """
    # Process XY coordinates data as Point geometry
    from shapely.geometry import Point
    points = [Point(xy) for xy in zip(df[lng_column], df[lat_column])]
    
    gdf = GeoDataFrame(df, crs=crs_degree, geometry=points)
    gdf = gdf.rename(columns={'geometry': geometry_column}).set_geometry(geometry_column)
    return gdf

In [5]:
# Geometries column names
start_node_geometry_column = 'start_node_geometry'
end_node_geometry_column = 'end_node_geometry'

# Converting each leg into GeoDataFrame (with two geometries for start end nodes)
int_int_legs = add_point_geometry(internal_legs, lng_column='start_node_lng', lat_column='start_node_lat', geometry_column=start_node_geometry_column)
int_int_legs = add_point_geometry(int_int_legs, lng_column='end_node_lng', lat_column='end_node_lat', geometry_column=end_node_geometry_column)

int_ext_legs = add_point_geometry(starting_legs, lng_column='start_node_lng', lat_column='start_node_lat', geometry_column=start_node_geometry_column)
int_ext_legs = add_point_geometry(int_ext_legs, lng_column='end_node_lng', lat_column='end_node_lat', geometry_column=end_node_geometry_column)

ext_int_legs = add_point_geometry(ending_legs, lng_column='start_node_lng', lat_column='start_node_lat', geometry_column=start_node_geometry_column)
ext_int_legs = add_point_geometry(ext_int_legs, lng_column='end_node_lng', lat_column='end_node_lat', geometry_column=end_node_geometry_column)

---

## Spatial joins

In a Spatial Join, observations from GeoDataFrames are combined **based on their spatial relationship to one another**.

Docs: http://geopandas.org/mergingdata.html

In [6]:
from geopandas import sjoin

def spatial_join_nodes_with_centroids(gdf, centroid_zones, how='left', op='within', type='origin'):
    if type == 'origin':
        gdf_to_join = gdf.set_geometry(start_node_geometry_column)
    elif type == 'destination':
        gdf_to_join = gdf.set_geometry(end_node_geometry_column)
    
    if type not in ['origin', 'destination']:
        raise ValueError('{type} argument is incorrect, use "origin" or "destination"'.format(type=repr(type)))
        
    centroid_id_column = "CentroidID_O" if type == 'origin' else "CentroidID_D"

    gdf_to_join = sjoin(gdf_to_join, centroid_zones, how='left', op='within')
    gdf_to_join.rename(
        columns={
            "CentroidID": centroid_id_column
        },
        inplace=True
    )

    for column in ['index_left', 'index_right', 'OBJECTID']:
        try:
            gdf_to_join.drop(column, axis=1, inplace=True)
        except KeyError:
            # ignore if there are no index columns
            pass
        
    return gdf_to_join

In [7]:
# Internal to internal OD matrix
int_int_start_nodes = spatial_join_nodes_with_centroids(int_int_legs, int_centroid_zones, type='origin')
int_int_end_nodes = spatial_join_nodes_with_centroids(int_int_legs, int_centroid_zones, type='destination')

int_int_OD = int_int_start_nodes.combine_first(int_int_end_nodes)
int_int_OD['OBJECTID'] = int_int_OD.index + 1
# Parse CentroidIDs to be numeric
int_int_OD["CentroidID_O"] = pd.to_numeric(int_int_OD["CentroidID_O"], downcast='signed')
int_int_OD["CentroidID_D"] = pd.to_numeric(int_int_OD["CentroidID_D"], downcast='signed')

In [8]:
# Internal to external OD matrix
int_ext_start_nodes = spatial_join_nodes_with_centroids(int_ext_legs, int_centroid_zones, type='origin')
int_ext_end_nodes = spatial_join_nodes_with_centroids(int_ext_legs, ext_centroid_zones, type='destination')

int_ext_OD = int_ext_start_nodes.combine_first(int_ext_end_nodes)
int_ext_OD['OBJECTID'] = int_ext_OD.index + 1
# Parse CentroidIDs to be numeric
int_ext_OD["CentroidID_O"] = pd.to_numeric(int_ext_OD["CentroidID_O"], downcast='signed')
int_ext_OD["CentroidID_D"] = pd.to_numeric(int_ext_OD["CentroidID_D"], downcast='signed')

### Attention! Index 56707 of int_ext_OD is duplicated. I don't know if it's okay!

In [9]:
# Code to check it out:
# int_ext_end_nodes_OD.loc[[56705, 56706, 56707, 56708, 56709, 56710]]

In [10]:
# External to internal OD matrix
ext_int_start_nodes = spatial_join_nodes_with_centroids(ext_int_legs, ext_centroid_zones, type='origin')
ext_int_end_nodes = spatial_join_nodes_with_centroids(ext_int_legs, int_centroid_zones, type='destination')

ext_int_OD = ext_int_start_nodes.combine_first(ext_int_end_nodes)
ext_int_OD['OBJECTID'] = ext_int_OD.index + 1
# Parse CentroidIDs to be numeric
ext_int_OD["CentroidID_O"] = pd.to_numeric(ext_int_OD["CentroidID_O"], downcast='signed')
ext_int_OD["CentroidID_D"] = pd.to_numeric(ext_int_OD["CentroidID_D"], downcast='signed')

---

## Export OD matrices as CSV files 

In [11]:
def export_od_matrix_to_csv(df, path):
    """
    Exports an origin-destination matrix into CSV
    Parameters
    ----------
    df : DataFrame
        DataFrame representing OD matrix
    output_path : string
        Output path
    """
    if path == '' or path == None:
        raise ValueError('"output_path" cannot be empty.')
        
    pd.DataFrame.to_csv(df,
        path,
        encoding='utf8',
        columns=["OBJECTID", "leg_id","start_time","start_node_lat","start_node_lng","end_node_lat","end_node_lng","CentroidID_O","CentroidID_D"]
    )

In [12]:
# Export all OD matrixes
export_od_matrix_to_csv(int_int_OD, output_path+'/int_int_OD.csv')
export_od_matrix_to_csv(int_ext_OD, output_path+'/int_ext_OD.csv')
export_od_matrix_to_csv(ext_int_OD, output_path+'/ext_int_OD.csv')

---

## Clustering the OD matrices

...there will be some description...

### Cluster demand data per 15min, set (origin, dest) as index, time as column

In [13]:
from pytz import timezone, utc

local_tz = timezone('US/Pacific')

def cluster_demand_15min(df):
    """
    Exports an origin-destination matrix into CSV.
    
    -----------------------------------------------
    | CentroidID_O | CentroidID_D | dt_15 | count |
    -----------------------------------------------
    
    Parameters
    ----------
    df : DataFrame
        DataFrame representing OD matrix
    output_path : string
        Output path
    """
    demand_df = df
    demand_df['dt'] = pd.to_datetime(demand_df['start_time'])
    dt_15=[]
    for dt in demand_df['dt']:
        # Replace each dt value (start_time) with the time in current 15 minute chunk of the hour
        # (e.g. 22:39 -> 22:30 as it's past 22:30 but before 22:45)
        dt_15.append(dt.replace(minute=int(dt.minute/15)*15,second = 0).replace(tzinfo=utc))

    demand_df['dt_15'] = dt_15
    grouped_od_demand_15min = demand_df.groupby(['CentroidID_D', 'CentroidID_O', 'dt_15']).size().reset_index(name='count')
    return grouped_od_demand_15min

In [14]:
int_int_OD_demand_cluster_15min = cluster_demand_15min(int_int_OD)
int_int_OD_demand_cluster_15min.to_csv(output_path+'/int_int_OD_demand_cluster_15min.csv')

In [15]:
int_ext_OD_demand_cluster_15min = cluster_demand_15min(int_ext_OD)
int_ext_OD_demand_cluster_15min.to_csv(output_path+'/int_ext_OD_demand_cluster_15min.csv')

In [16]:
ext_int_OD_demand_cluster_15min = cluster_demand_15min(ext_int_OD)
ext_int_OD_demand_cluster_15min.to_csv(output_path+'/ext_int_OD_demand_cluster_15min.csv')

---

## External to external demand estimation

~~### Going from north to south~~

~~PeMS detectors positioned at the entering of the highway on the south side: **403251**.~~
~~PeMS detectors positioned at the exiting of the highway on the north side: **402798**.~~

### Going from south to north

PeMS detectors positioned at the entering of the highway on the south side: **403250**.
PeMS detectors positioned at the exiting of the highway on the north side: **402799**.

This will be estimation of external to external demand from PeMS data along the highway (going north). We'll be using origin **PeMS detector with ID 403250** and destination **PeMS detector with ID 402799**.

PeMS vehicle data is sampled at 5 min. intervals. We need to first get them clustered into 15 min. intervals.

In [52]:
def cluster_pems_by_15_min(path):
    pems_dataset = pd.read_excel(path)
    cluster_pems_15_min = pd.DataFrame()
    pems_1_day = pems_dataset[pems_dataset['5 Minutes'].dt.day.eq(5)]
    pems_1_day.index = pems_1_day['5 Minutes']
    pems_1_day.index.name = '15min'
    cluster_pems_15_min['Flow (Veh/15min)'] = pems_1_day['Flow (Veh/5 Minutes)'].resample('15min').sum()
    cluster_pems_15_min.reset_index(inplace=True)
    cluster_pems_15_min.index.name = None
    cluster_pems_15_min['df_15'] = cluster_pems_15_min['15min'].map(
        lambda x: pd.Timestamp(
            year=2000,
            month=1,
            day=1,
            hour=x.hour,
            minute=x.minute,
            second=x.second
        )
    )
    return cluster_pems_15_min

Here we're using PeMS data from 2019:

In [53]:
pems_o_cluster_15_min = cluster_pems_by_15_min(data_path+'/Data processing/Demand/External to external inference/PeMS/403250_2019.xlsx')
pems_d_cluster_15_min = cluster_pems_by_15_min(data_path+'/Data processing/Demand/External to external inference/PeMS/402799_2019.xlsx')

### Demand inference

In [54]:
# estimate ext_ext OD data
# estimate ext20_ext13, Centroid_O = 23 (pems 403250), Centroid_D = 31 (pems 402799), going north along highway
def compute_ext_ext(origin_centroid, dest_centroid, cluster_OD_15min_ext_int, cluster_OD_15min_int_ext):
    """
    Computes the external to external estimaton
    """
    display(pems_o_cluster_15_min)
    display(pems_d_cluster_15_min)
    flow_o = pems_o_cluster_15_min['Flow (Veh/15min)']
    flow_d = pems_d_cluster_15_min['Flow (Veh/15min)']
    display(flow_o)
    
    ext_int = cluster_OD_15min_ext_int
    display(ext_int)
    
    # Cars drive through origin cluster (Centroid_O = 20), they are external to internal...
    # We need have 
    # sum of all ext_int demand depart from an ext node
    #sum_ext_int_o = ext_int[ext_int['CentroidID_O']==origin_centroid].sum()['count']
    #display(sum_ext_int_o)
    
    return
    int_ext = cluster_OD_15min_int_ext.reset_index()
    # sum of all int_ext demand arrive at an ext node
    sum_int_ext_d = int_ext[int_ext['CentroidID_D']==dest_centroid].sum()['count'].reset_index()[0]
    display(sum_int_ext_d)
    result1 = flow_o-sum_ext_int_o
    result2 = flow_d-sum_int_ext_d
    display(result1)
    display(result2)
    final_result = result1-result2
    return final_result

ext_ext_OD_col_15 = compute_ext_ext(23, 31, ext_int_OD_demand_cluster_15min, int_ext_OD_demand_cluster_15min)

Unnamed: 0,15min,Flow (Veh/15min),df_15
0,2019-03-05 00:00:00,123,2000-01-01 00:00:00
1,2019-03-05 00:15:00,145,2000-01-01 00:15:00
2,2019-03-05 00:30:00,90,2000-01-01 00:30:00
3,2019-03-05 00:45:00,76,2000-01-01 00:45:00
4,2019-03-05 01:00:00,79,2000-01-01 01:00:00
...,...,...,...
91,2019-03-05 22:45:00,215,2000-01-01 22:45:00
92,2019-03-05 23:00:00,197,2000-01-01 23:00:00
93,2019-03-05 23:15:00,158,2000-01-01 23:15:00
94,2019-03-05 23:30:00,153,2000-01-01 23:30:00


Unnamed: 0,15min,Flow (Veh/15min),df_15
0,2019-03-05 00:00:00,351,2000-01-01 00:00:00
1,2019-03-05 00:15:00,297,2000-01-01 00:15:00
2,2019-03-05 00:30:00,248,2000-01-01 00:30:00
3,2019-03-05 00:45:00,209,2000-01-01 00:45:00
4,2019-03-05 01:00:00,177,2000-01-01 01:00:00
...,...,...,...
91,2019-03-05 22:45:00,657,2000-01-01 22:45:00
92,2019-03-05 23:00:00,607,2000-01-01 23:00:00
93,2019-03-05 23:15:00,563,2000-01-01 23:15:00
94,2019-03-05 23:30:00,503,2000-01-01 23:30:00


0     123
1     145
2      90
3      76
4      79
     ... 
91    215
92    197
93    158
94    153
95    178
Name: Flow (Veh/15min), Length: 96, dtype: int64

Unnamed: 0,CentroidID_D,CentroidID_O,dt_15,count
0,1,4,2000-01-01 12:30:00+00:00,1
1,1,4,2000-01-01 14:00:00+00:00,4
2,1,4,2000-01-01 14:15:00+00:00,1
3,1,4,2000-01-01 14:30:00+00:00,3
4,1,4,2000-01-01 14:45:00+00:00,3
...,...,...,...,...
12742,30,22,2000-01-02 06:30:00+00:00,1
12743,30,22,2000-01-02 06:45:00+00:00,1
12744,30,22,2000-01-02 07:00:00+00:00,2
12745,30,22,2000-01-02 07:15:00+00:00,2


In [None]:
def compare_with_demand(origin_centroid, dest_centroid, cluster_OD_15min_ext_int, cluster_OD_15min_int_ext):
    """
    Compares inferred external to external demand with int_ext and ext_int data
    """
    # Cars drive through origin cluster (Centroid_O = 20), they are external to internal...
    # We need have 
    # sum of all ext_int demand depart from an ext node
    #sum_ext_int_o = ext_int[ext_int['CentroidID_O']==origin_centroid].sum()['count']
    #display(sum_ext_int_o)
    
    return
    int_ext = cluster_OD_15min_int_ext.reset_index()
    # sum of all int_ext demand arrive at an ext node
    sum_int_ext_d = int_ext[int_ext['CentroidID_D']==dest_centroid].sum()['count'].reset_index()[0]
    display(sum_int_ext_d)
    result1 = flow_o-sum_ext_int_o
    result2 = flow_d-sum_int_ext_d
    display(result1)
    display(result2)
    final_result = result1-result2
    return final_result

ext_ext_comparison = compare_with_demand(20, 13, ext_int_OD_demand_cluster_15min, int_ext_OD_demand_cluster_15min)

In [22]:
ext_ext_OD_col_15 = compute_ext_ext(20, 13, ext_int_OD_demand_cluster_15min, int_ext_OD_demand_cluster_15min)
#ext_ext_OD_col_15.head()

0      79
1      87
2     124
3     115
4     168
     ... 
79    215
80    197
81    158
82    153
83    178
Name: Flow (Veh/15min), Length: 84, dtype: int64

Unnamed: 0,index,CentroidID_D,CentroidID_O,dt_15,count
0,0,1,4,2000-01-01 12:30:00+00:00,1
1,1,1,4,2000-01-01 14:00:00+00:00,4
2,2,1,4,2000-01-01 14:15:00+00:00,1
3,3,1,4,2000-01-01 14:30:00+00:00,3
4,4,1,4,2000-01-01 14:45:00+00:00,3
...,...,...,...,...,...
12742,12742,30,22,2000-01-02 06:30:00+00:00,1
12743,12743,30,22,2000-01-02 06:45:00+00:00,1
12744,12744,30,22,2000-01-02 07:00:00+00:00,2
12745,12745,30,22,2000-01-02 07:15:00+00:00,2


12836

---

## Demand matrices from centroid to centroid

Each 15 minute timeframe as separate file.

In [None]:
ext_int_df = ext_int_OD_demand_cluster_15min
ext_int_df['CentroidID_O_name'] = ['E' + str(i) for i in ext_int_df['CentroidID_O']]
ext_int_df['CentroidID_D_name'] = ['I' + str(i) for i in ext_int_df['CentroidID_D']]

int_ext_df = int_ext_OD_demand_cluster_15min
int_ext_df['CentroidID_O_name'] = ['I' + str(i) for i in int_ext_df['CentroidID_O']]
int_ext_df['CentroidID_D_name'] = ['E' + str(i) for i in int_ext_df['CentroidID_D']]

int_int_df = int_int_OD_demand_cluster_15min
int_int_df['CentroidID_O_name'] = ['I' + str(i) for i in int_int_df['CentroidID_O']]
int_int_df['CentroidID_D_name'] = ['I' + str(i) for i in int_int_df['CentroidID_D']]

# Concat matrices
demand_matrices = ext_int_df.append(int_ext_df)
demand_matrices = demand_matrices.append(int_int_df)
demand_matrices = demand_matrices.reset_index(drop=True)

# Normalize count
demand_matrices['count'] = demand_matrices['count']*4

# Filter time
demand_matrices['min'] = [demand_matrices for demand_matrices in demand_matrices['dt_15']]
demand_matrices['min'] = pd.to_datetime(demand_matrices['min'], format="%H:%M")
demand_matrices = demand_matrices[(demand_matrices['min'] >= pd.Timestamp(2000, 1, 1, 13, 00).replace(tzinfo=utc)) & (demand_matrices['min'] < pd.Timestamp(2000, 1, 1, 19, 00).replace(tzinfo=utc))]

# Save by timestamp group
from pathlib import Path
import numpy as np

Path(output_path+'/OD grouped by timestamp').mkdir(parents=True, exist_ok=True)

for timestamp in demand_matrices['min'].unique():
    demand_matrices_export = demand_matrices.groupby('min').get_group(timestamp)
    demand_matrices_export = pd.pivot_table(demand_matrices_export, values='count', index='CentroidID_O_name', columns='CentroidID_D_name', aggfunc=np.sum)
    demand_matrices_export = demand_matrices_export.fillna(0)
    # Save
    demand_matrices_export.insert(0, '', value=demand_matrices_export.index)
    demand_matrices_export.to_csv(output_path+'/OD grouped by timestamp/'+str(timestamp.isoformat()).replace(':', '-')+'.csv', index=False)