## Background  
This Jupyter Notebook calculates the percentage of households that **do not regularly use highways during peak-hour periods** by income groups.

## Inputs  
- `trip.csv`
- `household.csv`
- `trip_motorway_bridge_boolean.csv`: this is the output of the Jupyter Notebook `Generate_trip_motorway_booleans.ipynb`

## Outputs 
- `percent_noPNBMtrip_by_income.csv`: 

## Caveats 
- The `highway=motorway` tag in OpenStreetMap (OSM) is not always reliable. For this analysis, we have used it as-is and have not made corrections to its classification. It doesn't look too bad, and I'll share a map soon!

-  This analysis includes a time-of-day component. Specifically, it focuses on trips that use the freeway during the peak period. However, because the GeoDataFrame matched_path_gdf does not contain time stamps (double-check this!), I rely on the depart_hour and arrive_hour fields from the trip file. This introduces some limitations, since the freeway segment may occur at any point within the trip duration.

For example:

  - If a trip starts at 9:45a and ends at 10:15a, and the first 15 minutes involved freeway travel, it is **correctly counted**.
  - If a trip starts at 9:45a and ends at 10:15a, but only the last 15 minutes involved freeway travel, it is **overcounted**.
  - If a trip starts at 5:45a and ends at 6:15a, but only the first 15 minutes involved freeway travel, it is **overcounted**.
  - If a trip starts at 5:45a and ends at 6:15a, and the last 15 minutes involved freeway travel, it is **correctly counted**.
  
## Still to do
- discuss defintion of "low income"
- visualize and review how good the highway=motorway tagging is, and how useful the bridge tagging is


In [1]:
import os
import pandas as pd

# read the trip file
BATS_data_location = r"E:\Box\Modeling and Surveys\Surveys\Travel Diary Survey\BATS_2023\Data\2023\Full Weighted 2023 Dataset\WeightedDataset_02212025"
trip_df = pd.read_csv(os.path.join(BATS_data_location, "trip.csv"))

row_count_trip = len(trip_df)
print(f"Done reading the trip file. Number of rows: {row_count_trip}")


# read the household file
hh_df = pd.read_csv(os.path.join(BATS_data_location, "hh.csv"))

row_count_hh = len(hh_df)
print(f"Done reading the household file. Number of rows: {row_count_hh}")


Done reading the trip file. Number of rows: 365830
Done reading the household file. Number of rows: 8258


In [2]:
# This analysis will include only households with complete tue, or complete wed or complete thu
# Verify number of weighted household on each of Tue, Wed and Thu

# what is it in census? but 2.1m looks about right.

NumHh_Tue = hh_df.loc[hh_df["num_complete_tue"] == 1, "hh_weight_rmove_only"].sum()
print(f"Number of weighted households on Tue: {NumHh_Tue}")

NumHh_Wed = hh_df.loc[hh_df["num_complete_wed"] == 1, "hh_weight_rmove_only"].sum()
print(f"Number of weighted households on Wed: {NumHh_Wed}")

NumHh_Thu = hh_df.loc[hh_df["num_complete_thu"] == 1, "hh_weight_rmove_only"].sum()
print(f"Number of weighted households on Thu: {NumHh_Thu}")

Number of weighted households on Tue: 2151578.9136188542
Number of weighted households on Wed: 2160605.282457783
Number of weighted households on Thu: 2203531.932945707


In [3]:
# Does any household have more than one complete Tue, Wed or Thu?
# (Verify that "num_complete_[DayOfWeek] is either 0 or 1)"
hh_df["num_complete_tue"].value_counts(dropna=False)

1    6304
0    1954
Name: num_complete_tue, dtype: int64

In [4]:
hh_df["num_complete_wed"].value_counts(dropna=False)

1    6317
0    1941
Name: num_complete_wed, dtype: int64

In [5]:
hh_df["num_complete_thu"].value_counts(dropna=False)

1    6260
0    1998
Name: num_complete_thu, dtype: int64

In [6]:
# read the file with the Has_motorway booleans
trip_motorway_bridge_df = pd.read_csv(os.path.join(BATS_data_location, "derived_variables", "trip_motorway_booleans.csv"))
#trip_motorway_bridge_df = pd.read_csv(os.path.join(BATS_data_location, "derived_variables", "trip_motorway_booleans_old.csv"))

# join the derived variables to the original trip file
trip_df = pd.merge(trip_df, trip_motorway_bridge_df, on="trip_id", how="left")

In [7]:
# the geodata processing script produced two booleans.
# I'll focus on the one named has_nonBridge_motorway

trip_df["has_nonBridge_motorway"].value_counts(dropna=False)
# 1 means it is a trip that involves a freeway (that is not a bridge)
# 0 maens it is in the geodatabase as it is an auto trip, but didn't involve a freeway (that is not a bridge)
# NaN means it is not an auto trip as it wasn't in the geodatabase

# check the auto trips

NaN    158340
0.0    124322
1.0     83168
Name: has_nonBridge_motorway, dtype: int64

In [8]:
# tag trips that involve a freeway (that is not a bridge) and take place during peak hour
# AM Peak is 6 am to 10 am
# PM Peak is 3 pm to 7 pm
# Perferably, the peak hour or not tag should be done in the geodatabase. Unfortunately, it is not readily available

# do this for Tue, Wed and Thu separately

trip_df["is_TuePeak_nonBridge_motorway"] = (
    (trip_df["has_nonBridge_motorway"] == 1) &
    (
        ((trip_df["depart_hour"] >  6) & (trip_df["depart_hour"] < 10)) |
        ((trip_df["arrive_hour"] >  6) & (trip_df["arrive_hour"] < 10)) | 
        ((trip_df["depart_hour"] > 15) & (trip_df["depart_hour"] < 19)) |         
        ((trip_df["arrive_hour"] > 15) & (trip_df["arrive_hour"] < 19))
    ) &
    (
        (trip_df["travel_dow"] == 2)
    )
)

trip_df["is_WedPeak_nonBridge_motorway"] = (
    (trip_df["has_nonBridge_motorway"] == 1) &
    (
        ((trip_df["depart_hour"] >  6) & (trip_df["depart_hour"] < 10)) |
        ((trip_df["arrive_hour"] >  6) & (trip_df["arrive_hour"] < 10)) | 
        ((trip_df["depart_hour"] > 15) & (trip_df["depart_hour"] < 19)) |         
        ((trip_df["arrive_hour"] > 15) & (trip_df["arrive_hour"] < 19))
    ) &
    (
        (trip_df["travel_dow"] == 3)
    )
)

trip_df["is_ThuPeak_nonBridge_motorway"] = (
    (trip_df["has_nonBridge_motorway"] == 1) &
    (
        ((trip_df["depart_hour"] >  6) & (trip_df["depart_hour"] < 10)) |
        ((trip_df["arrive_hour"] >  6) & (trip_df["arrive_hour"] < 10)) | 
        ((trip_df["depart_hour"] > 15) & (trip_df["depart_hour"] < 19)) |         
        ((trip_df["arrive_hour"] > 15) & (trip_df["arrive_hour"] < 19))
    ) &
    (
        (trip_df["travel_dow"] == 4)
    )
)

In [9]:
# for each household, sum the number of trips that involve a freeway (that is not a bridge) and take place during peak hour
hh_peakNonBridgeMotorway_trips_df = trip_df.groupby("hh_id")[[
    "is_TuePeak_nonBridge_motorway",
    "is_WedPeak_nonBridge_motorway",
    "is_ThuPeak_nonBridge_motorway"
]].sum().reset_index()

In [10]:
# Rename the columns so they have more intuitive names
hh_peakNonBridgeMotorway_trips_df = hh_peakNonBridgeMotorway_trips_df.rename(columns={
    "is_TuePeak_nonBridge_motorway": "numTrips_TuePeak_nonBridge_motorway",
    "is_WedPeak_nonBridge_motorway": "numTrips_WedPeak_nonBridge_motorway",
    "is_ThuPeak_nonBridge_motorway": "numTrips_ThuPeak_nonBridge_motorway"
})

In [11]:
hh_peakNonBridgeMotorway_trips_df.head()

Unnamed: 0,hh_id,numTrips_TuePeak_nonBridge_motorway,numTrips_WedPeak_nonBridge_motorway,numTrips_ThuPeak_nonBridge_motorway
0,23000075,0,0,0
1,23000098,0,0,0
2,23000339,0,0,0
3,23000432,0,0,0
4,23000474,0,0,0


In [12]:
# join the info about peak non-bridge motorway trips to the household file
# note that it is for each household on each of Tue, Wed, Thu
# so it adds 3 columns to the household file

new_hh_df = pd.merge(
    hh_df,
    hh_peakNonBridgeMotorway_trips_df,
    on="hh_id",
    how="left"  # or "inner", "right", "outer" depending on your needs
)

# check number of rows
row_count_newhh = len(new_hh_df)
print(f"Check number of rows in the new household file: {row_count_newhh}")

Check number of rows in the new household file: 8258


In [13]:
new_hh_df.head()

Unnamed: 0,hh_id,first_travel_date,last_travel_date,signup_platform,diary_platform,participation_group,num_days_complete,num_days_complete_a,num_days_complete_b,num_days_complete_weekday,...,pairwise_segment,hh_weight,income_broad_user_input,income_imputed,hh_weight_rmove_only,income_broad_user_input_rmove_only,income_imputed_rmove_only,numTrips_TuePeak_nonBridge_motorway,numTrips_WedPeak_nonBridge_motorway,numTrips_ThuPeak_nonBridge_motorway
0,23000023,2023-05-24,2023-05-24,browser,browser,1,1,1,1,1,...,995,49.357083,"$25,000-$49,999","$25,000-$49,999",,,,,,
1,23000075,2023-05-17,2023-05-17,browser,browser,1,1,1,1,1,...,995,71.542285,"Under $25,000","Under $25,000",,,,0.0,0.0,0.0
2,23000098,2023-05-24,2023-05-24,browser,browser,1,1,1,1,1,...,995,68.199435,"$200,000 or more","$200,000 or more",,,,0.0,0.0,0.0
3,23000339,2023-05-17,2023-05-23,rmove,rmove,9,7,7,7,5,...,995,2564.426824,"$200,000 or more","$200,000 or more",83.898986,"$200,000 or more","$200,000 or more",0.0,0.0,0.0
4,23000432,2023-05-24,2023-05-24,browser,browser,1,1,1,1,1,...,995,66.61766,"$100,000-$199,999","$100,000-$199,999",,,,0.0,0.0,0.0


In [14]:
#look at the distribution of values in num_trip_Tue_peak_nonBridge_motorway
new_hh_df["numTrips_TuePeak_nonBridge_motorway"].value_counts(dropna=False)

0.0     5451
1.0     1092
2.0      828
3.0      306
NaN      192
4.0      176
5.0       94
6.0       41
7.0       31
8.0       24
9.0        6
11.0       5
10.0       4
13.0       4
12.0       3
16.0       1
Name: numTrips_TuePeak_nonBridge_motorway, dtype: int64

In [15]:
# Notes on numTrips_[DayOfWeek]Peak_nonBridge_motorway fields:
# - NaN: The household did not complete the survey on that day
# - 0: The household completed the survey on that weekday, but did not make any peak-period non-bridge motorway trips
# - Other values: The number of peak-period non-bridge motorway trips made on that day

# look at a crosstab of num_complete_tue and numTrips_TuePeak_nonBridge_motorway
# below is the unweighted count
pd.crosstab(
    new_hh_df["numTrips_TuePeak_nonBridge_motorway"].fillna("NaN"),
    new_hh_df["num_complete_tue"],
    margins=True
)

num_complete_tue,0,1,All
numTrips_TuePeak_nonBridge_motorway,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,1355,4096,5451
1.0,158,934,1092
2.0,146,682,828
3.0,68,238,306
4.0,44,132,176
5.0,37,57,94
6.0,14,27,41
7.0,14,17,31
8.0,12,12,24
9.0,3,3,6


In [16]:
# Number of households that made no peak-period non-bridge motorway trip
numHh_MadeZero_TuePeak_nonBridge_motorway = new_hh_df.loc[
    new_hh_df["numTrips_TuePeak_nonBridge_motorway"] == 0,
    "hh_weight_rmove_only"
].sum()

print(f"On Tue, the weighted number of hh that made no peak-period non-bridge motorway trip is: {numHh_MadeZero_TuePeak_nonBridge_motorway}")

numHh_MadeZero_WedPeak_nonBridge_motorway = new_hh_df.loc[
    new_hh_df["numTrips_WedPeak_nonBridge_motorway"] == 0,
    "hh_weight_rmove_only"
].sum()

print(f"On Wed, the weighted number of hh that made no peak-period non-bridge motorway trip is: {numHh_MadeZero_WedPeak_nonBridge_motorway}")

numHh_MadeZero_ThuPeak_nonBridge_motorway = new_hh_df.loc[
    new_hh_df["numTrips_ThuPeak_nonBridge_motorway"] == 0,
    "hh_weight_rmove_only"
].sum()

print(f"On Thu, the weighted number of hh that made no peak-period non-bridge motorway trip is: {numHh_MadeZero_ThuPeak_nonBridge_motorway}")


On Tue, the weighted number of hh that made no peak-period non-bridge motorway trip is: 1486206.5516216499
On Wed, the weighted number of hh that made no peak-period non-bridge motorway trip is: 1489422.6049095811
On Thu, the weighted number of hh that made no peak-period non-bridge motorway trip is: 1469633.6361873047


In [17]:
# calculate % of households that do not use highways on a regular basis during peak hour periods
Percent_MadeZero_TuePeak_nonBridge_motorway = numHh_MadeZero_TuePeak_nonBridge_motorway / NumHh_Tue
print(f"On Tue, the % of weighted households that made no peak-period non-bridge motorway trip is: {Percent_MadeZero_TuePeak_nonBridge_motorway}")

Percent_MadeZero_WedPeak_nonBridge_motorway = numHh_MadeZero_WedPeak_nonBridge_motorway / NumHh_Wed
print(f"On Wed, the % of weighted households that made no peak-period non-bridge motorway trip is: {Percent_MadeZero_WedPeak_nonBridge_motorway}")

Percent_MadeZero_ThuPeak_nonBridge_motorway = numHh_MadeZero_ThuPeak_nonBridge_motorway / NumHh_Thu
print(f"On Thu, the % of weighted households that made no peak-period non-bridge motorway trip is: {Percent_MadeZero_ThuPeak_nonBridge_motorway}")

Percent_MadeZero_TueToThuPeak_nonBridge_motorway = (numHh_MadeZero_TuePeak_nonBridge_motorway + numHh_MadeZero_WedPeak_nonBridge_motorway + numHh_MadeZero_ThuPeak_nonBridge_motorway) / (NumHh_Tue + NumHh_Wed + NumHh_Thu) 
print(f"For the average Tue-Thu, the % of weighted households that made no peak-period non-bridge motorway trip is: {Percent_MadeZero_TueToThuPeak_nonBridge_motorway}")



On Tue, the % of weighted households that made no peak-period non-bridge motorway trip is: 0.6907515881543571
On Wed, the % of weighted households that made no peak-period non-bridge motorway trip is: 0.6893543290865687
On Thu, the % of weighted households that made no peak-period non-bridge motorway trip is: 0.6669445603280555
For the average Tue-Thu, the % of weighted households that made no peak-period non-bridge motorway trip is: 0.6822370257842293


In [18]:
# number of low-income households on each day
NumHh_Tue = hh_df.loc[hh_df["num_complete_tue"] == 1 , "hh_weight_rmove_only"].sum()
print(f"Number of weighted households on Tue: {NumHh_Tue}")

NumHh_Wed = hh_df.loc[hh_df["num_complete_wed"] == 1, "hh_weight_rmove_only"].sum()
print(f"Number of weighted households on Wed: {NumHh_Wed}")

NumHh_Thu = hh_df.loc[hh_df["num_complete_thu"] == 1, "hh_weight_rmove_only"].sum()
print(f"Number of weighted households on Thu: {NumHh_Thu}")

Number of weighted households on Tue: 2151578.9136188542
Number of weighted households on Wed: 2160605.282457783
Number of weighted households on Thu: 2203531.932945707


In [19]:
# Inspect the income variable
new_hh_df["income_broad"].value_counts(dropna=False)


5      2365
6      1952
4       878
3       790
999     778
1       756
2       739
Name: income_broad, dtype: int64

In [20]:
# Add labels for the income groups
income_label_map = {
    1: "Under $25,000",
    2: "$25,000–$49,999",
    3: "$50,000–$74,999",
    4: "$75,000–$99,999",
    5: "$100,000–$199,999",
    6: "$200,000 or more",
    995: "Missing Response",
    999: "Prefer not to answer"
}

# List of income groups (creates an array)
income_groups = sorted(new_hh_df["income_broad"].unique())

# Store results
results_list = []

# Loop through each income group
for income_num in income_groups:
    income_label = income_label_map.get(income_num, "Unknown")    
    hh_subset_df = new_hh_df[new_hh_df["income_broad"] == income_num]
    
    # Households that completed the survey on Tue and made 0 peak-period non-bridge motorway trip (PNBM trip)
    hh_tue_zero_PNBMtrip = hh_subset_df.loc[
        (hh_subset_df["num_complete_tue"] == 1) & 
        (hh_subset_df["numTrips_TuePeak_nonBridge_motorway"] == 0),
        "hh_weight_rmove_only"
    ].sum()

    hh_tue_total = hh_subset_df.loc[
        hh_subset_df["num_complete_tue"] == 1,
        "hh_weight_rmove_only"
    ].sum()

    percent_tue = hh_tue_zero_PNBMtrip / hh_tue_total 

    # Repeat for Wed
    hh_wed_zero_PNBMtrip = hh_subset_df.loc[
        (hh_subset_df["num_complete_wed"] == 1) & 
        (hh_subset_df["numTrips_WedPeak_nonBridge_motorway"] == 0),
        "hh_weight_rmove_only"
    ].sum()

    hh_wed_total = hh_subset_df.loc[
        hh_subset_df["num_complete_wed"] == 1,
        "hh_weight_rmove_only"
    ].sum()

    percent_wed = hh_wed_zero_PNBMtrip / hh_wed_total 

    # Repeat for Thu
    hh_thu_zero_PNBMtrip = hh_subset_df.loc[
        (hh_subset_df["num_complete_thu"] == 1) & 
        (hh_subset_df["numTrips_ThuPeak_nonBridge_motorway"] == 0),
        "hh_weight_rmove_only"
    ].sum()

    hh_thu_total = hh_subset_df.loc[
        hh_subset_df["num_complete_thu"] == 1,
        "hh_weight_rmove_only"
    ].sum()

    percent_thu = hh_thu_zero_PNBMtrip / hh_thu_total

    percent_tuewedthu = (hh_tue_zero_PNBMtrip + hh_wed_zero_PNBMtrip + hh_thu_zero_PNBMtrip) / (hh_tue_total + hh_wed_total + hh_thu_total)

    print(f"\nIncome Group {income_num}: {income_label}")
    print(f"              Tue: {percent_tue:.2%} of hh made no peak non-bridge motorway trips")
    print(f"              Wed: {percent_wed:.2%}")
    print(f"              Thu: {percent_thu:.2%}")
    print(f"  Tue-Thu Average: {percent_tuewedthu:.2%}")
    
    
# Save results to CSV
results_df = pd.DataFrame(results_list)
output_path = os.path.join(BATS_data_location, "processed", "percent_noPNBMtrip_by_income.csv")
results_df.to_csv(output_path, index=False)
print(f"\nSaved results to: {output_path}")


Income Group 1: Under $25,000
              Tue: 77.89% of hh made no peak non-bridge motorway trips
              Wed: 73.24%
              Thu: 81.29%
  Tue-Thu Average: 77.46%

Income Group 2: $25,000–$49,999
              Tue: 74.07% of hh made no peak non-bridge motorway trips
              Wed: 67.69%
              Thu: 64.97%
  Tue-Thu Average: 68.96%

Income Group 3: $50,000–$74,999
              Tue: 54.17% of hh made no peak non-bridge motorway trips
              Wed: 56.21%
              Thu: 57.89%
  Tue-Thu Average: 56.15%

Income Group 4: $75,000–$99,999
              Tue: 49.97% of hh made no peak non-bridge motorway trips
              Wed: 47.26%
              Thu: 45.41%
  Tue-Thu Average: 47.48%

Income Group 5: $100,000–$199,999
              Tue: 52.66% of hh made no peak non-bridge motorway trips
              Wed: 43.84%
              Thu: 48.61%
  Tue-Thu Average: 48.37%

Income Group 6: $200,000 or more
              Tue: 45.89% of hh made no peak non-bridge 