## 🗺️ Step 1: Loading the Data
The first part of the code handles getting all the necessary data ready.

Load Raw Taxi Trips: The code starts by downloading three months of NYC Yellow Taxi trip data from January to March 2023. These files are in the Parquet format, which is a very efficient way to store large amounts of data. It combines these three large files into a single master table (a DataFrame).

Clean the Trip Data: The script immediately cleans this data by:

Keeping only the two columns that matter for this analysis: the pickup location ID (PULocationID) and the dropoff location ID (DOLocationID).

Removing any trips where the location ID is invalid (the valid taxi zones are numbered 1 to 263).

Dropping any rows that might have missing information.

Load the "Crosswalk" File: Next, it loads a crucial helper file: nyc_zone_tract_crosswalk_FINAL.csv. A crosswalk file acts as a translator or a Rosetta Stone. It connects two different types of geographic data. In this case, it lists every taxi zone and tells us exactly which census tracts it overlaps with, and by how much (the apportion_weight).

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import os

In [3]:
# --- 1. CONFIGURATION ---
# Use a sample of 3 months from 2023 for manageable processing
TLC_URLS = [
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet",
]
CROSSWALK_FILE = "nyc_zone_tract_crosswalk_FINAL.csv"
OUTPUT_DIR = "output"
OUTPUT_FILE = "OD_Matrix_TLC_by_Tract.csv"

# --- 2. LOAD & CLEAN TLC TRIP DATA ---
print("➡️ Step 1: Loading and cleaning TLC trip data...")

# Read and combine the Parquet files from the URLs
df_tlc = pd.concat([pd.read_parquet(url) for url in TLC_URLS])

# Keep only essential columns
df_tlc = df_tlc[["PULocationID", "DOLocationID"]]

# Remove records with invalid taxi zone IDs (TLC zones are 1-263)
df_tlc = df_tlc[df_tlc["PULocationID"].between(1, 263) & df_tlc["DOLocationID"].between(1, 263)]

# Drop any rows with missing data
df_tlc.dropna(inplace=True)
df_tlc
print(f"✅ TLC data loaded and cleaned: {len(df_tlc)} trips.")

➡️ Step 1: Loading and cleaning TLC trip data...


Unnamed: 0,PULocationID,DOLocationID
0,161,141
1,43,237
2,48,238
3,138,7
4,107,79
...,...,...
3403761,163,75
3403762,125,198
3403763,50,224
3403764,113,158


In [4]:
# --- 3. LOAD THE ZONE-TO-TRACT CROSSWALK FILE ---
print(f"\n➡️ Step 2: Loading the crosswalk file '{CROSSWALK_FILE}'...")
try:
    df_crosswalk = pd.read_csv(CROSSWALK_FILE)
    print("✅ Crosswalk file loaded successfully.")
except FileNotFoundError:
    print(f"❌ ERROR: The file '{CROSSWALK_FILE}' was not found.")
    print("Please make sure it is in the same directory as your notebook.")
    # Stop execution if the file isn't found
    raise


➡️ Step 2: Loading the crosswalk file 'nyc_zone_tract_crosswalk_FINAL.csv'...
✅ Crosswalk file loaded successfully.


## 📊 Step 3: Building the Matrix and Visualizing Results
The final part of the notebook takes all those calculated "trip fractions" and turns them into a final, clean report and visuals.

Aggregate the Results: It groups the data by each unique pair of pickup_tract_id and dropoff_tract_id and then sums up all the trip_fraction values. This adds all the little pieces of trips together to get a final, total trip count between every census tract pair. The result is the final O-D Matrix.

Save the Output: The code saves this final matrix to a new CSV file: output/OD_Matrix_TLC_by_Tact.csv.

Create Insightful Plots: To make the data easy to understand, the last cell loads the saved CSV and generates three bar charts:

Top 15 Busiest Origin Tracts: Shows which census tracts have the most taxi pickups.

Top 15 Busiest Destination Tracts: Shows which census tracts have the most taxi dropoffs.

Top 15 Busiest Routes: Shows the most popular specific routes from one census tract to another.

These plots are saved as .png image files in the output folder, providing a quick, visual summary of taxi demand patterns across New York City at a very detailed level.

In [5]:
print("\n➡️ Step 3: Mapping trips to origin and destination tracts...")

# Prepare the crosswalk for joining with pickup locations
pickup_crosswalk = df_crosswalk.rename(columns={
    "LocationID": "PULocationID",
    "census_tract_id": "pickup_tract_id",
    "apportion_weight": "pickup_weight"
})
# Prepare the crosswalk for joining with dropoff locations
dropoff_crosswalk = df_crosswalk.rename(columns={
    "LocationID": "DOLocationID",
    "census_tract_id": "dropoff_tract_id",
    "apportion_weight": "dropoff_weight"
})

# Join 1: Map pickup zones to pickup tracts
df_merged = pd.merge(df_tlc, pickup_crosswalk, on="PULocationID", how="inner")

# Join 2: Map dropoff zones to dropoff tracts
df_merged = pd.merge(df_merged, dropoff_crosswalk, on="DOLocationID", how="inner")

# Calculate the apportioned trip count for each O-D tract pair
# Each original trip is split based on the spatial overlap weights.
df_merged["trip_fraction"] = df_merged["pickup_weight"] * df_merged["dropoff_weight"]
print("✅ Trip-to-tract mapping complete.")


➡️ Step 3: Mapping trips to origin and destination tracts...
✅ Trip-to-tract mapping complete.


In [6]:
# --- 5. AGGREGATE TO CREATE THE O-D MATRIX ---
print("\n➡️ Step 4: Aggregating trips to create the O-D matrix...")

# Group by the origin and destination tracts and sum the fractional trips
od_matrix = (
    df_merged.groupby(["pickup_tract_id", "dropoff_tract_id"])
    ["trip_fraction"]
    .sum()
    .reset_index()
)

# Rename the column to reflect the total apportioned trips
od_matrix.rename(columns={"trip_fraction": "total_trips"}, inplace=True)
print("✅ O-D matrix created successfully.")


# --- 6. SAVE AND DISPLAY OUTPUT ---
print("\n➡️ Step 5: Saving output...")

# Create the output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)
output_path = os.path.join(OUTPUT_DIR, OUTPUT_FILE)
od_matrix.to_csv(output_path, index=False)

print(f"✅ Final Origin-Destination matrix saved to: '{output_path}'")
print("\n--- Sample of the Final O-D Matrix ---")
print(od_matrix.head())


➡️ Step 4: Aggregating trips to create the O-D matrix...
✅ O-D matrix created successfully.

➡️ Step 5: Saving output...
✅ Final Origin-Destination matrix saved to: 'output/OD_Matrix_TLC_by_Tract.csv'

--- Sample of the Final O-D Matrix ---
   pickup_tract_id  dropoff_tract_id  total_trips
0          1000201           1000201     2.161536
1          1000201           1000202     7.443110
2          1000201           1000500     0.096026
3          1000201           1000600     5.858761
4          1000201           1000700     7.844641


In [8]:
INPUT_FILE = "output/OD_Matrix_TLC_by_Tract.csv"
OUTPUT_DIR = "output"
TOP_N = 15  # You can change this to plot the Top 10, Top 20, etc.

# Load the O-D matrix data
try:
    od_matrix = pd.read_csv(INPUT_FILE)
    print(f"✅ Successfully loaded '{INPUT_FILE}'")
except FileNotFoundError:
    print(f"❌ ERROR: The file '{INPUT_FILE}' was not found.")
    print("Please make sure you have re-run the previous notebook cell to generate the O-D matrix before running this cell.")
    # Stop execution if the file isn't found
    raise

# --- 2. PLOT 1: TOP N BUSIEST ORIGIN TRACTS ---
print("➡️ Generating plot for Top Origin Tracts...")

# Aggregate total trips originating from each tract
top_origins = od_matrix.groupby('pickup_tract_id')['total_trips'].sum().nlargest(TOP_N).sort_values(ascending=True)

# Create the plot
plt.style.use('seaborn-v0_8-whitegrid')
plt.figure(figsize=(10, 8))
top_origins.plot(kind='barh', color='skyblue')
plt.title(f'Top {TOP_N} Busiest Origin Census Tracts', fontsize=16)
plt.xlabel('Total Apportioned Trips', fontsize=12)
plt.ylabel('Origin Census Tract ID', fontsize=12)
plt.tight_layout()

# Save the figure to the output directory
origin_plot_path = os.path.join(OUTPUT_DIR, 'top_origin_tracts.png')
plt.savefig(origin_plot_path)
print(f"✅ Saved plot to: '{origin_plot_path}'")
plt.close() # Close the plot to free up memory

# --- 3. PLOT 2: TOP N BUSIEST DESTINATION TRACTS ---
print("\n➡️ Generating plot for Top Destination Tracts...")

# Aggregate total trips ending in each tract
top_destinations = od_matrix.groupby('dropoff_tract_id')['total_trips'].sum().nlargest(TOP_N).sort_values(ascending=True)

# Create the plot
plt.figure(figsize=(10, 8))
top_destinations.plot(kind='barh', color='salmon')
plt.title(f'Top {TOP_N} Busiest Destination Census Tracts', fontsize=16)
plt.xlabel('Total Apportioned Trips', fontsize=12)
plt.ylabel('Destination Census Tract ID', fontsize=12)
plt.tight_layout()

# Save the figure
destination_plot_path = os.path.join(OUTPUT_DIR, 'top_destination_tracts.png')
plt.savefig(destination_plot_path)
print(f"✅ Saved plot to: '{destination_plot_path}'")
plt.close()

# --- 4. PLOT 3: TOP N BUSIEST O-D ROUTES ---
print("\n➡️ Generating plot for Top O-D Routes...")

# Create a 'route' column for easier labeling in the plot
od_matrix_sorted = od_matrix.sort_values(by='total_trips', ascending=False).head(TOP_N)
od_matrix_sorted['route'] = od_matrix_sorted['pickup_tract_id'].astype(str) + ' → ' + od_matrix_sorted['dropoff_tract_id'].astype(str)
od_matrix_sorted.sort_values(by='total_trips', ascending=True, inplace=True)

# Create the plot
plt.figure(figsize=(10, 8))
plt.barh(od_matrix_sorted['route'], od_matrix_sorted['total_trips'], color='mediumseagreen')
plt.title(f'Top {TOP_N} Busiest O-D Routes by Census Tract', fontsize=16)
plt.xlabel('Total Apportioned Trips', fontsize=12)
plt.ylabel('Origin → Destination Route', fontsize=12)
plt.tight_layout()

# Save the figure
routes_plot_path = os.path.join(OUTPUT_DIR, 'top_od_routes.png')
plt.savefig(routes_plot_path)
print(f"✅ Saved plot to: '{routes_plot_path}'")
plt.close()

✅ Successfully loaded 'output/OD_Matrix_TLC_by_Tract.csv'
➡️ Generating plot for Top Origin Tracts...
✅ Saved plot to: 'output/top_origin_tracts.png'

➡️ Generating plot for Top Destination Tracts...
✅ Saved plot to: 'output/top_destination_tracts.png'

➡️ Generating plot for Top O-D Routes...
✅ Saved plot to: 'output/top_od_routes.png'
