# Geospatial Taxi Demand Analysis

To analyze the taxi trip data of Chicago from the year 2015, this notebook will visualize the taxi usage in a geospatial context. Here different H3 maps with different resolutions and heatmaps are used. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium import plugins
from folium.plugins import HeatMap
import h3
from shapely.geometry import Polygon
import geopandas
from geojson import Feature, Point, FeatureCollection, Polygon
from shapely.geometry import Polygon
import plotly.express as px
from datetime import datetime

In [None]:
#Import cleaned dataset

# For heatmaps
sample_df = pd.read_parquet('../../data/rides/Taxi_Trips_Sampled_Cleaned.parquet')

# For h3 maps
trips_df = pd.read_parquet('../../data/rides/Taxi_Trips_Cleaned.parquet')

columns_to_drop = ['trip_seconds', 'trip_miles', 'pickup_census_tract', 'dropoff_census_tract', 'fare', 'tips', 'tolls', 'Extras', 'trip_total', 'payment_type', 'Company', 'temp', 'precip', 'is_weekday', 'h3_05_pickup',
       'h3_05_dropoff', 'h3_06_pickup', 'h3_06_dropoff', 'week_end']
trips_df = trips_df.drop(columns_to_drop, axis=1)
trips_df.head(3)

We create a dictionary for each census tract name of Chicago and add it to our DataFrame to be able to reference to certain locations in our analysis more easily. The census tract list were retrived from [here](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Census-Tracts-2000/pt6c-hxpp). 

In [None]:
import csv

csv_file = '../../data/census_tract/chicago_census_tract.csv'

# Create an empty dictionary
data_dict = {}

# Open the CSV file
with open(csv_file, 'r') as file:
    reader = csv.reader(file)

    # Skip the header row if present
    next(reader)

    # Iterate over each row in the CSV file
    for row in reader:
        ca = row[1]  # CA is in the second column
        community = row[2]  # COMMUNIT_1 is in the third column

        # Add the data to the dictionary
        data_dict[ca] = community

data_dict = dict(sorted(data_dict.items(), key=lambda item: int(item[0])))
print(data_dict)

In [None]:
# Create a new dictionary with integer keys
new_data_dict = {int(key): value for key, value in data_dict.items()}


In [None]:
key_types = [type(key) for key in new_data_dict.keys()]

In [None]:
# Match census tract name to community area unique value

# For heatmaps: sample df
sample_df['pickup_name'] = sample_df['pickup_community_area'].map(new_data_dict)
sample_df['dropoff_name'] = sample_df['dropoff_community_area'].map(new_data_dict)

# For h3 maps: trips_df
trips_df['pickup_name'] = trips_df['pickup_community_area'].map(new_data_dict)
trips_df['dropoff_name'] = trips_df['dropoff_community_area'].map(new_data_dict)
trips_df.head(3)

In [None]:
#Test for NAN values
trips_df['pickup_name'].isna().sum()

## Heatmaps 

In this section, we use heatmaps to visualize the number of starting trips and ending trips. However, due to the size of the heatmaps and kernel failures, we use only a sample of the data. As the size of the heatmap is still too big, we have removed the cell outputs containing the visualization of the heatmaps. So please run through the code for the visualization yourself, either with the sample data, or with enough resource capabilities with the whole dataset. As an alternative we also have [uploaded screenshots]().

### Starting Trips

In [None]:
from shapely import wkt

In [None]:
# Convert the pickup_centroid column to a GeoSeries from WKT format
sample_df["pickup_centroid"] = geopandas.GeoSeries.from_wkt(sample_df["pickup_centroid"])

In [None]:
sample_df.set_geometry('pickup_centroid')
sample_df.head(3)

In [None]:
#Split pickup_centroid into coordinates
#taken from https://geopandas.org/en/stable/gallery/plotting_with_folium.html 

geo_df_list = [[point.xy[1][0], point.xy[0][0]] for point in sample_df.pickup_centroid]

In [None]:
# Create DF for starting trip locations and is corresponding area names
geo_df_pickup = geopandas.GeoDataFrame(geo_df_list, sample_df["pickup_census_tract"])
geo_df_pickup = geo_df_pickup.reset_index()
geo_df_pickup["pickup_name"] = sample_df["pickup_name"]

geo_df_pickup.head(3)

In [None]:
geo_df_pickup = geo_df_pickup.reset_index()
geo_df_pickup.head(3)

In [None]:
# Ploting heatmap that shows the census tracts and the frequency of trips starting there

trips_heatmap = folium.Map(
    location=(41.881832, -87.623177), # the orig mean values as location coordinates from https://www.latlong.net/place/chicago-il-usa-1855.html
    zoom_start=13,
    control_scale=True,
    max_zoom=20,
)

trips_heatmap.add_child(plugins.HeatMap(geo_df_list, radius=30))

popup_content = np.vectorize(
    lambda name, tract: f"Pickup Name: {name}<br> </br><br> Pickup Census Tract: {tract}"
)(geo_df_pickup['pickup_name'], geo_df_pickup['pickup_census_tract'])

for _,row in geo_df_pickup.iterrows():
    folium.CircleMarker(
        radius=5,
        location=[row[0], row[1]],
        popup=popup_content[_],
        color="crimson",
        fill_color="crimson",
    ).add_to(trips_heatmap)
    
    
trips_heatmap

The heatmap reveals that the primary hotspot is Loop, representing the city center. Additional hotspots include the O'Hare International Airport, as well as the Chicago Midway International Airport, both important transportation hubs in Chicago. Generally, it can be observed that besides the city center and the airports, taxi demand is elevated in the northern parts of Chicago, around Near North Side, Lake View, and Edgewater. In contrast, the southern regions display fewer prominent areas. More prominant locations in the South include the University of Chicago (located in Hyde Park) and Douglas.

## Ending Trips

In [None]:
# Convert the "dropoff_centroid" column to a GeoSeries from WKT format
sample_df["dropoff_centroid"] = geopandas.GeoSeries.from_wkt(sample_df["dropoff_centroid"])

In [None]:
sample_df.set_geometry('dropoff_centroid')
sample_df.head(3)

In [None]:
geo_df_list_dropoff = [[point.xy[1][0], point.xy[0][0]] for point in sample_df.dropoff_centroid]

In [None]:
geo_df_dropoff = geopandas.GeoDataFrame(geo_df_list_dropoff, sample_df["dropoff_census_tract"])

In [None]:
geo_df_dropoff = geo_df_dropoff.reset_index()
geo_df_dropoff["dropoff_name"] = sample_df["dropoff_name"]

geo_df_dropoff.head(3)

In [None]:
# Ploting heatmap that shows the census tracts and the frequency of trips ending there

trips_heatmap = folium.Map(
    location=(41.881832, -87.623177), # the orig mean values as location coordinates from https://www.latlong.net/place/chicago-il-usa-1855.html
    zoom_start=13,
    control_scale=True,
    max_zoom=20,
)

trips_heatmap.add_child(plugins.HeatMap(geo_df_list_dropoff, radius=30))

popup_content = np.vectorize(lambda name, tract: f"Dropoff Name: {name}<br> </br><br>Dropoff Census Tract: {tract}")(
    geo_df_dropoff['dropoff_name'], geo_df_dropoff['dropoff_census_tract']
)

for _,row in geo_df_dropoff.iterrows():
    folium.CircleMarker(
        radius=5,
        location=[row[0], row[1]],
        popup=popup_content[_],
        color="crimson",
        fill_color="crimson",
    ).add_to(trips_heatmap)
    
    
trips_heatmap

Similar to the heatmap of starting trips the most popular hotspot is Loop. Also the Airports and northern locations in Chicago are more prominent as well. The few popular locations in the south being around Hyde Park and Douglas. 

To further analyze patterns, like the similarity of starting and ending trips, we will use H3 maps and look at different analysis objectives.

## H3 maps 

We employ H3 maps for additional geospatial visualizations, focusing on resolution types 7, 8, and 9. 
Our approach begins by establishing DataFrames for each resolution type, accompanied by methods for generating hexagonal geometries, counting the number of trips within each hexagon, and designing the visual representation of the H3 maps.

Our Plotly H3 maps are not displayed properly in Jupyter Notebook on GitHub, which is why we will upload screenshots of the visualizations in folder **???** instead. Additionally the notebook with the outputs can be found [here]() **Sciebo**. 

In [None]:
# Creating a dataframe that contains all hexagons where at least one trip started or ended

hexagons7_df = pd.DataFrame()
hexagons8_df = pd.DataFrame()
hexagons9_df = pd.DataFrame()

hexagons7_df["hex"] = pd.concat([trips_df["h3_07_pickup"], trips_df["h3_07_dropoff"]]).unique()
hexagons8_df["hex"] = pd.concat([trips_df["h3_08_pickup"], trips_df["h3_08_dropoff"]]).unique()
hexagons9_df["hex"] = pd.concat([trips_df["h3_09_pickup"], trips_df["h3_09_dropoff"]]).unique()
hexagons7_df.head(3)
hexagons8_df.head(3)
hexagons9_df.head(3)

In [None]:
# Defining a funtion that generates heaxagon geometry for each hexagon
# taken from https://medium.com/analytics-vidhya/how-to-create-a-choropleth-map-using-uber-h3-plotly-python-458f51593548

def add_geometry(row):
  points = h3.h3_to_geo_boundary(row['hex'], True)
  return Polygon(points)

In [None]:
#Applying function to our hexagons dataframe

hexagons7_df['geometry'] = (hexagons7_df
                                .apply(add_geometry,axis=1)) 

hexagons8_df['geometry'] = (hexagons8_df
                                .apply(add_geometry,axis=1)) 

hexagons9_df['geometry'] = (hexagons9_df
                                .apply(add_geometry,axis=1)) 


hexagons7_df.head(3)


In [None]:
# Defining a functions that will count trips for a given groupby value

def calculate_hexagon_trips(hexagons_df, label, group_by):
    hexagons_df[label] = trips_df.groupby(group_by).size()
    hexagons_df[label] = hexagons_df[label].fillna(value=0)

In [None]:
# Calculate starting and ending trips for each hexagon

hexagons7_df = hexagons7_df.set_index('hex')
hexagons8_df = hexagons8_df.set_index('hex')
hexagons9_df = hexagons9_df.set_index('hex')


calculate_hexagon_trips(hexagons7_df, label="starting_trips_07", group_by="h3_07_pickup")
calculate_hexagon_trips(hexagons7_df, label="ending_trips_07", group_by="h3_07_dropoff")
calculate_hexagon_trips(hexagons8_df, label="starting_trips_08", group_by="h3_08_pickup")
calculate_hexagon_trips(hexagons8_df, label="ending_trips_08", group_by="h3_08_dropoff")
calculate_hexagon_trips(hexagons9_df, label="starting_trips_09", group_by="h3_09_pickup")
calculate_hexagon_trips(hexagons9_df, label="ending_trips_09", group_by="h3_09_dropoff")

hexagons7_df = hexagons7_df.reset_index()
hexagons8_df = hexagons8_df.reset_index()
hexagons9_df = hexagons9_df.reset_index()

hexagons9_df.head(3)

In [None]:
hexagons7_df["pickup_name"]=trips_df["pickup_name"]
hexagons8_df["pickup_name"]=trips_df["pickup_name"]
hexagons9_df["pickup_name"]=trips_df["pickup_name"]

hexagons7_df.head(3)

In [None]:
# Our approach uses the chloropleth_mapbox module of Plotly Express to build a map.
# To do this a GeoJSON-formatted dictionary is created by this method that can be passed to Plotly express. 

# taken from https://medium.com/analytics-vidhya/how-to-create-a-choropleth-map-using-uber-h3-plotly-python-458f51593548

def hexagons_dataframe_to_geojson(df_hex, value_field, file_output = None):

    list_features = []

    for i, row in df_hex.iterrows():
        feature = Feature(geometry = row['geometry'],
                          id = row['hex'],
                          properties = {"value": row[value_field]})
        list_features.append(feature)

    feat_collection = FeatureCollection(list_features)

    if file_output is not None:
        with open(file_output, "w") as f:
            json.dump(feat_collection, f)

    else :
      return feat_collection

In [None]:
# Function that visualizes the H3 map

# Adapted from https://medium.com/analytics-vidhya/how-to-create-a-choropleth-map-using-uber-h3-plotly-python-458f51593548

def plot_frequency(dataset, variable, labels, range_color, palette="RdBu"):
    geojson_obj = (hexagons_dataframe_to_geojson(dataset, value_field=variable))

    fig = (px.choropleth_mapbox(
                    dataset,
                    width=700,
                    height=500,
                    geojson=geojson_obj, 
                    locations='hex', 
                 #   hover_name = "pickup_name",
                    color=variable,
                    color_continuous_scale=palette,
                    range_color=range_color,
                    mapbox_style='carto-positron',
                    zoom=10.5,
                    center = {"lat": 41.881832 ,"lon": -87.623177,},
                    opacity=0.7,
                    labels=labels))
    fig.update_layout(
        margin={"r": 0, "t": 0, "l": 0, "b": 0},
    )
    return fig 

In [None]:
# Function to plot the frequency with choropleth colors based on the variable

def plot_frequency_test(dataset, hover_name, variable, labels, range_color, palette="RdBu"):
    fig = (px.choropleth_mapbox(
        dataset,
        width=700,
        height=500,
        geojson=hexagons_dataframe_to_geojson(dataset, value_field=variable),
        locations='hex',
        hover_name=hover_name,
        color=variable,  # Use the 'variable' directly as the color
        color_continuous_scale=palette,
        range_color=range_color,
        zoom=10.5,
        center={"lat": 41.881832, "lon": -87.623177},
        opacity=0.7,
        labels=labels,
        animation_frame="month",
       # animation_group="hex",
        mapbox_style="open-street-map",
    ))
    fig.update_layout(
        sliders=[
            dict(
                active=0,
                bgcolor='black'
            )
        ]
    )
    return fig


# Hex7

First we represent the hexagon resolution 7. For each resolution we look at the number of starting trips, the number of ending trips and the demand difference. We will look at the demand difference more in detail by looking at the overall difference, as well as during morning (6am-12pm) and evening (2pm-8pm) hours.

For the range color we will use a quantile of 0.9, but some visualization do not highlight the differences well, so in these cases we will use a quantile of 0.95.

## Starting Trips

In [None]:
variable = "starting_trips_07"

fig = plot_frequency(
    dataset=hexagons7_df,
    variable=variable,
    labels={variable: "Starting Trips in Res7"},
    range_color=(0, hexagons7_df[variable].quantile(0.9)),
    palette="reds",
)

fig.show()

It is visible that the most taxi trips start in the city center, like Loop, Lincoln Park, West Town and Near South Side. These regions are bustling commercial and residential hubs, drawing both locals and visitors for various activities and destinations. As can also be seen in the **POI notebook** is that there are a lot of sustencance, arts and culture and sports POI in these hexagons. 
Furthermore, two other prominent destinations stand out — Airport O'Hare and Clearing. These locations are logical hotspots due to their roles as major transportation hubs, attracting a substantial volume of travelers arriving or departing from the city.

## Ending Trips

In [None]:
variable = "ending_trips_07"

fig = plot_frequency(
    dataset=hexagons7_df,
    variable=variable,
    labels={variable: "Ending Trips in Res7"},
    range_color=(0, hexagons7_df[variable].quantile(0.9)),
    palette="reds"
)

fig.show()

Just like with the heatmaps we see a notable similarity between where taxi journeys begin and where they conclude on the map. This similarity suggests a connection between popular starting points and destinations across the city. This can lead to both commencing their journeys and completing them within the central areas. Rather than dispersing to other regions before leaving, many opt to stay within the center. In addition, the prominence of transportation hubs such as Airport O'Hare and Clearing naturally contributes to their dual role as popular starting and ending points. These hubs are central and accessible nodes for incoming and outgoing travel, making their significance self-explanatory.

## Demand Difference

### Overall

In [None]:
# Calculate demand difference
hexagons7_df["demand_difference_07"] = hexagons7_df["starting_trips_07"] - hexagons7_df["ending_trips_07"]
hexagons7_df.head(3)

In [None]:
# Ploting a map with hexagons depicting the difference in demand considering starting and ending trips

variable = "demand_difference_07"

fig = plot_frequency(
    dataset=hexagons7_df,
    variable=variable,
    labels={variable: "Demand difference in Res7"},
    range_color=(
        -hexagons7_df[variable].quantile(0.95),
        hexagons7_df[variable].quantile(0.95),
    ),
)
fig.show()

To see the differences on the H3 map better, we used a 0.95 quantile instead of 0.9. The overall demand difference is only for a few hexagons positive, meaning that in certain parts of these places more taxi trips start than end: Airport O'Hare, Jefferson Park, Loop, Brighton Park and Clearing. Else the demand difference is negative or zero. For places more in the south of Chicago and more on the periphery have a lower demand difference, so the ending trips are not as prominent as the other places. This could be due to reason that outskirts are less habitated than places closer to the city. To look further at the possible reasons why there is a bigger negative demand difference, we will look at different time zones.

### During morning and evening rush hour

### Methoden nach ganz oben bringen?

In [None]:
# Function that returns the number of trips for the given hours and for the given 'group by' value

def calculate_hexagon_trips_by_hours(hexagons_df, label, group_by, hours):
    hexagons_df[label] = (
        trips_df[
            (trips_df["trip_start_timestamp"].dt.hour >= hours[0])
            & (trips_df["trip_start_timestamp"].dt.hour <= hours[1])
        ]
        .groupby(group_by)
        .size()
    )
    hexagons_df[label] = hexagons_df[label].fillna(value=0)

In [None]:
# Function that returns the number of trips for the given hours and for the given 'group by' value

def calculate_hexagon_trips_hour(hexagons_df, hexagon_num):
    hexagon_label = f"h3_0{hexagon_num}"
    now = datetime.now()
    morning_hours = [now.replace(hour=6).hour, now.replace(hour=12).hour]
    evening_hours = [now.replace(hour=14).hour, now.replace(hour=20).hour]

    calculate_hexagon_trips_by_hours(
        hexagons_df, label="starting_trips_morning", group_by=f"{hexagon_label}_pickup", hours=morning_hours
    )
    calculate_hexagon_trips_by_hours(
        hexagons_df, label= "ending_trips_morning", group_by=f"{hexagon_label}_dropoff", hours=morning_hours
    )
    calculate_hexagon_trips_by_hours(
        hexagons_df, label= "starting_trips_evening", group_by=f"{hexagon_label}_pickup", hours=evening_hours
    )
    calculate_hexagon_trips_by_hours(
        hexagons_df, label= "ending_trips_evening", group_by=f"{hexagon_label}_dropoff", hours=evening_hours
    )

    hexagons_df["trips_difference_morning"] = (
        hexagons_df["starting_trips_morning"] - hexagons_df["ending_trips_morning"]
    )
    hexagons_df["trips_difference_evening"] = (
        hexagons_df["starting_trips_evening"] - hexagons_df["ending_trips_evening"]
    )

    return hexagons_df

In [None]:
# Calculate the difference between starting and ending trips for the morning (06:00-12:00) 
# and for the evening (14:00-20:00) in Resolution 7, 8 + 9

hexagon_numbers = [7, 8, 9]
for hexagon_num in hexagon_numbers:
    hexagon_df = globals()[f"hexagons{hexagon_num}_df"]
    hexagon_df = hexagon_df.set_index("hex")
    hexagon_df = calculate_hexagon_trips_hour(hexagon_df, hexagon_num)
    hexagon_df = hexagon_df.reset_index()

    # Optionally, you can assign the updated DataFrame back to its original variable
    globals()[f"hexagons{hexagon_num}_df"] = hexagon_df

hexagons7_df.head(3)

#hexagons7_df['geometry'] = hexagons7_df['geometry'].apply(lambda geom: geom.wkt) 
#hexagons7_df.to_parquet('hexagons7_df.parquet', index=False)


In [None]:
# Ploting a map with hexagons depicting the demand difference in the morning

variable = "trips_difference_morning"

fig = plot_frequency(
    dataset=hexagons7_df,
    variable=variable,
    labels={variable: "Demand Difference Morning in Res7"},
    range_color=(
        -hexagons7_df[variable].quantile(0.9),
        hexagons7_df[variable].quantile(0.9),
    ),
)
fig.show()

In the morning (6am-12pm) the overall demand difference in most parts of Chicago is neither high on the positive or negative. This could be attributed to a balance between people starting their day and those who have already reached their destinations. The two Airports have more ending trips, which could be due to the fact that the early morning hours are often popular for business travel.
Around Loop, Near North Side, Douglas and Lincoln Park there are also more ending trips, assumingly also due to work reasons, as there could be more businesses located.
More starting trips are in the Indian village, which could be due to the fact that the residential area has a lot of high buildings, meaning that the area has a higher population density. 
In the North of Chicago, next to the river side, there are more starting trips, which is likely also due to to these areas being residential neighborhoods, as well as Public Transport not being as present in these areas (see **POI Notbeook**).

In [None]:
# Ploting a map with hexagons depicting the demand difference in the evening

variable = "trips_difference_evening"

fig = plot_frequency(
    dataset=hexagons7_df,
    variable=variable,
    labels={variable: "Demand Difference Evening in Res7"},
    range_color=(
        -hexagons7_df[variable].quantile(0.9), 
        hexagons7_df[variable].quantile(0.9),
    ),
)
fig.show()

A big difference is visible in comparison to the demand difference in the morning. In the evening (14pm-20pm) the demand difference is negative for most part in Chicago. This could be because people come back from work or activities that they did in the evening, like shopping or dining out. 
The two Airports in O'Hare and Clearing have more starting trips, as people come back from work trips or vacation on a later time. 
On a lot of the peripheral parts of Chicago again there is no much demand difference, likely because of lower urban denisty and less activity options.
Here we can also see Elk Grove Village being prominent for the first time. This place has different kinds of outdoor activities and seems to be mostly visited during evening time. As this has activities for the family, the people will assumingly go there either after school or work time.

# Hex8

Next, we will look at resolution 8 to capture finer geographic details.

In [None]:
variable = "starting_trips_08"

fig = plot_frequency(
    dataset=hexagons8_df,
    variable=variable,
    labels={variable: "Starting Trips in Res8"},
    range_color=(0, hexagons8_df[variable].quantile(0.9)),
    palette="reds"
)

fig.show()

Overall the Resolution 8 H3 map is very similar to the Resolution 7 map. The only slight difference is that we now see few more differences within the city center and the Northern region. In addition, an area around Hyde Park has become more visually prominent. There are several museums in this area, Jackson Park, the University of Chicago, and the lake is nearby as well.

In [None]:
variable = "ending_trips_08"

fig = plot_frequency(
    dataset=hexagons8_df,
    variable=variable,
    labels={variable: "Ending Trips in Res8"},
    range_color=(0, hexagons8_df[variable].quantile(0.9)),
    palette="reds"
)

fig.show()

Again, this is similar to Resolution 7. We see more dynamics within the city center and the Northern regions Lincoln Park and Uptown. And Hyde Park is more visible here as well.

In [None]:
hexagons8_df["demand_difference_08"] = hexagons8_df["starting_trips_08"] - hexagons8_df["ending_trips_08"]
hexagons8_df.head(3)

In [None]:
# Ploting a map with hexagons depicting the difference in demand considering starting and ending trips

variable = "demand_difference_08"

fig = plot_frequency(
    dataset=hexagons8_df,
    variable=variable,
    labels={variable: "Demand difference Res8"},
    range_color=(
        -hexagons8_df[variable].quantile(0.95),
        hexagons8_df[variable].quantile(0.95),
    ),
)
fig.show()

The difference that we can see here compared to Resolution 7 is that the peripheral areas do not have a high demand difference. As we already observed in Resolution 8 and 7 there are less starting trips and ending trips in general in these areas. A reason besides lower population density could be the higher number of public transport available in these areas, especially in the South, which would be less costly than taking a Taxi to the city center. 

### During morning and evening rush hour

In [None]:
# Ploting a map with hexagons depicting the demand difference in the morning

variable = "trips_difference_morning"

fig = plot_frequency(
    dataset=hexagons8_df,
    variable=variable,
    labels={variable: "Demand Difference Morning Res8"},
    range_color=(
        -hexagons8_df[variable].quantile(0.9),
        hexagons8_df[variable].quantile(0.9),
    ),
)
fig.show()

Interestingly here we can see more detail whether there are more starting or more ending trips in Chicago, especially in the city center, and around Lincoln Park and Lake View.

In [None]:
# Ploting a map with hexagons depicting the demand difference in the evening

variable = "trips_difference_evening"

fig = plot_frequency(
    dataset=hexagons8_df,
    variable=variable,
    labels={variable: "Demand Difference Evening Res8"},
    range_color=(
        -hexagons8_df[variable].quantile(0.95),
        hexagons8_df[variable].quantile(0.95),
    ),
)
fig.show()

Here again the map is similar to Resolution 7, but we see more clearly, which spots have more starting trips within the overall red colored spots, that symbol that in these location more Taxi trips end than start.

Overall we can see a patterns within Chicago not being colored at all, which can be conlcuded due to the more detailed nature of a higher hexagon resolution. This already shows us at which locations there are no Taxi trips, which can be even better captured with Resolution 9.

# Hex9

Finally, we will look at Resolution 9 to analyze taxi demand in more detail. Since these maps are just a more detailed version of the other two Resolutions, showing more exact locations due to smaller hexagon sizes where certain hexagons are not drawn anymore, we will only comment on the maps where a bigger difference is visible.

In [None]:
variable = "starting_trips_09"

fig = plot_frequency(
    dataset=hexagons9_df,
    variable=variable,
    labels={variable: "Starting Trips in Res9"},
    range_color=(0, hexagons9_df[variable].quantile(0.9)),
    palette="reds"
)

fig.show()

In [None]:
variable = "ending_trips_09"

fig = plot_frequency(
    dataset=hexagons9_df,
    variable=variable,
    labels={variable: "Ending Trips in Res9"},
    range_color=(0, hexagons9_df[variable].quantile(0.9)),
    palette="reds"
)

fig.show()

In [None]:
hexagons9_df["demand_difference_09"] = hexagons9_df["starting_trips_09"] - hexagons9_df["ending_trips_09"]
hexagons9_df.head(3)

In [None]:
# Ploting a map with hexagons depicting the difference in demand considering starting and ending trips

variable = "demand_difference_09"

fig = plot_frequency(
    dataset=hexagons9_df,
    variable=variable,
    labels={variable: "Demand difference Res9"},
    range_color=(
        -hexagons9_df[variable].quantile(0.95),
        hexagons9_df[variable].quantile(0.95),
    ),
)
fig.show()

### During morning and evening rush hour

In [None]:
# Ploting a map with hexagons depicting the demand difference in the morning

variable = "trips_difference_morning"

fig = plot_frequency(
    dataset=hexagons9_df,
    variable=variable,
    labels={variable: "Demand Difference Morning Res9"},
    range_color=(
        -hexagons9_df[variable].quantile(0.9),
        hexagons9_df[variable].quantile(0.9),
    ),
)
fig.show()

In [None]:
# Ploting a map with hexagons depicting the demand difference in the evening

variable = "trips_difference_evening"

fig = plot_frequency(
    dataset=hexagons9_df,
    variable=variable,
    labels={variable: "Demand Difference Evening Res9"},
    range_color=(
        -hexagons9_df[variable].quantile(0.95),
        hexagons9_df[variable].quantile(0.95),
    ),
)
fig.show()

In [None]:
hexagons9_df['geometry'] = hexagons9_df['geometry'].apply(lambda geom: geom.wkt) 
hexagons9_df.to_parquet('hexagons9_df.parquet', index=False)

As there were no comments for any of these H3 Resolution 9 maps, it can be conlcuded that thie Resolutions type just captures more exact locations, whilst the other Resolutions, especially 8, already captured the overall essentials.

# Average idle time between trips

When considering Taxis in Chicago, studying idle time is crucial because it unveils moments when taxis are not in active service, waiting for passengers. Analyzing idle time patterns can offer valuable information about taxi supply and demand dynamics, helping to identify areas or times of low passenger activity. This insight can guide Taxi operators in making informed decisions to strategically position taxis, minimize empty driving, and optimize earnings for both drivers and the overall Taxi system.

In [None]:
# Method for visualizing average idle time in a hex map
def plot_trips_net_monthly(h3_res: int, time_interval_length: int):
	trips_monthly_net = get_trips_net_monthly(h3_res, time_interval_length)

	trips_monthly_net['month'] = trips_monthly_net['datetime'].dt.month
	trips_monthly_net = trips_monthly_net.sort_values(by=['month'])

	h3_visualization.plot_choropleth(
		trips_monthly_net,
		hex_col="hex_id",
		color_by_col="demand",
		center=lat_lon_leipzig,
		color_continuous_scale="RdBu",
		range_color=(
			-50,
			50
		),
		animation_frame="month",
		opacity=0.7,
		zoom=10,
		labels={'demand': 'inflow - outflow'},
		mapbox_style="open-street-map",
	)

In [None]:
# Shift trips, so the idle time of a taxi can be calculated later on
trips_shifted = trips_df.groupby("taxi_id").shift(1).dropna(subset=["trip_start_timestamp"])
trips_with_next = trips_df.merge(
    trips_shifted, left_index=True, right_index=True, how="inner", suffixes=("", "_next")
)
trips_with_next['taxi_id'] = trips_df['taxi_id']

In [None]:
# Calculating idle time
trips_with_next['idle_time'] = (
    trips_with_next.trip_end_timestamp - trips_with_next.trip_start_timestamp_next
)

In [None]:
trips_with_next.idle_time.describe()

In [None]:
trips_with_next['timeinterval'] = (
    trips_with_next.trip_start_timestamp.dt.floor('1D')
)

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))

ax.plot(
    trips_with_next.groupby("timeinterval").idle_time.median().dt.total_seconds()
    / 60
    / 60,
)
ax.set_xlabel("Time interval")
ax.set_ylabel("Median idle time (hours)")
ax.set_tile("Median idle time in hours over a one year time span")

plt.show()

With a linear plot we show the median idle time in hours for the one year period. The median idle time is overall very similar. We can see a bigger idle time around December, which could be attributed to holiday-related changes in travel demand. In our **Temporal Demand Pattern Notebook** we also observed that there are less Taxi usages in December. Additionally other short peaks during End of June to November can be seen, possibly due to seasonal events or factors influencing travel behavior, and one around January, likely related to weather conditions. For these month we also observed less Taxi usage.
Around January to March smaller idle times can be seen, this phenomenon can be attributed to the colder weather prevalent in Chicago during this period. Around June a smaller idle time can also be observed. Conversely, in June, a reduction in idle time is discernible. This trend can be attributed to the onset of warmer weather, likely leading to more outdoor activities and higher demand for transportation services, thus reducing taxi idle times during this period.

In [None]:
trips_with_next.head(3)

In [None]:
trips_with_next['month'] = trips_with_next.trip_start_timestamp.dt.month

In [None]:
# Hex7
idle_by_hex7_time_median = trips_with_next.groupby(["h3_07_pickup_next", "month"])[
    "idle_time"
].median().rename("idle_time_median")
idle_by_hex7_time_mean = trips_with_next.groupby(["h3_07_pickup_next", "month"])[
    "idle_time"
].mean().rename("idle_time_mean")

idle_by_hex7_time = pd.concat(
	[idle_by_hex7_time_median, idle_by_hex7_time_mean], axis=1
).reset_index()

# Hex9
idle_by_hex9_time_median = trips_with_next.groupby(["h3_09_pickup_next", "month"])[
    "idle_time"
].median().rename("idle_time_median")
idle_by_hex9_time_mean = trips_with_next.groupby(["h3_09_pickup_next", "month"])[
    "idle_time"
].mean().rename("idle_time_mean")

idle_by_hex9_time = pd.concat(
	[idle_by_hex9_time_median, idle_by_hex9_time_mean], axis=1
).reset_index()

In [None]:
# Hex7
idle_by_hex7_time["idle_time_median_days"] = (
    idle_by_hex7_time["idle_time_median"].dt.total_seconds() / 60 / 60 / 24
)
idle_by_hex7_time["idle_time_mean_days"] = (
    idle_by_hex7_time["idle_time_mean"].dt.total_seconds() / 60 / 60 / 24
)

# Hex9
idle_by_hex9_time["idle_time_median_days"] = (
    idle_by_hex9_time["idle_time_median"].dt.total_seconds() / 60 / 60 / 24
)
idle_by_hex9_time["idle_time_mean_days"] = (
    idle_by_hex9_time["idle_time_mean"].dt.total_seconds() / 60 / 60 / 24
)

In [None]:
idle_by_hex7 = idle_by_hex7_time.groupby("h3_07_pickup_next").mean().reset_index()

# Hex9
idle_by_hex9 = idle_by_hex9_time.groupby("h3_09_pickup_next").mean().reset_index()

In [None]:
# Hex7
idle_by_hex7.rename(columns={"h3_07_pickup_next":"hex"}, inplace=True)
idle_by_hex7_time.rename(columns={"h3_07_pickup_next":"hex"}, inplace=True)

# Hex9
idle_by_hex9.rename(columns={"h3_09_pickup_next":"hex"}, inplace=True)
idle_by_hex9_time.rename(columns={"h3_09_pickup_next":"hex"}, inplace=True)

In [None]:
import math
# Hex7
idle_by_hex7['month'] = idle_by_hex7['month'].astype(int)

idle_by_hex7.head(3)

# Hex9
idle_by_hex9['month'] = idle_by_hex9['month'].astype(int)

idle_by_hex9.head(3)

In [None]:
# Hex7
idle_by_hex7['geometry'] = (idle_by_hex7.apply(add_geometry,axis=1)) 
idle_by_hex7_time['geometry'] = (idle_by_hex7.apply(add_geometry,axis=1)) 

# Hex9
idle_by_hex9['geometry'] = (idle_by_hex9.apply(add_geometry,axis=1)) 
idle_by_hex9_time['geometry'] = (idle_by_hex9.apply(add_geometry,axis=1)) 

In [None]:
# Hex7
idle_by_hex7_time['geometry'].isna().sum()
idle_by_hex7_time = idle_by_hex7_time.dropna(axis=0)
idle_by_hex7_time.head(3)

# Hex9
idle_by_hex9_time['geometry'].isna().sum()
idle_by_hex9_time = idle_by_hex9_time.dropna(axis=0)
idle_by_hex9_time.head(3)

As there is little difference from Resolution 7 to 8, we will just visualize Reolution 7 and 9.

In [None]:
idle_by_hex7 = idle_by_hex7.sort_values('month')

variable = "idle_time_median_days"

fig = plot_frequency(
    dataset=idle_by_hex7,
    variable=variable,
    labels={variable: "Idle Time in Res7"},
    range_color=(0, idle_by_hex7[variable].quantile(0.9)),
    palette="reds",

)

fig.show()

It can be observed that there is a higher idle time around periheral areas in Chicago, possibily due to the patterns that we have seen in the analyis before, that in these areas there are less starting and ending trips, so in generall less demand for Taxis. Possible reasons that we mentioned were due to the alternative transport options, longer distances resulting in higher fares and lower residental population.

In [None]:
idle_by_hex9 = idle_by_hex9.sort_values('month')

variable = "idle_time_median_days"

fig = plot_frequency(
    dataset=idle_by_hex9,
    variable=variable,
    labels={variable: "Idle Time in Res9"},
    range_color=(0, idle_by_hex9[variable].quantile(0.9)),
    palette="reds",

)

fig.show()

Finally, with a Resolution of 9, we can see in more detail in which locations the idle time is higher or lower. This again is already well captured in Resolution 7, as both are similar. 

To summarize this notebook, we observed higher Taxi Demand in the city center the Loop, the Northern areas above the city center like Lincoln Park, the Southern parts around Hyde Park and the two Airports O'Hare International and Midway International. Linking this observation with other observations from the **Temporal Demand Patterns and Price Analysis** and **POI Notebook** and research, we can explain these patterns for each area like this:
- The City Center has higher amount of activities like sustenance, arts and culture, sports and education leading to the highest Taxi Demand
- The Northern areas, above the city and nearer to the river side, have more residential areas that have less access to public transport, so the Taxi Demand is higher
- The Southern areas around Hyde Park have several museums, the University of Chicago and a park near the Lake located there, leading to higher Taxi Demand in these parts
- Chicago O'Hare International Airport and Chicago Midway International Airport have higher demand in general due to them being important transportation hubs. In the morning there are more ending trips and in the evening more starting trips likely due to business travel and its timely nature
- Peripheral areas have less Taxi demand likely due to a lower population density, better access to public transport and higher cost of Taxi Demand due to its farer distance to the city center

For an outlook, more POIs can be studied in relation to Taxi Demand. Chicago is a place with a lot of criminal activities, so the influence of that can also be analyzed in relation to Taxi Demand. 