# Descriptive (Spatial) Analytics

Analyze taxi demand patterns for the relevant one-year period and 
city (please check carefully which year your team has been allocated). 

Specifically show how these
patterns (start time, trip length, start and end location, price, average idle time between trips, and so 
on) for the given sample varies in different spatio-temporal resolution (i.e., census tract vs. varying
hexagon diameter and/or temporal bin sizes). 

Give possible reasons for the observed patterns.

## 1.1 Average idle time

To account for large time gaps when the driver does not work, we introduce a threshold of xx to consider only significant idle periods. In the following we present the steps we followed to calculate the average idle time:

1. **Sorting the Trips in ascending order for each taxi:** We sort the trips in ascending order for each taxi and start time.

2. **Identifying the Idle Periods:** For each driver, we find the time gap between the end time of one trip and the start time of the next trip. 

3. **Introducing a threshold of xx hours:** If the time gap exceeds the defined threshold of xx minutes (i.e., xx hours), we can ignore this time gaps, because the driver is not working at that time. By introducing a threshold, you can exclude the large time gaps when the driver does not work from the calculation of the average idle time. This approach provides a more accurate representation of the idle time during active working periods.

4. **Summing Idle Times up:** We add up all the significant idle times calculated after step 3 for all drivers to get the total idle time.

5. **Counting the Idle Periods:** We count the number of significant idle periods observed in step 2. This will be the total number of significant idle periods for all drivers.

6. **Calculating the Average:** We divide the total idle time by the number of significant idle periods to find the average idle time per significant idle period.


In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from datetime import datetime
import numpy as np


In [2]:
# import datasets
dfChicago = pd.read_csv("data\datasets\df_chicago.csv.zip")
dfChicago_hourly = pd.read_csv("data\datasets\df_chicago_hourly.csv")

In [3]:
# Sort trip data by start time
dfChicago_avg_idle_time = dfChicago.sort_values(by=['Taxi_ID', 'Original_Trip_Start_Timestamp'])

# Reset index
dfChicago_avg_idle_time.reset_index(drop=True, inplace=True)

# defining a threshold in minutes
threshold_minutes = 480

# Calculate time differences using vectorized operations
time_diff = (
    pd.to_datetime(dfChicago_avg_idle_time["Original_Trip_Start_Timestamp"]) -
    pd.to_datetime(dfChicago_avg_idle_time["Original_Trip_End_Timestamp"].shift())
).dt.total_seconds() / 60

# Set "Idle Time Minutes" column based on the threshold
dfChicago_avg_idle_time["Idle Time Minutes"] = np.where(
    (time_diff > threshold_minutes) & 
    (dfChicago_avg_idle_time["Taxi_ID"] == dfChicago_avg_idle_time["Taxi_ID"].shift()), time_diff, 0)

# Set "Idle Period" column based on condition using np.where
dfChicago_avg_idle_time["Idle Period"] = np.where(
    (dfChicago_avg_idle_time["Idle Time Minutes"] < threshold_minutes) &
    (dfChicago_avg_idle_time["Taxi_ID"] == dfChicago_avg_idle_time["Taxi_ID"].shift()),
    1, 0
)

# Set "Idle Time Minutes" column based on the threshold
dfChicago_avg_idle_time["Idle Time Minutes"] = np.where(
    (time_diff <= threshold_minutes) &
    (dfChicago_avg_idle_time["Taxi_ID"] == dfChicago_avg_idle_time["Taxi_ID"].shift()), time_diff, 0)


In [None]:
average_idle_time = dfChicago_avg_idle_time["Idle Time Minutes"].sum() / dfChicago_avg_idle_time["Idle Period"].sum()
average_idle_time

In [None]:
import matplotlib.pyplot as plt

# Assuming dfChicago_avg_idle_time["Idle Time Minutes"] is a numeric column
idle_time_minutes = dfChicago_avg_idle_time["Idle Time Minutes"]

# Set the figure size (width, height) in inches
plt.figure(figsize=(10, 6))

# Plot the histogram and get bin edges
hist, bin_edges, _ = plt.hist(idle_time_minutes, bins=20, edgecolor='black', align='mid')

# Calculate the center positions of each bin
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2

# Plot the bin intervals
plt.bar(bin_centers, hist, width=(bin_edges[1] - bin_edges[0]), align='center', edgecolor='black')

plt.xlabel("Idle Time Minutes")
plt.ylabel("Frequency")
plt.title("Histogram of Idle Time Minutes")

# Add a border around the note
border_props = dict(boxstyle='round, pad=0.5', fc='lightgray', ec='black', lw=2)

# Add the note with the specified coordinates and border properties
plt.text(420, 5.5e6, f"The bin size is {bin_edges[1] - bin_edges[0]}.", fontsize=12, ha='center', va='center', color='black', bbox=border_props)

plt.show()


### Census tract vs. varying hexagon diameter

### Census tract vs. diff temporal bin sizes

## More features