## Spatial Hot Spots

In the following, an attempt is made to classify the cab trips into clusters of start and end points using the processed dataset, with spatial hotspots can be identified.

The approach of soft clustering with GMM is used. The dataset is analyzed on the basis of the start and end points, since these can be different due to empty trips, which are not included in the dataset.

Only the sampled dataset with about 1% of the data is used because a calculation of the full dataset was not completed after more than 96 hours.

#### Imports and preparatory calculations
In the following, the necessary libraries are imported and the data set is loaded and displayed superficially.

Furthermore, preparatory calculations are performed to extracte the longitude and latitude .

In [1]:
# Import necessary libraries
import re
import folium
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KernelDensity

In [2]:
# Import dataset
#trips_df = pd.read_parquet('../../data/rides/Taxi_Trips_Sampled_Cleaned.parquet') #Sampled
trips_df = pd.read_parquet('../../data/rides/Taxi_Trips_Cleaned.parquet') #NonSampled

In [3]:
# General presentation of the dataframe
trips_df.info()
trips_df.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17065882 entries, 0 to 17065881
Data columns (total 44 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   taxi_id                 int64         
 1   trip_start_timestamp    datetime64[ns]
 2   trip_end_timestamp      datetime64[ns]
 3   trip_seconds            float64       
 4   trip_miles              float64       
 5   pickup_census_tract     int64         
 6   dropoff_census_tract    int64         
 7   pickup_community_area   int64         
 8   dropoff_community_area  int64         
 9   fare                    float64       
 10  tips                    float64       
 11  tolls                   float64       
 12  Extras                  float64       
 13  trip_total              float64       
 14  payment_type            object        
 15  Company                 object        
 16  hour_start              int32         
 17  4_hour_block_start      int32         
 18  

Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,fare,...,h3_07_dropoff,h3_08_pickup,h3_08_dropoff,h3_09_pickup,h3_09_dropoff,pickup_centroid,dropoff_centroid,datetime,temp,precip
0,1,2015-01-01 00:00:00,2015-01-01 00:00:00,420.0,1.0,17031081500,17031320100,8,32,6.05,...,872664c1effffff,882664c1e1fffff,882664c1e3fffff,892664c1e0fffff,892664c1e2fffff,POINT (-87.626214906 41.892507781),POINT (-87.620992913 41.884987192),2015-01-01,-7.0115,0
1,2,2015-01-01 00:30:00,2015-01-01 00:30:00,480.0,1.9,17031081700,17031832600,8,7,7.65,...,872664c13ffffff,882664c1e7fffff,882664c135fffff,892664c1e73ffff,892664c13cfffff,POINT (-87.63186395 41.892042136),POINT (-87.654007029 41.914747305),2015-01-01,-7.0115,0
2,3,2015-01-01 00:30:00,2015-01-01 00:45:00,300.0,1.0,17031081700,17031842200,8,8,5.25,...,872664c13ffffff,882664c1e7fffff,882664c137fffff,892664c1e73ffff,892664c1377ffff,POINT (-87.63186395 41.892042136),POINT (-87.649907226 41.904935302),2015-01-01,-7.0115,0
3,4,2015-01-01 00:30:00,2015-01-01 00:30:00,180.0,0.7,17031062800,17031062900,6,6,4.65,...,872664c16ffffff,882664c129fffff,882664c163fffff,892664c1293ffff,892664c162fffff,POINT (-87.661265218 41.936159071),POINT (-87.656411531 41.936237179),2015-01-01,-7.0115,0
4,5,2015-01-01 00:30:00,2015-01-01 00:45:00,600.0,2.2,17031051300,17031071200,5,7,8.25,...,872664c13ffffff,882664c12dfffff,882664c107fffff,892664c12dbffff,892664c106fffff,POINT (-87.675821928 41.935983574),POINT (-87.646210977 41.921854911),2015-01-01,-7.0115,0


#### Extract the longitude and latitude
First, we extract the longitude and latitude from the data to cluster based on them.  
This is done once for the pickup data and once for the dropoff data.

In [4]:
# Extract the longitude and latitude
def extract_coordinates(point):
    coords = re.findall(r"[-+]?\d*\.\d+|\d+", point)
    return [float(coord) for coord in coords]

trips_df['pickup_point'] = trips_df['pickup_centroid'].apply(extract_coordinates)
trips_df['dropoff_point'] = trips_df['dropoff_centroid'].apply(extract_coordinates)

pickup_coordinates = np.vstack(trips_df['pickup_point'].values)
dropoff_coordinates = np.vstack(trips_df['dropoff_point'].values)

### Determination of the hyperparameters

We would now like to set the hyperparameters for our GMM. This is first the number of clusters we want to find. Furthermore, the optimal bandwidth for the kernel density estimation has to be found. 

#### Calculation of the optimal number of clusters with the elbow method
First, we compute the inertia for up to 10 clusters and then use the visualization to select the optimal number of clusters at the point where an elbow can be seen.

In [None]:
n_components = np.arange(1, 10)

# Model for pickup
models_pickup = [GaussianMixture(n, covariance_type='full', random_state=0).fit(pickup_coordinates)
                 for n in n_components]
# Model for dropoff
models_dropoff = [GaussianMixture(n, covariance_type='full', random_state=0).fit(dropoff_coordinates)
                  for n in n_components]

In [None]:
plt.figure(figsize=(8, 6))

# Plot pickup
plt.plot(n_components, [-m.score(pickup_coordinates) for m in models_pickup], color='blue', label='Pickup', marker='o')

# Plot dropoff
plt.plot(n_components, [-m.score(dropoff_coordinates) for m in models_dropoff], color='red', label='Dropoff', marker='o')

# Labeling and legend
plt.title('Elbow method for determining the optimal number of clusters')
plt.xlabel('Number of clusters')
plt.legend(loc='best')
plt.show()

It can be seen that the inertia does not decrease noticeably after 4 to 5 clusters. Therefore, 4 clusters are now calculated in K Means in order to further anlayze them.

#### Calculation of the optimal bandwith for KDE
Define a function, calculate_best_bandwidth, which determines the optimal bandwidth value for kernel density estimation by using GridSearchCV.Then calculates the best bandwidth values for both pickup and dropoff coordinates of trips.

In [None]:
def calculate_best_bandwidth(coordinates, bandwidths_range=(-1, 1, 10), cv=5, n_jobs=-1):
    # Calculate the optimal bandwidth
    bandwidths = 10 ** np.linspace(*bandwidths_range)

    # Generate a KernelDensity object
    kde = KernelDensity()

    # Use GridSearchCV to find the best bandwidth
    # Set n_jobs=-1 to use all available cores
    grid = GridSearchCV(kde, {'bandwidth': bandwidths}, cv=cv, n_jobs=n_jobs)  # Cross-validation
    grid.fit(coordinates)

    # Best bandwidth
    best_bandwidth = grid.best_params_['bandwidth']

    return best_bandwidth

# Calculate best bandwidth for pickup location
best_bandwidth_pickup = calculate_best_bandwidth(pickup_coordinates)
print("Best Bandwidth Pickup: ", best_bandwidth_pickup)

# Calculate best bandwidth for dropoff location
best_bandwidth_dropoff = calculate_best_bandwidth(dropoff_coordinates)
print("Best Bandwidth Dropoff: ", best_bandwidth_dropoff)

#### Create a spatial kernel density estimate
Now we establishe a spatial kernel density estimation for both pickup and dropoff coordinates using a Gaussian kernel. Then, computes the density estimates for the data points of these coordinates.

In [None]:
# Create a spatial kernel density estimation
kde_pickup = KernelDensity(kernel='gaussian', bandwidth=best_bandwidth_pickup).fit(pickup_coordinates)
kde_dropoff = KernelDensity(kernel='gaussian', bandwidth=best_bandwidth_dropoff).fit(dropoff_coordinates)

# Calculate the density estimation for the data points
pickup_density = np.exp(kde_pickup.score_samples(pickup_coordinates))
dropoff_density = np.exp(kde_dropoff.score_samples(dropoff_coordinates))

## Gaussian Mixture Model

Now we create the GMM based on the KDE and calculate the spatial hotspot centers and their sizes

In [None]:
# Application of Gaussian mixed models to density estimation
gmm_pickup = GaussianMixture(n_components=4).fit(pickup_coordinates, pickup_density)
gmm_dropoff = GaussianMixture(n_components=4).fit(dropoff_coordinates, dropoff_density)

In [None]:
# Cache the results in variables
pickup_covariances = gmm_pickup.covariances_
dropoff_covariances = gmm_dropoff.covariances_
pickup_centers = gmm_pickup.means_
dropoff_centers = gmm_dropoff.means_

# Display the location (means) and size (variances) of the identified hotspots
print("Hotspot Centers (Pickup):", gmm_pickup.means_)
print("Hotspot Sizes (Pickup):", gmm_pickup.covariances_)
print("Hotspot Centers (Dropoff):", gmm_dropoff.means_)
print("Hotspot Sizes (Dropoff):", gmm_dropoff.covariances_)

## Visualsisierung
Now we visualize the identified hotspots for both pickup and dropoff locations on a map centered around Chicago. Blue markers indicate pickup centers, while red markers represent dropoff centers. 

In [None]:
# Creating a map centered on Chicago
m = folium.Map(location=[41.8781, -87.6298], zoom_start=11, control_scale=False)

# Fetching the centers and covariances
pickup_centers = gmm_pickup.means_
dropoff_centers = gmm_dropoff.means_

# Adding markers for pickup centers
for center in pickup_centers:
    folium.Marker(location=[center[1], center[0]],
                  icon=folium.Icon(color='blue'),
                  popup='Pickup Center').add_to(m)

# Adding markers for dropoff centers
for center in dropoff_centers:
    folium.Marker(location=[center[1], center[0]],
                  icon=folium.Icon(color='red'),
                  popup='Dropoff Center').add_to(m)

# Display the map
m