# Team 7 Project - Yellow Taxi Dataset



## Table of Contents

1. [Introduction](#1.-Introduction)
2. [Imports](#2.-Imports)
3. [Data-Preperation](#3.-Data-Preperation)
    - [3.1 Weather-Dataset](#3.1-Clean-up-of-the-Weather-Data-Set)
    - [3.2 Taxi-Dataset](#3.2-Clean-up-of-the-Yellow-Taxi-Data-Set)
    - [3.3 Taxi-Zones](#3.3-Clean-up-of-the-Taxi-Zones)
    - [3.4 Merged Zones + Trip](#3.4-Merge-of-zones-and-trips)
    - [3.5 Merge of Weather + Trip](#3.5-Merge-of-weather-and-trips)
    - [3.6 Model Specific Feature-Engineering](#3.6-Model-Specific-Feature-Engineering)
4. [Descriptive Analysis](#4.-Descriptive-Analysis)
    - [4.1 Temperature](#4.1-Number-of-rides-according-to-the-temperature)
    - [4.2 Common Pickup-Zones](#4.2-Most-Common-Pickup-Zones)
        - [4.2.1 Common Pickup Heatmap](#4.2.1-Heatmap)
    - [4.3 Common Dropoff-Zones](#4.3-Most-Common-Dropoff-Zones)
    - [4.4 Most Common Pairs](#4.4-Most-Common-Pairs)
    - [4.5 Temporal Trip Patterns](#4.4-Temporal-Trip-Patterns)
        - [4.5.1 Time-of-Day Usage](#4.5.1-Time-of-Day-Usage)
        - [4.5.2 Day-of-Week Usage](#4.5.2-Day-of-Week-Usage)
5. [Predictive Models](#5.-Predictive-Models)
    - [5.1 Random Forrest Regression](#5.1-Random-Forrest-Regression)
        - [5.1.1 Data-Split](#5.1.1-Data-Split)
        - [5.1.2 Target Encoding](#5.1.2-Target-Encoding)
        - [5.1.2 Hyper-Parameter Optimization](#5.1.3-Hyper-Parameter-Optimization)
        - [5.1.4 Model-Training and Testing](#5.1.4-Model-Training-and-Testing)
        - [5.1.5 Visualisation](#5.1.5-Visualisation)



## 1. Introduction

For our team project, we have chosen to analyze the <a href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page">NYC Yellow Taxi Dataset January (2021)</a>, which contains collected data of taxi trips in New York City. The dataset offers information such as pickup and drop-off times, trip distances, fares, passenger counts, and various surcharges. In the following, we will point out, which features we are going to include and which target we are going to predict.


## Context and Relevance

Understanding trip counts provides valuable insights for various stakeholders:
* Shared `mobility providers` can use demand patterns to optimize fleet allocation, predict peak times, and identify opportunities for new services
* City `governments` can leverage this data to improve traffic management, urban planning, and environmental policies by understanding when and where taxi demand is highest
* For `society` at large, improved knowledge of taxi usage contributes to more efficient transportation networks, reduced congestion and emissions, and better mobility services for residents and visitors

The primary goal of this analysis is to identify temporal and spatial patterns in taxi demand through trip count metrics. Key questions include:

* Are there weather trends in the trip data?
* Which neighborhoods have the most frequent taxi activity?
* Are there any Temporal Trends?





## *2. Imports*

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import geopandas as gpd
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

ModuleNotFoundError: No module named 'geopandas'

## 3. Data-Preperation

### *3.1 Clean-up of the Weather Data-Set*

The utilized raw dataset contains historical weather data with the following columns:

* `time`: Timestamp of the recorded data (e.g., hourly or daily).

* `temperature_2m (°C)`: Air temperature measured at 2 meters above ground level, in degrees Celsius.

* `precipitation (mm)`: Total amount of precipitation, in millimeters.

* `rain (mm)`: Rainfall amount specifically, in millimeters.

* `cloudcover (%)`: Total cloud coverage, expressed as a percentage.

* `cloudcover_low (%)`: Low-altitude cloud coverage, in percent.

* `cloudcover_mid (%)`: Mid-altitude cloud coverage, in percent.

* `cloudcover_high (%)`: High-altitude cloud coverage, in percent.

* `windspeed_10m (km/h)`: Wind speed measured at 10 meters above ground, in kilometers per hour.

* `winddirection_10m (°)`: Wind direction at 10 meters above ground, in degrees (0–360°).

This dataset provides a comprehensive view of atmospheric conditions and is suitable for analyzing weather patterns, forecasting, or climate trend detection.

In [None]:
weather = pd.read_csv('NYC_Weather_2016_2022.csv')
weather

In our <a href="https://www.kaggle.com/datasets/aadimator/nyc-weather-2016-to-2022?resource=download">NYC Weather Dataset</a>, we initially filtered out all data that does not pertain to January 2021, since our primary dataset is also focused exclusively on that month.

Also we dropped some columns because of their limited relevance for our predictive tasks:

* `cloudcover (%)`
* `cloudcover_low (%)`
* `cloudcover_mid (%)`
* `cloudcover_high (%)`
* `winddirection_10m (°)`

All of the cloud cover levels do only have minimal impact on taxi demand or passenger behavior. Since it tends to correlate with other more impactful weather features such as precipitation and temperature, it does not provide significant added value for our models. <br>
`winddirection` also wont influence the taxi patterns in a meaningful way within an urban environtment like NYC.



#### weather = weather.loc[weather['time'].str[:7] == "2021-01"].copy()
weather['time'] = pd.to_datetime(weather['time'])

columns_to_drop = [
    'cloudcover (%)',
    'cloudcover_low (%)',
    'cloudcover_mid (%)',
    'cloudcover_high (%)',
    'winddirection_10m (°)'
]
weather = weather.drop(columns=columns_to_drop)
weather

### *3.2 Clean-up of the Yellow Taxi Data-Set* 

NYC Yellow Taxi Trip Records
This dataset contains detailed information on individual yellow taxi trips in New York City, including:

* `VendorID`: Taxi company/provider identifier.

* `tpep_pickup_datetime / tpep_dropoff_datetime`: Timestamps for pickup and drop-off.

* `passenger_count`: Number of passengers.

* `trip_distance`: Distance of the trip in miles.

* `RatecodeID`: Fare rate code (e.g., standard rate, JFK flat fare).

* `store_and_fwd_flag`: Indicates if the trip record was stored in the taxi’s memory before being sent (Y/N).

* `PULocationID / DOLocationID`: Pickup and drop-off location codes (mapped to taxi zones).

* `payment_type`: Type of payment (e.g., credit card, cash).

* `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`, `improvement_surcharge`, `total_amount`: Fare breakdown.

* `congestion_surcharge / airport_fee`: Additional surcharges related to traffic zones and airport trips.

In [None]:
TripData = pd.read_parquet('yellow_tripdata_2021-01.parquet', engine='pyarrow')
TripData

 #### The following columns have been removed: 
 * `store_and_fwd_flag`
 * `airport_fee`
 * `RatecodeID`
 * `payment_type`
 * `fare_amount`
 * `extra`
 * `mta_tax`
 * `tip_amount`
 * `tolls_amount`
 * `improvement_surcharge`
 * `total_amount`
 * `congestion_surcharge`
 
 
The `store_and_fwd_flag column` indicates whether the trip record was temporarily stored in the vehicle's memory before being sent to the vendor, also known as "store and forward," due to the vehicle's lack of server connectivity at the time of the trip. Since this information does not contribute to our analysis, we decided to exclude this column. 
<br>
`VendorID`, `payment_type`, `airport_fee`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`, `improvement_surcharge`, `total_amount`, `congestion_surcharge` all represent detailed financial components of each trip. While useful for fare analysis or revenue modeling, they are not required for understanding the frequency or distribution of trips. Consequently, we decided to drop these columns to simplify the dataset and focus solely on trip counts.
<br>

#### The following rows have been removed:

All rows containing 'NaN' <br>

For further information, please refer to the <a href="https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf">Yellow Taxi Data Dictonary</a>

In [None]:
cols_to_drop = [
    'store_and_fwd_flag',
    'airport_fee',
    'RatecodeID',
    'payment_type',
    'fare_amount',
    'extra',
    'mta_tax',
    'tip_amount',
    'tolls_amount',
    'improvement_surcharge',
    'total_amount',
    'congestion_surcharge',
    'VendorID'
]


TripData.drop(columns=cols_to_drop, inplace=True)
TripData = TripData.dropna()


TripData = TripData.copy() ## Added for SettingWithCopyWarning

TripData['PULocationID'] = TripData['PULocationID'].astype(int) ## Checking if LocationIDs are Integers
TripData['DOLocationID'] = TripData['DOLocationID'].astype(int)


TripData

Since one of our future prediction tasks requires the duration of the trip, we have added a new column, `trip_time_minutes`, by subtracting the `tpep_pickup_datetime` from the `tpep_dropoff_datetime`. We also resorted the dataset so that the trip duration appears after the pickup and drop-off columns

In [None]:

# Calculate the trip time in minutes
TripData['trip_time_minutes'] = (TripData['tpep_dropoff_datetime'] - TripData['tpep_pickup_datetime']).dt.total_seconds() / 60

# Resorting the dataset and insert trip time behind pu/do 
col = TripData.pop('trip_time_minutes')
TripData.insert(3, 'trip_time_minutes', col)

In [None]:
TripData

### *3.3 Clean-up of the Taxi-Zones*

This dataset contains metadata for geographic taxi zones used in trip records:

* `location_id`: Numeric identifier for each zone (matches PULocationID and DOLocationID in trip data).

* `zone`: Name of the taxi zone (e.g., "Upper East Side North", "JFK Airport").

* `borough`: Borough the zone is in (e.g., "Manhattan", "Brooklyn").

* `shape_area` / shape_leng: Geometric area and perimeter length of the zone.

* `objectid`: Internal object identifier (not typically used in analysis).

* `geometry`: Spatial polygon data defining the zone's shape (used for mapping/visualization).



In [None]:
zones = gpd.read_file("NYC Taxi Zones.geojson")
zones

To analyse common pick up and drop off locations in the future, we have also added the `NYC Taxi Zones` Dataset. We dropped `shape_area`, `objectid` and `shape_leng` since we dont need this information in the future.

In [None]:
zones = zones[['location_id', 'zone', 'borough', 'geometry']].copy()
zones

### *3.4 Merge of zones and trips*

To analyse demand in certain areas later, we will merge `zones`  with the `TripData` <br>


Since we dont need most of the columns from our `TripData` Dataset, we only included a few. Also we renamed certain columns for clarity. 


In [None]:
zones = zones.rename(columns={'location_id': 'LocationID'})

zones['LocationID'] = zones['LocationID'].astype(int)           ## The PU/DOLocations where int64, while 
TripData['PULocationID'] = TripData['PULocationID'].astype(int) ## LocationID had the type object
TripData['DOLocationID'] = TripData['DOLocationID'].astype(int) ## so change both to int64

TripData_zones = TripData.merge(                                ## merge pickup
    zones[['LocationID', 'zone', 'borough', 'geometry']],
    left_on='PULocationID',
    right_on='LocationID',
    how='left',
    suffixes=('', '_pickup')
)

TripData_zones = TripData_zones.merge(                          ## merge dropoff
    zones[['LocationID', 'zone', 'borough', 'geometry']],
    left_on='DOLocationID',
    right_on='LocationID',
    how='left',
    suffixes=('', '_dropoff')
)

In [None]:
TripZones = TripData_zones.rename(columns={               ## rename
    'zone': 'pickup_zone',
    'borough': 'pickup_borough',
    'geometry': 'pickup_geometry',
    'geometry_dropoff': 'dropoff_geometry'
})

TripZones = TripZones[[                              ## resort + drop certain columns
    'passenger_count',
    'tpep_pickup_datetime',
    'pickup_zone',
    'pickup_borough',
    'pickup_geometry',
    'tpep_dropoff_datetime',
    'zone_dropoff',
    'borough_dropoff',
    'dropoff_geometry'
]]

TripZones

### *3.5 Merge of weather and trips*

In [None]:
TripData['tpep_pickup_datetime'] = TripData['tpep_pickup_datetime'].astype('datetime64[ns]')
weather['time'] = weather['time'].astype('datetime64[ns]')

TripData_sorted = TripData.sort_values('tpep_pickup_datetime')
weather_sorted = weather.sort_values('time')

TripData_weather = pd.merge_asof(
    TripData_sorted,
    weather_sorted[['time', 'temperature_2m (°C)', 'rain (mm)']],
    left_on='tpep_pickup_datetime',
    right_on='time',
    direction='backward'
)

TripData_weather = TripData_weather.dropna()
TripData_weather

### 3.6 Model Specific Feature-Engineering
To ensure a clear separation between descriptive analysis and predictive modeling, we created a dedicated DataFrame that contains only the features relevant for modeling purposes. The modeling DataFrame is based on a temporal and spatial aggregation of the trip data. First, all pickup timestamps (tpep_pickup_datetime) were rounded down to 30-minute intervals. Then, the data was grouped by both the pickup time and the pickup location (PULocationID). Within each group, we computed: the number of trips (trip_count), the average temperature (temperature_2m (°C)), and the average precipitation (rain (mm)). Additional features such as hour and weekday were extracted from the rounded timestamps to capture time-of-day and weekly patterns. The final set of explanatory variables used in the model includes: temperature_2m (°C), rain (mm), hour, weekdayand PULocationID The target variable is trip_count, representing the number of trips observed for a given pickup location and 30-minute time window.

In [None]:
TripData_weather['tpep_pickup_datetime'] = TripData_weather['tpep_pickup_datetime'].dt.floor('30min') #Roudning the timestamps
stats = ( 
    TripData_weather
    .groupby(['tpep_pickup_datetime', 'PULocationID'])
    .agg({
        'temperature_2m (°C)': 'mean',
        'rain (mm)': 'mean',
        'tpep_pickup_datetime': 'count'
    })
    .rename(columns={'tpep_pickup_datetime': 'trip_count'})
    .reset_index()
)
stats['hour'] = stats['tpep_pickup_datetime'].dt.hour #Creating hour column
stats['weekday'] = stats['tpep_pickup_datetime'].dt.weekday #Creating Weekday Column

#### 3.6.1 Rollling Means
We implemented a rolling mean feature, which contains the average Tripcount for a zone from the last 2 hours. 

In [None]:
stats = stats.sort_values(['PULocationID', 'tpep_pickup_datetime'])
stats['rolling_mean_2h'] = (
    stats.groupby('PULocationID')['trip_count']
    .transform(lambda x: x.rolling(window=4, min_periods=1).mean().shift(1))
)

## 4. Descriptive Analysis

### 4.1 Number of rides according to the temperature

In [None]:
TripData_weather['temperature_rounded'] = TripData_weather['temperature_2m (°C)'].round()

temp_group = TripData_weather.groupby('temperature_rounded').size().reset_index(name='ride_count')

plt.figure(figsize=(10, 6))
plt.plot(temp_group['temperature_rounded'], temp_group['ride_count'], marker='o')
plt.xlabel('Temperature (°C)')
plt.ylabel('Number of Rides')
plt.title('Number of Taxi Rides vs Temperature')
plt.grid(True)
plt.tight_layout()
plt.show()

This line plot illustrates the relationship between temperature (°C) and the number of taxi rides. The number of rides peaks sharply in the range [-1°C and 4°C,], suggesting that people heavily rely on taxis when it's cold but not extreme—likely to avoid discomfort while still being mobile. 
For very low temperatures [< -1°C], ride numbers are lower, possibly due to reduced outdoor activity in harsh cold. From, mild to warm temperatures [4 °C to 10°C], a noticeable drop in taxi rides is observed. Warmer conditions may encourage walking, biking, or using public transport instead.
Taxi demand is highest around freezing temperatures, indicating weather sensitivity. People are more likely to choose taxis when it's cold enough to be uncomfortable, but not cold enough to deter travel altogether.


### 4.2 Most Common Pickup-Zones

In [None]:
pickup_counts = TripZones.groupby('pickup_zone').size().reset_index(name='pickup_count')
pickup_counts = pickup_counts.sort_values(by='pickup_count', ascending=False)

print(pickup_counts.head(10))

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(
    data=pickup_counts.head(10),
    y='pickup_count',
    x='pickup_zone',
    hue='pickup_zone',
    palette='Oranges_r',
    legend=False
)
plt.title('Top 10 Pickup-Zones')
plt.ylabel('Trip Count')
plt.xlabel('Pickup-Zone')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

This horizontal bar chart displays the top 10 taxi pickup zones based on trip volume. It reveals a right-skewed frequency distribution, with the first two zones (Upper East Side North and Upper East Side South) dominating the dataset (around 72,000+ trips each). This suggests a high concentration of demand in these zones. Zones ranked 3 to 10 show a gradual decline in trip count, ranging from around 45,000 to around 40,000 trips. The slope is relatively flat, indicating that these zones have comparable levels of demand, forming a plateau below the top two.
Many zones belong to Manhattan’s densely populated or commercial areas (e.g., Midtown, Lenox Hill, Penn Station), which likely reflects high pedestrian traffic, tourism, and business activity.
The Upper East Side accounts for a disproportionate share of ride activity, signalling potential hotspots for taxi availability, resource allocation, or pricing optimization.

#### 4.2.1 Heatmap

In [None]:
pickup_zone_stats = TripZones.groupby(
    ['pickup_zone', 'pickup_geometry']
).size().reset_index(name='count')

pickup_zone_stats = gpd.GeoDataFrame(pickup_zone_stats, geometry='pickup_geometry')

pickup_zone_stats.plot(
    column='count',
    cmap='OrRd',
    edgecolor='black',
    linewidth=0.5,
    legend=True,
    figsize=(12, 10)
)
plt.title('Heatmap: Trip count per Pickup Zone')
plt.axis('off')
plt.show()
plt.show()

The heatmap confirms a strong spatial concentration of pickups in central Manhattan, especially the Upper East Side and Midtown. These zones show the highest intensity (70,000+ rides), while outer boroughs display significantly lower demand. The distribution is highly clustered, reflecting dense population, tourism, and commercial activity.

### 4.3 Most Common Dropoff-Zones

In [None]:
dropoff_counts = TripZones.groupby('zone_dropoff').size().reset_index(name='dropoff_count')
dropoff_counts = dropoff_counts.sort_values(by='dropoff_count', ascending=False)

print(dropoff_counts.head(10))

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(
    data=dropoff_counts.head(10),
    y='dropoff_count',
    x='zone_dropoff',
    hue='zone_dropoff',
    palette='Blues_r',
    legend=False
)
plt.title('Top 10 Drop-Off-Zones')
plt.ylabel('Trip Count')
plt.xlabel('Drop-Off-Zone')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

This bar chart ranks the top 10 drop-off zones by total trip count. The same two zones again lead the distribution, with Upper East Side North peaking at over 70,000 drop-offs. This mirrors the pickup pattern, indicating bidirectional demand concentration.
The remaining zones form a moderately flat tail, ranging between around 34,000 and around 43,000 trips. This suggests a more uniform distribution of drop-offs beyond the primary hotspots. Drop-off activity is heavily concentrated in a few zones—especially the Upper East Side—indicating consistent inbound and outbound flow.

### 4.4 Most Common Pairs 

In [None]:
pair_counts = TripZones.groupby(['pickup_zone', 'zone_dropoff']).size().reset_index(name='trip_count')
top_10 = pair_counts.sort_values(by='trip_count', ascending=False).head(10)

top_10['pair'] = top_10['pickup_zone'] + " → " + top_10['zone_dropoff']
plt.figure(figsize=(12, 6))
bars = plt.bar(top_10['pair'], top_10['trip_count'], color='skyblue')

for bar in bars:
    height = bar.get_height()
    plt.annotate(f'{int(height)}',
                 xy=(bar.get_x() + bar.get_width() / 2, height),
                 xytext=(0, 3),  # offset text upward
                 textcoords="offset points",
                 ha='center', va='bottom')

plt.title('Top 10 Most Common Pickup → Dropoff Zone Pairs')
plt.xlabel('Pickup → Dropoff Pair')
plt.ylabel('Trip Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

This bar chart ranks the most frequent pickup/drop-off pairs by trip volume. The top 4 pairs are all within the Upper East Side, indicating a high volume of short, local trips—likely driven by dense population and limited walkability or convenience. The leading pair (UES South -> UES North) alone accounts for over 11,600 trips, showing a steep drop-off beyond the top few pairs. This suggests localized mobility clusters.
Taxi usage is largely intra-neighborhood, especially within the Upper East Side, pointing to localized, short-distance travel behavior in dense urban zones.

### 4.5 Temporal Trip Patterns

#### 4.5.1 Time-of-Day Usage

In [None]:
TripData['tpep_pickup_datetime'] = pd.to_datetime(TripData['tpep_pickup_datetime'])

TripData['pickup_hour'] = TripData['tpep_pickup_datetime'].dt.hour

trips_per_hour = TripData.groupby('pickup_hour').size().reset_index(name='trip_count')

trips_per_hour = trips_per_hour.sort_values('pickup_hour')

print(trips_per_hour)

In [None]:
plt.figure(figsize=(10,6))
plt.bar(trips_per_hour['pickup_hour'], trips_per_hour['trip_count'], color='skyblue')
plt.xlabel('Daytime')
plt.ylabel('Trip Count')
plt.title('Tripcount per hour of the day')
plt.xticks(range(0,24))
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

This bar chart displays hourly taxi demand over a 24-hour period. Trip volume peaks sharply between 1 PM and 4 PM, with a maximum of 106,000 rides at 3 PM, suggesting a high concentration of afternoon activity, possibly driven by shopping, work breaks, or school pickups. Ride volume remains low overnight, with a gradual ramp-up starting around 6 AM.
The demand curve is right-skewed, with the bulk of trips occurring between late morning and early evening, indicating a strong diurnal pattern tied to urban daytime activity cycles.

#### 4.5.2 Day-of-Week Usage

In [None]:
TripData['tpep_pickup_datetime'] = pd.to_datetime(TripData['tpep_pickup_datetime'])

TripData['pickup_weekday'] = TripData['tpep_pickup_datetime'].dt.dayofweek

weekday_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
TripData['weekday_name'] = TripData['pickup_weekday'].apply(lambda x: weekday_names[x])

TripData['weekday_name'] = pd.Categorical(
    TripData['weekday_name'],
    categories=weekday_names,
    ordered=True
)

trips_per_weekday = TripData.groupby('weekday_name', observed=True).size().reset_index(name='trip_count')

In [None]:
plt.figure(figsize=(10,6))
plt.bar(trips_per_weekday['weekday_name'], trips_per_weekday['trip_count'], color='seagreen')
plt.xlabel('Day of the Week')
plt.ylabel('Tripcount')
plt.title('Number of Trips per Weekday')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

This bar chart shows the distribution of trip counts by day of the week. Friday records the highest trip volume, suggesting increased end-of-week activity—commuting, social plans, or travel. Monday to Friday show consistently higher demand than Saturday and Sunday, pointing to a weekday-dominant usage trend, likely tied to work-related or routine travel. Finally, trip count dips significantly after Friday, making Sunday the least active day in terms of taxi usage.
Taxi demand follows a weekly cycle, peaking on Fridays and declining over the weekend, reflecting typical urban mobility rhythms.

## 5. Predictive Models

### 5.1 Random Forrest Regression
In our first model, we used a Random Forest Regression model to predict the number of taxi rides within a 30-minute time frame in a specific zone. The target variable is `trip_count`, and the explanatory features are `temperature_2m (°C)`, `rain (mm)`, `hour`, `weekday`, `rolling_mean_2h` and `PULocationID`. We transformed `PULocationID` into `PULocationID_encoded` at a later point using target encoding. 

In [None]:
y = stats['trip_count']
X = stats[['temperature_2m (°C)', 'rain (mm)', 'hour', 'weekday', 'PULocationID']]

#### 5.1.1 Data-Split
For this model, we used a data split of 60% training, 20% validation, and 20% test data.

In [None]:
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=False)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42, shuffle=False)

#### 5.1.2 Target Encoding
To make the `PULocationID` feature usable for modeling, we applied **Target Encoding** by assigning each location ID the average number of trips associated with it.  
Importantly, this encoding was performed **after splitting the data** to avoid data leakage from the test set into the training process. 

In [None]:
# Target-Encoding based of Trainingdata
means = X_train.assign(trip_count=y_train).groupby('PULocationID')['trip_count'].mean()
X_train['PULocationID_encoded'] = X_train['PULocationID'].map(means)
X_val['PULocationID_encoded'] = X_val['PULocationID'].map(means)
X_test['PULocationID_encoded'] = X_test['PULocationID'].map(means)
#filling missing values with global mean
global_mean = y_train.mean()
X_train['PULocationID_encoded'] = X_train['PULocationID_encoded'].fillna(global_mean)
X_val['PULocationID_encoded'] = X_val['PULocationID_encoded'].fillna(global_mean)
X_test['PULocationID_encoded'] = X_test['PULocationID_encoded'].fillna(global_mean)
# Final Feature-Set (without Original-ID)
X_train = X_train.drop(columns=['PULocationID'])
X_val = X_val.drop(columns=['PULocationID'])
X_test = X_test.drop(columns=['PULocationID'])

#### 5.1.3 Hyper-Parameter Optimization
In order to optimize our hyperparameters, we used the Grid Search algorithm.

In [None]:
param_grid = { 
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
}

estimator = RandomForestRegressor(random_state=42) 

grid = GridSearchCV( 
    estimator=estimator,
    param_grid=param_grid,
    cv=3,
    scoring='r2',
    n_jobs=-1,    
)

grid.fit(X_train, y_train) 
print("Best R² on validation folds:", grid.best_score_)
print("Beste Parameter:", grid.best_params_)

#### 5.1.4 Model-Training and Testing
The model was trained and tested using the hyperparameters identified through grid search.

In [None]:
best_rf = grid.best_estimator_ #Using the best Hyperparameters from our Gridsearch
X_full_train = pd.concat([X_train, X_val]) #combining training and validation Data
y_full_train = pd.concat([y_train, y_val]) #combining training and validation Data
best_rf.fit(X_full_train, y_full_train) #Model is trained on combined training Dataset
y_pred = best_rf.predict(X_test) #Trained Model is used, to make a prediction for the Testdata
print("Finaler R² (Testset):", r2_score(y_test, y_pred))
print("Finales MSE (Testset):", mean_squared_error(y_test, y_pred))

#### 5.1.5 Visualisation

In [None]:
n = 200
plt.figure(figsize=(14, 6))
plt.plot(range(n), y_test[:n], label='Actual Trip-Counts', linestyle='-', color='blue')
plt.plot(range(n), y_pred[:n], label='Predicted Trip-Counts', linestyle='--', color='orange')
plt.title(f"Modelpredictions vs. Actual Trip-Count (First {n} Observations)")
plt.xlabel('Observationindex')
plt.ylabel('Trip-Count')
plt.legend()
plt.grid(True)
plt.show()

### 5.2 Gradient Boosting

In [None]:
df = TripData_weather.copy()

df['hour'] = df['tpep_pickup_datetime'].dt.hour
df['weekday'] = df['tpep_pickup_datetime'].dt.dayofweek

trip_counts = df.groupby(['PULocationID', 'hour', 'weekday']).size().reset_index(name='trip_count')

weather_avg = df.groupby(['hour', 'weekday']).agg({
    'temperature_2m (°C)': 'mean',
    'rain (mm)': 'mean'
}).reset_index()

stats = pd.merge(trip_counts, weather_avg, on=['hour', 'weekday'])

y = stats['trip_count']
X = stats.drop('trip_count', axis=1)

X = pd.get_dummies(X, columns=['PULocationID'], drop_first=True)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_trainval, X_test, y_trainval, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.25, random_state=42)

param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}

grid_search = GridSearchCV(
    GradientBoostingRegressor(random_state=42),
    param_grid,
    cv=3,
    scoring='r2',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print("Beste Parameter:", grid_search.best_params_)
print("Bestes R² auf Validierung (CV):", grid_search.best_score_)

best_model = grid_search.best_estimator_
best_model.fit(X_trainval, y_trainval)

y_pred = best_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print("\n--- FINAL TEST RESULTS ---")
print(f"R² Score: {r2:.4f}")
print(f"Mean Squared Error: {mse:.4f}")

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel("True trip counts")
plt.ylabel("Predicted trip counts")
plt.title("True vs Predicted Trip Counts")
plt.show()

In [None]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 3],
    'subsample': [0.8, 1.0]
}

gbr = GradientBoostingRegressor(random_state=42)

grid_search = GridSearchCV(
    estimator=gbr,
    param_grid=param_grid,
    scoring='r2',
    cv=3,
    n_jobs=-1,
)

grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best R² score on validation folds:", grid_search.best_score_)

best_gbr = grid_search.best_estimator_

y_pred = best_gbr.predict(X_test)
print("Final R² (Testset):", r2_score(y_test, y_pred))
print("Final MSE (Testset):", mean_squared_error(y_test, y_pred))

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel("True trip counts")
plt.ylabel("Predicted trip counts")
plt.title("True vs Predicted Trip Counts")
plt.show()