# Report: Dashboard for SAIL2025

### TIL6022 - Group 23

**Members:** 
Kristof Hollstein,
Yuxuan Li,
Junlin Li,
Muhammad Harisuddin,
Ida Bagus Rai Satria Dharma

## Research Objective/Introduction

SAIL Amsterdam 2025 was a major city event where over 10,000 vessels from around the world were showcased around the north of Amsterdam Centraal station. The event attracted a major crowd, with over 2.3 million visitors congregating in one location during the week. Since the event brought an irregular surge in human traffic flow to the entirety of Amsterdam, there is demand by the event organizers and revelent stakeholders for a dashboard allowing both monitoring and predicting current and future traffic flow around the city to ensure public safety and aid in important decision making by the end users. Hence for this project, the following main objective was identified:

**Main Objective:**
How can we monitor and predict short term crowd flows during SAIL Amsterdam 2025?

To further breakdown the necessities to produce a final project meeting the requirements of the main objectives, 4 sub objectives were further developed:

**Sub Objectives:**
1. How can multi score mobility data (pedestrian sensors, Tom Tom Traffic, and vessel position) be integrated to understand urban crowd management in SAIL Amsterdam?
2. How can predictive models improve the future pedestrian flow and what further insight can the model offer?
3. What spatial patterns can be identified in pedestrian flows from multiple crowd sensors in the area?
4. How can an interactive dashboard crowd management support the decision making for urban crowd management?


## Subquestion 1: Integrating multi score mobility data (Data Used)


To develop a working final product with proper insights into current and future crowd flow, the group has integrated several different data sets connecting to SAIL 2025 pedestrian traffic. 

### Primary Datasets (Sensor Data)
The main datasets which this project will depend on concerns the data gathered by each sensor. The neccessary data is spit into two sets, with the first csv file containing the main pedestrian traffic flow information, with a count of people at each sensor location scattered around the city given every 3 minutes. This dataset provides the group with the base variables for this model. The second excel dataset contains the sensor's metadata, which includes each sensor's base ID, name, width and geographic coordinates. This data will be utilized to understand the specs of the sensor and to map the findings from the previous count data onto a potential dashboard presentation, allowing the end user to easily understand the pedestrian data through a map overlay. 

### Project Inputs
- Part 2: Independent Variables which we will plug in for future projection and modelling (Needs to be finished)

While predictions can be produced with just the base sensor data, it is hypothesized that there are more external variables that influence pedestrian flow than just the time of the day when the count was made. Especially for a major event like SAIL, several different factors is expected to occur during the event which can alter traffic. To properly visualize these variables to the best of our ability to make our projections accurate, this group has also utilized additional data sets which will act as our independent variables to observe its influence it has on our sensor data. 

**TomTom Traffic (roads).**  
The first external dataset is the TomTom Traffic dataset. It is provided as a CSV with a `time` column and, for each timestamp, a nested micro-CSV that lists pairs of `(road_segment_id, traffic_level)`. With the NWB roads shapefile, we parse these records into per-timestamp road-segment values and then aggregate them to a 3-minute grid. For modeling, we derive indicators such as the mean congestion level, variance (dispersion across segments), and the 95th percentile (extreme congestion). These aggregated indicators are aligned with the pedestrian sensors in time and are also used in lagged form to capture the leading effect that road conditions can have on subsequent crowd flow.

**Vessel Positions (ships).**  
The Vessel Positions dataset provides the change in ship coordinate data in a time stamp form (timestamps e.g., `update-time` / `upload-timestamp`; coordinates: `lat`, `lon`), indicating which ship was located where in a given time, and ship attributes (e.g., `speed-in-centimeters-per-second`, `short-term-avg-speed-in-cm-per-sec`, `length`, `beam`, `type`). For efficiency we use a pre-aggregated parquet built from the raw CSV: at a 3-minute cadence we compute the total vessel count and average speed (low speeds are a proxy for docking/boarding/exhibiting). This data was determined useful for our model, as it is hypothesized that certain ships may attract viewers more than others, creating a potential difference in crowd formations for each boat. By understanding these ship locations as well as its potential popularity and project it against our geographic sensor data, it will allow the model to factor in these differences in inducing foot traffic for each vessel, potentially improving our final predictions.

### Summary
To answer the subquestion, several different multi score mobility datasets were integrated to potentially find refined patterns between these data sets, which may allow the predictive model to produce accurate future projections that takes in as many variables connected to the pedestrian traffic at the venue as possible.

## Subquestion 2: Implementing a Predictive Model

### Predictive Model Selection

To be able to utilize the dataset and predict future pedestrian flow, there is a need to implement a fitting predictive model. Initially, the group was given a choice between either **XGBoost** or **LightGBM**. For this model, the group has chosen to use **XGBoost**.  

---

#### Rationale for XGBoost
Firstly, the group has identified that XGBoost works well with the data amount this group possesses. This is because there is a total of 4 days of data (around 2400 rows), which is a reletively small sample. XGBoost's regularization knobs (min_child_weight, gamma, reg_alpha, reg_lambda) therefore helps prevent overfitting its algorithm when sample sizes are limited, while also being faster to produce a robust baseline than LightGBM. XGBoost also supports count targets, allowing the group to start with producing normal regressions and utilize Tweedie or Poisson if needed. XGBoost is also found to be presentation friendly with SHAP visualizations percisely explaining the prediction rational and backtesting making the reporting easier. 

#### When to Revisit LightGBM
In comparison, LightGBM maybe more useful if the model contains data spanning a longer time period (over 2-4 weeks) or if its richer exogenous features (especially when factoring in weather or vessel schedules) fits well with any additional variables we may plug into the model. In this scenario, LightGBM's speed with more expansive featres becomes more advantageous, and is best to revisit this method if the variables become further expansive.

---

#### Model Settings (to be specified)
- What are the hypoparameters we have chosen for the predictive model and why
  - leartning_rate
  - n_estimators
  - max_depth
  - subsample/colsample_bytree

#### Interpretation
- How does this help us understand SAIL traffic flow


## Subquestion 3: Spatial Pattern Analysis



### Correlation Analysis between External Factors and Crowd

#### Method
1. **Align to a 3-minute grid.** We resampled TomTom (roads) and Vessel (ships) to a common 3-minute cadence and aligned them with the pedestrian flow series.
2. **Screen with correlation + lead–lag.**  
   - We computed **Pearson correlations** with `human_flow` (see correlation heatmap).  
   - We then scanned **±36 minutes** of lead–lag (positive lag = driver leads) to see whether a driver **"anticipates"** later pedestrian changes (see the lead–lag plots).
3. **Keep features that are both statistically useful and operationally plausible.** i.e., can be known at prediction time and make sense for SAIL operations.



#### Visual Evidence from Correlation and Lead–Lag Analysis

The following figures and intepretation summarize the diagnostic relationships between pedestrian flow (`human_flow`) and the main external variables from TomTom traffic and Vessel data.

##### 1. Correlation Matrix
![Correlation matrix](img_for_report/corr_matrix.png)

All series are aligned on a **3-minute grid**. The matrix below reports **0-lag Pearson correlations**. Colors indicate strength and sign (purple = negative, green/yellow = positive).

**A. Relationship with pedestrian flow (`human_flow`)**

- **Road traffic level (`traffic_level_mean`) vs human_flow:** **clear negative** correlation.  
 In our TomTom data, `traffic_level` is treated as a **relative speed index** (≈ current speed / free-flow speed, range ~0–1). **Higher `traffic_level` indicates smoother roads (less congestion); lower `traffic_level` indicates heavier congestion.** This is defined by the services offered by TomTom([Traffic Flow Service](https://developer.tomtom.com/traffic-api/documentation/tomtom-orbis-maps/traffic-flow/traffic-flow-service)).
 Thus, **heavier congestion (lower level)** aligns with **higher pedestrian flow**, while **free-flowing roads (higher level)** align with **lower flow**.

- **Average vessel speed (`vessel_avg_speed`) vs human_flow:** **moderate negative** correlation.  
  Lower average speed (loitering, docking, exhibitions) aligns with more people on the quays.

- **Vessel count (`vessel_count`) vs human_flow:** **small positive** correlation.  
  More active vessels in the area coincide with slightly higher pedestrian volumes.

- **Change in traffic level (`traffic_level_chg`) vs human_flow:** **small positive** correlation.  
  This is the **first difference** of `traffic_level_mean` (current minus previous step). Short-term improvements in road conditions tend to align with increases in pedestrian flow, but the effect is mild at 0-lag.

At the **same timestamp**, road congestion (level), vessel behavior (speed), and activity (count) are all informative, with the **strongest signal** coming from `traffic_level_mean`.

**B. Structure among driver variables (collinearity clues)**

- **`traffic_level_mean` vs `vessel_count`:** **strong negative**. Busier waterways co-occur with tighter roads (or vice versa).  
- **`traffic_level_mean` vs `vessel_avg_speed`:** **positive**. Smoother roads often come with higher vessel cruising speeds.  
- **`vessel_avg_speed` vs `vessel_count`:** **moderate negative**. When there are many vessels, the average speed tends to drop.  
- **`traffic_level_chg`** has **weak correlations** with others. (as expected for a “rate-of-change” feature).



##### 2. Lead–Lag Correlations
**Road Traffic**

- **Road congestion level (`traffic_level_mean`)**  
  ![Lead-lag: traffic_level_mean vs human_flow](img_for_report/leadlag_traffic_level_mean.png)
- The correlation remains **negative across all lags**, and its magnitude **grows more negative** as the driver (traffic level) leads.  
- Approximate range: **r ≈ −0.30 at −35 min**, **r ≈ −0.47 at 0 min**, and **r ≈ −0.54 at +35 min**.

**Interpretation.**
- In TomTom data, a **higher `traffic_level_mean`** represents **smoother roads** (less congestion), while a lower value means heavier congestion.  
- The **negative correlation** therefore implies: when roads are **more congested**, pedestrian flow tends to be **higher**; when roads are **free-flowing**, pedestrian flow is **lower**.  
- Because the correlation becomes stronger (more negative) at positive lags, this suggests that **higher road congestion tends to precede higher pedestrian flow**. The lead–lag curve is monotonic without a distinct peak, indicating a **persistent correlation** between traffic conditions and crowd movement rather than a single best lead time. The negative relationship becomes more pronounced when the lag is positive, but a clear correlation is also observed when the lag is negative.
- In practical modeling, however, future road traffic data cannot be used to predict the present. Therefore, only historical and current features are used, and shorter lags are generally preferred for real-time prediction. The exact lag structure should ultimately be determined through model validation rather than correlation plots.


- **Change in road congestion (`traffic_level_chg`)**  
  ![Lead-lag: traffic_level_chg vs human_flow](img_for_report/leadlag_traffic_level_chg.png)

- The correlation remains **positive across all lags**, but its magnitude **gradually decreases** as the driver (road condition change) leads further into the future.  
- Approximate range: **r ≈ +0.19 at −35 min**, **r ≈ +0.12 at 0 min**, and **r ≈ +0.05 at +35 min**.

**Interpretation.**
- In TomTom data, `traffic_level_chg` represents the **change in traffic level** between consecutive time steps — positive values indicate **improving road conditions** (less congestion), while negative values indicate **worsening congestion**.  
- The stronger correlations at **negative lags** indicate that **changes in road conditions tend to follow crowd changes**, not precede them. At positive lags the association weakens with horizon, suggesting that **recent** traffic changes carry more information than **older** ones.  
- Practically, treat `traffic_level_chg` as a **short-term auxiliary signal** (current or 1–2 short lags); avoid using far-future or long-horizon values, both for avoiding data leakage and the weaker correlation between them.



**Vessel Traffic**
- **Vessel count (`vessel_count`)**  
  ![Lead-lag: vessel_count vs human_flow](img_for_report/leadlag_vessel_count.png)
- The correlation remains **positive across all lags**, and its magnitude **increases steadily** as the driver (vessel count) leads.  
- Approximate range: **r ≈ +0.01 at −35 min**, **r ≈ +0.08 at 0 min**, and **r ≈ +0.12 at +35 min**.

**Interpretation.**  
- In the vessel data, `vessel_count` represents the **number of active vessels** recorded within the monitored area during each time interval.  
- The **positive correlation** indicates that **higher vessel activity coincides with or slightly precedes higher pedestrian flow**. This pattern is consistent with the intuition that when more ships are present (especially during docking, exhibitions, or sailing events), more people gather along the waterfront to observe.  
- The **monotonic and modest(small r) increase** in correlation toward positive lags suggests that pedestrian crowds tend to follow the rise in vessel activity with a **gradual delay**, rather than responding instantaneously. There is no clear peak, which implies a **persistent but weak link** between maritime and pedestrian dynamics.  

- From a modeling perspective, since vessel activity data can be monitored in near-real time, features such as **current and short-lagged vessel counts** can be useful predictors for crowd flow near the quays.  
- However, using far-future vessel counts would imply data leakage, so only **past and current observations**  should be included when building the predictive model.  




- **Vessel average speed (`vessel_avg_speed`)**  
  ![Lead-lag: vessel_avg_speed vs human_flow](img_for_report/leadlag_vessel_avg_speed.png)
- The correlation remains **negative across all lags**, and its magnitude **grows more negative** as the driver (vessel speed) leads.  
- Approximate range: **r ≈ −0.04 at −35 min**, **r ≈ −0.14 at 0 min**, and **r ≈ −0.17 at +35 min**.

**Interpretation.**
- In the vessel data, a higher `vessel_avg_speed` means vessels are **moving faster** (transiting), while a lower value suggests **slow/near-stationary behavior** (e.g., docking, queueing, boarding/exhibitions).
- The **negative correlation** therefore implies: when vessels are **slower**, pedestrian flow tends to be **higher**; when vessels are **moving faster**, pedestrian flow is **lower**.
- Because the correlation becomes stronger (more negative) at positive lags, this suggests that **reductions in vessel speed tend to precede increases in pedestrian flow**. The lead–lag curve is monotonic without a distinct peak, indicating a **persistent relationship** rather than a single best lead time.
- In practical modeling, future vessel speeds cannot be used to predict the present. We therefore use **current and short-lag** speed features only.


#### 3. Timeseries (3-min)
![Timeseries (3-min)](img_for_report/timeseries_overlay.png)

**What stands out**
- **human_flow**: clear **diurnal cycles** (day/evening highs, late-night lows) with the **largest peak on Aug-24**.
- **traffic_level_mean** (higher = smoother roads): tends to **dip** around crowd peaks → **inverse** pattern with human flow.
- **vessel_count**: **step-like jumps** and **daytime pulses**; activity is higher during event windows and rises around periods when crowds build.
- **vessel_avg_speed**: often **lower** when crowds are high (consistent with more docking/slow cruising/observing).

**Consistency with correlation / lead–lag**
- Road conditions: **lower `traffic_level_mean` (more congestion)** aligns with **higher** human flow; lead–lag curves showed a **monotonic, more negative** relation when the driver leads.
- Vessels: **`vessel_count` positive**, **`vessel_avg_speed` negative** with human flow; effects are **gradual** (no sharp single best lag).

**Important note on straight segments (data gaps)**
- Some **straight lines** (flat or linear ramps) are artifacts of **missing data / resampling alignment**, **not** true physical trajectories.



#### 4. perational insights

##### 1) When surrounding roads start **clogging**
When TomTom road data shows that nearby traffic is slowing down, crowd levels along the waterfront usually **increase within 15–30 minutes**.  
This means that road congestion is an **early signal** of crowd build-up.  

**What to do:**
- Use this early warning to **prepare crowd control** before the peak arrives — deploy extra stewards, open more walking routes, and adjust soft barriers.
- Redirect visitors through **clear signage or announcements**, guiding them toward less crowded viewing points.
- In high-risk locations such as **bridges or narrow streets**, temporarily switch to **one-way walking** to avoid jams.

**Why it matters:**  
When car traffic slows down, it often means more people are arriving or moving on foot toward the waterfront, especially during large public events.

##### 2) When **vessel activity rises** but **average speed drops**
When the number of ships in the monitored area increases and their average speed decreases, it usually means ships are **docking, showing, or boarding passengers** — moments that attract people to gather nearby.  
Crowds often build up **shortly after** these signals.

**What to do:**
- Prepare **observation areas** near the berths where popular ships are docking.
- Deploy stewards to **separate queues from walking paths**, keeping flows moving along the promenade.
- Communicate early: “Plenty of space at East Quay” or “View also available from Pier 3”.

**Why it matters:**  
Vessel movements act as a **live indicator** of where people will gather next — slower, denser marine activity usually predicts a rise in nearby pedestrian density.

##### 3) When both signals appear together
If roads are congested **and** vessel activity increases at the same time, it is a **strong warning** that crowd pressure will rise soon.  

**Recommended response:**
- Open additional pathways or crossings to spread the flow.
- Bring in **reserve teams** to manage the crowd.
- Begin **guiding or slowing the inflow of visitors** about 10–15 minutes **earlier** than usual to prevent sudden crowding.

##### 4) Summary for daily operations
- Road and vessel data can act as **early-warning tools** for on-site teams.  
- Use them not just for reporting, but for **timely coordination** — who to send, where to guide, and when to act.
- Combine these signals with on-site observation (CCTV, radio) to confirm and fine-tune responses.
- Keep short notes on what worked each day to **improve timing** and **thresholds** for the next event.




- analysis of results (Model Results)
- Observing the Accuracy of the model
- Model Result (XGBoost)

## Subquestion 4: Developing the Dashboard

To properlly visualize the foundings we have developed in the previous part, the project required a dashboard design which both accurately presend the result and predictions while actively aiding in the decision making process by the end user. 

- Key Visualizations
    - Long Raw Data projection
    - Interactive mapping
    - Sensor Detail View
    - Correlation lab
    - 
- How to show the visualizations to the end user and why

## Conclusion


- Conclusion

## Contribution Statement

*Be specific. Some of the tasks can be coding (expect everyone to do this), background research, conceptualisation, visualisation, data analysis, data modelling*

**Author 1**:

**Author 2**:

**Author 3**:

## Data Used

## Data Pipeline