# Report: Dashboard for SAIL2025

### TIL6022 - Group 23

**Members:** 
Kristof Hollstein,
Yuxuan Li,
Junlin Li,
Muhammad Harisuddin,
Ida Bagus Rai Satria Dharma

## Research Objective/Introduction

SAIL Amsterdam 2025 was a major city event where over 10,000 vessels from around the world were showcased around the north of Amsterdam Centraal station. The event attracted a major crowd, with over 2.3 million visitors congregating in one location during the week. Since the event brought an irregular surge in human traffic flow to the entirety of Amsterdam, there is demand by the event organizers and revelent stakeholders for a dashboard allowing both monitoring and predicting current and future traffic flow around the city to ensure public safety and aid in important decision making by the end users. Hence for this project, the following main objective was identified:

**Main Objective:**
How can we monitor and predict short term crowd flows during SAIL Amsterdam 2025?

To further breakdown the necessities to produce a final project meeting the requirements of the main objectives, 4 sub objectives were further developed:

**Sub Objectives:**
1. How can multi score mobility data (pedestrian sensors, Tom Tom Traffic, and vessel position) be integrated to understand urban crowd management in SAIL Amsterdam?
2. How can predictive models improve the future pedestrian flow and what further insight can the model offer?
3. What spatial patterns can be identified in pedestrian flows from multiple crowd sensors in the area?
4. How can an interactive dashboard crowd management support the decision making for urban crowd management?


## Subquestion 1: Integrating multi score mobility data (Data Used)


To develop a working final product with proper insights into current and future crowd flow, the group has integrated several different data sets connecting to SAIL 2025 pedestrian traffic. 

### Primary Datasets (Sensor Data)
The main datasets which this project will depend on concerns the data gathered by each sensor. The neccessary data is spit into two sets, with the first csv file containing the main pedestrian traffic flow information, with a count of people at each sensor location scattered around the city given every 3 minutes. This dataset provides the group with the base variables for this model. The second excel dataset contains the sensor's metadata, which includes each sensor's base ID, name, width and geographic coordinates. This data will be utilized to understand the specs of the sensor and to map the findings from the previous count data onto a potential dashboard presentation, allowing the end user to easily understand the pedestrian data through a map overlay. 

### Project Inputs
While predictions can be produced with just the base sensor data, it is hypothesized that there are more external variables that influence pedestrian flow than just the time of the day when the count was made. Especially for a major event like SAIL, several different factors is expected to occur during the event which can alter traffic. To properly visualize these variables to the best of our ability to make our projections accurate, this group has also utilized additional data sets which will act as our independent variables to observe its influence it has on our sensor data. 

**TomTom Traffic (roads).**  
The first external dataset is the TomTom Traffic dataset. It is provided as a CSV with a `time` column and, for each timestamp, a nested micro-CSV that lists pairs of `(road_segment_id, traffic_level)`. With the NWB roads shapefile, we parse these records into per-timestamp road-segment values and then aggregate them to a 3-minute grid. For modeling, we derive indicators such as the mean congestion level, variance (dispersion across segments), and the 95th percentile (extreme congestion). These aggregated indicators are aligned with the pedestrian sensors in time and are also used in lagged form to capture the leading effect that road conditions can have on subsequent crowd flow.

**Vessel Positions (ships).**  
The Vessel Positions dataset provides the change in ship coordinate data in a time stamp form (timestamps e.g., `update-time` / `upload-timestamp`; coordinates: `lat`, `lon`), indicating which ship was located where in a given time, and ship attributes (e.g., `speed-in-centimeters-per-second`, `short-term-avg-speed-in-cm-per-sec`, `length`, `beam`, `type`). For efficiency we use a pre-aggregated parquet built from the raw CSV: at a 3-minute cadence we compute the total vessel count and average speed (low speeds are a proxy for docking/boarding/exhibiting). This data was determined useful for our model, as it is hypothesized that certain ships may attract viewers more than others, creating a potential difference in crowd formations for each boat. By understanding these ship locations as well as its potential popularity and project it against our geographic sensor data, it will allow the model to factor in these differences in inducing foot traffic for each vessel, potentially improving our final predictions.

### Summary
To answer the subquestion, several different multi score mobility datasets were integrated to potentially find refined patterns between these data sets, which may allow the predictive model to produce accurate future projections that takes in as many variables connected to the pedestrian traffic at the venue as possible.

## Subquestion 2: Implementing a Predictive Model

### Predictive Model Selection

To be able to utilize the dataset and predict future pedestrian flow, there is a need to implement a fitting predictive model. Initially, the group was given a choice between either **XGBoost** or **LightGBM**. For this model, the group has chosen to use **XGBoost**.  

---

#### Rationale for XGBoost
Firstly, the group has identified that XGBoost works well with the data amount this group possesses. This is because there is a total of 4 days of data (around 2400 rows), a reletively small sample. XGBoost's regularization knobs (min_child_weight, gamma, reg_alpha, reg_lambda) therefore helps prevent overfitting its algorithm when sample sizes are limited, while also being faster to produce a robust baseline than LightGBM. XGBoost also supports count targets, allowing the group to start with producing normal regressions and utilize Tweedie or Poisson if needed. XGBoost is also found to be presentation friendly with SHAP visualizations percisely explaining the prediction rational and backtesting making the reporting easier. 

#### When to Revisit LightGBM
In comparison, LightGBM maybe more useful if the model contains data spanning a longer time period (over 2-4 weeks) or if its richer exogenous features (especially when factoring in weather or vessel schedules) fits well with any additional variables we may plug into the model. In this scenario, LightGBM's speed with more expansive features becomes more advantageous, and is best to revisit this method if the variables become further expansive.

---

#### Model Settings (Potential Major Change)

To properly tune our prediction model to produce accurate and insightful projections for the enduser, specific hypoparameters must be tuned to specific settings within XGBoost. These hypoparameters ensures the predictive model runs smoothly despite the problems which can emerge from plugging and combining different datasets into one model. 

One problem may originate from the high dimentionality of the data. High dimentionality can be a problem if not regulated, as the model does not know what data to prioritize and how to balance it against other datapoints. This can result in the increase in useless noise when processing data, running the risk of data overfitting and irrelevent relationships being prioritized while straining the computer's resources. Moreover, combining several datasets with missing data points can also cause problems for future projection if not properly patched. 

By reliably tuning our XGBoost settings, we can essentially comand the model to balance and prioritize certain data points, while also self regulating and generalize missing points by utilizing past data as its rational. XGBoost hyperparameters must therefore be set accordingly to allow the model to further regulate the data input for a reliable prediction output.

As such, the hyperparameters in which we have coded is as follows:


In [None]:
[Actual code of all the parameters here]

**learning_rate**

**n_estimators**

**max_depth**

**subsample & colsample_bytree**
- subsample/colsample_bytree

#### Interpretation
- How does this help us understand SAIL traffic flow

## Subquestion 3: Spatial Pattern Analysis


### General Analysis of the results
According to the Flow Dashboard analysis, over 5 million people were recorded by the sensors throughout the five day event. Oosterdoksbrug emerged as the busiest location, capturing nearly 10% of the total crowd, while Zeeburg registered the lowest count with only just 113 people, peaking at 8 individuals recorded simultaneously on August 20, 2025, at 8:45 PM. The highest attendance occurred on August 23, 2025, when 775 people were detected at Oosterdoksbrug around 11 PM, likely due to it being the start of the weekend. Notably, only the first and last days saw fewer than one million visitors, which is understandable since the first day was a weekday and the final day preceded the workweek, prompting many attendees to leave earlier.




### Observing Accuracy of the models
- Problem with missing data (Junli maybe)


### Model Results
- Model Result (XGBoost) (generalization of all 37 models)


## Subquestion 4: Developing the Dashboard

To properly visualize the outputs we have developed in the previous parts, the project requires a dashboard which can accurately presend the results and predictions in a way which can aid the crowd management decision making process by the end user. To ensure quality, there needs to be a consensus on what refined data will be presented and how it will be projected on the dashboard for general use. The dashboard therefore utilizes two different types of projection methods to present our findings, both analytical and geographic projections. 

#### Analytical Projections on Dashboard (Harris' code)

When observing the analytical projections of the dashboard, it first projects both a visitor and a vessel count with 4 separate pie charts to represent its findings; 2 each for the visitor and vessel count. The first 2 pie charts provide a count of visitors or vessels within the given time frame on a specific date. The total number is provided by a value located inside the pie, with the pie chart itself representing the ratio of the visitors or vessels within the selected time frame against the grand total for that chosen date. The specific date and time frame for these pie charts can be altered by the user with a selectable time date filter located on the top as well. 

The other 2 pie charts represent the count ratio between the cumulative total of visitors or vessels throughout the entire event against the running total counted for both until a given date which was selected earlier in the time date filter. Like the previous 2 pie charts, the cumlative total until the given date is given in the middle of the pie chart, while the said ratio is represented through color fillings in its outershell. 

(SCREENSHOT OF THE 4 PIE CHARTS HERE)

Moreover, there are also 2 Line graphs projecting the hourly trends for the total number of visitors for a given area and vessels during the selected date. For each line graph, it shows both the actual counted value on one line, as well as a forcast line based off the data yeilded through our predicion model for each hour.

(SCREENSHOT OF THE LINE GRAPH)

Lastly, the analytical projections in our dashboard also provides a bar chart showing the top 10 most busiest areas recorded during the selected time frame and date. While there are 36 different active sensors in this model, the bar chart only projects the top 10 to prioritize revealing the areas that are the most crowded while improving the readability of our dashboard.

(SCREENSHOT OF THE BAR GRAPH)

#### Geographic Projections (Mapping) on Dashboard (Satria's code)

The dashboard also implemented geographic projections to show our refined visitor data on a readable map thorugh utilizing its geodata. The dashboard has used two types of interactive mapping; a heat map and a bubble map.

The bubble map projection shows the total number of visitors counted at every sensor by visualizing with a number inside a circle which represents the sensor. Each circle is placed on the map depending on the sensor's geodata. Like the previous analytical measures, the dashboard gives autonomy to the end user to designate a certain date and time frame to project onto the map. The size of each circle fluxuates depending on the number of visitors its relevent sensor has counted within the given time frame, with the circle on the map becoming larger as the number of visitors increase. The map also provides a function where clicking on each circle provides the line graph representation of the number of people counted by that sensor for the entirety of a chosen date. This allows the end user to look deeper into the specific visitor count trends for each of the sensors.

(SCREENSHOT OF THE BUBBLE MAP)

The Heat Map on the other hand is a visualization of the concentration of visitors during a given time frame onto a map of Amsterdam. Rather than providing the number of visitors per sensor like the bubble map, the heat map projects its findings through gradations of color, where areas colored in more in red represents high concentration of visitors while blue represents the opposite. Like the bubble map, the heat map is also paired with its own line graph representing the changes in the number of visitors counted at each sensor. Unlike the bubble map however, the heatmap also has a time lapse function located on a separate page of the dashboard, allowing the end user to view how the visitor concentration has potentially changed geographically throughout the time period. This system was not implemented into the bubble map, as the heat map's minimalistic represention measures of using color to show concentration better visualizes the city's movement trends of visitors than a number heavy bubble map. 

(SCREENSHOT OF THE HEAT MAP)

#### How these projections help with decision making
- Pie chart
- Line/bar graph
- The interactive mapping (both Heat and Bubble)

## Conclusion


- Conclusion

## Contribution Statement

*Be specific. Some of the tasks can be coding (expect everyone to do this), background research, conceptualisation, visualisation, data analysis, data modelling*

**Author 1**:

**Author 2**:

**Author 3**:

Ida Bagus Rai Satria Dharma: Visualization for Flow chart (Map, Sensor Details, and Time-lapse)

**Author 5**:

## Data Used

## Data Pipeline