# Report: Dashboard for SAIL2025

### TIL6022 - Group 23

**Members:** 
Kristof Hollstein,
Yuxuan Li,
Junlin Li,
Muhammad Harisuddin,
Ida Bagus Rai Satria Dharma

## Research Objective/Introduction

SAIL Amsterdam 2025 was a major city event where over 10,000 vessels from around the world were showcased around the north of Amsterdam Centraal station. The event attracted a major crowd, with over 2.3 million visitors congregating in one location during the week. Since the event brought an irregular surge in human traffic flow to the entirety of Amsterdam, there is demand by the event organizers and revelent stakeholders for a dashboard allowing both monitoring and predicting current and future traffic flow around the city to ensure public safety and aid in important decision making by the end users. Hence for this project, the following main objective was identified:

**Main Objective:**
How can we monitor and predict short term crowd flows during SAIL Amsterdam 2025?

To further breakdown the necessities to produce a final project meeting the requirements of the main objectives, 4 sub objectives were further developed:

**Sub Objectives:**
1. How can multi score mobility data (pedestrian sensors, Tom Tom Traffic, and vessel position) be integrated to understand urban crowd management in SAIL Amsterdam?
2. How can predictive models improve the future pedestrian flow and what further insight can the model offer?
3. What spatial patterns can be identified in pedestrian flows from multiple crowd sensors in the area?
4. How can an interactive dashboard crowd management support the decision making for urban crowd management?


## Subquestion 1: Integrating multi score mobility data (Data Used)


To develop a working final product with proper insights into current and future crowd flow, the group has integrated several different data sets connecting to SAIL 2025 pedestrian traffic. 

### Primary Datasets (Sensor Data)
The main datasets which this project will depend on concerns the data gathered by each sensor. The neccessary data is spit into two sets, with the first csv file containing the main pedestrian traffic flow information, with a count of people at each sensor location scattered around the city given every 3 minutes. This dataset provides the group with the base variables for this model. The second excel dataset contains the sensor's metadata, which includes each sensor's base ID, name, width and geographic coordinates. This data will be utilized to understand the specs of the sensor and to map the findings from the previous count data onto a potential dashboard presentation, allowing the end user to easily understand the pedestrian data through a map overlay. 

### Project Inputs
- Part 2: Independent Variables which we will plug in for future projection and modelling (Needs to be finished)

While predictions can be produced with just the base sensor data, it is hypothesized that there are more external variables that influence pedestrian flow than just the time of the day when the count was made. Especially for a major event like SAIL, several different factors is expected to occur during the event which can alter traffic. To properly visualize these variables to the best of our ability to make our projections accurate, this group has also utilized additional data sets which will act as our independent variables to observe its influence it has on our sensor data. 

**TomTom Traffic (roads).**  
The first external dataset is the TomTom Traffic dataset. It is provided as a CSV with a `time` column and, for each timestamp, a nested micro-CSV that lists pairs of `(road_segment_id, traffic_level)`. With the NWB roads shapefile, we parse these records into per-timestamp road-segment values and then aggregate them to a 3-minute grid. For modeling, we derive indicators such as the mean congestion level, variance (dispersion across segments), and the 95th percentile (extreme congestion). These aggregated indicators are aligned with the pedestrian sensors in time and are also used in lagged form to capture the leading effect that road conditions can have on subsequent crowd flow.

**Vessel Positions (ships).**  
The Vessel Positions dataset provides the change in ship coordinate data in a time stamp form (timestamps e.g., `update-time` / `upload-timestamp`; coordinates: `lat`, `lon`), indicating which ship was located where in a given time, and ship attributes (e.g., `speed-in-centimeters-per-second`, `short-term-avg-speed-in-cm-per-sec`, `length`, `beam`, `type`). For efficiency we use a pre-aggregated parquet built from the raw CSV: at a 3-minute cadence we compute the total vessel count and average speed (low speeds are a proxy for docking/boarding/exhibiting). This data was determined useful for our model, as it is hypothesized that certain ships may attract viewers more than others, creating a potential difference in crowd formations for each boat. By understanding these ship locations as well as its potential popularity and project it against our geographic sensor data, it will allow the model to factor in these differences in inducing foot traffic for each vessel, potentially improving our final predictions.

### Summary
To answer the subquestion, several different multi score mobility datasets were integrated to potentially find refined patterns between these data sets, which may allow the predictive model to produce accurate future projections that takes in as many variables connected to the pedestrian traffic at the venue as possible.

## Subquestion 2: Implementing a Predictive Model

### Predictive Model Selection

To be able to utilize the dataset and predict future pedestrian flow, there is a need to implement a fitting predictive model. Initially, the group was given a choice between either **XGBoost** or **LightGBM**. For this model, the group has chosen to use **XGBoost**.  

---

#### Rationale for XGBoost
Firstly, the group has identified that XGBoost works well with the data amount this group possesses. This is because there is a total of 4 days of data (around 2400 rows), which is a reletively small sample. XGBoost's regularization knobs (min_child_weight, gamma, reg_alpha, reg_lambda) therefore helps prevent overfitting its algorithm when sample sizes are limited, while also being faster to produce a robust baseline than LightGBM. XGBoost also supports count targets, allowing the group to start with producing normal regressions and utilize Tweedie or Poisson if needed. XGBoost is also found to be presentation friendly with SHAP visualizations percisely explaining the prediction rational and backtesting making the reporting easier. 

#### When to Revisit LightGBM
In comparison, LightGBM maybe more useful if the model contains data spanning a longer time period (over 2-4 weeks) or if its richer exogenous features (especially when factoring in weather or vessel schedules) fits well with any additional variables we may plug into the model. In this scenario, LightGBM's speed with more expansive featres becomes more advantageous, and is best to revisit this method if the variables become further expansive.

---

#### Model Settings (to be specified)
- What are the hypoparameters we have chosen for the predictive model and why
  - leartning_rate
  - n_estimators
  - max_depth
  - subsample/colsample_bytree

#### Interpretation
- How does this help us understand SAIL traffic flow


## Subquestion 3: Spatial Pattern Analysis


- analysis of results (Model Results)
- Observing the Accuracy of the model
- Model Result (XGBoost)


## Subquestion 4: Developing the Dashboard

To properlly visualize the foundings we have developed in the previous part, the project required a dashboard design which both accurately presend the result and predictions while actively aiding in the decision making process by the end user. 


- Key Visualizations
    - Long Raw Data projection
    - Interactive mapping
    - Sensor Detail View
    - Correlation lab
    - 
- How to show the visualizations to the end user and why

## Conclusion


- Conclusion

## Contribution Statement

*Be specific. Some of the tasks can be coding (expect everyone to do this), background research, conceptualisation, visualisation, data analysis, data modelling*

**Author 1**:

**Author 2**:

**Author 3**:

## Data Used

## Data Pipeline