# Report: Dashboard for SAIL2025

### TIL6022 - Group 23

**Members:** 
Kristof Hollstein,
Yuxuan Li,
Junlin Li,
Muhammad Harisuddin,
Ida Bagus Rai Satria Dharma

## Research Objective/Introduction

SAIL Amsterdam 2025 was a major city event where over 10,000 vessels from around the world were showcased around the north of Amsterdam Centraal station. The event attracted a major crowd, with over 2.3 million visitors congregating in one location during the week. Since the event brought an irregular surge in human traffic flow to the entirety of Amsterdam, there is demand by the event organizers and revelent stakeholders for a dashboard allowing both monitoring and predicting current and future traffic flow around the city to ensure public safety and aid in important decision making by the end users. Hence for this project, the following main objective was identified:

**Main Objective:**
How can we monitor and predict short term crowd flows during SAIL Amsterdam 2025?

To further breakdown the necessities to produce a final project meeting the requirements of the main objectives, 4 sub objectives were further developed:

**Sub Objectives:**
1. How can multi score mobility data (pedestrian sensors, Tom Tom Traffic, and vessel position) be integrated to understand urban crowd management in SAIL Amsterdam?
2. How can predictive models improve the future pedestrian flow and what further insight can the model offer?
3. What spatial patterns can be identified in pedestrian flows from multiple crowd sensors in the area?
4. How can an interactive dashboard crowd management support the decision making for urban crowd management?


## Subquestion 1: Integrating multi score mobility data (Data Used)


To develop a working final product with proper insights into current and future crowd flow, the group has integrated several different data sets connecting to SAIL 2025 pedestrian traffic. 

### Primary Datasets (Sensor Data)
The main datasets which this project will depend on concerns the data gathered by each sensor. The neccessary data is spit into two sets, with the first csv file containing the main pedestrian traffic flow information, with a count of people at each sensor location scattered around the city given every 3 minutes. This dataset provides the group with the base variables for this model. The second excel dataset contains the sensor's metadata, which includes each sensor's base ID, name, width and geographic coordinates. This data will be utilized to understand the specs of the sensor and to map the findings from the previous count data onto a potential dashboard presentation, allowing the end user to easily understand the pedestrian data through a map overlay. 

### Project Inputs
While predictions can be produced with just the base sensor data, it is hypothesized that there are more external variables that influence pedestrian flow than just the time of the day when the count was made. Especially for a major event like SAIL, several different factors is expected to occur during the event which can alter traffic. To properly visualize these variables to the best of our ability to make our projections accurate, this group has also utilized additional data sets which will act as our independent variables to observe its influence it has on our sensor data. 

**TomTom Traffic (roads).**  
The first external dataset is the TomTom Traffic dataset. It is provided as a CSV with a `time` column and, for each timestamp, a nested micro-CSV that lists pairs of `(road_segment_id, traffic_level)`. With the NWB roads shapefile, we parse these records into per-timestamp road-segment values and then aggregate them to a 3-minute grid. For modeling, we derive indicators such as the mean congestion level, variance (dispersion across segments), and the 95th percentile (extreme congestion). These aggregated indicators are aligned with the pedestrian sensors in time and are also used in lagged form to capture the leading effect that road conditions can have on subsequent crowd flow.

**Vessel Positions (ships).**  
The Vessel Positions dataset provides the change in ship coordinate data in a time stamp form (timestamps e.g., `update-time` / `upload-timestamp`; coordinates: `lat`, `lon`), indicating which ship was located where in a given time, and ship attributes (e.g., `speed-in-centimeters-per-second`, `short-term-avg-speed-in-cm-per-sec`, `length`, `beam`, `type`). For efficiency we use a pre-aggregated parquet built from the raw CSV: at a 3-minute cadence we compute the total vessel count and average speed (low speeds are a proxy for docking/boarding/exhibiting). This data was determined useful for our model, as it is hypothesized that certain ships may attract viewers more than others, creating a potential difference in crowd formations for each boat. By understanding these ship locations as well as its potential popularity and project it against our geographic sensor data, it will allow the model to factor in these differences in inducing foot traffic for each vessel, potentially improving our final predictions.

### Summary
To answer the subquestion, several different multi score mobility datasets were integrated to potentially find refined patterns between these data sets, which may allow the predictive model to produce accurate future projections that takes in as many variables connected to the pedestrian traffic at the venue as possible.

## Subquestion 2: Implementing a Predictive Model

### Predictive Model Selection

To be able to utilize the dataset and predict future pedestrian flow, there is a need to implement a fitting predictive model. Initially, the group was given a choice between either **XGBoost** or **LightGBM**. For this model, the group has chosen to use **XGBoost**.  

---

#### Rationale for XGBoost
Firstly, the group has identified that XGBoost works well with the data amount this group possesses. This is because there is a total of 4 days of data (around 2400 rows), a reletively small sample. XGBoost's regularization knobs (min_child_weight, gamma, reg_alpha, reg_lambda) therefore helps prevent overfitting its algorithm when sample sizes are limited, while also being faster to produce a robust baseline than LightGBM. XGBoost also supports count targets, allowing the group to start with producing normal regressions and utilize Tweedie or Poisson if needed. XGBoost is also found to be presentation friendly with SHAP visualizations percisely explaining the prediction rational and backtesting making the reporting easier. 

#### When to Revisit LightGBM
In comparison, LightGBM maybe more useful if the model contains data spanning a longer time period (over 2-4 weeks) or if its richer exogenous features (especially when factoring in weather or vessel schedules) fits well with any additional variables we may plug into the model. In this scenario, LightGBM's speed with more expansive features becomes more advantageous, and is best to revisit this method if the variables become further expansive.

---

#### Data Processing

As the original sensor dataset contains readings from multipl directions per sensor, and the objective is to predict the total pedestrian flow, we have aggregated these directional values to obtain the total flow for each sensor. Moreover, since we also wanted to include external factors including road (Tom Tom data) and vessel traffic (Vessel data) as both are hypothesized to influence pedestrian movement, we first imported both datasets and resampled them to the same 3 minute interval found in the sensor data. Then, this data was merged with the sensor dataset.

During the merging process, it was identified that the Vessel and TomTom data sets had several missing values stemming from them not providing full data coverage. Since XGBoost can naturally handle missing values, we retained these gaps in the dataset and additionally created a missing value indicator column to help the model catch potential patterns associated with data availability. Through this process, the group was able to produce a unified dataset suitable for training the XGBoost predition.

#### Build XGBoost model for single sensor

To develop our predictive model, we first focused on building and testing a XGBoost prediction model designed for a single sensor. This step served as a foundation for our future modeling framework; verifying the model's feasibility before scaling it up to multiple sensors. We also used the processed time series data of one selected sensor, including its historical pedestrian flow and the external variables (TomTom and Vessel data) alligned by timestamps. We then performed feature engineering; generating lag features to capture temporal dependencies, and trained the XGBoost model using these features. Hyperparameters were optimized through Optuna to reach the best predictive performance.

Building a single sensor model carries several key purposes:
- **Model Validation**: Allowing the model to confirm that XGBoost can effectively learn temporal patterns in the crowd flow data
- **Parameter Tuning**: Identifying the optimal hyperparameters and training configurations before applying to all sensors
- **Pipeline Testing**: Verified end to end workflow from data preprocessing and feature generation to model training and evaluation functioned correctly
- **Performance Benchmark**: Single sensor results formulated a baseline to later use when comparing trained models with multiple sensors.

This stage laid the technical groundwork for our subsequent large scale modeling, ensuring both reliability and efficiency of the prediction pipeline

#### Building the Multisensor XGBoost model (Loop Based Architechture)

After validating the single sensor model framework, we extended the approach to a loop based multi sensor XGBoost prediction system. The objective of this stage was to automatically train individual prediction models for all sensors, ensuring both consistency and scalability across the model

For technical implementation, a Python loop was developed to iterate through all sensor columns in the dataset. For each sensor, the following steps were executed:
    1. The total pedestrian flow time series of the sensor was extracted as the target variable
    2. The aligned external variables (TomTom and Vessel Data) together with lag features and timestamp based features were used as predictors
    3. The dataset was split into multiple folds for cross validation to enhance robustness
    4. Within the loop, optuna optimized each sensor's hyperparameters, ensuring the model parameters were best suited against characteristics of that sensor data.
    5. The Resulting best performance models, predictions, and evaluation metrics (eg. R^2 and RMSE) were automatically stored for later comparison and analysis.

This loop based design carries several advantages:
- **Automation and Scalability**: For over 30 sensors, the group was able to execute effective batch training and evaluation without the need for an intervention
- **Consistancy**: No model has its own preprocessing and feature engineering pipeline making results comparable, ensuring consistancy
- **Individual Optimization**: Hyperparameter tuning was done independently for each sensor, improving the performance throughout heterogeneous data sources
- **Performance Benchmarking**: Comparing results among sensors helped identify locations with bettwe predictive behavior or potential data quality issues
- **System level validation**: Results demonstrated if the model can generalize effectively when external variables were imported on a larger scale

Although some sensors showed varied performance stemming from the difference in data completeness and external feature availability, this stage successfully produced a unified and automated prediction pipeline, confirming feasibility and robustness for the XGBoost based modeling framework on a large scale, multi source environment. 

#### XGBoost Performance Results

(INSERT SCREENSHOT OF OVERALL R^2 score MODEL PERFORMANCE)

R^2 score statistically measures the variance of the dependent variables in our model that can be explained with our predictive model. Idealy, the closer every sensor is to the R^2 score of 1, the more perfect fit our data is to be explained through our model. Results show several sensors posting a relatively high R^2 number, with over 13 sensors with a R^2 over 0.6 and one sensor with a perfect score of 1. However, the graph also indicates several sensors with weak R^2 scores of below 0.5, which includes around half of the total sensors. The model also posted a negative R^2 score for GVCV_09, indicating that the model couldn't predict any accurate results from this sensor by utilizing the plugged in data sets. Overall, the gap between over performing and underperforming sensors are relatively wide, indicating some variation in quality for our output prediction depending on the sensor

(INSERT SCREENSHOT OF MAE SCORES FOR ALL SENSORS)

MAE was used to evaluate the performance of our XGBoost model by calculating the average difference between the predicted and actual values for each sensor. A lower MAE score is ideal for this model, as it indicates that the predictions were very close to the actual result. Here, we can identify a diverse set of MAE scores, with around a third of sensors posting a score less than 20 and another third blowing past 40. Similar to the X^2 scores, these results identify that the model does have some weakness in producing acurate predictions depending on the sensor location

(INSERT SCREENSHOT OF RMSE FOR ALL SENSORS)

RMSE is another performance indicator which penalizes prediction errors harsher than MAE. This is done to assess the size of errors and the model's stability by square rooting all errors produced from each sensor before calculating the average, and a lower number is again ideal for this parameter. Results here are identical to those yeilded when examining MAE, with sensors with a low MAE also posting a low RMSE and vise versa. 

#### Possible reasoning behind certain sensor results

As discussed before, there is a wide gap for all performance parameters between sensors posting ideal results and sensors deemed completely statistically off from the model. To further assess this divide, this part will analyze the results posted by a sensor with a perfect R^2 score of 1 and a sensor which posted a negative R^2 score

(SCREENSHOT OF TEST SET & VALIDATION SET RESULTS FOR GASA06)

Sensor GASA-06 was the only sensor with a perfect R^2 of 1, with around a 10 point difference with sensor CMSA-GAHK-01 with the second highest R^2. The test results from this sensor indicate a near perfect correlation between the prediction and the actual amount line, excluding some spikes of data variance between the 2000 and 2100 time frame. While this itself can indicate perfect projection, the situation differs when we observe the validation set, as both the XGBoost best and actual values lines are completely flatlined near 0 throughout the entire time frame. This indicates that XGBoost hasn't aggregated this sensor data at all, which has artificially boosted its R^2 score to 1 as there is no actual data to base or compare any data prediction output against. Hence, this "perfect correlation" may actually carry no substance. 

(SCREENSHOT OF TEST SET & VALIDATION SET FOR GVCV09)

In comparison, the worst performing sensor of GVCV-09 which posted a negative R^2 score shows a completely different scenario. Here, we can observe major spikes in the difference between the actual and projected amount produced by XGBoost throughout the entire time span, with the actual amount being significantly higher in the beginning, and the XGBoost predictions later overshooting the actual amount in the later stages of the time frame (Around the 2200 mark). Validation also shows constant irregular spikes of the actual amount throughout the start of the model, with the XGB best line nearly muted throughout the prediction process. These diagnostics therefore reveal serious predictive capabilities with this specific sensor, as this sensor actively produces inaccurate predictions wich don't correlate with our model at all. 


## Subquestion 3: Spatial Pattern Analysis



### Correlation Analysis between External Factors and Crowd

#### Method
1. **Align to a 3-minute grid.** We resampled TomTom (roads) and Vessel (ships) to a common 3-minute cadence and aligned them with the pedestrian flow series.
2. **Screen with correlation + lead–lag.**  
   - We computed **Pearson correlations** with `human_flow` (see correlation heatmap).  
   - We then scanned **±36 minutes** of lead–lag (positive lag = driver leads) to see whether a driver **"anticipates"** later pedestrian changes (see the lead–lag plots).
3. **Keep features that are both statistically useful and operationally plausible.** i.e., can be known at prediction time and make sense for SAIL operations.



#### Visual Evidence from Correlation and Lead–Lag Analysis

The following figures and intepretation summarize the diagnostic relationships between pedestrian flow (`human_flow`) and the main external variables from TomTom traffic and Vessel data.

##### 1. Correlation Matrix
![Correlation matrix](img_for_report/corr_matrix.png)

All series are aligned on a **3-minute grid**. The matrix below reports **0-lag Pearson correlations**. Colors indicate strength and sign (purple = negative, green/yellow = positive).

**A. Relationship with pedestrian flow (`human_flow`)**

- **Road traffic level (`traffic_level_mean`) vs human_flow:** **clear negative** correlation.  
 In our TomTom data, `traffic_level` is treated as a **relative speed index** (≈ current speed / free-flow speed, range ~0–1). **Higher `traffic_level` indicates smoother roads (less congestion); lower `traffic_level` indicates heavier congestion.** This is defined by the services offered by TomTom([Traffic Flow Service](https://developer.tomtom.com/traffic-api/documentation/tomtom-orbis-maps/traffic-flow/traffic-flow-service)).
 Thus, **heavier congestion (lower level)** aligns with **higher pedestrian flow**, while **free-flowing roads (higher level)** align with **lower flow**.

- **Average vessel speed (`vessel_avg_speed`) vs human_flow:** **moderate negative** correlation.  
  Lower average speed (loitering, docking, exhibitions) aligns with more people on the quays.

- **Vessel count (`vessel_count`) vs human_flow:** **small positive** correlation.  
  More active vessels in the area coincide with slightly higher pedestrian volumes.

- **Change in traffic level (`traffic_level_chg`) vs human_flow:** **small positive** correlation.  
  This is the **first difference** of `traffic_level_mean` (current minus previous step). Short-term improvements in road conditions tend to align with increases in pedestrian flow, but the effect is mild at 0-lag.

At the **same timestamp**, road congestion (level), vessel behavior (speed), and activity (count) are all informative, with the **strongest signal** coming from `traffic_level_mean`.

**B. Structure among driver variables (collinearity clues)**

- **`traffic_level_mean` vs `vessel_count`:** **strong negative**. Busier waterways co-occur with tighter roads (or vice versa).  
- **`traffic_level_mean` vs `vessel_avg_speed`:** **positive**. Smoother roads often come with higher vessel cruising speeds.  
- **`vessel_avg_speed` vs `vessel_count`:** **moderate negative**. When there are many vessels, the average speed tends to drop.  
- **`traffic_level_chg`** has **weak correlations** with others. (as expected for a “rate-of-change” feature).



##### 2. Lead–Lag Correlations
**Road Traffic**

- **Road congestion level (`traffic_level_mean`)**  
  ![Lead-lag: traffic_level_mean vs human_flow](img_for_report/leadlag_traffic_level_mean.png)
- The correlation remains **negative across all lags**, and its magnitude **grows more negative** as the driver (traffic level) leads.  
- Approximate range: **r ≈ −0.30 at −35 min**, **r ≈ −0.47 at 0 min**, and **r ≈ −0.54 at +35 min**.

**Interpretation.**
- In TomTom data, a **higher `traffic_level_mean`** represents **smoother roads** (less congestion), while a lower value means heavier congestion.  
- The **negative correlation** therefore implies: when roads are **more congested**, pedestrian flow tends to be **higher**; when roads are **free-flowing**, pedestrian flow is **lower**.  
- Because the correlation becomes stronger (more negative) at positive lags, this suggests that **higher road congestion tends to precede higher pedestrian flow**. The lead–lag curve is monotonic without a distinct peak, indicating a **persistent correlation** between traffic conditions and crowd movement rather than a single best lead time. The negative relationship becomes more pronounced when the lag is positive, but a clear correlation is also observed when the lag is negative.
- In practical modeling, however, future road traffic data cannot be used to predict the present. Therefore, only historical and current features are used, and shorter lags are generally preferred for real-time prediction. The exact lag structure should ultimately be determined through model validation rather than correlation plots.


- **Change in road congestion (`traffic_level_chg`)**  
  ![Lead-lag: traffic_level_chg vs human_flow](img_for_report/leadlag_traffic_level_chg.png)

- The correlation remains **positive across all lags**, but its magnitude **gradually decreases** as the driver (road condition change) leads further into the future.  
- Approximate range: **r ≈ +0.19 at −35 min**, **r ≈ +0.12 at 0 min**, and **r ≈ +0.05 at +35 min**.

**Interpretation.**
- In TomTom data, `traffic_level_chg` represents the **change in traffic level** between consecutive time steps — positive values indicate **improving road conditions** (less congestion), while negative values indicate **worsening congestion**.  
- The stronger correlations at **negative lags** indicate that **changes in road conditions tend to follow crowd changes**, not precede them. At positive lags the association weakens with horizon, suggesting that **recent** traffic changes carry more information than **older** ones.  
- Practically, treat `traffic_level_chg` as a **short-term auxiliary signal** (current or 1–2 short lags); avoid using far-future or long-horizon values, both for avoiding data leakage and the weaker correlation between them.



**Vessel Traffic**
- **Vessel count (`vessel_count`)**  
  ![Lead-lag: vessel_count vs human_flow](img_for_report/leadlag_vessel_count.png)
- The correlation remains **positive across all lags**, and its magnitude **increases steadily** as the driver (vessel count) leads.  
- Approximate range: **r ≈ +0.01 at −35 min**, **r ≈ +0.08 at 0 min**, and **r ≈ +0.12 at +35 min**.

**Interpretation.**  
- In the vessel data, `vessel_count` represents the **number of active vessels** recorded within the monitored area during each time interval.  
- The **positive correlation** indicates that **higher vessel activity coincides with or slightly precedes higher pedestrian flow**. This pattern is consistent with the intuition that when more ships are present (especially during docking, exhibitions, or sailing events), more people gather along the waterfront to observe.  
- The **monotonic and modest(small r) increase** in correlation toward positive lags suggests that pedestrian crowds tend to follow the rise in vessel activity with a **gradual delay**, rather than responding instantaneously. There is no clear peak, which implies a **persistent but weak link** between maritime and pedestrian dynamics.  

- From a modeling perspective, since vessel activity data can be monitored in near-real time, features such as **current and short-lagged vessel counts** can be useful predictors for crowd flow near the quays.  
- However, using far-future vessel counts would imply data leakage, so only **past and current observations**  should be included when building the predictive model.  




- **Vessel average speed (`vessel_avg_speed`)**  
  ![Lead-lag: vessel_avg_speed vs human_flow](img_for_report/leadlag_vessel_avg_speed.png)
- The correlation remains **negative across all lags**, and its magnitude **grows more negative** as the driver (vessel speed) leads.  
- Approximate range: **r ≈ −0.04 at −35 min**, **r ≈ −0.14 at 0 min**, and **r ≈ −0.17 at +35 min**.

**Interpretation.**
- In the vessel data, a higher `vessel_avg_speed` means vessels are **moving faster** (transiting), while a lower value suggests **slow/near-stationary behavior** (e.g., docking, queueing, boarding/exhibitions).
- The **negative correlation** therefore implies: when vessels are **slower**, pedestrian flow tends to be **higher**; when vessels are **moving faster**, pedestrian flow is **lower**.
- Because the correlation becomes stronger (more negative) at positive lags, this suggests that **reductions in vessel speed tend to precede increases in pedestrian flow**. The lead–lag curve is monotonic without a distinct peak, indicating a **persistent relationship** rather than a single best lead time.
- In practical modeling, future vessel speeds cannot be used to predict the present. We therefore use **current and short-lag** speed features only.


##### 3. Timeseries (3-min)
![Timeseries (3-min)](img_for_report/timeseries_overlay.png)

**What stands out**
- **human_flow**: clear **diurnal cycles** (day/evening highs, late-night lows) with the **largest peak on Aug-24**.
- **traffic_level_mean** (higher = smoother roads): tends to **dip** around crowd peaks → **inverse** pattern with human flow.
- **vessel_count**: **step-like jumps** and **daytime pulses**; activity is higher during event windows and rises around periods when crowds build.
- **vessel_avg_speed**: often **lower** when crowds are high (consistent with more docking/slow cruising/observing).

**Consistency with correlation / lead–lag**
- Road conditions: **lower `traffic_level_mean` (more congestion)** aligns with **higher** human flow; lead–lag curves showed a **monotonic, more negative** relation when the driver leads.
- Vessels: **`vessel_count` positive**, **`vessel_avg_speed` negative** with human flow; effects are **gradual** (no sharp single best lag).

**Important note on straight segments (data gaps)**
- Some **straight lines** (flat or linear ramps) are artifacts of **missing data / resampling alignment**, **not** true physical trajectories.



#### Operational insights

##### 1) When surrounding roads start **clogging**
When TomTom road data shows that nearby traffic is slowing down, crowd levels along the waterfront usually **increase within 15–30 minutes**.  
This means that road congestion is an **early signal** of crowd build-up.  

**What to do:**
- Use this early warning to **prepare crowd control** before the peak arrives — deploy extra stewards, open more walking routes, and adjust soft barriers.
- Redirect visitors through **clear signage or announcements**, guiding them toward less crowded viewing points.
- In high-risk locations such as **bridges or narrow streets**, temporarily switch to **one-way walking** to avoid jams.

**Why it matters:**  
When car traffic slows down, it often means more people are arriving or moving on foot toward the waterfront, especially during large public events.

##### 2) When **vessel activity rises** but **average speed drops**
When the number of ships in the monitored area increases and their average speed decreases, it usually means ships are **docking, showing, or boarding passengers** — moments that attract people to gather nearby.  
Crowds often build up **shortly after** these signals.

**What to do:**
- Prepare **observation areas** near the berths where popular ships are docking.
- Deploy stewards to **separate queues from walking paths**, keeping flows moving along the promenade.
- Communicate early: “Plenty of space at East Quay” or “View also available from Pier 3”.

**Why it matters:**  
Vessel movements act as a **live indicator** of where people will gather next — slower, denser marine activity usually predicts a rise in nearby pedestrian density.

##### 3) When both signals appear together
If roads are congested **and** vessel activity increases at the same time, it is a **strong warning** that crowd pressure will rise soon.  

**Recommended response:**
- Open additional pathways or crossings to spread the flow.
- Bring in **reserve teams** to manage the crowd.
- Begin **guiding or slowing the inflow of visitors** about 10–15 minutes **earlier** than usual to prevent sudden crowding.

##### 4) Summary for daily operations
- Road and vessel data can act as **early-warning tools** for on-site teams.  
- Use them not just for reporting, but for **timely coordination** — who to send, where to guide, and when to act.
- Combine these signals with on-site observation (CCTV, radio) to confirm and fine-tune responses.
- Keep short notes on what worked each day to **improve timing** and **thresholds** for the next event.



### Analysis of the results
- According to the Flow Dashboard analysis, over 5 million people were recorded by the sensors throughout the five day event. Oosterdoksbrug emerged as the busiest location, capturing nearly 10% of the total crowd, while Zeeburg registered the lowest count with only just 113 people, peaking at 8 individuals recorded simultaneously on August 20, 2025, at 8:45 PM. The highest attendance occurred on August 23, 2025, when 775 people were detected at Oosterdoksbrug around 11 PM, likely due to it being the start of the weekend. Notably, only the first and last days saw fewer than one million visitors, which is understandable since the first day was a weekday and the final day preceded the workweek, prompting many attendees to leave earlier.


## Subquestion 4: Developing the Dashboard

To properly visualize the outputs we have developed in the previous parts, the project requires a dashboard which can accurately presend the results and predictions in a way which can aid the crowd management decision making process by the end user. To ensure quality, there needs to be a consensus on what refined data will be presented and how it will be projected on the dashboard for general use. The dashboard therefore utilizes two different types of projection methods to present our findings, both analytical and geographic projections. 

#### Analytical Projections on Dashboard

When observing the analytical projections of the dashboard, it first projects both a visitor and a vessel count with 4 separate pie charts to represent its findings; 2 each for the visitor and vessel count. The first 2 pie charts provide a count of visitors or vessels within the given time frame on a specific date. The total number is provided by a value located inside the pie, with the pie chart itself representing the ratio of the visitors or vessels within the selected time frame against the grand total for that chosen date. The specific date and time frame for these pie charts can be altered by the user with a selectable time date filter located on the top as well. 

The other 2 pie charts represent the count ratio between the cumulative total of visitors or vessels throughout the entire event against the running total counted for both until a given date which was selected earlier in the time date filter. Like the previous 2 pie charts, the cumlative total until the given date is given in the middle of the pie chart, while the said ratio is represented through color fillings in its outershell. 

(SCREENSHOT OF THE 4 PIE CHARTS HERE)

Moreover, there are also 2 Line graphs projecting the hourly trends for the total number of visitors for a given area and vessels during the selected date. For each line graph, it shows both the actual counted value on one line, as well as a forcast line based off the data yeilded through our predicion model for each hour.

(SCREENSHOT OF THE LINE GRAPH)

Lastly, the analytical projections in our dashboard also provides a bar chart showing the top 10 most busiest areas recorded during the selected time frame and date. While there are 36 different active sensors in this model, the bar chart only projects the top 10 to prioritize revealing the areas that are the most crowded while improving the readability of our dashboard.

(SCREENSHOT OF THE BAR GRAPH)

#### Geographic Projections (Mapping) on Dashboard

The dashboard also implemented geographic projections to show our refined visitor data on a readable map thorugh utilizing its geodata. The dashboard has used two types of interactive mapping; a heat map and a bubble map.

The bubble map projection shows the total number of visitors counted at every sensor by visualizing with a number inside a circle which represents the sensor. Each circle is placed on the map depending on the sensor's geodata. Like the previous analytical measures, the dashboard gives autonomy to the end user to designate a certain date and time frame to project onto the map. The size of each circle fluxuates depending on the number of visitors its relevent sensor has counted within the given time frame, with the circle on the map becoming larger as the number of visitors increase. The map also provides a function where clicking on each circle provides the line graph representation of the number of people counted by that sensor for the entirety of a chosen date. This allows the end user to look deeper into the specific visitor count trends for each of the sensors.

(SCREENSHOT OF THE BUBBLE MAP)

The Heat Map on the other hand is a visualization of the concentration of visitors during a given time frame onto a map of Amsterdam. Rather than providing the number of visitors per sensor like the bubble map, the heat map projects its findings through gradations of color, where areas colored in more in red represents high concentration of visitors while blue represents the opposite. Like the bubble map, the heat map is also paired with its own line graph representing the changes in the number of visitors counted at each sensor. Unlike the bubble map however, the heatmap also has a time lapse function located on a separate page of the dashboard, allowing the end user to view how the visitor concentration has potentially changed geographically throughout the time period. This system was not implemented into the bubble map, as the heat map's minimalistic represention measures of using color to show concentration better visualizes the city's movement trends of visitors than a number heavy bubble map. 

(SCREENSHOT OF THE HEAT MAP)

#### How these projections help with decision making

The design of this dashboard was chosen and developed to enahnce the end user's decision making process, and each of these projection methods contribute to this goal.

The analytical design measures with the pie charts and line/bar graph representation allows the end user to identify the overall numerical trends that are present throughout the entire event. The 4 pie charts are a quick way to identify the visitor and ship count within a given time frame that can be chosen by the user, and correlate it against the cumaltilitive total of the entire event to understand what percentage of the cumilative total of visitors and vessels are present at that moment. The line graph places time staps on these counts, allowing the user to identify the numerical trends of any given point throughout the whole event. 

The interactive geographic projections allow the observation of these trends on a map overlay to understand the correlation geographically. The bubble map allows the end user to understand both which sensors are the most crowded through the size of the circles with the numbers in the middle indicating the actual value. The heat map projects these trends utilizing color, which paired with the time lapse function allows the end user to identify the geographic trends of the crowd flow at one glance. The line graph data, which can be brought up by simply clicking on the censor representations on the map can also instantly bring up trends identifiable on a single sensor level. This is especially useful if the user wants to zoom into the specifics of a given sensor due to patterns which were recognized on the map overlay, providing the analytical tools to support with this decision process. Like the analytical representations, the end user has full autonomy to choose the timespan of the data to show onto the geographic projections. 

Overall, the combination of analytical and geographic projections, as well as presenting data for both on an individual sensors and for the entire event, provides the necessary insight and tools to better understand the crowd flow patterns during SAIL 2025. The ability to also alter the time frame for each projection allows the user to pick and zoom into potential important moments when trends emerge, allowing further understanding of crowd patterns. 

## Conclusion


To answer the main objective "How can we monitor and predict short term crowd flows during SAIL Amsterdam 2025?", we have concluded that a successful model stems from a continuous process of finding and plugging relevent data, developing the prediction model in accordance to our dataset and goals, successfully generating our results and designing a well suited dashboard which can project the findings in a way which is digestable and aid the decision process for the end user. For potential future development of a crowd predicting model of this sort, it is advised for the possible developers for the same crowd prediction model for the SAIL2030 event to look into potentially implementing new sensors to better our prediction model. As stated before, there was a gap between sensors which correlated well with XGBoost, and those which did not. To improve on this problem, these new sensors can possibly be implemented in areas which yeilded a low R^2 to better our raw data so that the XGBoost predictions can potentially improve for these points. 

Moreover, future developers can also look into developing the crowd prediction model for other events possibly in different parts of the world with a differing set of variables and pluged in correlated data. With this, we can potentially observe the difference in predictions which stem from these variable and scenario differences, allowing developers to not only broaded their horizons on crowd predictions in a different environment, but to also better understand the specific constraints that SAIL2025 may have been under which have not been identified in this report. 

## Contribution Statement

*Be specific. Some of the tasks can be coding (expect everyone to do this), background research, conceptualisation, visualisation, data analysis, data modelling*

**Author 1**:

**Author 2**:

Kristof Hollstein:
1. Created the early stage streamlit structure to enable group contribution into github
2. Written the report for the Introduction, subquestions 1, 2 and 4, alongside the conclusion.

Muhammad Harisuddin: 
1. Create visualization of chart in the dashboard that display the number of crowd, number of vessel and the Top 10 crowd area within specific time. 
2. Collaboration with Ida Bagus Rai S.D. combine the chart visualization and the map visualization into one unified dashboard as the final product. 

Ida Bagus Rai Satria Dharma: 
1. Create Dashboard visualization for Flow chart (Map, Sensor Details, and Time-lapse)
2. Collaboration with Muhammad Harisuddin combine the chart visualization and the map visualization into one unified dashboard as the final product.

## Data Used

## Data Pipeline