# Motivation

- *What is your dataset?*

Our dataset is the [MTA Subway Hourly Ridership](https://data.ny.gov/Transportation/MTA-Subway-Hourly-Ridership-2020-2024/wujg-7c2s/about_data) dataset provided by the Metropolitan Transportation Authority (MTA) of New York. It includes hourly counts of entries and exits for every subway station in NYC from January 2020 to the present. Each row represents one station-hour, with metadata including date, station name, line, borough, and ridership numbers. The dataset captures both weekday and weekend traffic, making it useful for analysing temporal patterns in New York’s subway system over time.

- *Why did you choose this/these particular dataset(s)?*

We chose it because the subway reflects the mood, motion, and disruption of the city. Whether it is a blizzard in The Bronx, a signal failure at Penn Station, or 50,000 runners flooding Columbus Circle, the changes in ridership give us a direct, measurable signal of how New Yorkers react to their environment.

And because the dataset is so granular—hour-by-hour at each station—it allowed us to zoom in on the exact moment when the city shifts. We could see the crowd surge in Times Square at midnight, the quiet after a protest, or the hesitation after a high-profile crime.

- *What was your goal for the end user's experience?*

Our aim was to let readers feel the rhythm of the city. We did not just want to show charts—we wanted to tell stories. Stories of how the subway reacts during a crisis, how it pulses with energy during events, and how it quietly reflects collective human decisions like protest, celebration, or avoidance.

We designed our story like a digital magazine: immersive, browsable, with clean visuals that flow naturally from one insight to the next. We wanted users to walk away not just understanding the data—but seeing the subway as a living, breathing mirror of New York City itself.

# Basic Stats

- *Write about your choices in data cleaning and preprocessing*.

 Subway data is messy—just like the city it tracks. Each entry in the dataset records a turnstile tap at a specific moment in time, logged under a column called `transit_timestamp`. But for our stories to work—like isolating midnight surges on New Year’s Eve or early-morning Marathon crowds—we needed a clearer sense of when things were happening.

So, we did one crucial thing:

```python
# Convert timestamp to datetime and extract date/hour
df['transit_timestamp'] = pd.to_datetime(df['transit_timestamp'])
df['date'] = df['transit_timestamp'].dt.date
df['hour'] = df['transit_timestamp'].dt.hour
```

Why?
Because human stories do not live in timestamps—they live in hours and days. We needed to know:

* What happens at 8 AM on a Monday?
* How do late-night entries change on New Year’s Eve?
* What’s the average 6 PM weekday crowd compared to a Saturday?

Extracting date and hour helped us group the data meaningfully, so we could build visualisations like hourly ridership patterns, event overlays, and weekend vs. weekday comparisons. Without this step, all of those stories would have stayed hidden in raw strings of numbers.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from datetime import datetime, timedelta
from datetime import datetime
import matplotlib.dates as mdates
from matplotlib.ticker import FuncFormatter
import folium
from folium.plugins import HeatMap
import matplotlib.colors as mcolors

In [3]:
# Load the dataset
file_path = r"C:\Users\poltr\OneDrive - udl.cat\Desktop\MTA_Subway_Hourly_Ridership__2020-2024_20250408.csv"

# Read data from the specified file
df = pd.read_csv(file_path)

# Display the first few rows of the dataframe
df.head()

  df = pd.read_csv(file_path)


MemoryError: Unable to allocate 845. MiB for an array with shape (110696365,) and data type int64

In [None]:
# Display the dataset size and columns
print(f"Dataset size: {df.shape[0]} rows and {df.shape[1]} columns")

# **This is a comment section**

Once cleaned, we dove into the data like rush hour at Grand Central.

* **Size**: Over 280 million entries** across **500+ unique stations**, covering **2020–2024**.
* **Granularity**: Hourly turnstile counts, letting us track **minute-by-minute shifts** in city behaviour.
* **Coverage**: Every borough, every station, every day—including **lockdowns**, **storms**, **holidays**, and **historic protests**.

From our exploratory data analysis, some fascinating patterns emerged:

* **Weekday Peaks**: Predictable morning (7–9 AM) and evening (5–7 PM) rushes.
* **Pandemic Drop**: A sharp fall in ridership during spring 2020, bottoming out in April.
* **Event Spikes**: Localised surges during marathons, parades, and holidays—especially at stations like **Times Sq–42 St**, **Columbus Circle**, and **Atlantic Av–Barclays Ctr**.

We also created:

* Ridership histograms by hour and weekday
* Heatmaps of station-level usage on special dates
* Bar charts showing drops tied to major disruptions (e.g., signal failures, snowstorms)

This stage let us frame our story: not as a static chart, but as a moving timeline—one tap at a time.

# **This is a comment section (till here)**

#### **Data Cleaning**

**Which columns can we remove?**


**Which rows of columns should we remove?**

- *Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.*


# Data Analysis

- *Describe your data analysis and explain what you've learned about the dataset.*
- *If relevant, talk about your machine-learning.*

##### **DATA ANALYSIS IS CARRIED OUT AND REFERENCED IN SEPARATE NOTEBOOKS. ONLY SELECTED PLOTS ARE SHOWN HERE**.

The data analysis plots are in separate notebooks as they are too large to fit in one notebook. To give you an idea of the most important findings, only selected plots are shown in this notebook.

We know that you asked for the explanatory notebook to contain all the code that made up this project, but we strongly believe that using our current folder structure with reference links to hyperlinks and the respective files is much more manageable and clear for the people grading this project.

### Temporal Data Analysis

### Bar Plots

### Line And Polar Plots

These plots illustrate that there are significant temporal changes across all 

# Genre

## This answer needs to be revised maybe to adapt to act 2 plots too.

- *Which genre of data story did you use?*

We chose the Magazine Style genre for our data story—because the New York City subway is not just a system, it is a living narrative. Events like New Year’s Eve crowd surges, the NYC Marathon, and even signal failures at Penn Station do not just show up as stats—they tell stories.

Magazine Style allowed us to unfold these moments with depth and pacing. Readers can scroll at their own rhythm, pausing to explore interactive visuals or glide through snapshots of the city’s highs and lows. It is a genre designed for storytelling, reflection, and exploration—exactly how we wanted our audience to engage with the MTA’s ridership data.

- *Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?*

According to Segel & Heer’s Figure 7, we used tools from all three categories of Visual Narrative to build a balanced and compelling experience:

  1. Annotations

* We used headlines, subheadings, and descriptive figure captions to guide readers through the data.
* For example, "New Year’s Eve – The Ultimate Stress Test" and "When Protests Shut Down Stations" are not just labels—they are entry points into the story. These annotations provided context and helped readers quickly grasp the core insight behind each chart.

2. Colour and Visual Encodings

* We kept a consistent colour palette across maps, line plots, and bar charts to maintain visual harmony.
* Important data points (like surges or dips) were highlighted to stand out, ensuring quick readability even at a glance.

3. Interactive Visualisations

* Our Bokeh plots enabled users to explore data on their own terms. For instance, readers can hover over stations or filter by date to see how a protest in June 2020 or a snowstorm in January 2022 changed the flow.
* This gave our story a sense of discoverability and allowed readers to become part of the analysis. <br> <br>



- *Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?*

We also strategically used elements from all three categories of Narrative Structure:

  1. Ordering

* Our story is segmented into clear acts— events, incidents, and the unseen patterns—each serving as a chapter.
* This modular layout mirrors how magazine stories unfold, giving structure while allowing readers to jump between sections.

  2. Interactivity

* While much of the narrative is guided, interactive Bokeh charts give room for exploration and hypothesis testing. Want to see how 42 St ridership dropped after a crime? Click and find out.
* This blend of narrative + interaction lets users engage more deeply than with static plots alone.

  3. Messaging

* Each section delivers a clear takeaway: "New Yorkers do not stop moving," "Disruptions ripple far beyond the station," or "The subway mirrors the city."
* These are not just data points—they are messages framed through evidence.


Together, these visual and structural elements brought the dataset to life—not just as rows of numbers, but as a portrait of a city in motion.



# Visualisation

- *Explain the visualizations you've chosen.*

- *Why are they right for the story you want to tell?*

# Discussion

- *What went well?*

The biggest win was our ability to translate raw ridership data into a human story. By anchoring our visuals around major New York City events—like New Year’s Eve in Times Square or signal failures at Penn Station—we made the subway system feel alive and reactive, not just mechanical.

Our decision to use the Magazine Style genre turned out to be effective, as it allowed us to structure the story into individual, digestible acts-each with its own visuals, tone, and pacing-while benefiting from fewer constraints on the number of characters. This additional visual real estate allowed us to develop high-quality visualisations, most especially the interactive Bokeh plots, which successfully empowered users to explore the data for themselves by zooming in on data, filtering stations, and tracing personal narratives through the system. As a result, we were able to create a more engaging and interactive data story in our final project.

- *What is still missing? What could be improved? Why?*

We would have loved to include real-time video or news tweet overlays to connect ridership shifts to breaking moments as they happened—e.g., overlaying Tweets from BLM protests with station shutdowns. But pulling in those external datasets at scale (and legally) proved too time-consuming for this scope.

Another challenge was granularity. While our hourly data was rich, we occasionally lacked the contextual information (e.g., exact event locations, train outages by line) that would have sharpened our causality claims. Some of our stories—like "The Subway Avoidance Effect"—would benefit from integrating official MTA service alerts or NYPD incident logs in future work.

Lastly, accessibility could be improved: adding alt text for interactive visualisations and refining our colour scheme for better contrast would make the story more inclusive.

# Contributions

*You should write (just briefly) which group member was the main responsible for which elements of the assignment.* *(I want you guys to understand every part of the assignment, but usually there is someone who took lead role on certain portions of the work.* *That's what you should explain).*
*It is not OK simply to write "All group members contributed equally".*

Each of the 3 students has made an equal contribution to this project and each of us has helped the other and understands each component of the final outcome.

However, in accordance with DTU requirements, the following is an outline of each student's main responsibilities:

* **Clara Mejlhede Lorenzen (s180350)** led the data wrangling and preprocessing, including converting timestamps and structuring the data by date and hour. Clara was responsible for the GitHub Pages site, ensuring that all elements were properly linked and displayed, nonetheless, she was also in charge of implement the feedback from Assignment 2. She also contributed to the narrative structure and copy, especially in Act II (Incident reports) and the discussion section.

* **Pol Triquell Lombardo (s243271)** was responsible for the narrative structure and storyboarding. He sketched the key scenes for Act I (New Year’s Eve, Marathon Sunday, protests, and incidents) and wrote most of the copy (the explainer notebook and the story), including the headlines and figure captions. Nonetheless, he also contributed on the data visualisation side, creating the interactive plots for the first Act and embedding of visuals into the GitHub Pages site.

* **Marie Sophie Mudge Woods (s194384)** took charge of 

We met regularly to review, critique, and revise each section together, ensuring everyone understood the full scope of the story—from data wrangling to narrative arc.