# Using CSP to analyze MTA data

The NYC Metropolitan Transportation Authority provides an [API for developers](https://api.mta.info), and we'll explore the MTA's realtime [GTFS-rt](https://developers.google.com/transit/gtfs-realtime) transportation data feed.

In order to deal with the GTFS-rt data, we'll use the [nyct-gtfs](https://pypi.org/project/nyct-gtfs/) library, available from PyPI through

```
pip install nyct-gtfs
```

The MTA feed can be inspected as follows:

In [1]:
from nyct_gtfs import NYCTFeed

# Load the realtime feed from the MTA site for lines 1-7 and S
# (note that the api_key argument is required, but can be empty)
feed = NYCTFeed("https://api-endpoint.mta.info/Dataservice/mtagtfsfeeds/nyct%2Fgtfs", api_key="")

> **Note:** There is a website with official documentation for the MTA API at https://new.mta.info/developers, including instructions on how to access the feed for other lines. However, you may not have access to this website depending on your geographical location. 

Let's explore the data first. `feed` is a `nyct_gtfs.feed.NYCTFeed` instance, with the most important methods being the following:

In [2]:
feed.refresh?

[0;31mSignature:[0m [0mfeed[0m[0;34m.[0m[0mrefresh[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m Reload this object's feed information from the MTA API
[0;31mFile:[0m      ~/micromamba/envs/csp-dev/lib/python3.11/site-packages/nyct_gtfs/feed.py
[0;31mType:[0m      method

In [3]:
feed.filter_trips?

[0;31mSignature:[0m
[0mfeed[0m[0;34m.[0m[0mfilter_trips[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mline_id[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtravel_direction[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtrain_assigned[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0munderway[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshape_id[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheaded_for_stop_id[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mupdated_after[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mhas_delay_alert[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Get the list of subway trips from the GTFS-realtime feed, optionally filtering based on one or more parameters.

If more than one filter is specifi

In [4]:
feed.trips?

[0;31mType:[0m        property
[0;31mString form:[0m <property object at 0x7fc737eb8f40>
[0;31mDocstring:[0m   Get the list of subway trips from the GTFS-realtime feed. Returns a list of `Trip` objects

In our case, we will (for simplicity) filter the trips to collect information only about the 1, 2 and 3 trains, and we will start with trains going through 34 St-Penn Station (identified with stop IDs 128S or 128N):

In [5]:
# This cell can be run multiple times, and data will be refreshed every 30s
feed.refresh()
trains = feed.filter_trips(underway=True, headed_for_stop_id=['128S', '128N'])
trains

[{"110450_1..S03R", STOPPED_AT 127S @16:01:09},
 {"110900_1..S03R", INCOMING_AT 125S @16:00:45},
 {"111150_1..S03R", INCOMING_AT 122S @16:01:03},
 {"111550_1..S03R", IN_TRANSIT_TO 119S @16:01:06},
 {"112000_1..S03R", IN_TRANSIT_TO 116S @16:01:09},
 {"112600_1..N03R", STOPPED_AT 128N @16:00:58},
 {"112700_1..S03R", STOPPED_AT 113S @16:01:09},
 {"112950_1..S03R", STOPPED_AT 111S @16:00:57},
 {"113000_1..N03R", STOPPED_AT 132N @16:00:29},
 {"113400_1..S03R", IN_TRANSIT_TO 110S @16:01:05},
 {"113400_1..N03R", IN_TRANSIT_TO 134N @16:01:09},
 {"113700_1..S03R", STOPPED_AT 106S @16:00:29},
 {"113800_1..N03R", IN_TRANSIT_TO 137N @16:01:05},
 {"113950_1..S03R", STOPPED_AT 104S @16:01:07},
 {"109550_2..S01R", IN_TRANSIT_TO 120S @16:01:10},
 {"110150_2..S01R", IN_TRANSIT_TO 227S @16:01:08},
 {"110400_2..N01R", IN_TRANSIT_TO 128N @16:01:10},
 {"110950_2..S01R", STOPPED_AT 220S @16:00:35},
 {"111000_2..N01R", STOPPED_AT 228N @16:00:29},
 {"111450_2..S01R", STOPPED_AT 217S @16:01:08},
 {"111800_2..N

We can also show this data in a human-readable way.

In [6]:
for train in trains:
    print(train)

Southbound 1 to South Ferry, departed origin 18:24:30, Currently STOPPED_AT Times Sq-42 St, last update at 16:01:09
Southbound 1 to South Ferry, departed origin 18:29:00, Currently INCOMING_AT 59 St-Columbus Circle, last update at 16:00:45
Southbound 1 to South Ferry, departed origin 18:31:30, Currently INCOMING_AT 79 St, last update at 16:01:03
Southbound 1 to South Ferry, departed origin 18:35:30, Currently IN_TRANSIT_TO 103 St, last update at 16:01:06
Southbound 1 to South Ferry, departed origin 18:40:00, Currently IN_TRANSIT_TO 125 St, last update at 16:01:09
Northbound 1 to Van Cortlandt Park-242 St, departed origin 18:46:00, Currently STOPPED_AT 34 St-Penn Station, last update at 16:00:58
Southbound 1 to South Ferry, departed origin 18:47:00, Currently STOPPED_AT 157 St, last update at 16:01:09
Southbound 1 to South Ferry, departed origin 18:49:30, Currently STOPPED_AT 181 St, last update at 16:00:57
Northbound 1 to Van Cortlandt Park-242 St, departed origin 18:50:00, Currently S

We can now check for the times when trains will pass through 34St-Penn Station.

In [7]:
trains_at_penn = []
print("Station | Line | Direction | Arrival time")
for train in trains:
    for update in train.stop_time_updates:
        if update.stop_id in ['128S', '128N']:
            print(f"{update.stop_name} | {train.route_id} | {train.headsign_text} | {update.arrival}")
            trains_at_penn.append((train, update))

Station | Line | Direction | Arrival time
34 St-Penn Station | 1 | South Ferry | 2024-06-06 16:03:09
34 St-Penn Station | 1 | South Ferry | 2024-06-06 16:06:55
34 St-Penn Station | 1 | South Ferry | 2024-06-06 16:12:03
34 St-Penn Station | 1 | South Ferry | 2024-06-06 16:17:06
34 St-Penn Station | 1 | South Ferry | 2024-06-06 16:21:22
34 St-Penn Station | 1 | Van Cortlandt Park-242 St | 2024-06-06 16:01:28
34 St-Penn Station | 1 | South Ferry | 2024-06-06 16:25:09
34 St-Penn Station | 1 | South Ferry | 2024-06-06 16:28:27
34 St-Penn Station | 1 | Van Cortlandt Park-242 St | 2024-06-06 16:04:59
34 St-Penn Station | 1 | South Ferry | 2024-06-06 16:31:05
34 St-Penn Station | 1 | Van Cortlandt Park-242 St | 2024-06-06 16:09:43
34 St-Penn Station | 1 | South Ferry | 2024-06-06 16:34:59
34 St-Penn Station | 1 | Van Cortlandt Park-242 St | 2024-06-06 16:13:05
34 St-Penn Station | 1 | South Ferry | 2024-06-06 16:37:07
34 St-Penn Station | 2 | Flatbush Av-Brooklyn College | 2024-06-06 16:10:49


---

## Using CSP to ingest and analyze the data

When using CSP to ingest and analyze this data, we start with a graph representing the operations we want to perform. [CSP Graphs](https://github.com/Point72/csp/wiki/CSP-Graph) are composed of some number of "input" adapters, a set of connected calculation "nodes" and at the end sent off to "output" adapters. For simplicity, we'll build a graph that will show trains passing through 34 St-Penn Station.

There are two types of [Input Adapters](https://github.com/Point72/csp/wiki/5.-Adapters): Historical (aka Simulated) adapters and Realtime Adapters. Historical adapters are used to feed in historical timeseries data into the graph. Realtime Adapters are used to feed in live event data, generally created from external sources on separate threads.

As you may have guessed, in our case we need to use a [Realtime adapter](https://github.com/Point72/csp/wiki/Write-Realtime-Input-Adapters), which will ingest the data and periodically refresh it.

In CSP terminology, a single adapter corresponds to a single timeseries edge in the graph. When writing realtime adapters, you will need to implement a "push" adapter, which will get data from a separate thread that drives external events and "pushes" them into the engine as they occur. For this, [we will need "graph building time" and "runtime" versions of your adapter](https://github.com/Point72/csp/wiki/Write-Realtime-Input-Adapters#pushinputadapter---python). 

> Once the graph is constructed, `csp.graph` code is no longer needed. Once the
> graph is run, only inputs, `csp.nodes` and outputs will be active as data flows
> through the graph, driven by input ticks.

In our case, "ticks" correspond to feed refreshes, and we'll observe this data being updated every 30s. We will read 3 minutes of data for the purposes of this demonstration.

In [8]:
import csp
from csp.impl.pushadapter import PushInputAdapter
from csp.impl.wiring import py_push_adapter_def

import nyct_gtfs

import os
import time
import threading
from datetime import datetime, timedelta


class Event(csp.Struct):
    train: nyct_gtfs.trip.Trip
    update: nyct_gtfs.stop_time_update.StopTimeUpdate
    arrival: datetime
    direction: str

# Create a runtime implementation of the adapter
class FetchTrainDataAdapter(PushInputAdapter):
    def __init__(self, interval, stations):
        self._interval = interval
        self._thread = None
        self._running = False
        self._stations = stations

    def start(self, starttime, endtime):
        print("FetchTrainDataAdapter::start")
        self._running = True
        self._thread = threading.Thread(target=self._run)
        self._thread.start()

    def stop(self):
        print("FetchTrainDataAdapter::stop")
        if self._running:
            self._running = False
            self._thread.join()

    def _run(self):
        # This is where we will read and process the real-time data feed
        feed = nyct_gtfs.NYCTFeed("https://api-endpoint.mta.info/Dataservice/mtagtfsfeeds/nyct%2Fgtfs", api_key="")

        while self._running:
            print("----------------------------------------------")
            print(f"{datetime.utcnow()}: refreshing MTA feed")
            print("----------------------------------------------")
            print("                                     Station     | Line | Direction   | Arrival time")
            feed.refresh()
            trains = feed.filter_trips(underway=True, headed_for_stop_id=self._stations)
            # tick whenever feed is refreshed
            for train in trains:
                for update in train.stop_time_updates:
                    if update.stop_id in self._stations:
                        self.push_tick(Event(train=train, update=update, direction=train.direction, arrival=update.arrival))
            time.sleep(self._interval.total_seconds())

# Create the graph-time representation of our adapter
FetchTrainData = py_push_adapter_def("FetchTrainData", FetchTrainDataAdapter, csp.ts[Event], interval=timedelta, stations=list)

@csp.node
def pretty_print(train_data: csp.ts[Event], count: csp.ts[float]) -> csp.ts[str]:
    message = f" {train_data.update.stop_name} |   {train_data.train.route_id}  | {train_data.train.headsign_text} | {train_data.update.arrival} | Southbound train count: {int(count)}"
    return message

@csp.graph
def mta_graph():
    print("Start of graph building")
    stations = ['128S', '128N']
    interval = timedelta(seconds=30)
    trains_at_penn = FetchTrainData(interval, stations=stations)
    # trains_at_penn is an edge that can be processed through a node.
    # Select all southbound trains going through Penn Station
    south_trains = csp.filter(trains_at_penn.direction == "S", trains_at_penn)
    # Convert timestamps to unique float values
    timestamp = csp.apply(south_trains.arrival, datetime.timestamp, float)
    # Count the number of unique entries in this timeseries block (reset every 30 seconds)
    count = csp.stats.count(csp.stats.unique(timestamp), interval=timedelta(seconds=30), min_window=timedelta(seconds=1))
    result = pretty_print(trains_at_penn, count)
    csp.print(":", result)
    print("End of graph building")

start = datetime.utcnow()
end = start + timedelta(minutes=3)
csp.run(mta_graph, starttime=start, realtime=True, endtime=end)
print("Done.")

Start of graph building
End of graph building
FetchTrainDataAdapter::start
----------------------------------------------
2024-06-06 23:01:21.452959: refreshing MTA feed
----------------------------------------------
                                     Station     | Line | Direction   | Arrival time
2024-06-06 23:01:21.990072 :: 34 St-Penn Station |   1  | South Ferry | 2024-06-06 16:03:09 | Southbound train count: 1
2024-06-06 23:01:21.990361 :: 34 St-Penn Station |   1  | South Ferry | 2024-06-06 16:06:55 | Southbound train count: 2
2024-06-06 23:01:21.990415 :: 34 St-Penn Station |   1  | South Ferry | 2024-06-06 16:12:03 | Southbound train count: 3
2024-06-06 23:01:21.990812 :: 34 St-Penn Station |   1  | South Ferry | 2024-06-06 16:17:06 | Southbound train count: 4
2024-06-06 23:01:21.991054 :: 34 St-Penn Station |   1  | South Ferry | 2024-06-06 16:21:22 | Southbound train count: 5
2024-06-06 23:01:21.991102 :: 34 St-Penn Station |   1  | Van Cortlandt Park-242 St | 2024-06-06 1

---

### References

* https://erikbern.com/2016/04/04/nyc-subway-math
* https://erikbern.com/2016/07/09/waiting-time-math.html
* https://pypi.org/project/nyct-gtfs/
* https://api.mta.info/#/landing
* https://developers.google.com/transit/gtfs-realtime
* https://github.com/Point72/csp/blob/main/examples/4_writing_adapters/e_14_user_adapters_03_pushinput.py
* https://github.com/Point72/csp/wiki/5.-Adapters#realtime-adapters