# Using CSP to analyze MTA data

The NYC Metropolitan Transportation Authority provides an [API for developers](https://api.mta.info), and we'll explore the MTA's realtime [GTFS-rt](https://developers.google.com/transit/gtfs-realtime) transportation data feed.

In order to deal with the GTFS-rt data, we'll use the [nyct-gtfs](https://pypi.org/project/nyct-gtfs/) library, available from PyPI through

```
pip install nyct-gtfs
```

The MTA feed can be inspected as follows:

In [1]:
from nyct_gtfs import NYCTFeed

# Load the realtime feed from the MTA site for lines 1-7 and S
# (note that the api_key argument is required, but can be empty)
feed = NYCTFeed("https://api-endpoint.mta.info/Dataservice/mtagtfsfeeds/nyct%2Fgtfs", api_key="")

> **Note:** There is a website with official documentation for the MTA API at https://new.mta.info/developers, including instructions on how to access the feed for other lines. However, you may not have access to this website depending on your geographical location. 

Let's explore the data first. `feed` is a `nyct_gtfs.feed.NYCTFeed` instance, with the most important methods being the following:

In [2]:
feed.refresh?

[0;31mSignature:[0m [0mfeed[0m[0;34m.[0m[0mrefresh[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m Reload this object's feed information from the MTA API
[0;31mFile:[0m      ~/micromamba/envs/csp/lib/python3.11/site-packages/nyct_gtfs/feed.py
[0;31mType:[0m      method

In [3]:
feed.filter_trips?

[0;31mSignature:[0m
[0mfeed[0m[0;34m.[0m[0mfilter_trips[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mline_id[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtravel_direction[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtrain_assigned[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0munderway[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshape_id[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheaded_for_stop_id[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mupdated_after[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mhas_delay_alert[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Get the list of subway trips from the GTFS-realtime feed, optionally filtering based on one or more parameters.

If more than one filter is specifi

In [4]:
feed.trips?

[0;31mType:[0m        property
[0;31mString form:[0m <property object at 0x7f1f48e11350>
[0;31mDocstring:[0m   Get the list of subway trips from the GTFS-realtime feed. Returns a list of `Trip` objects

In our case, we will (for simplicity) filter the trips to collect information only about the 1, 2 and 3 trains, and we will start with trains going through 34 St-Penn Station (identified with stop IDs 128S or 128N):

In [5]:
# This cell can be run multiple times, and data will be refreshed every 30s
feed.refresh()
trains = feed.filter_trips(underway=True, headed_for_stop_id=['128S', '128N'])
trains

[{"080550_1..S03R", INCOMING_AT 128S @15:07:07},
 {"081150_1..S03R", STOPPED_AT 125S @15:07:02},
 {"081750_1..S03R", STOPPED_AT 121S @15:06:50},
 {"082350_1..S03R", IN_TRANSIT_TO 117S @15:07:10},
 {"082950_1..S03R", IN_TRANSIT_TO 114S @15:07:07},
 {"083550_1..S03R", IN_TRANSIT_TO 112S @15:07:08},
 {"083750_1..N03R", STOPPED_AT 133N @15:06:34},
 {"084150_1..S03R", STOPPED_AT 107S @15:06:50},
 {"084350_1..N03R", STOPPED_AT 138N @15:07:05},
 {"080550_2..S01X013", STOPPED_AT 120S @15:06:17},
 {"081350_2..S10X008", STOPPED_AT 127S @15:06:32},
 {"081550_2..N01R", IN_TRANSIT_TO 132N @15:07:08},
 {"082150_2..S01X013", STOPPED_AT 220S @15:06:22},
 {"082350_2..N01R", IN_TRANSIT_TO 230N @15:07:07},
 {"082950_2..S10X008", IN_TRANSIT_TO 224S @15:07:00},
 {"083150_2..N01R", INCOMING_AT 236N @15:07:03},
 {"083750_2..S01X013", STOPPED_AT 213S @15:06:07},
 {"083950_2..N01R", IN_TRANSIT_TO 241N @15:07:08},
 {"084550_2..S10X008", STOPPED_AT 214S @15:06:48},
 {"080750_3..N01R", STOPPED_AT 137N @15:06:41},

We can also show this data in a human-readable way.

In [6]:
for train in trains:
    print(train)

Southbound 1 to South Ferry, departed origin 13:25:30, Currently INCOMING_AT 34 St-Penn Station, last update at 15:07:07
Southbound 1 to South Ferry, departed origin 13:31:30, Currently STOPPED_AT 59 St-Columbus Circle, last update at 15:07:02
Southbound 1 to South Ferry, departed origin 13:37:30, Currently STOPPED_AT 86 St, last update at 15:06:50
Southbound 1 to South Ferry, departed origin 13:43:30, Currently IN_TRANSIT_TO 116 St-Columbia University, last update at 15:07:10
Southbound 1 to South Ferry, departed origin 13:49:30, Currently IN_TRANSIT_TO 145 St, last update at 15:07:07
Southbound 1 to South Ferry, departed origin 13:55:30, Currently IN_TRANSIT_TO 168 St-Washington Hts, last update at 15:07:08
Northbound 1 to Van Cortlandt Park-242 St, departed origin 13:57:30, Currently STOPPED_AT Christopher St-Sheridan Sq, last update at 15:06:34
Southbound 1 to South Ferry, departed origin 14:01:30, Currently STOPPED_AT 215 St, last update at 15:06:50
Northbound 1 to Van Cortlandt P

We can now check for the times when trains will pass through 34St-Penn Station.

In [7]:
trains_at_penn = []
print("Station | Line | Direction | Arrival time")
for train in trains:
    for update in train.stop_time_updates:
        if update.stop_id in ['128S', '128N']:
            print(f"{update.stop_name} | {train.route_id} | {train.headsign_text} | {update.arrival}")
            trains_at_penn.append((train, update))

Station | Line | Direction | Arrival time
34 St-Penn Station | 1 | South Ferry | 2024-05-24 15:08:16
34 St-Penn Station | 1 | South Ferry | 2024-05-24 15:12:32
34 St-Penn Station | 1 | South Ferry | 2024-05-24 15:17:50
34 St-Penn Station | 1 | South Ferry | 2024-05-24 15:25:41
34 St-Penn Station | 1 | South Ferry | 2024-05-24 15:31:40
34 St-Penn Station | 1 | South Ferry | 2024-05-24 15:36:28
34 St-Penn Station | 1 | Van Cortlandt Park-242 St | 2024-05-24 15:12:34
34 St-Penn Station | 1 | South Ferry | 2024-05-24 15:43:20
34 St-Penn Station | 1 | Van Cortlandt Park-242 St | 2024-05-24 15:19:35
34 St-Penn Station | 2 | Flatbush Av-Brooklyn College | 2024-05-24 15:14:47
34 St-Penn Station | 2 | Flatbush Av-Brooklyn College | 2024-05-24 15:08:02
34 St-Penn Station | 2 | Wakefield-241 St | 2024-05-24 15:11:37
34 St-Penn Station | 2 | Flatbush Av-Brooklyn College | 2024-05-24 15:34:22
34 St-Penn Station | 2 | Wakefield-241 St | 2024-05-24 15:22:07
34 St-Penn Station | 2 | Flatbush Av-Brookl

---

## Using CSP to ingest and analyze the data

When using CSP to ingest and analyze this data, we start with a graph representing the operations we want to perform. [CSP Graphs](https://github.com/Point72/csp/wiki/CSP-Graph) are composed of some number of "input" adapters, a set of connected calculation "nodes" and at the end sent off to "output" adapters. For simplicity, we'll build a graph that will show trains passing through 34 St-Penn Station.

There are two types of [Input Adapters](https://github.com/Point72/csp/wiki/5.-Adapters): Historical (aka Simulated) adapters and Realtime Adapters. Historical adapters are used to feed in historical timeseries data into the graph. Realtime Adapters are used to feed in live event data, generally created from external sources on separate threads.

As you may have guessed, in our case we need to use a [Realtime adapter](https://github.com/Point72/csp/wiki/Write-Realtime-Input-Adapters), which will ingest the data and periodically refresh it.

In CSP terminology, a single adapter corresponds to a single timeseries edge in the graph. When writing realtime adapters, you will need to implement a "push" adapter, which will get data from a separate thread that drives external events and "pushes" them into the engine as they occur. For this, [we will need "graph building time" and "runtime" versions of your adapter](https://github.com/Point72/csp/wiki/Write-Realtime-Input-Adapters#pushinputadapter---python). 

> Once the graph is constructed, `csp.graph` code is no longer needed. Once the
> graph is run, only inputs, `csp.nodes` and outputs will be active as data flows
> through the graph, driven by input ticks.

In our case, "ticks" correspond to feed refreshes, and we'll observe this data being updated every 30s. We will read 3 minutes of data for the purposes of this demonstration.

In [8]:
import csp
from csp.impl.pushadapter import PushInputAdapter
from csp.impl.wiring import py_push_adapter_def

import nyct_gtfs

import os
import time
import threading
from datetime import datetime, timedelta


class Event(csp.Struct):
    train: nyct_gtfs.trip.Trip
    update: nyct_gtfs.stop_time_update.StopTimeUpdate

# Create a runtime implementation of the adapter
class FetchTrainDataAdapter(PushInputAdapter):
    def __init__(self, interval, stations):
        self._interval = interval
        self._thread = None
        self._running = False
        self._stations = stations

    def start(self, starttime, endtime):
        print("FetchTrainDataAdapter::start")
        self._running = True
        self._thread = threading.Thread(target=self._run)
        self._thread.start()

    def stop(self):
        print("FetchTrainDataAdapter::stop")
        if self._running:
            self._running = False
            self._thread.join()

    def _run(self):
        # This is where we will read and process the real-time data feed
        feed = nyct_gtfs.NYCTFeed("https://api-endpoint.mta.info/Dataservice/mtagtfsfeeds/nyct%2Fgtfs", api_key="")

        while self._running:
            print("----------------------------------------------")
            print(f"{datetime.utcnow()}: refreshing MTA feed")
            print("----------------------------------------------")
            print("                                     Station     | Line | Direction   | Arrival time")
            feed.refresh()
            trains = feed.filter_trips(underway=True, headed_for_stop_id=self._stations)
            # tick whenever feed is refreshed
            for train in trains:
                for update in train.stop_time_updates:
                    if update.stop_id in self._stations:
                        self.push_tick(Event(train=train, update=update))
            time.sleep(self._interval.total_seconds())

# Create the graph-time representation of our adapter
FetchTrainData = py_push_adapter_def("FetchTrainData", FetchTrainDataAdapter, csp.ts[Event], interval=timedelta, stations=list)

@csp.node
def pretty_print(train_data: csp.ts[Event]) -> csp.ts[str]:
    message = f" {train_data.update.stop_name} |   {train_data.train.route_id}  | {train_data.train.headsign_text} | {train_data.update.arrival}"
    return message

@csp.graph
def mta_graph():
    print("Start of graph building")
    stations = ['128S', '128N']
    trains_at_penn = FetchTrainData(timedelta(seconds=30), stations=stations)
    # trains_at_penn is an edge that can be processed through a node
    result = pretty_print(trains_at_penn)
    csp.print(":", result)
    print("End of graph building")

start = datetime.utcnow()
end = start + timedelta(minutes=3)
csp.run(mta_graph, starttime=start, realtime=True, endtime=end)
print("Done.")

Start of graph building
End of graph building
FetchTrainDataAdapter::start
----------------------------------------------
2024-05-24 18:07:27.096941: refreshing MTA feed
----------------------------------------------
                                     Station     | Line | Direction   | Arrival time
2024-05-24 18:07:27.778076 :: 34 St-Penn Station |   1  | South Ferry | 2024-05-24 15:08:16
2024-05-24 18:07:27.778252 :: 34 St-Penn Station |   1  | South Ferry | 2024-05-24 15:12:32
2024-05-24 18:07:27.778281 :: 34 St-Penn Station |   1  | South Ferry | 2024-05-24 15:17:50
2024-05-24 18:07:27.778300 :: 34 St-Penn Station |   1  | South Ferry | 2024-05-24 15:25:41
2024-05-24 18:07:27.778318 :: 34 St-Penn Station |   1  | South Ferry | 2024-05-24 15:31:40
2024-05-24 18:07:27.778335 :: 34 St-Penn Station |   1  | South Ferry | 2024-05-24 15:36:28
2024-05-24 18:07:27.778351 :: 34 St-Penn Station |   1  | Van Cortlandt Park-242 St | 2024-05-24 15:12:34
2024-05-24 18:07:27.778369 :: 34 St-Penn

---

### References

* https://erikbern.com/2016/04/04/nyc-subway-math
* https://erikbern.com/2016/07/09/waiting-time-math.html
* https://pypi.org/project/nyct-gtfs/
* https://api.mta.info/#/landing
* https://developers.google.com/transit/gtfs-realtime
* https://github.com/Point72/csp/blob/main/examples/4_writing_adapters/e_14_user_adapters_03_pushinput.py
* https://github.com/Point72/csp/wiki/5.-Adapters#realtime-adapters