# About

With much of the code written, it's time to think about downloading and storing the data.

First of all, let's get a decent estimate of how large an individual GTFS-R feed is.

In [2]:
import json

with open("../auth/mta-credentials.json", "r") as f:
    key = json.loads(f.read())['key']

In [5]:
import requests
import time

In [7]:
requests_1 = []
requests_16 = []
requests_21 = []
requests_2 = []
requests_11 = []

for i in range(0, 5):
    requests_1.append(requests.get("http://datamine.mta.info/mta_esi.php?key={0}&feed_id=1".format(key)))
    requests_16.append(requests.get("http://datamine.mta.info/mta_esi.php?key={0}&feed_id=16".format(key)))
    requests_21.append(requests.get("http://datamine.mta.info/mta_esi.php?key={0}&feed_id=21".format(key)))
    requests_2.append(requests.get("http://datamine.mta.info/mta_esi.php?key={0}&feed_id=2".format(key)))
    requests_11.append(requests.get("http://datamine.mta.info/mta_esi.php?key={0}&feed_id=11".format(key)))
    print("Finished loop {0}, sleeping".format(i + 1))
    if i != 4:
        time.sleep(30)

print("Done!")

Finished loop 1, sleeping
Finished loop 2, sleeping
Finished loop 3, sleeping
Finished loop 4, sleeping
Finished loop 5, sleeping
Done!


In [17]:
len(requests_16[0].content)

57001

In [12]:
len(requests_1[1].content)

17

Unfortunately the first (and beefiest) feed is currently throwing a permission denied error. Puzzling.

In [20]:
requests_1[0].content

b'Permission denied'

In [24]:
for feeds in [requests_1, requests_16, requests_21, requests_2, requests_11]:
    print(sum([len(feed.content) for feed in feeds]) / len(requests_1)  / 1000)

0.017
57.897
20.534
19.977
3.401


But we can still ballpark that feed at around 204 KB in bytes.

In [27]:
import sys; sys.path.append("../src/")
from processing import fetch_archival_gtfs_realtime_data

In [30]:
len(fetch_archival_gtfs_realtime_data(timestamp='2014-09-17-09-31', raw=True)) / 1000

204.765

In [38]:
(204.765+57.897+20.534+19.977+3.401)*2880/1000/1000

0.8829331199999999

So let's say that we are looking at 200 KB, 60 KB, 20 KB, 20 KB, and 3.5 KB responses per feed.

Feeds update every 30 seconds, which means that for full resolution we will need to capture 2880 messages per day. That means that our daily storage requirements are:

* Feed 1 (1, 2, 3, 4, 5, 6, S): ~200 KB per response, ~0.6 GB per day.
* Feed 16 (N, Q, R, W): ~60 KB per response, ~0.17 GB per day.
* Feed 21 (B, D), ~20 KB per response, ~0.06 GB per day.
* Feed 2 (L), ~20 KB per response, ~0.06 GB per day.
* Feed 11 (SIR), ~3.5 KB per response, ~0.01 GB per day.
* TOTAL (1, 2, 3, 4, 5, 6, S, N, Q, R, W, B, D, L, SIR), ~0.9 GB per day.

Note that these are just estimates, conducted using data from a sample of 8 PM realtimes. As the amount of trains on the track differs quite heavily by time, they're not wholey accurate. But, good ballparks.

0.9 GB per day means 28 GB in one month, 82 GB in a season (three months), and 328 GB in a year. For context, my home PC contains 1 TB (1000 GB) of persistent (hard-drive) storage, and 16 GB of random-access memory.

These data volumes are managable, but they're beyond anything offered by a free-ish data storage service. For example, Amazon S3 provides 5 GB of storage, 20000 GET requests, and 2000 PUT requests per month in its (12-month limited) free tier. That, particularly the PUT request limit, is nowhere near enough...

Let's see how much each set of data would cost us on a few different services.

In [39]:
328 * 0.023

7.544

In [47]:
2880*5*365

5256000

In [49]:
0.005 * (1051200 / 1000)

5.256

In [51]:
0.004 * (5256000 / 10000)

2.1024000000000003

In [52]:
7.50 + 5.26 + 2.10

14.86

In [55]:
31*2880

892800

In [56]:
feeds = [
    "http://datamine.mta.info/mta_esi.php?key={0}&feed_id=1".format(key),
    "http://datamine.mta.info/mta_esi.php?key={0}&feed_id=21".format(key),
    "http://datamine.mta.info/mta_esi.php?key={0}&feed_id=2".format(key),
    "http://datamine.mta.info/mta_esi.php?key={0}&feed_id=11".format(key)
        ]

In [57]:
%timeit [requests.get(feed) for feed in feeds]

The slowest run took 11.62 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 486 ms per loop


In [59]:
200+60+20+20+3.5

303.5

In [61]:
2880*31*2

178560

## Amazon S3+Lambda

### S3

S3 charges a bunch of different prices.

First of all, the cost for data storage is 0.023\$ per GB. That means a year's worth of data would run you a grand total of...7.50\$.

PUT requests (what actually stashes the data into the S3 server) cost 0.005\$ per 1000 requests. Pushing data into S3 for a year would require `2880*5*365=5256000` PUT requests, costing `0.005 * (1051200 / 1000)` or 5.26$.

Finally, GET requests are 0.004\$ per 10,000 requests. Those same 5256000 objects would cost 2.10$.

That means that keeping this data running for a period of a year would cost 14.86\$. That, obviously, is no problem at all.

### Lambda

In order to get this data into S3, I would need to run a job every thirty seconds on AWS Lambda.

AWS Lambda charges based on the number of requests made, the time it takes to process each request, and how much memory is allocated to that process during the time that it is running.

The data itself takes up about `200+60+20+20+3.5 KB`, which is 303.5 KB, which is less than a MB. Including Python and `requests`, and `boto3`, it seems unlikely that we will even fill up the 128 MB tier (the lowest one).

OK, so we will use a 128 MB allocation. From the loop above, it seems likely that the entire download-and-stash process will run inside of 1 seconds. Conservatively, let's say it takes 2 seconds to run. That's 178560 seconds of processing time.

The AWS Lambda free tier, which is persistant beyond the 12-month trial, gives 3,200,000 free seconds at that 128 MB process level. That's an order of magnitude more than we need!

Additionally, you get 1 million free requests per month, many orders of magnitude more than we need.

So Lambda is effectively free.

Which means the total cost of everything all together comes out to 14.86\$.