## Steps

1. [Get the data](#1.-Get-the-data)
2. [Prepare_the_data_for_modeling](#2.-Prepare-the-data-for-modeling)
3. [Build a model](#3.-Build-a-model)
4. [Put the model to work](#4.-Put-the-model-to-work)

## 1. Get the data

### Introduction

We are going to take a problem and answer with real dataset and in the process we are going to build a decision tree and get familiar with few powerful data science tools.

### Define the problem

What time we need to start from home to reach Harvard Square Bookshop using Boston subway? <br>
So, we know it takes 6mins from home to JFK/UMass station and another 4 mins from Harvard Square to work. <br>

Questions we need to find an answer for- <br>
a) When do trains arrive at JFK/UMass station? <br>
b) How long they take to reach Harvard Square? <br>
The answers to these vary based on weekdays, weekends and holidays, equipment failures, rush time etc.

So, what we are trying to essentially do is find probability and not certainties with regards to the problem. Example - What time do I need to leave the house in order to get to work on time 9times out of 10. 

### Find the data

We are trying to build a transit-time model using past data. First step is to find the appropriate data. So, we need to find historical data on what time does the train leave from JFK/UMass station and what time do they arrive at Harvard Square. <br>
Thankfully, developers and administrators at Massachusetts Bay Transportation Authority(MBTA) do a great job of preserving data and making available to others.<br>
Our first stop is [MBTA Developers Webpage](https://mbta.com/developers). Go to the _MBTA-performance section_ and click on _Download Documentation_ which gives us the link to the API.<br>

After reading through the documentation, it becomes clear that the type of query we are trying to make is _TRAVELTIMES_ query.
- This lets us gather departure and arrival times for historical trips.
- It tells us how to specify departure stop i.e. ```from_stop```, the arrival stop i.e. ```to_stop``` and both the beginning and end of the time window to look at (```from_datetime``` , ```to_datetime```)
- We also know that the data we want is only available starting in July 2015 and we can only query 7 days at a time.<br>
This important bits of information will play a part in how we construct our data collection code.

After making a query we get a collection of information in the results (can be found under 2.3 Performance Queries -> 2.3.1 Travel Times->Response Fields in the documentation) which also includes ```dep_dt```(the departure time from the departure stop) and ```arr_dt```(arrival time at the arrival stop) <br>

The documentation also gives an example of how to structure a 'travel time' query.<br>
Example - http://realtime.mbta.com/developer/api/v2.1/traveltimes?api_key=wX9NwuHnZU2ToO7GmGR9uw&format=json&from_stop=70172&to_stop=70182&from_datetime=1457454139&to_datetime=1457455262 <br>

**Let's break the query,**

- We start with a base URL ```http://realtime.mbta.com/developer/api/v2.1/traveltimes``` 
- and we get an API key ```api_key=wX9NwuHnZU2ToO7GmGR9uw```. This is like a password which grants us access and in this case lets the MBTA developers track who is gathering the data. There is a public API key which anyone can use to pull the data. It comes with some limitations on how many queries we can make and how often. But our use case fits comfortably within these limitations. If we need to make more queries or at a faster rate, we can apply for our own API key.
- The rest of the query ```from_stop=70172&to_stop=70182&from_datetime=1457454139&to_datetime=1457455262``` how to structure the departure stop, the arrival stop, beginning and end times for our time windows. <br>
- You may have noticed that the dates and time ```from_datetime=1457454139&to_datetime=1457455262```aren't easy to interpret, they just look like long string of numbers. We will talak about this later.

With all these in place, our query can be submitted and result submitted. In this case, the result is a ```json``` a structured bit of text that looks like this:<br>

```
{
 "travel_times":[
 {
 "route_id":"Green-D",
 "direction":"1",
 "dep_dt":"1457453760",
 "arr_dt":"1457454560",
 "travel_time_sec":"800",
 "benchmark_travel_time_sec":"480",
 "threshold_flag_1":"threshold_id_04"
 },
 {
 "route_id":"Green-D",
 "direction":"1",
 "dep_dt":"1457454105",
 "arr_dt":"1457454658",
 "travel_time_sec":"553",
 "benchmark_travel_time_sec":"480"
 }
 ]
}

```

- This response/result shows there are two trips that match the departure and arrival stops and the date time windows provided. We can see that the pieces of information we are interested in are there in the result i.e. ```dep_dt``` and ```arr_dt``` for each train.

**Note**

- The format for this compatible with [GTFS](https://en.wikipedia.org/wiki/General_Transit_Feed_Specification)(General Transit Feed Specification). This is a common format for information relating to public transportation schedules.
- It allows code developed for one city to be more easily portable to another.
- The MBTA has its own [comments](https://mbta.com/developers/gtfs) about how it uses GTFS and notes that an enhanced set of information was made available in early 2018.<br>

But since we are planning to look as far back as 2015, this won't help us and we will neglect it for now. 
We would need to look at MBTA's GTFS files that describe their routes and trains so that we can find the departure and arrival stations codes for JFK/UMass station and Harvard Square, our departure and arrival stops on the Red Line.<br> 

There are set of standard, optional and experimental files that are part of GTFS. We would take a look at an [archive](https://cdn.mbta.com/archive/archived_feeds.txt) of the GTFS feed to pull out the information we need.
- ```archived_feeds.txt``` is collection of zip files each correspondint to a specific date range.
- Each time the change is made to the files or the data format, a new zip file is created.
- You can see looking at the comments associated with each file that the GTFS system is regularly corrected, adapted and enriched with additional information.
- We will take a look at the most recent archive(https://cdn.mbtace.com/archive/20190704.zip) under the assumption that it will have most up to date information.
- Opening this archive shows a set of files as listed in the MBTA's GTFS documentation. We are interested in the file called ```stops.txt```, opening and seraching through it shows the Stop ID number for the JFK/UMass stop-Inbound the direction that head towards Harvard Square is **70086**. Further searching reveals that the Harvard stop-Outbound the direction we will be headed has the id of **70068**.
- These are the codes we will need in our travel time query to make sure we get the departure and arrival times for trains going from JFK to Harvard.








### Gather the data

Now we have everything we need to programatically download historical departure and arrival times for our trips. So Let's code!!!

**Note**
- For a short tour of some extra-useful datetime functions, check out the datetime tutorial [here](https://end-to-end-machine-learning.teachable.com/courses/516023/lectures/9460206)

In [9]:
import datetime
import requests

def download_data(verbose = True):
    """
    Pull the data down from the public servers.
    
    Paramters
    ----------
    verbose: boolean
    
    Returns
    ----------
    trips
    """
    
    #Harvard Square, Red Line Stop, outbound
    harvard_stop_id = '70068'
    #JFK/UMass, Red Line Stop, inbound
    jfk_stop_id = '70086'
    
    #Gather trip data from a time window from each day,
    #over many days.
    start_time = datetime.time(7,0)
    end_time = datetime.time(10,0)
    start_date = datetime.date(2015,5,1)
    end_date = datetime.date(2019,5,1)
    
    TTravelURL = "http://realtime.mbta.com/developer/api/v2.1/traveltimes"
    TKey = "?api_key=wX9NwuHnZU2ToO7GmGR9uw"
    TFormat = "&format=json"
    from_stop = "&from_stop=" + str(jfk_stop_id)
    to_stop = "&to_stop=" + str(harvard_stop_id)
    
    #Cycle through all the days
    i_day = 0
    trips = []
    while True:
        check_date = start_date + datetime.timedelta(days=i_day)
        if check_date > end_date:
            break
            
        #Formulate the query
        from_time = datetime.datetime.combine(check_date, start_time)
        to_time = datetime.datetime.combine(check_date, end_time)
        TFrom_time = "&from_datetime=" + str(int(from_time.timestamp()))
        TTo_time = "&to_datetime=" + str(int(to_time.timestamp()))
        
        SRequest = "".join([
            TTravelURL,
            TKey,
            TFormat,
            from_stop, to_stop,
            TFrom_time, TTo_time
        ])
        s = requests.get(SRequest)
        s_json = s.json()
        for trip in s_json['travel_times']:
            trips.append({
              'dep': datetime.datetime.fromtimestamp(
                  float(trip['dep_dt'])),
              'arr': datetime.datetime.fromtimestamp(
                  float(trip['arr_dt']))})
        if verbose:
            print(check_date, ':', len(s_json['travel_times']))
        
        i_day +=1
    
    return trips
                 
if __name__ == "__main__":
    trips = download_data() 

2015-05-01 : 0
2015-05-02 : 7
2015-05-03 : 7
2015-05-04 : 1
2015-05-05 : 1
2015-05-06 : 1
2015-05-07 : 1
2015-05-08 : 1
2015-05-09 : 6
2015-05-10 : 7
2015-05-11 : 1
2015-05-12 : 1
2015-05-13 : 1
2015-05-14 : 1
2015-05-15 : 1
2015-05-16 : 5
2015-05-17 : 5
2015-05-18 : 1
2015-05-19 : 1
2015-05-20 : 1
2015-05-21 : 1
2015-05-22 : 1
2015-05-23 : 6
2015-05-24 : 6
2015-05-25 : 1
2015-05-26 : 1
2015-05-27 : 1
2015-05-28 : 1
2015-05-29 : 1
2015-05-30 : 7
2015-05-31 : 5
2015-06-01 : 1
2015-06-02 : 0
2015-06-03 : 0
2015-06-04 : 0
2015-06-05 : 0
2015-06-06 : 0
2015-06-07 : 0
2015-06-08 : 0
2015-06-09 : 0
2015-06-10 : 0
2015-06-11 : 0
2015-06-12 : 0
2015-06-13 : 0
2015-06-14 : 0
2015-06-15 : 0
2015-06-16 : 0
2015-06-17 : 0
2015-06-18 : 0
2015-06-19 : 0
2015-06-20 : 0
2015-06-21 : 0
2015-06-22 : 0
2015-06-23 : 0
2015-06-24 : 0
2015-06-25 : 0
2015-06-26 : 0
2015-06-27 : 0
2015-06-28 : 0
2015-06-29 : 0
2015-06-30 : 0
2015-07-01 : 0
2015-07-02 : 0
2015-07-03 : 0
2015-07-04 : 0
2015-07-05 : 0
2015-07-06

2016-10-29 : 0
2016-10-30 : 0
2016-10-31 : 0
2016-11-01 : 0
2016-11-02 : 0
2016-11-03 : 0
2016-11-04 : 0
2016-11-05 : 0
2016-11-06 : 0
2016-11-07 : 0
2016-11-08 : 1
2016-11-09 : 1
2016-11-10 : 1
2016-11-11 : 0
2016-11-12 : 0
2016-11-13 : 0
2016-11-14 : 0
2016-11-15 : 1
2016-11-16 : 0
2016-11-17 : 1
2016-11-18 : 0
2016-11-19 : 0
2016-11-20 : 0
2016-11-21 : 0
2016-11-22 : 0
2016-11-23 : 0
2016-11-24 : 1
2016-11-25 : 0
2016-11-26 : 0
2016-11-27 : 0
2016-11-28 : 0
2016-11-29 : 1
2016-11-30 : 1
2016-12-01 : 0
2016-12-02 : 1
2016-12-03 : 1
2016-12-04 : 0
2016-12-05 : 0
2016-12-06 : 1
2016-12-07 : 1
2016-12-08 : 1
2016-12-09 : 1
2016-12-10 : 1
2016-12-11 : 0
2016-12-12 : 0
2016-12-13 : 0
2016-12-14 : 0
2016-12-15 : 1
2016-12-16 : 0
2016-12-17 : 0
2016-12-18 : 0
2016-12-19 : 0
2016-12-20 : 1
2016-12-21 : 1
2016-12-22 : 1
2016-12-23 : 1
2016-12-24 : 0
2016-12-25 : 0
2016-12-26 : 0
2016-12-27 : 1
2016-12-28 : 1
2016-12-29 : 1
2016-12-30 : 1
2016-12-31 : 1
2017-01-01 : 6
2017-01-02 : 1
2017-01-03

2018-04-29 : 0
2018-04-30 : 0
2018-05-01 : 1
2018-05-02 : 1
2018-05-03 : 1
2018-05-04 : 1
2018-05-05 : 1
2018-05-06 : 1
2018-05-07 : 1
2018-05-08 : 1
2018-05-09 : 1
2018-05-10 : 1
2018-05-11 : 1
2018-05-12 : 1
2018-05-13 : 0
2018-05-14 : 0
2018-05-15 : 1
2018-05-16 : 1
2018-05-17 : 0
2018-05-18 : 0
2018-05-19 : 1
2018-05-20 : 0
2018-05-21 : 1
2018-05-22 : 1
2018-05-23 : 1
2018-05-24 : 1
2018-05-25 : 1
2018-05-26 : 1
2018-05-27 : 1
2018-05-28 : 1
2018-05-29 : 1
2018-05-30 : 1
2018-05-31 : 1
2018-06-01 : 1
2018-06-02 : 1
2018-06-03 : 1
2018-06-04 : 1
2018-06-05 : 0
2018-06-06 : 1
2018-06-07 : 1
2018-06-08 : 1
2018-06-09 : 1
2018-06-10 : 1
2018-06-11 : 1
2018-06-12 : 1
2018-06-13 : 1
2018-06-14 : 1
2018-06-15 : 1
2018-06-16 : 1
2018-06-17 : 1
2018-06-18 : 1
2018-06-19 : 0
2018-06-20 : 1
2018-06-21 : 1
2018-06-22 : 1
2018-06-23 : 1
2018-06-24 : 1
2018-06-25 : 2
2018-06-26 : 1
2018-06-27 : 1
2018-06-28 : 1
2018-06-29 : 1
2018-06-30 : 1
2018-07-01 : 0
2018-07-02 : 1
2018-07-03 : 1
2018-07-04

### Timezone bugfix

## 2. Prepare the data for modeling

## 3. Build a model

## 4. Put the model to work