## API lab

### Introduction:

While many data sets are provided though downloadable files, some are provided through application programming interface (API) which allows the code to access some data without downloading the entire dataset.

APIs provide sets of functions and procedures that allow accessing the features or data from an application or other service.


#### REST API
REST is essentially a set of useful conventions for structuring a web API. 'Web API' means an API that you interact with over HTTP, making requests to specific URLs, and often getting relevant data back in the response.

### Uses
- Scraping real time information - useful for example for accessing travel route info from mapping applications, social media feed. And any other data that being updated in real time or is changing quickly. Another example of this is stock price data. It doesn’t really make sense to regenerate a data set and download it every minute 
- Getting only the desired tables through filtering instead of downloading the entire data sets. Reddit comments are one example. It doesn’t make much sense (plus won't be permitted) to download the entire Reddit database if we want only some information

### General Terminology

#### Types of requests
characterize what action we are going to take by referring to the API

- GET: retrieve information (like search results). This is the most common type of request. Using it, we can get the data we are interested in from those that the API is ready to share.
- POST: adds new data to the server. Using this type of request, you can, for example, add a new item to your inventory.
- PUT: changes existing information. For example, using this type of request, it would be possible to change the color or value of an existing product.
- DELETE: deletes existing information

#### Status codes
briefly describe the result of the call.

- 200 – OK. The request was successful. The answer itself depends on the method used (GET, POST, etc.) and the API specification.
- 204 – No Content. The server successfully processed the request and did not return any content.
- 301 – Moved Permanently. The server responds that the requested page (endpoint) has been moved to another address and redirects to this address.
- 400 – Bad Request. The server cannot process the request because the client-side errors (incorrect request format).
- 401 – Unauthorized. Occurs when authentication was failed, due to incorrect credentials or even their absence.
- 403 – Forbidden. Access to the specified resource is denied.
- 404 – Not Found. The requested resource was not found on the server.
- 500 – Internal Server Error. Occurs when an unknown error has occurred on the server.

#### Endpoints
Usually, an Endpoint is a specific address (for example, https://weather-in-london.com/forecast), by referring to which you get access to certain features/data (in this example above – the weather forecast for London). Commonly, the name (address) of the endpoint corresponds to the functionality it provides.

### HereMap API
Provides mapping, location data and routing information services worldwide. Routing info is available for transit, walking, driving, bike modes.


In [1]:
import pandas as pd
import glob #Unix style pathname pattern expansion
import numpy as np
from dateutil import parser #powerful extensions to datetime; offers a generic date/time string parser
import time
from tqdm import tqdm #Customisable progressbar for iterators
try: #spelling depends on enviroment version 
    import urllib2 as urllib #URL handling module
except ImportError:
    import urllib.request as urllib
import json
import geopandas as gpd
import sys
from IPython.display import clear_output #Clears the output of the current cell receiving output
import requests #the module for making HTTP requests in Python; provides GET funcionality

#### Request API key by creating free account 
https://developer.here.com/sign-up?create=Freemium-Basic&keepState=true&step=account

#### Example 1: getting driving time

#### API endpoint
https://developer.here.com/documentation/routing-api/8.3.1/dev_guide/topics/send-request.html

#### Sample request

An API request needs an endpoint link with API key to fetch data. In some cases, you may also need appID and appCode to make a request. When you make a account in a particular API service, you will be asked to create and name your app. When you finish this step, you will be provided with a unique API key together with appID and appCode as a unique identifier for your app.

Please not that various APIs have certain restrictions on their usage. Please make sure to read the the usage terms before making acccount (especially if you are asked to enter your payment info beforehand!)

HereMap API has 250k free requests per month. (And it does not ask for credit card info when you create an account)

Also, please note the below example uses an API key for this lab's demonstration purposes only. Please create your HereMap account and get your unique API key when attempting the lab or homeworks yourself.

retreive a driving route between the two given points

In [2]:
# enter your api key from HereMap
apiKey = 'W6GB9G-e0QHV5Nes77OLAX5FxecH7IHEIuObpc6WJnU'

lon_pickup, lat_pickup = -73.972136, 40.744301 ##origin: A point in Manhattan (around E 35th street)
lon_dropoff, lat_dropoff = -74.172945, 40.680108 # destination: Newark airport

Note the driving API endpoint is 'https://route.ls.hereapi.com/routing/7.2/calculateroute.json?apiKey={}&waypoint0=geo!{},{}&waypoint1=geo!{},{}&mode=fastest;car;traffic:disabled', which essentially gives the driving directions from origin to destination together with a summary of the whole trip.

For the above origin-destination lat-longs, the endpoint will look like

'https://route.ls.hereapi.com/routing/7.2/calculateroute.json?apiKey=W6GB9G-e0QHV5Nes77OLAX5FxecH7IHEIuObpc6WJnU&waypoint0=geo!40.744301,-73.972136&waypoint1=geo!40.680108,-74.172945&mode=fastest;car;traffic:disabled'

Basically, the origin lat-long go in after the 'waypoint0=geo!' field and destination lat-long go in after 'waypoint1=geo!' field. Instead of manually putting these lat-long in the code, the pythonic way to do this is by the 'format' function. Wherever we need to insert variables, we put '{}' in the url and then specify the variables in their respective order of appearing in the url in the 'format' command.



In [3]:
url = str('https://route.ls.hereapi.com/routing/7.2/calculateroute.json?apiKey={}&waypoint0=geo!{},{}&waypoint1=geo!{},{}&mode=fastest;car;traffic:disabled'.format(
    apiKey, lat_pickup, lon_pickup, lat_dropoff, lon_dropoff))

print(url)

https://route.ls.hereapi.com/routing/7.2/calculateroute.json?apiKey=W6GB9G-e0QHV5Nes77OLAX5FxecH7IHEIuObpc6WJnU&waypoint0=geo!40.744301,-73.972136&waypoint1=geo!40.680108,-74.172945&mode=fastest;car;traffic:disabled


Now we'll make a request with the below command

In [4]:
data = urllib.urlopen(url).read().decode('utf-8')
data = json.loads(data)

The fetched data gives driving directions together with a summary of the whole trip

In [5]:
data

{'response': {'metaInfo': {'timestamp': '2020-09-22T01:41:10Z',
   'mapVersion': '8.30.112.154',
   'moduleVersion': '7.2.202037-7841',
   'interfaceVersion': '2.6.76',
   'availableMapVersion': ['8.30.112.154']},
  'route': [{'waypoint': [{'linkId': '+973152345',
      'mappedPosition': {'latitude': 40.7441169, 'longitude': -73.9722743},
      'originalPosition': {'latitude': 40.744301, 'longitude': -73.9721361},
      'type': 'stopOver',
      'spot': 0.0864198,
      'sideOfStreet': 'right',
      'mappedRoadName': 'E 35th St',
      'label': 'E 35th St',
      'shapeIndex': 0,
      'source': 'user'},
     {'linkId': '-810910944',
      'mappedPosition': {'latitude': 40.6790655, 'longitude': -74.1703952},
      'originalPosition': {'latitude': 40.6801079, 'longitude': -74.172945},
      'type': 'stopOver',
      'spot': 0.267148,
      'sideOfStreet': 'neither',
      'mappedRoadName': 'New Jersey Tpke S',
      'label': 'New Jersey Tpke S - I-95',
      'shapeIndex': 239,
      's

The retrived output contains all information in the 'response' key. In the 'response', there is information about intermediate directions (which street to take a turn, etc.) in the 'route' key. Also present in the 'route' key is the summary of the trip (time, distance, road types etc.)

Let's print the time and distance of the whole trip

In [6]:
# get travel time in seconds
data['response']['route'][0]['summary']['travelTime']

1737

In [7]:
# travel distance in meters
data['response']['route'][0]['summary']['distance']

25921

#### Example 2: getting transit time

#### API endpoint
https://developer.here.com/documentation/public-transit/dev_guide/index.html

In [8]:
lon_pickup, lat_pickup = -73.972136, 40.744301
lon_dropoff, lat_dropoff = -74.008704, 40.708592

url = str('https://transit.ls.hereapi.com/v3/route.json?apiKey={}&routing=all&dep={},{}&arr={},{}&time=2020-09-21T07%3A30%3A00'.format(apiKey, lat_pickup, lon_pickup, 
                                                                                                        lat_dropoff, lon_dropoff))

data = urllib.urlopen(url).read().decode('utf-8')
data = json.loads(data)
data

{'Res': {'serviceUrl': 'http://internal-ptkernel-prd-v040-9929454.us-east-1.elb.amazonaws.com/goroute/NEW_YORK_NEW_ENGLAND_W001731_20200921',
  'Connections': {'valid_until': '2020-02-28',
   'context': 'xoPQ4CZ9y2XAeEVIu3bLZf__AADwVmhfEv0C_9AHZACsXmhfEv0',
   'Connection': [{'id': 'R0004e9-C0',
     'duration': 'PT34M',
     'transfers': 1,
     'Dep': {'time': '2020-09-21T07:31:00',
      'Addr': {'y': 40.744301, 'x': -73.972136}},
     'Arr': {'time': '2020-09-21T08:05:00',
      'Addr': {'y': 40.708592, 'x': -74.008704}},
     'Sections': {'Sec': [{'id': 'R0004e9-C0-S0',
        'mode': 20,
        'Dep': {'time': '2020-09-21T07:31:00',
         'Addr': {'y': 40.744301, 'x': -73.972136},
         'Transport': {'mode': 20}},
        'Journey': {'distance': 611, 'duration': 'PT10M'},
        'Arr': {'time': '2020-09-21T07:41:00',
         'Stn': {'y': 40.747864,
          'x': -73.969576,
          'name': 'E 41 St/1 AV',
          'id': '717062318'}}},
       {'id': 'R0004e9-C0-S1',

Note that retrieved output keys are different from what we obtained in case of driving direction request. 

Here all data is present in the 'Res' key with information regarding intermediate stations. If we are interested in just the summary of the trip, the information is present in 'Connection' key inside of the 'Connections' key. 

Important: You may get the retrieved repsonse other than above if any transit route is not available from specified origin to destination. 
Invalid responses are 'Timestamp out of validity period' and 'No result found'

Travel time

In [9]:
x = data['Res']['Connections']['Connection'][0]['duration']
x

'PT34M'

Converting to minutes

In [10]:
if len(x) > 8:
    if x[5] == 'M':
        x =  float(x[2])*60 + float(x[4])
    else:
        x = float(x[2])*60 + float(x[4:6])
elif len(x)==8:
    if x[3]=='H':
        x = float(x[2])*60 + float(x[4])
    else:
        x = float(x[2:4])*60 + float(x[5:7])
elif len(x) == 5:
    x = float(x[2:4])

In [11]:
x

34.0

## Now access information for a sample of selected origin-destination points

In [12]:
# sample points lat-longs

df = pd.read_csv('samplePoints.csv')
df.head()

Unnamed: 0,LocationID,lon_pickup,lat_pickup,LocationID2,lon_dropoff,lat_dropoff
0,1,-74.174,40.691831,145,-73.948891,40.745379
1,2,-73.831299,40.616745,203,-73.739473,40.657853
2,3,-73.847422,40.864474,77,-73.895364,40.666559
3,4,-73.976968,40.723752,158,-74.008984,40.735035
4,5,-74.188484,40.552659,123,-73.964334,40.599954


In [13]:
len(df) ##one random point per each zip code

263

Function for getting driving time for a dataframe of above type

In [14]:
def driving(file, apiKey): 
    '''
    Input: the file that contain origin-destination lan/long information; keys from Map
    Output: list of driving times corresponding the input O-D pairs
    '''
    duration_list= []  # list for storing times
    for index, row in tqdm(file.iterrows()): # iterating through all rows of sample points
        lon_pickup = row['lon_pickup']  
        lat_pickup = row['lat_pickup']
        lon_dropoff = row['lon_dropoff']
        lat_dropoff = row['lat_dropoff']
        
        # specifying origin-desination lat-longs in the endpoint and making request
        url = str('https://route.ls.hereapi.com/routing/7.2/calculateroute.json?apiKey={}&waypoint0=geo!{},{}&waypoint1=geo!{},{}&mode=fastest;car;traffic:disabled'.format(apiKey, lat_pickup, lon_pickup, lat_dropoff, lon_dropoff))
        data = urllib.urlopen(url).read().decode('utf-8')
        data = json.loads(data)
        
        # append travel time to 'duration' list, nan if no travel time retrieved
        try:    
            duration = data['response']['route'][0]['summary']['travelTime']
        except:
            duration = np.nan
            
        duration_list.append(duration)

    return duration_list  

Passing 10 random O-D pairs

In [15]:
sample = df.sample(10, random_state=2020)
sample.reset_index(drop=True, inplace=True)

sample

Unnamed: 0,LocationID,lon_pickup,lat_pickup,LocationID2,lon_dropoff,lat_dropoff
0,110,-74.128342,40.54578,254,-73.858948,40.882157
1,51,-73.828264,40.873973,92,-73.828859,40.761102
2,166,-73.961764,40.809457,234,-73.990458,40.740337
3,207,-73.899353,40.763986,248,-73.872289,40.834165
4,195,-74.009178,40.675549,68,-73.999917,40.748428
5,251,-74.125348,40.61688,114,-73.99738,40.72834
6,71,-73.937966,40.644288,118,-74.132979,40.586555
7,139,-73.744234,40.677098,202,-73.949952,40.7619
8,76,-73.876821,40.660935,240,-73.881978,40.894599
9,165,-73.956825,40.620924,258,-73.855767,40.688721


Calling the function to get responses from API

In [16]:
sample['driving_time'] = driving(sample, apiKey)
sample

10it [00:03,  2.91it/s]


Unnamed: 0,LocationID,lon_pickup,lat_pickup,LocationID2,lon_dropoff,lat_dropoff,driving_time
0,110,-74.128342,40.54578,254,-73.858948,40.882157,4366
1,51,-73.828264,40.873973,92,-73.828859,40.761102,1324
2,166,-73.961764,40.809457,234,-73.990458,40.740337,1278
3,207,-73.899353,40.763986,248,-73.872289,40.834165,1224
4,195,-74.009178,40.675549,68,-73.999917,40.748428,1153
5,251,-74.125348,40.61688,114,-73.99738,40.72834,2034
6,71,-73.937966,40.644288,118,-74.132979,40.586555,2113
7,139,-73.744234,40.677098,202,-73.949952,40.7619,2363
8,76,-73.876821,40.660935,240,-73.881978,40.894599,2670
9,165,-73.956825,40.620924,258,-73.855767,40.688721,1942


### NYC Open data API: 
NYC open data offers API for almost every available data from crime, housing, 311, taxi etc.
Also its free for the most part and no account needed

https://opendata.cityofnewyork.us/

#### Example: Endpoint for 2018 taxi data
https://data.cityofnewyork.us/resource/t29m-gskq.json

In [17]:
# some parameters to filter the data. We can specify as many parameters as columns in the table.
# reference: https://dev.socrata.com/foundry/data.cityofnewyork.us/t29m-gskq

# extracting info for a specific O-D pair corresponding to near E 35th st as pick up and Newark airport as drop off
# this is around the same pick-up and drop-off location we used to get driving time from HereMap API
parameter = {'pulocationid':162, 'dolocationid':1}



making API request

'requests' is another package of making web requests. Highly uselful in the case of REST APIs and we can provide parameters together in one line of code. We'll use the 'get' method to retrieve data

In [18]:
url =  "https://data.cityofnewyork.us/resource/t29m-gskq.json"
r = requests.get(url = url, params=parameter)
data = r.json()

fisrt three returned observations

In [19]:
data[:3]

[{'vendorid': '2',
  'tpep_pickup_datetime': '2018-06-01T11:14:55.000',
  'tpep_dropoff_datetime': '2018-06-01T11:56:41.000',
  'passenger_count': '1',
  'trip_distance': '17.75',
  'ratecodeid': '3',
  'store_and_fwd_flag': 'N',
  'pulocationid': '162',
  'dolocationid': '1',
  'payment_type': '1',
  'fare_amount': '68.5',
  'extra': '0',
  'mta_tax': '0',
  'tip_amount': '16.35',
  'tolls_amount': '12.95',
  'improvement_surcharge': '0.3',
  'total_amount': '98.1'},
 {'vendorid': '2',
  'tpep_pickup_datetime': '2018-06-01T11:27:17.000',
  'tpep_dropoff_datetime': '2018-06-01T12:09:44.000',
  'passenger_count': '1',
  'trip_distance': '17.94',
  'ratecodeid': '3',
  'store_and_fwd_flag': 'N',
  'pulocationid': '162',
  'dolocationid': '1',
  'payment_type': '1',
  'fare_amount': '71.5',
  'extra': '0',
  'mta_tax': '0',
  'tip_amount': '14.3',
  'tolls_amount': '22.5',
  'improvement_surcharge': '0.3',
  'total_amount': '108.6'},
 {'vendorid': '2',
  'tpep_pickup_datetime': '2018-06-0

Note that the returned request contains information reagrding pick up and drop off times, distance covered, fare amount, number of passengers etc. Essentially, it returns all info provided by TLC in any csv file.

Let's get travel time from first observation

In [20]:
# travel time in seconds
parser.parse(data[0]['tpep_dropoff_datetime']) - parser.parse(data[0]['tpep_pickup_datetime'])

datetime.timedelta(seconds=2506)

Note the travel time returned is around 40 minutes compared to around 28 minutes from driving time retrieved from HereMap API before. Also note time can vary based on the pick up times, the above time of 40 min correspond to pick up around 7pm.



#### get yellow taxi times for O-D pairs as before

#### Note: taxi data can be full of absurd values in terms of travel times/fares/distances. So we need to filter out all noise while we extract. 
#### We'll make the same checks for filtering as we did in the taxi data lab

1. Get rid of observations with travel time > 1 min and < 100 min
2. Speed (distance/time) should be > 2mph and <80mph
3. Fare should be < 300 USD and > 2.5 USD
4. distance traveled < 100 and > 0.3

In [21]:
def taxiTime(df):
    
    times = [] # list for storing travel times
    
    for index, row in df.iterrows(): # iterating through all rows of sample points
        
        # specify parameters for making request
        parameters = {'pulocationid':int(row['LocationID']), 'dolocationid':int(row['LocationID2'])}
        
        url =  "https://data.cityofnewyork.us/resource/t29m-gskq.json"
        r = requests.get(url = url, params=parameters)
        data = r.json()
        
        travelTime = []
        
        for obs in data: # iterating through each returned observation for the returned data 
            
            # making sanity checks and appending times to 'traveTime' list
            try:
                time = parser.parse(obs['tpep_dropoff_datetime']) - parser.parse(obs['tpep_pickup_datetime'])
                time = time.total_seconds()
                fare = float(obs['fare_amount'])
                distance = float(obs['trip_distance'])
                speed = distance/(time/3600)

                if (time < 6000 and time > 60 and fare < 200  
                    and fare > 2.5 and fare < 300 and speed < 80  and speed > 2 and distance > 0.3 and 
                    distance < 100):
                    travelTime.append(time)
                    
            except: 
                pass
            
        # now appending the mean of travel times retrieved above to the 'times' list
    
        times.append(np.mean(travelTime))
        
        
    return times

In [22]:
sample['TLC_avgtime'] = taxiTime(sample)
sample

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


Unnamed: 0,LocationID,lon_pickup,lat_pickup,LocationID2,lon_dropoff,lat_dropoff,driving_time,TLC_avgtime
0,110,-74.128342,40.54578,254,-73.858948,40.882157,4366,
1,51,-73.828264,40.873973,92,-73.828859,40.761102,1324,1398.0
2,166,-73.961764,40.809457,234,-73.990458,40.740337,1278,1802.502508
3,207,-73.899353,40.763986,248,-73.872289,40.834165,1224,
4,195,-74.009178,40.675549,68,-73.999917,40.748428,1153,2149.0
5,251,-74.125348,40.61688,114,-73.99738,40.72834,2034,
6,71,-73.937966,40.644288,118,-74.132979,40.586555,2113,3126.0
7,139,-73.744234,40.677098,202,-73.949952,40.7619,2363,
8,76,-73.876821,40.660935,240,-73.881978,40.894599,2670,3203.666667
9,165,-73.956825,40.620924,258,-73.855767,40.688721,1942,2622.0


### Homework

1. Extract 2018/2019 FHV and shared FHV data and compare with taxi times from TLC and driving times from HereMap API for 50 random O-D pairs.
2. Extra credit: Using geocoding API from HereMAPs

### Retrieving FHV and shared FHV data

#### FHV (for hire vehicles) refers to cars belonging to privately owned companies and entities like Uber, Lyft etc. Shared for hire vehicles (SFHV) trips are where two or more people share the same trip (UberPool etc.)


use API endpoint: https://data.cityofnewyork.us/resource/u6nh-b56h.json

for shared FHV, use parameter 'SR_flag': 1 and 0 for non shared FHV

### Task 1: get sample request for a single O-D pair for both FHV and shared FHV

In [23]:
# write parameters here for FHV
parameter = {'pulocationid':162, 'dolocationid':1, 'sr_flag':1}


In [24]:
# code here, get travel time from one observation for above O-D pair

In [25]:
# write parameters here for shared FHV


In [26]:
# get shared FHV travel time from one observation for above O-D pair



### Task 2. Write the function for getting FHV and Shared FHV times for 50 sample O-D pairs. 

#### Hint: see above that the returned FHV/shared FHV data has column names 'pickup_datetime' and 'dropoff_datetime' instead of 'tpep_pickup_datetime' and 'tpep_dropoff_datetime' used in the functions above. So make sure to change these names when writing the code below


Also make necessary filtering to remove outliers. Remember, in this data, just travel times are available, so we don't need any other variable outlier filtering.

In [27]:
# sample points

df = pd.read_csv('samplePoints.csv')
sample = df.sample(50, random_state=2020)
sample.head()

Unnamed: 0,LocationID,lon_pickup,lat_pickup,LocationID2,lon_dropoff,lat_dropoff
109,110,-74.128342,40.54578,254,-73.858948,40.882157
50,51,-73.828264,40.873973,92,-73.828859,40.761102
165,166,-73.961764,40.809457,234,-73.990458,40.740337
206,207,-73.899353,40.763986,248,-73.872289,40.834165
194,195,-74.009178,40.675549,68,-73.999917,40.748428


In [40]:
# write function for getting average FHV and SFHV times
#### hint: function will essentially be the same as for FHV except for the change in the request parameters and returning 
def FHVtimes(df):
    
    # code here
    
    return (FHV_times,SFHV_times)

In [29]:
## requests from API (call the above function)


In [30]:
# create new columns in the above 'sample' dataframe named 'FHV time' and 'SFHV time' with above retrieved times

### Task 3: Comparison of times for FHV and shared FHV

In [35]:
## check number of 'Nan' values (if any) retrieved from travel times



In [36]:
# remove the rows if any of the retrieved FHV or shared FHV times has 'Nan'


In [37]:
# make scatter plot comparing FHV times with shared FHV times for the 50 sample O-D pairs
# add a FHV=SFHV baseline to see if most of the points are above (FHV time>SFHV time) or below (FHV time<SFHV time)

# hint: use plt.plot(var1, var2, 'o') for scatter plots between two variables var1 and var2 of interest

## Extra credit task- Using one other API : geocoding
Geocoding referes to converting textual addresses into lat-long. Useful in getting a precise location of a place

we'll use HereAPI again for geocoding. Same API key as before will work. The endpoint for geocoding API is https://geocode.search.hereapi.com/v1/geocode 
Many other companies also provide geocoding API services (eg. Google)

To specify an address, we use the 'q' parameter in the endpoint. For example, if we want to get lat-long for '5 Rue Daunou, 75000 Paris, France', we'll use the following endpoint:

https://geocode.search.hereapi.com/v1/geocode?q=5+Rue+Daunou%2C+75000+Paris%2C+France&apiKey={}

Essentially, we write the addresses as a query with spaces being replaced by '+' and commas being repalced by '%2C'

In [38]:
## try out a geocode request for a random address. Remember to be precise in specifying the address to remove any ambiguity

# API key
key = " "

# API endpoint with query as your address
url = " ".format(key)


#data = urllib.urlopen(url).read().decode('utf-8')
#data = json.loads(data)
#data

In [39]:
## get latitude and longitude from returned json request

