# Simple Linear Regression on MBTA Orange Line

**Assignment Due Date:** End of Day Friday November 30th

**Submission:** Submit just this notebook to James through email by the above due date. You can also share the homework via email using an online storage (e.g. Box, Google Drive)

Build a simple linear regression model for the MBTA subway Orange Line.  Your model should have the following constraints:
- 1)Use both September and October historical data for Orange Line to build model ('mbta_Orange_09_2018.json' and 'mbta_Orange_10_2018.json')
- 2)Regression model is for the direction of train moving from Forest Hills to Oak Grove (opposite direction should be removed)
- 3)Regression model should be for trips that occur on Saturday and only between 7am - 10pm (all other days and time outside the specified range should not be included in the regression model)
- 4)Unique trips for a specific day that have under 40 vehicle updates should removed
- 5)Unique trips for a specific day in which the vehicle updates do not begin at Forest Hills should be removed
- 6)The dependent variable should be the elapsed time since trip's began and should be represented in 'hour' unit 
- 7)The independent variable should be distance traveled from start and should be represented in 'kilometer' unit

Please tag the portion of your code that handles each of the above constraints with '#CONSTRAINT{bullet-number}.'  For example if you are filtering out any trips that do not occur on Saturday.  Prior to the logic that performs this put a comment '#CONSTRAINT3.'  If you have logic for a specific constraint spread throughout the notebook please tag each piece. 

Plot your simple linear regression model and include a scatter plot for the testing and training dataset.

You are encouraged to reuse the logic from the lectures to complete this assignment.  The historical datasets used to build the regression model can be found here:  https://umass.app.box.com/s/x3zgwv34uduqrxnwkako4rbkjayzabt1/folder/56231379908 . 

Full credit given to if entire work is shown and follows the above constraints.  This homework is an individual assignment.  The code will be re-run locally and that runtime output is what will be graded, not the output displayed when submitted. **Please be sure your code runs as expected from the start of the notebook to the end.**

![MBTAOrangeLineHW3.png](attachment:MBTAOrangeLineHW3.png)

Example plot of Orange Line to Oak Groves distance traveled vs time elapsed:

![forestHillsOak.png](attachment:forestHillsOak.png)

Example plot of simple linear regression model with scatter plot of testing and training dataset:

![exampleOrangeRegression.png](attachment:exampleOrangeRegression.png)

# CONSTRAINT 1

In [2]:
import pandas as pd
import json
import requests
from geopy import distance
import plotly as py
import plotly.graph_objs as go
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import numpy as np

The datasets of both September and October is obtained from the given URL and then combined at the end of Constraint 1 to get the DataFrame of the combined months.

In [3]:
with open('D:/First Sem/IOT/Programming Assignments/IOT-Homework-P3/mbta_Orange_09_2018.json') as file1:
    data1 = json.load(file1)

In [4]:
df1 = pd.io.json.json_normalize(data1, record_path='Vehicles')
df1

Unnamed: 0,Id,Type,UpdatedAt,Speed,Longitude,Latitude,Label,DirectionId,CurrentStopSequence,CurrentStatus,Bearing,StopId,TripId,RouteId
0,O-5457D94B,vehicle,2018-08-31T19:56:16-04:00,0,-71.10796,42.31005,1208,0,180,STOPPED_AT,210,70002,37641807,Orange
1,O-5457DE49,vehicle,2018-08-31T19:59:25-04:00,0,-71.06641,42.34806,1219,0,120,INCOMING_AT,230,70014,37641839,Orange
2,O-5457E099,vehicle,2018-08-31T19:59:27-04:00,0,-71.07683,42.41500,1220,1,180,INCOMING_AT,357,70035,37642043,Orange
3,O-5457D346,vehicle,2018-08-31T19:54:48-04:00,0,-71.07465,42.34758,1232,1,70,STOPPED_AT,85,70015,37642044,Orange
4,O-5457DE17,vehicle,2018-08-31T19:52:38-04:00,0,-71.11384,42.30122,1245,1,1,STOPPED_AT,210,70001,37641875,Orange
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
429749,O-54587D66,vehicle,2018-09-30T19:58:22-04:00,0,-71.07715,42.39250,1295,1,160,STOPPED_AT,355,70279,ADDED-1538079926,Orange
429750,O-54587D4B,vehicle,2018-09-30T19:58:51-04:00,0,-71.06710,42.37216,1216,0,60,IN_TRANSIT_TO,125,70026,ADDED-1538079619,Orange
429751,O-54587D02,vehicle,2018-09-30T19:58:24-04:00,0,-71.06059,42.35533,1246,1,100,STOPPED_AT,25,70021,ADDED-1538079941,Orange
429752,O-54587D4B,vehicle,2018-09-30T19:58:59-04:00,0,-71.06600,42.37147,1216,0,60,IN_TRANSIT_TO,125,70026,ADDED-1538079619,Orange


In [5]:
toDropList1 = "Speed Type Bearing Label RouteId".split()
df1.drop(toDropList1, axis=1)
df1

Unnamed: 0,Id,Type,UpdatedAt,Speed,Longitude,Latitude,Label,DirectionId,CurrentStopSequence,CurrentStatus,Bearing,StopId,TripId,RouteId
0,O-5457D94B,vehicle,2018-08-31T19:56:16-04:00,0,-71.10796,42.31005,1208,0,180,STOPPED_AT,210,70002,37641807,Orange
1,O-5457DE49,vehicle,2018-08-31T19:59:25-04:00,0,-71.06641,42.34806,1219,0,120,INCOMING_AT,230,70014,37641839,Orange
2,O-5457E099,vehicle,2018-08-31T19:59:27-04:00,0,-71.07683,42.41500,1220,1,180,INCOMING_AT,357,70035,37642043,Orange
3,O-5457D346,vehicle,2018-08-31T19:54:48-04:00,0,-71.07465,42.34758,1232,1,70,STOPPED_AT,85,70015,37642044,Orange
4,O-5457DE17,vehicle,2018-08-31T19:52:38-04:00,0,-71.11384,42.30122,1245,1,1,STOPPED_AT,210,70001,37641875,Orange
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
429749,O-54587D66,vehicle,2018-09-30T19:58:22-04:00,0,-71.07715,42.39250,1295,1,160,STOPPED_AT,355,70279,ADDED-1538079926,Orange
429750,O-54587D4B,vehicle,2018-09-30T19:58:51-04:00,0,-71.06710,42.37216,1216,0,60,IN_TRANSIT_TO,125,70026,ADDED-1538079619,Orange
429751,O-54587D02,vehicle,2018-09-30T19:58:24-04:00,0,-71.06059,42.35533,1246,1,100,STOPPED_AT,25,70021,ADDED-1538079941,Orange
429752,O-54587D4B,vehicle,2018-09-30T19:58:59-04:00,0,-71.06600,42.37147,1216,0,60,IN_TRANSIT_TO,125,70026,ADDED-1538079619,Orange


In [6]:
with open('D:/First Sem/IOT/Programming Assignments/IOT-Homework-P3/mbta_Orange_10_2018.json') as file2:
    data2 = json.load(file2)

In [7]:
df2 = pd.io.json.json_normalize(data2, record_path='Vehicles')
df2

Unnamed: 0,Id,Type,UpdatedAt,Speed,Longitude,Latitude,Label,DirectionId,CurrentStopSequence,CurrentStatus,Bearing,StopId,TripId,RouteId
0,O-54587D4B,vehicle,2018-09-30T20:01:55-04:00,0,-71.05920,42.36424,1216,0,70,INCOMING_AT,150,70024,ADDED-1538079619,Orange
1,O-54587D4A,vehicle,2018-09-30T20:02:30-04:00,0,-71.08425,42.34093,1225,0,130,STOPPED_AT,220,70012,ADDED-1538079618,Orange
2,O-54587D02,vehicle,2018-09-30T20:02:16-04:00,0,-71.05958,42.36473,1246,1,130,INCOMING_AT,330,70027,ADDED-1538079941,Orange
3,O-54587D63,vehicle,2018-09-30T20:02:06-04:00,0,-71.07213,42.43223,1260,0,10,INCOMING_AT,200,70034,ADDED-1538079620,Orange
4,O-54587D66,vehicle,2018-09-30T20:02:27-04:00,0,-71.07698,42.40413,1295,1,180,IN_TRANSIT_TO,357,70035,ADDED-1538079926,Orange
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
461739,O-545923BF,vehicle,2018-10-31T19:59:12-04:00,0,-71.07707,42.38868,1252,0,40,INCOMING_AT,175,70030,37940120,Orange
461740,O-545923DE,vehicle,2018-10-31T19:58:52-04:00,0,-71.06467,42.34880,1297,0,110,STOPPED_AT,230,70016,37940088,Orange
461741,O-54591E13,vehicle,2018-10-31T19:59:05-04:00,0,-71.09157,42.33483,1304,0,150,INCOMING_AT,220,70008,37940224,Orange
461742,O-545923C0,vehicle,2018-10-31T19:59:26-04:00,0,-71.07723,42.39790,1241,1,170,INCOMING_AT,5,70033,37940339,Orange


In [8]:
toDropList2 = "Speed Type Bearing Label RouteId".split()
df2.drop(toDropList2, axis=1)
df2

Unnamed: 0,Id,Type,UpdatedAt,Speed,Longitude,Latitude,Label,DirectionId,CurrentStopSequence,CurrentStatus,Bearing,StopId,TripId,RouteId
0,O-54587D4B,vehicle,2018-09-30T20:01:55-04:00,0,-71.05920,42.36424,1216,0,70,INCOMING_AT,150,70024,ADDED-1538079619,Orange
1,O-54587D4A,vehicle,2018-09-30T20:02:30-04:00,0,-71.08425,42.34093,1225,0,130,STOPPED_AT,220,70012,ADDED-1538079618,Orange
2,O-54587D02,vehicle,2018-09-30T20:02:16-04:00,0,-71.05958,42.36473,1246,1,130,INCOMING_AT,330,70027,ADDED-1538079941,Orange
3,O-54587D63,vehicle,2018-09-30T20:02:06-04:00,0,-71.07213,42.43223,1260,0,10,INCOMING_AT,200,70034,ADDED-1538079620,Orange
4,O-54587D66,vehicle,2018-09-30T20:02:27-04:00,0,-71.07698,42.40413,1295,1,180,IN_TRANSIT_TO,357,70035,ADDED-1538079926,Orange
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
461739,O-545923BF,vehicle,2018-10-31T19:59:12-04:00,0,-71.07707,42.38868,1252,0,40,INCOMING_AT,175,70030,37940120,Orange
461740,O-545923DE,vehicle,2018-10-31T19:58:52-04:00,0,-71.06467,42.34880,1297,0,110,STOPPED_AT,230,70016,37940088,Orange
461741,O-54591E13,vehicle,2018-10-31T19:59:05-04:00,0,-71.09157,42.33483,1304,0,150,INCOMING_AT,220,70008,37940224,Orange
461742,O-545923C0,vehicle,2018-10-31T19:59:26-04:00,0,-71.07723,42.39790,1241,1,170,INCOMING_AT,5,70033,37940339,Orange


In [9]:
frames = [df1, df2]
df = pd.concat(frames, ignore_index=True)
df

Unnamed: 0,Id,Type,UpdatedAt,Speed,Longitude,Latitude,Label,DirectionId,CurrentStopSequence,CurrentStatus,Bearing,StopId,TripId,RouteId
0,O-5457D94B,vehicle,2018-08-31T19:56:16-04:00,0,-71.10796,42.31005,1208,0,180,STOPPED_AT,210,70002,37641807,Orange
1,O-5457DE49,vehicle,2018-08-31T19:59:25-04:00,0,-71.06641,42.34806,1219,0,120,INCOMING_AT,230,70014,37641839,Orange
2,O-5457E099,vehicle,2018-08-31T19:59:27-04:00,0,-71.07683,42.41500,1220,1,180,INCOMING_AT,357,70035,37642043,Orange
3,O-5457D346,vehicle,2018-08-31T19:54:48-04:00,0,-71.07465,42.34758,1232,1,70,STOPPED_AT,85,70015,37642044,Orange
4,O-5457DE17,vehicle,2018-08-31T19:52:38-04:00,0,-71.11384,42.30122,1245,1,1,STOPPED_AT,210,70001,37641875,Orange
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
891493,O-545923BF,vehicle,2018-10-31T19:59:12-04:00,0,-71.07707,42.38868,1252,0,40,INCOMING_AT,175,70030,37940120,Orange
891494,O-545923DE,vehicle,2018-10-31T19:58:52-04:00,0,-71.06467,42.34880,1297,0,110,STOPPED_AT,230,70016,37940088,Orange
891495,O-54591E13,vehicle,2018-10-31T19:59:05-04:00,0,-71.09157,42.33483,1304,0,150,INCOMING_AT,220,70008,37940224,Orange
891496,O-545923C0,vehicle,2018-10-31T19:59:26-04:00,0,-71.07723,42.39790,1241,1,170,INCOMING_AT,5,70033,37940339,Orange


In [10]:
df.sort_values(["Id",'UpdatedAt'], inplace=True)

# CONSTRAINT 2

Since the regression model should be from the direction Forest Hill to Oak Grove, DirectionId is set to '1'.

In [11]:
df = df[df['DirectionId']==1]
len(df['StopId'].unique())

23

# CONSTRAINT 3

The DataFrame is filtered to get the 'Saturdays' of 2 months between 7am and 10pm. And the 'ADDED-" stops are removed.

In [12]:
df['UpdatedAt'] = pd.to_datetime(df['UpdatedAt'])
df['day'] = df['UpdatedAt'].dt.dayofyear
df['hour'] = df['UpdatedAt'].dt.hour
df = df[(df['UpdatedAt'].dt.hour>=7) & (df['UpdatedAt'].dt.hour<=22)]
df['IsSaturday'] = df['UpdatedAt'].dt.dayofweek == 5
df




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Id,Type,UpdatedAt,Speed,Longitude,Latitude,Label,DirectionId,CurrentStopSequence,CurrentStatus,Bearing,StopId,TripId,RouteId,day,hour,IsSaturday
3,O-5457D346,vehicle,2018-08-31 19:54:48-04:00,0,-71.07465,42.34758,1232,1,70,STOPPED_AT,85,70015,37642044,Orange,243,19,False
38,O-5457D346,vehicle,2018-08-31 20:03:24-04:00,0,-71.06966,42.34757,1232,1,80,INCOMING_AT,85,70017,37642044,Orange,243,20,False
42,O-5457D346,vehicle,2018-08-31 20:03:37-04:00,0,-71.06723,42.34759,1232,1,80,INCOMING_AT,50,70017,37642044,Orange,243,20,False
47,O-5457D346,vehicle,2018-08-31 20:04:37-04:00,0,-71.06468,42.34878,1232,1,80,STOPPED_AT,50,70017,37642044,Orange,243,20,False
58,O-5457D346,vehicle,2018-08-31 20:05:34-04:00,0,-71.06375,42.35002,1232,1,90,INCOMING_AT,25,70019,37642044,Orange,243,20,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
889593,O-545926D3,vehicle,2018-10-31 18:09:30-04:00,0,-71.07428,42.42668,1232,1,180,STOPPED_AT,20,70035,37940335,Orange,304,18,False
889617,O-545926D3,vehicle,2018-10-31 18:10:52-04:00,0,-71.07350,42.42861,1232,1,190,IN_TRANSIT_TO,20,70036,37940335,Orange,304,18,False
889626,O-545926D3,vehicle,2018-10-31 18:11:08-04:00,0,-71.07290,42.43014,1232,1,190,IN_TRANSIT_TO,20,70036,37940335,Orange,304,18,False
889634,O-545926D3,vehicle,2018-10-31 18:11:21-04:00,0,-71.07161,42.43342,1232,1,190,IN_TRANSIT_TO,20,70036,37940335,Orange,304,18,False


In [13]:
df = df[df['IsSaturday']]
df

Unnamed: 0,Id,Type,UpdatedAt,Speed,Longitude,Latitude,Label,DirectionId,CurrentStopSequence,CurrentStatus,Bearing,StopId,TripId,RouteId,day,hour,IsSaturday
3682,O-5457E3A7,vehicle,2018-09-01 07:02:43-04:00,0,-71.08695,42.33854,1263,1,60,INCOMING_AT,40,70013,ADDED-1535511531,Orange,244,7,True
3685,O-5457E3A7,vehicle,2018-09-01 07:02:58-04:00,0,-71.08571,42.33951,1263,1,60,INCOMING_AT,40,70013,ADDED-1535511531,Orange,244,7,True
3688,O-5457E3A7,vehicle,2018-09-01 07:03:15-04:00,0,-71.08410,42.34086,1263,1,60,STOPPED_AT,40,70013,ADDED-1535511531,Orange,244,7,True
3690,O-5457E3A7,vehicle,2018-09-01 07:04:10-04:00,0,-71.08192,42.34257,1263,1,70,IN_TRANSIT_TO,40,70015,ADDED-1535511531,Orange,244,7,True
3696,O-5457E3A7,vehicle,2018-09-01 07:04:49-04:00,0,-71.07814,42.34571,1263,1,70,INCOMING_AT,40,70015,ADDED-1535511531,Orange,244,7,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
833609,O-54591290,vehicle,2018-10-27 22:44:23-04:00,0,-71.07428,42.42668,1249,1,180,STOPPED_AT,20,70035,38018348,Orange,300,22,True
833621,O-54591290,vehicle,2018-10-27 22:45:36-04:00,0,-71.07350,42.42861,1249,1,190,IN_TRANSIT_TO,20,70036,38018348,Orange,300,22,True
833624,O-54591290,vehicle,2018-10-27 22:45:55-04:00,0,-71.07290,42.43014,1249,1,190,IN_TRANSIT_TO,20,70036,38018348,Orange,300,22,True
833627,O-54591290,vehicle,2018-10-27 22:46:11-04:00,0,-71.07161,42.43342,1249,1,190,INCOMING_AT,20,70036,38018348,Orange,300,22,True


In [14]:
df['day'].unique()

array([244, 251, 258, 265, 272, 279, 286, 293, 300], dtype=int64)

In [15]:
df = df[~df['TripId'].str.contains('ADDED-')]
df

Unnamed: 0,Id,Type,UpdatedAt,Speed,Longitude,Latitude,Label,DirectionId,CurrentStopSequence,CurrentStatus,Bearing,StopId,TripId,RouteId,day,hour,IsSaturday
823300,O-5458F8A7,vehicle,2018-10-27 07:05:37-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,210,70001,38018204,Orange,300,7,True
823349,O-5458F8A7,vehicle,2018-10-27 07:09:26-04:00,0,-71.11312,42.30213,1295,1,10,IN_TRANSIT_TO,30,70003,38018204,Orange,300,7,True
823354,O-5458F8A7,vehicle,2018-10-27 07:09:48-04:00,0,-71.11258,42.30290,1295,1,10,IN_TRANSIT_TO,30,70003,38018204,Orange,300,7,True
823357,O-5458F8A7,vehicle,2018-10-27 07:10:07-04:00,0,-71.11077,42.30540,1295,1,10,INCOMING_AT,30,70003,38018204,Orange,300,7,True
823368,O-5458F8A7,vehicle,2018-10-27 07:11:04-04:00,0,-71.10782,42.31001,1295,1,10,STOPPED_AT,30,70003,38018204,Orange,300,7,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
833609,O-54591290,vehicle,2018-10-27 22:44:23-04:00,0,-71.07428,42.42668,1249,1,180,STOPPED_AT,20,70035,38018348,Orange,300,22,True
833621,O-54591290,vehicle,2018-10-27 22:45:36-04:00,0,-71.07350,42.42861,1249,1,190,IN_TRANSIT_TO,20,70036,38018348,Orange,300,22,True
833624,O-54591290,vehicle,2018-10-27 22:45:55-04:00,0,-71.07290,42.43014,1249,1,190,IN_TRANSIT_TO,20,70036,38018348,Orange,300,22,True
833627,O-54591290,vehicle,2018-10-27 22:46:11-04:00,0,-71.07161,42.43342,1249,1,190,INCOMING_AT,20,70036,38018348,Orange,300,22,True


In [16]:
len(df['StopId'].unique())

20

# CONSTRAINT 4

Vehicle Updates less than 40 updtaes are filtered.

In [17]:
grouping = ['TripId','day']
fDf = df.groupby(grouping).filter(lambda x: x['CurrentStatus'].count()>40)

# CONSTRAINT 5

The stops for Orange line obtained from the Web API and the data is retrieved from it. The unique stops is obtained and inner merge is done with all the stops of MBTA to get the common stops.

In [18]:
url = "https://api-v3.mbta.com/stops?page%5Boffset%5D=0&page%5Blimit%5D=100&filter%5Bdirection_id%5D=1&filter%5Broute%5D=Orange" 

r = requests.get(url)
OrangeStops = json.loads(r.content)
OrangeStops

{'data': [{'attributes': {'address': 'Washington St and Hyde Park Ave, Jamaica Plain, MA 02130',
    'at_street': None,
    'description': None,
    'latitude': 42.300523,
    'location_type': 1,
    'longitude': -71.113686,
    'municipality': 'Boston',
    'name': 'Forest Hills',
    'on_street': None,
    'platform_code': None,
    'platform_name': None,
    'vehicle_type': None,
    'wheelchair_boarding': 1},
   'id': 'place-forhl',
   'links': {'self': '/stops/place-forhl'},
   'relationships': {'child_stops': {},
    'facilities': {'links': {'related': '/facilities/?filter[stop]=place-forhl'}},
    'parent_station': {'data': None},
    'recommended_transfers': {},
    'zone': {'data': {'id': 'CR-zone-1A', 'type': 'zone'}}},
   'type': 'stop'},
  {'attributes': {'address': '150 Green St, Jamaica Plain, MA',
    'at_street': None,
    'description': None,
    'latitude': 42.310525,
    'location_type': 1,
    'longitude': -71.107414,
    'municipality': 'Boston',
    'name': 'Green

In [19]:
for stop in OrangeStops['data']:
    stop['latitude'] = stop['attributes']['latitude']
    stop['longitude'] = stop['attributes']['longitude']
    stop['name'] = stop['attributes']['name']
    stop.pop('links')
    stop.pop('attributes')
    stop.pop('relationships')

OrangeStops['data']

[{'id': 'place-forhl',
  'type': 'stop',
  'latitude': 42.300523,
  'longitude': -71.113686,
  'name': 'Forest Hills'},
 {'id': 'place-grnst',
  'type': 'stop',
  'latitude': 42.310525,
  'longitude': -71.107414,
  'name': 'Green Street'},
 {'id': 'place-sbmnl',
  'type': 'stop',
  'latitude': 42.317062,
  'longitude': -71.104248,
  'name': 'Stony Brook'},
 {'id': 'place-jaksn',
  'type': 'stop',
  'latitude': 42.323132,
  'longitude': -71.099592,
  'name': 'Jackson Square'},
 {'id': 'place-rcmnl',
  'type': 'stop',
  'latitude': 42.331397,
  'longitude': -71.095451,
  'name': 'Roxbury Crossing'},
 {'id': 'place-rugg',
  'type': 'stop',
  'latitude': 42.336377,
  'longitude': -71.088961,
  'name': 'Ruggles'},
 {'id': 'place-masta',
  'type': 'stop',
  'latitude': 42.341512,
  'longitude': -71.083423,
  'name': 'Massachusetts Avenue'},
 {'id': 'place-bbsta',
  'type': 'stop',
  'latitude': 42.34735,
  'longitude': -71.075727,
  'name': 'Back Bay'},
 {'id': 'place-tumnl',
  'type': 'stop

In [20]:
data = json.dumps(OrangeStops['data'])
stopDf = pd.read_json(data)
stopDf

Unnamed: 0,id,type,latitude,longitude,name
0,place-forhl,stop,42.300523,-71.113686,Forest Hills
1,place-grnst,stop,42.310525,-71.107414,Green Street
2,place-sbmnl,stop,42.317062,-71.104248,Stony Brook
3,place-jaksn,stop,42.323132,-71.099592,Jackson Square
4,place-rcmnl,stop,42.331397,-71.095451,Roxbury Crossing
5,place-rugg,stop,42.336377,-71.088961,Ruggles
6,place-masta,stop,42.341512,-71.083423,Massachusetts Avenue
7,place-bbsta,stop,42.34735,-71.075727,Back Bay
8,place-tumnl,stop,42.349662,-71.063917,Tufts Medical Center
9,place-chncl,stop,42.352547,-71.062752,Chinatown


In [23]:
sIdDf= pd.read_csv('D:/First Sem/IOT/Programming Assignments/IOT-Homework-P3/stops.txt')
sIdDf

Unnamed: 0,stop_id,stop_code,stop_name,stop_desc,platform_code,platform_name,stop_lat,stop_lon,zone_id,stop_url,level_id,location_type,parent_station,wheelchair_boarding,stop_address
0,Boat-Hull,,Hull,,,,42.303251,-70.920215,,,,0,,1,"180 Main St, Hull, MA 02045"
1,Boat-Logan,,Logan Airport Ferry Terminal,,,,42.359789,-71.027340,,,,0,,1,"Harborside Dr, East Boston, MA 02128"
2,Boat-Long,,Long Wharf (North),,,,42.360795,-71.049976,,,,0,,1,"Long Wharf near Christopher Columbus Park, Bos..."
3,Boat-Long-South,,Long Wharf (South),,,,42.359448,-71.050498,,,,0,,1,"Long Wharf near Old Atlantic Ave, Boston, MA 0..."
4,Boat-Charlestown,,Charlestown,,,,42.373334,-71.054160,,,,0,,1,"Pier 4, Boston, MA 02129"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8557,door-prmnl-huntington,,Prudential - Huntington Ave,Prudential - Huntington Ave,,,42.345791,-71.081609,,,level_in_street,2,place-prmnl,2,
8558,door-grnst-main,,"Green Street - Woolsey Sq, Green St, Amory St","Green Street - Woolsey Sq, Green St, Amory St",,,42.310556,-71.107519,,,level_in_street,2,place-grnst,1,
8559,door-sbmnl-main,,Stony Brook - Amory St,Stony Brook - Amory St,,,42.317100,-71.104216,,,level_in_street,2,place-sbmnl,1,
8560,door-dwnxg-summereast,,"Downtown Crossing - Summer St, Washington St","Downtown Crossing - Summer St, Washington St",,,42.355369,-71.060192,,,level_in_street,2,place-dwnxg,2,


In [24]:
sIdDf[sIdDf['stop_code'].notnull()]
sIdDf

Unnamed: 0,stop_id,stop_code,stop_name,stop_desc,platform_code,platform_name,stop_lat,stop_lon,zone_id,stop_url,level_id,location_type,parent_station,wheelchair_boarding,stop_address
0,Boat-Hull,,Hull,,,,42.303251,-70.920215,,,,0,,1,"180 Main St, Hull, MA 02045"
1,Boat-Logan,,Logan Airport Ferry Terminal,,,,42.359789,-71.027340,,,,0,,1,"Harborside Dr, East Boston, MA 02128"
2,Boat-Long,,Long Wharf (North),,,,42.360795,-71.049976,,,,0,,1,"Long Wharf near Christopher Columbus Park, Bos..."
3,Boat-Long-South,,Long Wharf (South),,,,42.359448,-71.050498,,,,0,,1,"Long Wharf near Old Atlantic Ave, Boston, MA 0..."
4,Boat-Charlestown,,Charlestown,,,,42.373334,-71.054160,,,,0,,1,"Pier 4, Boston, MA 02129"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8557,door-prmnl-huntington,,Prudential - Huntington Ave,Prudential - Huntington Ave,,,42.345791,-71.081609,,,level_in_street,2,place-prmnl,2,
8558,door-grnst-main,,"Green Street - Woolsey Sq, Green St, Amory St","Green Street - Woolsey Sq, Green St, Amory St",,,42.310556,-71.107519,,,level_in_street,2,place-grnst,1,
8559,door-sbmnl-main,,Stony Brook - Amory St,Stony Brook - Amory St,,,42.317100,-71.104216,,,level_in_street,2,place-sbmnl,1,
8560,door-dwnxg-summereast,,"Downtown Crossing - Summer St, Washington St","Downtown Crossing - Summer St, Washington St",,,42.355369,-71.060192,,,level_in_street,2,place-dwnxg,2,


In [25]:
OrangeStopsDf = pd.DataFrame(df['StopId'].unique())
OrangeStopsDf

Unnamed: 0,0
0,70001
1,70003
2,70005
3,70007
4,70009
5,70011
6,70013
7,70015
8,70017
9,70019


In [26]:
OrangeStopsDf = OrangeStopsDf.merge(sIdDf, left_on = 0, right_on = 'stop_id')
OrangeStopsDf

Unnamed: 0,0,stop_id,stop_code,stop_name,stop_desc,platform_code,platform_name,stop_lat,stop_lon,zone_id,stop_url,level_id,location_type,parent_station,wheelchair_boarding,stop_address
0,70001,70001,70001.0,Forest Hills,Forest Hills - Orange Line,,Orange Line,42.300523,-71.113686,,,level_-1_orange_platform,0,place-forhl,1,
1,70003,70003,70003.0,Green Street,Green Street - Orange Line - Oak Grove,,Oak Grove,42.310525,-71.107414,,,level_-1_platform,0,place-grnst,1,
2,70005,70005,70005.0,Stony Brook,Stony Brook - Orange Line - Oak Grove,,Oak Grove,42.317062,-71.104248,,,level_-1_platform,0,place-sbmnl,1,
3,70007,70007,70007.0,Jackson Square,Jackson Square - Orange Line - Oak Grove,,Oak Grove,42.323132,-71.099592,,,level_-1_platform,0,place-jaksn,1,
4,70009,70009,70009.0,Roxbury Crossing,Roxbury Crossing - Orange Line - Oak Grove,,Oak Grove,42.331397,-71.095451,,,level_-1_platform,0,place-rcmnl,1,
5,70011,70011,70011.0,Ruggles,Ruggles - Orange Line - Oak Grove,,Oak Grove,42.336377,-71.088961,,,level_-1_orange_platform,0,place-rugg,1,
6,70013,70013,70013.0,Massachusetts Avenue,Massachusetts Avenue - Orange Line - Oak Grove,,Oak Grove,42.341512,-71.083423,,,level_-1_platform,0,place-masta,1,
7,70015,70015,70015.0,Back Bay,Back Bay - Orange Line - Oak Grove,,Oak Grove,42.34735,-71.075727,,,level_-1_orange_platform,0,place-bbsta,1,
8,70017,70017,70017.0,Tufts Medical Center,Tufts Medical Center - Orange Line - Oak Grove,,Oak Grove,42.349662,-71.063917,,,level_-2_platform,0,place-tumnl,1,
9,70019,70019,70019.0,Chinatown,Chinatown - Orange Line - Oak Grove,,Oak Grove,42.352547,-71.062752,,,level_-1_platform,0,place-chncl,1,


In [27]:
stopDf = stopDf.merge(OrangeStopsDf, left_on = 'name', right_on='stop_name')
stopDf

Unnamed: 0,id,type,latitude,longitude,name,0,stop_id,stop_code,stop_name,stop_desc,...,platform_name,stop_lat,stop_lon,zone_id,stop_url,level_id,location_type,parent_station,wheelchair_boarding,stop_address
0,place-forhl,stop,42.300523,-71.113686,Forest Hills,70001,70001,70001.0,Forest Hills,Forest Hills - Orange Line,...,Orange Line,42.300523,-71.113686,,,level_-1_orange_platform,0,place-forhl,1,
1,place-grnst,stop,42.310525,-71.107414,Green Street,70003,70003,70003.0,Green Street,Green Street - Orange Line - Oak Grove,...,Oak Grove,42.310525,-71.107414,,,level_-1_platform,0,place-grnst,1,
2,place-sbmnl,stop,42.317062,-71.104248,Stony Brook,70005,70005,70005.0,Stony Brook,Stony Brook - Orange Line - Oak Grove,...,Oak Grove,42.317062,-71.104248,,,level_-1_platform,0,place-sbmnl,1,
3,place-jaksn,stop,42.323132,-71.099592,Jackson Square,70007,70007,70007.0,Jackson Square,Jackson Square - Orange Line - Oak Grove,...,Oak Grove,42.323132,-71.099592,,,level_-1_platform,0,place-jaksn,1,
4,place-rcmnl,stop,42.331397,-71.095451,Roxbury Crossing,70009,70009,70009.0,Roxbury Crossing,Roxbury Crossing - Orange Line - Oak Grove,...,Oak Grove,42.331397,-71.095451,,,level_-1_platform,0,place-rcmnl,1,
5,place-rugg,stop,42.336377,-71.088961,Ruggles,70011,70011,70011.0,Ruggles,Ruggles - Orange Line - Oak Grove,...,Oak Grove,42.336377,-71.088961,,,level_-1_orange_platform,0,place-rugg,1,
6,place-masta,stop,42.341512,-71.083423,Massachusetts Avenue,70013,70013,70013.0,Massachusetts Avenue,Massachusetts Avenue - Orange Line - Oak Grove,...,Oak Grove,42.341512,-71.083423,,,level_-1_platform,0,place-masta,1,
7,place-bbsta,stop,42.34735,-71.075727,Back Bay,70015,70015,70015.0,Back Bay,Back Bay - Orange Line - Oak Grove,...,Oak Grove,42.34735,-71.075727,,,level_-1_orange_platform,0,place-bbsta,1,
8,place-tumnl,stop,42.349662,-71.063917,Tufts Medical Center,70017,70017,70017.0,Tufts Medical Center,Tufts Medical Center - Orange Line - Oak Grove,...,Oak Grove,42.349662,-71.063917,,,level_-2_platform,0,place-tumnl,1,
9,place-chncl,stop,42.352547,-71.062752,Chinatown,70019,70019,70019.0,Chinatown,Chinatown - Orange Line - Oak Grove,...,Oak Grove,42.352547,-71.062752,,,level_-1_platform,0,place-chncl,1,


In [28]:
stopDf.columns

Index([                 'id',                'type',            'latitude',
                 'longitude',                'name',                     0,
                   'stop_id',           'stop_code',           'stop_name',
                 'stop_desc',       'platform_code',       'platform_name',
                  'stop_lat',            'stop_lon',             'zone_id',
                  'stop_url',            'level_id',       'location_type',
            'parent_station', 'wheelchair_boarding',        'stop_address'],
      dtype='object')

In [29]:
toKeep = [          'stop_id',                    'latitude',
                   'longitude',                  'name',
                                            
                   
                 'stop_name',            'platform_name',
                  'stop_lat',            'stop_lon',
                  'level_id',     'parent_station', 'wheelchair_boarding']

stopDf = stopDf[toKeep].copy()
stopDf


Unnamed: 0,stop_id,latitude,longitude,name,stop_name,platform_name,stop_lat,stop_lon,level_id,parent_station,wheelchair_boarding
0,70001,42.300523,-71.113686,Forest Hills,Forest Hills,Orange Line,42.300523,-71.113686,level_-1_orange_platform,place-forhl,1
1,70003,42.310525,-71.107414,Green Street,Green Street,Oak Grove,42.310525,-71.107414,level_-1_platform,place-grnst,1
2,70005,42.317062,-71.104248,Stony Brook,Stony Brook,Oak Grove,42.317062,-71.104248,level_-1_platform,place-sbmnl,1
3,70007,42.323132,-71.099592,Jackson Square,Jackson Square,Oak Grove,42.323132,-71.099592,level_-1_platform,place-jaksn,1
4,70009,42.331397,-71.095451,Roxbury Crossing,Roxbury Crossing,Oak Grove,42.331397,-71.095451,level_-1_platform,place-rcmnl,1
5,70011,42.336377,-71.088961,Ruggles,Ruggles,Oak Grove,42.336377,-71.088961,level_-1_orange_platform,place-rugg,1
6,70013,42.341512,-71.083423,Massachusetts Avenue,Massachusetts Avenue,Oak Grove,42.341512,-71.083423,level_-1_platform,place-masta,1
7,70015,42.34735,-71.075727,Back Bay,Back Bay,Oak Grove,42.34735,-71.075727,level_-1_orange_platform,place-bbsta,1
8,70017,42.349662,-71.063917,Tufts Medical Center,Tufts Medical Center,Oak Grove,42.349662,-71.063917,level_-2_platform,place-tumnl,1
9,70019,42.352547,-71.062752,Chinatown,Chinatown,Oak Grove,42.352547,-71.062752,level_-1_platform,place-chncl,1


Calculating the distance

In [30]:
stopDf['priorStopCoord'] = stopDf['stop_lat'].shift(1).astype('str').str.cat(stopDf['stop_lon'].shift(1).astype('str'), sep=',')

def fixNanVal(df):
    if(df['priorStopCoord']== 'nan,nan'):
        return '{},{}'.format(df['stop_lat'], df['stop_lon'])
    return df['priorStopCoord']

stopDf['priorStopCoord'] = stopDf.apply(lambda x: fixNanVal(x), axis=1)

In [31]:
stopDf['distanceFromPrior'] = stopDf.apply(lambda x: distance.distance(x['priorStopCoord'],'{},{}'.format(x['stop_lat'],x['stop_lon'])).km, axis=1)
stopDf

Unnamed: 0,stop_id,latitude,longitude,name,stop_name,platform_name,stop_lat,stop_lon,level_id,parent_station,wheelchair_boarding,priorStopCoord,distanceFromPrior
0,70001,42.300523,-71.113686,Forest Hills,Forest Hills,Orange Line,42.300523,-71.113686,level_-1_orange_platform,place-forhl,1,"42.300523,-71.113686",0.0
1,70003,42.310525,-71.107414,Green Street,Green Street,Oak Grove,42.310525,-71.107414,level_-1_platform,place-grnst,1,"42.300523,-71.113686",1.225477
2,70005,42.317062,-71.104248,Stony Brook,Stony Brook,Oak Grove,42.317062,-71.104248,level_-1_platform,place-sbmnl,1,"42.310525,-71.107414",0.771613
3,70007,42.323132,-71.099592,Jackson Square,Jackson Square,Oak Grove,42.323132,-71.099592,level_-1_platform,place-jaksn,1,"42.317062,-71.104248",0.775841
4,70009,42.331397,-71.095451,Roxbury Crossing,Roxbury Crossing,Oak Grove,42.331397,-71.095451,level_-1_platform,place-rcmnl,1,"42.323132,-71.099592",0.979469
5,70011,42.336377,-71.088961,Ruggles,Ruggles,Oak Grove,42.336377,-71.088961,level_-1_orange_platform,place-rugg,1,"42.331396999999996,-71.095451",0.769482
6,70013,42.341512,-71.083423,Massachusetts Avenue,Massachusetts Avenue,Oak Grove,42.341512,-71.083423,level_-1_platform,place-masta,1,"42.336377,-71.088961",0.730505
7,70015,42.34735,-71.075727,Back Bay,Back Bay,Oak Grove,42.34735,-71.075727,level_-1_orange_platform,place-bbsta,1,"42.341512,-71.083423",0.90703
8,70017,42.349662,-71.063917,Tufts Medical Center,Tufts Medical Center,Oak Grove,42.349662,-71.063917,level_-2_platform,place-tumnl,1,"42.34735,-71.075727",1.006429
9,70019,42.352547,-71.062752,Chinatown,Chinatown,Oak Grove,42.352547,-71.062752,level_-1_platform,place-chncl,1,"42.349662,-71.063917",0.334533


Accumulation of Distances from starting point(in kilometers)

In [32]:
stopDf['distanceFromOrigin'] = stopDf['distanceFromPrior'].cumsum()
stopDf.drop(['priorStopCoord', 'distanceFromPrior'], axis=1, inplace=True)
stopDf

Unnamed: 0,stop_id,latitude,longitude,name,stop_name,platform_name,stop_lat,stop_lon,level_id,parent_station,wheelchair_boarding,distanceFromOrigin
0,70001,42.300523,-71.113686,Forest Hills,Forest Hills,Orange Line,42.300523,-71.113686,level_-1_orange_platform,place-forhl,1,0.0
1,70003,42.310525,-71.107414,Green Street,Green Street,Oak Grove,42.310525,-71.107414,level_-1_platform,place-grnst,1,1.225477
2,70005,42.317062,-71.104248,Stony Brook,Stony Brook,Oak Grove,42.317062,-71.104248,level_-1_platform,place-sbmnl,1,1.99709
3,70007,42.323132,-71.099592,Jackson Square,Jackson Square,Oak Grove,42.323132,-71.099592,level_-1_platform,place-jaksn,1,2.772931
4,70009,42.331397,-71.095451,Roxbury Crossing,Roxbury Crossing,Oak Grove,42.331397,-71.095451,level_-1_platform,place-rcmnl,1,3.7524
5,70011,42.336377,-71.088961,Ruggles,Ruggles,Oak Grove,42.336377,-71.088961,level_-1_orange_platform,place-rugg,1,4.521882
6,70013,42.341512,-71.083423,Massachusetts Avenue,Massachusetts Avenue,Oak Grove,42.341512,-71.083423,level_-1_platform,place-masta,1,5.252387
7,70015,42.34735,-71.075727,Back Bay,Back Bay,Oak Grove,42.34735,-71.075727,level_-1_orange_platform,place-bbsta,1,6.159417
8,70017,42.349662,-71.063917,Tufts Medical Center,Tufts Medical Center,Oak Grove,42.349662,-71.063917,level_-2_platform,place-tumnl,1,7.165846
9,70019,42.352547,-71.062752,Chinatown,Chinatown,Oak Grove,42.352547,-71.062752,level_-1_platform,place-chncl,1,7.500379


merging name and distance information with the vehicle update

In [33]:
toMerge = stopDf[['stop_id', 'name','distanceFromOrigin','latitude','longitude']].copy()
toMerge['stopLatLong'] = toMerge['latitude'].astype('str').str.cat(stopDf['stop_lon'].astype('str'), ',')
toMerge

Unnamed: 0,stop_id,name,distanceFromOrigin,latitude,longitude,stopLatLong
0,70001,Forest Hills,0.0,42.300523,-71.113686,"42.300523,-71.113686"
1,70003,Green Street,1.225477,42.310525,-71.107414,"42.310525,-71.107414"
2,70005,Stony Brook,1.99709,42.317062,-71.104248,"42.317062,-71.104248"
3,70007,Jackson Square,2.772931,42.323132,-71.099592,"42.323132,-71.099592"
4,70009,Roxbury Crossing,3.7524,42.331397,-71.095451,"42.331397,-71.095451"
5,70011,Ruggles,4.521882,42.336377,-71.088961,"42.336377,-71.088961"
6,70013,Massachusetts Avenue,5.252387,42.341512,-71.083423,"42.341512,-71.083423"
7,70015,Back Bay,6.159417,42.34735,-71.075727,"42.34735,-71.075727"
8,70017,Tufts Medical Center,7.165846,42.349662,-71.063917,"42.349662,-71.063917"
9,70019,Chinatown,7.500379,42.352547,-71.062752,"42.352547,-71.062752"


In [34]:
toMerge = toMerge.drop(['latitude','longitude'],axis=1)
toMerge

Unnamed: 0,stop_id,name,distanceFromOrigin,stopLatLong
0,70001,Forest Hills,0.0,"42.300523,-71.113686"
1,70003,Green Street,1.225477,"42.310525,-71.107414"
2,70005,Stony Brook,1.99709,"42.317062,-71.104248"
3,70007,Jackson Square,2.772931,"42.323132,-71.099592"
4,70009,Roxbury Crossing,3.7524,"42.331397,-71.095451"
5,70011,Ruggles,4.521882,"42.336377,-71.088961"
6,70013,Massachusetts Avenue,5.252387,"42.341512,-71.083423"
7,70015,Back Bay,6.159417,"42.34735,-71.075727"
8,70017,Tufts Medical Center,7.165846,"42.349662,-71.063917"
9,70019,Chinatown,7.500379,"42.352547,-71.062752"


Merging with the Original DataFrame StopId

In [35]:
fDf= fDf.merge(toMerge, left_on = 'StopId', right_on='stop_id')
fDf

Unnamed: 0,Id,Type,UpdatedAt,Speed,Longitude,Latitude,Label,DirectionId,CurrentStopSequence,CurrentStatus,...,StopId,TripId,RouteId,day,hour,IsSaturday,stop_id,name,distanceFromOrigin,stopLatLong
0,O-5458F8A7,vehicle,2018-10-27 07:05:37-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,70001,38018204,Orange,300,7,True,70001,Forest Hills,0.0000,"42.300523,-71.113686"
1,O-5458F8A7,vehicle,2018-10-27 08:32:08-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,70001,38018199,Orange,300,8,True,70001,Forest Hills,0.0000,"42.300523,-71.113686"
2,O-5458F8A7,vehicle,2018-10-27 11:36:38-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,70001,38018192,Orange,300,11,True,70001,Forest Hills,0.0000,"42.300523,-71.113686"
3,O-5458F8A7,vehicle,2018-10-27 13:10:39-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,70001,38018305,Orange,300,13,True,70001,Forest Hills,0.0000,"42.300523,-71.113686"
4,O-5458F8A7,vehicle,2018-10-27 14:38:24-04:00,0,-71.11384,42.30122,1223,1,1,STOPPED_AT,...,70001,38018283,Orange,300,14,True,70001,Forest Hills,0.0000,"42.300523,-71.113686"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4768,O-54591290,vehicle,2018-10-27 21:13:59-04:00,0,-71.07123,42.43489,1249,1,190,INCOMING_AT,...,70036,38018349,Orange,300,21,True,70036,Oak Grove,17.4349,"42.43668,-71.071097"
4769,O-54591290,vehicle,2018-10-27 22:45:36-04:00,0,-71.07350,42.42861,1249,1,190,IN_TRANSIT_TO,...,70036,38018348,Orange,300,22,True,70036,Oak Grove,17.4349,"42.43668,-71.071097"
4770,O-54591290,vehicle,2018-10-27 22:45:55-04:00,0,-71.07290,42.43014,1249,1,190,IN_TRANSIT_TO,...,70036,38018348,Orange,300,22,True,70036,Oak Grove,17.4349,"42.43668,-71.071097"
4771,O-54591290,vehicle,2018-10-27 22:46:11-04:00,0,-71.07161,42.43342,1249,1,190,INCOMING_AT,...,70036,38018348,Orange,300,22,True,70036,Oak Grove,17.4349,"42.43668,-71.071097"


In [36]:
def distanceFromStop(df):
    if(df['CurrentStatus'] == 'STOPPED_AT'):
        return 0
    return distance.distance(df['stopLatLong'], '{},{}'.format(df['Latitude'],df['Longitude'])).km

fDf['distFromStop'] = fDf.apply(lambda x: distanceFromStop(x),axis=1)
fDf

Unnamed: 0,Id,Type,UpdatedAt,Speed,Longitude,Latitude,Label,DirectionId,CurrentStopSequence,CurrentStatus,...,TripId,RouteId,day,hour,IsSaturday,stop_id,name,distanceFromOrigin,stopLatLong,distFromStop
0,O-5458F8A7,vehicle,2018-10-27 07:05:37-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,38018204,Orange,300,7,True,70001,Forest Hills,0.0000,"42.300523,-71.113686",0.000000
1,O-5458F8A7,vehicle,2018-10-27 08:32:08-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,38018199,Orange,300,8,True,70001,Forest Hills,0.0000,"42.300523,-71.113686",0.000000
2,O-5458F8A7,vehicle,2018-10-27 11:36:38-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,38018192,Orange,300,11,True,70001,Forest Hills,0.0000,"42.300523,-71.113686",0.000000
3,O-5458F8A7,vehicle,2018-10-27 13:10:39-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,38018305,Orange,300,13,True,70001,Forest Hills,0.0000,"42.300523,-71.113686",0.000000
4,O-5458F8A7,vehicle,2018-10-27 14:38:24-04:00,0,-71.11384,42.30122,1223,1,1,STOPPED_AT,...,38018283,Orange,300,14,True,70001,Forest Hills,0.0000,"42.300523,-71.113686",0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4768,O-54591290,vehicle,2018-10-27 21:13:59-04:00,0,-71.07123,42.43489,1249,1,190,INCOMING_AT,...,38018349,Orange,300,21,True,70036,Oak Grove,17.4349,"42.43668,-71.071097",0.199137
4769,O-54591290,vehicle,2018-10-27 22:45:36-04:00,0,-71.07350,42.42861,1249,1,190,IN_TRANSIT_TO,...,38018348,Orange,300,22,True,70036,Oak Grove,17.4349,"42.43668,-71.071097",0.917979
4770,O-54591290,vehicle,2018-10-27 22:45:55-04:00,0,-71.07290,42.43014,1249,1,190,IN_TRANSIT_TO,...,38018348,Orange,300,22,True,70036,Oak Grove,17.4349,"42.43668,-71.071097",0.741469
4771,O-54591290,vehicle,2018-10-27 22:46:11-04:00,0,-71.07161,42.43342,1249,1,190,INCOMING_AT,...,38018348,Orange,300,22,True,70036,Oak Grove,17.4349,"42.43668,-71.071097",0.364578


In [37]:
fDf['stopDistance'] = fDf['distanceFromOrigin']
fDf['distanceFromOrigin'] = fDf['stopDistance'] - fDf['distFromStop']
fDf

Unnamed: 0,Id,Type,UpdatedAt,Speed,Longitude,Latitude,Label,DirectionId,CurrentStopSequence,CurrentStatus,...,RouteId,day,hour,IsSaturday,stop_id,name,distanceFromOrigin,stopLatLong,distFromStop,stopDistance
0,O-5458F8A7,vehicle,2018-10-27 07:05:37-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,Orange,300,7,True,70001,Forest Hills,0.000000,"42.300523,-71.113686",0.000000,0.0000
1,O-5458F8A7,vehicle,2018-10-27 08:32:08-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,Orange,300,8,True,70001,Forest Hills,0.000000,"42.300523,-71.113686",0.000000,0.0000
2,O-5458F8A7,vehicle,2018-10-27 11:36:38-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,Orange,300,11,True,70001,Forest Hills,0.000000,"42.300523,-71.113686",0.000000,0.0000
3,O-5458F8A7,vehicle,2018-10-27 13:10:39-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,Orange,300,13,True,70001,Forest Hills,0.000000,"42.300523,-71.113686",0.000000,0.0000
4,O-5458F8A7,vehicle,2018-10-27 14:38:24-04:00,0,-71.11384,42.30122,1223,1,1,STOPPED_AT,...,Orange,300,14,True,70001,Forest Hills,0.000000,"42.300523,-71.113686",0.000000,0.0000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4768,O-54591290,vehicle,2018-10-27 21:13:59-04:00,0,-71.07123,42.43489,1249,1,190,INCOMING_AT,...,Orange,300,21,True,70036,Oak Grove,17.235763,"42.43668,-71.071097",0.199137,17.4349
4769,O-54591290,vehicle,2018-10-27 22:45:36-04:00,0,-71.07350,42.42861,1249,1,190,IN_TRANSIT_TO,...,Orange,300,22,True,70036,Oak Grove,16.516922,"42.43668,-71.071097",0.917979,17.4349
4770,O-54591290,vehicle,2018-10-27 22:45:55-04:00,0,-71.07290,42.43014,1249,1,190,IN_TRANSIT_TO,...,Orange,300,22,True,70036,Oak Grove,16.693431,"42.43668,-71.071097",0.741469,17.4349
4771,O-54591290,vehicle,2018-10-27 22:46:11-04:00,0,-71.07161,42.43342,1249,1,190,INCOMING_AT,...,Orange,300,22,True,70036,Oak Grove,17.070322,"42.43668,-71.071097",0.364578,17.4349


Calculating the Elapsed TIme from the beginning of the trip

In [38]:
grouping = ['TripId','day']
fDf['elapsed'] = fDf.groupby(grouping)['UpdatedAt'].transform(lambda x: x-x.min())
fDf

Unnamed: 0,Id,Type,UpdatedAt,Speed,Longitude,Latitude,Label,DirectionId,CurrentStopSequence,CurrentStatus,...,day,hour,IsSaturday,stop_id,name,distanceFromOrigin,stopLatLong,distFromStop,stopDistance,elapsed
0,O-5458F8A7,vehicle,2018-10-27 07:05:37-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,300,7,True,70001,Forest Hills,0.000000,"42.300523,-71.113686",0.000000,0.0000,00:00:00
1,O-5458F8A7,vehicle,2018-10-27 08:32:08-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,300,8,True,70001,Forest Hills,0.000000,"42.300523,-71.113686",0.000000,0.0000,00:00:00
2,O-5458F8A7,vehicle,2018-10-27 11:36:38-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,300,11,True,70001,Forest Hills,0.000000,"42.300523,-71.113686",0.000000,0.0000,00:00:00
3,O-5458F8A7,vehicle,2018-10-27 13:10:39-04:00,0,-71.11376,42.30119,1223,1,1,STOPPED_AT,...,300,13,True,70001,Forest Hills,0.000000,"42.300523,-71.113686",0.000000,0.0000,00:00:00
4,O-5458F8A7,vehicle,2018-10-27 14:38:24-04:00,0,-71.11384,42.30122,1223,1,1,STOPPED_AT,...,300,14,True,70001,Forest Hills,0.000000,"42.300523,-71.113686",0.000000,0.0000,00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4768,O-54591290,vehicle,2018-10-27 21:13:59-04:00,0,-71.07123,42.43489,1249,1,190,INCOMING_AT,...,300,21,True,70036,Oak Grove,17.235763,"42.43668,-71.071097",0.199137,17.4349,00:41:57
4769,O-54591290,vehicle,2018-10-27 22:45:36-04:00,0,-71.07350,42.42861,1249,1,190,IN_TRANSIT_TO,...,300,22,True,70036,Oak Grove,16.516922,"42.43668,-71.071097",0.917979,17.4349,00:45:44
4770,O-54591290,vehicle,2018-10-27 22:45:55-04:00,0,-71.07290,42.43014,1249,1,190,IN_TRANSIT_TO,...,300,22,True,70036,Oak Grove,16.693431,"42.43668,-71.071097",0.741469,17.4349,00:46:03
4771,O-54591290,vehicle,2018-10-27 22:46:11-04:00,0,-71.07161,42.43342,1249,1,190,INCOMING_AT,...,300,22,True,70036,Oak Grove,17.070322,"42.43668,-71.071097",0.364578,17.4349,00:46:19


In [39]:
tripTime = fDf.groupby(grouping)['elapsed'].max()
tripTime.head()

TripId    day
38018123  300   00:43:52
38018127  300   00:45:18
38018128  300   00:42:04
38018131  300   00:43:53
38018177  300   00:42:15
Name: elapsed, dtype: timedelta64[ns]

# Scatter Chart

In [40]:
py.offline.init_notebook_mode(connected=True)

In [41]:
def createTripTrace(distance, time, name, line=0.06):
    return go.Scattergl(
        x=time,
        y=distance,
        name=name,
        mode='lines',
        line = dict(
            color = ('rgb(5,40,205)'),
            width = line
        )
    )

Overlaying of Stops with stopnames and Distance from Origin

In [42]:
stops = stopDf[['name','distanceFromOrigin']]

In [43]:
def createStopTrace(distance, name):
    return go.Scattergl(
        y = [distance,distance],
        x = [0,0.8], 
        name = name,
        mode = 'lines',
        line = dict(
            color = ('rgb(255,40,0)'),
            width = 0.5
        
        )
    )

def createStopAnnotation(stoptrace):
    return dict(xref = 'paper',
               x=0.85,
               y = stoptrace['y'][0],
               xanchor = 'left',
                yanchor = 'bottom',
                text = stoptrace['name'],
                font = dict(family = 'Arial',
                           size = 12),
                showarrow = False)

def createStopTraces(traceData, stops, annotations):
    temp = stops.apply(lambda x: createStopTrace(x['distanceFromOrigin'],x['name']),axis=1)
    stopTraces = list(temp.values)
    traceData.extend(stopTraces)
    for stopTrace in stopTraces:
        annotationTrace = createStopAnnotation(stopTrace)
        annotations.append(annotationTrace)

In [44]:
def plotDistVerseTime(stopDf, updatesDf, maxDisplay = -1, lineWidth=0.1):  
    data = []
    annotations = []
    
    count = 1
    createStopTraces(data, stopDf, annotations)

    for name, group in updatesDf.groupby(grouping):
        hourList = group.apply(lambda x: x['elapsed'].total_seconds()/3600, axis=1)
        trace = createTripTrace(group['distanceFromOrigin'],hourList,'{}'.format(name),lineWidth)
        data.append(trace)
        if(count == maxDisplay):
            break;
        count += 1

    layout = dict(title='Trips On Orange Line',
                 xaxis =dict(
                     title='Timestamp(hr)'
                 ),
                 yaxis = dict(
                     title= 'distance Travelled(km)'
                 ),
                  showlegend = False,
                  annotations = annotations
                 )

    figure = dict(data=data, layout = layout)
    py.offline.iplot(figure)

In [45]:
plotDistVerseTime(stops, fDf, 50, 0.5)

In [46]:
trimDf = fDf[((fDf['CurrentStatus'] == 'STOPPPED_AT') & (fDf['name'] == 'Forest Hills'))
            | (fDf['name'] != 'Forest Hills')].sort_values(['Id','UpdatedAt'])

grouping = ['TripId', 'day']
trimDf['elapsed'] = trimDf.groupby(grouping)['UpdatedAt'].transform(lambda x: x-x.min())

plotDistVerseTime(stops, trimDf, 500, 0.5)

 # Filter each trips vehicle update that are before Forest Hills departs

In [47]:
departDf = trimDf= trimDf[trimDf['name'] != 'Forest Hills'].copy()
departDf['elapsed'] = departDf.groupby(grouping)['UpdatedAt'].transform(lambda x: x-x.min())

plotDistVerseTime(stops, departDf, 500, 0.5)

# Filtering trips that do not start at Forest Hills

In [48]:
departDf = departDf.sort_values(['UpdatedAt'])
toRemove = departDf.groupby(grouping).first()['name']
toRemove

TripId    day
38018123  300        Green Street
38018127  300        Green Street
38018128  300        Green Street
38018131  300        Green Street
38018177  300        Green Street
                       ...       
38018361  300        Green Street
38018362  300    Roxbury Crossing
38018364  300        Green Street
38018365  300        Green Street
38018367  300        Green Street
Name: name, Length: 84, dtype: object

In [49]:
toRemove = pd.DataFrame(toRemove[toRemove == 'Green Street'])
startForestHillsDf = departDf.set_index(grouping)
startForestHillsDf = startForestHillsDf.merge(toRemove, on=['TripId','day'])

startForestHillsDf['elapsed'] = startForestHillsDf.groupby(grouping)['UpdatedAt'].transform(lambda x: x- x.min())

plotDistVerseTime(stops, startForestHillsDf, 500,0.5)

# CONSTRAINT 6 & CONSTRAINT 7

The 'elapsed time' is converted to hours below and the 'distanceFromOrigin' is already in Kimlometers

In [50]:
 startForestHillsDf   

Unnamed: 0_level_0,Unnamed: 1_level_0,Id,Type,UpdatedAt,Speed,Longitude,Latitude,Label,DirectionId,CurrentStopSequence,CurrentStatus,...,hour,IsSaturday,stop_id,name_x,distanceFromOrigin,stopLatLong,distFromStop,stopDistance,elapsed,name_y
TripId,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
38018208,300,O-54590A93,vehicle,2018-10-27 07:00:18-04:00,0,-71.11077,42.30540,1225,1,10,INCOMING_AT,...,7,True,70003,Green Street,0.592511,"42.310525,-71.107414",0.632966,1.225477,00:00:00,Green Street
38018208,300,O-54590A93,vehicle,2018-10-27 07:01:12-04:00,0,-71.10782,42.31001,1225,1,10,STOPPED_AT,...,7,True,70003,Green Street,1.225477,"42.310525,-71.107414",0.000000,1.225477,00:00:54,Green Street
38018208,300,O-54590A93,vehicle,2018-10-27 07:01:31-04:00,0,-71.10670,42.31184,1225,1,20,INCOMING_AT,...,7,True,70005,Stony Brook,1.382819,"42.317062,-71.104248",0.614271,1.997090,00:01:13,Green Street
38018208,300,O-54590A93,vehicle,2018-10-27 07:02:25-04:00,0,-71.10554,42.31423,1225,1,20,INCOMING_AT,...,7,True,70005,Stony Brook,1.664970,"42.317062,-71.104248",0.332120,1.997090,00:02:07,Green Street
38018208,300,O-54590A93,vehicle,2018-10-27 07:02:54-04:00,0,-71.10443,42.31682,1225,1,20,STOPPED_AT,...,7,True,70005,Stony Brook,1.997090,"42.317062,-71.104248",0.000000,1.997090,00:02:36,Green Street
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38018354,300,O-54591079,vehicle,2018-10-27 22:57:32-04:00,0,-71.07030,42.37384,1273,1,140,STOPPED_AT,...,22,True,70029,Community College,10.237856,"42.373622,-71.06953299999999",0.000000,10.237856,00:29:03,Green Street
38018354,300,O-54591079,vehicle,2018-10-27 22:58:40-04:00,0,-71.07276,42.37523,1273,1,150,IN_TRANSIT_TO,...,22,True,70031,Sullivan Square,10.509651,"42.383975,-71.076994",1.032089,11.541741,00:30:11,Green Street
38018354,300,O-54591079,vehicle,2018-10-27 22:59:01-04:00,0,-71.07419,42.37654,1273,1,150,INCOMING_AT,...,22,True,70031,Sullivan Square,10.684179,"42.383975,-71.076994",0.857562,11.541741,00:30:32,Green Street
38018354,300,O-54591079,vehicle,2018-10-27 22:59:17-04:00,0,-71.07633,42.37898,1273,1,150,INCOMING_AT,...,22,True,70031,Sullivan Square,10.984204,"42.383975,-71.076994",0.557536,11.541741,00:30:48,Green Street


In [51]:
startForestHillsDf['distanceFromOrigin']

TripId    day
38018208  300     0.592511
          300     1.225477
          300     1.382819
          300     1.664970
          300     1.997090
                   ...    
38018354  300    10.237856
          300    10.509651
          300    10.684179
          300    10.984204
          300    10.984204
Name: distanceFromOrigin, Length: 4561, dtype: float64

In [52]:
startForestHillsDf['elapsed'] =pd.to_timedelta(startForestHillsDf['elapsed']).dt.total_seconds()/3600
startForestHillsDf

Unnamed: 0_level_0,Unnamed: 1_level_0,Id,Type,UpdatedAt,Speed,Longitude,Latitude,Label,DirectionId,CurrentStopSequence,CurrentStatus,...,hour,IsSaturday,stop_id,name_x,distanceFromOrigin,stopLatLong,distFromStop,stopDistance,elapsed,name_y
TripId,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
38018208,300,O-54590A93,vehicle,2018-10-27 07:00:18-04:00,0,-71.11077,42.30540,1225,1,10,INCOMING_AT,...,7,True,70003,Green Street,0.592511,"42.310525,-71.107414",0.632966,1.225477,0.000000,Green Street
38018208,300,O-54590A93,vehicle,2018-10-27 07:01:12-04:00,0,-71.10782,42.31001,1225,1,10,STOPPED_AT,...,7,True,70003,Green Street,1.225477,"42.310525,-71.107414",0.000000,1.225477,0.015000,Green Street
38018208,300,O-54590A93,vehicle,2018-10-27 07:01:31-04:00,0,-71.10670,42.31184,1225,1,20,INCOMING_AT,...,7,True,70005,Stony Brook,1.382819,"42.317062,-71.104248",0.614271,1.997090,0.020278,Green Street
38018208,300,O-54590A93,vehicle,2018-10-27 07:02:25-04:00,0,-71.10554,42.31423,1225,1,20,INCOMING_AT,...,7,True,70005,Stony Brook,1.664970,"42.317062,-71.104248",0.332120,1.997090,0.035278,Green Street
38018208,300,O-54590A93,vehicle,2018-10-27 07:02:54-04:00,0,-71.10443,42.31682,1225,1,20,STOPPED_AT,...,7,True,70005,Stony Brook,1.997090,"42.317062,-71.104248",0.000000,1.997090,0.043333,Green Street
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38018354,300,O-54591079,vehicle,2018-10-27 22:57:32-04:00,0,-71.07030,42.37384,1273,1,140,STOPPED_AT,...,22,True,70029,Community College,10.237856,"42.373622,-71.06953299999999",0.000000,10.237856,0.484167,Green Street
38018354,300,O-54591079,vehicle,2018-10-27 22:58:40-04:00,0,-71.07276,42.37523,1273,1,150,IN_TRANSIT_TO,...,22,True,70031,Sullivan Square,10.509651,"42.383975,-71.076994",1.032089,11.541741,0.503056,Green Street
38018354,300,O-54591079,vehicle,2018-10-27 22:59:01-04:00,0,-71.07419,42.37654,1273,1,150,INCOMING_AT,...,22,True,70031,Sullivan Square,10.684179,"42.383975,-71.076994",0.857562,11.541741,0.508889,Green Street
38018354,300,O-54591079,vehicle,2018-10-27 22:59:17-04:00,0,-71.07633,42.37898,1273,1,150,INCOMING_AT,...,22,True,70031,Sullivan Square,10.984204,"42.383975,-71.076994",0.557536,11.541741,0.513333,Green Street


Both the 'UpdatedAt' column and 'Elapsed' columns are  in Timedelta type and in Hours.

Now Breaking our Dependent and Independent Variable and Splitting the data and Fitting the Model 

In [53]:
x = startForestHillsDf.iloc[:,-6:-5].values
y = startForestHillsDf.iloc[:, -2].values


In [54]:
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = .20, random_state = 0)


In [55]:
regressor = LinearRegression()
regressor.fit(xTrain, yTrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [56]:
yPred = regressor.predict(xTest)

In [57]:
def plotSimpleLinearRegression(xTrain,yTrain,xTest,yTest, regressor):
    traceTrain = go.Scattergl(
        x = xTrain.ravel(),
        y = yTrain.ravel(),
        mode = 'markers',
        name = 'Training Points'
    )

    traceTest = go.Scattergl(
        x = xTest.ravel(),
        y = yTest.ravel(),
        mode = 'markers',
        name = 'Testing Points'
    )

    tracePred = go.Scattergl(
        x = xTrain.ravel(), 
        y = regressor.predict(xTrain).ravel(),
        mode = 'lines+markers',
        name = 'Prediction Line'
    )
    
    data = [traceTrain, traceTest, tracePred]
    
    layout = dict(
        title = 'Linear Regression Model',
        xaxis = dict(
                title = 'Distance (km)'
        ),
        yaxis = dict(
            title = 'Timestamp (hr)'
        ),
    )
    
    fig = go.Figure(data=data, layout=layout)
    py.offline.iplot(fig)

In [58]:
plotSimpleLinearRegression(xTrain, yTrain, xTest, yTest, regressor)

# Evaluating the performance

In [59]:
def createPredictionDf(yPred,yTest):
    yDf = pd.DataFrame()
    yDf['yPred'] = yPred
    yDf['yActual'] = yTest
    yDf['yDiff'] = (yDf['yPred'] - yDf['yActual']).abs()
    return yDf

def predictionAccuracyMinutes(df, minutes):
    totalCount = len(df.index)
    countUnder5 = len(df[df['yDiff'] < minutes])
    print('Accuracy of prediction within {} minutes: {}%'.format(minutes, countUnder5/totalCount*100))

def printPredictionAccuracy(yPred,yTest):
    yDf = createPredictionDf(yPred,yTest)
    predictionAccuracyMinutes(yDf, 1)
    predictionAccuracyMinutes(yDf, 2)
    predictionAccuracyMinutes(yDf, 3)
    predictionAccuracyMinutes(yDf, 5)
    
printPredictionAccuracy(yPred,yTest)

Accuracy of prediction within 1 minutes: 100.0%
Accuracy of prediction within 2 minutes: 100.0%
Accuracy of prediction within 3 minutes: 100.0%
Accuracy of prediction within 5 minutes: 100.0%


RSME PREDICTION

In [60]:
yPredBaseReg = yPred
yTestBaseReg = yTest
np.sqrt(metrics.mean_squared_error(yTestBaseReg, yPredBaseReg))

0.0319932052851892