# Generate planned arrival time TIMETABLE for each STOPOINTID of a LINEID
**Notebook genrates a timetable.csv file with schema LINEID | DIRECTION | STOPPOINTID | PROGRNUMBER | PLANNEDTIME_ARR**
* **PLANNEDTIME_ARR** column has entries for each ferry in a day at each STOPOINTID; in terms of minutes resolution

In [1]:
import numpy as np
import pandas as pd
from warnings import filterwarnings
filterwarnings('ignore')

## Source of Truth gathered in prior EDA
* **TRIPID**
    * **TRIPID are shared across days for perticular LINEID on particular ROUTEID**
    * For a single TRIPID, only one corresponding LINEID and ROUTEID are present
    * TRIPID for inbound and outbound directions are mutually exclusive
    * For certein TRIPIDs; PROGNUMBER do not start at 1; which should. It means, some entries are missing.
* **LINEID**
    * **For a LINEID; ROUTEIDs and TRIPIDs used for DIFFERENT DIRECTION are mutually exclusive**
* **ROUTEID**
    * A LINEID has multiple ROUTEID 
    * A LINEID may / may not have trips with both INBOUND-going(1)/OUTBOUND-returning(2) direction 
    * **THOUGH 2 ROUTES FOR A LINEID ARE HAVING SAME DIRECTION; STOPPOINTIDs VISITED ARE DIFFERENT**
    * **FOR A SINGLE ROUTE; STOPS VISITED DURING DIFFERENT TRIPS ARE DIFEENRENT**
    * **ROUTEIDs ARE NOT KNOWN TO USER & ROUTEIDs ARE NOT TIMEBOUND i.e. timetable DOES NOT exists to follow a ROUTEID**
    * **majority of data for a LINEID is tied with a particular ROUTEID**
* **PLANNED ARRIVAL AND DEPARTURE TIMES ARE SAME FOR ALL ENTRIES; HENCE "PLANNEDTIME_DEP" CAN BE DROPPED**    

## Generate TIMETABLE for PLANNEDTIME_ARR
* **Each LINEID has fixed number of ferries in a DAY**
* **Each Ferry is represented by a specific TRIPID for that day**
* **HENCE IN IDEAL SITUATION : FOR a LINEID IN A DAY; COUNT of UNIQUE (PLANNED_ARR) == COUNT of UNIQUE (FERRIES IN A DAY)**
<br>

**BUT, ACTUAL DATAENTRIES SHOW THAT THIS IS NOT CASE. PLANNEDTIME_ARR HAS TWICE TO THRICE UNIQUE ENTRIES AS OF FERRIES IN A DAY. WE ARE NEGLECTING THIS DISCRIPANCY; AND CONSIDERING ALL PLANNEDTIME_ARR ENTRIES FOR MAKING A TIMETABLE**

## Read Merge df_Jan with df_trips on 'TRIPID'

In [2]:
df_Jan_trips_arr = pd.read_csv("../DB/ML/Jan_Trips_arr.csv")

In [3]:
df_Jan_trips_arr.head()

Unnamed: 0,year,month,dayofweek_num,quarter,LINEID,DIRECTION,STOPPOINTID,PROGRNUMBER,PLANNEDTIME_ARR,ACTUALTIME_ARR
0,2018,1,0,1,41,1,1172,3,23799,23863
1,2018,1,0,1,41,1,1173,4,23860,23934
2,2018,1,0,1,41,1,1174,5,23937,24114
3,2018,1,0,1,41,1,1175,6,24048,24180
4,2018,1,0,1,41,1,15,7,24130,24227


In [4]:
df_Jan_trips_arr['year'] = df_Jan_trips_arr['year'].astype('category')
df_Jan_trips_arr['month'] = df_Jan_trips_arr['month'].astype('category')
df_Jan_trips_arr['dayofweek_num'] = df_Jan_trips_arr['dayofweek_num'].astype('category')
df_Jan_trips_arr['quarter'] = df_Jan_trips_arr['quarter'].astype('category')
df_Jan_trips_arr['PROGRNUMBER'] = df_Jan_trips_arr['PROGRNUMBER'].astype('category')
df_Jan_trips_arr['STOPPOINTID'] = df_Jan_trips_arr['STOPPOINTID'].astype('category')
df_Jan_trips_arr['PLANNEDTIME_ARR'] = pd.to_numeric(df_Jan_trips_arr['PLANNEDTIME_ARR'], downcast='integer', errors='coerce')
df_Jan_trips_arr['ACTUALTIME_ARR'] = pd.to_numeric(df_Jan_trips_arr['ACTUALTIME_ARR'], downcast='integer', errors='coerce')
df_Jan_trips_arr['LINEID'] = df_Jan_trips_arr['LINEID'].astype('category')
df_Jan_trips_arr['DIRECTION'] = df_Jan_trips_arr['DIRECTION'].astype('category')

### Object types and valid entries

In [5]:
print(df_Jan_trips_arr.info(null_counts = True))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9600545 entries, 0 to 9600544
Data columns (total 10 columns):
 #   Column           Non-Null Count    Dtype   
---  ------           --------------    -----   
 0   year             9600545 non-null  category
 1   month            9600545 non-null  category
 2   dayofweek_num    9600545 non-null  category
 3   quarter          9600545 non-null  category
 4   LINEID           9600545 non-null  category
 5   DIRECTION        9600545 non-null  category
 6   STOPPOINTID      9600545 non-null  category
 7   PROGRNUMBER      9600545 non-null  category
 8   PLANNEDTIME_ARR  9600545 non-null  int32   
 9   ACTUALTIME_ARR   9600545 non-null  int32   
dtypes: category(8), int32(2)
memory usage: 155.9 MB
None


### Nature of data

In [6]:
df_Jan_trips_arr.head(100)

Unnamed: 0,year,month,dayofweek_num,quarter,LINEID,DIRECTION,STOPPOINTID,PROGRNUMBER,PLANNEDTIME_ARR,ACTUALTIME_ARR
0,2018,1,0,1,41,1,1172,3,23799,23863
1,2018,1,0,1,41,1,1173,4,23860,23934
2,2018,1,0,1,41,1,1174,5,23937,24114
3,2018,1,0,1,41,1,1175,6,24048,24180
4,2018,1,0,1,41,1,15,7,24130,24227
...,...,...,...,...,...,...,...,...,...,...
95,2018,1,0,1,66,1,3958,47,27943,28473
96,2018,1,0,1,66,1,3959,48,27990,28508
97,2018,1,0,1,66,1,3960,49,28024,28534
98,2018,1,0,1,66,1,3961,50,28058,28551


### Unique entries

In [7]:
df_Jan_trips_arr.nunique()

year                   1
month                  1
dayofweek_num          7
quarter                1
LINEID               126
DIRECTION              2
STOPPOINTID         4602
PROGRNUMBER          102
PLANNEDTIME_ARR    67749
ACTUALTIME_ARR     70065
dtype: int64

## Confirming variations in PLANNEDTIME_ARR at different STOPPOINTIDs
**PLANNEDTIME_ARR NEEDS TO HAVE FIXED SCHEDULE FOR EACH FERRY; BUT BELLOW EXAMPLE DEMONSTRATE THAT IS NOT A CASE**

### FILTER ENTRIES FOR LINE 41 DIRECTION 1

In [8]:
df_Jan_trips_arr_41_1 = df_Jan_trips_arr.loc[(df_Jan_trips_arr["LINEID"] == "41") & 
                                                 (df_Jan_trips_arr["DIRECTION"] == 1)].copy()

#### NATURE OF DATA

In [9]:
df_Jan_trips_arr_41_1

Unnamed: 0,year,month,dayofweek_num,quarter,LINEID,DIRECTION,STOPPOINTID,PROGRNUMBER,PLANNEDTIME_ARR,ACTUALTIME_ARR
0,2018,1,0,1,41,1,1172,3,23799,23863
1,2018,1,0,1,41,1,1173,4,23860,23934
2,2018,1,0,1,41,1,1174,5,23937,24114
3,2018,1,0,1,41,1,1175,6,24048,24180
4,2018,1,0,1,41,1,15,7,24130,24227
...,...,...,...,...,...,...,...,...,...,...
9592293,2018,1,2,1,41,1,4912,47,84198,83984
9592294,2018,1,2,1,41,1,4913,48,84233,84020
9592295,2018,1,2,1,41,1,4914,49,84257,84037
9592296,2018,1,2,1,41,1,4915,50,84287,84073


#### OBJECT TYPES AND VALID ENTRIES

In [10]:
df_Jan_trips_arr_41_1.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71010 entries, 0 to 9592297
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   year             71010 non-null  category
 1   month            71010 non-null  category
 2   dayofweek_num    71010 non-null  category
 3   quarter          71010 non-null  category
 4   LINEID           71010 non-null  category
 5   DIRECTION        71010 non-null  category
 6   STOPPOINTID      71010 non-null  category
 7   PROGRNUMBER      71010 non-null  category
 8   PLANNEDTIME_ARR  71010 non-null  int32   
 9   ACTUALTIME_ARR   71010 non-null  int32   
dtypes: category(8), int32(2)
memory usage: 1.9 MB


#### UNIQUE ENTRIES

In [11]:
df_Jan_trips_arr_41_1.nunique()

year                   1
month                  1
dayofweek_num          7
quarter                1
LINEID                 1
DIRECTION              1
STOPPOINTID           51
PROGRNUMBER           51
PLANNEDTIME_ARR    14529
ACTUALTIME_ARR     42448
dtype: int64

### CONVERT PLANNEDTIME_ARR TO MINUTE RESOLUTION

In [12]:
df_Jan_trips_arr_41_1['PLANNEDTIME_ARR_MIN'] = df_Jan_trips_arr_41_1['PLANNEDTIME_ARR'] - df_Jan_trips_arr_41_1['PLANNEDTIME_ARR'].mod(60)

In [13]:
df_Jan_trips_arr_41_1.nunique()

year                       1
month                      1
dayofweek_num              7
quarter                    1
LINEID                     1
DIRECTION                  1
STOPPOINTID               51
PROGRNUMBER               51
PLANNEDTIME_ARR        14529
ACTUALTIME_ARR         42448
PLANNEDTIME_ARR_MIN     1117
dtype: int64

In [14]:
df_Jan_trips_arr_41_1.head()

Unnamed: 0,year,month,dayofweek_num,quarter,LINEID,DIRECTION,STOPPOINTID,PROGRNUMBER,PLANNEDTIME_ARR,ACTUALTIME_ARR,PLANNEDTIME_ARR_MIN
0,2018,1,0,1,41,1,1172,3,23799,23863,23760
1,2018,1,0,1,41,1,1173,4,23860,23934,23820
2,2018,1,0,1,41,1,1174,5,23937,24114,23880
3,2018,1,0,1,41,1,1175,6,24048,24180,24000
4,2018,1,0,1,41,1,15,7,24130,24227,24120


#### Check case for STOPPOINTID 1172

In [15]:
df_Jan_trips_arr_41_1_test = df_Jan_trips_arr_41_1.loc[df_Jan_trips_arr_41_1['STOPPOINTID'] == 1172]
df_Jan_trips_arr_41_1_test

Unnamed: 0,year,month,dayofweek_num,quarter,LINEID,DIRECTION,STOPPOINTID,PROGRNUMBER,PLANNEDTIME_ARR,ACTUALTIME_ARR,PLANNEDTIME_ARR_MIN
0,2018,1,0,1,41,1,1172,3,23799,23863,23760
327,2018,1,0,1,41,1,1172,3,27399,27480,27360
2939,2018,1,0,1,41,1,1172,3,30699,30729,30660
10785,2018,1,0,1,41,1,1172,3,34299,34341,34260
20066,2018,1,0,1,41,1,1172,3,37892,37829,37860
...,...,...,...,...,...,...,...,...,...,...,...
9566028,2018,1,2,1,41,1,1172,3,74796,74794,74760
9572095,2018,1,2,1,41,1,1172,3,76577,76633,76560
9579417,2018,1,2,1,41,1,1172,3,78377,78474,78360
9585134,2018,1,2,1,41,1,1172,3,80177,80341,80160


In [16]:
df_Jan_trips_arr_41_1_test.nunique()

year                      1
month                     1
dayofweek_num             7
quarter                   1
LINEID                    1
DIRECTION                 1
STOPPOINTID               1
PROGRNUMBER               1
PLANNEDTIME_ARR         309
ACTUALTIME_ARR         1295
PLANNEDTIME_ARR_MIN     113
dtype: int64

In [17]:
uniques = df_Jan_trips_arr_41_1_test['PLANNEDTIME_ARR_MIN'].unique()

In [18]:
sorted(uniques)

[18060,
 19860,
 21660,
 22260,
 23760,
 24660,
 25260,
 26160,
 27060,
 27360,
 27660,
 27720,
 28380,
 28440,
 28860,
 29460,
 29520,
 29760,
 29820,
 30660,
 30720,
 31260,
 32160,
 32220,
 32460,
 33660,
 33720,
 34260,
 34320,
 36060,
 36120,
 37260,
 37860,
 38460,
 39660,
 39960,
 40260,
 40860,
 41160,
 42360,
 42660,
 43260,
 43560,
 43860,
 44760,
 45060,
 45660,
 45960,
 46260,
 46560,
 46860,
 47160,
 47460,
 48060,
 48660,
 49260,
 49860,
 50460,
 51060,
 51660,
 52260,
 52860,
 53460,
 53760,
 54060,
 54360,
 54720,
 55260,
 55860,
 55920,
 56460,
 57120,
 57660,
 58020,
 58860,
 58920,
 60060,
 60120,
 60360,
 61260,
 61320,
 61560,
 61860,
 62100,
 63060,
 63120,
 64260,
 64320,
 64860,
 65460,
 65520,
 66060,
 66420,
 67260,
 67920,
 68460,
 68820,
 69420,
 70260,
 71160,
 72060,
 72960,
 73860,
 74760,
 75660,
 76560,
 77460,
 78360,
 79260,
 80160,
 81060,
 81960,
 82860]

**ANALYSIS**
* **Even after converting to minute resolution; number of unique entries for PLANNEDTIME_ARR [unique entries : 113]  exceeds number of DIRECTION 1 ferries made on LINEID 41 in a day [Number of ferries in a day: 61]**
* **Hence; NO FIXED TIMETABLE CAN BE DRAWN UP FOR A STOPOINTID SERVING LINEID**
<br>

**FACT**
* **Converting PLANNEDTIME_ARR to minute resolution reduces unique entries significantly. Also, it is way forward as front end shall provide ARRIVAL TIME input in terms of minutes**
<br>
<br>

**WE ARE CHOOSING TO PROCEED WITH THIS ANOMALY WHILE TRAINING AND USING MODEL AS:**
* **NO WAY AROUND CAN BE FOUND TO GET ARRIVALTIME TIMETABLE FOR EACH STOPOINTID FOR A LINEID of DUBLIN BUS**

## Generate dataframe with MINUTES resolution for PLANNEDTIME_ARR and ACTUALTIME_ARR

### Generate dataframe copy

In [19]:
df_Jan_trips_arr_MINUTES = df_Jan_trips_arr.copy()
df_Jan_trips_arr_MINUTES

Unnamed: 0,year,month,dayofweek_num,quarter,LINEID,DIRECTION,STOPPOINTID,PROGRNUMBER,PLANNEDTIME_ARR,ACTUALTIME_ARR
0,2018,1,0,1,41,1,1172,3,23799,23863
1,2018,1,0,1,41,1,1173,4,23860,23934
2,2018,1,0,1,41,1,1174,5,23937,24114
3,2018,1,0,1,41,1,1175,6,24048,24180
4,2018,1,0,1,41,1,15,7,24130,24227
...,...,...,...,...,...,...,...,...,...,...
9600540,2018,1,2,1,33B,1,3726,16,86812,86671
9600541,2018,1,2,1,33B,1,7039,17,86848,86685
9600542,2018,1,2,1,33B,1,7204,18,86890,86722
9600543,2018,1,2,1,33B,1,3728,19,86988,86836


#### UNIQUE ENTRIES

In [20]:
df_Jan_trips_arr_MINUTES.nunique()

year                   1
month                  1
dayofweek_num          7
quarter                1
LINEID               126
DIRECTION              2
STOPPOINTID         4602
PROGRNUMBER          102
PLANNEDTIME_ARR    67749
ACTUALTIME_ARR     70065
dtype: int64

#### ROUND-OFF [FLOOR] PLANNEDTIME_ARR and ACTUALTIME_ARR TO MINUTE RESOLUTION
* **PLANNEDTIME_ARR and ACTUALTIME_ARR are in seconds resolution**
* **With these resolutions; earlier ML model gave RMSE of aproximately 300 seconds**
* **Flooring seconds data to minutes data would add worst case input error of 1 minute**
* **BUT AS OBSERVED BELLOW; IT REDUCES CARDINALITY OF BOTH FETURES SIGNIFICANTLY**

FOR JANUARY 2018 DATA:

|FEATURE|NUNUQUE (seconds)|NUNIQUE (minutes)|
|------|------|------|
|PLANNEDTIME_ARR  |67749  |1215  |
|ACTUALTIME_ARR  |70065  |1222 |


In [21]:
df_Jan_trips_arr_MINUTES['PLANNEDTIME_ARR'] = df_Jan_trips_arr_MINUTES['PLANNEDTIME_ARR'] - df_Jan_trips_arr_MINUTES['PLANNEDTIME_ARR'].mod(60)
df_Jan_trips_arr_MINUTES['ACTUALTIME_ARR'] = df_Jan_trips_arr_MINUTES['ACTUALTIME_ARR'] - df_Jan_trips_arr_MINUTES['ACTUALTIME_ARR'].mod(60)

df_Jan_trips_arr_MINUTES

Unnamed: 0,year,month,dayofweek_num,quarter,LINEID,DIRECTION,STOPPOINTID,PROGRNUMBER,PLANNEDTIME_ARR,ACTUALTIME_ARR
0,2018,1,0,1,41,1,1172,3,23760,23820
1,2018,1,0,1,41,1,1173,4,23820,23880
2,2018,1,0,1,41,1,1174,5,23880,24060
3,2018,1,0,1,41,1,1175,6,24000,24180
4,2018,1,0,1,41,1,15,7,24120,24180
...,...,...,...,...,...,...,...,...,...,...
9600540,2018,1,2,1,33B,1,3726,16,86760,86640
9600541,2018,1,2,1,33B,1,7039,17,86820,86640
9600542,2018,1,2,1,33B,1,7204,18,86880,86700
9600543,2018,1,2,1,33B,1,3728,19,86940,86820


#### UNIQUE ENTRIES

In [22]:
df_Jan_trips_arr_MINUTES.nunique()

year                  1
month                 1
dayofweek_num         7
quarter               1
LINEID              126
DIRECTION             2
STOPPOINTID        4602
PROGRNUMBER         102
PLANNEDTIME_ARR    1215
ACTUALTIME_ARR     1222
dtype: int64

#### Save csv

In [23]:
df_Jan_trips_arr_MINUTES.to_csv("../DB/ML/Jan_trips_arr_MINUTES.csv",index=False)

## Generate planned arrival time TIMETABLE

### Generate TIMETABLE
**UNIQUE ENTRIES FOR [LINEID, DIRECTION, STOPPOINTID, PROGRNUMBER, PLANNEDTIME_ARR] FORMS TIMETABLE**

In [30]:
df_Jan_trips_arr_TIMETABLE = df_Jan_trips_arr_MINUTES[['LINEID', 'DIRECTION', 'STOPPOINTID',\
                                                       'PROGRNUMBER', 'PLANNEDTIME_ARR']].drop_duplicates().copy()
df_Jan_trips_arr_TIMETABLE

Unnamed: 0,LINEID,DIRECTION,STOPPOINTID,PROGRNUMBER,PLANNEDTIME_ARR
0,41,1,1172,3,23760
1,41,1,1173,4,23820
2,41,1,1174,5,23880
3,41,1,1175,6,24000
4,41,1,15,7,24120
...,...,...,...,...,...
9599762,111,1,3055,44,86400
9599763,111,1,3056,45,86400
9599764,111,1,3057,46,86460
9600523,184,1,6138,43,87300


In [31]:
df_Jan_trips_arr_TIMETABLE.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1557076 entries, 0 to 9600524
Data columns (total 5 columns):
 #   Column           Non-Null Count    Dtype   
---  ------           --------------    -----   
 0   LINEID           1557076 non-null  category
 1   DIRECTION        1557076 non-null  category
 2   STOPPOINTID      1557076 non-null  category
 3   PROGRNUMBER      1557076 non-null  category
 4   PLANNEDTIME_ARR  1557076 non-null  int32   
dtypes: category(4), int32(1)
memory usage: 25.4 MB


#### UNIQUE ENTRIES

In [29]:
df_Jan_trips_arr_TIMETABLE.nunique()

LINEID              126
DIRECTION             2
STOPPOINTID        4602
PROGRNUMBER         102
PLANNEDTIME_ARR    1215
dtype: int64

#### Save csv

In [26]:
df_Jan_trips_arr_TIMETABLE.to_csv("../DB/ML/Jan_trips_arr_TIMETABLE.csv",index=False)