# Analysis for LEAVETIMES database  : Dublin Bus

* leavetimes database is vast. For representative purpose, we shall start with data for month of january

In [1]:
import pandas as pd

## Vehicles data

### Load file into dataframe

In [2]:
df_Jan = pd.read_csv("../DB/monthlyData/January_2018.csv",skip_blank_lines=True,index_col=False)

### View vehicle data

In [3]:
df_Jan.head()

Unnamed: 0,DAYOFSERVICE,TRIPID,PROGRNUMBER,STOPPOINTID,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,VEHICLEID
0,2018-01-01,5958355,3,1172,23799,23799,23863,23863,1001237
1,2018-01-01,5958355,4,1173,23860,23860,23934,24072,1001237
2,2018-01-01,5958355,5,1174,23937,23937,24114,24140,1001237
3,2018-01-01,5958355,6,1175,24048,24048,24180,24192,1001237
4,2018-01-01,5958355,7,15,24130,24130,24227,24257,1001237


### Properties of features

In [4]:
df_Jan.info(null_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10525004 entries, 0 to 10525003
Data columns (total 9 columns):
 #   Column           Non-Null Count     Dtype 
---  ------           --------------     ----- 
 0   DAYOFSERVICE     10525004 non-null  object
 1   TRIPID           10525004 non-null  int64 
 2   PROGRNUMBER      10525004 non-null  int64 
 3   STOPPOINTID      10525004 non-null  int64 
 4   PLANNEDTIME_ARR  10525004 non-null  int64 
 5   PLANNEDTIME_DEP  10525004 non-null  int64 
 6   ACTUALTIME_ARR   10525004 non-null  int64 
 7   ACTUALTIME_DEP   10525004 non-null  int64 
 8   VEHICLEID        10525004 non-null  int64 
dtypes: int64(8), object(1)
memory usage: 722.7+ MB


In [5]:
df_Jan.nunique()

DAYOFSERVICE          31
TRIPID             76453
PROGRNUMBER          103
STOPPOINTID         4714
PLANNEDTIME_ARR    68859
PLANNEDTIME_DEP    68859
ACTUALTIME_ARR     71982
ACTUALTIME_DEP     71965
VEHICLEID            996
dtype: int64

**REVIEW OF FEATURES**

| Features        | Observation   | Decision  |
| :--------------- |:-------------|:-----|
| **DAYOFSERVICE**      | Non -null column with 31 value; represent each day of month| **No action** |
| **TRIPID**    | Non -null column with  76453 unique entries; indicates 1 way trip made by bus on defined route| **No action** |
| **PROGRNUMBER**       | Non -null column with  103 unique entries; indicate position of stop in trip| **No action** |
| **STOPPOINTID**        | Non -null column with  4714 unique entries; Nominal catagorical variable indicating stop number| **No action** |
| **PLANNEDTIME_ARR**    | Non -null column with  68859 unique entries; expresses defined time in seconds [offset from 00:00:00 of the day] for bus to arrive at a stop| **No action** |
| **PLANNEDTIME_DEP**      | Non -null column with  68859 unique entries; expresses defined time in seconds [offset from 00:00:00 of the day] for bus to depart from a stop| **No action** |
| **ACTUALTIME_ARR**       | Non -null column with  71982 unique entries; expresses defined time in seconds [offset from 00:00:00 of the day] for bus to arrive at a stop| **No action** |
| **ACTUALTIME_DEP**       | Non -null column with  71965 unique entries; expresses defined time in seconds [offset from 00:00:00 of the day] for bus to depart from a stop| **No action** |
| **VEHICLEID**            | Non -null column with  996 unique entries;  expresses perticular vehicle used for trip, NOT SIGNIFICANT FOR ANALYSIS| **Drop** |

In [6]:
# drop VEHICLEID feature
df_Jan.drop('VEHICLEID',axis=1,inplace=True)

## View data for perticular trip ID

Considering that we are totally unaware of dublin bus operations; let us check assumptions:
* All buses are operational on all 31 days of service [hypothesis for fully loaded functional service]
* Each one way trip has unique ID [Established trueth by documentation]
* Thus, we may predict that :<br>
$Average\ daily\ trips\ for\ a\ bus=\dfrac{trips\ per\ day}{Number\ of \ buses}$
$\dfrac{76453/31}{996} = 2.47$


### Convert DAYOFSERVICE to datetime 

In [7]:
df_trip_Jan = df_Jan.copy() 
df_trip_Jan['DAYOFSERVICE'] =  pd.to_datetime(df_trip_Jan['DAYOFSERVICE'],format='%Y-%m-%d')
df_trip_Jan.head()

Unnamed: 0,DAYOFSERVICE,TRIPID,PROGRNUMBER,STOPPOINTID,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP
0,2018-01-01,5958355,3,1172,23799,23799,23863,23863
1,2018-01-01,5958355,4,1173,23860,23860,23934,24072
2,2018-01-01,5958355,5,1174,23937,23937,24114,24140
3,2018-01-01,5958355,6,1175,24048,24048,24180,24192
4,2018-01-01,5958355,7,15,24130,24130,24227,24257


In [8]:
df_trip_Jan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10525004 entries, 0 to 10525003
Data columns (total 8 columns):
 #   Column           Dtype         
---  ------           -----         
 0   DAYOFSERVICE     datetime64[ns]
 1   TRIPID           int64         
 2   PROGRNUMBER      int64         
 3   STOPPOINTID      int64         
 4   PLANNEDTIME_ARR  int64         
 5   PLANNEDTIME_DEP  int64         
 6   ACTUALTIME_ARR   int64         
 7   ACTUALTIME_DEP   int64         
dtypes: datetime64[ns](1), int64(7)
memory usage: 642.4 MB


### Visualise data for perticular trip 

In [9]:
TRIPID_unique = df_trip_Jan['TRIPID'].unique()
group_TRIPID = df_trip_Jan.groupby(by='TRIPID',axis=1)

In [10]:
df_trip_Jan.loc[(df_trip_Jan['PLANNEDTIME_ARR'] == 30600) & (df_trip_Jan['STOPPOINTID'] == 324)]

Unnamed: 0,DAYOFSERVICE,TRIPID,PROGRNUMBER,STOPPOINTID,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP
1414,2018-01-01,5959280,1,324,30600,30600,30579,30579
247416,2018-01-02,5970229,1,324,30600,30600,30604,30604
247455,2018-01-02,5967072,1,324,30600,30600,30550,30550
618693,2018-01-03,5970229,1,324,30600,30600,30611,30611
618766,2018-01-03,5967072,1,324,30600,30600,30617,30617
998788,2018-01-04,5970229,1,324,30600,30600,30593,30593
998830,2018-01-04,5967072,1,324,30600,30600,30620,30620
1380013,2018-01-05,5970229,1,324,30600,30600,30598,30598
1380070,2018-01-05,5967072,1,324,30600,30600,31239,31239
1738227,2018-01-06,5970892,1,324,30600,30600,30620,30620


In [11]:
df_trip_Jan.loc[df_trip_Jan['TRIPID'] == TRIPID_unique[45]].sort_values(by=['PROGRNUMBER'])

Unnamed: 0,DAYOFSERVICE,TRIPID,PROGRNUMBER,STOPPOINTID,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP
1414,2018-01-01,5959280,1,324,30600,30600,30579,30579
1527,2018-01-01,5959280,2,327,30701,30701,30669,30669
1649,2018-01-01,5959280,3,7113,30818,30818,30745,30767
1779,2018-01-01,5959280,4,127,30938,30938,30812,30834
1883,2018-01-01,5959280,5,112,31042,31042,30877,30903
1950,2018-01-01,5959280,6,113,31114,31114,30958,30958
1964,2018-01-01,5959280,7,114,31131,31131,30964,30964
2036,2018-01-01,5959280,8,115,31196,31196,30990,30990
2100,2018-01-01,5959280,9,37,31244,31244,31007,31007
2170,2018-01-01,5959280,10,38,31302,31302,31027,31027


In [12]:
df_trip_Jan['PLANNEDTIME_ARR'].max()

91217