<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Documentation" data-toc-modified-id="Documentation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Documentation</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Initial-Data-Quality-Assessment-&amp;-Configuration" data-toc-modified-id="Initial-Data-Quality-Assessment-&amp;-Configuration-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Initial Data Quality Assessment &amp; Configuration</a></span><ul class="toc-item"><li><span><a href="#Drop-Constant-Column-and-NaN-columns" data-toc-modified-id="Drop-Constant-Column-and-NaN-columns-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Drop Constant Column and NaN columns</a></span></li><li><span><a href="#Configure-Datatypes" data-toc-modified-id="Configure-Datatypes-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Configure Datatypes</a></span></li></ul></li><li><span><a href="#Investigate-Data-Anomolies" data-toc-modified-id="Investigate-Data-Anomolies-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Investigate Data Anomolies</a></span></li><li><span><a href="#Data-Insights" data-toc-modified-id="Data-Insights-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Data Insights</a></span><ul class="toc-item"><li><span><a href="#Is-the-planned-duration-for-[...]-fixed?" data-toc-modified-id="Is-the-planned-duration-for-[...]-fixed?-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Is the planned duration for [...] fixed?</a></span></li><li><span><a href="#What-percentage-of-trips-are-later-than-expected?" data-toc-modified-id="What-percentage-of-trips-are-later-than-expected?-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>What percentage of trips are later than expected?</a></span></li><li><span><a href="#Most-/-Least-Common-Services-/-Route-/-Programmes" data-toc-modified-id="Most-/-Least-Common-Services-/-Route-/-Programmes-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Most / Least Common Services / Route / Programmes</a></span></li><li><span><a href="#Most-/-Least-Common-Visited-Stops" data-toc-modified-id="Most-/-Least-Common-Visited-Stops-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Most / Least Common Visited Stops</a></span></li></ul></li></ul></div>

# Documentation

- Each record of this table contains the actual data for one leave time of one vehicle at one stop
point of the route.

<br>

- If the leave time is planned and achieved, the record will contain both the planned and actual data.

<br>

- If the leave time is planned and suppressed, the record will contain only the planned data and the flag “Suppressed” will be checked.

<br>

- Each record could be referred to a fault event by using the field “JustificationID”.

- According with the time representation defined for planned data, AVL data refer to one day of service that could be greater than 24 hours. 

<br>

- For example a trip starting at 23:30 of 1st of April and ending at 00:30 of the 2nd of April could have the last AVL data tracked at “24:30 of 1st of April”. 

<br>

- To express this “extended time” the interface requires the use of seconds past the beginning of the operation day.

<br>

- In our example the DayOfService is “01/04/2014” and the extended time is “88200” (= 24 * 60 * 60 + 30 * 60).

---

- the distance is expressed in the unit of measurement according with the DMS settings. It could be “kilometers” or “miles”.

<br>

- Each table contains a field called “LastUpdate” used to detect new data availability.

| Column | Desc. |
| --- | --- |
| DataSource | Unique Bus Operator Code: used to identify the Bus Operator's system from which the data arrive. |
| DayOfService | One day of service could last more than 24hrs |
| TripID | Unique Trip Code |
| ProgrNumber | Sequential position of the stop point in the trip |
| StopPointID | Unique stop point code |
| PlannedTime_Arr | Planned arrival time at the stop point, in seconds |
| PlannedTime_Dept | Planned departure time from the stop point, in seconds |
| ActualTime_Arr | Actual arrival time at the stop point, in seconds |
| ActualTime_Dept | Actual departure time from the stop point, in seconds |
| VehicleID | Unique vehicle code arriving a this stop point |
| Passengers | Number of passengers on board (previous link) |
| PassengersIn | Number of boarded passengers |
| PassengersOut | Number of descended passenges |
| Distance | Distance measured from the beginning of the trip |
| Supressed | Whole trip supressed: 0=achieved; 1=supressed |
| JustificationID | Fault Code |
| LastUpdate | Time of the last record update |
| Note | Free Note |


# Imports

In [1]:
from IPython.core.display import display, HTML
display(HTML('<style>.container {width:80% !important;}</style>'))

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
pd.__version__

'0.24.2'

In [4]:
# Display plots in the notebook
%matplotlib inline

In [5]:
def assess_df(df):
  
    feature_types = {df[df.columns[x]].dtypes.name for x in range(len(df.columns))}
    
  	# print datatypes
  
    print("Datatypes:\n")
    print(df.dtypes)
    print('\n\n')
    
    # print number rows and cols
    print('*' * 50)
    print('ROWS, COLUMNS, NULLS\n')
    print(df.shape[0], "rows in the dataframe.")
    print(df.shape[1], "columns in the dataframe.\n")
    print(df.isnull().sum().sum(), "null values in the dataframe.")
    
    # check duplicate rows & col; print if present
    
    duplicate_rows = df.duplicated(keep=False).sum()
    duplicate_cols = df.index.T.duplicated(keep=False).sum()
    
    print(duplicate_rows, "duplicate rows")
    print(duplicate_cols, "duplicate columns")
    
    if duplicate_rows > 0 or duplicate_cols > 0:
      print(df.loc[df.duplicated()])
	
    print('\n')
    
    # check for constant columns
    
    numeric_cols = list(df.select_dtypes(include=[np.number]).columns.values)
    categorical_cols = list(df.select_dtypes('category').columns.values)
    
    standard_diffs = [df.describe().loc['std', x] for x in numeric_cols]
    
    constant_categorical_column=False
    if 'category' in feature_types:
        for column in categorical_cols:
            if df[column].describe().index.unique == 1:
                constant_categorical_column = True
            
            
    if 0 in standard_diffs:
        print("Constant numeric columns: TRUE")
    else:
        print("Constant numeric columns: FALSE")
  
    if 'category' in feature_types:    
        if constant_categorical_column == True:
            print("Constant categorical columns: TRUE")
        else:
            print("Constant categorical columns: FALSE")

            
    
    print('\n\n')

    
    # feature stats
    print('*' * 50)
    print('DESCRIPTION\n')
    print(df.describe().T)
    print('\n\n')
    print('*' * 50)

    
    # feature stats: categorical
    
    if 'category' in feature_types:
        print('CATEGORICAL DESCRIPTION\n')
        print('\n')
        print(df.select_dtypes(['category']).describe().T)
        print('\n\n')
        print('*' * 50)
    
    # print feature cardinalities
    
    print("FEATURE CARDINALITIES\n")
    column_names = list(df.columns.values)

    print('{0:45}  {1}'.format("Feature", "Distinct Values"))
    print('{0:45}  {1}'.format("-------", "--------------- \n"))

    for c in column_names:
        print('{0:45}  {1}'.format(c, str(len(df[c].unique()))))
    
    print('\n')
    print('*' * 50)
    print('MEMORY\n')
    
    
    print(df.info(memory_usage='deep'))
    print('\n')
    print(df.memory_usage(deep=True))
    print('\n')
    print('*' * 50)
    print('HEAD\n')
    print(df.head(10))
    print('\n')
    print('*' * 50)
    print('TAIL\n')
    print(df.tail(10))

# Initial Data Quality Assessment & Configuration

In [6]:
path = "/Users/davidodwyer/Documents/studyCS/Semester_3"\
"/data/original_raw_data/bus/sample_bus_data-20190607"\
"/rt_leavetimes_2017_I_DB_1000.txt"

df_leavetime_sample = pd.read_csv(path, sep=';')

In [7]:
assess_df(df_leavetime_sample)

Datatypes:

datasource          object
dayofservice        object
tripid               int64
progrnumber          int64
stoppointid          int64
plannedtime_arr      int64
plannedtime_dep      int64
actualtime_arr       int64
actualtime_dep       int64
vehicleid            int64
passengers         float64
passengersin       float64
passengersout      float64
distance           float64
suppressed         float64
justificationid    float64
lastupdate          object
note               float64
dtype: object



**************************************************
ROWS, COLUMNS, NULLS

999 rows in the dataframe.
18 columns in the dataframe.

6993 null values in the dataframe.
0 duplicate rows
0 duplicate columns


Constant numeric columns: FALSE



**************************************************
DESCRIPTION

                 count          mean            std        min        25%  \
tripid           999.0  4.366036e+06  210532.711369  4082096.0  4093337.0   
progrnumber      999.0  2.98

## Drop Constant Column and NaN columns

In [8]:
df_leavetime_sample.drop([
    'datasource',
    'passengers',
    'passengersin',
    'passengersout',
    'distance',
    'suppressed',
    'justificationid',
    'note'], axis=1, inplace=True)

## Configure Datatypes

In [9]:
df_leavetime_sample.dtypes

dayofservice       object
tripid              int64
progrnumber         int64
stoppointid         int64
plannedtime_arr     int64
plannedtime_dep     int64
actualtime_arr      int64
actualtime_dep      int64
vehicleid           int64
lastupdate         object
dtype: object

In [10]:
df_leavetime_sample.loc[:, 'dayofservice'] = df_leavetime_sample.loc[:, 'dayofservice'].\
apply(lambda x: pd.to_datetime(x, format='%d-%b-%y %H:%M:%S'))

df_leavetime_sample.loc[:, 'lastupdate'] = df_leavetime_sample.loc[:, 'lastupdate'].\
apply(lambda x: pd.to_datetime(x, format='%d-%b-%y %H:%M:%S'))

In [11]:
df_leavetime_sample.head()

Unnamed: 0,dayofservice,tripid,progrnumber,stoppointid,plannedtime_arr,plannedtime_dep,actualtime_arr,actualtime_dep,vehicleid,lastupdate
0,2017-03-27,4529349,5,4384,51343,51343,51476,51476,1000211,2017-04-09 09:00:06
1,2017-03-27,4529349,10,523,51544,51544,51744,51763,1000211,2017-04-09 09:00:06
2,2017-03-27,4529349,15,4382,51764,51764,52048,52077,1000211,2017-04-09 09:00:06
3,2017-03-27,4529349,20,1223,52044,52044,52340,52340,1000211,2017-04-09 09:00:06
4,2017-03-27,4529349,25,1228,52343,52343,52629,52637,1000211,2017-04-09 09:00:06


In [12]:
assess_df(df_leavetime_sample)

Datatypes:

dayofservice       datetime64[ns]
tripid                      int64
progrnumber                 int64
stoppointid                 int64
plannedtime_arr             int64
plannedtime_dep             int64
actualtime_arr              int64
actualtime_dep              int64
vehicleid                   int64
lastupdate         datetime64[ns]
dtype: object



**************************************************
ROWS, COLUMNS, NULLS

999 rows in the dataframe.
10 columns in the dataframe.

0 null values in the dataframe.
0 duplicate rows
0 duplicate columns


Constant numeric columns: FALSE



**************************************************
DESCRIPTION

                 count          mean            std        min        25%  \
tripid           999.0  4.366036e+06  210532.711369  4082096.0  4093337.0   
progrnumber      999.0  2.987387e+01      16.202319        1.0       19.0   
stoppointid      999.0  2.677818e+03    2209.434783       11.0     1071.0   
plannedtime_arr  999.0 