<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Documentation" data-toc-modified-id="Documentation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Documentation</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Initial-Data-Quality-Assessment-&amp;-Configuration" data-toc-modified-id="Initial-Data-Quality-Assessment-&amp;-Configuration-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Initial Data Quality Assessment &amp; Configuration</a></span><ul class="toc-item"><li><span><a href="#Drop-Constant-Column-and-NaN-columns" data-toc-modified-id="Drop-Constant-Column-and-NaN-columns-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Drop Constant Column and NaN columns</a></span></li><li><span><a href="#Configure-Datatypes" data-toc-modified-id="Configure-Datatypes-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Configure Datatypes</a></span></li></ul></li><li><span><a href="#Investigate-Data-Anomolies" data-toc-modified-id="Investigate-Data-Anomolies-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Investigate Data Anomolies</a></span><ul class="toc-item"><li><span><a href="#Determine-%-missing-values-in-affected-columns" data-toc-modified-id="Determine-%-missing-values-in-affected-columns-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Determine % missing values in affected columns</a></span></li></ul></li></ul></div>

# Documentation

- Each record of this table contains the actual data for one trip.

<br>

- If the trip is planned and achieved, the record will contain both the planned and actual data. If the tripped is planned and suppressed the record will contain only the planned data and the flag “Suppressed” will be checked.

<br>

- If the trips is partially suppressed the flag “Suppressed” will not be checked. The set of the leave times suppressed is detailed in table RT_LeaveTimes ( 9.2.3 ).

<br>

- Each record could be referred to a fault event by using the field “JustificationID”.

<br>

- According with the time representation defined for planned data, AVL data refer to one day of service that could be greater than 24 hours. 

<br>

- For example a trip starting at 23:30 of 1st of April and ending at 00:30 of the 2nd of April could have the last AVL data tracked at “24:30 of 1st of April”. 

<br>

- To express this “extended time” the interface requires the use of seconds past the beginning of the operation day.

<br>

- In our example the DayOfService is “01/04/2014” and the extended time is “88200” (= 24 * 60 * 60 + 30 * 60).

---

- the distance is expressed in the unit of measurement according with the DMS settings. It could be “kilometers” or “miles”.

<br>

- Each table contains a field called “LastUpdate” used to detect new data availability.

| Column | Desc. |
| --- | --- |
| DataSource | Unique Bus Operator Code: used to identify the Bus Operator's system from which the data arrive. |
| DayOfService | One day of service could last more than 24hrs |
| TripID | Unique Trip Code |
| LineID | Unique Line Code |
| RouteID | Unique Route Code |
| Direction | IB = inbound / going / northbound / Eastbound; OB = outbound / back / southbound / westbound|
| PlannedTime_Dept | ..in seconds |
| PlannedTime_Arr | ..in seconds |
| Basin | Basic Code |
| TenderLot | Tender lot |
| ActualTime_Dept | ..in seconds |
| ActualTime_Arr | ..in seconds |
| Supressed | Whole trip supressed: 0=achieved; 1=supressed |
| JustificationID | Fault Code |
| LastUpdate | Time of the last record update |
| Note | Free Note |

# Imports

In [1]:
from IPython.core.display import display, HTML
display(HTML('<style>.container {width:80% !important;}</style>'))

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
pd.__version__

'0.24.2'

In [4]:
# Display plots in the notebook
%matplotlib inline

In [5]:
def assess_df(df):
  
    feature_types = {df[df.columns[x]].dtypes.name for x in range(len(df.columns))}
    
  	# print datatypes
  
    print("Datatypes:\n")
    print(df.dtypes)
    print('\n\n')
    
    # print number rows and cols
    print('*' * 50)
    print('ROWS, COLUMNS, NULLS\n')
    print(df.shape[0], "rows in the dataframe.")
    print(df.shape[1], "columns in the dataframe.\n")
    print(df.isnull().sum().sum(), "null values in the dataframe.")
    
    # check duplicate rows & col; print if present
    
    duplicate_rows = df.duplicated(keep=False).sum()
    duplicate_cols = df.index.T.duplicated(keep=False).sum()
    
    print(duplicate_rows, "duplicate rows")
    print(duplicate_cols, "duplicate columns")
    
    if duplicate_rows > 0 or duplicate_cols > 0:
      print(df.loc[df.duplicated()])
	
    print('\n')
    
    # check for constant columns
    
    numeric_cols = list(df.select_dtypes(include=[np.number]).columns.values)
    categorical_cols = list(df.select_dtypes('category').columns.values)
    
    standard_diffs = [df.describe().loc['std', x] for x in numeric_cols]
    
    constant_categorical_column=False
    if 'category' in feature_types:
        for column in categorical_cols:
            if df[column].describe().index.unique == 1:
                constant_categorical_column = True
            
            
    if 0 in standard_diffs:
        print("Constant numeric columns: TRUE")
    else:
        print("Constant numeric columns: FALSE")
  
    if 'category' in feature_types:    
        if constant_categorical_column == True:
            print("Constant categorical columns: TRUE")
        else:
            print("Constant categorical columns: FALSE")

            
    
    print('\n\n')

    
    # feature stats
    print('*' * 50)
    print('DESCRIPTION\n')
    print(df.describe().T)
    print('\n\n')
    print('*' * 50)

    
    # feature stats: categorical
    
    if 'category' in feature_types:
        print('CATEGORICAL DESCRIPTION\n')
        print('\n')
        print(df.select_dtypes(['category']).describe().T)
        print('\n\n')
        print('*' * 50)
    
    # print feature cardinalities
    
    print("FEATURE CARDINALITIES\n")
    column_names = list(df.columns.values)

    print('{0:45}  {1}'.format("Feature", "Distinct Values"))
    print('{0:45}  {1}'.format("-------", "--------------- \n"))

    for c in column_names:
        print('{0:45}  {1}'.format(c, str(len(df[c].unique()))))
    
    print('\n')
    print('*' * 50)
    print('MEMORY\n')
    
    
    print(df.info(memory_usage='deep'))
    print('\n')
    print(df.memory_usage(deep=True))
    print('\n')
    print('*' * 50)
    print('HEAD\n')
    print(df.head(10))
    print('\n')
    print('*' * 50)
    print('TAIL\n')
    print(df.tail(10))

# Initial Data Quality Assessment & Configuration

In [6]:
path = "/Users/davidodwyer/Documents/studyCS/Semester_3"\
"/data/original_raw_data/bus/sample_bus_data-20190607"\
"/rt_trips_2017_I_DB_1000.txt"

df_trips_sample = pd.read_csv(path, sep=';')

In [7]:
df_trips_sample.columns

Index(['datasource', 'dayofservice', 'tripid', 'lineid', 'routeid',
       'direction', 'plannedtime_arr', 'plannedtime_dep', 'actualtime_arr',
       'actualtime_dep', 'basin', 'tenderlot', 'suppressed', 'justificationid',
       'lastupdate', 'note'],
      dtype='object')

In [8]:
assess_df(df_trips_sample)

Datatypes:

datasource          object
dayofservice        object
tripid               int64
lineid              object
routeid             object
direction            int64
plannedtime_arr      int64
plannedtime_dep      int64
actualtime_arr     float64
actualtime_dep     float64
basin               object
tenderlot          float64
suppressed         float64
justificationid    float64
lastupdate          object
note                object
dtype: object



**************************************************
ROWS, COLUMNS, NULLS

999 rows in the dataframe.
16 columns in the dataframe.

3167 null values in the dataframe.
0 duplicate rows
0 duplicate columns


Constant numeric columns: FALSE



**************************************************
DESCRIPTION

                 count          mean            std        min        25%  \
tripid           999.0  4.370038e+06  143955.545965  4081366.0  4334691.0   
direction        999.0  1.505506e+00       0.500220        1.0        1.0   
plann

## Drop Constant Column and NaN columns

In [9]:
df_trips_sample.drop([
    'basin',
    'datasource',
    'tenderlot',
    'suppressed',
    'justificationid'], axis=1, inplace=True)

## Configure Datatypes

In [12]:
df_trips_sample.dtypes

dayofservice        object
tripid               int64
lineid              object
routeid             object
direction            int64
plannedtime_arr      int64
plannedtime_dep      int64
actualtime_arr     float64
actualtime_dep     float64
lastupdate          object
note                object
dtype: object

In [13]:
df_trips_sample.loc[:, 'dayofservice'] = df_trips_sample.loc[:, 'dayofservice'].\
apply(lambda x: pd.to_datetime(x, format='%d-%b-%y %H:%M:%S'))

df_trips_sample.loc[:, 'lastupdate'] = df_trips_sample.loc[:, 'lastupdate'].\
apply(lambda x: pd.to_datetime(x, format='%d-%b-%y %H:%M:%S'))

In [14]:
df_trips_sample = df_trips_sample.astype({
    'direction':'category',
    'lineid':'category',
    'routeid':'category'
})

In [15]:
assess_df(df_trips_sample)

Datatypes:

dayofservice       datetime64[ns]
tripid                      int64
lineid                   category
routeid                  category
direction                category
plannedtime_arr             int64
plannedtime_dep             int64
actualtime_arr            float64
actualtime_dep            float64
lastupdate         datetime64[ns]
note                       object
dtype: object



**************************************************
ROWS, COLUMNS, NULLS

999 rows in the dataframe.
11 columns in the dataframe.

170 null values in the dataframe.
0 duplicate rows
0 duplicate columns


Constant numeric columns: FALSE
Constant categorical columns: FALSE



**************************************************
DESCRIPTION

                 count          mean            std        min        25%  \
tripid           999.0  4.370038e+06  143955.545965  4081366.0  4334691.0   
plannedtime_arr  999.0  5.502891e+04   17726.494665    20260.0    39199.5   
plannedtime_dep  999.0  5.16

Note: % of actualtime_arr and actualtime_dep values are missing

In [16]:
df_trips_sample.head(20)

Unnamed: 0,dayofservice,tripid,lineid,routeid,direction,plannedtime_arr,plannedtime_dep,actualtime_arr,actualtime_dep,lastupdate,note
0,2017-02-24,4391739,140,140_19,1,76357,73800,76935.0,73806.0,2017-03-05 02:56:33,",2412713,"
1,2017-02-24,4388762,140,140_19,1,54484,51000,54607.0,,2017-03-05 02:56:33,",2413302,"
2,2017-02-24,4381744,4,4_12,2,57516,53100,58761.0,53041.0,2017-03-05 02:56:33,",2612752,"
3,2017-02-24,4388860,84A,84A_1,1,67967,65400,67842.0,65362.0,2017-03-05 02:56:33,",2817382,"
4,2017-02-24,4388883,39A,39A_40,1,34126,29400,34777.0,29619.0,2017-03-05 02:56:33,",2417274,"
5,2017-02-24,4388888,39A,39A_43,2,65671,60600,66094.0,60593.0,2017-03-05 02:56:33,",2417282,"
6,2017-02-24,4389028,39,39_21,2,63799,58920,64776.0,,2017-03-05 02:56:33,",2414822,"
7,2017-02-24,4379972,27,27_17,2,29021,24000,28859.0,24005.0,2017-03-05 02:56:33,",2764337,"
8,2017-02-24,4380017,41,41_7,2,33756,30000,33581.0,30066.0,2017-03-05 02:56:33,",1727884,"
9,2017-02-24,4380023,41C,41C_79,2,61616,57900,61644.0,57913.0,2017-03-05 02:56:33,",1727893,"


# Investigate Data Anomolies

## Determine % missing values in affected columns

In [58]:
# number of rows for which actual time of arrival is missing

number_missing_values_actual_arrival = \
len(df_trips_sample.loc[df_trips_sample.actualtime_arr.isna()])

print("Column: Actual time arrival\n")
print("Number of affected rows:", number_missing_values_actual_arrival)
print(f"Missing for {round(number_missing_values_actual_arrival / len(df_trips_sample)*100,2)}\
% of rows")
               

Column: Actual time arrival

Number of affected rows: 73
Missing for 7.31% of rows


In [59]:
# number of rows for which actual time of departure is missing

number_missing_values_actual_dept = \
len(df_trips_sample.loc[df_trips_sample.actualtime_dep.isna()])

print("Column: Actual time departure\n")
print("Number of affected rows:", number_missing_values_actual_dept)
print(f"Missing for {round(number_missing_values_actual_dept / len(df_trips_sample)*100,2)}\
% of rows")


Column: Actual time departure

Number of affected rows: 97
Missing for 9.71% of rows


In [35]:
# number of rows for which both of the above fields' values are missing

len(df_trips_sample[(df_trips_sample.actualtime_dep.isna()) \
                    & (df_trips_sample.actualtime_arr.isna())])

7

Note: Little crossover in the missing values of these columns