<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Documentation" data-toc-modified-id="Documentation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Documentation</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Initial-Data-Quality-Assessment-&amp;-Configuration" data-toc-modified-id="Initial-Data-Quality-Assessment-&amp;-Configuration-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Initial Data Quality Assessment &amp; Configuration</a></span><ul class="toc-item"><li><span><a href="#Configure-Datatypes" data-toc-modified-id="Configure-Datatypes-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Configure Datatypes</a></span></li><li><span><a href="#Drop-Constant-Columns" data-toc-modified-id="Drop-Constant-Columns-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Drop Constant Columns</a></span></li></ul></li><li><span><a href="#Investigate-Data-Anomolies" data-toc-modified-id="Investigate-Data-Anomolies-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Investigate Data Anomolies</a></span></li><li><span><a href="#Data-Insights" data-toc-modified-id="Data-Insights-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Data Insights</a></span><ul class="toc-item"><li><span><a href="#Most-and-Least-Recent-LASTUPDATE-entry" data-toc-modified-id="Most-and-Least-Recent-LASTUPDATE-entry-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Most and Least Recent LASTUPDATE entry</a></span></li></ul></li></ul></div>

# Documentation

- Each record of this table represents the service of one vehicle for one day of service and tells about the overall distance and time worked by the associated vehicle in that specific day.

- According with the time representation defined for planned data, AVL data refer to one day of service that could be greater than 24 hours. 

<br>

- For example a trip starting at 23:30 of 1st of April and ending at 00:30 of the 2nd of April could have the last AVL data tracked at “24:30 of 1st of April”. 

<br>

- To express this “extended time” the interface requires the use of seconds past the beginning of the operation day.

<br>

- In our example the DayOfService is “01/04/2014” and the extended time is “88200” (= 24 * 60 * 60 + 30 * 60).

---

- the distance is expressed in the unit of measurement according with the DMS settings. It could be “kilometers” or “miles”.

<br>

- Each table contains a field called “LastUpdate” used to detect new data availability.

| Column | Desc. |
| --- | --- |
| DataSource | Unique Bus Operator Code: used to identify the Bus Operator's system from which the data arrive. |
| DayOfService | Day of service |
| VehicleID | Unique vehicle code arriving at this stop point |
| Distance | Distance Travelled by the vehicle in this day |
| Minutes | Time wored by the vehicle in the corresponding day |
| LastUpdate | Time of the last record update |
| Note | Free Note |

# Imports

In [1]:
from IPython.core.display import display, HTML
display(HTML('<style>.container {width:80% !important;}</style>'))

In [58]:
pd.__version__

'0.24.2'

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Display plots in the notebook
%matplotlib inline

In [57]:
def assess_df(df):
  
    feature_types = {df[df.columns[x]].dtypes.name for x in range(len(df.columns))}
    
  	# print datatypes
  
    print("Datatypes:\n")
    print(df.dtypes)
    print('\n\n')
    
    # print number rows and cols
    print('*' * 50)
    print('ROWS, COLUMNS, NULLS\n')
    print(df.shape[0], "rows in the dataframe.")
    print(df.shape[1], "columns in the dataframe.\n")
    print(df.isnull().sum().sum(), "null values in the dataframe.")
    
    # check duplicate rows & col; print if present
    
    duplicate_rows = df.duplicated(keep=False).sum()
    duplicate_cols = df.index.T.duplicated(keep=False).sum()
    
    print(duplicate_rows, "duplicate rows")
    print(duplicate_cols, "duplicate columns")
    
    if duplicate_rows > 0 or duplicate_cols > 0:
      print(df.loc[df.duplicated()])
	
    print('\n')
    
    # check for constant columns
    
    numeric_cols = list(df.select_dtypes(include=[np.number]).columns.values)
    categorical_cols = list(df.select_dtypes('category').columns.values)
    
    standard_diffs = [df.describe().loc['std', x] for x in numeric_cols]
    
    constant_categorical_column=False
    if 'category' in feature_types:
        for column in categorical_cols:
            if df[column].describe().index.unique == 1:
                constant_categorical_column = True
            
            
    if 0 in standard_diffs:
        print("Constant numeric columns: TRUE")
    else:
        print("Constant numeric columns: FALSE")
  
    if 'category' in feature_types:    
        if constant_categorical_column == True:
            print("Constant categorical columns: TRUE")
        else:
            print("Constant categorical columns: FALSE")

            
    
    print('\n\n')

    
    # feature stats
    print('*' * 50)
    print('DESCRIPTION\n')
    print(df.describe().T)
    print('\n\n')
    print('*' * 50)

    
    # feature stats: categorical
    
    if 'category' in feature_types:
        print('CATEGORICAL DESCRIPTION\n')
        print('\n')
        print(df.select_dtypes(['category']).describe().T)
        print('\n\n')
        print('*' * 50)
    
    # print feature cardinalities
    
    print("FEATURE CARDINALITIES\n")
    column_names = list(df.columns.values)

    print('{0:45}  {1}'.format("Feature", "Distinct Values"))
    print('{0:45}  {1}'.format("-------", "--------------- \n"))

    for c in column_names:
        print('{0:45}  {1}'.format(c, str(len(df[c].unique()))))
    
    print('\n')
    print('*' * 50)
    print('MEMORY\n')
    
    
    print(df.info(memory_usage='deep'))
    print('\n')
    print(df.memory_usage(deep=True))
    print('\n')
    print('*' * 50)
    print('HEAD\n')
    print(df.head(10))
    print('\n')
    print('*' * 50)
    print('TAIL\n')
    print(df.tail(10))

# Initial Data Quality Assessment & Configuration

In [7]:
path = '/Users/davidodwyer/Documents/studyCS/Semester_3/data/original_raw_data/full_bus_data/data/rt_vehicles_DB_2018.csv'
vehicles = pd.read_csv(path)

In [16]:
assess_df(vehicles)

Datatypes:

DATASOURCE       object
DAYOFSERVICE     object
VEHICLEID         int64
DISTANCE          int64
MINUTES           int64
LASTUPDATE       object
NOTE            float64
dtype: object



**************************************************
ROWS, COLUMNS, NULLS

272622 rows in the dataframe.
7 columns in the dataframe.

272622 null values in the dataframe.
0 duplicate rows
0 duplicate columns


Constant numeric columns: FALSE



**************************************************
DESCRIPTION

              count          mean            std        min        25%  \
VEHICLEID  272622.0  1.928436e+06  810440.716935  1000110.0  1000948.0   
DISTANCE   272622.0  1.774863e+05   71334.432383        0.0   126765.0   
MINUTES    272622.0  4.559764e+04   14881.389868   -70145.0    39380.0   
NOTE            0.0           NaN            NaN        NaN        NaN   

                 50%         75%        max  
VEHICLEID  2172252.0  2693217.00  3394131.0  
DISTANCE    177420.0   227126.25 

## Configure Datatypes

In [17]:
vehicles.loc[:, 'DAYOFSERVICE'] = vehicles.loc[:, 'DAYOFSERVICE'].\
apply(lambda x: pd.to_datetime(x, format='%d-%b-%y %H:%M:%S'))

In [18]:
vehicles.loc[:, 'LASTUPDATE'] = vehicles.loc[:, 'LASTUPDATE'].\
apply(lambda x: pd.to_datetime(x, format='%d-%b-%y %H:%M:%S'))

In [19]:
vehicles = vehicles.astype({'DATASOURCE':'category', 'VEHICLEID':'category'})

## Drop Constant Columns

In [20]:
vehicles.drop('NOTE', axis=1, inplace=True)

In [21]:
vehicles.drop('DATASOURCE', axis=1, inplace=True)

In [56]:
assess_df(vehicles)

Datatypes:

DAYOFSERVICE    datetime64[ns]
VEHICLEID             category
DISTANCE                 int64
MINUTES                  int64
LASTUPDATE      datetime64[ns]
dtype: object



**************************************************
ROWS, COLUMNS, NULLS

272622 rows in the dataframe.
5 columns in the dataframe.

0 null values in the dataframe.
0 duplicate rows
0 duplicate columns


Constant numeric columns: FALSE
Constant categorical columns: FALSE



**************************************************
DESCRIPTION

             count           mean           std      min       25%       50%  \
DISTANCE  272622.0  177486.328466  71334.432383      0.0  126765.0  177420.0   
MINUTES   272622.0   45597.643220  14881.389868 -70145.0   39380.0   46994.5   

                75%       max  
DISTANCE  227126.25  473864.0  
MINUTES    57911.00   76651.0  



**************************************************
CATEGORICAL DESCRIPTION



            count  unique      top  freq
VEHICLEID  272622  

# Investigate Data Anomolies

# Data Insights

## Most and Least Recent LASTUPDATE entry

In [63]:
print("Earliest \'Last Update'", vehicles.LASTUPDATE.min())

Earliest 'Last Update' 2018-01-08 17:21:10


In [64]:
print("Latest \'Last Update'", vehicles.LASTUPDATE.max())

Latest 'Last Update' 2019-01-16 18:27:21
