<a id="top"></a>

    Using Dublin Bus GTFS data to compare the stops serviced in 2018 with those service in 2021
    Data taken from: https://transitfeeds.com/p/transport-for-ireland/782?p=1

***

# Import Packages

In [1]:
import pandas as pd
import dask.dataframe as dd
import datetime
import matplotlib.pyplot as plt

***

<a id="contents"></a>
# Contents

- [Import Data](#import_data)
- [Quick Look at the Datasets](#quick_look)
- [Stop Changes in 2018](#stop_changes_2018)
- [Stop Changes between 2018 and 2021](#stop_changes_2018_2021)

***

<a id="import_data"></a>
# Import Data

In [2]:
# import stops data and create dataframes
df_2018_start = pd.read_csv('/home/faye/Data-Analytics-CityRoute/Dublin_Bus_GTFS/05-01-2018/stops.txt', sep=',',error_bad_lines=False)
df_2018_end = pd.read_csv('/home/faye/Data-Analytics-CityRoute/Dublin_Bus_GTFS/18-12-2018/stops.txt', sep=',',error_bad_lines=False)
df_2021 = pd.read_csv('/home/faye/Data-Analytics-CityRoute/Dublin_Bus_GTFS/17-07-2021/stops.txt', sep=',',error_bad_lines=False)

***

<a id="quick_look"></a>
# Quick Look at the Datasets
[back to contents](#contents)

In [8]:
# print first 5 rows
df_2018_start.head(5)

Unnamed: 0,stop_id,stop_name,stop_lat,stop_lon
0,8220B007612,Davenport Hotel Merrion Street,53.341347,-6.250529
1,8220DB000002,"Rotunda, Parnell Square West",53.352244,-6.263723
2,8220DB000003,"Rotunda, Granby Place",53.352309,-6.263811
3,8220DB000004,"Rotunda, Rotunda Hospital",53.352575,-6.264175
4,8220DB000006,"Rotunda, Saint Martin's Chapel",53.352749,-6.264454


In [9]:
# print first 5 rows
df_2018_end.head(5)

Unnamed: 0,stop_id,stop_name,stop_lat,stop_lon,location_type,parent_station
0,8220B007612,Davenport Hotel Merrion Street,53.341347,-6.250529,,
1,8220DB000002,"Rotunda, Parnell Square West",53.352244,-6.263723,,
2,8220DB000003,"Rotunda, Granby Place",53.352309,-6.263811,,
3,8220DB000004,"Rotunda, Rotunda Hospital",53.352575,-6.264175,,
4,8220DB000006,"Rotunda, Saint Martin's Chapel",53.352749,-6.264454,,


In [10]:
# print first 5 rows
df_2021.head(5)

Unnamed: 0,stop_id,stop_name,stop_lat,stop_lon
0,8220DB000002,"Parnell Square West, stop 2",53.352244,-6.263723
1,8220DB000003,"Parnell Square West, stop 3",53.352309,-6.263811
2,8220DB000004,"Parnell Square West, stop 4",53.352575,-6.264175
3,8220DB000006,"Parnell Square West, stop 6",53.352749,-6.264454
4,8220DB000007,"Parnell Square West, stop 7",53.352841,-6.26457


    For each dataset we have a stop id, stop name, 
    and the coordinates (lat & lon) of the stop.
    The 2018_end dataset has 2 extra features: `location_type` and 
    `parent_station`

In [11]:
# unique values of the location_type feature
df_2018_end['location_type'].unique()

array([nan])

In [12]:
# unique values of the parent_station feature
df_2018_end['parent_station'].unique()

array([nan])

    Both `location_type` and `parent_station` are null columns so I will 
    drop them.

In [14]:
# drop null columns
df_2018_end = df_2018_end.drop(columns=['location_type','parent_station'])

In [17]:
# print number of rows in each dataset
print(f"The 2018_start dataset has {df_2018_start.shape[0]} rows.")
print(f"The 2018_end   dataset has {df_2018_end.shape[0]} rows.")
print(f"The 2021       dataset has {df_2021.shape[0]} rows.")

The 2018_start dataset has 4690 rows.
The 2018_end   dataset has 4430 rows.
The 2021       dataset has 4220 rows.


    From this it appears that there are 470 less stops in 202
    This may mean that we have data for stops that no longer exist.

In [22]:
# check the data types of each dataset
print("\n2018_start Dataset")
print(df_2018_start.dtypes)
print("\n2018_end Dataset")
print(df_2018_end.dtypes)
print("\n2021 Dataset")
print(df_2021.dtypes)


2018_start Dataset
stop_id       object
stop_name     object
stop_lat     float64
stop_lon     float64
dtype: object

2018_end Dataset
stop_id       object
stop_name     object
stop_lat     float64
stop_lon     float64
dtype: object

2021 Dataset
stop_id       object
stop_name     object
stop_lat     float64
stop_lon     float64
dtype: object


In [25]:
# Find number of duplicate rows in each Dataset

print("2018_start Dataset")
num_duplicate_rows = df_2018_start.duplicated().sum()
print(f"There are {num_duplicate_rows} duplicated rows in this dataset (excluding the first row).")
num_duplicate_rows_inclusive = df_2018_start[df_2018_start.duplicated(keep=False)].shape[0]
print(f"There are {num_duplicate_rows_inclusive} duplicated rows in this dataset (including row that is duplicated).")

print("~"*20)
print("2018_end Dataset")
num_duplicate_rows = df_2018_end.duplicated().sum()
print(f"There are {num_duplicate_rows} duplicated rows in this dataset (excluding the first row).")
num_duplicate_rows_inclusive = df_2018_end[df_2018_end.duplicated(keep=False)].shape[0]
print(f"There are {num_duplicate_rows_inclusive} duplicated rows in this dataset (including row that is duplicated).")

print("~"*20)
print("2021")
num_duplicate_rows = df_2021.duplicated().sum()
print(f"There are {num_duplicate_rows} duplicated rows in this dataset (excluding the first row).")
num_duplicate_rows_inclusive = df_2021[df_2021.duplicated(keep=False)].shape[0]
print(f"There are {num_duplicate_rows_inclusive} duplicated rows in this dataset (including row that is duplicated).")

2018_start Dataset
There are 0 duplicated rows in this dataset (excluding the first row).
There are 0 duplicated rows in this dataset (including row that is duplicated).
~~~~~~~~~~~~~~~~~~~~
2018_end Dataset
There are 0 duplicated rows in this dataset (excluding the first row).
There are 0 duplicated rows in this dataset (including row that is duplicated).
~~~~~~~~~~~~~~~~~~~~
2021
There are 0 duplicated rows in this dataset (excluding the first row).
There are 0 duplicated rows in this dataset (including row that is duplicated).


***

<a id="stop_changes_2018"></a>
# Stop Changes in 2018
[back to contents](#contents)

In [None]:
# for each row in the dataset:
    # if the row aren't equal:
        # print row
        


In [51]:
# create new dataframe of the changes in stop in 2018
df_2018_changes = pd.concat([df_2018_start,df_2018_end]).drop_duplicates(keep=False)

In [57]:
# print number of rows in the new dataset
print(f"There are {df_2018_changes.shape[0]} differences in the stops at the start and end of 2018.")

There are 1526 differences in the stops at the start and end of 2018.


In [55]:
# first 10 rows of df_2018_changes
df_2018_changes.head(10)

Unnamed: 0,stop_id,stop_name,stop_lat,stop_lon
14,8220DB000018,"Dromcondra, Lower Drumcondra Road",53.365856,-6.255957
15,8220DB000019,"Dromcondra, Lower Drumcondra Road",53.367144,-6.255514
40,8220DB000047,"Dromcondra, Near Train Station",53.363916,-6.257298
82,8220DB000104,"Poppintree, Balbutcher Lane (Carrig Road)",53.399684,-6.276816
83,8220DB000105,"Ballymun, Balbutcher Lane (Cranogue Road)",53.402937,-6.281768
85,8220DB000110,"Ballymun, Balbutcher Lane (Árd Na Meala)",53.39794,-6.268691
93,8220DB000119,"Dromcondra, Upper Drumcondra Road (Griffith Av...",53.37519,-6.250766
96,8220DB000128,"Wadelai, Glasnevin Avenue (Ballymun Road)",53.389315,-6.265823
100,8220DB000132,"Glasnevin North, Beneavin Park",53.390484,-6.28266
105,8220DB000138,"Finglas, Saint Canice's School",53.389922,-6.292351


***

<a id="stop_changes_2018_2021"></a>
# Stop Changes between 2018 and 2021
[back to contents](#contents)

***

[Back to top](#top)