# Hypothesis - public transport is designed to serve people who use it to commute (for work, school, etc)

Initialise needed libraries

In [2]:
library(readr)
library(plyr)
library(dplyr)
# Note from Dave to Dave - ALWAYS load plyr before dplyr or group_by doesn't work properly

# Next steps for Dave: 
# - see how many stops have more trips per hour in commuting times. If more than 50% - hypothesis is at least partially true. If more than say 80% - strong correlation. (or talk to Steff about using standard deviations or something?)
# - If less than 50% - hypothesis is less true than random chance.

"package 'plyr' was built under R version 3.4.4"
Attaching package: 'dplyr'

The following objects are masked from 'package:plyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



For this hypothesis, I want to test whether the number of stops (to be replaced with number of services once 1A is available) shows a closer correlation to the number of people using it during working hours (which I have chosen as between midnight and 7pm on weekdays) than those using it outside of working hours (which are 7pm-midnight on weekdays and all day weekends).

Ideally, we would have better granularity over the starting time for the workday - but few/no services are operating a full 24 hours, so this is less of a problem.

Obviously, not everyone using the service before 7pm on a weekday is commuting (and similarly, not everyone using it after 7pm or on weekends AREN'T using it to commute), but common sense suggests that the *majority* of the use will be for that purpose. Not suggesting that shift workers and weekend staff aren't real workers for a second!

In an ideal world, the data would include the purpose of travel, but that both a) doesn't exist, and b) wouldn't be cost-effective to capture.

For this hypothesis, we will be using the 1C trip_freq dataset to categorise trips into these categories above. The intention isn't to use absolute numbers, but relative numbers (to somewhat normalise the huge differences in window between the two categories).

First, I'll read in the dataset, create a subset of all the values I am considering "during the working day" and aggregate each trip in this subset to a single number of trips (so combining all months and all three time periods).

In [3]:
trip_freq <- read_csv("Processed Data/1C/trip_freq.csv")
trip_freq_working <- trip_freq[trip_freq$time=="Weekday (12:00am-8:29:59am)"|trip_freq$time=="Weekday (8:30am-2:59:59pm)" | trip_freq$time=="Weekday (3:00pm-6:59:59pm)",]
working_stops <- trip_freq_working %>% group_by(stop_id) %>% summarise(quantity = sum(quantity))
summary(working_stops)

Parsed with column specification:
cols(
  month = col_character(),
  stop_id = col_integer(),
  time = col_character(),
  quantity = col_integer()
)


    stop_id          quantity      
 Min.   :     1   Min.   :      1  
 1st Qu.:  5398   1st Qu.:    125  
 Median :301285   Median :    570  
 Mean   :185225   Mean   :   7992  
 3rd Qu.:313319   3rd Qu.:   2142  
 Max.   :600831   Max.   :7763894  

When looking at the summary statistics for working days, we can see that each trip:

- Has between 1 and 7,763,894 people using it to access public transport.
- Has a mean of 7,992 trips and a median of 570 trips.

These will be useful in a second when we compare against non-working days, like so:

In [4]:
trip_freq_not_working <- trip_freq[trip_freq$time=="Weekday (7:00pm-11:59:59pm)"|trip_freq$time=="Weekend",]
non_working_stops <- trip_freq_not_working %>% group_by(stop_id) %>% summarise(quantity = sum(quantity))
summary(non_working_stops)

    stop_id          quantity      
 Min.   :     1   Min.   :      1  
 1st Qu.:  4825   1st Qu.:     28  
 Median :300432   Median :    108  
 Mean   :173354   Mean   :   2547  
 3rd Qu.:311365   3rd Qu.:    433  
 Max.   :600831   Max.   :1362041  

When looking at the summary statistics for non working days, we can see that each trip:

- Has between 1 and 1,362,041 people using it to access public transport.
- Has a mean of 2,547 trips and a median of 108 trips.

To more easily compare, I am going to merge both datasets into the one file where stop_id exists in both.

In [5]:
combined_stops <- merge(working_stops, non_working_stops, by="stop_id")
colnames(combined_stops)[colnames(combined_stops)=="quantity.x"] <- "working_time"
colnames(combined_stops)[colnames(combined_stops)=="quantity.y"] <- "non_working_time"
head(combined_stops)

stop_id,working_time,non_working_time
1,5761,1023
2,2041,433
3,34,3
5,2717,424
6,30501,3029
8,52437,8590


To help us understand the relationship between working time and non working time trips, I've added a new (calculated) column that calculates the percentage of weekend trips for each of the stops.

In [6]:
combined_stops$weekend_percentage <- with(combined_stops, (non_working_time/(working_time + non_working_time)) * 100)
summary(combined_stops)

    stop_id        working_time     non_working_time  weekend_percentage
 Min.   :     1   Min.   :      1   Min.   :      1   Min.   : 0.00225  
 1st Qu.:  4826   1st Qu.:    307   1st Qu.:     28   1st Qu.: 6.90772  
 Median :300431   Median :    950   Median :    108   Median :11.99678  
 Mean   :173344   Mean   :  10696   Mean   :   2549   Mean   :13.72968  
 3rd Qu.:311364   3rd Qu.:   3180   3rd Qu.:    433   3rd Qu.:18.56930  
 Max.   :600831   Max.   :7763894   Max.   :1362041   Max.   :98.44262  

This set of summary statistics is both a good comparsion of working week trips vs. non-working week trips (from above), it also lets us drill down on the percentage of non-weekday trips:

- while the mean and median are pretty close and between 12-14% of all trips, there are a number (46, in fact) of trips that have a greater than 50% percentage of non-commuting trips, including one at a whopping 98% of trips being taken outside of working hours

The problem with comparing the raw figures of commute vs. non-commute is that commute represents a 5 day window (minus 5 hours a day), while non-commute is 2 days plus the extra five hour blocks. It's not comparing apples to apples. To normalise, I want to divide the trips by the weekly spread of hours to get a normalised number. It would make more sense to divide by the yearly spread of hours (given the universe covers a 12 month period), but as both datasets cover 12 months of data, I don't gain anything by doing this additional calculation - these figures will only be compared with each other, not presented as true facts about the data.

- Commuter trips are 19 hours a day x 5 days = 95 hours a week.
- Non-working trips are 2 x 24 hours + (5 x 5  hours) = 73 hours a week.

(An alternative way of doing this would be to normalise on number of trips (from 1A) rather than spread of hours, but 1A isn't available for use yet (and I'm not sure it would be preferred in either case?)

To implement this normalisation, I am creating two new (calculated) columns in our dataset to give the normalised hours. I will also add a third new column that gives the updated percentage of weekend stops, corrected for hours.

In [7]:
combined_stops$normalised_working_time <- with(combined_stops, working_time / 95)
options(scipen=999) # to temporarily disable R's habit of automatically swapping to scientific notation
combined_stops$normalised_non_working_time <- with(combined_stops, non_working_time / 73)
combined_stops$norm_weekend_percentage <- with(combined_stops, (normalised_non_working_time/(normalised_working_time + normalised_non_working_time)) * 100)
summary(combined_stops)
options(scipen=0) # to restore R's default functionality

    stop_id        working_time     non_working_time  weekend_percentage
 Min.   :     1   Min.   :      1   Min.   :      1   Min.   : 0.00225  
 1st Qu.:  4826   1st Qu.:    307   1st Qu.:     28   1st Qu.: 6.90772  
 Median :300431   Median :    950   Median :    108   Median :11.99678  
 Mean   :173344   Mean   :  10696   Mean   :   2549   Mean   :13.72968  
 3rd Qu.:311364   3rd Qu.:   3180   3rd Qu.:    433   3rd Qu.:18.56930  
 Max.   :600831   Max.   :7763894   Max.   :1362041   Max.   :98.44262  
 normalised_working_time normalised_non_working_time norm_weekend_percentage
 Min.   :    0.01        Min.   :    0.014           Min.   : 0.00293       
 1st Qu.:    3.23        1st Qu.:    0.384           1st Qu.: 8.80618       
 Median :   10.00        Median :    1.479           Median :15.06749       
 Mean   :  112.59        Mean   :   34.920           Mean   :16.85188       
 3rd Qu.:   33.47        3rd Qu.:    5.932           3rd Qu.:22.88484       
 Max.   :81725.20        Ma

At this point, looking at both the mean and median of the normalised weekend percentage, it is clear that there is a strong correlation between the design of the public transport network and the needs of people using it to commute. When considering the collection of stops, it is clear that the vast majority of the use is during working hours, even when correcting for the different periods of time in each window.

Note - would still like to re-run using number of services rather than hours, but need 1A.

To complete this piece of analysis, it is important to aggregate these results back to the SA2 level so we can compare any differences between socio-economic index.  To do this, I'll bring in 1F.

The only columns I'm bringing in are stop_id, SA2Code, SA2Name and Team_Member for now, and I'll immediately join it to our combined_stops dataset based on stop_id (using an inner join to only keep stops for which we have data).

In [8]:
SA2_stops_by_mode <- read_csv("Processed Data/1F/SA2_stops_by_mode.csv", col_types = cols(AREASQKM16 = col_skip(), Pop = col_skip(), Score = col_skip(), is_bus_stop = col_skip(), is_ferry_stop = col_skip(), is_train_station = col_skip(), is_tram_stop = col_skip(), stop_lat = col_skip(), stop_lon = col_skip(), stop_name = col_skip(), stop_url = col_skip()))
SA2_combined <- join(SA2_stops_by_mode, combined_stops, by="stop_id", type="inner")
head(SA2_combined)

stop_id,SA2Code,SA2Name,Team_Member,working_time,non_working_time,weekend_percentage,normalised_working_time,normalised_non_working_time,norm_weekend_percentage
1,305011105,Brisbane City,Steff,5761,1023,15.079599,60.6421053,14.01369863,18.77108
2,305011105,Brisbane City,Steff,2041,433,17.502021,21.4842105,5.93150685,21.63542
3,305011105,Brisbane City,Steff,34,3,8.108108,0.3578947,0.04109589,10.29996
5,305011105,Brisbane City,Steff,2717,424,13.498886,28.6,5.80821918,16.88032
6,305011105,Brisbane City,Steff,30501,3029,9.033701,321.0631579,41.49315068,11.44461
8,305011105,Brisbane City,Steff,52437,8590,14.075737,551.9684211,117.67123288,17.57232


Finally, I'm going to use my existing code samples to aggregate these rows up to the SA2 level.

There is probably a more effecient way of doing this, but this works for now.

In [19]:
options(scipen=999) # to temporarily disable R's habit of automatically swapping to scientific notation
working_dataset1 <- SA2_combined %>% group_by(SA2Code, SA2Name, Team_Member) %>% summarise(normalised_working_time = sum(normalised_working_time))
working_dataset2 <- SA2_combined %>% group_by(SA2Code) %>% summarise(normalised_non_working_time = sum(normalised_non_working_time))
trips_commute_per_SA2 <- merge(working_dataset1, working_dataset2, by="SA2Code")
trips_commute_per_SA2$norm_weekend_percentage <- with(trips_commute_per_SA2, (normalised_non_working_time/(normalised_working_time + normalised_non_working_time)) * 100)
trips_commute_per_SA2 <- trips_commute_per_SA2[order(trips_commute_per_SA2$norm_weekend_percentage),]
#head(trips_commute_per_SA2)
trips_commute_per_SA2
options(scipen=0) # to restore R's default functionality

Unnamed: 0,SA2Code,SA2Name,Team_Member,normalised_working_time,normalised_non_working_time,norm_weekend_percentage
218,311031318,Munruben - Park Ridge South,Dave,2.8736842,0.01369863,0.4744307
252,314011387,Samford Valley,Steff,6.3789474,0.06849315,1.0623309
93,304031093,Corinda,Steff,834.4000000,14.38356164,1.6946089
11,301021011,Sheldon - Mount Cotton,Steff,221.3157895,12.58904110,5.3821210
238,313021367,Elimbah,Charlie,125.4947368,7.98630137,5.9830980
221,311051324,Cornubia - Carbrook,Kate,59.3473684,4.24657534,6.6776411
207,311011305,Beaudesert,Eric,0.1684211,0.01369863,7.5217736
43,302041043,Deagon,Charlie,2163.8842105,179.57534247,7.6628309
250,314011385,Eatons Hill,Steff,384.6105263,32.50684932,7.7932139
251,314011386,The Hills District,Kate,654.7157895,55.36986301,7.7976316


In [21]:
summary(trips_commute_per_SA2$norm_weekend_percentage)
# write.csv(trips_commute_per_SA2,"SA2__norm_weekend_percentage.csv", row.names = FALSE) # in case you want a copy to play
# around with - uploading to my Dave Outputs folder on Google Drive

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4744 12.9022 16.7584 18.5170 22.7611 45.4604 

At this point, it is clear that there are no SA2 areas that has more public transport outside of working hours than there is during hours (despite finding that this is true with ~50 individual stops). There are a few SA2 areas that are above 40% non-commute travel, which are:

- Surfers Paradise
- Gympie Region *
- Mermaid Beach - Broadbeach
- Coolangatta

With the exception of Gympie Region - all of these areas are popular tourist stops, so this result isn't an illogical one. (Gympie Region is an outlier, but has an extremely number of trips to use as a comparison).

Final steps in this analysis is to break the SA2 groups into our assigned categories to consider if there are any differences based on SAD index.

In [28]:
# Note - to use this code block, only uncomment one of the indiv_data lines. Trying to use more than one will overwrite with 
# the last one and not give the desired result.

# indiv_data <- subset(trips_commute_per_SA2, Team_Member == "Eric") # only consider Eric's SA2 areas
# indiv_data <- subset(trips_commute_per_SA2, Team_Member == "Charlie") # only consider Charlie's SA2 areas
# indiv_data <- subset(trips_commute_per_SA2, Team_Member == "Dave") # only consider my SA2 areas
# indiv_data <- subset(trips_commute_per_SA2, Team_Member == "Kate") # only consider Kate's SA2 areas
indiv_data <- subset(trips_commute_per_SA2, Team_Member == "Steff") # only consider Steff's SA2 areas
summary(indiv_data)

    SA2Code            SA2Name          Team_Member       
 Min.   :301011002   Length:61          Length:61         
 1st Qu.:304011084   Class :character   Class :character  
 Median :304041104   Mode  :character   Mode  :character  
 Mean   :304418316                                        
 3rd Qu.:305031122                                        
 Max.   :314011387                                        
 normalised_working_time normalised_non_working_time norm_weekend_percentage
 Min.   :     6.38       Min.   :    0.07            Min.   : 1.062         
 1st Qu.:   927.66       1st Qu.:  132.12            1st Qu.:12.867         
 Median :  2811.19       Median :  617.97            Median :16.931         
 Mean   :  8813.81       Mean   : 2936.29            Mean   :17.632         
 3rd Qu.:  5562.94       3rd Qu.: 1220.75            3rd Qu.:21.740         
 Max.   :207703.65       Max.   :71465.56            Max.   :36.503         

Norm_weekend_percentage:
- ALL SA2 - has scores between 0.47% and 45.56%, median is 16.76%, mean is 18.52%
- Eric - has scores between 7.52% and 44.04%, median is 14.80%, mean is 17.02% (more commuter-focused that average)
- Charlie - has scores between 5.98% and 45.46%, median is 19.82%, mean is 20.14% (less commuter-focused than average)
- Dave - has scores between 0.47% and 43.99%, median is 17.55%, mean is 19.85% (less commuter-focused than average)
- Kate - has scores between 6.68% and 39.74%, median of 15.91%, mean of 17.92% (more commuter-focused than average)
- Steff - has scores between 1.06% and 36.50%, median is 16.93%, mean of 17.63% (pretty close to average, depending on whether you take mean or median)

In short - no discernable pattern, so it's probably independant of socio-economic status.