# Hypothesis - public transport serves people who live in that area (ie. public transport infrastructure and services are built where the population is highest)

Initialise needed libraries

In [1]:
library(readr)
library(plyr)
library(dplyr)
# Note from Dave to Dave - ALWAYS load plyr before dplyr or group_by doesn't work properly
# Note - have updated to use combined_quantity from the 1C dataset to give more meaningful results

"package 'plyr' was built under R version 3.4.4"
Attaching package: 'dplyr'

The following objects are masked from 'package:plyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



Get access to the data we need (saved in a Processed Data folder from the working directory)
In this case, we need to get access to all of the stops by SA2, some of the SA2 characteristics (all from 1F), the number of stops per SA2 (calculated from 1F) and the quantity of trips per SA2 (from file 1C). 

1F is read in to only retrieve the following columns:
- stop_id
- AREASQKM16
- SA2Name
- SA2Code
- Team_Member
- Pop
- Score
as other columns are not needed for this analysis

1C is read in as-is for now

** In a future revision, the number of stops per SA2 should be replaced by the quantity of services (from file 1A), but this isn't ready to use at the time of coding.

In [2]:
data_1F <- read_csv("Processed data/1F/SA2_stops_by_mode.csv", col_types = cols(is_bus_stop = col_skip(), is_ferry_stop = col_skip(), is_train_station = col_skip(), is_tram_stop = col_skip(), stop_lat = col_skip(), stop_lon = col_skip(), stop_name = col_skip(), stop_url = col_skip()))
data_1C <- read_csv("Processed data/1C/1C_trip_freq.csv")

Parsed with column specification:
cols(
  month = col_character(),
  time = col_character(),
  stop_id = col_integer(),
  origin_quantity = col_integer(),
  destination_quantity = col_integer(),
  combined_quantity = col_integer()
)


First, we want to aggregate the number of trips taken from each stop, so we can then aggregate these into the number of trips taken from each SA2 area. To do that, we want to operate on the 1C dataset.

In [3]:
trips_per_stop<- data_1C %>% group_by(stop_id) %>% summarise(combined_quantity = sum(combined_quantity))
summary(trips_per_stop)

    stop_id       combined_quantity 
 Min.   :     1   Min.   :       1  
 1st Qu.:  5586   1st Qu.:     324  
 Median :301610   Median :    1328  
 Mean   :189427   Mean   :   17863  
 3rd Qu.:313994   3rd Qu.:    4740  
 Max.   :600831   Max.   :18326237  

From this operation, we can see some interesting summary statistics about the relationship between stops and the number of trips. 

We can see that there is at least one stop that only had a single trip recorded in the 12 months we have data for.

We can see that the median number of trips is 1,328 per stop, but the mean is 17,863 - which implies that there are some outliers in terms of stops with a huge quantity of trips.

FInally, we can see that one incredibly-busy stop had over 18 million trips from it over this 12 month period. Interesting (but not particularly useful here).

We then want to join this new dataset with our 1F dataset so we can operate from there.

In [4]:
data_1F_with_qty <- join(data_1F, trips_per_stop, by="stop_id", type="inner")

Next, we want to stop considering individual stops altogether (for this particular analysis). To do this, we want to both take a count of the number of stops per SA2 area (which we will use with the quantity of trips later) and also to group by SA2 area.

In [5]:
num_stops_per_SA2 <- data_1F_with_qty %>% count(SA2Code)
data_1F_with_qty <- merge(data_1F_with_qty, num_stops_per_SA2, by = "SA2Code")
SA2_working_data <- data_1F_with_qty %>% group_by (SA2Code, SA2Name, Pop, Score, Team_Member, n) %>% summarise(combined_quantity = sum(combined_quantity))
SA2_working_data <- SA2_working_data[complete.cases(SA2_working_data), ]

Finally, I want to add one last column that is derived from the calulation of quantity divided by number of stops (n). This allows us to analyse data based on either the number of stops in each area OR the average "busy-ness" of stops within each area.

In [6]:
SA2_working_data$avg_trips_per_stop <- with(SA2_working_data, combined_quantity / n)
head(SA2_working_data)

SA2Code,SA2Name,Pop,Score,Team_Member,n,combined_quantity,avg_trips_per_stop
112031254,Tweed Heads,19417,933,Eric,3,158677,52892.333
301011001,Alexandra Hills,16345,987,Charlie,113,231473,2048.434
301011002,Belmont - Gumdale,7375,1093,Steff,47,125180,2663.404
301011003,Birkdale,14923,1034,Dave,57,446086,7826.07
301011004,Capalaba,17588,991,Charlie,120,684247,5702.058
301011005,Thorneside,3761,983,Charlie,25,159068,6362.72


At this point, I'm going to write this resulting file out to a new CSV, import into Tableau and do some visual EDA. Depending on what I find there, it may (or may not) be worth building on this notebook more to do more structured EDA.

In [7]:
# have commented it out so it doesn't keep overwriting this file - uncomment if you need to pump out the file yourself
# write.csv(SA2_working_data,"SA2_with_qty_stops_and_avg.csv", row.names = FALSE)

The intention of the plotting in Tableau is to explore the relationship between the population of each SA2 area and the number of stops/avg_trips_per_stop to see if there is a linear relationship between # of people vs. where the stops have been built. The avg_trips_per_stop is not used in this hypothesis, but has been brought forward for other use later on.

From plotting in Tableau, it appears that the answer to the initial hypotheis is:
NOTE - image of plot is in Analysis - Dave - Dave Vis as "population by # stops _ all SA2" - if you have access, it's here - https://drive.google.com/open?id=1sS2rLgqcbU7ppCw9RLaMzLG9k97mHK9J

Hypothesis - public transport serves people who live in that area (ie. public transport infrastructure and services are built where the population is highest)

Answer - largely no, from the plot you can see a large number of data points are BELOW the confidence interval (and therefore under-served) and just as many are ABOVE the interval (and thus over-served). Relatively few are within the confidence interval you'd expect to see if public transport was built according to where people live.

Note - raw # of stops is a worse indicator than frequency of services, so I'll re-run once we have a final 1A dataset.

Note - this version contains ALL SA2 areas. To use in Tableau, I can filter on the Team_Member variable to look only at mine. If you wanted to use this for yours, just filter on your name =)

Note - when filtering to each of our groups, same result. Steff's is maybe closer to showing a linear relationship, but the rest seem just as disjointed and misaligned.

Final CSV file is added to the Dave output folder in this folder.

Everything up to this point is applicable to all SA2 areas and uses all records. Everyone is welcome to use this notebook, or follow along and build their own, or to ignore this entirely and use another method - all of these are good options =)

If you do it yourself and learn something new, please share it back so I can also get better at doing this.

I'm adding some more code below so we can do some point comparisons between lower and higher socioeconomic areas. Simply comment out the lines that don't apply to the data you are looking at. Feedback on how to make this even easier to work with always welcome =)

In [40]:
# Note - to use this code block, only uncomment one of the indiv_data lines. Trying to use more than one will overwrite with 
# the last one and not give the desired result.

# indiv_data <- subset(SA2_working_data, Team_Member == "Eric") # only consider Eric's SA2 areas
# indiv_data <- subset(SA2_working_data, Team_Member == "Charlie") # only consider Charlie's SA2 areas
# indiv_data <- subset(SA2_working_data, Team_Member == "Dave") # only consider my SA2 areas
# indiv_data <- subset(SA2_working_data, Team_Member == "Kate") # only consider Kate's SA2 areas
# indiv_data <- subset(SA2_working_data, Team_Member == "Steff") # only consider Steff's SA2 areas

# Once you've selected A name, use the below to either summarise or export. (Or, y'know, whatever you want to do with it 
# from here))

# write.csv(indiv_data,"indiv_data.csv", row.names = FALSE)
# summary(indiv_data)

Some summary stats of interest:

Population:
- All SA2 - between 196 and 31,214 people per area, median of 9,484, mean of 10,597
- Eric - between 3,073 and 26,827 people per area, median of 10,884, mean of 12,208 (more populated than average)
- Charlie - between 3,675 and 25,744 people per area, median of 10,298, mean of 11,303 (more populated than average)
- Dave - between 3,764 and 31,214 people per area, median of 10,246, mean of 11,682 (more populated than average)
- Kate - berween 822 and 30,105 people per area, median of 8,370, mean of 9,649 (less populated than average)
- Steff - between 196 and 16,477 people per area, median of 7,909, mean of 8,205 (less populated than average)

Average trips per stop:
- All SA2 - between 23.7 and 447,138.6 trips per stop, median of 9,368.4, mean of 19,,564.1
- Eric - between 152 and 250,349 trips per stop, median of 6,297, mean of 14,637 (less utilisation per stop than average)
- Charlie - between 122.7 and 131,691.0 trips per stop, median of 8.175.2, mean of 14,263.6 (less utilisation per stop than average)
- Dave - between 23.67 and 179,614.17 trips per stop, median of 6,294.63, mean of 14,623.44 (less utilisation per stop than average)
- Kate - between 882.2 and 185,168.0 trips per stop, median of 10,464.0, mean of 19,864.2 (slightly higher utilisation than average, but pretty close to exact)
- Steff - between 346.2 and 447,138.6 trips per stop, median of 16,729.6, mean of 33,948.3 (massively higher utilisation than average)

### Note - now going to use this existing work to try to address "Exploration - where are SE QLD's transport hubs (e.g. City, Upper Mount Gravatt, UQ, Indooroopilly?) i.e. Could this show where the government want people working? Vis option: heat map? (KM)"

Some general takeaways, areas with lower SA2 scores have more people that areas with higher SA2 scores, and people are more likely to travel to/from higher SA2 scoring areas than lower ones. To further explore this relationship, it's worth creating a new index score that is (number of trips/number of stops) / population. 

The lower the score, the more likely that public transport is being used primarily by people who live in that area. Conversely, the higher scores are more likely to be used primarily by people travelling to areas outside of there they live.

In [8]:
SA2_working_data$index <- with(SA2_working_data, avg_trips_per_stop / Pop)
SA2_working_data[order(SA2_working_data$index),]

SA2Code,SA2Name,Pop,Score,Team_Member,n,combined_quantity,avg_trips_per_stop,index
310031288,Ipswich - North,4644,1033,Dave,6,142,23.66667,0.005096181
319031514,Gympie Region,18564,935,Eric,1,152,152.00000,0.008187891
317011451,Lockyer Valley - West,11362,979,Charlie,3,368,122.66667,0.010796221
309071255,Ormeau - Yatala,19674,1030,Dave,76,37024,487.15789,0.024761507
314011387,Samford Valley,11754,1130,Steff,16,5539,346.18750,0.029452739
310021278,Esk,5158,908,Eric,5,831,166.20000,0.032221791
311041320,Greenbank,12852,1028,Dave,4,2072,518.00000,0.040305011
310021281,Lowood,14052,918,Eric,12,8043,670.25000,0.047697837
313011362,Beachmere - Sandstone Point,14892,931,Eric,87,69804,802.34483,0.053877574
309071258,Upper Coomera - Willow Vale,31214,1012,Dave,52,105123,2021.59615,0.064765687


From looking at these results, we can see that the highest index scores (smallest percentage of locals using the service) are the ones you would expect:
- Brisbane airport
- South Brisbane
- Brisbane City
- Redland Islands
- Fortitude Valley
- Woolloongabba

With the exception of Redland Islands, these areas are all major transport hubs. Logically, this result seems to make sense - while there are some people who live in these areas, the majority of the travel is from people travelling to these areas from somewhere else.

Conversely, the smallest index scores (most likely to be locals using the service) are:
- Ipswich - North
- Gympie Region
- Lockyer Valley - West
- Ormeau - Yatala
- Samford Valley
- Esk
- Greenbank	

In [1]:
# write.csv(SA2_working_data,"travel_hub_index.csv", row.names = FALSE)
# I have uploaded this file under Project - Analysis - Dave - Dave Output - travel_hub_index.csv if anyone would like to 
# play around with it