# Hypothesis - more low-SE people use public transport than high-SE, therefore it needs to be more affordable!

Today's hypothesis is another great one submitted by Kate - Hypothesis - more low-SE people use public transport than high-SE, therefore it needs to be more affordable!

My understanding of what this means isn't complete, so here's my thought process
- is there a relationship between the number of available stops and SAD index scores, where there are more stops available in lower scoring areas? (doesn't really address whether more people use it, only that more infrastructure is available)
- is there a relationship between the number of journeys taken on public transport and SAD index scores, where people in lower SAD index areas are taking more trips? (might be an indicator of people from lower-SE areas using public transport more BUT it doesn't account for the extremely-common scenario of people in low SE areas travelling to higher SE areas - this sort of journey would show up in both areas and would dilute this relationship)
- is there a relationship between utlilisation of public transport stops in different SAD index-scoring areas, where pt stops in lower scoring areas show more utilisation than higher scoring areas? (might be a way of solving this hypothesis, utilisation as an index should help correct for the problem in the line above - but it's imperfect. Good as a backup, but I think I may have an alternative)

- _is there a relationship between the number of journeys taken on public transport (when only considering arrival stops and only considering time bin 1 (between midnight and 8:30am)) and SAD index scores, where people in lower scoring areas are taking more trips_

This is the best measure I can currently think of to demonstrate what I believe this hypothesis is asking for. The assumptions I am making in this one are:

- _most trips_ of people travelling in the early morning are likely to be from the SA2 area they reside in (obviously this isn't perfect, as it misses people travelling home after being out at night, people coming home after shift work, probably many other scenarios). Despite that, and noting this is the assumption I am making, I'm comfortable enough for now.

Ok, so I need to get to a working dataset, aggregated by SA2 area, that includes:
- number of trips (using arrival_time and in time bin 1)
- SAD index scores (and by extension, Team_Member)
- number of transport stops per SA2 area (might not be needed, but may be useful to flip the question to look at utilisation instead of raw trips in our question above. I define utilisation as the number of trips taken _divided by_ the number of available stops
- _(Question - should I also normalise by population of each area? Going to bring Pop in as well so I can think about this more)_

Let's get the libraries:

In [27]:
library(readr)
library(plyr)
library(dplyr)
# Note from Dave to Dave - ALWAYS load plyr before dplyr or group_by doesn't work properly
# Note - have updated to use combined_quantity from the 1C dataset to give more meaningful results

Let's just grab the base datasets we'll need - 1F and 1C_trip_freq

In [28]:
SA2_stops_by_mode <- read_csv("Processed Data/1F/SA2_stops_by_mode.csv", col_types = cols(AREASQKM16 = col_skip(), is_bus_stop = col_skip(), is_ferry_stop = col_skip(), is_train_station = col_skip(), is_tram_stop = col_skip(), stop_lat = col_skip(), stop_lon = col_skip(), stop_name = col_skip(), stop_url = col_skip()))
trip_freq <- read_csv("Processed Data/1C/1C_trip_freq.csv", col_types = cols(combined_quantity = col_skip(), destination_quantity = col_skip()))
head(SA2_stops_by_mode)
head(trip_freq)

stop_id,SA2Code,SA2Name,Pop,Score,Team_Member
1,305011105,Brisbane City,10192,1083,Steff
2,305011105,Brisbane City,10192,1083,Steff
3,305011105,Brisbane City,10192,1083,Steff
4,305011111,Spring Hill,6063,1028,Dave
5,305011105,Brisbane City,10192,1083,Steff
6,305011105,Brisbane City,10192,1083,Steff


month,time,stop_id,origin_quantity
2016-02,Weekday (12:00am-8:29:59am),1,42
2016-02,Weekday (3:00pm-6:59:59pm),1,150
2016-02,Weekday (7:00pm-11:59:59pm),1,21
2016-02,Weekday (8:30am-2:59:59pm),1,208
2016-02,Weekend,1,83
2016-03,Weekday (12:00am-8:29:59am),1,65


Next, we know we only want to focus on the first time bin in trip_freq (which in this dataset, is the string "Weekday (12:00am-8:29:59am)". After that, we don't need the time column anymore, so let's dump it too.

In [29]:
trip_freq <- subset(trip_freq, time == "Weekday (12:00am-8:29:59am)") 
trip_freq$time <- NULL
head(trip_freq)

month,stop_id,origin_quantity
2016-02,1,42
2016-03,1,65
2016-04,1,49
2016-05,1,73
2016-06,1,73
2016-07,1,79


Nest, we don't need it broken out by individual months, so let's just aggregate the count to stop_id. 

In [30]:
trip_freq <- trip_freq %>% group_by (stop_id) %>% summarise(origin_quantity = sum(origin_quantity))
summary(trip_freq)

    stop_id       origin_quantity 
 Min.   :     1   Min.   :     1  
 1st Qu.:  5462   1st Qu.:   111  
 Median :301338   Median :   362  
 Mean   :186437   Mean   :  3003  
 3rd Qu.:313221   3rd Qu.:  1140  
 Max.   :600831   Max.   :422287  
                  NA's   :3210    

As the summary stats tell us, there are 3,210 stops that don't have any trips that fit these criteria (and are showing NAs in the origin_quantity column) - we ultimately want to know the number of these stops per SA2 area so we can correct the number of stops we're looking at for utilisation. We can't do anything with it now, but the code I plan to use after we aggregate is:

In [7]:
#dataset_name$stops_no_trips <- sum(is.na(dataset_name$origin_quantity))

Alright, let's join our modified trip_freq with 1F, aggregate to SA2 level and get a count of NA values. Not that I'm not doing an inner join, as I want to preserve the NA values from trip_freq.

In [31]:
working_data <- join(SA2_stops_by_mode, trip_freq, by="stop_id")
head(working_data)

stop_id,SA2Code,SA2Name,Pop,Score,Team_Member,origin_quantity
1,305011105,Brisbane City,10192,1083,Steff,767.0
2,305011105,Brisbane City,10192,1083,Steff,259.0
3,305011105,Brisbane City,10192,1083,Steff,
4,305011111,Spring Hill,6063,1028,Dave,
5,305011105,Brisbane City,10192,1083,Steff,465.0
6,305011105,Brisbane City,10192,1083,Steff,2400.0


Ok, experiments grouping by SA2 and preserving the count of NAs isn't working - any SA2 area where at least one stop has an NA is aggregating to an NA. So throwing out the approach above and coming up with a new one - let's subset this dataset to create a temporary dataset that only has the stop_ids of stops with an NA in origin_quantity, then replace all of the NA values in origin_quantity with a 0, then group to up to SA2 level, then join with our next temp variable and go from there.

In [32]:
temp_NA <- working_data[is.na(working_data$origin_quantity),]
temp_NA$SA2Code <- NULL
temp_NA$SA2Name <- NULL
temp_NA$Pop <- NULL
temp_NA$Team_Member <- NULL
temp_NA$Score <- 1
temp_NA$origin_quantity <- NULL
colnames(temp_NA)[colnames(temp_NA)=="Score"] <- "number_of_NAs"
temp_NA <- join(SA2_stops_by_mode, temp_NA, by="stop_id", type="inner")
temp_NA <- temp_NA %>% group_by (SA2Code) %>% summarise(number_of_NAs = sum(number_of_NAs))

working_data[is.na(working_data)] <- 0

temp_stop_count <- working_data %>% count(SA2Code)
temp_stop_count <- temp_stop_count[!(temp_stop_count$SA2Code==0),]

working_data <- working_data %>% group_by (SA2Code, SA2Name, Pop, Score, Team_Member) %>% summarise(origin_quantity = sum(origin_quantity))
working_data <- working_data[!(working_data$Team_Member=="unassigned"),]

working_data <- merge(working_data, temp_stop_count, by="SA2Code")
colnames(working_data)[colnames(working_data)=="n"] <- "number_of_stops"

working_data <- merge(working_data, temp_NA, by="SA2Code")
head(working_data)

SA2Code,SA2Name,Pop,Score,Team_Member,origin_quantity,number_of_stops,number_of_NAs
112031254,Tweed Heads,19417,933,Eric,11140,3,1
301011001,Alexandra Hills,16345,987,Charlie,46707,113,25
301011002,Belmont - Gumdale,7375,1093,Steff,35990,47,19
301011003,Birkdale,14923,1034,Dave,128030,57,15
301011004,Capalaba,17588,991,Charlie,127597,120,37
301011005,Thorneside,3761,983,Charlie,53504,26,7


Alright, that was a lot of code in one block!! What are we left with?
- SA2Code and Name
- Population
- SAD Score and Team_Member
- the number of stops that originated within each SA2 in the morning bin
- the number of stops in each SA2 area
- the number of stops in each SA2 area that had zero trips here

What's next?
- new column (useful_stops) calculated by number_of_stops _minus_ number_of_NAs (this represents the relevant number of stops per SA2 area for this analysis)
- new column (stop_utilisation) calculated by origin_quantity _divided by_ useful stops (this represents an index of utilisation by public transport stops in each SA2 area
- export dataset, import to Tableau, plot Score vs. stop_utilisation
- analyse results

In [33]:
working_data$useful_stops <-  with(working_data, working_data$number_of_stops - working_data$number_of_NAs)
working_data$stop_utilisation <-  with(working_data, working_data$origin_quantity / working_data$useful_stops)
working_data$stop_utilisation_per_person <-  with(working_data, working_data$stop_utilisation / working_data$Pop)
head(working_data)

# write.csv(working_data, "public_transport_utilisation.csv", row.names = FALSE)

SA2Code,SA2Name,Pop,Score,Team_Member,origin_quantity,number_of_stops,number_of_NAs,useful_stops,stop_utilisation,stop_utilisation_per_person
112031254,Tweed Heads,19417,933,Eric,11140,3,1,2,5570.0,0.28686203
301011001,Alexandra Hills,16345,987,Charlie,46707,113,25,88,530.7614,0.0324724
301011002,Belmont - Gumdale,7375,1093,Steff,35990,47,19,28,1285.3571,0.17428571
301011003,Birkdale,14923,1034,Dave,128030,57,15,42,3048.3333,0.20427081
301011004,Capalaba,17588,991,Charlie,127597,120,37,83,1537.3133,0.08740694
301011005,Thorneside,3761,983,Charlie,53504,26,7,19,2816.0,0.74873704


Results visualisation - 

https://github.com/TheDataStarter/Lets-Go-card-/blob/master/viz/SAD%20score%20vs%20pt%20stop%20utilisation.png
