# Hypothesis - public transport is easily accessible to people without cars

Hey guys, back with a more challening hypothesis to address (because it's time to bring in new data outside of the original spec) - public transport is easily accessible to people without cars. My goal with this piece is to introduce some other ABS data, using the metadata listing Eric provided here - https://drive.google.com/open?id=1CCXJ1blJ6xoUeOgfEZVzWPVBOhwnXB3-

Cheers Eric!

From looking at this metadata file, the data subsets I'm thinking will be useful are:
- G43 - Labour Force Status by Age by Sex (and using the next sheet, probably "G9074 - P_Tot_Emp_Tot-Persons_Total_employed_Total" (to get the nunber of persons per SA2 who are employed in some capacity))
- G59 - Method of Travel to Work by Sex (and using the next sheet, "G15526 - Worked_home_P - Worked_at_home_Persons" (to get the number of workers who don't travel at all for work)
- G59 - Method of Travel to Work by Sex (and using the next sheet, a _whole bunch of variables_ that indicate the number of people who used at least one of at least one of train, bus, tram or ferry (but not cars) in their travel to work - will need some deft aggregation to get a sensible answer out of this
- G30 - Number of Motor Vehicles by Dwellings (and by using the next sheet, "G7897 - Total_dwelings - Total_Dwellings" and "G7890 - Num_MVs_per_dweling_0_MVs - Number_of_motor_vehicles_per_dwelling_No_motor_vehicles_Dwellings" to calculate an indicative percentage of the number of dwellings that have no cars in the area")

To make use of these datasets, I am planning on (attempting to):
- only focus on the SA2 areas we are interested in for our analysis
- for each SA2 area - have both the reported population (from 1F) and the number of people who are employed (from G43).
- For each SA2 area - subtract the number of people who work at home (from G59) from the number of employed peoples.
- For each SA2 area - aggregate the number of people who use at least one of train, bus, tram or ferry to travel to work (from G59) (but deliberately exclude all records that involve the use of a car) 
- For each SA2 area - calculate the percentage of dwellings in each area that do not have a car (from G30)

This should allow us to discover some things on our way to evaluating this hypothesis:
- is there a relationship between % employed and SAD index?
- is there a relationship between % of people who are employed AND have to travel to work and SAD index? (in other words, which areas are more likely to encourage working from home?)
- is there a relationship between SAD index and people who only use PT to get to work?
- is there a relationship between SAD index and people who only use PT to get to work, normalised by the percentage of houses that don't have a car? (an extension of the previous one, will be interesting to see if/how this changes the relationship)

Ultimately, I am trying to get the data to a point that shows whether there is a meaningful relationship in the data between the number of people who rely on public transport to get to work (normalised by houses without cars) and the number of services provided in that SA2 area (which I'll just lift from the "Hypothesis - public transport serves people of relative wealth" hypothesis, rather than reinventing the wheel).

Finally, as always, I then plan to further analysis this relationship (if one can be found) by filtering on each of our SA2 areas to see what results.

_Note_ - unlike the last hypothesis, this one has a lot of moving parts and data that I'm not as familiar with. As a result, I'm planning on tackling it in smaller, self-contained chunks that won't be as streamlined or optimised at first. As a result, depending on timing and energy, I might do a second notebook to replicate the findings of this one in a more cohesive, streamlined way (with fewer interim steps). We'll see how we go ;)

_Also note_ - there are some assumptions being made here (due to the data we have available). Documenting them here as I find them:
- people who travel to work using public transport but NOT using cars will be normalised by the percentage of dwellings in the area that don't have a car - this isn't an exact science, as it completely fails to capture people who have cars but choose not to use them to get to work. It should be noted that this will give an indicative number only suitable for relative comparison with other areas who have had the same calculations applied. The hope is that the relative errors between areas will be roughly equivalent. Ideally, we would seek out either better reference data to be more exact in this (or at least, validate our approach with a relevant subject matter expert - but I see this as outside the scope of this assignment)
- any categorisation of "travelled to work via Other" or similar where it's not clear that a car was involved or not - assuming a car was involved. Excluded from PT numbers as a result, which may make these numbers appear lower than they should be. Nothing we can do to resolve, and hopefully treating every record with the same brush will result in similar scales of uncertainty, if nothing else.
- As the focus is on people accessing PT where they don't have a car to get to work, things like taxis, walking to work, riding a bike, and so on are excluded from this dataset. 

My current workflow involves having both this Jupityr notebook and R Studio open in parallel. I perform every step in R Studio first and make sure it does what I want it to do (and correct anything that needs correcting), then paste the code in here to ensure Jupityer gives an identical result.

I should also mention that I've copied all of the ABS datasets identified above into my local Processed Data folder, but they are identical to the ones from https://drive.google.com/open?id=1TSQcJvtTyzVh92KGNdPqcPpTtxMX9ixW

First, as always, let's get some libraries.


In [1]:
library(readr)
library(plyr)
library(dplyr)
# Note from Dave to Dave - ALWAYS load plyr before dplyr or group_by doesn't work properly
# Note - have updated to use combined_quantity from the 1C dataset to give more meaningful results

"package 'plyr' was built under R version 3.4.4"
Attaching package: 'dplyr'

The following objects are masked from 'package:plyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



Next, I'm going to do an interim step that probably isn't a very effecient way of achieving this, but it'll help me keep everything straight in my head. Let's import 1F (as as discussed earlier, my version of 1F) to _just_ get the SA2 codes, names and population counts we need for now as a reference table.

In [2]:
our_SA2 <- read_csv("Processed Data/1F/SA2_stops_by_mode.csv", col_types = cols(AREASQKM16 = col_skip(), Score = col_skip(), Team_Member = col_skip(), is_bus_stop = col_skip(), is_ferry_stop = col_skip(), is_train_station = col_skip(), is_tram_stop = col_skip(), stop_id = col_skip(), stop_lat = col_skip(), stop_lon = col_skip(), stop_name = col_skip(), stop_url = col_skip()))
head(our_SA2)

SA2Code,SA2Name,Pop
305011105,Brisbane City,10192
305011105,Brisbane City,10192
305011105,Brisbane City,10192
305011111,Spring Hill,6063
305011105,Brisbane City,10192
305011105,Brisbane City,10192


As you can see, this dataset is filled with now duplicate rows - we only need unique rows to move forward with. It's also worth ensuring we have no row with missing values.  I'll also generate some summary statistics for reference.

In [3]:
our_SA2 <- na.omit(our_SA2)
our_SA2 <- unique(our_SA2)
summary(our_SA2)

    SA2Code            SA2Name               Pop       
 Min.   :112031254   Length:296         Min.   :  196  
 1st Qu.:303051075   Class :character   1st Qu.: 6828  
 Median :309031236   Mode  :character   Median : 9484  
 Mean   :307264259                      Mean   :10597  
 3rd Qu.:311051323                      3rd Qu.:13308  
 Max.   :319031514                      Max.   :31214  

As you can see, we have 296 SA2 areas to work with in our universe of interest.

This is another matter of personal preference - no reason why I couldn't work with all SA2 areas in the ABS data and filter at the end, but by doing it as we go it's working with a smaller dataset, which I always prefer.

Next, let's bring in the first of our new ABS datasets - G43B (for some reason, G43 is split into A and B datasets - the data feature we want is in B). Also, as part of the import, I will be excluding _187_ columns to get the two I want... o_0

In [4]:
g43_data <- read_csv("Processed Data/ABS/2016Census_G43B_QLD_SA2.csv", col_types = cols(F_LFS_NS_15_19 = col_skip(), F_LFS_NS_20_24 = col_skip(), F_LFS_NS_25_34 = col_skip(), F_LFS_NS_35_44 = col_skip(), F_LFS_NS_45_54 = col_skip(), F_LFS_NS_55_64 = col_skip(), F_LFS_NS_65_74 = col_skip(), F_LFS_NS_75_84 = col_skip(), F_LFS_NS_85ov = col_skip(), F_LFS_NS_Tot = col_skip(), F_Not_in_LF_15_19 = col_skip(), F_Not_in_LF_20_24 = col_skip(), F_Not_in_LF_25_34 = col_skip(), F_Not_in_LF_35_44 = col_skip(), F_Not_in_LF_45_54 = col_skip(), F_Not_in_LF_55_64 = col_skip(), F_Not_in_LF_65_74 = col_skip(), F_Not_in_LF_75_84 = col_skip(), F_Not_in_LF_85ov = col_skip(), F_Not_in_LF_Tot = col_skip(), F_Tot_15_19 = col_skip(), F_Tot_20_24 = col_skip(), F_Tot_25_34 = col_skip(), F_Tot_35_44 = col_skip(), F_Tot_45_54 = col_skip(), F_Tot_55_64 = col_skip(), F_Tot_65_74 = col_skip(), F_Tot_75_84 = col_skip(), F_Tot_85ov = col_skip(), F_Tot_LF_15_19 = col_skip(), F_Tot_LF_20_24 = col_skip(), F_Tot_LF_25_34 = col_skip(), F_Tot_LF_35_44 = col_skip(), F_Tot_LF_45_54 = col_skip(), F_Tot_LF_55_64 = col_skip(), F_Tot_LF_65_74 = col_skip(), F_Tot_LF_75_84 = col_skip(), F_Tot_LF_85ov = col_skip(), F_Tot_LF_Tot = col_skip(), F_Tot_Tot = col_skip(), P_Emp_FullT_15_19 = col_skip(), P_Emp_FullT_20_24 = col_skip(), P_Emp_FullT_25_34 = col_skip(), P_Emp_FullT_35_44 = col_skip(), P_Emp_FullT_45_54 = col_skip(), P_Emp_FullT_55_64 = col_skip(), P_Emp_FullT_65_74 = col_skip(), P_Emp_FullT_75_84 = col_skip(), P_Emp_FullT_85ov = col_skip(), P_Emp_FullT_Tot = col_skip(), P_Emp_PartT_15_19 = col_skip(), P_Emp_PartT_20_24 = col_skip(), P_Emp_PartT_25_34 = col_skip(), P_Emp_PartT_35_44 = col_skip(), P_Emp_PartT_45_54 = col_skip(), P_Emp_PartT_55_64 = col_skip(), P_Emp_PartT_65_74 = col_skip(), P_Emp_PartT_75_84 = col_skip(), P_Emp_PartT_85ov = col_skip(), P_Emp_PartT_Tot = col_skip(), P_Emp_awy_f_wrk_15_19 = col_skip(), P_Emp_awy_f_wrk_20_24 = col_skip(), P_Emp_awy_f_wrk_25_34 = col_skip(), P_Emp_awy_f_wrk_35_44 = col_skip(), P_Emp_awy_f_wrk_45_54 = col_skip(), P_Emp_awy_f_wrk_55_64 = col_skip(), P_Emp_awy_f_wrk_65_74 = col_skip(), P_Emp_awy_f_wrk_75_84 = col_skip(), P_Emp_awy_f_wrk_85ov = col_skip(), P_Emp_awy_f_wrk_Tot = col_skip(), P_Hours_wkd_NS_15_19 = col_skip(), P_Hours_wkd_NS_20_24 = col_skip(), P_Hours_wkd_NS_25_34 = col_skip(), P_Hours_wkd_NS_35_44 = col_skip(), P_Hours_wkd_NS_45_54 = col_skip(), P_Hours_wkd_NS_55_64 = col_skip(), P_Hours_wkd_NS_65_74 = col_skip(), P_Hours_wkd_NS_75_84 = col_skip(), P_Hours_wkd_NS_85ov = col_skip(), P_Hours_wkd_NS_Tot = col_skip(), P_LFS_NS_15_19 = col_skip(), P_LFS_NS_20_24 = col_skip(), P_LFS_NS_25_34 = col_skip(), P_LFS_NS_35_44 = col_skip(), P_LFS_NS_45_54 = col_skip(), P_LFS_NS_55_64 = col_skip(), P_LFS_NS_65_74 = col_skip(), P_LFS_NS_75_84 = col_skip(), P_LFS_NS_85ov = col_skip(), P_LFS_NS_Tot = col_skip(), P_Not_in_LF_15_19 = col_skip(), P_Not_in_LF_20_24 = col_skip(), P_Not_in_LF_25_34 = col_skip(), P_Not_in_LF_35_44 = col_skip(), P_Not_in_LF_45_54 = col_skip(), P_Not_in_LF_55_64 = col_skip(), P_Not_in_LF_65_74 = col_skip(), P_Not_in_LF_75_84 = col_skip(), P_Not_in_LF_85ov = col_skip(), P_Not_in_LF_Tot = col_skip(), P_Tot_15_19 = col_skip(), P_Tot_20_24 = col_skip(), P_Tot_25_34 = col_skip(), P_Tot_35_44 = col_skip(), P_Tot_45_54 = col_skip(), P_Tot_55_64 = col_skip(), P_Tot_65_74 = col_skip(), P_Tot_75_84 = col_skip(), P_Tot_85ov = col_skip(), P_Tot_Emp_15_19 = col_skip(), P_Tot_Emp_20_24 = col_skip(), P_Tot_Emp_25_34 = col_skip(), P_Tot_Emp_35_44 = col_skip(), P_Tot_Emp_45_54 = col_skip(), P_Tot_Emp_55_64 = col_skip(), P_Tot_Emp_65_74 = col_skip(), P_Tot_Emp_75_84 = col_skip(), P_Tot_Emp_85ov = col_skip(), P_Tot_LF_15_19 = col_skip(), P_Tot_LF_20_24 = col_skip(), P_Tot_LF_25_34 = col_skip(), P_Tot_LF_35_44 = col_skip(), P_Tot_LF_45_54 = col_skip(), P_Tot_LF_55_64 = col_skip(), P_Tot_LF_65_74 = col_skip(), P_Tot_LF_75_84 = col_skip(), P_Tot_LF_85ov = col_skip(), P_Tot_LF_Tot = col_skip(), P_Tot_Tot = col_skip(), P_Tot_Unemp_15_19 = col_skip(), P_Tot_Unemp_20_24 = col_skip(), P_Tot_Unemp_25_34 = col_skip(), P_Tot_Unemp_35_44 = col_skip(), P_Tot_Unemp_45_54 = col_skip(), P_Tot_Unemp_55_64 = col_skip(), P_Tot_Unemp_65_74 = col_skip(), P_Tot_Unemp_75_84 = col_skip(), P_Tot_Unemp_85ov = col_skip(), P_Tot_Unemp_Tot = col_skip(), P_Unem_look_FTW_15_19 = col_skip(), P_Unem_look_FTW_20_24 = col_skip(), P_Unem_look_FTW_25_34 = col_skip(), P_Unem_look_FTW_35_44 = col_skip(), P_Unem_look_FTW_45_54 = col_skip(), P_Unem_look_FTW_55_64 = col_skip(), P_Unem_look_FTW_65_74 = col_skip(), P_Unem_look_FTW_75_84 = col_skip(), P_Unem_look_FTW_85ov = col_skip(), P_Unem_look_FTW_Tot = col_skip(), P_Unem_look_PTW_15_19 = col_skip(), P_Unem_look_PTW_20_24 = col_skip(), P_Unem_look_PTW_25_34 = col_skip(), P_Unem_look_PTW_35_44 = col_skip(), P_Unem_look_PTW_45_54 = col_skip(), P_Unem_look_PTW_55_64 = col_skip(), P_Unem_look_PTW_65_74 = col_skip(), P_Unem_look_PTW_75_84 = col_skip(), P_Unem_look_PTW_85ov = col_skip(), P_Unem_look_PTW_Tot = col_skip()))
head(g43_data)

SA2_MAINCODE_2016,P_Tot_Emp_Tot
301011001,8311
301011002,3835
301011003,7349
301011004,8995
301011005,1836
301011006,5975


Let's clean up the column names a to be a little more meaningful....

In [5]:
colnames(g43_data)[colnames(g43_data)=="SA2_MAINCODE_2016"] <- "SA2Code"
colnames(g43_data)[colnames(g43_data)=="P_Tot_Emp_Tot"] <- "number_of_employees"
summary(g43_data)

    SA2Code          number_of_employees
 Min.   :301011001   Min.   :    0      
 1st Qu.:305041133   1st Qu.: 2444      
 Median :309091402   Median : 3653      
 Mean   :310209722   Mean   : 4031      
 3rd Qu.:315011397   3rd Qu.: 5223      
 Max.   :399999499   Max.   :14884      

We can see the names are now better, but we can also see there are 530 SA2 codes in this list - it's the whole country, not just the bits we care about. Let's join with our 1F reference set from before to finish off this piece of the puzzle.

In [6]:
g43_data <- join(g43_data, our_SA2, by="SA2Code", type="inner")
head(g43_data)

SA2Code,number_of_employees,SA2Name,Pop
301011001,8311,Alexandra Hills,16345
301011002,3835,Belmont - Gumdale,7375
301011003,7349,Birkdale,14923
301011004,8995,Capalaba,17588
301011005,1836,Thorneside,3761
301011006,5975,Wellington Point,11576


Finally, to sidetrack a little, let's also calculate the percentage of the population of each SA2 area that are unemployed.

In [7]:
g43_data$percent_unemployed <- with(g43_data, ((g43_data$Pop - g43_data$number_of_employees) / g43_data$Pop) * 100)
head(g43_data)
summary(g43_data$percent_unemployed)

SA2Code,number_of_employees,SA2Name,Pop,percent_unemployed
301011001,8311,Alexandra Hills,16345,49.15265
301011002,3835,Belmont - Gumdale,7375,48.0
301011003,7349,Birkdale,14923,50.75387
301011004,8995,Capalaba,17588,48.85718
301011005,1836,Thorneside,3761,51.1832
301011006,5975,Wellington Point,11576,48.38459


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  35.35   48.86   51.91   52.78   56.39   85.63 

Not at all relevant to our hypothesis, but I wonder if there's a correlation between % unemployed and SAD index score? Doesn't help us with public transport, so moving on...  (_also_ - should check which parts of employment stats are actually included in calculating SAD index scores, it might be correlated by design).

Alright, so that's all the manipulation I was planning on doing to G43 for now. Next on my hit list was G59 to discover:
- "G15526 - Worked_home_P - Worked_at_home_Persons" (to get the number of workers who don't travel at all for work
- a _whole bunch of variables_ that indicate the number of people who used at least one of at least one of train, bus, tram or ferry (but not cars) in their travel to work - will need some deft aggregation to get a sensible answer out of this

I'm going to import G59, but I'm going to do what I would normally do first - open in Excel, manually delete all the columns I don't have a use for, import the result from there. The G59 file I am working from is available from Project - Analysis - Dave - Dave Output on Google Drive (or from here - https://drive.google.com/open?id=1v6UKQoLlxoAoPcKwSN_c89iDyJwaSI30 )

The columns I am going to keep here are:
- SA2_MAINCODE_2016
- One_method_Train_P - One_method_Train_Persons
- One_method_Bus_P - One_method_Bus_Persons
- One_method_Ferry_P - One_method_Ferry_Persons
- One_met_Tram_incl_lt_rail_P - One_method_Tram_includes_light_rail_Persons
- Two_methods_Train_Bus_P - Two_methods_Train_and_Bus_Persons
- Two_methods_Train_Ferry_P - Two_methods_Train_and_Ferry_Persons
- Two_mt_trn_Trm_incl_lt_rl_P - Two_methods_Train_and_Tram_includes_light_rail_Persons
- Two_methods_Bus_Ferry_P - Two_methods_Bus_and_Ferry_Persons
- Two_mth_Bu_Trm_inc_lt_rl_P - Two_methods_Bus_and_Tram_includes_light_rail_Persons
- Worked_home_P - Worked_at_home_Persons


In [8]:
g59_data<- read_csv("Processed Data/ABS/g59.csv")
head(g59_data)

Parsed with column specification:
cols(
  SA2_MAINCODE_2016 = col_integer(),
  One_method_Train_P = col_integer(),
  One_method_Bus_P = col_integer(),
  One_method_Ferry_P = col_integer(),
  One_met_Tram_incl_lt_rail_P = col_integer(),
  Two_methods_Train_Bus_P = col_integer(),
  Two_methods_Train_Ferry_P = col_integer(),
  Two_mt_trn_Trm_incl_lt_rl_P = col_integer(),
  Two_methods_Bus_Ferry_P = col_integer(),
  Two_mth_Bu_Trm_inc_lt_rl_P = col_integer(),
  Worked_home_P = col_integer()
)


SA2_MAINCODE_2016,One_method_Train_P,One_method_Bus_P,One_method_Ferry_P,One_met_Tram_incl_lt_rail_P,Two_methods_Train_Bus_P,Two_methods_Train_Ferry_P,Two_mt_trn_Trm_incl_lt_rl_P,Two_methods_Bus_Ferry_P,Two_mth_Bu_Trm_inc_lt_rl_P,Worked_home_P
301011001,94,231,3,0,15,0,0,0,0,291
301011002,59,145,3,0,4,0,0,0,0,286
301011003,331,73,0,0,27,0,0,0,0,355
301011004,55,274,3,0,12,0,0,0,0,348
301011005,118,9,3,0,11,0,0,0,0,75
301011006,279,53,0,0,25,0,0,0,0,302


Now, just like before, let's rename (some of) our columns and join to our 1F ref set to only get the SA2 areas we care about.

In [9]:
colnames(g59_data)[colnames(g59_data)=="SA2_MAINCODE_2016"] <- "SA2Code"
colnames(g59_data)[colnames(g59_data)=="Worked_home_P"] <- "number_worked_from_home"
g59_data <- join(g59_data, our_SA2, by="SA2Code", type="inner")
summary(g59_data)

    SA2Code          One_method_Train_P One_method_Bus_P One_method_Ferry_P
 Min.   :301011001   Min.   :   0.0     Min.   :   0.0   Min.   :  0.000   
 1st Qu.:303051076   1st Qu.:  23.0     1st Qu.:  38.0   1st Qu.:  0.000   
 Median :309031237   Median :  62.0     Median :  83.0   Median :  0.000   
 Mean   :307926065   Mean   : 140.9     Mean   : 185.6   Mean   :  7.966   
 3rd Qu.:311051324   3rd Qu.: 194.5     3rd Qu.: 228.0   3rd Qu.:  3.000   
 Max.   :319031514   Max.   :1353.0     Max.   :1331.0   Max.   :524.000   
 One_met_Tram_incl_lt_rail_P Two_methods_Train_Bus_P Two_methods_Train_Ferry_P
 Min.   :  0.00              Min.   :  0.00          Min.   :0.0000           
 1st Qu.:  0.00              1st Qu.:  6.00          1st Qu.:0.0000           
 Median :  0.00              Median : 17.00          Median :0.0000           
 Mean   :  5.78              Mean   : 23.77          Mean   :0.1424           
 3rd Qu.:  0.00              3rd Qu.: 33.50          3rd Qu.:0.0000      

Ok, now lets look at aggregating all of these different columns that represent different ways of using public transport to get to work - we only really need it as a single value per SA2 area for what we're trying to do here. So lets sum them and then get rid of the individual columns.

In [10]:
g59_data$pt_to_work <- with(g59_data, g59_data$One_method_Bus_P + g59_data$One_method_Ferry_P + g59_data$One_met_Tram_incl_lt_rail_P + g59_data$One_method_Train_P + g59_data$Two_methods_Train_Bus_P + g59_data$Two_methods_Train_Ferry_P + g59_data$Two_mt_trn_Trm_incl_lt_rl_P + g59_data$Two_methods_Bus_Ferry_P + g59_data$Two_mth_Bu_Trm_inc_lt_rl_P)
g59_data$One_method_Train_P <- NULL
g59_data$One_method_Bus_P <- NULL
g59_data$One_method_Ferry_P <- NULL
g59_data$One_met_Tram_incl_lt_rail_P <- NULL
g59_data$Two_methods_Train_Bus_P <- NULL
g59_data$Two_methods_Train_Ferry_P <- NULL
g59_data$Two_mt_trn_Trm_incl_lt_rl_P <- NULL
g59_data$Two_methods_Bus_Ferry_P <- NULL
g59_data$Two_mth_Bu_Trm_inc_lt_rl_P <- NULL
head(g59_data)

SA2Code,number_worked_from_home,SA2Name,Pop,pt_to_work
301011001,291,Alexandra Hills,16345,343
301011002,286,Belmont - Gumdale,7375,211
301011003,355,Birkdale,14923,431
301011004,348,Capalaba,17588,344
301011005,75,Thorneside,3761,141
301011006,302,Wellington Point,11576,357


From the head(), we can see that the number of people who took public transport to work seems low - but remember that this is only people who took ONLY public transport to work. If you drove or got a lift to your bus stop or train station or etc, you won't be included in these figures. Let's take a quick peek at the summary stats to see if this is true across the board.

In [11]:
summary(g59_data$pt_to_work)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0   121.5   268.0   368.0   536.0  1724.0 

Alright, one ABS dataset let to nab. To refresh my memory, this one is:

- G30 - Number of Motor Vehicles by Dwellings (and by using the next sheet, "G7897 - Total_dwelings - Total_Dwellings" and "G7890 - Num_MVs_per_dweling_0_MVs - Number_of_motor_vehicles_per_dwelling_No_motor_vehicles_Dwellings" to calculate an indicative percentage of the number of dwellings that have no cars in the area")

Unlike the last two imports, I'm going to import this one whole and then start excluding columns, just so I've provided examples for all three methods for doing it. The dataset is the G30 ABS file, available from https://drive.google.com/open?id=14Vt5tonuoq6VMFAGPHPaJM3SIHOQGdBl



In [12]:
g30_data <- read_csv("Processed Data/ABS/2016Census_G30_QLD_SA2.csv")
g30_data$Num_MVs_per_dweling_1_MVs <- NULL
g30_data$Num_MVs_per_dweling_2_MVs <- NULL
g30_data$Num_MVs_per_dweling_3_MVs <- NULL
g30_data$Num_MVs_per_dweling_4mo_MVs <- NULL
g30_data$Num_MVs_per_dweling_Tot <- NULL
g30_data$Num_MVs_NS <- NULL
head(g30_data)

Parsed with column specification:
cols(
  SA2_MAINCODE_2016 = col_integer(),
  Num_MVs_per_dweling_0_MVs = col_integer(),
  Num_MVs_per_dweling_1_MVs = col_integer(),
  Num_MVs_per_dweling_2_MVs = col_integer(),
  Num_MVs_per_dweling_3_MVs = col_integer(),
  Num_MVs_per_dweling_4mo_MVs = col_integer(),
  Num_MVs_per_dweling_Tot = col_integer(),
  Num_MVs_NS = col_integer(),
  Total_dwelings = col_integer()
)


SA2_MAINCODE_2016,Num_MVs_per_dweling_0_MVs,Total_dwelings
301011001,202,5619
301011002,41,2299
301011003,169,5060
301011004,311,6399
301011005,61,1530
301011006,88,3944


As before, let's rename some columns and combine with our 1F reference set.

In [13]:
colnames(g30_data)[colnames(g30_data)=="SA2_MAINCODE_2016"] <- "SA2Code"
colnames(g30_data)[colnames(g30_data)=="Num_MVs_per_dweling_0_MVs"] <- "dwellings_with_no_cars"
g30_data <- join(g30_data, our_SA2, by="SA2Code", type="inner")
head(g30_data)

SA2Code,dwellings_with_no_cars,Total_dwelings,SA2Name,Pop
301011001,202,5619,Alexandra Hills,16345
301011002,41,2299,Belmont - Gumdale,7375
301011003,169,5060,Birkdale,14923
301011004,311,6399,Capalaba,17588
301011005,61,1530,Thorneside,3761
301011006,88,3944,Wellington Point,11576


Final step on this one - adding a derived column that expresses percentage of dwellings that don't have cars per SA2 area.

In [14]:
g30_data$percentage_houses_no_cars <- with(g30_data, (g30_data$dwellings_with_no_cars / g30_data$Total_dwelings) * 100)
head(g30_data)

SA2Code,dwellings_with_no_cars,Total_dwelings,SA2Name,Pop,percentage_houses_no_cars
301011001,202,5619,Alexandra Hills,16345,3.594946
301011002,41,2299,Belmont - Gumdale,7375,1.783384
301011003,169,5060,Birkdale,14923,3.339921
301011004,311,6399,Capalaba,17588,4.860134
301011005,61,1530,Thorneside,3761,3.986928
301011006,88,3944,Wellington Point,11576,2.231237


 As can probably be expected for SE Queensland - not many houses without cars from the looks of it. Let's check with summary stats.

In [15]:
summary(g30_data$percentage_houses_no_cars)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   2.775   5.008   6.137   8.435  32.180 

From looking at the actual figures, we can see there's one SA2 area in SEQ where every single dwelling has at least one car - Brisbane airport. At the other end of the scale, there list of areas with more than 20% of dwellings not owning a car are all one's you would probably predict:

- Spring Hill(inner city suburb with a short walk/free bus service to the CBD)
- Brisbane City (lots of stuff in walking distance)
- Fortitude Valley (same)
- St Lucia (heavy University suburb)
- South Brisbane (lots of stuff and access to PT, also heavy University suburb)

So now that we have all these datasets... what do we do with 'em?

Normally, I'd just look at the questions I'm trying to answer and use the appropriate dataset to work out an answer, joining if needed. But I've done all this work in a big bundle, so let's practice joinin' 'em up! To do that, I'm going to reimport a fresh 1F to get back some fields that we're missing that would be useful and start selectively joining the key columns from our other datasets.

In [16]:
SA2_stops_by_mode <- read_csv("Processed Data/1F/SA2_stops_by_mode.csv", col_types = cols(AREASQKM16 = col_skip(), Pop = col_skip(), is_bus_stop = col_skip(), is_ferry_stop = col_skip(), is_train_station = col_skip(), is_tram_stop = col_skip(), stop_id = col_skip(), stop_lat = col_skip(), stop_lon = col_skip(), stop_name = col_skip(), stop_url = col_skip()))
SA2_stops_by_mode <- na.omit(SA2_stops_by_mode)
SA2_stops_by_mode <- unique(SA2_stops_by_mode)
head(SA2_stops_by_mode)

SA2Code,SA2Name,Score,Team_Member
305011105,Brisbane City,1083,Steff
305011111,Spring Hill,1028,Dave
305011106,Fortitude Valley,1064,Kate
305031128,Newstead - Bowen Hills,1132,Steff
304041098,Enoggera,1053,Kate
305041135,Paddington - Milton,1137,Steff


Now, the G43 dataset.

In [17]:
g43_data$SA2Name <- NULL #duplicates a column already present
accessible_data <- join(SA2_stops_by_mode, g43_data, by="SA2Code", type="inner")
head(accessible_data)

SA2Code,SA2Name,Score,Team_Member,number_of_employees,Pop,percent_unemployed
305011105,Brisbane City,1083,Steff,5397,10192,47.0467
305011111,Spring Hill,1028,Dave,3411,6063,43.74072
305011106,Fortitude Valley,1064,Kate,4539,7146,36.48195
305031128,Newstead - Bowen Hills,1132,Steff,6877,10638,35.35439
304041098,Enoggera,1053,Kate,4684,8158,42.58397
305041135,Paddington - Milton,1137,Steff,6522,10788,39.54394


Now, g59.

In [18]:
head(g59_data)

SA2Code,number_worked_from_home,SA2Name,Pop,pt_to_work
301011001,291,Alexandra Hills,16345,343
301011002,286,Belmont - Gumdale,7375,211
301011003,355,Birkdale,14923,431
301011004,348,Capalaba,17588,344
301011005,75,Thorneside,3761,141
301011006,302,Wellington Point,11576,357


In [19]:
g59_data$SA2Name <- NULL # duplicates a column already present
g59_data$Pop <- NULL # duplicates a column already present
accessible_data <- join(accessible_data, g59_data, by="SA2Code", type="inner")
head(accessible_data)

SA2Code,SA2Name,Score,Team_Member,number_of_employees,Pop,percent_unemployed,number_worked_from_home,pt_to_work
305011105,Brisbane City,1083,Steff,5397,10192,47.0467,254,724
305011111,Spring Hill,1028,Dave,3411,6063,43.74072,127,478
305011106,Fortitude Valley,1064,Kate,4539,7146,36.48195,139,900
305031128,Newstead - Bowen Hills,1132,Steff,6877,10638,35.35439,333,1421
304041098,Enoggera,1053,Kate,4684,8158,42.58397,152,698
305041135,Paddington - Milton,1137,Steff,6522,10788,39.54394,369,1065


Finally, g30.

In [21]:
g30_data$SA2Name <- NULL
g30_data$Pop <- NULL
accessible_data <- join(accessible_data, g30_data, by="SA2Code", type="inner")
head(accessible_data)

# write.csv(accessible_data,"accessible_data.csv", row.names = FALSE)

SA2Code,SA2Name,Score,Team_Member,number_of_employees,Pop,percent_unemployed,number_worked_from_home,pt_to_work,dwellings_with_no_cars,Total_dwelings,percentage_houses_no_cars,dwellings_with_no_cars.1,Total_dwelings.1,percentage_houses_no_cars.1
305011105,Brisbane City,1083,Steff,5397,10192,47.0467,254,724,1280,4044,31.65183,1280,4044,31.65183
305011111,Spring Hill,1028,Dave,3411,6063,43.74072,127,478,707,2197,32.18025,707,2197,32.18025
305011106,Fortitude Valley,1064,Kate,4539,7146,36.48195,139,900,973,3251,29.92925,973,3251,29.92925
305031128,Newstead - Bowen Hills,1132,Steff,6877,10638,35.35439,333,1421,735,4802,15.30612,735,4802,15.30612
304041098,Enoggera,1053,Kate,4684,8158,42.58397,152,698,341,3021,11.28765,341,3021,11.28765
305041135,Paddington - Milton,1137,Steff,6522,10788,39.54394,369,1065,396,4159,9.52152,396,4159,9.52152


Alright - a big messy table full of derived values that allow us to operate on. Lets list out the things I wanted to learn at the start (and see if these can now be answered/if anything else has come up in the meantime).

- is there a relationship between % employed and SAD index? (could be answered, but decided it doesn't help us analyse PT data)
- is there a relationship between % of people who are employed AND have to travel to work and SAD index? (in other words, which areas are more likely to encourage working from home?)
- is there a relationship between SAD index and people who only use PT to get to work?
- is there a relationship between SAD index and people who only use PT to get to work, normalised by the percentage of houses that don't have a car? (an extension of the previous one, will be interesting to see if/how this changes the relationship)

And finally, the big one, the question that addresses the hypothesis:
- is there is a meaningful relationship in the data between the number of people who rely on public transport to get to work (normalised by houses without cars) and the number of services provided in that SA2 area?

Let's start with - which areas are more likely to encourage working from home? Let's analyse using the Team_Member SA2 breakdowns (because I want to save the most detailed analysis for the hypothesis question).
To answer this, creating a new temporary variable - percentage of people working from home out of the number who are employed.

In [22]:
wfh_index <- subset(accessible_data, select=c("SA2Code", "SA2Name", "Team_Member", "number_of_employees", "number_worked_from_home"))
wfh_index$percentage_wfh <- with(wfh_index, number_worked_from_home / number_of_employees)
wfh_index <- wfh_index[order(wfh_index$percentage_wfh),]
wfh_index

# write.csv(wfh_index,"wfh_index.csv", row.names = FALSE)

Unnamed: 0,SA2Code,SA2Name,Team_Member,number_of_employees,number_worked_from_home,percentage_wfh
252,311031314,Crestmead,Eric,4585,64,0.01395856
264,310031291,Leichhardt - One Mile,Eric,2578,43,0.01667960
282,311061336,Woodridge,Eric,3837,69,0.01798280
273,310031295,Riverview,Eric,829,15,0.01809409
243,311031317,Marsden,Eric,5219,95,0.01820272
235,310031285,Churchill - Yamanto,Charlie,3199,60,0.01875586
241,310031284,Bundamba,Eric,3640,70,0.01923077
262,311061330,Kingston (Qld.),Eric,3312,65,0.01962560
255,310031293,Raceview,Eric,6620,136,0.02054381
62,310011274,Inala - Richlands,Eric,5333,115,0.02156385


Now we can do our usual dance - filter by each of our SA2 areas and compare summary stats.

In [24]:
 summary(wfh_index$percentage_wfh)

# Note - to use this code block, only uncomment one of the indiv_data lines. Trying to use more than one will overwrite with 
# the last one and not give the desired result.

# indiv_data <- subset(wfh_index, Team_Member == "Eric") # only consider Eric's SA2 areas
# indiv_data <- subset(wfh_index, Team_Member == "Charlie") # only consider Charlie's SA2 areas
# indiv_data <- subset(wfh_index, Team_Member == "Dave") # only consider my SA2 areas
# indiv_data <- subset(wfh_index, Team_Member == "Kate") # only consider Kate's SA2 areas
# indiv_data <- subset(wfh_index, Team_Member == "Steff") # only consider Steff's SA2 areas

# Once you've selected A name, use the below to either summarise or export. (Or, y'know, whatever you want to do with it 
# from here))

# write.csv(indiv_data,"indiv_data.csv", row.names = FALSE)
# summary(indiv_data$percentage_wfh)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01396 0.03682 0.04731 0.05163 0.06157 0.15765 

Percentages of working from home per SA2 area:
- All SA2 - between 0.014% and 0.156% wfh, median is 0.047%, mean is 0.052%
- Eric's SA2 - between 0.014% and 0.112% wfh, median is 0.030%, mean is 0.037% ( less than average)
- Charlie's SA2 - between 0.019% and 0.106% wfh, median is 0.047%, mean is 0.049% (slightly less than average)
- my SA2s - between 0.028% and 0.120% wfh, median is 0.049%, mean is 0.054% (right on average)
- Kate's SA2s - between 0.026% and 0.158% wfh, median is 0.046%, mean is 0.067% (slightly above average)
- Steff's SA2s - between 0.040% and 0.113% wfh, median is 0.055%, mean is 0.068% (above average)

In summary, there is a faint but discernable trend upwards as SAD index scores improve - I would be comfortable saying that there is a relationship between areas that are relatively better-off socio-economically and the availability of work from home arrangements.

Next - is there a relationship between SAD index and people who only use PT to get to work? (And to save time, let's also do the legwork to answer the next one - is there a relationship between SAD index and people who only use PT to get to work, normalised by the percentage of houses that don't have a car?

Let's create a temporary variable to help us find out!

In [25]:
head(accessible_data)

SA2Code,SA2Name,Score,Team_Member,number_of_employees,Pop,percent_unemployed,number_worked_from_home,pt_to_work,dwellings_with_no_cars,Total_dwelings,percentage_houses_no_cars,dwellings_with_no_cars.1,Total_dwelings.1,percentage_houses_no_cars.1
305011105,Brisbane City,1083,Steff,5397,10192,47.0467,254,724,1280,4044,31.65183,1280,4044,31.65183
305011111,Spring Hill,1028,Dave,3411,6063,43.74072,127,478,707,2197,32.18025,707,2197,32.18025
305011106,Fortitude Valley,1064,Kate,4539,7146,36.48195,139,900,973,3251,29.92925,973,3251,29.92925
305031128,Newstead - Bowen Hills,1132,Steff,6877,10638,35.35439,333,1421,735,4802,15.30612,735,4802,15.30612
304041098,Enoggera,1053,Kate,4684,8158,42.58397,152,698,341,3021,11.28765,341,3021,11.28765
305041135,Paddington - Milton,1137,Steff,6522,10788,39.54394,369,1065,396,4159,9.52152,396,4159,9.52152


In [26]:
pt_only_commuters <- subset(accessible_data, select=c("SA2Code", "SA2Name", "Team_Member", "pt_to_work", "percentage_houses_no_cars"))
head(pt_only_commuters)

SA2Code,SA2Name,Team_Member,pt_to_work,percentage_houses_no_cars
305011105,Brisbane City,Steff,724,31.65183
305011111,Spring Hill,Dave,478,32.18025
305011106,Fortitude Valley,Kate,900,29.92925
305031128,Newstead - Bowen Hills,Steff,1421,15.30612
304041098,Enoggera,Kate,698,11.28765
305041135,Paddington - Milton,Steff,1065,9.52152


Ok, like before, let's explore the relationship (if any) between areas that rely on PT (and no cars) to commute to work against each of our SA2 areas. Let's _also_ do this comparison with the number of people relying on only PT for commute normalised by the percentage of houses without cars in that area (to try to correct for areas where people are choosing to take PT, rather than having no practical option to take PT). Let's create the normalised index first -

In [27]:
pt_only_commuters$normalised_pt_users <- with(pt_only_commuters, pt_only_commuters$pt_to_work * (100 / pt_only_commuters$percentage_houses_no_cars))
pt_only_commuters <- pt_only_commuters[order(pt_only_commuters$normalised_pt_users),]
head(pt_only_commuters)

Unnamed: 0,SA2Code,SA2Name,Team_Member,pt_to_work,percentage_houses_no_cars,normalised_pt_users
254,310021278,Esk,Eric,6,3.905091,153.6456
257,317011448,Gatton,Eric,17,7.704465,220.6513
295,319031512,Gympie - North,Eric,43,9.264006,464.1621
182,316021419,Caloundra - Kings Beach,Eric,56,11.716898,477.9422
161,309021230,Coolangatta,Charlie,55,11.343738,484.849
160,309081259,Clear Island Waters,Kate,24,4.795396,500.48


Now, comparisons filtering on Team_Member

In [28]:
#summary(pt_only_commuters$pt_to_work)
# pt_only_commuters = pt_only_commuters[!pt_only_commuters$SA2Code == "302031036",] # Bris airport has 0 houses without cars - breaks this calculation
#summary(pt_only_commuters$normalised_pt_users)

# Note - to use this code block, only uncomment one of the indiv_data lines. Trying to use more than one will overwrite with 
# the last one and not give the desired result.

 indiv_data <- subset(pt_only_commuters, Team_Member == "Eric") # only consider Eric's SA2 areas
# indiv_data <- subset(pt_only_commuters, Team_Member == "Charlie") # only consider Charlie's SA2 areas
# indiv_data <- subset(pt_only_commuters, Team_Member == "Dave") # only consider my SA2 areas
# indiv_data <- subset(pt_only_commuters, Team_Member == "Kate") # only consider Kate's SA2 areas
 indiv_data <- subset(pt_only_commuters, Team_Member == "Steff") # only consider Steff's SA2 areas

# Once you've selected A name, use the below to either summarise or export. (Or, y'know, whatever you want to do with it 
# from here))

# write.csv(indiv_data,"indiv_data.csv", row.names = FALSE)
 summary(indiv_data$pt_to_work)
 summary(indiv_data$normalised_pt_users)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0   293.0   539.0   599.1   818.0  1724.0 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2287    6940    9326     Inf   18154     Inf 

Users per area who catch only public transport to work:
- All SA2s - between 4 and 1,724 users, median is 268, mean is 368
- Eric's SA2s - between 6 and 691 users, median is 167.5, mean is 199 (fewer PT-only commuters than average)
- Charlie's SA2s - between 15 and 1,181 users, median is 141, mean is 243 (fewer PT-only commuters than average)
- Dave's SA2 - between 20 and 1,070 users, median is 222, mean is 322 (fewer PT-only commuters than average)
- Kate's SA2s - between 24 and 1,698 users, median of 409, mean of 467.6 (more PT-only commuters than average)
- Steff's SA2s - between 99 and 1,724 users, median of 545.5, mean of 609.1 (much more PT-only commuters than average)

Users per area who catch only public transport to work, normalised by the percentage of houses without cars in that area:
- All SA2s - between 153.6 and 61696.1 on the index, median is 5,284.2, mean is 8,418.8 
- Eric's SAs - between 153.6 and 14,133.3 on the index, median is 2,490.8, mean is 3787.5 (fewer PT-only commuters than average once normalised)
- Charlies SA2s - between 484.8 and 25,976 on the index, median is 3,919.2, mean is 5,712.7 (fewer PT-only commuters than average once normalised)
- Dave's SA2s - between 784.7 and 53.966.4 on the index, median of 6,584.9, mean is 9,110.5 (more PT-only commuters than average once normalised)
- Kate's SA2s - between 500.5 and 61,696.1 on the index, median of 7,496.2, mean of 11,257.1 (much more PT-only commuters than average once normalised)
- Steff's SA2s - between 2,287 and 58,004 on the index, median of 9,305, mean of 13,040 (much more PT-only commuters than average once normalised)

(NOTE - had to exclude Brisbane airport from the normalisation calculation for both ALL SA2 and Steff's SA2, as it was the only SA2 area with 0 houses without cars - and therefore doesn't make sense to apply this normalisation to. Code used for this is in the block above and commented out for reference. On reflection, a simpler solution would have been to add +1 to all houses without cars as part of the normalisation formula. Oh well!).

So what does this show us? In both raw figures and normalised figures, there is a clear relationship between the reliance of commuters on PT and SAD index scores - but this by itself doesn't give us an answer to our hypothesis, as it's only half the equation! We have now quantified the reliance of each area on public transport, both in absolute percentage and in terms of their relative reliance on PT for their commute. What we don't have yet is how the public transport responds to this situation - how accessible are PT services to these people who rely upon it??

To answer this question, we would have to understand the number of available public transport services available in each SA2 area, which would normally mean it's time to combine with dataset 1A. In this particular case, we have an ace up our sleeves - understanding number of transport services per SA2 area was a large part of the analysis we performed last night, in our Hypothesis - public transport serves people of relative wealth notebook (available here - https://drive.google.com/open?id=10TCZ_6lgOCP4pTtrrSl6kitZ6_2VUcnt ). As a quick refresher, this workbook got us to the point where we could visualise something like this - https://drive.google.com/open?id=1Q1B-gHk0EFjjLjP23GlxhM2GGZhA6Eag

So let us leverage our prior good work and import the workbook from last night to use.

In [29]:
pt_services <- read_csv("pt_services_by_SA2_v2.csv", col_types = cols(SA2Name = col_skip(), Team_Member = col_skip()))
head(pt_services)

SA2Code,Score,n
112031254,933,4467
301011001,987,3486
301011002,1093,3147
301011003,1034,2881
301011004,991,4788
301011005,983,1519


Sweet, sweet reuse. Let's join! Or why not, let's merge!

In [30]:
final_ultra_data <- merge(pt_services, pt_only_commuters, by = "SA2Code")
head(final_ultra_data)

SA2Code,Score,n,SA2Name,Team_Member,pt_to_work,percentage_houses_no_cars,normalised_pt_users
301011001,987,3486,Alexandra Hills,Charlie,343,3.594946,9541.173
301011002,1093,3147,Belmont - Gumdale,Steff,211,1.783384,11831.439
301011003,1034,2881,Birkdale,Dave,431,3.339921,12904.497
301011004,991,4788,Capalaba,Charlie,344,4.860134,7077.994
301011005,983,1519,Thorneside,Charlie,141,3.986928,3536.557
301011006,1062,2062,Wellington Point,Kate,357,2.231237,16000.091


And this is the part where there's a few things that I'd normally do.

- First is to export this dataset to import to Tableau, with a view of plotting the the count of services per area vs the number of PT-only commuters AND  the normalised index of the same. If it looks like there's a clear pattern, great - write it up and assess the original hypothesis
- If the visual EDA doesn't show a clear pattern, but has the hints there is a pattern to be found, I'd come back to R Studio/Jupityer notebooks, filter the data by Team_Member and start doing comparison there. The extra level of aggregation might reveal a clearer pattern that is otherwise obscured by outliers at the SA2 level. If that turns up a pattern - great, write it up as a weaker relationship (but still a relationship) and assess the original hypotheses.
- If there's no pattern at that point - see what it does reveal to see if there's anything else I'd like to explore futher - possibly there is something that hints at another hypothesis we can test.
- If not that either - write it up as hypothesis is unproven. Not a bad outcome and not a failure - we now know something else about the world as revealed by this data (just not something that we can action in any meaningful way).
With that captured, let's get doin' in Tableau. Exporting the working dataset and will post the resulting plot in a sec...

(output file is here if you need - https://drive.google.com/open?id=1vcz9tog6nuCgdgThkYAxPROZXVRIAKTa ) As you can probably guess, it's Project - Analysis - Dave - Dave Output.

In [31]:
 # write.csv(final_ultra_data,"final_ultra_data.csv", row.names = FALSE)

And the results are in... and are the most interesting set of results yet!

Ok, resulting visualisation can be seen at https://drive.google.com/open?id=1A3nHNnGSvVI2594CGCkLPqBkD7SRNWv9

Please have it open before proceeding, coz it's a little nuts! On the left, we have a simple scatter plot of the number of passenger transport services vs. the number of people who rely on PT only to commute to work. To assess the hypothesis of whether public transport is easily accessible to people without cars (and following the leaps in logic and assumptions documented in the first markdown box about without cars Vs have a car but choose not to use it to get to work), I would expect to see the number of services increasing as the number of people who need to rely on them increase. This graph's trend line suggests that this exact relationship exists, albeit with a gigantic amount of noise. If the generated trend line wasn't there, I don't think I would have been able to discern this relationship visually. 

_Can someone with a better grasp on statistical mathematics suggest some ways of quantifying this result?_

Of more interest to me personally is the graph on the right, which attempts to use normalisation to distinguish between _people who have cars but choose not to use them to commute_ and _people who rely on PT to commute because they don't have a car_. You can see the difference this normalisation is having on the trend line, which is almost entirely horizontal. 

My stab at interpretation of what this means - at a whole-of-system level, there is a relationship between people who rely soley on PT and the availability of PT services. The second graph implies something about this relationship - it is _more likely_ that people are reliant on PT to commute in areas where there are more PT services available, but _not the other way around_. We can see that the number of services is almost entirely static despite how badly people NEED public transport to get to work (ie. not having a car appears not to have been a factor when deciding which services should run, and how frequently).

My final takeaway is that we have possibly proven two hypothesis, including one that is very, very close to our original hypothesis:
- hypothesis - the accessibility of public transport is a factor in people deciding how they want to commute - proven
- hypothesis - public transport is _as easily_ accessible to people without cars as to people with cars - proven

Open to other ways of interpreting these results, I'm feeling a little out of my comfort zone but would happily stand by the block of text above in the absense of better ideas =p

### Peace