## Liveability Model
How liveable is a suburb, really?

We used our liveability metrics of: 
- proximity to the city
- groceries access
- healthcare access
- education access
- affordability 

All these liveability metrics are numerical data on a scale of 1 - 10 for quanitifable anaylsis and interpretability

In [3]:
# Import necessary libraries 
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml.feature import Bucketizer
from pyspark.sql.functions import col, desc


# Create a spark session (which will run spark jobs)
spark = (
    SparkSession.builder.appName("Liveability")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '4g')
    .config('spark.executor.memory', '2g')
    .getOrCreate()
)

your 131072x1 screen size is bogus. expect trouble
24/10/11 01:42:32 WARN Utils: Your hostname, DESKTOP-Q5SP5SI resolves to a loopback address: 127.0.1.1; using 172.20.36.110 instead (on interface eth0)
24/10/11 01:42:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/11 01:42:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


The dataframe containing all the liveability metrics is imported

In [4]:
sdf = spark.read.parquet('../data/landing/suburb_level_data.parquet', header=True, inferSchema=True)
sdf.show()

                                                                                

+--------+-----------------------+--------------------+----------+---------+------------------------+--------------------+-----------+---------------------+--------------------+------------------+------------------+------------------+
|postcode|total population - 2021|              suburb|  Latitude|Longitude|distance_to_melbourne_km|   school_per_capita| bed_column|healthcare_per_capita|groceries_per_capita|           all_RAI|       1-2_Bed_RAI|        3+_Bed_RAI|
+--------+-----------------------+--------------------+----------+---------+------------------------+--------------------+-----------+---------------------+--------------------+------------------+------------------+------------------+
|    3175|                  53545|     dandenong-north| -38.01917|145.21487|       31.78522659549726|6.723316836305911E-4|1-2_bedders| 3.548417219161453E-4|0.001045849285647586|208.30188679245282|  275.070707070707|189.98261219156743|
|    3127|                  18608|         mont-albert|  -37

### On a scale of 1 - 10... 
All the data was placed on a scale of 1-10. Unfortunately, we could not find any data on what number of schools, groceries, healthcare or education per capita was ideal. We could, however, compare how well equipped one suburb was compared to other suburbs in Victoria. Hence, for all liveability variables, we bucketised each of the per capita data by finding each 10th percentile and assigning a respective score out of 10 for it. For instance if the groceries per capita of a particular suburb is in the top 10th percentile, the suburb would receive a 10/10 score. 

In [5]:
######## Scale of 1 - 10 for Schools ########

# Calculate quantile cut points
quantiles = sdf.approxQuantile("school_per_capita", [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 0.0)

quantiles = sorted(set(quantiles))

# Add min and max to make splits
splits = [-float('inf')] + quantiles + [float('inf')]

# Create the Bucketizer
bucketizer = Bucketizer(
    splits=splits,
    inputCol="school_per_capita",
    outputCol="school_per_capita_score"
)


######## Scale of 1 - 10 for Groceries ########
quantiles2 = sdf.approxQuantile("groceries_per_capita", [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 0.0)
# Add min and max to make splits
splits2 = [-float('inf')] + quantiles2 + [float('inf')]

# Create the Bucketizer
bucketizer2 = Bucketizer(
    splits=splits,
    inputCol="groceries_per_capita",
    outputCol="groceries_per_capita_score"
)


######## Scale of 1 - 10 for Healthcare ########
quantiles3 = sdf.approxQuantile("healthcare_per_capita", [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 0.0)
# Add min and max to make splits
splits3 = [-float('inf')] + quantiles3 + [float('inf')]

# Create the Bucketizer
bucketizer3 = Bucketizer(
    splits=splits,
    inputCol="healthcare_per_capita",
    outputCol="healthcare_per_capita_score"
)

# Transform the DataFrame to include all 3 caregorical metric
sdf_buckets = bucketizer.transform(sdf)
sdf_buckets2 = bucketizer2.transform(sdf_buckets)
sdf_buckets3 = bucketizer3.transform(sdf_buckets2)

sdf_buckets3.show()

                                                                                

+--------+-----------------------+--------------------+----------+---------+------------------------+--------------------+-----------+---------------------+--------------------+------------------+------------------+------------------+-----------------------+--------------------------+---------------------------+
|postcode|total population - 2021|              suburb|  Latitude|Longitude|distance_to_melbourne_km|   school_per_capita| bed_column|healthcare_per_capita|groceries_per_capita|           all_RAI|       1-2_Bed_RAI|        3+_Bed_RAI|school_per_capita_score|groceries_per_capita_score|healthcare_per_capita_score|
+--------+-----------------------+--------------------+----------+---------+------------------------+--------------------+-----------+---------------------+--------------------+------------------+------------------+------------------+-----------------------+--------------------------+---------------------------+
|    3175|                  53545|     dandenong-north| -3

In [6]:
######## Scale of 1 - 10 for Proximity to the City ########
sdf_buckets4 = sdf_buckets3 .withColumn(
    "distance_score",
    F.when((F.col("distance_to_melbourne_km") <= 5), 10)
    .when((F.col("distance_to_melbourne_km") > 5) & (F.col("distance_to_melbourne_km") <= 10),9 )
    .when((F.col("distance_to_melbourne_km") > 10) & (F.col("distance_to_melbourne_km") <= 15), 8)
    .when((F.col("distance_to_melbourne_km") > 15) & (F.col("distance_to_melbourne_km") <= 20), 7)
    .when((F.col("distance_to_melbourne_km") > 20) & (F.col("distance_to_melbourne_km") <= 25), 6)
    .when((F.col("distance_to_melbourne_km") > 25) & (F.col("distance_to_melbourne_km") <= 30), 5)
    .when((F.col("distance_to_melbourne_km") > 30) & (F.col("distance_to_melbourne_km") <= 35), 4)
    .when((F.col("distance_to_melbourne_km") > 35) & (F.col("distance_to_melbourne_km") <= 40), 3)
    .when((F.col("distance_to_melbourne_km") > 40) & (F.col("distance_to_melbourne_km") <= 45), 2)
    .when((F.col("distance_to_melbourne_km") > 50) , 1),
    )
sdf_buckets4

postcode,total population - 2021,suburb,Latitude,Longitude,distance_to_melbourne_km,school_per_capita,bed_column,healthcare_per_capita,groceries_per_capita,all_RAI,1-2_Bed_RAI,3+_Bed_RAI,school_per_capita_score,groceries_per_capita_score,healthcare_per_capita_score,distance_score
3175,53545,dandenong-north,-38.01917,145.21487,31.78522659549726,6.723316836305911E-4,1-2_bedders,3.548417219161453E-4,0.001045849285647586,208.3018867924528,275.070707070707,189.98261219156743,4.0,7.0,0.0,4.0
3127,18608,mont-albert,-37.8259,145.09897,12.012507597259846,8.598452278589854E-4,3+_bedders,4.299226139294927E-4,0.001021066208082545,178.06451612903226,224.87734487734485,162.7308327435716,6.0,7.0,1.0,8.0
3215,21994,north-geelong,-38.108315,144.33578,64.01946054205368,5.001364008365918E-4,3+_bedders,9.09338910611985E-5,7.729380740201874E-4,212.30769230769232,278.87432464176646,195.6873735701938,2.0,5.0,0.0,1.0
3043,17912,gowanbrae,-37.704422,144.87862,14.2316617369889,7.257704332291202E-4,3+_bedders,2.233139794551138...,0.001116569897275...,240.0,300.0418118466899,228.11695906432743,5.0,7.0,0.0,8.0
3550,41839,long-gully,-36.766586,144.29208,130.68153865489094,6.453309113506537E-4,3+_bedders,2.151103037835512...,0.001003848084323239,245.3333333333333,299.3843762145356,225.3535353535353,4.0,7.0,0.0,1.0
3350,66022,lake-wendouree,-37.569107,143.85632,101.1083976660203,3.786616582351337...,1-2_bedders,1.363181969646481...,5.60419254187998E-4,175.23809523809524,233.2525252525253,162.53672942045034,1.0,3.0,0.0,1.0
3220,17270,newtown,-38.15568,144.35219,65.67705759041552,0.001331789229878...,3+_bedders,7.527504342790967E-4,0.001852924145917...,230.0,294.61904761904765,226.98474102729423,8.0,8.0,5.0,1.0
3350,66022,golden-point,-37.569107,143.85632,101.1083976660203,3.786616582351337...,1-2_bedders,1.363181969646481...,5.60419254187998E-4,290.5263157894737,332.95238095238096,254.7658109332528,1.0,3.0,0.0,1.0
3630,32151,shepparton,-36.461567,145.558,159.31845513814582,0.001119716338527573,3+_bedders,5.909614008895524E-4,0.001741780971042...,324.7058823529412,401.2952380952381,292.0352460777993,7.0,8.0,3.0,1.0
3156,38484,upper-ferntree-gully,-37.936802,145.30328,32.85121754048439,5.196964972456086E-4,3+_bedders,2.078785988982434...,7.535599210061325E-4,234.89361702127655,268.6501377410469,199.70090405366,2.0,5.0,0.0,4.0


For afforability, the RAI (rent affordability index) was used to classify the affordability of each particular type of property across all suburbs. This metric was based on another paper online: *https://sgsep.com.au/projects/rental-affordability-index*

In [7]:
# Across all types of properties
sdf_buckets5 = sdf_buckets4.withColumn(
    "all_RAI_score",
    F.when((F.col("all_RAI") <= 50), 0)
    .when((F.col("all_RAI") > 50) & (F.col("all_RAI") <= 75), 1) 
    .when((F.col("all_RAI") > 75) & (F.col("all_RAI") <= 100), 2)
    .when((F.col("all_RAI") > 100) & (F.col("all_RAI") <= 115), 3)
    .when((F.col("all_RAI") > 115) & (F.col("all_RAI") <= 130), 4)
    .when((F.col("all_RAI") > 130) & (F.col("all_RAI") <= 145), 5)
    .when((F.col("all_RAI") > 145) & (F.col("all_RAI") <= 160), 6)
    .when((F.col("all_RAI") > 160) & (F.col("all_RAI") <= 175), 7)
    .when((F.col("all_RAI") > 150) & (F.col("all_RAI") <= 175), 8)
    .when((F.col("all_RAI") > 175) & (F.col("all_RAI") <= 200), 9)
    .when((F.col("all_RAI") > 200), 10),
    )

# Across 1 - 2 bedroom properties
sdf_buckets6 = sdf_buckets5.withColumn(
    "1-2_Bed_RAI_score",
    F.when((F.col("1-2_Bed_RAI") <= 50), 0) 
    .when((F.col("1-2_Bed_RAI") > 50) & (F.col("1-2_Bed_RAI") <= 75), 1) 
    .when((F.col("1-2_Bed_RAI") > 75) & (F.col("1-2_Bed_RAI") <= 100), 2) 
    .when((F.col("1-2_Bed_RAI") > 100) & (F.col("1-2_Bed_RAI") <= 115), 3) 
    .when((F.col("1-2_Bed_RAI") > 115) & (F.col("1-2_Bed_RAI") <= 130), 4) 
    .when((F.col("1-2_Bed_RAI") > 130) & (F.col("1-2_Bed_RAI") <= 145), 5) 
    .when((F.col("1-2_Bed_RAI") > 145) & (F.col("1-2_Bed_RAI") <= 160), 6) 
    .when((F.col("1-2_Bed_RAI") > 160) & (F.col("1-2_Bed_RAI") <= 175), 7) 
    .when((F.col("1-2_Bed_RAI") > 150) & (F.col("1-2_Bed_RAI") <= 175), 8) 
    .when((F.col("1-2_Bed_RAI") > 175) & (F.col("1-2_Bed_RAI") <= 200), 9) 
    .when((F.col("1-2_Bed_RAI") > 200),10),
    )

# Across 3+ bedroom properties
sdf_buckets7 = sdf_buckets6.withColumn(
    "3+_Bed_RAI_score",
    F.when((F.col("3+_Bed_RAI") <= 50), 0) 
    .when((F.col("3+_Bed_RAI") > 50) & (F.col("3+_Bed_RAI") <= 75), 1) 
    .when((F.col("3+_Bed_RAI") > 75) & (F.col("3+_Bed_RAI") <= 100), 2) 
    .when((F.col("3+_Bed_RAI") > 100) & (F.col("3+_Bed_RAI") <= 115), 3) 
    .when((F.col("3+_Bed_RAI") > 115) & (F.col("3+_Bed_RAI") <= 130), 4) 
    .when((F.col("3+_Bed_RAI") > 130) & (F.col("3+_Bed_RAI") <= 145), 5) 
    .when((F.col("3+_Bed_RAI") > 145) & (F.col("3+_Bed_RAI") <= 160), 6) 
    .when((F.col("3+_Bed_RAI") > 160) & (F.col("3+_Bed_RAI") <= 175), 7) 
    .when((F.col("3+_Bed_RAI") > 150) & (F.col("3+_Bed_RAI") <= 175), 8) 
    .when((F.col("3+_Bed_RAI") > 175) & (F.col("3+_Bed_RAI") <= 200), 9) 
    .when((F.col("3+_Bed_RAI") > 200), 10)
    )
sdf_buckets7

postcode,total population - 2021,suburb,Latitude,Longitude,distance_to_melbourne_km,school_per_capita,bed_column,healthcare_per_capita,groceries_per_capita,all_RAI,1-2_Bed_RAI,3+_Bed_RAI,school_per_capita_score,groceries_per_capita_score,healthcare_per_capita_score,distance_score,all_RAI_score,1-2_Bed_RAI_score,3+_Bed_RAI_score
3175,53545,dandenong-north,-38.01917,145.21487,31.78522659549726,6.723316836305911E-4,1-2_bedders,3.548417219161453E-4,0.001045849285647586,208.3018867924528,275.070707070707,189.98261219156743,4.0,7.0,0.0,4.0,10,1,2
3127,18608,mont-albert,-37.8259,145.09897,12.012507597259846,8.598452278589854E-4,3+_bedders,4.299226139294927E-4,0.001021066208082545,178.06451612903226,224.87734487734485,162.7308327435716,6.0,7.0,1.0,8.0,9,1,2
3215,21994,north-geelong,-38.108315,144.33578,64.01946054205368,5.001364008365918E-4,3+_bedders,9.09338910611985E-5,7.729380740201874E-4,212.30769230769232,278.87432464176646,195.6873735701938,2.0,5.0,0.0,1.0,10,1,2
3043,17912,gowanbrae,-37.704422,144.87862,14.2316617369889,7.257704332291202E-4,3+_bedders,2.233139794551138...,0.001116569897275...,240.0,300.0418118466899,228.11695906432743,5.0,7.0,0.0,8.0,10,1,1
3550,41839,long-gully,-36.766586,144.29208,130.68153865489094,6.453309113506537E-4,3+_bedders,2.151103037835512...,0.001003848084323239,245.3333333333333,299.3843762145356,225.3535353535353,4.0,7.0,0.0,1.0,10,1,1
3350,66022,lake-wendouree,-37.569107,143.85632,101.1083976660203,3.786616582351337...,1-2_bedders,1.363181969646481...,5.60419254187998E-4,175.23809523809524,233.2525252525253,162.53672942045034,1.0,3.0,0.0,1.0,9,1,2
3220,17270,newtown,-38.15568,144.35219,65.67705759041552,0.001331789229878...,3+_bedders,7.527504342790967E-4,0.001852924145917...,230.0,294.61904761904765,226.98474102729423,8.0,8.0,5.0,1.0,10,1,1
3350,66022,golden-point,-37.569107,143.85632,101.1083976660203,3.786616582351337...,1-2_bedders,1.363181969646481...,5.60419254187998E-4,290.5263157894737,332.95238095238096,254.7658109332528,1.0,3.0,0.0,1.0,10,1,1
3630,32151,shepparton,-36.461567,145.558,159.31845513814582,0.001119716338527573,3+_bedders,5.909614008895524E-4,0.001741780971042...,324.7058823529412,401.2952380952381,292.0352460777993,7.0,8.0,3.0,1.0,10,1,1
3156,38484,upper-ferntree-gully,-37.936802,145.30328,32.85121754048439,5.196964972456086E-4,3+_bedders,2.078785988982434...,7.535599210061325E-4,234.89361702127655,268.6501377410469,199.70090405366,2.0,5.0,0.0,4.0,10,1,2


We then split the dataset according to the property types. Duplicates and unnecessary columns were also removed.

In [8]:

sdf1 = sdf_buckets7.filter(sdf['bed_column'] == '1-2_bedders')
sdf1 = sdf1.distinct()
sdf2 = sdf_buckets7.filter(sdf['bed_column'] == '3+_bedders')
sdf2 = sdf2.distinct()
sdf_1a = sdf1.drop(sdf1['3+_Bed_RAI'], sdf1['bed_column'], sdf1['groceries_per_capita'], sdf1['school_per_capita'], sdf1['healthcare_per_capita'],
        sdf1['total population - 2021'], sdf1['Latitude'], sdf1['Longitude'], sdf1['distance_to_melbourne_km']) 
sdf_2a = sdf2.drop(sdf2['1-2_Bed_RAI'], sdf2['bed_column'], sdf2['groceries_per_capita'], sdf2['school_per_capita'], sdf2['healthcare_per_capita'],
        sdf2['total population - 2021'], sdf2['Latitude'], sdf2['Longitude'], sdf2['distance_to_melbourne_km'])
sdf_1a

                                                                                

postcode,suburb,all_RAI,1-2_Bed_RAI,school_per_capita_score,groceries_per_capita_score,healthcare_per_capita_score,distance_score,all_RAI_score,1-2_Bed_RAI_score,3+_Bed_RAI_score
3020,sunshine-north,225.3061224489796,224.81333661821463,5.0,7.0,0.0,8,10,1,2
3108,doncaster,162.35294117647058,219.45609945609945,1.0,5.0,0.0,8,7,1,3
3550,white-hills,245.3333333333333,276.3982683982684,4.0,7.0,0.0,1,10,1,1
3350,alfredton,175.23809523809524,233.2525252525253,1.0,3.0,0.0,1,9,1,2
3020,sunshine,225.3061224489796,224.81333661821463,5.0,7.0,0.0,8,10,1,2
3155,boronia,184.0,213.7677337677338,5.0,8.0,1.0,5,9,1,3
3175,dandenong,210.28571428571428,265.77777777777777,4.0,7.0,0.0,4,10,1,2
3186,brighton,187.11864406779665,220.24242424242425,7.0,8.0,3.0,8,9,1,2
3043,tullamarine,240.0,300.0418118466899,5.0,7.0,0.0,8,10,1,1
3450,castlemaine,262.85714285714283,369.9913419913421,8.0,8.0,7.0,1,10,1,1


### Liveability without weightages 
Firstly we designed a liveability index that weighted all variables equally. As the total weightage adds up to 10 , we assigned  a weightage of 2 for each variable

In [None]:
df_all = sdf_1a.select(
    col("postcode"),
    col("suburb"),
    col("all_RAI_score"),
    col("school_per_capita_score"),
    col("groceries_per_capita_score"),
    col("healthcare_per_capita_score"),
    col("distance_score"),
    (2*col("all_RAI_score") + 2*col("school_per_capita_score")+  2*col("groceries_per_capita_score")
    +  2*col("healthcare_per_capita_score")+  2*col("distance_score")).alias("liveablity_score_all")
)
df_all = df_all.drop(df_all['all_RAI_score'],df_all['school_per_capita_score'], df_all['groceries_per_capita_score'],
        df_all['healthcare_per_capita_score'],df_all['distance_score'])



### Liveability: 1 - 2 Bedrooms 
We designed a liveability index that weighted variables for 1 - 2 Bedroom properties according to what we think a 20-something would prioritise.
- 40%: affordability
- 20%: groceries
- 20%: proximity to the city
- 10%: education
- 10%: healthcare

In [None]:
df_1_2B = sdf_1a.select(
    col("postcode"),
    col("suburb"),
    col("1-2_Bed_RAI_score"),
    col("school_per_capita_score"),
    col("groceries_per_capita_score"),
    col("healthcare_per_capita_score"),
    col("distance_score"),
    (4*col("1-2_Bed_RAI_score") +  1*col("school_per_capita_score")+  2*col("groceries_per_capita_score")
    +  1*col("healthcare_per_capita_score")+  2*col("distance_score")).alias("liveablity_score_1_2Bedder")
)

df_1_2B = df_1_2B.drop(df_1_2B['1-2_Bed_RAI_score'],df_1_2B['school_per_capita_score'], df_1_2B['groceries_per_capita_score'],
        df_1_2B['healthcare_per_capita_score'],df_1_2B['distance_score'])

df_1_2B

### Liveability: 3+ Bedrooms 
We designed a liveability index that weighted variables for 3+ Bedroom properties according to what we think families would prioritise.
- 40%: affordability
- 20%: groceries
- 20%: education
- 10%: proximity to the city
- 10%: healthcare

In [10]:

df_3B = sdf_2a.select(
    col("postcode"),
    col("suburb"),
    col("3+_Bed_RAI_score"),
    col("school_per_capita_score"),
    col("groceries_per_capita_score"),
    col("healthcare_per_capita_score"),
    col("distance_score"),
    (4*col("3+_Bed_RAI_score") +  2*col("school_per_capita_score")+ 2*col("groceries_per_capita_score")
    +  1*col("healthcare_per_capita_score")+ 1*col("distance_score")).alias("liveablity_score_3+Bedder")
)

df_3B = df_3B.drop(df_3B['3+_Bed_RAI_score'],df_3B['school_per_capita_score'], df_3B['groceries_per_capita_score'],
        df_3B['healthcare_per_capita_score'],df_3B['distance_score'])

Finally, we collated all of our findings in one dataframe

In [12]:
liveability_sdf = df_all.join(df_1_2B, on=['postcode', 'suburb'], how='outer')
liveability_sdf1 = liveability_sdf.join(df_3B, on=['postcode', 'suburb'], how='outer')
liveability_sdf1 = liveability_sdf1.distinct()  
liveability_sdf2 = liveability_sdf1.dropna(subset=['postcode', 'suburb'])
liveability_sdf2 = liveability_sdf1.dropDuplicates(subset=['postcode', 'suburb'])
liveability_sdf2

postcode,suburb,liveablity_score_all,liveablity_score_1_2Bedder,liveablity_score_3+Bedder
3127,surrey-hills,62.0,41.0,43.0
3205,south-melbourne,72.0,53.0,54.0
3215,rippleside,36.0,18.0,23.0
3047,dallas,,,48.0
3016,williamstown,68.0,45.0,43.0
3171,springvale,74.0,45.0,47.0
3199,frankston-south,46.0,26.0,28.0
3844,callignee,54.0,,
3204,ormond,50.0,32.0,30.0
3226,ocean-grove,56.0,31.0,38.0


### Most Liveable Suburbs Across All Property Types

In [13]:
sorted_df_all = df_all.orderBy(desc("liveablity_score_all"))
sorted_df_all.show()

+--------+---------------+--------------------+
|postcode|         suburb|liveablity_score_all|
+--------+---------------+--------------------+
|    3002| east-melbourne|                86.0|
|    3047|   broadmeadows|                76.0|
|    3168|        clayton|                76.0|
|    3168|   notting-hill|                76.0|
|    3039|   moonee-ponds|                76.0|
|    3011|      footscray|                74.0|
|    3144|        kooyong|                74.0|
|    3011|         seddon|                74.0|
|    3171|     springvale|                74.0|
|    3205|south-melbourne|                72.0|
|    3300|       hamilton|                70.0|
|    3181|        windsor|                70.0|
|    3181|   prahran-east|                70.0|
|    3181|        prahran|                70.0|
|    3186|       brighton|                70.0|
|    3450|    castlemaine|                68.0|
|    3016|   williamstown|                68.0|
|    3585|      swan-hill|              

### Most Liveable Suburbs for 1 - 2 Bedrooms

In [14]:
sorted_df_1_2B = df_1_2B.orderBy(desc('liveablity_score_1_2Bedder'))
sorted_df_1_2B.show()

+--------+---------------+--------------------------+
|postcode|         suburb|liveablity_score_1_2Bedder|
+--------+---------------+--------------------------+
|    3002| east-melbourne|                      60.0|
|    3205|south-melbourne|                      53.0|
|    3121|       richmond|                      52.0|
|    3121|        burnley|                      52.0|
|    3121|       cremorne|                      52.0|
|    3181|        windsor|                      51.0|
|    3181|   prahran-east|                      51.0|
|    3181|        prahran|                      51.0|
|    3031|     flemington|                      50.0|
|    3031|     kensington|                      50.0|
|    3011|      footscray|                      49.0|
|    3144|        kooyong|                      49.0|
|    3039|   moonee-ponds|                      49.0|
|    3011|         seddon|                      49.0|
|    3168|   notting-hill|                      48.0|
|    3168|        clayton|  

### Most Liveable Suburb for 3+ Bedrooms

In [15]:
sorted_df_3B = df_3B.orderBy(desc('liveablity_score_3+Bedder'))
sorted_df_3B.show()

+--------+---------------+-------------------------+
|postcode|         suburb|liveablity_score_3+Bedder|
+--------+---------------+-------------------------+
|    3168|        clayton|                     57.0|
|    3168|   notting-hill|                     57.0|
|    3011|      footscray|                     56.0|
|    3011|         seddon|                     56.0|
|    3144|        kooyong|                     55.0|
|    3205|south-melbourne|                     54.0|
|    3181|        windsor|                     52.0|
|    3181|        prahran|                     52.0|
|    3121|        burnley|                     52.0|
|    3121|       richmond|                     52.0|
|    3039|   moonee-ponds|                     51.0|
|    3185|      ripponlea|                     50.0|
|    3194|        mentone|                     50.0|
|    3185|    elsternwick|                     50.0|
|    3031|     kensington|                     50.0|
|    3031|     flemington|                    

## Affordability ~ $$$ ~ 
How affordable is a suburb, really? Although we had used the RAI to calculate the liveability of suburbs, we have decided to better visualise affordability by comparing the median rent of each suburb with the median household income of Victorians. Affordable rent is rent within 30% of the household income. The equation for this is (100 / RAI)

In [16]:
percentage_sdf = sdf_buckets7.select(
    col("postcode"),
    col("suburb"),
    col("all_RAI"),
    col("1-2_Bed_RAI"),
    col("3+_Bed_RAI"),
    ((100/col("all_RAI")).alias("Percentage Income : All")),
    ((100/col("1-2_Bed_RAI")).alias("Percentage Income : 1 - 2 Bedders")),
    ((100/col("3+_Bed_RAI")).alias("Percentage Income : 3 Bedders"))
)

percentage_sdf = percentage_sdf.dropDuplicates(['suburb'])
percentage_sdf


# df_3B = sdf_2a.select(
#     col("postcode"),
#     col("suburb"),
#     col("3+_Bed_RAI_score"),
#     col("school_per_capita_score"),
#     col("groceries_per_capita_score"),
#     col("healthcare_per_capita_score"),
#     col("distance_score"),
#     (4*col("3+_Bed_RAI_score") +  4*col("school_per_capita_score")+ 2*col("groceries_per_capita_score")
#     +  1*col("healthcare_per_capita_score")+ 1*col("distance_score")).alias("liveablity_score_3+Bedder")
# )

postcode,suburb,all_RAI,1-2_Bed_RAI,3+_Bed_RAI,Percentage Income : All,Percentage Income : 1 - 2 Bedders,Percentage Income : 3 Bedders
3020,albion,225.3061224489796,224.81333661821463,164.24241090531206,0.4438405797101449,0.4448134683834316,0.6088561380023296
3350,alfredton,290.5263157894737,332.95238095238096,254.7658109332528,0.3442028985507246,0.3003432494279176,0.3925173461607038
3018,altona,189.36535162950256,206.14104175394496,118.14307862679956,0.5280797101449276,0.4851047571563283,0.8464313031480121
3195,aspendale,169.84615384615384,227.09732620320852,167.0153846153846,0.588768115942029,0.4403398387461382,0.598747236551216
3350,bakery-hill,175.23809523809524,233.2525252525253,162.53672942045034,0.5706521739130435,0.4287199029967088,0.6152455531532186
3183,balaclava,169.84615384615384,,,0.588768115942029,,
3350,ballarat-central,290.5263157894737,332.95238095238096,254.7658109332528,0.3442028985507246,0.3003432494279176,0.3925173461607038
3350,ballarat-east,290.5263157894737,332.95238095238096,254.7658109332528,0.3442028985507246,0.3003432494279176,0.3925173461607038
3350,ballarat-north,175.23809523809524,233.2525252525253,162.53672942045034,0.5706521739130435,0.4287199029967088,0.6152455531532186
3103,balwyn,220.8,203.63875205254516,113.37373737373736,0.4528985507246376,0.4910656689459425,0.8820384889522452


### Affordability: All Property Types

In [17]:
all_affordability_sdf1 = percentage_sdf.select("postcode", "suburb", "Percentage Income : All").orderBy(desc("Percentage Income : All"))
all_affordability_sdf1 = all_affordability_sdf1.dropDuplicates()
all_affordability_sdf1.orderBy(desc("Percentage Income : All"))

postcode,suburb,Percentage Income : All
3185,elsternwick,0.7699275362318841
3185,gardenvale,0.7699275362318841
3185,ripponlea,0.7699275362318841
3192,cheltenham,0.7246376811594203
3163,glen-huntly,0.6340579710144927
3163,murrumbeena,0.6340579710144927
3121,cremorne,0.6295289855072463
3121,burnley,0.6295289855072463
3121,richmond,0.6295289855072463
3108,doncaster,0.6159420289855073


In [18]:
all_affordability_sdf1.orderBy(("Percentage Income : All"))

postcode,suburb,Percentage Income : All
3305,bolwarra,0.3079710144927536
3305,gorae-west,0.3079710144927536
3305,portland,0.3079710144927536
3630,shepparton,0.3079710144927536
3850,sale,0.3260869565217391
3850,wurruk,0.3260869565217391
3820,warragul,0.3351449275362319
3350,alfredton,0.3442028985507246
3350,ballarat-central,0.3442028985507246
3350,ballarat-east,0.3442028985507246


### Affordability : 1-2 Bedrooms

In [19]:
affordability_1B_sdf1 = percentage_sdf.select("postcode", "suburb", "Percentage Income : 1 - 2 Bedders").orderBy(desc("Percentage Income : 1 - 2 Bedders"))
affordability_1B_sdf1  = affordability_1B_sdf1.dropDuplicates()
affordability_1B_sdf1.orderBy(desc("Percentage Income : 1 - 2 Bedders"))

postcode,suburb,Percentage Income : 1 - 2 Bedders
3031,flemington,0.5583105717995791
3031,kensington,0.5583105717995791
3006,southbank,0.5510587128439777
3121,cremorne,0.5510400474108632
3121,burnley,0.5510400474108632
3121,richmond,0.5510400474108632
3002,east-melbourne,0.5439767837274062
3182,st-kilda,0.5358615004122012
3182,st-kilda-west,0.5358615004122012
3141,south-yarra,0.5357319914729066


In [20]:
affordability_1B_sdf2 = affordability_1B_sdf1.dropna()
affordability_1B_sdf2.orderBy(("Percentage Income : 1 - 2 Bedders"))

postcode,suburb,Percentage Income : 1 - 2 Bedders
3820,warragul,0.2435963681996014
3305,bolwarra,0.2441625505163897
3305,gorae-west,0.2441625505163897
3305,portland,0.2441625505163897
3630,shepparton,0.2491930890449971
3660,seymour,0.2633537377887719
3825,rawson,0.2689549180327868
3825,moe,0.2689549180327868
3825,newborough,0.2689549180327868
3825,yallourn-north,0.2689549180327868


### Affordability : 3+ Bedrooms

In [21]:
affordability_3B_sdf1 = percentage_sdf.select("postcode", "suburb", "Percentage Income : 3 Bedders").orderBy(desc("Percentage Income : 3 Bedders"))
affordability_3B_sdf2 = (affordability_3B_sdf1.dropDuplicates()).dropna()
affordability_3B_sdf2.orderBy(desc("Percentage Income : 3 Bedders")).show()


+--------+---------------+-----------------------------+
|postcode|         suburb|Percentage Income : 3 Bedders|
+--------+---------------+-----------------------------+
|    3130|      blackburn|            1.116016828922264|
|    3130|blackburn-north|            1.116016828922264|
|    3130|blackburn-south|            1.116016828922264|
|    3192|     cheltenham|           1.0589291083026346|
|    3182|       st-kilda|           1.0496186228610889|
|    3182|  st-kilda-west|           1.0496186228610889|
|    3121|       cremorne|           0.9722136078893505|
|    3121|        burnley|           0.9722136078893505|
|    3121|       richmond|           0.9722136078893505|
|    3031|     flemington|           0.9556722526836049|
|    3031|     kensington|           0.9556722526836049|
|    3205|south-melbourne|           0.9151166075888195|
|    3141|    south-yarra|           0.9122249157825143|
|    3132|        mitcham|           0.8889684845235324|
|    3103|       deepdene|     

In [22]:

affordability_3B_sdf3 = (percentage_sdf.select("postcode", "suburb", "Percentage Income : 3 Bedders").orderBy(("Percentage Income : 3 Bedders")))
affordability_3B_sdf4 = (affordability_3B_sdf3.dropDuplicates()).dropna()
affordability_3B_sdf4.orderBy(("Percentage Income : 3 Bedders")).show()

+--------+----------------+-----------------------------+
|postcode|          suburb|Percentage Income : 3 Bedders|
+--------+----------------+-----------------------------+
|    3630|      shepparton|          0.34242442082953106|
|    3030|      point-cook|           0.3592483419307295|
|    3030|        werribee|           0.3592483419307295|
|    3030|  werribee-south|           0.3592483419307295|
|    3450|     castlemaine|          0.37213740458015265|
|    3850|            sale|          0.38019107038409045|
|    3850|          wurruk|          0.38019107038409045|
|    3660|         seymour|          0.39093640767768434|
|    3825|             moe|          0.39126995046780405|
|    3825|      newborough|          0.39126995046780405|
|    3825|          rawson|          0.39126995046780405|
|    3825|  yallourn-north|          0.39126995046780405|
|    3350|       alfredton|          0.39251734616070383|
|    3350|ballarat-central|          0.39251734616070383|
|    3350|   b

## Combining our data
We decided to combine the liveability scores and affordabilit of each suburb into one single dataframe

In [23]:
output = liveability_sdf2.join(percentage_sdf, on = ['postcode','suburb'], how = 'inner')
output

                                                                                

postcode,suburb,liveablity_score_all,liveablity_score_1_2Bedder,liveablity_score_3+Bedder,all_RAI,1-2_Bed_RAI,3+_Bed_RAI,Percentage Income : All,Percentage Income : 1 - 2 Bedders,Percentage Income : 3 Bedders
3127,surrey-hills,62.0,41.0,43.0,178.06451612903226,224.87734487734485,162.7308327435716,0.5615942028985508,0.4446868583162218,0.6145116958725225
3205,south-melbourne,72.0,53.0,54.0,197.1428571428572,197.6733493718257,109.27569139356284,0.5072463768115941,0.5058850893040666,0.9151166075888196
3215,rippleside,36.0,18.0,23.0,212.30769230769232,278.87432464176646,195.6873735701938,0.4710144927536231,0.358584463193078,0.5110191739792022
3047,dallas,,,48.0,234.89361702127655,,240.95917312661496,0.4257246376811594,,0.4150080642393878
3016,williamstown,68.0,45.0,43.0,262.85714285714283,311.30156472261734,249.2573251310093,0.3804347826086957,0.321231922136672,0.401191820330416
3171,springvale,74.0,45.0,47.0,220.8,280.48035005267,201.1513876958621,0.4528985507246376,0.3565312150431269,0.4971380070775276
3199,frankston-south,46.0,26.0,28.0,208.3018867924528,270.5569985569985,214.8280701754386,0.4800724637681159,0.3696078849682127,0.4654885179498905
3844,callignee,54.0,,,283.0769230769231,,,0.3532608695652173,,
3204,ormond,50.0,32.0,30.0,216.4705882352941,252.25108225108224,181.8403113457772,0.4619565217391304,0.3964304101596019,0.5499330663256822
3226,ocean-grove,56.0,31.0,38.0,250.9090909090909,335.2226720647773,223.78510378510376,0.3985507246376811,0.2983091787439613,0.4468572675687473


In [24]:
# Unnecessary columns and duplicates were removed a
final_output = output.drop(output['all_RAI'],output['1-2_Bed_RAI'],output['3+_Bed_RAI'])
final_output = final_output.dropDuplicates(['suburb'])
final_output

postcode,suburb,liveablity_score_all,liveablity_score_1_2Bedder,liveablity_score_3+Bedder,Percentage Income : All,Percentage Income : 1 - 2 Bedders,Percentage Income : 3 Bedders
3020,albion,60.0,39.0,40.0,0.4438405797101449,0.4448134683834316,0.6088561380023296
3350,alfredton,28.0,13.0,17.0,0.3442028985507246,0.3003432494279176,0.3925173461607038
3018,altona,58.0,39.0,44.0,0.5280797101449276,0.4851047571563283,0.8464313031480121
3195,aspendale,42.0,30.0,30.0,0.588768115942029,0.4403398387461382,0.598747236551216
3350,bakery-hill,,,13.0,0.5706521739130435,0.4287199029967088,0.6152455531532186
3183,balaclava,60.0,,,0.588768115942029,,
3350,ballarat-central,30.0,13.0,13.0,0.3442028985507246,0.3003432494279176,0.3925173461607038
3350,ballarat-east,30.0,13.0,13.0,0.3442028985507246,0.3003432494279176,0.3925173461607038
3350,ballarat-north,30.0,13.0,17.0,0.5706521739130435,0.4287199029967088,0.6152455531532186
3103,balwyn,66.0,43.0,49.0,0.4528985507246376,0.4910656689459425,0.8820384889522452
