# FIT5202 2025 S2 Assignment 1 : Analysing Australian Property Market Data

## Table of Contents
* [Part 1 : Working with RDD](#part-1)  
    - [1.1 Data Preparation and Loading](#1.1)  
    - [1.2 Data Partitioning in RDD](#1.2)  
    - [1.3 Query/Analysis](#1.3)  
* [Part 2 : Working with DataFrames](#2-dataframes)  
    - [2.1 Data Preparation and Loading](#2-dataframes)  
    - [2.2 Query/Analysis](#2.2)  
* [Part 3 :  RDDs vs DataFrame vs Spark SQL](#part-3)  

Note: Feel free to add Code/Markdown cells as you need.

# Part 1 : Working with RDDs (30%) <a class="anchor" name="part-1"></a>
## 1.1 Working with RDD
In this section, you will need to create RDDs from the given datasets, perform partitioning in these RDDs and use various RDD operations to answer the queries. 

1.1.1 Data Preparation and Loading <a class="anchor" name="1.1"></a>
1.	Write the code to create a SparkContext object using SparkSession. To create a SparkSession, you first need to build a SparkConf object that contains information about your application. Use Melbourne time as the session timezone. Give your application an appropriate name and run Spark locally with 4 cores on your machine.

In [1]:
# Import SparkConf class into program
from pyspark import SparkConf

# local[*]: run Spark in local mode with as many working processors as logical cores on your machine
# If we want Spark to run locally with 'k' worker threads, we can specify as "local[k]".
master = "local[4]"
# The `appName` field is a name to be shown on the Spark cluster UI page
app_name = "Assignment1"
# Setup configuration parameters for Spark
spark_conf = SparkConf().setMaster(master).setAppName(app_name)

# Import SparkContext and SparkSession classes
from pyspark import SparkContext # Spark
from pyspark.sql import SparkSession # Spark SQL

# Method 1: Using SparkSession
spark = SparkSession.builder.config(conf=spark_conf).config("spark.sql.session.timeZone", "GMT+10").getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')

1.1.2 Load the CSV and JSON files into multiple RDDs. 

In [46]:
import os
files = ["data/council.json", "data/nsw_property_price.csv", "data/property_purpose.json", "data/zoning.json"]
rdds = []  
for file in files:
    ext = os.path.splitext(file)[1].lower()  # get file extension    
    rdd = sc.textFile(file).map(lambda x: x.replace("\t", "").replace("{", "").replace("}", "").replace("\n", "")).filter(lambda x: (x!= "") & (x!= ","))
    if ext == ".json":
        rdds.append((rdd, "json", file))
    elif ext == ".csv":
        rdds.append((rdd, "csv", file))

1.1.3 For each RDD, remove the header rows and display the total count and the first 8 records.


In [50]:
# Step 3: filter out records with invalid *_id
def valid_record(rec):
    for k, v in rec.items():
        if k.endswith("_id"):
            try:
                int(v)  # must be integer-parsable
            except (ValueError, TypeError):
                return False
    return True


for rdd, ext, filename in rdds:
    if ext == "csv":
        continue
    print(filename)
    # remove the header row
    header = rdd.first()
    # the filter method apply a function to each elemnts. The function output is a boolean value (TRUE or FALSE)
    # elements that have output TRUbE will be kept.
    rdd = rdd.filter(lambda x: x != header)
    print(rdd.count())
#     print(rdd.take(3))
    
    rdd = rdd.map(lambda x: x.split(" : "))
    
    indexed = rdd.zipWithIndex()
#     print(indexed.take(3))
    grouped = indexed.map(lambda x: (x[1] // 2, x[0])) \
                 .groupByKey() \
                 .mapValues(dict) \
                 .values()
    print(grouped.take(3))
    
    

    filtered = grouped.filter(valid_record)
    print(filtered.take(3))
    

data/council.json
441
[{'"council_id"': '1,', '"council_name"': '"003"'}, {'"council_id"': '3,', '"council_name"': '"013"'}, {'"council_id"': '5,', '"council_name"': '"020"'}]
[{'"council_id"': '1,', '"council_name"': '"003"'}, {'"council_id"': '3,', '"council_name"': '"013"'}, {'"council_id"': '5,', '"council_name"': '"020"'}]
data/property_purpose.json
1731
[{'"purpose_id"': '1,', '"primary_purpose"': '""'}, {'"purpose_id"': '29,', '"primary_purpose"': '"10 FLATS"'}, {'"purpose_id"': '115,', '"primary_purpose"': '"2"'}]
[{'"purpose_id"': '1,', '"primary_purpose"': '""'}, {'"purpose_id"': '29,', '"primary_purpose"': '"10 FLATS"'}, {'"purpose_id"': '115,', '"primary_purpose"': '"2"'}]
data/zoning.json
143
[{'"zoning_id"': '1,', '"zoning"': '""'}, {'"zoning_id"': '3,', '"zoning"': '"AGB"'}, {'"zoning_id"': '5,', '"zoning"': '"B1"'}]
[{'"zoning_id"': '1,', '"zoning"': '""'}, {'"zoning_id"': '3,', '"zoning"': '"AGB"'}, {'"zoning_id"': '5,', '"zoning"': '"B1"'}]


----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 45242)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/conda/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/opt/conda/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/conda/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/opt/conda/lib/python3.10/site-packages/pyspark/accumulators.py", line 295, in handle
    poll(accum_updates)
  File "/opt/conda/lib/python3.10/site-packages/pyspark/accumulators.py", line 267, in poll
    if self.rfile in r and func():
  File "/opt/conda/lib/python3.10/site-packages/pyspark/accumulators.py", line 271, in accum_updates
   

1.1.4 Drop records with invalid information: purpose_id or council_id is null, empty, or 0.

In [6]:
for rdd, ext, filename in rdds:
    if ext == "csv":
        continue

PythonRDD[54] at RDD at PythonRDD.scala:53
['"council_id,council_name\\n": [', '"council_id" : 1,']
PythonRDD[60] at RDD at PythonRDD.scala:53
['"purpose_id, primary_purpose\\n": [', '"purpose_id" : 1,']
PythonRDD[66] at RDD at PythonRDD.scala:53
['"zoning_id, zoning\\n": [', '"zoning_id" : 1,']


### 1.2 Data Partitioning in RDD <a class="anchor" name="1.2"></a>
1.2.1 For each RDD, using Spark’s default partitioning, print out the total number of partitions and the number of records in each partition

In [33]:
for rdd, ext, filename in rdds:
    print('Default partitions: ',rdd.getNumPartitions())

Default partitions:  2
Default partitions:  19
Default partitions:  2
Default partitions:  2


1.2.2 Answer the following questions:   
a) How many partitions do the above RDDs have?  
b) How is the data in these RDDs partitioned by default, when we do not explicitly specify any partitioning strategy? Can you explain why it is partitioned in this number?   
c) Assuming we are querying the dataset based on <strong> Property Price</strong>, can you think of a better strategy for partitioning the data based on your available hardware resources?  

Your answer for a

Your answer for b

Your answer for c

1.2.3 Create a user-defined function (UDF) to transform the date strings from ISO format (YYYY-MM-DD) (e.g. 2025-01-01) to Australian format (DD/Mon/YYYY) (e.g. 01/Jan/2025), then call the UDF to transform two date columns (iso_contract_date and iso_settlement_date) to contract_date and settlement_date.

### 1.3 Query/Analysis <a class="anchor" name="1.3"></a>
For this part, write relevant RDD operations to answer the following queries.

1.3.1 Extract the Month (Jan-Dec) information and print the total number of sales by contract date for each Month. (5%)

1.3.2 Which 5 councils have the largest number of houses? Show their name and the total number of houses. (Note: Each house may appear multiple times if there are more than one sales, you should only count them once.) (5%)

## Part 2. Working with DataFrames (45%) <a class="anchor" name="2-dataframes"></a>
In this section, you need to load the given datasets into PySpark DataFrames and use DataFrame functions to answer the queries.
### 2.1 Data Preparation and Loading

2.1.1. Load the CSV/JSON files into separate dataframes. When you create your dataframes, please refer to the metadata file and think about the appropriate data type for each column.

In [63]:
import os
files = ["data/council.json", "data/nsw_property_price.csv", "data/property_purpose.json", "data/zoning.json"]
dfs = []  
for file in files:
    ext = os.path.splitext(file)[1].lower()  # get file extension
    
    if ext == ".json":
        df = spark.read.json(file)
        dfs.append((df, "json", file))
    elif ext == ".csv":
        # rdd = spark.read.csv(file, header=True, inferSchema=True)
        df = spark.read.csv(file)
        dfs.append((df, "csv", file))

2.1.2 Display the schema of the dataframes.

In [65]:
for df, ext, filename in dfs:
    print(df)
    df.printSchema()

DataFrame[_corrupt_record: string]
root
 |-- _corrupt_record: string (nullable = true)

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string, _c6: string, _c7: string, _c8: string, _c9: string, _c10: string, _c11: string, _c12: string, _c13: string, _c14: string, _c15: string, _c16: string]
root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)

DataFrame[_corrupt_record: string]
root
 |-- _corrupt_record: stri

When the dataset is large, do you need all columns? How to optimize memory usage? Do you need a customized data partitioning strategy? (Note: Think about those questions but you don’t need to answer these questions.)

### 2.2 QueryAnalysis  <a class="anchor" name="2.2"></a>
Implement the following queries using dataframes. You need to be able to perform operations like transforming, filtering, sorting, joining and group by using the functions provided by the DataFrame API. For each task, display the first 5 results where no output is specified.

2.2.1. The area column has two types: (H, A and M): 1 H is one hectare = 10000 sqm, 1A is one acre = 4000 sqm, 1 M is one sqm. Unify the unit to sqm and create a new column called area_sqm. 

In [None]:
from pyspark.sql.functions import col

for df, filetype, filename in dfs:
    if filetype == "json":
        print(f"Checking IDs in {filename}")
        
        # find all columns ending with "_id"
        id_cols = [c for c in df.columns if c.endswith("_id")]
        
        for id_col in id_cols:
            print(f"  Validating column: {id_col}")
            
            # filter invalid rows based on regex
            invalid_rows = df.filter(~col(id_col).rlike("^[A-Za-z0-9]+$"))
            
            if invalid_rows.count() > 0:
                print(f"    Found {invalid_rows.count()} invalid {id_col} values")
                invalid_rows.show(truncate=False)

2.2.2. <pre>The top five property types are: Residence, Vacant Land, Commercial, Farm and Industrial.
However, for historical reason, they may have different strings in the database. Please update the primary_purpose with the following rules:
a)	Any purpose that has “HOME”, “HOUSE”, “UNIT” is classified as “Residence”;
b)	“Warehouse”, “Factory”,  “INDUST” should be changed to “Industrial”;
c)	Anything that contains “FARM”(i.e. FARMING), should be changed to “FARM”;
d)	“Vacant”, “Land” should be “Vacant Land”;
e)	Anything that has “COMM”, “Retail”, “Shop” or “Office” are “Cmmercial”.
f)	All remaining properties, including null and empty purposes, are classified as “Others”.
Show the count of each type in a table.
(note: Some properties are multi-purpose, e.g. “House & Farm”, it’s fine to count them multiple times.)
</pre>

2.2.3 Find the top 20 properties that make the largest value gain, show their address, suburb, and value increased. To calculate the value gain, the property must have been sold multiple times, “value increase” can be calculated with the last sold price – first sold price, regardless the transactions in between. Print all 20 records.

2.2.4 For each season, plot the median house price trend over the years. Seasons in Australia are defined as: (Spring: Sep-Nov, Summer: Dec-Feb, Autumn: Mar-May, Winter: Jun-Aug). 

2.2.5 (Open Question) Explore the dataset freely and plot one diagram of your choice. Which columns (at least 2) are highly correlated to the sales price? Discuss the steps of your exploration and the results. (No word limit, please keep concise.) 

Write your dicsussion here.

### Part 3 RDDs vs DataFrame vs Spark SQL (25%) <a class="anchor" name="part-3"></a>
Implement the following complex queries using RDD, DataFrame in SparkSQL separately(choose two). Log the time taken for each query in each approach using the “%%time” built-in magic command in Jupyter Notebook and discuss the performance difference between these 2 approaches of your choice.
(notes: You can write a multi-step query or a single complex query, the choice is yours. You can reuse the data frame in Part 2.)

#### Complex Query:
<pre>
A property investor wants to understand whether the property price and the settlement date are correlated. Here is the conditions:
1)	The investor is only interested in the last 2 years of the dataset.
2)	The investor is looking at houses under $2 million.
3)	Perform a bucketing of the settlement date (settlement – contract date
range (15, 30, 45, 60, 90 days).
4)	Perform a bucketing of property prices in $500K(e.g. 0-$500K, $500K-$1M, $1M-$1.5M, $1.5-$2M)
5)	Count the number of transactions in each combination and print the result in the following format
(Note: It’s fine to count the same property multiple times in this task, it’s based on sales transactions).
(Note: You shall show the full table with 40 rows, 2 years *4 price bucket * 5 settlement bucket; 0 count should be displayed as 0, not omitted.)
</pre>

### a)	Implement the above query using two approaches of your choice separately and print the results. (Note: Outputs from both approaches of your choice are required, and the results should be the same.). 

#### 3.1. Implementation 1

#### 3.2. Implementation 2

### b)	Which one is easier to implement, in your opinion? Log the time taken for each query, and observe the query execution time, among DataFrame and SparkSQL, which is faster and why? Please include proper references. (Maximum 500 words.) 

### Some ideas on the comparison

Armbrust, M., Huai, Y., Liang, C., Xin, R., & Zaharia, M. (2015). Deep Dive into Spark SQL’s Catalyst Optimizer. Retrieved September 30, 2017, from https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Damji, J. (2016). A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets. Retrieved September 28, 2017, from https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Data Flair (2017a). Apache Spark RDD vs DataFrame vs DataSet. Retrieved September 28, 2017, from http://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset

Prakash, C. (2016). Apache Spark: RDD vs Dataframe vs Dataset. Retrieved September 28, 2017, from http://why-not-learn-something.blogspot.com.au/2016/07/apache-spark-rdd-vs-dataframe-vs-dataset.html

Xin, R., & Rosen, J. (2015). Project Tungsten: Bringing Apache Spark Closer to Bare Metal. Retrieved September 30, 2017, from https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html