# FIT5202 2025 S2 Assignment 1 : Analysing Australian Property Market Data

## Table of Contents
* [Part 1 : Working with RDD](#part-1)  
    - [1.1 Data Preparation and Loading](#1.1)  
    - [1.2 Data Partitioning in RDD](#1.2)  
    - [1.3 Query/Analysis](#1.3)  
* [Part 2 : Working with DataFrames](#2-dataframes)  
    - [2.1 Data Preparation and Loading](#2-dataframes)  
    - [2.2 Query/Analysis](#2.2)  
* [Part 3 :  RDDs vs DataFrame vs Spark SQL](#part-3)  

Note: Feel free to add Code/Markdown cells as you need.

# Part 1 : Working with RDDs (30%) <a class="anchor" name="part-1"></a>
## 1.1 Working with RDD
In this section, you will need to create RDDs from the given datasets, perform partitioning in these RDDs and use various RDD operations to answer the queries. 

1.1.1 Data Preparation and Loading <a class="anchor" name="1.1"></a>
1.	Write the code to create a SparkContext object using SparkSession. To create a SparkSession, you first need to build a SparkConf object that contains information about your application. Use Melbourne time as the session timezone. Give your application an appropriate name and run Spark locally with 4 cores on your machine.

In [1]:
# Import SparkConf class into program
from pyspark import SparkConf

# local[*]: run Spark in local mode with as many working processors as logical cores on your machine
# If we want Spark to run locally with 'k' worker threads, we can specify as "local[k]".
master = "local[4]"
# The `appName` field is a name to be shown on the Spark cluster UI page
app_name = "Assignment1"
# Setup configuration parameters for Spark
spark_conf = SparkConf().setMaster(master).setAppName(app_name)

# Import SparkContext and SparkSession classes
from pyspark import SparkContext # Spark
from pyspark.sql import SparkSession # Spark SQL

# Method 1: Using SparkSession
spark = SparkSession.builder.config(conf=spark_conf).config("spark.sql.session.timeZone", "GMT+10").getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')

1.1.2 Load the CSV and JSON files into multiple RDDs. 

In [2]:
import os
files = ["data/council.json", "data/nsw_property_price.csv", "data/property_purpose.json", "data/zoning.json"]
rdds = []  
headers = {}
for file in files:
    # get file extension
    ext = os.path.splitext(file)[1].lower()  
    # filter out whitespace
    rdd = (
        sc.textFile(file)
          .map(lambda x: x.strip().rstrip(",").replace("{", "").replace("}", ""))
          .filter(lambda x: x != "")
    )

    if ext == ".json":
        rdds.append((rdd, "json", file))
    elif ext == ".csv":
        rdds.append((rdd, "csv", file))

1.1.3 For each RDD, remove the header rows and display the total count and the first 8 records.


In [3]:
# used GPT to assemble the messy logic together cleanly
def safe_dict(it):
    """Safely turn iterable of kv pairs into dict, skipping malformed entries."""
    d = {}
    for kv in it:
        if isinstance(kv, tuple) and len(kv) == 2:
            k, v = kv
            d[k] = v
    return d

def process_rdd(rdd, ext, filename):
    # remove header
    header = rdd.first()
    clean_header = header.split("\\n", 1)[0]
    if clean_header.startswith('"') and not clean_header.endswith('"'):
        clean_header += '"'  # restore the closing quote
    headers[filename] = clean_header
    lines = rdd.filter(lambda s: s != header)

    
    if ext == "json":        
        # robust key:value parsing (handles both "key : value" and "key": "value")
        kv = (
            lines
            .map(lambda s: s.strip())
            .filter(lambda s: ":" in s)              # accept any colon, with or without spaces
            .map(lambda s: s.split(":", 1))          # split once, keep right side intact
            .filter(lambda kv: len(kv) == 2)         # keep only well-formed pairs
            .map(lambda kv: (kv[0].strip(' "\',{}'), kv[1].strip(' "\',{}')))
        )

        # group into records (assumes each record spans 2 lines)
        grouped = (
            kv.zipWithIndex()
              .map(lambda x: (x[1] // 2, x[0]))
              .groupByKey()
              .mapValues(safe_dict)
              .values()
        )

        print(f"{filename}: grouped={grouped.count()}")
        return grouped
    
    if ext == "csv":
        
        fieldnames = [h.strip().strip('"') for h in headers[filename].split(",")]
        lines = (
            lines.map(lambda row: dict(
                zip(
                    fieldnames,
                    [val.strip().strip('"') for val in row.split(",")]
                )
            ))
        )
        print(f"{filename}: grouped={lines.count()}")
        return lines

# Replace items in rdds
rdds = [
    (process_rdd(rdd, ext, filename), ext, filename)
    for (rdd, ext, filename) in rdds
]

for fname, header in headers.items():
    print("Header for", fname, "=", header)
    
for (rdd, ext, filename) in rdds:
    print(rdd.take(2))

data/council.json: grouped=220
data/nsw_property_price.csv: grouped=4854814
data/property_purpose.json: grouped=865
data/zoning.json: grouped=71
Header for data/council.json = "council_id,council_name"
Header for data/nsw_property_price.csv = "property_id","purchase_price","address","post_code","property_type","strata_lot_number","property_name","area","area_type","iso_contract_date","iso_settlement_date","nature_of_property","legal_description","id","council_id","purpose_id","zone_id"
Header for data/property_purpose.json = "purpose_id, primary_purpose"
Header for data/zoning.json = "zoning_id, zoning"
[{'council_id': '1', 'council_name': '003'}, {'council_id': '3', 'council_name': '013'}]
[{'property_id': '4270509', 'purchase_price': '1400000.00', 'address': '8 C NYARI RD', 'post_code': 'KENTHURST', 'property_type': '2156', 'strata_lot_number': 'house', 'property_name': '', 'area': '', 'area_type': '2.044', 'iso_contract_date': 'H', 'iso_settlement_date': '2023-12-14', 'nature_of_pro

1.1.4 Drop records with invalid information: purpose_id or council_id is null, empty, or 0.

In [4]:
for (rdd, ext, filename) in rdds:
    print(f"{filename}: type:{type(rdd)}")

data/council.json: type:<class 'pyspark.rdd.PipelinedRDD'>
data/nsw_property_price.csv: type:<class 'pyspark.rdd.PipelinedRDD'>
data/property_purpose.json: type:<class 'pyspark.rdd.PipelinedRDD'>
data/zoning.json: type:<class 'pyspark.rdd.PipelinedRDD'>


In [5]:
def valid_record(rec):
    for k, v in rec.items():
        if k.endswith("_id"):
            try:
                if int(v) < 1:  # must be integer-parsable and > 0
                    return False
            except (ValueError, TypeError):
                return False
    return True

def filter_rdd(rdd, ext, filename):
    filtered = rdd.filter(valid_record)

    print(f"{filename}: raw={rdd.count()}, filtered={filtered.count()}")
    return filtered

# Apply filtering
rdds = [
    (filter_rdd(rdd, ext, filename), ext, filename)
    for (rdd, ext, filename) in rdds
]


data/council.json: raw=220, filtered=220
data/nsw_property_price.csv: raw=4854814, filtered=4729218
data/property_purpose.json: raw=865, filtered=865
data/zoning.json: raw=71, filtered=71


### 1.2 Data Partitioning in RDD <a class="anchor" name="1.2"></a>
1.2.1 For each RDD, using Spark’s default partitioning, print out the total number of partitions and the number of records in each partition

In [6]:
for rdd, ext, filename in rdds:
    print('Default partitions: ',rdd.getNumPartitions())

Default partitions:  2
Default partitions:  19
Default partitions:  2
Default partitions:  2


1.2.2 Answer the following questions:   
a) How many partitions do the above RDDs have?  
b) How is the data in these RDDs partitioned by default, when we do not explicitly specify any partitioning strategy? Can you explain why it is partitioned in this number?   
c) Assuming we are querying the dataset based on <strong> Property Price</strong>, can you think of a better strategy for partitioning the data based on your available hardware resources?  

Answer for a)  
The csv file has 19 partitions, while all of the json files have 2 partitions.  

Your answer for b

Your answer for c

1.2.3 Create a user-defined function (UDF) to transform the date strings from ISO format (YYYY-MM-DD) (e.g. 2025-01-01) to Australian format (DD/Mon/YYYY) (e.g. 01/Jan/2025), then call the UDF to transform two date columns (iso_contract_date and iso_settlement_date) to contract_date and settlement_date.

In [7]:
import csv
from io import StringIO
from datetime import datetime

# Function to parse a CSV line safely, respecting quotes
def parse_csv_line(line: str):
    reader = csv.reader(StringIO(line), quotechar='"', delimiter=',')
    return next(reader)

# ISO -> AUS conversion
def iso_to_aus(iso_date: str) -> str:
    try:
        dt = datetime.strptime(iso_date, "%Y-%m-%d")
        return dt.strftime("%d/%b/%Y")
    except Exception:
        return None

for fname, header in headers.items():
    if fname != "data/nsw_property_price.csv":
        continue
    # Parse header using the same csv.reader
    columns = parse_csv_line(header)

    # Build column index lookup
    col_index = {col: idx for idx, col in enumerate(columns)}

    # Replace header names
    columns[col_index["iso_contract_date"]] = "contract_date"
    columns[col_index["iso_settlement_date"]] = "settlement_date"

    # Load CSV RDD
    rdd = sc.textFile(fname)

    # Remove header row
    data_rdd = rdd.filter(lambda line: line != header)

    # Parse rows safely with csv.reader
    parsed_rdd = data_rdd.map(parse_csv_line)

    # Replace ISO dates with AUS format
    transformed_rdd = parsed_rdd.map(
        lambda row: [
            iso_to_aus(row[col_index["iso_contract_date"]]) if i == col_index["iso_contract_date"]
            else iso_to_aus(row[col_index["iso_settlement_date"]]) if i == col_index["iso_settlement_date"]
            else val
            for i, val in enumerate(row)
        ]
    )

    # Zip headers with row -> dict
    dict_rdd = transformed_rdd.map(lambda row: dict(zip(columns, row)))

    # Example output
    print("Updated header for", fname, "=", columns)
    print("Example row as dict:", dict_rdd.first())


Updated header for data/nsw_property_price.csv = ['property_id', 'purchase_price', 'address', 'post_code', 'property_type', 'strata_lot_number', 'property_name', 'area', 'area_type', 'contract_date', 'settlement_date', 'nature_of_property', 'legal_description', 'id', 'council_id', 'purpose_id', 'zone_id']
Example row as dict: {'property_id': '4270509', 'purchase_price': '1400000.00', 'address': '8 C NYARI RD, KENTHURST', 'post_code': '2156', 'property_type': 'house', 'strata_lot_number': '', 'property_name': '', 'area': '2.044', 'area_type': 'H', 'contract_date': '14/Dec/2023', 'settlement_date': '14/Feb/2024', 'nature_of_property': 'V', 'legal_description': '2/1229857', 'id': '142', 'council_id': '200', 'purpose_id': '9922', 'zone_id': '53'}


### 1.3 Query/Analysis <a class="anchor" name="1.3"></a>
For this part, write relevant RDD operations to answer the following queries.

1.3.1 Extract the Month (Jan-Dec) information and print the total number of sales by contract date for each Month. (5%)

In [12]:
print(dict_rdd)

PythonRDD[73] at RDD at PythonRDD.scala:53


In [14]:
from datetime import datetime

def extract_month(date_str: str) -> str:
    try:
        dt = datetime.strptime(date_str, "%d/%b/%Y")  # AUS format
        return dt.strftime("%b-%Y")  # "Jan-2025"
    except Exception:
        return None

def parse_month(month_str: str) -> datetime:
    return datetime.strptime(month_str, "%b-%Y")

def safe_float(x: str) -> float:
    try:
        return float(x)
    except Exception:
        return 0.0

# Filter out header-like rows and bad values
clean_rdd = dict_rdd.filter(
    lambda row: row["contract_date"] not in (None, "", "contract_date") 
                and row["purchase_price"] not in (None, "", "purchase_price")
)

# Extract (month, purchase_price)
month_price_rdd = clean_rdd.map(
    lambda row: (
        extract_month(row["contract_date"]),
        safe_float(row["purchase_price"])
    )
).filter(lambda x: x[0] is not None)

# Reduce by key (sum per month)
monthly_totals = month_price_rdd.reduceByKey(lambda a, b: a + b)

# Collect and sort chronologically
monthly_totals_sorted = sorted(
    monthly_totals.collect(),
    key=lambda x: parse_month(x[0])
)

for month, total in monthly_totals_sorted:
    print(month, "=", total)


Nov-1000 = 0.0
Jan-1012 = 1900000.0
Jul-1016 = 333500.0
Oct-1016 = 225000.0
Apr-1017 = 526400.0
Feb-1018 = 405000.0
Oct-1018 = 1820000.0
Dec-1018 = 270000.0
Feb-1019 = 630000.0
Jun-1019 = 2250000.0
Jul-1019 = 989000.0
Aug-1020 = 1225000.0
Oct-1020 = 702000.0
Feb-1021 = 359000.0
May-1021 = 3340000.0
Oct-1021 = 682000.0
Nov-1021 = 1598000.0
Feb-1022 = 4000000.0
May-1022 = 6440000.0
Jun-1022 = 650000.0
Oct-1022 = 1240000.0
Nov-1022 = 280000.0
Dec-1022 = 520000.0
Apr-1023 = 1225000.0
May-1023 = 1135000.0
Aug-1023 = 2804000.0
Sep-1023 = 1030000.0
Nov-1023 = 1233000.0
Jan-1024 = 1316000.0
Mar-1028 = 230000.0
Oct-1029 = 450000.0
Feb-1877 = 23000.0
Jan-1900 = 350000.0
Jan-1901 = 350000.0
Oct-1903 = 202.0
Dec-1903 = 137000.0
Jan-1908 = 305000.0
Nov-1909 = 292000.0
Jan-1910 = 514000.0
Mar-1910 = 860000.0
May-1910 = 215000.0
Jul-1910 = 390000.0
Sep-1910 = 2226000.0
Jan-1911 = 6000.0
Feb-1911 = 710500.0
Apr-1911 = 1552500.0
Jun-1911 = 449000.0
Jul-1911 = 875000.0
Aug-1911 = 700000.0
Sep-1911 = 150

In [16]:
from datetime import datetime

def extract_month_name(date_str: str) -> str:
    try:
        dt = datetime.strptime(date_str, "%d/%b/%Y")  # AUS format
        return dt.strftime("%b")   # "Jan"
    except Exception:
        return None

def safe_float(x: str) -> float:
    try:
        return float(x)
    except Exception:
        return 0.0

# Filter out header remnants / blanks
clean_rdd = dict_rdd.filter(
    lambda row: row["contract_date"] not in (None, "", "contract_date")
)

# Map to (month, (count, total_purchase_price))
month_metrics_rdd = clean_rdd.map(
    lambda row: (
        extract_month_name(row["contract_date"]),
        (1, safe_float(row["purchase_price"]))
    )
).filter(lambda x: x[0] is not None)

# Reduce: sum counts and purchase prices
monthly_metrics = month_metrics_rdd.reduceByKey(
    lambda a, b: (a[0] + b[0], a[1] + b[1])
)

# Collect and sort by calendar order
month_order = ["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
monthly_metrics_sorted = sorted(
    monthly_metrics.collect(),
    key=lambda x: month_order.index(x[0])
)

# Print results
for month, (count, total) in monthly_metrics_sorted:
    print(f"{month}: Sales={count}, Total Purchase Price={total}")


Jan: Sales=232506, Total Purchase Price=150349905292.0
Feb: Sales=387321, Total Purchase Price=285319462750.0
Mar: Sales=463130, Total Purchase Price=365311248208.0
Apr: Sales=384295, Total Purchase Price=290504534216.0
May: Sales=451956, Total Purchase Price=381422491121.0
Jun: Sales=410513, Total Purchase Price=374419320769.0
Jul: Sales=406578, Total Purchase Price=359921097048.0
Aug: Sales=415692, Total Purchase Price=358084768214.0
Sep: Sales=425426, Total Purchase Price=349453709090.0
Oct: Sales=434540, Total Purchase Price=350918253916.0
Nov: Sales=449023, Total Purchase Price=418727748950.0
Dec: Sales=393246, Total Purchase Price=437044783603.0


1.3.2 Which 5 councils have the largest number of houses? Show their name and the total number of houses. (Note: Each house may appear multiple times if there are more than one sales, you should only count them once.) (5%)

## Part 2. Working with DataFrames (45%) <a class="anchor" name="2-dataframes"></a>
In this section, you need to load the given datasets into PySpark DataFrames and use DataFrame functions to answer the queries.
### 2.1 Data Preparation and Loading

2.1.1. Load the CSV/JSON files into separate dataframes. When you create your dataframes, please refer to the metadata file and think about the appropriate data type for each column.

In [8]:
import os
files = ["data/council.json", "data/nsw_property_price.csv", "data/property_purpose.json", "data/zoning.json"]
dfs = []  
for file in files:
    ext = os.path.splitext(file)[1].lower()  # get file extension
    
    if ext == ".json":
        df = spark.read.json(file)
        dfs.append((df, "json", file))
    elif ext == ".csv":
        # rdd = spark.read.csv(file, header=True, inferSchema=True)
        df = spark.read.csv(file)
        dfs.append((df, "csv", file))

2.1.2 Display the schema of the dataframes.

In [9]:
for df, ext, filename in dfs:
    print(df)
    df.printSchema()

DataFrame[_corrupt_record: string]
root
 |-- _corrupt_record: string (nullable = true)

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string, _c6: string, _c7: string, _c8: string, _c9: string, _c10: string, _c11: string, _c12: string, _c13: string, _c14: string, _c15: string, _c16: string]
root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)

DataFrame[_corrupt_record: string]
root
 |-- _corrupt_record: stri

When the dataset is large, do you need all columns? How to optimize memory usage? Do you need a customized data partitioning strategy? (Note: Think about those questions but you don’t need to answer these questions.)

### 2.2 QueryAnalysis  <a class="anchor" name="2.2"></a>
Implement the following queries using dataframes. You need to be able to perform operations like transforming, filtering, sorting, joining and group by using the functions provided by the DataFrame API. For each task, display the first 5 results where no output is specified.

2.2.1. The area column has two types: (H, A and M): 1 H is one hectare = 10000 sqm, 1A is one acre = 4000 sqm, 1 M is one sqm. Unify the unit to sqm and create a new column called area_sqm. 

In [10]:
from pyspark.sql.functions import col

for df, filetype, filename in dfs:
    if filetype == "json":
        print(f"Checking IDs in {filename}")
        
        # find all columns ending with "_id"
        id_cols = [c for c in df.columns if c.endswith("_id")]
        
        for id_col in id_cols:
            print(f"  Validating column: {id_col}")
            
            # filter invalid rows based on regex
            invalid_rows = df.filter(~col(id_col).rlike("^[A-Za-z0-9]+$"))
            
            if invalid_rows.count() > 0:
                print(f"    Found {invalid_rows.count()} invalid {id_col} values")
                invalid_rows.show(truncate=False)

Checking IDs in data/council.json
Checking IDs in data/property_purpose.json
Checking IDs in data/zoning.json


2.2.2. <pre>The top five property types are: Residence, Vacant Land, Commercial, Farm and Industrial.
However, for historical reason, they may have different strings in the database. Please update the primary_purpose with the following rules:
a)	Any purpose that has “HOME”, “HOUSE”, “UNIT” is classified as “Residence”;
b)	“Warehouse”, “Factory”,  “INDUST” should be changed to “Industrial”;
c)	Anything that contains “FARM”(i.e. FARMING), should be changed to “FARM”;
d)	“Vacant”, “Land” should be “Vacant Land”;
e)	Anything that has “COMM”, “Retail”, “Shop” or “Office” are “Cmmercial”.
f)	All remaining properties, including null and empty purposes, are classified as “Others”.
Show the count of each type in a table.
(note: Some properties are multi-purpose, e.g. “House & Farm”, it’s fine to count them multiple times.)
</pre>

2.2.3 Find the top 20 properties that make the largest value gain, show their address, suburb, and value increased. To calculate the value gain, the property must have been sold multiple times, “value increase” can be calculated with the last sold price – first sold price, regardless the transactions in between. Print all 20 records.

2.2.4 For each season, plot the median house price trend over the years. Seasons in Australia are defined as: (Spring: Sep-Nov, Summer: Dec-Feb, Autumn: Mar-May, Winter: Jun-Aug). 

2.2.5 (Open Question) Explore the dataset freely and plot one diagram of your choice. Which columns (at least 2) are highly correlated to the sales price? Discuss the steps of your exploration and the results. (No word limit, please keep concise.) 

Write your dicsussion here.

### Part 3 RDDs vs DataFrame vs Spark SQL (25%) <a class="anchor" name="part-3"></a>
Implement the following complex queries using RDD, DataFrame in SparkSQL separately(choose two). Log the time taken for each query in each approach using the “%%time” built-in magic command in Jupyter Notebook and discuss the performance difference between these 2 approaches of your choice.
(notes: You can write a multi-step query or a single complex query, the choice is yours. You can reuse the data frame in Part 2.)

#### Complex Query:
<pre>
A property investor wants to understand whether the property price and the settlement date are correlated. Here is the conditions:
1)	The investor is only interested in the last 2 years of the dataset.
2)	The investor is looking at houses under $2 million.
3)	Perform a bucketing of the settlement date (settlement – contract date
range (15, 30, 45, 60, 90 days).
4)	Perform a bucketing of property prices in $500K(e.g. 0-$500K, $500K-$1M, $1M-$1.5M, $1.5-$2M)
5)	Count the number of transactions in each combination and print the result in the following format
(Note: It’s fine to count the same property multiple times in this task, it’s based on sales transactions).
(Note: You shall show the full table with 40 rows, 2 years *4 price bucket * 5 settlement bucket; 0 count should be displayed as 0, not omitted.)
</pre>

### a)	Implement the above query using two approaches of your choice separately and print the results. (Note: Outputs from both approaches of your choice are required, and the results should be the same.). 

#### 3.1. Implementation 1

#### 3.2. Implementation 2

### b)	Which one is easier to implement, in your opinion? Log the time taken for each query, and observe the query execution time, among DataFrame and SparkSQL, which is faster and why? Please include proper references. (Maximum 500 words.) 

### Some ideas on the comparison

Armbrust, M., Huai, Y., Liang, C., Xin, R., & Zaharia, M. (2015). Deep Dive into Spark SQL’s Catalyst Optimizer. Retrieved September 30, 2017, from https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Damji, J. (2016). A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets. Retrieved September 28, 2017, from https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Data Flair (2017a). Apache Spark RDD vs DataFrame vs DataSet. Retrieved September 28, 2017, from http://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset

Prakash, C. (2016). Apache Spark: RDD vs Dataframe vs Dataset. Retrieved September 28, 2017, from http://why-not-learn-something.blogspot.com.au/2016/07/apache-spark-rdd-vs-dataframe-vs-dataset.html

Xin, R., & Rosen, J. (2015). Project Tungsten: Bringing Apache Spark Closer to Bare Metal. Retrieved September 30, 2017, from https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html