# FIT5202 2025 S2 Assignment 1 : Analysing Australian Property Market Data

## Table of Contents
* [Part 1 : Working with RDD](#part-1)  
    - [1.1 Data Preparation and Loading](#1.1)  
    - [1.2 Data Partitioning in RDD](#1.2)  
    - [1.3 Query/Analysis](#1.3)  
* [Part 2 : Working with DataFrames](#2-dataframes)  
    - [2.1 Data Preparation and Loading](#2-dataframes)  
    - [2.2 Query/Analysis](#2.2)  
* [Part 3 :  RDDs vs DataFrame vs Spark SQL](#part-3)  

Note: Feel free to add Code/Markdown cells as you need.

# Part 1 : Working with RDDs (30%) <a class="anchor" name="part-1"></a>
## 1.1 Working with RDD
In this section, you will need to create RDDs from the given datasets, perform partitioning in these RDDs and use various RDD operations to answer the queries. 

1.1.1 Data Preparation and Loading <a class="anchor" name="1.1"></a>
1.	Write the code to create a SparkContext object using SparkSession. To create a SparkSession, you first need to build a SparkConf object that contains information about your application. Use Melbourne time as the session timezone. Give your application an appropriate name and run Spark locally with 4 cores on your machine.

In [1]:
# Import SparkConf class into program
from pyspark import SparkConf

# local[*]: run Spark in local mode with as many working processors as logical cores on your machine
# If we want Spark to run locally with 'k' worker threads, we can specify as "local[k]".
master = "local[4]"
# The `appName` field is a name to be shown on the Spark cluster UI page
app_name = "Assignment1"
# Setup configuration parameters for Spark
spark_conf = SparkConf().setMaster(master).setAppName(app_name)

# Import SparkContext and SparkSession classes
from pyspark import SparkContext # Spark
from pyspark.sql import SparkSession # Spark SQL

# Method 1: Using SparkSession
spark = SparkSession.builder.config(conf=spark_conf).config("spark.sql.session.timeZone", "GMT+10").getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')

1.1.2 Load the CSV and JSON files into multiple RDDs. 

In [2]:
import os
files = ["data/council.json", "data/nsw_property_price.csv", "data/property_purpose.json", "data/zoning.json"]
rdds = []  
headers = {}
for file in files:
    # get file extension
    ext = os.path.splitext(file)[1].lower()  
    # filter out whitespace
    rdd = (
        sc.textFile(file)
          .map(lambda x: x.strip().rstrip(",").replace("{", "").replace("}", ""))
          .filter(lambda x: x != "")
    )

    if ext == ".json":
        rdds.append((rdd, "json", file))
    elif ext == ".csv":
        rdds.append((rdd, "csv", file))

1.1.3 For each RDD, remove the header rows and display the total count and the first 8 records.


In [3]:
# used GPT to assemble the messy logic together cleanly
import csv
from io import StringIO

def safe_dict(it):
    """Safely turn iterable of kv pairs into dict, skipping malformed entries."""
    d = {}
    for kv in it:
        if isinstance(kv, tuple) and len(kv) == 2:
            k, v = kv
            d[k] = v
    return d

def parse_csv_line(line: str):
    """Safely parse a CSV line, handling commas inside quoted fields."""
    reader = csv.reader(StringIO(line), quotechar='"', delimiter=',')
    return next(reader)

def process_rdd(rdd, ext, filename):
    # remove header
    header = rdd.first()
    clean_header = header.split("\\n", 1)[0]
    if clean_header.startswith('"') and not clean_header.endswith('"'):
        clean_header += '"'  # restore the closing quote
    headers[filename] = clean_header
    lines = rdd.filter(lambda s: s != header)

    
    if ext == "json":        
        # robust key:value parsing (handles both "key : value" and "key": "value")
        kv = (
            lines
            .map(lambda s: s.strip())
            .filter(lambda s: ":" in s)              # accept any colon, with or without spaces
            .map(lambda s: s.split(":", 1))          # split once, keep right side intact
            .filter(lambda kv: len(kv) == 2)         # keep only well-formed pairs
            .map(lambda kv: (kv[0].strip(' "\',{}'), kv[1].strip(' "\',{}')))
        )

        # group into records (assumes each record spans 2 lines)
        grouped = (
            kv.zipWithIndex()
              .map(lambda x: (x[1] // 2, x[0]))
              .groupByKey()
              .mapValues(safe_dict)
              .values()
        )

        print(f"{filename}: grouped={grouped.count()}")
        return grouped
    
    if ext == "csv":
        # use robust CSV parsing instead of naive split
        fieldnames = parse_csv_line(headers[filename])
        lines = (
            lines.map(lambda row: dict(zip(fieldnames, parse_csv_line(row))))
        )
        
#         fieldnames = [h.strip().strip('"') for h in headers[filename].split(",")]
#         lines = (
#             lines.map(lambda row: dict(
#                 zip(
#                     fieldnames,
#                     [val.strip().strip('"') for val in row.split(",")]
#                 )
#             ))
#         )
        print(f"{filename}: grouped={lines.count()}")
        return lines

# Replace items in rdds
rdds = [
    (process_rdd(rdd, ext, filename), ext, filename)
    for (rdd, ext, filename) in rdds
]

for fname, header in headers.items():
    print("Header for", fname, "=", header)
    
for (rdd, ext, filename) in rdds:
    print(rdd.take(2))

data/council.json: grouped=220
data/nsw_property_price.csv: grouped=4854814
data/property_purpose.json: grouped=865
data/zoning.json: grouped=71
Header for data/council.json = "council_id,council_name"
Header for data/nsw_property_price.csv = "property_id","purchase_price","address","post_code","property_type","strata_lot_number","property_name","area","area_type","iso_contract_date","iso_settlement_date","nature_of_property","legal_description","id","council_id","purpose_id","zone_id"
Header for data/property_purpose.json = "purpose_id, primary_purpose"
Header for data/zoning.json = "zoning_id, zoning"
[{'council_id': '1', 'council_name': '003'}, {'council_id': '3', 'council_name': '013'}]
[{'property_id': '4270509', 'purchase_price': '1400000.00', 'address': '8 C NYARI RD, KENTHURST', 'post_code': '2156', 'property_type': 'house', 'strata_lot_number': '', 'property_name': '', 'area': '2.044', 'area_type': 'H', 'iso_contract_date': '2023-12-14', 'iso_settlement_date': '2024-02-14', 'n

1.1.4 Drop records with invalid information: purpose_id or council_id is null, empty, or 0.

In [4]:
import re

def valid_record(rec):
    for k, v in rec.items():
        if k.endswith("_id"):
            if v is None:
                return False
            s = str(v).strip()

            # Must be digits only
            if not re.fullmatch(r"[0-9]+", s):
                return False

            try:
                if int(s) < 1:
                    return False
            except ValueError:
                return False
    return True


def filter_rdd(rdd, ext, filename):
    filtered = rdd.filter(valid_record)
    # invalid rows (inverse filter)
    invalid = rdd.filter(lambda rec: not valid_record(rec))
    
    print(f"{filename}: raw={rdd.count()}, filtered={filtered.count()}")
    # Show invalid data
    print(f"--- Example invalid rows from {filename} ---")
    for row in invalid.take(5):
        print(row)

    return filtered

# Apply filtering
rdds = [
    (filter_rdd(rdd, ext, filename), ext, filename)
    for (rdd, ext, filename) in rdds
]


data/council.json: raw=220, filtered=220
--- Example invalid rows from data/council.json ---
data/nsw_property_price.csv: raw=4854814, filtered=4828278
--- Example invalid rows from data/nsw_property_price.csv ---
{'property_id': '', 'purchase_price': '195000.00', 'address': '14 FAIRFAX RD, BELLEVUE HILL', 'post_code': '2023', 'property_type': 'house', 'strata_lot_number': '9', 'property_name': '', 'area': '', 'area_type': '', 'iso_contract_date': '2022-11-11', 'iso_settlement_date': '2022-11-17', 'nature_of_property': '3', 'legal_description': '9/SP104887', 'id': '640243', 'council_id': '219', 'purpose_id': '3746', 'zone_id': '1'}
{'property_id': '', 'purchase_price': '410000.00', 'address': ' FAIRFAX RD RD, BELLEVUE HILL', 'post_code': '2023', 'property_type': 'house', 'strata_lot_number': '10', 'property_name': 'LOT 10, 14', 'area': '', 'area_type': '', 'iso_contract_date': '2022-11-15', 'iso_settlement_date': '2022-11-28', 'nature_of_property': '3', 'legal_description': '10/SP10488

### 1.2 Data Partitioning in RDD <a class="anchor" name="1.2"></a>
1.2.1 For each RDD, using Spark’s default partitioning, print out the total number of partitions and the number of records in each partition

In [5]:
for rdd, ext, filename in rdds:
    print('Default partitions: ',rdd.getNumPartitions())

Default partitions:  2
Default partitions:  19
Default partitions:  2
Default partitions:  2


1.2.2 Answer the following questions:   
a) How many partitions do the above RDDs have?  
b) How is the data in these RDDs partitioned by default, when we do not explicitly specify any partitioning strategy? Can you explain why it is partitioned in this number?   
c) Assuming we are querying the dataset based on <strong> Property Price</strong>, can you think of a better strategy for partitioning the data based on your available hardware resources?  

Answer for a)  
The csv file has 19 partitions, while all of the json files have 2 partitions.  

Answer for b)  
The data in these RDDs is partitioned according to their file size by default. It appears that they are partitioned to have up to 32 MB of data per partition

Your answer for c

1.2.3 Create a user-defined function (UDF) to transform the date strings from ISO format (YYYY-MM-DD) (e.g. 2025-01-01) to Australian format (DD/Mon/YYYY) (e.g. 01/Jan/2025), then call the UDF to transform two date columns (iso_contract_date and iso_settlement_date) to contract_date and settlement_date.

In [6]:
from datetime import datetime

def iso_to_aus(iso_date: str) -> str:
    try:
        dt = datetime.strptime(iso_date, "%Y-%m-%d")
        return dt.strftime("%d/%b/%Y")
    except Exception:
        return None

def transform_property_price_rdd(rdd, filename):
    """Transform date fields in NSW property price RDD of dicts."""
    if filename != "data/nsw_property_price.csv":
        return rdd  # skip other files

    def transform_row(row):
        new_row = dict(row)  # copy
        if "iso_contract_date" in new_row:
            new_row["contract_date"] = iso_to_aus(new_row.pop("iso_contract_date"))
        if "iso_settlement_date" in new_row:
            new_row["settlement_date"] = iso_to_aus(new_row.pop("iso_settlement_date"))
        return new_row

    transformed = rdd.map(transform_row)

    # Debug output
    print("Example transformed row:", transformed.first())
    return transformed

rdds = [
    (transform_property_price_rdd(rdd, filename), ext, filename)
    if filename == "data/nsw_property_price.csv"
    else (rdd, ext, filename)
    for (rdd, ext, filename) in rdds
]



Example transformed row: {'property_id': '4270509', 'purchase_price': '1400000.00', 'address': '8 C NYARI RD, KENTHURST', 'post_code': '2156', 'property_type': 'house', 'strata_lot_number': '', 'property_name': '', 'area': '2.044', 'area_type': 'H', 'nature_of_property': 'V', 'legal_description': '2/1229857', 'id': '142', 'council_id': '200', 'purpose_id': '9922', 'zone_id': '53', 'contract_date': '14/Dec/2023', 'settlement_date': '14/Feb/2024'}


### 1.3 Query/Analysis <a class="anchor" name="1.3"></a>
For this part, write relevant RDD operations to answer the following queries.

1.3.1 Extract the Month (Jan-Dec) information and print the total number of sales by contract date for each Month. (5%)

In [8]:
# from datetime import datetime

# def extract_month(date_str: str) -> str:
#     try:
#         dt = datetime.strptime(date_str, "%d/%b/%Y")  # AUS format
#         return dt.strftime("%b-%Y")  # "Jan-2025"
#     except Exception:
#         return None

# def parse_month(month_str: str) -> datetime:
#     return datetime.strptime(month_str, "%b-%Y")

# def safe_float(x: str) -> float:
#     try:
#         return float(x)
#     except Exception:
#         return 0.0

# # Filter out header-like rows and bad values
# clean_rdd = dict_rdd.filter(
#     lambda row: row["contract_date"] not in (None, "", "contract_date") 
#                 and row["purchase_price"] not in (None, "", "purchase_price")
# )

# # Extract (month, purchase_price)
# month_price_rdd = clean_rdd.map(
#     lambda row: (
#         extract_month(row["contract_date"]),
#         safe_float(row["purchase_price"])
#     )
# ).filter(lambda x: x[0] is not None)

# # Reduce by key (sum per month)
# monthly_totals = month_price_rdd.reduceByKey(lambda a, b: a + b)

# # Collect and sort chronologically
# monthly_totals_sorted = sorted(
#     monthly_totals.collect(),
#     key=lambda x: parse_month(x[0])
# )

# for month, total in monthly_totals_sorted:
#     print(month, "=", total)


NameError: name 'dict_rdd' is not defined

In [9]:
from datetime import datetime

def extract_month_name(date_str: str) -> str:
    try:
        dt = datetime.strptime(date_str, "%d/%b/%Y")  # AUS format
        return dt.strftime("%b")   # "Jan"
    except Exception:
        return None

def safe_float(x: str) -> float:
    try:
        return float(x)
    except Exception:
        return 0.0
    
dict_rdd = next(rdd for (rdd, ext, fname) in rdds if fname == "data/nsw_property_price.csv")

# Filter out header remnants / blanks
clean_rdd = dict_rdd.filter(
    lambda row: row["contract_date"] not in (None, "", "contract_date")
)

# Map to (month, (count, total_purchase_price))
month_metrics_rdd = clean_rdd.map(
    lambda row: (
        extract_month_name(row["contract_date"]),
        (1, safe_float(row["purchase_price"]))
    )
).filter(lambda x: x[0] is not None)

# Reduce: sum counts and purchase prices
monthly_metrics = month_metrics_rdd.reduceByKey(
    lambda a, b: (a[0] + b[0], a[1] + b[1])
)

# Collect and sort by calendar order
month_order = ["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
monthly_metrics_sorted = sorted(
    monthly_metrics.collect(),
    key=lambda x: month_order.index(x[0])
)

# Print results
for month, (count, total) in monthly_metrics_sorted:
    print(f"{month}: Sales={count}, Total Purchase Price={total}")


Jan: Sales=231293, Total Purchase Price=149286257291.0
Feb: Sales=385415, Total Purchase Price=283257505935.0
Mar: Sales=460686, Total Purchase Price=362170672422.0
Apr: Sales=382178, Total Purchase Price=288007815243.0
May: Sales=449308, Total Purchase Price=374297279808.0
Jun: Sales=407721, Total Purchase Price=369932834504.0
Jul: Sales=404384, Total Purchase Price=356746327190.0
Aug: Sales=413422, Total Purchase Price=355777547916.0
Sep: Sales=423248, Total Purchase Price=346086151514.0
Oct: Sales=432387, Total Purchase Price=346873898401.0
Nov: Sales=446805, Total Purchase Price=414619066176.0
Dec: Sales=390848, Total Purchase Price=432696382032.0


1.3.2 Which 5 councils have the largest number of houses? Show their name and the total number of houses. (Note: Each house may appear multiple times if there are more than one sales, you should only count them once.) (5%)

In [10]:
# Step 1: Extract (council_id, property_id) pairs
council_property_rdd = dict_rdd.map(
    lambda row: (row["council_id"], row["property_id"])
)

# Step 2: Deduplicate by council_id + property_id
unique_council_property_rdd = council_property_rdd.distinct()

# Step 3: Count unique properties per council
council_house_counts = unique_council_property_rdd \
    .map(lambda x: (x[0], 1)) \
    .reduceByKey(lambda a, b: a + b)

# Step 4: Load council_id → council_name mapping from JSON RDD
# Assume json_rdd is already parsed into dicts like: {"council_id": "123", "council_name": "City Council"}
json_rdd = next(rdd for (rdd, ext, fname) in rdds if fname == "data/council.json")
council_name_map = json_rdd.map(
    lambda row: (row["council_id"], row["council_name"])
)

# Step 5: Join counts with names
council_with_names = council_house_counts.join(council_name_map)
# => (council_id, (house_count, council_name))

# Step 6: Get top 5 councils by number of houses
top5_councils = council_with_names.takeOrdered(
    5,
    key=lambda x: -x[1][0]   # sort by house_count descending
)

# Print results
for council_id, (count, name) in top5_councils:
    print(f"{name} (council_id={council_id}): {count} houses")


BLACKTOWN (council_id=100): 91213 houses
LAKE MACQUARIE (council_id=157): 59117 houses
THE HILLS SHIRE (council_id=200): 55032 houses
LIVERPOOL (council_id=162): 49053 houses
PENRITH (council_id=183): 46840 houses


## Part 2. Working with DataFrames (45%) <a class="anchor" name="2-dataframes"></a>
In this section, you need to load the given datasets into PySpark DataFrames and use DataFrame functions to answer the queries.
### 2.1 Data Preparation and Loading

2.1.1. Load the CSV/JSON files into separate dataframes. When you create your dataframes, please refer to the metadata file and think about the appropriate data type for each column.

In [11]:
import os
from pyspark.sql.functions import explode
files = ["data/council.json", "data/nsw_property_price.csv", "data/property_purpose.json", "data/zoning.json"]
dfs = []  
for file in files:
    ext = os.path.splitext(file)[1].lower()  # get file extension
    
    if ext == ".json":
        df = spark.read.option("multiline", "true").json(file)
        # If schema shows a single array column, explode it
        df_flat = df.select(explode(df[df.columns[0]]).alias("data"))

        # Now pull fields out of the struct
        df_flat = df_flat.select("data.*")


        dfs.append((df_flat, "json", file))
    elif ext == ".csv":
        # rdd = spark.read.csv(file, header=True, inferSchema=True)
        df = spark.read.csv(file, header=True, inferSchema=True)
        dfs.append((df, "csv", file))

2.1.2 Display the schema of the dataframes.

In [12]:
for df, ext, filename in dfs:
    print(df)
    df.printSchema()

DataFrame[council_id: bigint, council_name: string]
root
 |-- council_id: long (nullable = true)
 |-- council_name: string (nullable = true)

DataFrame[property_id: int, purchase_price: double, address: string, post_code: string, property_type: string, strata_lot_number: string, property_name: string, area: string, area_type: string, iso_contract_date: string, iso_settlement_date: string, nature_of_property: string, legal_description: string, id: string, council_id: string, purpose_id: string, zone_id: int]
root
 |-- property_id: integer (nullable = true)
 |-- purchase_price: double (nullable = true)
 |-- address: string (nullable = true)
 |-- post_code: string (nullable = true)
 |-- property_type: string (nullable = true)
 |-- strata_lot_number: string (nullable = true)
 |-- property_name: string (nullable = true)
 |-- area: string (nullable = true)
 |-- area_type: string (nullable = true)
 |-- iso_contract_date: string (nullable = true)
 |-- iso_settlement_date: string (nullable = tr

When the dataset is large, do you need all columns? How to optimize memory usage? Do you need a customized data partitioning strategy? (Note: Think about those questions but you don’t need to answer these questions.)

### 2.2 QueryAnalysis  <a class="anchor" name="2.2"></a>
Implement the following queries using dataframes. You need to be able to perform operations like transforming, filtering, sorting, joining and group by using the functions provided by the DataFrame API. For each task, display the first 5 results where no output is specified.

2.2.1. The area column has two types: (H, A and M): 1 H is one hectare = 10000 sqm, 1A is one acre = 4000 sqm, 1 M is one sqm. Unify the unit to sqm and create a new column called area_sqm. 

In [None]:
from pyspark.sql.functions import col, trim, length, regexp_replace

def filter_df(df, ext, filename):
    id_cols = [c for c in df.columns if c.endswith("_id")]

    condition = None
    for id_col in id_cols:
        # Force to string and trim
        id_str = trim(col(id_col).cast("string"))

        # Must be only digits (no "/" or other chars)
        # length > 0 to reject empty
        this_cond = (
            id_str.isNotNull() &
            (length(id_str) > 0) &
            id_str.rlike("^[0-9]+$") &
            (id_str.cast("bigint") >= 1)
        )

        # Combine conditions: ALL *_id columns must satisfy
        condition = this_cond if condition is None else (condition & this_cond)

    filtered_df = df.filter(condition) if condition is not None else df

    print(f"{filename}: raw={df.count()}, filtered={filtered_df.count()}")
    return filtered_df


# Apply filtering to all DataFrames
dfs = [
    (filter_df(df, ext, filename), ext, filename)
    for (df, ext, filename) in dfs
]

In [None]:
from pyspark.sql.functions import when, col
def normalize_area(df, ext, filename):
    """
    Convert area + area_type to a unified area_sqm column in sqm.
    """
    if ext != "csv":
        return df
    df.show(3)
    if "area" in df.columns and "area_type" in df.columns:
        df = df.withColumn(
            "area_sqm",
            when(col("area_type") == "H", col("area") * 10000)
            .when(col("area_type") == "A", col("area") * 4000)
            .when(col("area_type") == "M", col("area"))
            .otherwise(None)
        )
        print(f"{filename}: added area_sqm column")
    else:
        print(f"{filename}: no area/area_type columns found")

    return df


# Apply to all dataframes
dfs = [
    (normalize_area(df, ext, filename), ext, filename)
    for (df, ext, filename) in dfs
]

2.2.2. <pre>The top five property types are: Residence, Vacant Land, Commercial, Farm and Industrial.
However, for historical reason, they may have different strings in the database. Please update the primary_purpose with the following rules:
a)	Any purpose that has “HOME”, “HOUSE”, “UNIT” is classified as “Residence”;
b)	“Warehouse”, “Factory”,  “INDUST” should be changed to “Industrial”;
c)	Anything that contains “FARM”(i.e. FARMING), should be changed to “FARM”;
d)	“Vacant”, “Land” should be “Vacant Land”;
e)	Anything that has “COMM”, “Retail”, “Shop” or “Office” are “Cmmercial”.
f)	All remaining properties, including null and empty purposes, are classified as “Others”.
Show the count of each type in a table.
(note: Some properties are multi-purpose, e.g. “House & Farm”, it’s fine to count them multiple times.)
</pre>

In [None]:
# edit the dataframe for primary_purpose to consolidate their IDs into the 6 property type IDs.
# for each word, it should strip whitespace (spaces)
# if it should have nothing else, then having any characters disqualifies it, unless it has &,and/,- in it, then anything goes
# by defualt, it can have other stuff
# Residence -> 7071. HOME, HOUSE, UNIT, has nothing else
# Industrial -> 4778. WAREHOUSE, FACTORY, INDUST
# Farm -> 2941. FARM
# Vacant Land -> 9922. VACANT, LAND
# Commercial -> 1704. COMM, RETAIL, SHOP, OFFICE
# Others -> 12000. Anything that doesn't match any of the above.
# Permutations of the above -> 12001 above



2.2.3 Find the top 20 properties that make the largest value gain, show their address, suburb, and value increased. To calculate the value gain, the property must have been sold multiple times, “value increase” can be calculated with the last sold price – first sold price, regardless the transactions in between. Print all 20 records.

In [None]:
# Address has the key "address"
# Suburb is second half of the key "address" as denoted by after comma. create a new column in the dataframe for this.
# Value determined with the key "purchase_price", and it must be the same "property_id" showing up multiple times



2.2.4 For each season, plot the median house price trend over the years. Seasons in Australia are defined as: (Spring: Sep-Nov, Summer: Dec-Feb, Autumn: Mar-May, Winter: Jun-Aug). 

2.2.5 (Open Question) Explore the dataset freely and plot one diagram of your choice. Which columns (at least 2) are highly correlated to the sales price? Discuss the steps of your exploration and the results. (No word limit, please keep concise.) 

In [1]:
# I want to explore the dataset and identify which columns are highly correlated to sales price with a multivariate analysis. recall that the columns available are: 
# "property_id","purchase_price","address","post_code","property_type","strata_lot_number","property_name","area","area_type","iso_contract_date","iso_settlement_date","nature_of_property","legal_description","id","council_id","purpose_id","zone_id"
# intuitively, i expect that the relevant columns for identifying purchase_price are address, post_code, property_type, strata_lot_number, area_sqm (as created earlier), iso_contract_date, nature_of_property, legal_description, council_id, purpose_id, zone_id


# I now want to plot these variables against purchase_price.







Write your dicsussion here.

### Part 3 RDDs vs DataFrame vs Spark SQL (25%) <a class="anchor" name="part-3"></a>
Implement the following complex queries using RDD, DataFrame in SparkSQL separately(choose two). Log the time taken for each query in each approach using the “%%time” built-in magic command in Jupyter Notebook and discuss the performance difference between these 2 approaches of your choice.
(notes: You can write a multi-step query or a single complex query, the choice is yours. You can reuse the data frame in Part 2.)

#### Complex Query:
<pre>
A property investor wants to understand whether the property price and the settlement date are correlated. Here is the conditions:
1)	The investor is only interested in the last 2 years of the dataset.
2)	The investor is looking at houses under $2 million.
3)	Perform a bucketing of the settlement date (settlement – contract date
range (15, 30, 45, 60, 90 days).
4)	Perform a bucketing of property prices in $500K(e.g. 0-$500K, $500K-$1M, $1M-$1.5M, $1.5-$2M)
5)	Count the number of transactions in each combination and print the result in the following format
(Note: It’s fine to count the same property multiple times in this task, it’s based on sales transactions).
(Note: You shall show the full table with 40 rows, 2 years *4 price bucket * 5 settlement bucket; 0 count should be displayed as 0, not omitted.)
</pre>

### a)	Implement the above query using two approaches of your choice separately and print the results. (Note: Outputs from both approaches of your choice are required, and the results should be the same.). 

#### 3.1. Implementation 1

#### 3.2. Implementation 2

### b)	Which one is easier to implement, in your opinion? Log the time taken for each query, and observe the query execution time, among DataFrame and SparkSQL, which is faster and why? Please include proper references. (Maximum 500 words.) 

### Some ideas on the comparison

Armbrust, M., Huai, Y., Liang, C., Xin, R., & Zaharia, M. (2015). Deep Dive into Spark SQL’s Catalyst Optimizer. Retrieved September 30, 2017, from https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Damji, J. (2016). A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets. Retrieved September 28, 2017, from https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Data Flair (2017a). Apache Spark RDD vs DataFrame vs DataSet. Retrieved September 28, 2017, from http://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset

Prakash, C. (2016). Apache Spark: RDD vs Dataframe vs Dataset. Retrieved September 28, 2017, from http://why-not-learn-something.blogspot.com.au/2016/07/apache-spark-rdd-vs-dataframe-vs-dataset.html

Xin, R., & Rosen, J. (2015). Project Tungsten: Bringing Apache Spark Closer to Bare Metal. Retrieved September 30, 2017, from https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html