# Housing prices prediction from real estate assessments (Decision Tree Regression)

## Overview

This notebook shows how to predict housing prices (real estate assessed values) based on location and other characteristics using Decision Tree Regression.

#### **Steps**
Using Spark, 
1) It reads the table [Real Estate Sales](https://catalog.data.gov/dataset/real-estate-sales-2001-2018) from the **public_datasets** dataset located in the [metastore](../gcp_services/README.md) (notebook should be connected with the public metastore if using this specific dataset).  
   This table contains listing of real estate sales with a sales price of $2,000 or greater that occur between October 1 and September 30 of each year (2001 to 2020).  
   For each sale record, the file includes information such as town, property address, date of sale, property type (residential, apartment, commercial, industrial or vacant land), sales price, and property assessment.    
2) It parses process the dataset to choose features and train the ML model (fits the decision tree regression model) to predict a target value.  
3) It evaluates and plot the results.

### Setup

In [None]:
#### Import dependencies
import pandas as pd
import matplotlib.pyplot as plt

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import round, desc, corr, col

from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import Bucketizer, StringIndexer, VectorAssembler
from pyspark.ml.linalg import Vectors

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder \
    .appName("Housing prices prediction with Decision Tree Regression") \
    .enableHiveSupport() \
    .getOrCreate()

In [None]:
raw_dataset = spark.read.table("public_datasets.real_estate_sales")

### Pre-process dataset / filter values to increase quality of the dataset

In [None]:
#### Filters
filters = [
"opm_remarks IS NOT NULL AND opm_remarks != 'Unknown'",
"non_use_code IS NOT NULL AND non_use_code != 'Unknown'",
"assessor_remarks IS NOT NULL AND assessor_remarks != 'Unknown'",
"residential_type IS NOT NULL AND residential_type != 'Unknown'",
"property_type IS NOT NULL AND property_type != 'Unknown'",
"assessed_value IS NOT NULL",
"sale_amount IS NOT NULL"
]
filters = " AND ".join(filters)
filtered_dataset = raw_dataset.filter(filters)

#### Get only data from top towns
top_towns = filtered_dataset.groupBy("town").count().orderBy(desc("count")).limit(30)
towns_list = top_towns.select("town").collect()
towns_list = [town[0] for town in towns_list]
filtered_dataset = filtered_dataset.filter(col("town").isin(towns_list))

In [None]:
#### Filter assessed_value outliers
def get_ranges(df, column_name, num_ranges=10):
    min_value = df.agg({column_name: 'min' }).collect()[0][0]
    max_value = df.agg({column_name: 'max' }).collect()[0][0]
    ranges = []
    ranges.append(min_value)
    for i in range(num_ranges):
        end = min_value + ((i + 1) * (max_value - min_value) / num_ranges)
        ranges.append(end)
    return ranges

ranges = get_ranges(filtered_dataset, "assessed_value", num_ranges=10)
bucketizer = Bucketizer(splits=ranges, inputCol="assessed_value", outputCol="assessed_value_ranges_index")
with_split = bucketizer.transform(filtered_dataset)
range_count = with_split.groupBy("assessed_value_ranges_index").count()

print("Range list: ", ranges)
range_count.show()

filtered_dataset = with_split.filter("assessed_value_ranges_index == 0")

filtered_dataset.show()

 serial_number|list_year|date_recorded|   town|             address|assessed_value|sale_amount|sales_ratio|property_type|residential_type|        non_use_code|    assessor_remarks|         opm_remarks|            location|
|-------------|---------|-------------|-------|--------------------|--------------|-----------|-----------|-------------|----------------|--------------------|--------------------|--------------------|--------------------|
|       200594|     2020|   2021-02-16|Danbury|        8 HICKORY ST|      121600.0|   146216.0|  0.8316463|  Residential|   Single Family|          25 - Other|              I11192|HOUSE HAS SETTLED...|{-73.44696, 41.41...|
|       200562|     2020|   2021-02-03|Danbury|         19  MILL RD|      263600.0|   415000.0|  0.6351807|  Residential|   Single Family|          25 - Other|AFFORDABLE HOUSIN...|INCORRECT DATA PE...|{-73.53692, 41.38...|
|       200968|     2020|   2021-05-24|Danbury|    4A FLIRTATION DR|      205700.0|   515000.0|  0.3994175|  Residential|   Single Family|07 - Change in Pr...|              B17008|UPDATED KITCHEN P...|        {null, null}|
|       200260|     2020|   2020-11-23|Danbury|32 COALPIT HILL R...|       84900.0|   181778.0|  0.4670532|  Residential|           Condo|          25 - Other|            J16087-4|  MULTIPLE UNIT SALE|{-73.43796, 41.38...|
|       200262|     2020|   2020-11-23|Danbury|32 COALPIT HILL R...|       84900.0|   181778.0|  0.4670532|  Residential|           Condo|          25 - Other|            J16087-6|  MULTIPLE UNIT SALE|{-73.43796, 41.38...|

### Exploratory Data Analysis

In [None]:
# Count, Mean, Min, Max of numeric columns
numeric_columns = ["list_year","assessed_value","sale_amount"]
filtered_dataset.select(numeric_columns).describe().select('summary', *[round(c, 2).alias(c) for c in numeric_columns]).show()

|summary|list_year|assessed_value|sale_amount|
|-------|---------|--------------|-----------|
|  count|    755.0|         755.0|      755.0|
|   mean|   2017.9|     132269.29|  212040.05|
| stddev|     1.37|      69455.55|   273021.1|
|    min|   2009.0|        6650.0|     1500.0|
|    max|   2020.0|      337630.0|  3000000.0|

Here we choose to use the "assessed_value" as prediction target instead of the "sale_amount" due to the higher standard deviation of the sale_amount value.

In [None]:
# Towns
filtered_dataset.groupBy("town").count().orderBy(desc("count")).limit(10).show(10,100)
print(f'Number of distinct towns: {filtered_dataset.select("town").distinct().count()}')
print(f'Number of towns: {filtered_dataset.select("town").count()}')

In [None]:
# Property Type
filtered_dataset.groupBy("property_type").count().orderBy(desc("count")).limit(12).show(12,100)
print(f'Number of property type: {filtered_dataset.select("property_type").count()}')
print(f'Number of distinct property type: {filtered_dataset.select("property_type").distinct().count()}')

In [None]:
# Residential Type
filtered_dataset.groupBy("residential_type").count().orderBy(desc("count")).limit(12).show(12,100)
print(f'Number of residential type: {filtered_dataset.select("residential_type").count()}')
print(f'Number of distinct residential type: {filtered_dataset.select("residential_type").distinct().count()}')

In [None]:
# Years
filtered_dataset.groupBy("list_year").count().orderBy(desc("count")).limit(10).show(10,100)

### Process dataset to create features

In [None]:
columns = ['assessed_value','town','list_year','property_type','residential_type']
label_column = 'assessed_value'
categorical_columns = ['town','list_year','property_type','residential_type']
feature_columns = ['indexed_town','indexed_list_year','indexed_property_type','indexed_residential_type']

#### Select only some columns
sub_dataset = filtered_dataset.select(*columns)

In [None]:
sub_dataset.show()

|assessed_value|    town|list_year|property_type|residential_type|
|--------------|--------|---------|-------------|----------------|
|      108430.0|Griswold|     2020|  Residential|      Two Family|
|      121600.0| Danbury|     2020|  Residential|   Single Family|
|      263600.0| Danbury|     2020|  Residential|   Single Family|
|      205700.0| Danbury|     2020|  Residential|   Single Family|
|       84900.0| Danbury|     2020|  Residential|           Condo|

In [None]:
#### Index categorical columns
indexers = [StringIndexer(inputCol=column, outputCol="indexed_" + column).fit(sub_dataset) for column in categorical_columns]
pipeline = Pipeline(stages=indexers)
indexed_dataset = pipeline.fit(sub_dataset).transform(sub_dataset)
indexed_dataset = indexed_dataset.drop(*categorical_columns)
indexed_dataset.show()

|assessed_value|indexed_town|indexed_list_year|indexed_property_type|indexed_residential_type|
|--------------|------------|-----------------|---------------------|------------------------|
|      108430.0|        13.0|              3.0|                  2.0|                     2.0|
|      121600.0|         0.0|              3.0|                  2.0|                     0.0|
|      263600.0|         0.0|              3.0|                  2.0|                     0.0|
|      205700.0|         0.0|              3.0|                  2.0|                     0.0|
|       84900.0|         0.0|              3.0|                  2.0|                     1.0|

### Transform features to LIBSVM format

In [None]:
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
dataset = assembler.transform(indexed_dataset)

dataset = dataset.select(label_column,'features')
dataset = dataset.withColumnRenamed(label_column,'label')

dataset.show()

|   label|          features|
|--------|------------------|
|108430.0|[13.0,3.0,2.0,2.0]|
|121600.0| [0.0,3.0,2.0,0.0]|
|263600.0| [0.0,3.0,2.0,0.0]|
|205700.0| [0.0,3.0,2.0,0.0]|
| 84900.0| [0.0,3.0,2.0,1.0]|

### Train/Fit the model

In [None]:
(trainingData, testData) = dataset.randomSplit([0.7, 0.3])

dt = DecisionTreeRegressor(featuresCol="features", maxDepth = 25, maxBins = 60)

# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[dt])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

### Evaluate the model

In [None]:
# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(10)

evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

|        prediction|  label|          features|
|------------------|-------|------------------|
|           40090.0|46840.0| [6.0,3.0,2.0,1.0]|
| 87633.33333333333|59500.0| [0.0,3.0,2.0,1.0]|
|108368.33333333333|78540.0|[15.0,3.0,2.0,0.0]|
| 87633.33333333333|84900.0| [0.0,3.0,2.0,1.0]|
| 87633.33333333333|87000.0| [0.0,3.0,2.0,1.0]|
|          123465.0|98730.0| [6.0,3.0,2.0,0.0]|

In [None]:
#### Count, Mean, Min, Max of predictions
predictions.select(["prediction", "label"]).describe().select('summary', *[round(c, 2).alias(c) for c in ["prediction", "label"]]).show()

In [None]:
#### Model results
treeModel = model.stages[0]
print(treeModel)
print(treeModel.featureImportances)

In [None]:
#### Plot predictions against target
x = range(0, predictions.count())
y_pred=predictions.select("prediction").collect()
y_target=predictions.select("label").collect()
 
plt.plot(x, y_target, label="label")
plt.plot(x, y_pred, label="prediction")
plt.title("Test and predicted data")

plt.xlabel('x axis')
plt.ylabel('y axis')

plt.legend(loc='best',fancybox=True, shadow=False)
plt.show() 