## Identifying Real Estate Opportunity for Low Income Housing Programs

This is a rework of an analysis originally done using the traditional python scientific suite of modules. Here the bulk of the analysis is accomplished using Pyspark in order to take advantage of some of its features including the ML package and its pipelining capabilities.

To reiterate the goal of this analysis, we are taking Austin, TX housing data and creating a scoring model that will allow us pick the top 100 properties that best fit low income housing program goals. We begin by importing the necessary packages, creating a spark context, and uploading our data:

In [1]:
from pyspark.ml.feature import VectorAssembler, MinMaxScaler
from pyspark.ml import Pipeline
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.clustering import KMeans
from pyspark.sql.types import FloatType, ArrayType, DoubleType, IntegerType
from pyspark.sql import functions as F
import pandas as pd
from math import cos, sqrt
import geopandas as gpd
from helper_functions.plotly_plotting import plot_austin

In [2]:
import pyspark
spark = pyspark.sql.SparkSession.builder\
        .master('local')\
        .getOrCreate()

In [3]:
trulia_df = spark.read.format("csv")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .option("mode", "failFast")\
    .load("data/prices.csv")

atx_addresses_df = spark.read.format("csv")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .option("mode", "failFast")\
    .load("data/addresses_atx.csv")

In [4]:
trulia_df.show(5)

+---+---------+--------+------------------+------+----+------------------+--------------------+----------+-----------+
|_c0|bathrooms|bedrooms|        house_type| price|sqft|    price_per_sqft|         adj_address|  latitude|  longitude|
+---+---------+--------+------------------+------+----+------------------+--------------------+----------+-----------+
|  0|      2.0|     3.0|             Condo|280000|1427| 196.2158374211633|4159 STECK AVENUE...|30.3767245|-97.7577025|
|  1|      3.0|     4.0|Single-Family Home|299000|2224|134.44244604316546|15810 DE PEER COV...|  30.49254| -97.740365|
|  2|      1.0|     3.0|Single-Family Home|329900|1005| 328.2587064676617|8201 LAZY LANE AU...| 30.355568| -97.717455|
|  3|      2.0|     3.0|Single-Family Home|229874|1499|153.35156771180786|14511 FITZGIBBON ...|   30.2343| -97.588498|
|  4|      2.0|     3.0|Single-Family Home|237501|1295| 183.3984555984556|10304 BANKHEAD DR...| 30.351538| -97.732571|
+---+---------+--------+------------------+-----

In [5]:
print(("Rows: ", trulia_df.count(),"Columns: ", len(trulia_df.columns)))

('Rows: ', 623, 'Columns: ', 10)


In [6]:
atx_addresses_df.show(5)

+---+------------------+------------------+--------------------+
|_c0|               lat|               lon|         adj_address|
+---+------------------+------------------+--------------------+
|  1|30.186914399999996|       -97.9280313|12200 PRATOLINA D...|
|  3|         30.352751|-97.95679799999999|15302 DOROTHY DRI...|
|  4|30.169753200000002|-97.83318670000001|2515 DREW LANEAUS...|
|  5|30.354369199999997|       -97.9577997|15404 JOSEPH DRIV...|
|  7|30.399155600000004|       -97.8516461|11203 RANCH ROAD ...|
+---+------------------+------------------+--------------------+
only showing top 5 rows



In [7]:
print(("Rows: ", atx_addresses_df.count(), "Columns: ", len(atx_addresses_df.columns)))

('Rows: ', 140386, 'Columns: ', 4)


In [8]:
atx_addresses_df = atx_addresses_df.withColumnRenamed('lat', 'latitude').withColumnRenamed('lon', 'longitude')

## Feature Engineering

Now that we have our data in Pyspark dataframes, we will begin to build our data pipeline. The pipeline functionality allows us to plan out several data transformations and have Spark optimize the workload over all the transformations for us.

In Pyspark, almost all of the ML package's algorithms require the input data to be formatted in a dataframe column consisting of vectors of the input features. Convention is to name this column "features". And so we'll use the VectorAssembler to accomplish this. For our purposes, we need to use training data that we scraped from Trulia.com to train a regressor and then use the model to regress housing data onto a larger dataset of properties:

In [9]:
# use stages list to create a plan for the pipeline
stages = []
assemblerInputs = ['latitude', 'longitude']
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol='features')
stages += [assembler]

In [10]:
stages

[VectorAssembler_17d6e9b5af69]