# Airbnb Price Prediction Challenge

Airbnb has a host of public data sets for listings in cities throughout the world: <br>
http://insideairbnb.com/get-the-data.html

This challenge will be working on the data set for London.

Split into teams of 2 or 3 to work on completing this challenge.

**Metric:** We will be using RMSE to assess the effectiveness of our models.

## Proposed steps

In this challenge, you will be working with a real-world dataset released by AirBnB. The goal is to predict the price of a rental based on its features. 

1. Configure Classroom
1. Add the data set to Databricks.
2. Read the Data
2. Prepare the Data
3. Define Preprocessing Models
4. Split the Data for Model Development
4. Prepare a benchmark Model
5. Iterate on Benchmark Model
6. Iterate on Best Model

A list of available regression models can be found here: https://spark.apache.org/docs/2.2.0/ml-classification-regression.html

## Configure Classroom

Run the following cell to configure our "classroom."

In [4]:
%run "../Includes/Classroom Setup"

## Read and explore the Data
- Load the data as a Dataframe
- Prepare a basic description of the data set including:
   - print the schema
   - look at the first couple of rows

In [6]:
try:
  sasToken="?sv=2017-11-09&ss=bf&srt=co&sp=rl&se=2099-12-31T23:59:59Z"+\
    "&st=2018-01-01T00:00:00Z&spr=https&sig=di3x0sjVwmqIjO5ReQ%2Bwa54R9shTDZePtKHipkabqAg%3D"
  dbutils.fs.mount(
    source = "wasbs://class-453@airlift453.blob.core.windows.net/",
    mount_point = "/mnt/training-453",
    extra_configs = {"fs.azure.sas.class-453.airlift453.blob.core.windows.net": sasToken})
except Exception as e:
  if "Directory already mounted" in str(e):
    pass # Ignore error if already mounted.
  else:
    raise e
print("Success.")

In [7]:
filePath = "dbfs:/mnt/training-453/airbnb/listings/london-cleaned.csv"

initDF = (spark.read
  .option("multiline", True)
  .option("header", True)
  .option("inferSchema", True)
  .csv(filePath)
)

display(initDF)

host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
4.0,Haringey,N8 0EY,51.587767046892864,-0.1056664713590268,Apartment,Private room,2,,1.0,1.0,Real Bed,1,133,97.0,10.0,9.0,10.0,10.0,9.0,10.0,35.0
6.0,Ealing,W13 8,51.51564502087072,-0.3145082381228601,Apartment,Private room,2,,1.0,1.0,Real Bed,2,2,90.0,8.0,9.0,10.0,9.0,9.0,9.0,70.0
3.0,Islington,N4 3,51.56801694691075,-0.111208043110142,Apartment,Private room,2,1.0,1.0,1.0,Real Bed,1,14,95.0,9.0,10.0,9.0,10.0,9.0,9.0,45.0
14.0,Westminster,W1T4BP,51.52098247579589,-0.1400235751362096,Apartment,Entire home/apt,6,2.0,3.0,3.0,Real Bed,3,35,93.0,10.0,9.0,9.0,9.0,10.0,9.0,300.0
2.0,Wandsworth,SW11 5GX,51.4729809775772,-0.1637638798070717,Townhouse,Entire home/apt,4,1.5,2.0,2.0,Real Bed,30,92,98.0,10.0,10.0,10.0,10.0,9.0,9.0,150.0
59.0,Tower Hamlets,E14 7RJ,51.51190719380345,-0.0375978928918288,Serviced apartment,Entire home/apt,5,2.0,2.0,2.0,Real Bed,5,7,73.0,7.0,9.0,10.0,9.0,7.0,7.0,102.0
3.0,Barnet,NW11 9,51.57224300907506,-0.2090597643277406,House,Private room,2,1.5,1.0,1.0,Real Bed,15,117,95.0,9.0,10.0,10.0,10.0,9.0,9.0,29.0
6.0,Islington,N1 2,51.54168032647468,-0.10206526891557,Apartment,Entire home/apt,4,1.0,1.0,3.0,Real Bed,3,54,84.0,8.0,9.0,9.0,9.0,9.0,8.0,150.0
6.0,Islington,N1 2,51.53883041283413,-0.101525819246923,Apartment,Entire home/apt,5,1.0,1.0,3.0,Real Bed,2,69,90.0,9.0,9.0,9.0,10.0,10.0,9.0,150.0
3.0,Tower Hamlets,E2,51.52496863799578,-0.0737268307181456,Apartment,Private room,3,1.0,1.0,2.0,Real Bed,3,38,96.0,10.0,10.0,10.0,10.0,10.0,9.0,50.0


In [8]:
initDF.printSchema()

## Prepare the Data
- Count rows with null values
- Impute missing values for numerical fields
- remove rows with null values for `zipcode`

In [10]:
# TODO: Count the number of rows in the `initDF` DataFrame
initDF.count()

In [11]:
# TODO: Show the description of `initDF` DataFrame to shows the number of non-null rows
initDF_noNAN = initDF.dropna()

print("We have %s rows that contain NANs" % (initDF.count() - initDF_noNAN.count()))

In [12]:
# in addition, we can also show the number of NANs per column
from pyspark.sql.functions import col, sum, isnan
display(initDF.select(*(sum((col(c).isNull()).cast("int")).alias(c) for c in initDF.columns)))

host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
12,0,1923,0,0,0,0,0,193,62,170,0,0,0,21649,21718,21700,21785,21713,21790,21795,0


In [13]:
# TODO: Impute missing values for numerical columns
columnList = [item[0] for item in initDF.dtypes if item[1].startswith('double') or item[1].startswith('int')]

airbnbDF = initDF.fillna(0, columnList)

In [14]:
# TODO: Remove zipcode.  The data contains latitude and longitude alrady
airbnbDF = airbnbDF.drop('zipcode')

In [15]:
# in addition, we can also show the number of NANs per column
from pyspark.sql.functions import col, sum, isnan
display(airbnbDF.select(*(sum((col(c).isNull()).cast("int")).alias(c) for c in airbnbDF.columns)))

host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Prepare for the Competition

- Using the same random seed select 20% of the data to be used as a hold out set for comparison between teams.

`modelingDF` will be used to prepare your model. You are free to use this data any way that you see fit in order to prepare the best possible model. **You must not expose your model to `holdOutDF`**.

`holdOutDF` will be used for comparison. You will submit scores for you model's performance on `holdOutDF` to the instructor for comparisonn.

In [17]:
seed = 273
(holdOutDF, modelingDF) = airbnbDF.randomSplit([0.2, 0.8], seed=seed)

print(holdOutDF.count(), modelingDF.count())

##  Define Preprocessing Models

Prepare the following models to be used in a modeling pipeline:
- Prepare a StringIndexer for `neighbourhood_cleansed`, `room_type`, `property_type`, `bed_type`
- Prepare a OneHotEncoder for `neighbourhood_cleansed`, `room_type`, `property_type`, `bed_type`

In [19]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder

prep_stages = []

# TODO: Prepare a StringIndexer for `neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type`
for c in ['neighbourhood_cleansed', 'room_type', 'property_type', 'bed_type']:
  indexer = StringIndexer(inputCol = c, outputCol = 'cat_' + c, handleInvalid='skip')
  encoder = OneHotEncoder(inputCol = 'cat_' + c, outputCol = c + "_vec", dropLast = True)
  prep_stages += [indexer, encoder]

Now *StringIndex* all categorical features (`neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type`) and set `handleInvalid` to `skip`. Set the output columns to `cat_neighbourhood_cleansed`, `cat_room_type`, `cat_zipcode`, `cat_property_type` and `cat_bed_type`, respectively.

## Split the Data for Model Development

Let's keep 80% for the training set and set aside 20% of our data for the test set.

**NOTE:** The data is now split into three sets:
- `trainDF` - used for training a model
- `testDF` - used for internal validation of hyperparamters
- `holdOutDF` - used for final assessement of model and comparison to models prepared by other teams

In [22]:
# TODO: Perform a train-test split on `modelingDF`
seed = 273
(testDF, trainDF) = modelingDF.randomSplit([0.2, 0.8], seed=seed)

print(trainDF.count(), testDF.count())

## Prepare a Benchmark Model

- Define a `list` (Python) or `Array` (Scala) containing the features to be used. It is recommended to use the following features:

  `"host_total_listings_count"`, ` "accommodates"`, ` "bathrooms"`, ` "bedrooms"`, ` "beds"`, ` "minimum_nights"`, ` "number_of_reviews"`, ` "review_scores_rating"`, ` "review_scores_accuracy"`, ` "review_scores_cleanliness"`, ` "review_scores_checkin"`, ` "review_scores_communication"`, ` "review_scores_location"`, ` "review_scores_value"`, ` "vec_neighborhood"`, `"vec_room_type"`, `"vec_zipcode"`, `"vec_property_type"`, `"vec_bed_type"`
- Build a Linear Regression pipeline that contains:
  - each of the StringIndexers
  - the OneHotEncoder
  - a VectorAssembler
  - a LinearRegression Estimator
- Evaluate the performance of the Benchmark Model using a RegressionEvaluator

In [24]:
# TODO: Define a `list` (Python) or `Array` (Scala) containing the features to be used.
features = ["host_total_listings_count", "accommodates", "bathrooms", "bedrooms", "beds", "minimum_nights", "number_of_reviews", "review_scores_rating", "review_scores_accuracy", "review_scores_cleanliness", "review_scores_checkin", "review_scores_communication", "review_scores_location", "review_scores_value", "neighbourhood_cleansed_vec", "room_type_vec", "property_type_vec", "bed_type_vec"]

In [25]:
# TODO: Assemble a sparse vector column containing the features for your model
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols = features, outputCol = "features")

In [26]:
print("Number of preprocessing stages: %s" % len(prep_stages))

Define a `LinearRegression` model and a `Pipeline` that combines your `prep_stages` for data pre-processing with the `LinearRegression`.

In [28]:
# TODO : Build a Linear Regression pipeline that also includes the `prep_stages`
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression

lr = LinearRegression().setLabelCol('price')

pipeline = Pipeline(stages=prep_stages + [assembler] + [lr])

lr_model = pipeline.fit(trainDF)

Evaluate the performance of the Benchmark Model using a RegressionEvaluator on internal testing set, `testDF`.

In [30]:
# TODO: Evaluate the performance of the Benchmark Modelusing a RegressionEvaluator on internal testing set, `testDF`
from pyspark.ml.evaluation import RegressionEvaluator

predDF = lr_model.transform(testDF)

evaluator = RegressionEvaluator().setLabelCol("price")


metricName = evaluator.getMetricName()
metricVal = evaluator.evaluate(predDF)

print("{}: {}".format(metricName, metricVal))

Evaluate the performance of the Benchmark Modelusing a RegressionEvaluator on the class evaluation set, `holdOutDF`

In [32]:
# TODO: Evaluate the performance of the Benchmark Modelusing a RegressionEvaluator on the class evaluation set, `holdOutDF`
predDF = lr_model.transform(holdOutDF)

evaluator = RegressionEvaluator().setLabelCol("price")


metricName = evaluator.getMetricName()
metricVal = evaluator.evaluate(predDF)

print("{}: {}".format(metricName, metricVal))

## Iterate on Benchmark Model
- Prepare a model to beat your benchmark model.
- Build a regression pipeline that contains:
   - each of the StringIndexers
   - the OneHotEncoder
   - a VectorAssembler
   - an improved Regression Estimator
 - Evaluate the performance of the new Model using a RegressionEvaluator on your internal testing set, `testDF`.
 - Use the internal testing set to adjust the hyper parameters of your model.
 - Evaluate the performance of the new Model using a RegressionEvaluator on the class evaluation set, `holdOutDF`
 - When you have beaten the benchmark, share the results with your instructor.

Use `KMeans` to cluster `longitude` and `latitude`, maybe there are regional differences in pricing.

In [35]:
# TODO use the `VectorAssembler` to combine 'longitude', 'latitude' in one feature column for kmeans.

from pyspark.ml.clustering import KMeans

coord_assembler = VectorAssembler(inputCols = ['longitude', 'latitude'], outputCol = "coordinates")

kmeans = KMeans(featuresCol='coordinates', predictionCol='cat_coordinate')

In [36]:
# TODO: Make a copy of the original VectorAssembler and add the column which contains the otuput of your kmeans clustering.
new_assembler = assembler.copy()
new_assembler.setInputCols(assembler.getInputCols() + ['cat_coordinate'])

In [37]:
# TODO: Define a GBTRegressor
from pyspark.ml.regression import GBTRegressor

gbt = (GBTRegressor(subsamplingRate=.5)
      .setLabelCol("price")
      .setFeaturesCol("features")
      .setSeed(27))

Use the `ParamGridBuilder` and `CrossValidator` to find the best parameters using *only* your `trainDF`.

In [39]:
# TODO: Build a better Regression pipeline
from pyspark.ml.tuning import ParamGridBuilder

paramGrid = (ParamGridBuilder()\
            .addGrid(gbt.maxDepth, [3, 5]) \
            .addGrid(gbt.maxBins, [16, 32]) \
            .addGrid(kmeans.k, [50, 200]) \
            .build())

pipeline = Pipeline(stages=prep_stages + [coord_assembler] + [kmeans] + [new_assembler] + [gbt])

from pyspark.ml.tuning import CrossValidator

cv = (CrossValidator()
      .setEstimator(pipeline)
      .setEvaluator(evaluator)
      .setEstimatorParamMaps(paramGrid)
      .setNumFolds(3)
      .setSeed(27))

Fit your `trainDF` with your cross-validation object.

In [41]:
# TODO fit your model to the trainDF
cvModel = cv.fit(trainDF)

Using Python's `zip` and `list` functions, plot a list of model parameters `getEstimatorParamMaps` against the `avgMetrics` of your `cvModel`.

In [43]:
# TODO: plot a list of model parameters against metrics
list(zip(cvModel.getEstimatorParamMaps(), cvModel.avgMetrics))

Evaluate the performance of the better Model using a RegressionEvaluator on internal testing set, `testDF`

In [45]:
# TODO: Evaluate the performance of the better Model using a RegressionEvaluator on internal testing set, `testDF`
predDF = cvModel.transform(testDF)

evaluator = RegressionEvaluator().setLabelCol("price")


metricName = evaluator.getMetricName()
metricVal = evaluator.evaluate(predDF)

print("{}: {}".format(metricName, metricVal))

When you are convinced that you are done with the tuning of hyperparameters, apply your pipeline to the holdout dataset. 

**Please do *not* do this more than once!**

In [47]:
# TODO: Build a better Regression pipeline
predDF = cvModel.transform(holdOutDF)

evaluator = RegressionEvaluator().setLabelCol("price")


metricName = evaluator.getMetricName()
metricVal = evaluator.evaluate(predDF)

print("{}: {}".format(metricName, metricVal))

-sandbox
&copy; 2019 Microsoft. All rights reserved.<br/>

Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.