# Decision Tree Regression

The decision tree is a simple machine learning model for getting started with regression tasks.

Background
A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node

In [1]:
import turicreate as tc
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Load house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

We will use the library [Turicreate](https://github.com/apple/turicreate) and the dataset from Airbnb Belgium [open dataset](http://tomslee.net/airbnb-data-collection-get-the-data)

In [17]:
df_rooms = tc.SFrame('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/airbnb-belgium.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,int,float,int,float,float,str,float,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [18]:
df_rooms.head()

room_id,host_id,room_type,borough,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms
14054734,33267800,Shared room,Brussel,Brussel,1,0.0,2,1.0
16151530,105088596,Shared room,Brussel,Brussel,1,0.0,1,1.0
14678546,30043608,Shared room,Brussel,Brussel,14,4.5,2,1.0
8305401,43788729,Shared room,Namur,Namur,12,4.5,2,1.0
14904339,15277691,Shared room,Namur,Gembloux,1,0.0,6,1.0
16228753,61781546,Shared room,Antwerpen,Antwerpen,3,4.5,2,1.0
643309,3216639,Shared room,Roeselare,Roeselare,6,4.0,6,1.0
3879691,19998594,Shared room,Brugge,Knokke-Heist,1,0.0,12,1.0
3710876,18917692,Shared room,Antwerpen,Antwerpen,11,3.0,3,1.0
5141135,20676997,Shared room,Gent,Gent,9,4.5,2,1.0

price,minstay,latitude,longitude,last_modified
55.0,,50.847703,4.379786,2016-12-31 14:49:05.125349 ...
42.0,,50.821832,4.366557,2016-12-31 14:49:05.112730 ...
43.0,,50.847657,4.348675,2016-12-31 14:49:05.110143 ...
48.0,,50.462592,4.818974,2016-12-31 14:49:05.107436 ...
59.0,,50.562263,4.693185,2016-12-31 14:49:05.101899 ...
53.0,,51.203401,4.392493,2016-12-31 14:49:05.096266 ...
22.0,,50.941016,3.123627,2016-12-31 14:49:03.811667 ...
33.0,,51.339016,3.273554,2016-12-31 14:49:02.743608 ...
33.0,,51.232425,4.424612,2016-12-31 14:49:02.710383 ...
38.0,,51.034197,3.714149,2016-12-31 14:49:02.705108 ...


***Compute show and describe the features***

### Split data into training and testing.
We use seed=0 so that everyone running this notebook gets the same results.  In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).  

In [19]:
train_data,test_data = df_rooms.random_split(.8,seed=0)

---
### Learning a Decision Tree model

Recall we can use the following code to learn a multiple regression model predicting 'price' based on the following features:
example_features = ['sqft_living', 'bedrooms', 'bathrooms'] on training data with the following code:

(Aside: We set validation_set = None to ensure that the results are always the same)

In [34]:
example_features = ['borough', 'neighborhood','bedrooms']
example_model = tc.decision_tree_regression.create(train_data, 
                                                   target = 'price', 
                                                   features = example_features, 
                                                   max_depth =  3, 
                                                   validation_set = None)

### Predicting Values

In the gradient descent notebook we use numpy to do our regression. In this book we will use existing turicreate functions to analyze multiple regressions. 

Recall that once a model is built we can use the .predict() function to find the predicted values for data we pass. For example using the example model above:

In [24]:
example_predictions = example_model.predict(train_data)

In [26]:
print (example_predictions[0]) # 64.82

64.82726287841797



Why chose decision trees?
Different kinds of models have different advantages. The decision tree model is very good at handling tabular data with numerical features, or categorical features with fewer than hundreds of categories. Unlike linear models, decision trees are able to capture non-linear interaction between the features and the target.

One important note is that tree based models are not designed to work with very sparse features. When dealing with sparse input data (e.g. categorical features with large dimension), we can either pre-process the sparse features to generate numerical statistics, or switch to a linear model, which is better suited for such scenarios.

---
# Learning a Random Forest Regression model

In [27]:
example_features = ['borough', 'neighborhood','bedrooms']
example_model = tc.random_forest_regression.create(
    train_data, target = 'price', features = example_features, max_depth =  3,  max_iterations=3, validation_set = None)

Now that we have fitted the model we can extract the regression weights (coefficients) as an SFrame as follows:

### Predicting Values

In the gradient descent notebook we use numpy to do our regression. In this book we will use existing turicreate functions to analyze multiple regressions. 

Recall that once a model is built we can use the .predict() function to find the predicted values for data we pass. For example using the example model above:

In [28]:
example_predictions = example_model.predict(train_data)

In [29]:
print (example_predictions[0]) # should be 280466.91558480915

71.51585388183594
