# Decision Tree Regression

The decision tree is a simple machine learning model for getting started with regression tasks.

Background
A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node

In [None]:
import turicreate as tc
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Load house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

We will use the library [Turicreate](https://github.com/apple/turicreate) and the dataset from Airbnb Belgium [open dataset](http://tomslee.net/airbnb-data-collection-get-the-data)

In [None]:
df_rooms = tc.SFrame('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/airbnb-belgium.csv')

In [None]:
df_rooms.head()

### Split data into training and testing.
We use seed=0 so that everyone running this notebook gets the same results.  In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).  

In [None]:
train_data,test_data = df_rooms.random_split(.8,seed=0)

---
### Learning a Decision Tree model

Recall we can use the following code to learn a multiple regression model predicting 'price' based on the following features:
example_features = ['sqft_living', 'bedrooms', 'bathrooms'] on training data with the following code:

(Aside: We set validation_set = None to ensure that the results are always the same)

In [None]:
example_features = ['borough', 'neighborhood','bedrooms']
example_model = tc.decision_tree_regression.create(train_data, 
                                                   target = 'price', 
                                                   features = example_features, 
                                                   max_depth =  3, 
                                                   validation_set = None)

### Predicting Values

In the gradient descent notebook we use numpy to do our regression. In this book we will use existing turicreate functions to analyze multiple regressions. 

Recall that once a model is built we can use the .predict() function to find the predicted values for data we pass. For example using the example model above:

In [None]:
example_predictions = example_model.predict(train_data)

In [None]:
print (example_predictions[0]) # 64.82


Why chose decision trees?
Different kinds of models have different advantages. The decision tree model is very good at handling tabular data with numerical features, or categorical features with fewer than hundreds of categories. Unlike linear models, decision trees are able to capture non-linear interaction between the features and the target.

One important note is that tree based models are not designed to work with very sparse features. When dealing with sparse input data (e.g. categorical features with large dimension), we can either pre-process the sparse features to generate numerical statistics, or switch to a linear model, which is better suited for such scenarios.

---
# Training a Random Forest Regression model

In [None]:
example_features = ['borough', 'neighborhood','bedrooms']
example_model = tc.random_forest_regression.create(train_data, target = 'price', 
                                                   features = example_features, 
                                                   max_depth =  3,  
                                                   max_iterations = 3, 
                                                   validation_set = None)

***Try to predict from train and test data***   

***Construct a robust model and visualize (real y vs predict y)***

***Evaluate models with tc.evaluation.rmse(y, yhat)***

[Turicreate Documentation](https://apple.github.io/turicreate/docs/api/index.html)