# Decision Tree Regression Notebook

#### *Author: Kunyu He*
#### *University of Chicago, CAPP'20*

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeRegressor

%matplotlib notebook

### Load Data

In [5]:
salary = pd.read_csv("Position_Salaries.csv")
salary.head()

Unnamed: 0,Position,Level,Salary
0,Business Analyst,1,45000
1,Junior Consultant,2,50000
2,Senior Consultant,3,60000
3,Manager,4,80000
4,Country Manager,5,110000


### Data Cleaning

In [6]:
salary.isnull().sum()

Position    0
Level       0
Salary      0
dtype: int64

No value missing.

### Feature Selection

In [7]:
X = salary.iloc[:, 1:2].values
X.shape

(10, 1)

In [8]:
y = salary.Salary.values
y.shape

(10,)

### Model Training

As we only have ten observations, we are using the whole data set to train our model.

In [9]:
dtr = DecisionTreeRegressor(random_state=123)
dtr.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=123, splitter='best')

### Model Evaluation

In [11]:
plt.scatter(X, y, color="red")
plt.plot(X, dtr.predict(X), color="blue")

plt.title("Salary against Position Level (with Decision Tree Regressor)")
plt.xlabel("Position Level")
plt.ylabel("Salary ($)")
plt.show()

<IPython.core.display.Javascript object>

**This does not look like a predictive output of a decision tree regression, as decision tree regression is not continuous.**

It happens because expect for each position level, we are not predicting and plotting y values in between. A decision tree prediction should look like below:

In [15]:
X_grid = np.arange(min(X), max(X), 0.00001)
X_grid = X_grid.reshape((len(X_grid), 1))

In [16]:
plt.scatter(X, y, color="red")
plt.plot(X_grid, dtr.predict(X_grid), color="blue")

plt.title("Actual: Salary against Position (with Decision Tree Regressor)")
plt.xlabel("Position Level")
plt.ylabel("Salary ($)")
plt.show()

<IPython.core.display.Javascript object>