# Decision Tree Model

A building block to Random Forest is a Decision Tree. Decision trees start with a root node and end with a leaf node. For numeric feature, tree split on each unique value of each data.  Tree-based models may poorly handle trends in data, compared to linear models, so you have to detrend your series first, which was done in the previous part for three of the five datasets.

In [None]:
import pandas as pd
from pandas import read_csv
from matplotlib import pyplot

## Example 1:  Vacation dataset

In [None]:
# load data
df1 = pd.read_csv('~/Desktop/section_4/vacation_lags_12months_features.csv', header=0)
df1.head()

In [None]:
# Split data

vacat = df1.values
# split into lagged variables and original time series
X1= vacat[:, 0:-1]  # slice all rows and start with column 0 and go up to but not including the last column
y1 = vacat[:,-1]  # slice all rows and last column, essentially separating out 't' column

In [None]:
# Columns t-1 to t-12, which are the lagged variables
X1

In [None]:
# Column t, which is the original time series
y1[0:10]

Below, you can alter the splits as 50-50, 60-40, 70-30, 75-25, 80-20, and 85-15, etc. So, 0.80 is a 80-20 split.

In [None]:
# Target Train-Test split
from pandas import read_csv

Y1 = y1
traintarget_size = int(len(Y1) * 0.80)   # Set split
train_target, test_target = Y1[0:traintarget_size], Y1[traintarget_size:len(Y1)]

print('Observations for Target: %d' % (len(Y1)))
print('Training Observations for Target: %d' % (len(train_target)))
print('Testing Observations for Target: %d' % (len(test_target)))

In [None]:
# Features Train-Test split

trainfeature_size = int(len(X1) * 0.80)
train_feature, test_feature = X1[0:trainfeature_size], X1[trainfeature_size:len(X1)]
print('Observations for feature: %d' % (len(X1)))
print('Training Observations for feature: %d' % (len(train_feature)))
print('Testing Observations for feature: %d' % (len(test_feature)))

In [None]:
# Decision Tree Regresion Model

from sklearn.tree import DecisionTreeRegressor

# Create a decision tree regression model with default arguments
decision_tree_vacat = DecisionTreeRegressor()  # max-depth not set

# Fit the model to the training features and targets
decision_tree_vacat.fit(train_feature, train_target)

# Check the score on train and test
print(decision_tree_vacat.score(train_feature, train_target))
print(decision_tree_vacat.score(test_feature,test_target))  # predictions are horrible if negative value, no relationship if 0


In [None]:
# Find the best Max Depth

# Loop through a few different max depths and check the performance
# Try different max depths. We want to optimize our ML models to make the best predictions possible.
# For regular decision trees, max_depth, which is a hyperparameter, limits the number of splits in a tree.
# You can find the best value of max_depth based on the R-squared score of the model on the test set.

for d in [2, 3, 4, 5,7,8,10]:
    # Create the tree and fit it
    decision_tree_vacat = DecisionTreeRegressor(max_depth=d)
    decision_tree_vacat.fit(train_feature, train_target)

    # Print out the scores on train and test
    print('max_depth=', str(d))
    print(decision_tree_vacat.score(train_feature, train_target))
    print(decision_tree_vacat.score(test_feature, test_target), '\n')  # You want the test score to be positive and high
 


Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.

In [None]:
# Plot predicted against actual values

from matplotlib import pyplot as plt

# Use the best max_depth 
decision_tree_vacat = DecisionTreeRegressor(max_depth=5)  # fill in best max depth here
decision_tree_vacat.fit(train_feature, train_target)

# Predict values for train and test
train_prediction = decision_tree_vacat.predict(train_feature)
test_prediction = decision_tree_vacat.predict(test_feature)

# Scatter the predictions vs actual values
plt.scatter(train_prediction, train_target, label='train')  # blue
plt.scatter(test_prediction, test_target, label='test')  # orange
plt.show()

## Example 2: Furniture Dataset

In [None]:
# load data, this data has been stationarized
df2 = pd.read_csv('~/Desktop/section_4/furniture_lags_12months_features.csv', header=0)
df2.head()

In [None]:
# Split Data

furn = df2.values
# split into lagged variables (features) and original time series data (target)
X2= furn[:,0:-1]  # slice all rows and start with column 0 and go up to but not including the last column
y2 = furn[:,-1]  # slice all rows and last column, essentially separating out 't' column

In [None]:
# Columns t-1 to t-12, which are the lagged variables
X2

In [None]:
# Column t, which is the original time series
# Give first 10 values of target variable, time series
y2[0:10]

Below, you can alter the splits as 50-50, 60-40, 70-30, 75-25, 80-20, and 85-15, etc. Here we are using a 75-25 split.

In [None]:
# Target Train-Test split
from pandas import read_csv

Y2 = y2
traintarget_size = int(len(Y2) * 0.75)   # Set split
train_target, test_target = Y2[0:traintarget_size], Y2[traintarget_size:len(Y2)]

print('Observations for Target: %d' % (len(Y2)))
print('Training Observations for Target: %d' % (len(train_target)))
print('Testing Observations for Target: %d' % (len(test_target)))

In [None]:
# Features Train-Test split

trainfeature_size = int(len(X2) * 0.75)
train_feature, test_feature = X2[0:trainfeature_size], X2[trainfeature_size:len(X2)]
print('Observations for feature: %d' % (len(X2)))
print('Training Observations for feature: %d' % (len(train_feature)))
print('Testing Observations for feature: %d' % (len(test_feature)))

In [None]:
# Decision Tree Regression Model

from sklearn.tree import DecisionTreeRegressor

# Create a decision tree regression model with default arguments
decision_tree_furn = DecisionTreeRegressor()  # max_depth not set

# Fit the model to the training features and targets
decision_tree_furn.fit(train_feature, train_target)

# Check the score on train and test
print(decision_tree_furn.score(train_feature, train_target))
print(decision_tree_furn.score(test_feature,test_target))  # predictions are horrible if negative value, no relationship if 0


In [None]:
# Find Best Max Depth

# Loop through a few different max depths and check the performance
# Try different max depths. We want to optimize our ML models to make the best predictions possible.
# For regular decision trees, max_depth, which is a hyperparameter, limits the number of splits in a tree.
# You can find the best value of max_depth based on the R-squared score of the model on the test set.

for d in [2, 3,4, 5,7,8,10]:
    # Create the tree and fit it
    decision_tree_furn = DecisionTreeRegressor(max_depth=d)
    decision_tree_furn.fit(train_feature, train_target)

    # Print out the scores on train and test
    print('max_depth=', str(d))
    print(decision_tree_furn.score(train_feature, train_target))
    print(decision_tree_furn.score(test_feature, test_target), '\n')  # You want the test score to be positive
    
# R-square for train and test scores are below. 

The best max_depth is max_depth is the one that gives the best test score (positive and high).  Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.

In [None]:
# Plot predicted against actual values

from matplotlib import pyplot as plt

# Use the best max_depth 
decision_tree_furn = DecisionTreeRegressor(max_depth=5) # Fill in best max depth score here
decision_tree_furn.fit(train_feature, train_target)

# Predict values for train and test
train_prediction = decision_tree_furn.predict(train_feature)
test_prediction = decision_tree_furn.predict(test_feature)

# Scatter the predictions vs actual values, orange is predicted
plt.scatter(train_prediction, train_target, label='train')  # blue 
plt.scatter(test_prediction, test_target, label='test')  
plt.show()


## Example 3:  Bank of America Dataset

In [None]:
# load data, this data has been stationarized
df3 = pd.read_csv('~/Desktop/section_4/bac_lags_12months_features.csv', header=0)
df3.head()

In [None]:
# Split Data

bac = df3.values
# split into lagged variables (features) and original time series data (target)
X3= bac[:,0:-1]  # slice all rows and start with column 0 and go up to but not including the last column
y3 = bac[:,-1]  # slice all rows and last column, essentially separating out 't' column

In [None]:
# Columns t-1 to t-12, which are the lagged variables
X3

In [None]:
# Column t, which is the original time series
# Give first 10 values of target variable, time series
y3[0:10]

In [None]:
# Target Train-Test split
from pandas import read_csv

Y3 = y3
traintarget_size = int(len(Y3) * 0.75)   # Set split
train_target, test_target = Y3[0:traintarget_size], Y3[traintarget_size:len(Y3)]

print('Observations for Target: %d' % (len(Y3)))
print('Training Observations for Target: %d' % (len(train_target)))
print('Testing Observations for Target: %d' % (len(test_target)))

In [None]:
# Features Train-Test split

trainfeature_size = int(len(X3) * 0.75)
train_feature, test_feature = X3[0:trainfeature_size], X3[trainfeature_size:len(X3)]
print('Observations for feature: %d' % (len(X3)))
print('Training Observations for feature: %d' % (len(train_feature)))
print('Testing Observations for feature: %d' % (len(test_feature)))

In [None]:
# Decision Tree Regression Model

from sklearn.tree import DecisionTreeRegressor

# Create a decision tree regression model with default arguments
decision_tree_bac = DecisionTreeRegressor()  # max_depth not set

# Fit the model to the training features and targets
decision_tree_bac.fit(train_feature, train_target)

# Check the score on train and test
print(decision_tree_bac.score(train_feature, train_target))
print(decision_tree_bac.score(test_feature,test_target))  # predictions are horrible if negative value, no relationship if 0


In [None]:
# Find Best Max Depth

# Loop through a few different max depths and check the performance
# Try different max depths. We want to optimize our ML models to make the best predictions possible.
# For regular decision trees, max_depth, which is a hyperparameter, limits the number of splits in a tree.
# You can find the best value of max_depth based on the R-squared score of the model on the test set.

for d in [2, 3,4, 5,7,8,10]:
    # Create the tree and fit it
    decision_tree_bac = DecisionTreeRegressor(max_depth=d)
    decision_tree_bac.fit(train_feature, train_target)

    # Print out the scores on train and test
    print('max_depth=', str(d))
    print(decision_tree_bac.score(train_feature, train_target))
    print(decision_tree_bac.score(test_feature, test_target), '\n')  # You want the test score to be positive
    
# R-square for train and test scores are below. 

In [None]:
# Plot predicted against actual values

from matplotlib import pyplot as plt

# Use the best max_depth 
decision_tree_bac = DecisionTreeRegressor(max_depth=4) 
decision_tree_bac.fit(train_feature, train_target)

# Predict values for train and test
train_prediction = decision_tree_bac.predict(train_feature)
test_prediction = decision_tree_bac.predict(test_feature)

# Scatter the predictions vs actual values
plt.scatter(train_prediction, train_target, label='train')
plt.scatter(test_prediction, test_target, label='test')
plt.show()

## Example 4:  J.P. Morgan Dataset

In [None]:
# load data
df4 = pd.read_csv('~/Desktop/section_4/jpm_lags_12months_features.csv', header=0)
df4.head()

In [None]:
# split data

jpm = df4.values
# split into lagged variables and original time series
X4= jpm[:, 0:-1]  # slice all rows and start with column 0 and go up to but not including the last column
y4 = jpm[:,-1]  # slice all rows and last column, essentially separating out 't' column

In [None]:
# Columns t-1 to t-12, which are the lagged variables
X4

In [None]:
# Column t, which is the original time series
y4[0:10]

In [None]:
# Target Train-Test split
from pandas import read_csv

Y4 = y4
traintarget_size = int(len(Y4) * 0.50)   # Set split
train_target, test_target = Y4[0:traintarget_size], Y4[traintarget_size:len(Y4)]

print('Observations for Target: %d' % (len(Y4)))
print('Training Observations for Target: %d' % (len(train_target)))
print('Testing Observations for Target: %d' % (len(test_target)))

In [None]:
# Features Train-Test split

trainfeature_size = int(len(X4) * 0.50)
train_feature, test_feature = X4[0:trainfeature_size], X4[trainfeature_size:len(X4)]
print('Observations for feature: %d' % (len(X4)))
print('Training Observations for feature: %d' % (len(train_feature)))
print('Testing Observations for feature: %d' % (len(test_feature)))

In [None]:
# Decision Tree Regression Model
from sklearn.tree import DecisionTreeRegressor

# Create a decision tree regression model with default arguments
decision_tree_jpm = DecisionTreeRegressor()  # max-depth not set

# Fit the model to the training features and targets
decision_tree_jpm.fit(train_feature, train_target)

# Check the score on train and test
print(decision_tree_jpm.score(train_feature, train_target))
print(decision_tree_jpm.score(test_feature,test_target))  # predictions are horrible if negative value, no relationship if 0


In [None]:
# Find Best max depth

# Loop through a few different max depths and check the performance
# Try different max depths. We want to optimize our ML models to make the best predictions possible.
# For regular decision trees, max_depth, which is a hyperparameter, limits the number of splits in a tree.
# You can find the best value of max_depth based on the R-squared score of the model on the test set.

for d in [2, 3,4, 5,7,8,10]:
    # Create the tree and fit it
    decision_tree_jpm = DecisionTreeRegressor(max_depth=d)
    decision_tree_jpm.fit(train_feature, train_target)

    # Print out the scores on train and test
    print('max_depth=', str(d))
    print(decision_tree_jpm.score(train_feature, train_target))
    print(decision_tree_jpm.score(test_feature, test_target), '\n')  
    # You want the test score to be positive and high
 

In [None]:
# Plot predicted against actual values

from matplotlib import pyplot as plt

# Use the best max_depth 
decision_tree_jpm = DecisionTreeRegressor(max_depth=4)
decision_tree_jpm.fit(train_feature, train_target)

# Predict values for train and test
train_prediction = decision_tree_jpm.predict(train_feature)
test_prediction = decision_tree_jpm.predict(test_feature)

# Scatter the predictions vs actual values
plt.scatter(train_prediction, train_target, label='train')
plt.scatter(test_prediction, test_target, label='test')
plt.show()

## Example 5:  Average Temperature Dataset

In [None]:
# load data
df5 = pd.read_csv('~/Desktop/section_4/temp_lags_12months_features.csv', header=0)
df5.head()

In [None]:
temp = df5.values
# split into lagged variables and original time series
X5= temp[:,0:-1]  # slice all rows and start with column 0 and go up to but not including the last column
y5 = temp[:,-1]  # slice all rows and last column, essentially separating out 't' column

In [None]:
# Columns t-1 to t-12, which are the lagged variables
X5

In [None]:
# Column t, which is the original time series
y5[0:10]

In [None]:
# Target Train-Test split
from pandas import read_csv

Y5 = y5
traintarget_size = int(len(Y5) * 0.80)   # Set split
train_target, test_target = Y5[0:traintarget_size], Y5[traintarget_size:len(Y5)]

print('Observations for Target: %d' % (len(Y5)))
print('Training Observations for Target: %d' % (len(train_target)))
print('Testing Observations for Target: %d' % (len(test_target)))

In [None]:
# Features Train-Test split

trainfeature_size = int(len(X5) * 0.80)
train_feature, test_feature = X5[0:trainfeature_size], X5[trainfeature_size:len(X5)]
print('Observations for feature: %d' % (len(X5)))
print('Training Observations for feature: %d' % (len(train_feature)))
print('Testing Observations for feature: %d' % (len(test_feature)))

In [None]:
# Decision Tree Regression Model

from sklearn.tree import DecisionTreeRegressor

# Create a decision tree regression model with default arguments
decision_tree_temp = DecisionTreeRegressor()  # max-depth not set

# Fit the model to the training features and targets
decision_tree_temp.fit(train_feature, train_target)

# Check the score on train and test
print(decision_tree_temp.score(train_feature, train_target))
print(decision_tree_temp.score(test_feature,test_target))  # predictions are horrible if negative value, no relationship if 0


In [None]:
# Find best max depth

# Loop through a few different max depths and check the performance
# Try different max depths. We want to optimize our ML models to make the best predictions possible.
# For regular decision trees, max_depth, which is a hyperparameter, limits the number of splits in a tree.
# You can find the best value of max_depth based on the R-squared score of the model on the test set.

for d in [2, 3, 4, 5,7,8,10]:
    # Create the tree and fit it
    decision_tree_temp = DecisionTreeRegressor(max_depth=d)
    decision_tree_temp.fit(train_feature, train_target)

    # Print out the scores on train and test
    print('max_depth=', str(d))
    print(decision_tree_temp.score(train_feature, train_target))
    print(decision_tree_temp.score(test_feature, test_target), '\n')  # You want the test score to be positive and high
 


In [None]:
from matplotlib import pyplot as plt

# Use the best max_depth 
decision_tree_temp = DecisionTreeRegressor(max_depth=4)
decision_tree_temp.fit(train_feature, train_target)

# Predict values for train and test
train_prediction = decision_tree_temp.predict(train_feature)
test_prediction = decision_tree_temp.predict(test_feature)

# Scatter the predictions vs actual values
plt.scatter(train_prediction, train_target, label='train')
plt.scatter(test_prediction, test_target, label='test')
plt.show()

## Summary

In summary, we looked at a decision tree model on both datasets that consisted of 12 lagged variables.  The datasets were split into features and target. It was further split according to a training and test datasets. We determined that the best max_depth based on the best R-squared score on the test dataset model, which was max_depth with the highest and positive score. The max_depth determines the number of splits and is, essentially, the height of the tree. The scatterplot displays both the predicted and actual values. The decision tree models predicted training data better than test data.

In [None]:
# End