<a href="https://colab.research.google.com/github/JLDaniel77/DS-Unit-2-Sprint-2-Regression/blob/master/LS_DS_411A_Decision_Trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science — Tree Ensembles_ 

# Decision Trees Assignment

## Part 1: House Price Regression

Apply decision trees to the Ames housing dataset you've worked with this week!

- Try multiple features
- Try features you've engineered
- Try different `max_depth` paramaters
- What's the best Test Root Mean Squared Error you can get? *Share with your cohort on Slack!*
- What's a cool visualization you can make? *Share with your cohort on Slack!*

In [316]:
!pip install graphviz
!apt-get install graphviz

Reading package lists... Done
Building dependency tree       
Reading state information... Done
graphviz is already the newest version (2.40.1-2).
The following package was automatically installed and is no longer required:
  libnvidia-common-410
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 11 not upgraded.


In [0]:
# %matplotlib inline
import graphviz
from IPython.display import display
from ipywidgets import interact
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz
from sklearn.metrics import mean_squared_error

In [0]:
# set dataframe display options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

# import data
url = 'https://raw.githubusercontent.com/ryanleeallred/datasets/master/Ames%20Housing%20Data/train.csv'
df = pd.read_csv(url)

In [0]:
# Remove categorical data
df = df.select_dtypes(include='number')

In [0]:
# Remove columns with NaN values
df = df.drop(columns=['LotFrontage', 'MasVnrArea', 'GarageYrBlt', 'Id'])

In [321]:
# View columns
df.columns

Index(['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
       '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
       'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
       'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')

In [0]:
# Separate X and y variables
target = 'SalePrice'
features = ['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
       '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
       'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
       'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold']

X = df[feature]
y = df[target]

### Baseline Decision Tree

In [323]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Fit the model
tree = DecisionTreeRegressor(max_depth=2)
tree.fit(X_train, y_train)

# Predict y with X_test
y_pred = tree.predict(X_test)

# Print RMSE and R^2 scores for train and test data
print('Train R^2 Score:', tree.score(X_train, y_train))
print('Test R^2 Score:', tree.score(X_test, y_test))
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))

Train R^2 Score: 0.6217396608043659
Test R^2 Score: 0.6613143997312394
Root Mean Squared Error: 48709.3143193345


### Decision Tree to Max Test R^2 Score

In [0]:
# Make a copy of the dataframe
df_copy = df.copy()

# Engineered Features
df_copy['age'] = 2010 - df['YearBuilt']
df_copy['bedroom_bathroom_ratio'] = df['BedroomAbvGr'] / (df['FullBath'] + df['HalfBath'] + df['BsmtFullBath'] + df['BsmtHalfBath'])
df_copy['bedroom_per_sqft'] = df['BedroomAbvGr'] / (df['1stFlrSF'] + df['2ndFlrSF'] + df['BsmtFinSF1'] + df['BsmtFinSF2'])
df_copy['bathrooms_per_sqft'] = (df['BsmtFullBath'] + df['BsmtHalfBath'] + df['FullBath'] + df['HalfBath']) / (df['1stFlrSF'] + df['2ndFlrSF'] + df['TotalBsmtSF'])

# Polynomial Features
df_copy['overall_qual_squared'] = df['OverallQual'] ** 2
df_copy['overall_cond_squared'] = df['OverallCond'] ** 2
df_copy['GrLivArea_squared'] = df['GrLivArea'] ** 2
df_copy['first_floor_sqft_squared'] = df['1stFlrSF'] ** 2
df_copy['second_floor_sqft_squared'] = df['2ndFlrSF'] ** 2

In [325]:
# View columns
df_copy.columns

Index(['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
       '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
       'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
       'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SalePrice', 'age',
       'bedroom_bathroom_ratio', 'bedroom_per_sqft', 'bathrooms_per_sqft',
       'overall_qual_squared', 'overall_cond_squared', 'GrLivArea_squared',
       'first_floor_sqft_squared', 'second_floor_sqft_squared'],
      dtype='object')

In [326]:
# Convert price to natural log
df_copy['ln_price'] = np.log(df_copy['SalePrice'])

# Separate X varibles and y variable
target = 'ln_price'
features = ['MSSubClass', 'OverallQual', 'OverallCond', 'BsmtFinSF1', 
       'BsmtFinSF2','BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 
       'GrLivArea', 'BsmtFullBath','FullBath',  'KitchenAbvGr', 'TotRmsAbvGrd', 
       'Fireplaces', 'GarageCars', 'GarageArea', 'WoodDeckSF','OpenPorchSF', 
       'EnclosedPorch', 'ScreenPorch', 'PoolArea', 'MoSold', 'YrSold', 'age',  
       'bedroom_bathroom_ratio', 'bathrooms_per_sqft',  'overall_qual_squared',
       'overall_cond_squared', 'GrLivArea_squared',  'first_floor_sqft_squared']

X = df_copy[features]
y = df_copy[target]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Fit the model
tree = DecisionTreeRegressor(max_depth=6, random_state=42)
tree.fit(X_train, y_train)

# Predict y with X_test
y_pred = tree.predict(X_test)

# Print R^2 Scores
print('Train R^2 Score:', tree.score(X_train, y_train))
print('Test R^2 Score:', tree.score(X_test, y_test))
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))

Train R^2 Score: 0.8885280975650572
Test R^2 Score: 0.8069334504146474
Root Mean Squared Error: 0.18284225302075302


### Visualization

In [0]:
def viz3D(fitted_model, df, feature1, feature2, target='', num=100):
    """
    Visualize model predictions in 3D, for regression or binary classification
    
    Parameters
    ----------
    fitted_model : scikit-learn model, already fitted
    df : pandas dataframe, which was used to fit model
    feature1 : string, name of feature 1
    feature2 : string, name of feature 2
    target : string, name of target
    num : int, number of grid points for each feature
    
    References
    ----------
    https://jakevdp.github.io/PythonDataScienceHandbook/04.12-three-dimensional-plotting.html
    https://scikit-learn.org/stable/auto_examples/tree/plot_iris.html  
    """
    x1 = np.linspace(df[feature1].min(), df[feature1].max(), num)
    x2 = np.linspace(df[feature2].min(), df[feature2].max(), num)
    X1, X2 = np.meshgrid(x1, x2)
    X = np.c_[X1.flatten(), X2.flatten()]
    if hasattr(fitted_model, 'predict_proba'):
        predicted = fitted_model.predict_proba(X)[:,1]
    else:
        predicted = fitted_model.predict(X)
    Z = predicted.reshape(num, num)
    
    fig = plt.figure()
    ax = plt.axes(projection='3d')
    ax.plot_surface(X1, X2, Z, cmap='viridis')
    ax.set_xlabel(feature1)
    ax.set_ylabel(feature2)
    ax.set_zlabel(target)
    return fig

In [0]:
target = 'SalePrice'
features = ['OverallQual', 'bedroom_bathroom_ratio']

X = df_copy[features]
y = df_copy[target]

In [329]:
# Fit the model
tree = DecisionTreeRegressor(max_depth=2, random_state=42)
tree.fit(X, y)

# Plot the tree
import matplotlib.pyplot as plt
viz3D(tree, df_copy, feature1='OverallQual', feature2='bedroom_bathroom_ratio', target='ln_price')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Part 2 / Stretch: "Play Tennis" Classification

We'll reproduce the "Play Tennis" example from Ross Quinlan's 1986 paper, [Induction of Decison Trees](https://link.springer.com/content/pdf/10.1007%2FBF00116251.pdf).

[According to Wikipedia](https://en.wikipedia.org/wiki/Ross_Quinlan), "John Ross Quinlan is a computer science researcher in data mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical C4.5 and ID3 algorithms."

#### "Table 1 shows a small training set"

In [330]:
import pandas as pd

columns = 'No. Outlook Temperature Humidity Windy PlayTennis'.split()

raw = """1 sunny hot high false N
2 sunny hot high true N
3 overcast hot high false P
4 rain mild high false P
5 rain cool normal false P
6 rain cool normal true N
7 overcast cool normal true P
8 sunny mild high false N
9 sunny cool normal false P
10 rain mild normal false P
11 sunny mild normal true P
12 overcast mild high true P
13 overcast hot normal false P
14 rain mild high true N"""

data = [row.split() for row in raw.split('\n')]
tennis = pd.DataFrame(data=data, columns=columns).set_index('No.')
tennis['PlayTennis'] = (tennis['PlayTennis'] == 'P').astype(int)

tennis

Unnamed: 0_level_0,Outlook,Temperature,Humidity,Windy,PlayTennis
No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,sunny,hot,high,False,0
2,sunny,hot,high,True,0
3,overcast,hot,high,False,1
4,rain,mild,high,False,1
5,rain,cool,normal,False,1
6,rain,cool,normal,True,0
7,overcast,cool,normal,True,1
8,sunny,mild,high,False,0
9,sunny,cool,normal,False,1
10,rain,mild,normal,False,1


#### "A decision tree that correctly classifies each object in the training set is given in Figure 2."

<img src="https://i.imgur.com/RD7d0u0.png" height="300">

In this dataset, the tennis player decided to play on 64% of the days, and decided not to on 36% of the days.

In [331]:
tennis['PlayTennis'].value_counts(normalize=True) * 100

1    64.285714
0    35.714286
Name: PlayTennis, dtype: float64

The tennis player played on 100% of the overcast days, 40% of the sunny days, and 60% of the  rainy days

In [332]:
tennis.groupby('Outlook')['PlayTennis'].mean() * 100

Outlook
overcast    100.0
rain         60.0
sunny        40.0
Name: PlayTennis, dtype: float64

On sunny days, the tennis player's decision depends on the humidity. (The Outlook and Humidity features interact.)

In [333]:
sunny = tennis[tennis['Outlook']=='sunny']
sunny.groupby('Humidity')['PlayTennis'].mean() * 100

Humidity
high        0
normal    100
Name: PlayTennis, dtype: int64

On rainy days, the tennis player's decision depends on the wind. (The Outlook and Windy features interact.)

In [334]:
rainy = tennis[tennis['Outlook']=='rain']
rainy.groupby('Windy')['PlayTennis'].mean() * 100

Windy
false    100
true       0
Name: PlayTennis, dtype: int64

#### Before modeling, we will ["encode" categorical variables, using pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html?highlight=get_dummies#computing-indicator-dummy-variables)

In [335]:
y = tennis['PlayTennis']
X = pd.get_dummies(tennis.drop(columns='PlayTennis'))
X

Unnamed: 0_level_0,Outlook_overcast,Outlook_rain,Outlook_sunny,Temperature_cool,Temperature_hot,Temperature_mild,Humidity_high,Humidity_normal,Windy_false,Windy_true
No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,0,1,0,1,0,1,0,1,0
2,0,0,1,0,1,0,1,0,0,1
3,1,0,0,0,1,0,1,0,1,0
4,0,1,0,0,0,1,1,0,1,0
5,0,1,0,1,0,0,0,1,1,0
6,0,1,0,1,0,0,0,1,0,1
7,1,0,0,1,0,0,0,1,0,1
8,0,0,1,0,0,1,1,0,1,0
9,0,0,1,1,0,0,0,1,1,0
10,0,1,0,0,0,1,0,1,1,0


## Train a Decision Tree Classifier
Get a score of 100% (accuracy)

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

## Compare to Logistic Regression
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

## Visualize the tree
https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html