In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import graphviz

%matplotlib inline

ModuleNotFoundError: No module named 'graphviz'

# Part 1 Fitting Classification Trees

The ${\tt sklearn}$ library has a lot of useful tools for constructing classification and regression trees:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import confusion_matrix, mean_squared_error

${\tt Carseats}$ data set: A data frame with 400 observations on the following 11 variables.

`Sales`: Unit sales (in thousands) at each location

`CompPrice`: Price charged by competitor at each location

`Income`: Community income level (in thousands of dollars)

`Advertising`: Local advertising budget for company at each location (in thousands of dollars)

`Population`: Population size in region (in thousands)

`Price`: Price company charges for car seats at each site

`ShelveLoc`: A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site

`Age`: Average age of the local population

`Education`: Education level at each location

`Urban`: A factor with levels No and Yes to indicate whether the store is in an urban or rural location

`US`: A factor with levels No and Yes to indicate whether the store is in the US or not

We'll start by using **classification trees** to analyze the ${\tt Carseats}$ data set. In these
data, ${\tt Sales}$ is a continuous variable, and so we begin by converting it to a
binary variable. We use the ${\tt ifelse()}$ function to create a variable, called
${\tt High}$, which takes on a value of ${\tt Yes}$ if the ${\tt Sales}$ variable exceeds 8, and
takes on a value of ${\tt No}$ otherwise. We'll append this onto our dataFrame using the ${\tt .map()}$ function, and then do a little data cleaning to tidy things up:

In [2]:
df3 = pd.read_csv('Carseats.csv')
df3.head()

Unnamed: 0,Unauthorized


In order to properly evaluate the performance of a classification tree on
the data, we must estimate the test error rather than simply computing
the training error. We first split the observations into a training set and a test
set:

We now use the ${\tt DecisionTreeClassifier()}$ function to fit a classification tree in order to predict
${\tt High}$ using all variables but ${\tt Sales}$ (that would be a little silly...).  http://scikit-learn.org/stable/modules/tree.html

We can limit the depth of a tree using the ${\tt max\_depth}$ parameter: Set it to 6

We see that the training accuracy is 95.5%.

One of the most attractive properties of trees is that they can be
graphically displayed. Unfortunately, this is a bit of a roundabout process in ${\tt sklearn}$. We use the ${\tt export\_graphviz()}$ function to export the tree structure to a temporary ${\tt .dot}$ file,
and the ${\tt graphviz.Source()}$ function to display the image:

The most important indicator of ${\tt High}$ sales appears to be ${\tt Price}$.

Finally, let's evaluate the tree's performance on
the test data. The ${\tt predict()}$ function can be used for this purpose. We can then build a confusion matrix, which shows that we are making correct predictions for
around 74.5% of the test data set:

# Part 2 Fitting Regression Trees

Now let's try fitting a **regression tree** to the ${\tt Boston}$ data set. First, we create a
training set, and fit the tree to the training data using ${\tt medv}$ (median home value) as our response:

${\tt Boston}$ data set: A data frame with 506 rows and 13 variables.

`crim`: per capita crime rate by town.

`zn`: proportion of residential land zoned for lots over 25,000 sq.ft.

`indus`: proportion of non-retail business acres per town.

`chas`: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). 

`nox`: nitrogen oxides concentration (parts per 10 million).

`rm`: average number of rooms per dwelling.

`age`: proportion of owner-occupied units built prior to 1940.

`dis`: weighted mean of distances to five Boston employment centres.

`rad`: index of accessibility to radial highways.

`tax`: full-value property-tax rate per \$10,000.

`ptratio`: pupil-teacher ratio by town.

`lstat`: lower status of the population (percent).

`medv`: median value of owner-occupied homes in \$1000s.


In [None]:
# Choosing max depth 2


Let's take a look at the tree:

The variable ${\tt lstat}$ measures the percentage of individuals with lower
socioeconomic status. The tree indicates that lower values of ${\tt lstat}$ correspond
to more expensive houses. The tree predicts a median house price
of $\$ 45766 $ for larger homes ( ${\tt rm}$ >=7.435) in suburbs in which residents have high socioeconomic
status ( ${\tt lstat}$ <7.81).

Now let's see how it does on the test data:

The test set MSE associated with the regression tree is
28.8. The square root of the MSE (AKA. the average distance between the predicted value and the true value) https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e, https://tiaplagata.medium.com/interpreting-the-root-mean-squared-error-of-a-linear-regression-model-5166e6b10db8) is therefore around 5.37, indicating
that this model leads to test predictions that are within around \$5,370 of
the true median home value for the suburb.

# Part 3 Bagging and Random Forests

Let's see if we can improve on this result using **bagging** and **random forests**. Recall that **bagging** is simply a special case of
a **random forest** with $m = p$. Therefore, the ${\tt RandomForestRegressor()}$ function can
be used to perform both random forests and bagging. Let's start with bagging:

In [None]:
# Bagging: using all features


The argument ${\tt max\_features=13}$ indicates that all 13 predictors should be considered
for each split of the tree -- in other words, that bagging should be done. How
well does this bagged model perform on the test set?

The test set MSE associated with the bagged regression tree is significantly lower than our single tree!

We can grow a random forest in exactly the same way, except that
we'll use a smaller value of the ${\tt max\_features}$ argument. Here we'll
use ${\tt max\_features = 6}$:

In [None]:
# Random forests: using 6 features


The test set MSE is even lower; this indicates that random forests yielded an
improvement over bagging in this case.

Using the ${\tt feature\_importances\_}$ attribute of the ${\tt RandomForestRegressor}$, we can view the importance of each
variable:

The results indicate that across all of the trees considered in the random
forest, the wealth level of the community (${\tt lstat}$) and the house size (${\tt rm}$)
are by far the two most important variables.

In [None]:
# Random forests: using 3 features


# Part 4 Boosting

Now we'll use the ${\tt GradientBoostingRegressor}$ package to fit **boosted
regression trees** to the ${\tt Boston}$ data set. The
argument ${ n\_estimators=500}$ indicates that we want 500 trees, and the option
${max\_depth=3}$ limits the depth of each tree:

Let's check out the feature importances again:

We see that ${\tt lstat}$ and ${\tt rm}$ are again the most important variables by far. Now let's use the boosted model to predict ${\tt medv}$ on the test set:

The test MSE obtained is similar to the test MSE for random forests
and superior to that for bagging. If we want to, we can perform boosting
with a different value of the shrinkage parameter $\lambda$. Here we take $\lambda = 0.2$:

In this case, using $\lambda = 0.2$ leads to a slightly lower test MSE than $\lambda = 0.01$.