# A Brief Tour of the Trees and Forests

# 1. rpart

This package includes several example sets of data that can be used for recursive partitioning and regression trees.  Categorical or continuous variables can be used depending on whether one wants classification trees or regression trees. This package as well at the tree package are probably the two go-to packages for trees.  However, care should be taken as the tree package and the rpart package can produce very different results.

In [None]:
library(rpart)
raw.orig < - read.csv(#file="c:\\rsei212_chemical.txt", header=T, sep="\t")
 
# Keep the dataset small and tidy
# The Dataverse: hdl:1902.1/21235
raw = subset(raw.orig, select=c("Metal","OTW","AirDecay","Koc"))
 
row.names(raw) = raw.orig$CASNumber
raw = na.omit(raw);
 
frmla = Metal ~ OTW + AirDecay + Koc
 
# Metal: Core Metal (CM); Metal (M); Non-Metal (NM); Core Non-Metal (CNM)
 
fit = rpart(frmla, method="class", data=raw)
 
printcp(fit) # display the results
plotcp(fit) # visualize cross-validation results
summary(fit) # detailed summary of splits
 
# plot tree
plot(fit, uniform=TRUE, main="Classification Tree for Chemicals")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
 
# tabulate some of the data
table(subset(raw, Koc>=190.5)$Metal)

<a href=" http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/BasicTree.png"><img alt="Species Decision Tree" src="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/BasicTree.png" width="437" height="472" /></a> <a href="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/BasicTree.png"><img alt="Ozone Air Quality Decision Tree" src="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/BasicTree.png" width="437" height="472" /></a>

# 2. tree

This is the primary R package for classification and regression trees.  It has functions to prune the tree as well as general plotting functions and the mis-classifications (total loss). The output from tree can be easier to compare to the General Linear Model (GLM) and General Additive Model (GAM) alternatives.

In [None]:
###############
# TREE package
library(tree)
 
tr = tree(frmla, data=raw)
summary(tr)
plot(tr); text(tr)

<a href=" http://i2.wp.com/statistical-research.com/wp-content/uploads/2012/12/BasicTree2.png"><img alt="Species Decision Tree" src="http://i2.wp.com/statistical-research.com/wp-content/uploads/2012/12/BasicTree2.png" width="437" height="472" /></a> <a href="http://i2.wp.com/statistical-research.com/wp-content/uploads/2012/12/BasicTree2.png"><img alt="Ozone Air Quality Decision Tree" src="http://i2.wp.com/statistical-research.com/wp-content/uploads/2012/12/BasicTree2.png" width="437" height="472" /></a>

# 3. party

This is another package for recursive partitioning. One of the key functions in this package is ctree. As the package documention indicates it can be used for continuous, censored, ordered, nominal and multivariate response variable in a conditional inference framework. The party package also implements 

In [None]:
###############
# PARTY package
library(party)
 
(ct = ctree(frmla, data = raw))
plot(ct, main="Conditional Inference Tree")
 
#Table of prediction errors
table(predict(ct), raw$Metal)
 
# Estimated class probabilities
tr.pred = predict(ct, newdata=raw, type="prob")

<a href="http://i2.wp.com/statistical-research.com/wp-content/uploads/2012/12/ConditionalTree.png"><img alt="Species Decision Tree" src="http://i2.wp.com/statistical-research.com/wp-content/uploads/2012/12/ConditionalTree.png" width="437" height="472" /></a> <a href="http://i2.wp.com/statistical-research.com/wp-content/uploads/2012/12/ConditionalTree.png"><img alt="Ozone Air Quality Decision Tree" src="http://i2.wp.com/statistical-research.com/wp-content/uploads/2012/12/ConditionalTree.png" width="437" height="472" /></a>

# 4. maptree

maptree is a very good at graphing, pruning data from hierarchical clustering, and CART models. The trees produced by this package tend to be better labeled and higher quality and the stock plots from rpart.

In [None]:
###############
# MAPTREE
library(maptree)
library(cluster)
draw.tree( clip.rpart (rpart ( raw), best=7),
nodeinfo=TRUE, units="species",
cases="cells", digits=0)
a = agnes ( raw[2:4], method="ward" )
names(a)
a$diss
b = kgs (a, a$diss, maxclust=20)
 
plot(names(b), b, xlab="# clusters", ylab="penalty", type="n")
xloc = names(b)[b==min(b)]
yloc = min(b)
ngon(c(xloc,yloc+.75,10, "dark green"), angle=180, n=3)
apply(cbind(names(b), b, 3, 'blue'), 1, ngon, 4) # cbind(x,y,size,color)

<a href="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/maptree.png"><img alt="Species Decision Tree" src="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/maptree.png" width="437" height="472" /></a> <a href="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/maptree.png"><img alt="Ozone Air Quality Decision Tree" src="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/maptree.png" width="437" height="472" /></a>

# 5. partykit

This contains a re-implementation of the ctree function and it provides some very good graphing and visualization for tree models.  It is similar to the party package.  The example below uses data from airquality dataset and the famous species data available in R and can be found in the documentation.

# <a href="http://statistical-research.com/wp-content/uploads/2012/12/species.png"><img alt="Species Decision Tree" src="http://statistical-research.com/wp-content/uploads/2012/12/species.png" width="437" height="472" /></a> <a href="http://statistical-research.com/wp-content/uploads/2012/12/airqualityOzone.png"><img alt="Ozone Air Quality Decision Tree" src="http://statistical-research.com/wp-content/uploads/2012/12/airqualityOzone.png" width="437" height="472" /></a>

# 6. evtree

This package uses evolutionary algorithms.  The idea behind this approach is that is will reduce the a priori bias.  I have seen trees of this sort in the area of environmental research, bioinformatics, systematics, and marine biology.  Though there are many other areas than that of phylogentics.

In [None]:
###############
## EVTREE (Evoluationary Learning)
library(evtree)
 
ev.raw = evtree(frmla, data=raw)
plot(ev.raw)
table(predict(ev.raw), raw$Metal)
1-mean(predict(ev.raw) == raw$Metal)

<a href="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/EvolutionaryTree.png"><img alt="Species Decision Tree" src="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/EvolutionaryTree.png" width="437" height="472" /></a> <a href="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/EvolutionaryTree.png"><img alt="Ozone Air Quality Decision Tree" src="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/EvolutionaryTree.png" width="437" height="472" /></a>

# 7. randomForest

Random forests are very good in that it is an ensemble learning method used for classification and regression.  It uses multiple models for better performance that just using a single tree model.  In addition because many sample are selected in the process a measure of variable importance can be obtain and this approach can be used for model selection and can be particularly useful when forward/backward stepwise selection is not appropriate and when working with an extremely high number of candidate variables that need to be reduced.

In [None]:
##################
## randomForest
library(randomForest)
fit.rf = randomForest(frmla, data=raw)
print(fit.rf)
importance(fit.rf)
plot(fit.rf)
plot( importance(fit.rf), lty=2, pch=16)
lines(importance(fit.rf))
imp = importance(fit.rf)
impvar = rownames(imp)[order(imp[, 1], decreasing=TRUE)]
op = par(mfrow=c(1, 3))
for (i in seq_along(impvar)) {
partialPlot(fit.rf, raw, impvar[i], xlab=impvar[i],
main=paste("Partial Dependence on", impvar[i]),
ylim=c(0, 1))
}


<a href="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/ErrorAndImportance.png"><img alt="Species Decision Tree" src="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/ErrorAndImportance.png" width="437" height="472" /></a> <a href="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/ErrorAndImportance.png"><img alt="Ozone Air Quality Decision Tree" src="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/ErrorAndImportance.png" width="437" height="472" /></a>

# 8. varSelRF

This can be used for further variable selection procedure using random forests.  It implements both backward stepwise elimination as well as selection based on the importance spectrum.  This data uses randomly generated data so the correlation matrix can set so that the first variable is strongly correlated and the other variables are less so.

In [None]:
##################
## varSelRF package
library(varSelRF)
x = matrix(rnorm(25 * 30), ncol = 30)
x[1:10, 1:2] = x[1:10, 1:2] + 2
cl = factor(c(rep("A", 10), rep("B", 15)))
rf.vs1 = varSelRF(x, cl, ntree = 200, ntreeIterat = 100,
vars.drop.frac = 0.2)
 
rf.vs1
plot(rf.vs1)
 
## Example of importance function show that forcing x1 to be the most important
## while create secondary variables that is related to x1.
x1=rnorm(500)
x2=rnorm(500,x1,1)
y=runif(1,1,10)*x1+rnorm(500,0,.5)
my.df=data.frame(y,x1,x2,x3=rnorm(500),x4=rnorm(500),x5=rnorm(500))
rf1 = randomForest(y~., data=my.df, mtry=2, ntree=50, importance=TRUE)
importance(rf1)
cor(my.df)

<a href="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/ImportanceOOBError.png"><img alt="Species Decision Tree" src="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/ImportanceOOBError.png" width="437" height="472" /></a> <a href="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/ImportanceOOBError.png"><img alt="Ozone Air Quality Decision Tree" src="http://i1.wp.com/statistical-research.com/wp-content/uploads/2012/12/ImportanceOOBError.png" width="437" height="472" /></a>

# 9. oblique.tree

This package grows an oblique decision tree (a general form of the axis-parallel tree).  This example uses the crab dataset (morphological measurements on Leptograpsus crabs) available in R as a stock dataset to grow the oblique tree.

In [None]:
# ## OBLIQUE.TREE
library(oblique.tree)
 
aug.crabs.data = data.frame( g=factor(rep(1:4,each=50)),
predict(princomp(crabs[,4:8]))[,2:3])
plot(aug.crabs.data[,-1],type="n")
text( aug.crabs.data[,-1], col=as.numeric(aug.crabs.data[,1]), labels=as.numeric(aug.crabs.data[,1]))
ob.tree = oblique.tree(formula = g~.,
data = aug.crabs.data,
oblique.splits = "only")
plot(ob.tree);text(ob.tree)

<a href="http://i2.wp.com/statistical-research.com/wp-content/uploads/2012/12/ObliqueTree.png"><img alt="Species Decision Tree" src="http://i2.wp.com/statistical-research.com/wp-content/uploads/2012/12/ObliqueTree.png" width="437" height="472" /></a> <a href="http://i2.wp.com/statistical-research.com/wp-content/uploads/2012/12/ObliqueTree.png"><img alt="Ozone Air Quality Decision Tree" src="http://i2.wp.com/statistical-research.com/wp-content/uploads/2012/12/ObliqueTree.png" width="437" height="472" /></a>

#                                                                                             Thanks for reading...