# Tutorial Week 9 - Bagging, Boosting and Random Forests

## Question 1 

#### We consider the forest canopy height data discussed in lectures from the spNNGP R package

https://cran.r-project.org/web/packages/spNNGP/index.html

#### A random subset of 5000 observations from the original data set will be used.  To get started you can import the following packages and functions.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import sklearn.model_selection as skm
from sklearn.tree import (DecisionTreeClassifier as DTC,
                          DecisionTreeRegressor as DTR,
                          plot_tree)
from scipy.spatial import ConvexHull
from sklearn.ensemble import \
    (RandomForestRegressor as RF,
     GradientBoostingRegressor as GBR)

#### You can read in the data as follows.

In [3]:
BCEF = pd.read_csv('BCEF.csv',header=0,names=["Easting","Northing","FCH","PTC","holdout"])
BCEF.head()

Unnamed: 0,Easting,Northing,FCH,PTC,holdout
0,277.533541,1653.934914,12.53,76.0,1
1,270.594,1648.405211,7.05,77.353405,1
2,271.879366,1649.589274,11.64,46.062416,0
3,275.259678,1653.927843,9.8,43.476361,0
4,259.515725,1646.266058,15.49,81.627211,1


#### FCH is the response of interest (forest canopy height, in metres) and we will model FCH in terms of the Easting and Northing.  Answer the following questions. 

- #### a) Fit a regression tree using all the training data without pruning to predict FCH using Easting and Northing.  We consider the tree without pruning, since unpruned trees are used for bagging.  
- #### b) Plot the predictions of the fitted tree over the convex hull of the Easting and Northing.  If reg is your regression tree, and X is a pandas data frame containing the features ``Easting`` and ``Northing``, some code to do this is given below. See if you can understand the code by reading the help.

        xp = np.linspace(X['Easting'].min(), X['Easting'].max(), 100)  
        yp = np.linspace(X['Northing'].min(), X['Northing'].max(), 100)  
        Xm, Ym = np.meshgrid(xp, yp) 
        X_reshaped = Xm.flatten().reshape(-1, 1)  
        Y_reshaped = Ym.flatten().reshape(-1, 1)  
        features = np.concatenate((X_reshaped, Y_reshaped), axis=1)  
        features = pd.DataFrame(features, columns=X.columns)
        Z = reg.predict(features).reshape(Xm.shape)

        subset_points = np.column_stack((np.array(X['Easting']),np.array(X['Northing'])))
        hull = ConvexHull(subset_points)
        mask = np.ones_like(Xm, dtype=bool)
        for eq in hull.equations:
            mask &= (Xm * eq[0] + Ym * eq[1] + eq[2] <= 0)
        X_masked = np.ma.masked_array(Xm, mask=~mask)
        Y_masked = np.ma.masked_array(Ym, mask=~mask)
        Z_masked = np.ma.masked_array(Z, mask=~mask)


        fig, ax = plt.subplots()
        levels = 8
        cmap = "Blues"  
        CS = ax.contourf(X_masked, Y_masked, Z_masked, levels, cmap=cmap)
        ax.clabel(CS, inline=True, fontsize=10)
        ax.plot(X['Easting'], X['Northing'], 'ko', markersize=1)
        ax.set_title('Regression tree fit with no pruning')
        cbar = fig.colorbar(CS)
        cbar.set_label('FHC')
        
- #### c) Repeat the above analysis using bagging.  Experiment with different values for B.
- #### d) Repeat the above analysis with boosting using stumps (set ``max_depth=1``).  Experiment with different values for B and $\lambda$.  As mentioned in lectures, boosting with stumps corresponds to an additive model.  Can you explain why?
- #### e) Repeat the above analysis with boosting using ``max_depth=2``.

## Question 2

#### We consider once again the Singlish example from last lecture.  We will see if we can improve on a classification tree using bagging, random forests and gradient boosting.  To start we do the following imports.

In [2]:
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
import sklearn.model_selection as skm
from sklearn.tree import (DecisionTreeClassifier as DTC,
                          DecisionTreeRegressor as DTR,
                          plot_tree,
                          export_text)
from sklearn.metrics import (accuracy_score,
                             log_loss)
from sklearn.ensemble import \
     (RandomForestRegressor as RF,
      GradientBoostingRegressor as GBR,
      RandomForestClassifier as RFC,
      GradientBoostingClassifier as GBC)

#### Read in Singlish data to a pandas data frame

In [3]:
Singlish = pd.read_csv('Singlish.csv')
Singlish.head()

Unnamed: 0,a,e,h,s,g,l,m,label,phrase
0,0.142857,0.0,0.0,0.0,0.0,0.142857,0.0,S,Act blur
1,0.5,0.0,0.0,0.0,0.25,0.0,0.0,S,Agak agak
2,0.2,0.0,0.2,0.0,0.0,0.0,0.0,S,Aiyoh
3,0.5,0.0,0.0,0.0,0.0,0.166667,0.166667,S,Alamak
4,0.2,0.0,0.0,0.0,0.0,0.0,0.0,S,Arrow


- #### a) Split the data into training and test, 50%/50%.  
- #### b) Fit a classification tree to the training data, using cost-complexity pruning, and examine the accuracy on the test data.
- #### c) Construct a classifier using bagging using the ``RandomForestClassifier`` function, which works similarly to the ``RandomForestRegressor`` function that we discussed last week.  Try different values for B, and examine the accuracy on the test data. 
- #### d) Construct a classifier using boosting using the ``GradientBoostingClassifier`` function, which works similarly to the ``GradientBoostingRegressor`` function that we used last week.  Try varying B and $\lambda$, and examine the accuracy on the test data.  