# Lab 5: Random Forest (RF) and eXtreme Gradient Boosting (XGBoost)

Tree ensembles combine several decision trees to produce better predictive performance than utilizing a single decision tree. In this lab we will experience with two approaches: Random Forest (RF) and eXtreme Gradient Boosting (XGBoost).

**Random forest** is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression. Random Forest is suitable for situations when we have a large dataset, and interpretability is not a major concern. Decision trees are much easier to interpret and understand. Since a random forest combines multiple decision trees, it becomes more difficult to interpret.

**XGBoost**, eXtreme Gradient Boosting, is a distributed gradient-boosted decision tree (GBDT) machine learning library. It provides parallel tree boosting and is the leading Ensemble Tree library for regression and classification.

<a name="toc_40291_1"></a>
## Problem statement

In this notebook we are going to see how **tree ensembles** work with a simple example, especifically the [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from the scikit-learn library, and the [XGBClassifier](https://xgboost.readthedocs.io/en/stable/parameter.html) from a library called `XGBoost`. 


<a name="toc_40291_3"></a>
## Random Forest: predicting wine quality

We will start our tests with the Random Forest. First, we are going to import some libraries and functions that we will use:


*   `Numpy`, that allows us to work with arrays
*   The `Scikit-learn` library that provides a group of functions related to Random Forest and dataset splitting
*  `io`, that allows us to navigate files
*  `Pandas`, that allows us to work with dataframes

In [6]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import io
import pandas as pd

Here we are going to use a [Wine Quality Dataset](https://www.kaggle.com/datasets/yasserh/wine-quality-dataset?select=WineQT.csv). This dataset is related to red variants of the Portuguese "Vinho Verde" wine. The dataset describes the amount of various chemicals present in wine and their effect on its quality. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

This data frame contains the following columns:

Input variables (based on physicochemical tests):
1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
9. pH
10. sulphates
11. alcohol
12. quality (score between 0 and 10)

We are going to give it a try and create a Random Forest Classifier to predict the quality of the wine.

First, run the following cell and load the `WineQT.csv` file from your disk.

In [7]:
# TO DO: load "WineQT.csv" file from disk
from google.colab import files
uploaded = files.upload()

Saving WineQT.csv to WineQT (1).csv


Run this cell to load the data and generate a training and testing dataset that will be later used to generate the classsifier.

In [8]:
# Sort the dataset in a Pandas Dataframe
df = pd.read_csv(io.BytesIO(uploaded['WineQT.csv']))
# Divide the dataset in 70% training and 30% testing (in reality, it is 
# validation, because we will make decisions and tune hyperparameters 
# depending on its values)
df_train, df_test = train_test_split(df, test_size=0.3)

# Extract the names of the atributes of the dataset
X_list = df_train.columns.values.tolist()
# Delete ID atribute
X_list.pop()
# Extract the name of the ground truth atribute of the dataset
y_list = X_list.pop()

# Extract the parameters and ground truth of the training data
XTrain = df_train[X_list]
yTrain = df_train[y_list]
# Extract the parameters and ground truth of the validation data
XVal = df_test[X_list]
yVal = df_test[y_list]

<a name="toc_40291_10"></a>
## Boosted Tree: predicting wine quality

In this second part, use the same dataset to train a boosted tree. 
Compare the results of both approaches. 

**What works better? RF or XGBoost?**

$\color{red}{\text{Comparing the results of the two approaches, we see that the random forest does well in training but comparing the}\\ \text{ accuracy of the training to the testing, we obtain a large discrepancy whereas the XGBoost, though not as accurate as the random forest, }\\ \text{performs well in both training and testing and also does not overfit}}$

In [9]:
#_________ TO DO____________

# Random Forest model definition
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Fit the RF
model1 = RandomForestClassifier()
model1.fit(XTrain, yTrain)

# Check predictios for RF
y_pred1 = model1.predict(XTrain)

perc1 = (100/len(y_pred1)*np.count_nonzero(yTrain == y_pred1))
print("RF Train result % = "+str(perc1))

y_pred2 = model1.predict(XVal)
perc2 = (100/len(y_pred2)*np.count_nonzero(yVal == y_pred2))
print("RF Test result % = "+str(perc2))


###############################################################################
# Fit the XGBoost
model2 = XGBClassifier()
model2.fit(XTrain, yTrain)

# Check predictions for XGB
y_pred3 = model2.predict(XTrain)
perc1 = (100/len(y_pred3)*np.count_nonzero(yTrain == y_pred3))
print("XGB Train result % = "+str(perc1))

y_pred4 = model2.predict(XVal)
perc2 = (100/len(y_pred4)*np.count_nonzero(yVal == y_pred4))
print("XGB Test result % = "+str(perc2))



RF Train result % = 100.0
RF Test result % = 65.3061224489796
XGB Train result % = 86.5
XGB Test result % = 60.05830903790088


<a name="toc_40291_11"></a>
## Random Forest: Heart Dataset

In this section, we are now going to see how this algorithm works with a simple example, using the Python library `scikit-learn`, especifically the ["RandomForestClassifier"](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) function. 

Here we have an example of a Random Forest Classifier for the `heart.csv` dataset. 

In [10]:
  # Load the HEART dataset
  #_________ TO DO____________
  print('upload the heart.csv file')
  uploaded2 = files.upload()

  # Sort the dataset in a Pandas Dataframe
  df = pd.read_csv(io.BytesIO(uploaded2['heart2.csv']))
  # Divide the dataset in 70% training and 30% testing (in reality, it is 
  # validation, because we will make decisions and tune hyperparameters 
  # depending on its values)
  df_train, df_test = train_test_split(df, test_size=0.3)

  # Extract the names of the atributes of the dataset
  X_list = df_train.columns.values.tolist()
  # Delete ID atribute
  X_list.pop()
  # Extract the name of the ground truth atribute of the dataset
  y_list = X_list.pop()

  # Extract the parameters and ground truth of the training data
  XTrain = df_train[X_list]
  yTrain = df_train[y_list]
  # Extract the parameters and ground truth of the validation data
  XVal = df_test[X_list]
  yVal = df_test[y_list]

  # Fit the classifier
  clf = RandomForestClassifier()
  clf = clf.fit(XTrain, yTrain)

  # Check predictions for training data
  res = clf.predict(XTrain)
  perc = (100/len(res)*np.count_nonzero(yTrain == res))
  print("Train result % = "+str(perc))

  # Check predictions for testing data
  res = clf.predict(XVal)
  perc = (100/len(res)*np.count_nonzero(yVal == res))
  print("Test result % = "+str(perc))

upload the heart.csv file


Saving heart2.csv to heart2.csv
Train result % = 100.0
Test result % = 70.32967032967034


In the code cell below, change the maximum depth and the minimum number of samples required for a leaf node, as we did with the DT of the previous lab. Try different values for each parameter and find the best result. **Which combination works best?**

***Hint:* You might want to use a loop to try the different combinations**

$\color{red}{\text{The best combination of max_depth and min_samples_leaf is [5,3]}}$

In [12]:
# TO DO: Create Random Forest Classifiers with different maximum depths and minimum number of lead nodes allowed
#________TO DO_________

comb = []
results = []
for i in range(1, 25, 2):
  for j in range(1, 25, 2):

    clf2 = RandomForestClassifier(max_depth = i, min_samples_leaf = j) #
    clf2 = clf2.fit(XTrain, yTrain)

    # Check predictions for training data
    res1 = clf2.predict(XTrain)
    percT_RF = (100/len(res1)*np.count_nonzero(yTrain == res1))
    print("RF Train result % = "+str(percT_RF))

    # Check predictions for testing data
    res2 = clf2.predict(XVal)
    percV_RF = (100/len(res2)*np.count_nonzero(yVal == res2))
    print("RF Test result % = "+str(percV_RF))

    results.append(percV_RF)
    comb.append([i,j])
  
print(f'Best combination of max_depth and min_samples_leaf is: {comb[results.index(max(results))]}')
  

RF Train result % = 67.9245283018868
RF Test result % = 62.63736263736264
RF Train result % = 65.56603773584905
RF Test result % = 64.83516483516485
RF Train result % = 66.9811320754717
RF Test result % = 65.93406593406594
RF Train result % = 66.50943396226415
RF Test result % = 64.83516483516485
RF Train result % = 68.39622641509435
RF Test result % = 64.83516483516485
RF Train result % = 66.0377358490566
RF Test result % = 64.83516483516485
RF Train result % = 66.50943396226415
RF Test result % = 64.83516483516485
RF Train result % = 66.9811320754717
RF Test result % = 64.83516483516485
RF Train result % = 66.50943396226415
RF Test result % = 63.736263736263744
RF Train result % = 66.50943396226415
RF Test result % = 63.736263736263744
RF Train result % = 68.86792452830188
RF Test result % = 64.83516483516485
RF Train result % = 67.45283018867924
RF Test result % = 62.63736263736264
RF Train result % = 67.9245283018868
RF Test result % = 63.736263736263744
RF Train result % = 67.9245

<a name="toc_40291_11"></a>
## XGBoost: Heart Dataset

Like in the previous section, implement an XGBoost ensemble and try different parametrizations of maximum depth. **How do these results compare with the RF?**

$\color{red}{\text{Comparing this result with the RF we see that the XGB is more accurate}}$

In [17]:
# TO DO: Create an XGBoost ensemble classifiers with different maximum depths
#________TO DO_________

depths = [x for x in range(1, 50, 2)]
accuracy = []

for i in depths:

  clf3 = XGBClassifier(max_depth = i)
  clf3 = clf3.fit(XTrain, yTrain)

  # Check predictions for training data
  res3 = clf3.predict(XTrain)
  percT_XGB = (100/len(res3)*np.count_nonzero(yTrain == res3))
  print(i)
  print("XGB Train result % = "+str(percT_XGB))

  # Check predictions for testing data
  res4 = clf3.predict(XVal)
  percV_XGB = (100/len(res4)*np.count_nonzero(yVal == res4))
  print("XGB Test result % = "+str(percV_XGB))

  accuracy.append([percT_XGB, percV_XGB])

print(f"Best depth of the tree: {depths[accuracy.index(max(accuracy))]}")

1
XGB Train result % = 74.52830188679245
XGB Test result % = 70.32967032967034
3
XGB Train result % = 96.22641509433963
XGB Test result % = 62.63736263736264
5
XGB Train result % = 100.0
XGB Test result % = 67.03296703296704
7
XGB Train result % = 100.0
XGB Test result % = 68.13186813186813
9
XGB Train result % = 100.0
XGB Test result % = 71.42857142857143
11
XGB Train result % = 100.0
XGB Test result % = 72.52747252747254
13
XGB Train result % = 100.0
XGB Test result % = 71.42857142857143
15
XGB Train result % = 100.0
XGB Test result % = 71.42857142857143
17
XGB Train result % = 100.0
XGB Test result % = 71.42857142857143
19
XGB Train result % = 100.0
XGB Test result % = 71.42857142857143
21
XGB Train result % = 100.0
XGB Test result % = 71.42857142857143
23
XGB Train result % = 100.0
XGB Test result % = 71.42857142857143
25
XGB Train result % = 100.0
XGB Test result % = 71.42857142857143
27
XGB Train result % = 100.0
XGB Test result % = 71.42857142857143
29
XGB Train result % = 100.0