# Decision Tree Ensembles - Bagging 
*Ensemble methods*, which combines several decision trees to produce better predictive performance than utilizing a single decision tree. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner.

Import the Following Libraries:
- csv
- numpy (as np)
- DecisionTreeClassifier from sklearn.tree
- preprocessing from sklearn
- train_test_split from sklearn
- classification_report from sklearn
- matplotlib (as plt)

In [33]:
import csv
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

Now, ***read the data*** using *csv*

The following functions **readData** will read data from csv file And returns all the data in the dimensions of the file itself <br>
Then in the next step, we prepare it for pre-processing.

In [34]:
def readData(address):
    with open(address) as csvFile:
        reader = csv.reader(csvFile)
        data = [row for row in reader]
    return data


def cleanData(data):
    return list(filter(lambda thisList: False if '?' in thisList else True, data))


First, we read the information and store the number of available data in a variable, then in the next stage of cleaning, we reduce the size of the information from the initial amount to find the number of rows containing *missing values* and print **percentage** of this Incorrect information.
<br>

Remove the row containing the headers name since it doesn't contain any information.

In [35]:
fileAddress = './train+dev+test.csv'
data = readData(fileAddress)
missingValues = len(data)
print(f"Number of rows before data cleaning: {len(data)}")
data = cleanData(data)
missingValues -= len(data)
data = np.array(data)[1:]  # remove headers
print(f"Number of rows after data cleaning: {len(data)}")
missingValues = round(missingValues/(len(data)+missingValues)*100, 2)
print(f"Percentage of missing values: {missingValues}%")

Number of rows before data cleaning: 8125
Number of rows after data cleaning: 5644
Percentage of missing values: 30.53%


As you may figure out, All features in this dataset are categorical, such as **cap-shape** or **habitat**. Sklearn Decision Trees does not handle categorical variables. We can still convert these features to numerical values using `dummyVariables` to convert the categorical variable into dummy/indicator variables.


In [36]:
def dummyVariables(features):
    for column in range(features.shape[1]):
        # 0,1,2,3,...,21
        featureStatus = set(features[:, column])
        tranasformer = preprocessing.LabelEncoder()
        tranasformer.fit(list(featureStatus))
        features[:, column] = tranasformer.transform(features[:, column])
    return features

Now separate the labels of the samples and their features:
- **X** as the Feature Matrix (data)
- **Y** as the response vector (target)
<br>

Then we give the list of features **X** to the number converter function `dummyVariables`.

In [37]:
X = data[:, 1:]
Y = data[:, 0]
print(f"Before indicator variables: \n{X}")
X = dummyVariables(X)
print(f"\nAfter indicator variables: \n{X}")

Before indicator variables: 
[['x' 's' 'n' ... 'k' 's' 'u']
 ['x' 's' 'y' ... 'n' 'n' 'g']
 ['b' 's' 'w' ... 'n' 'n' 'm']
 ...
 ['x' 'y' 'g' ... 'w' 'y' 'p']
 ['x' 'y' 'c' ... 'w' 'c' 'd']
 ['f' 'y' 'c' ... 'w' 'c' 'd']]

After indicator variables: 
[['5' '2' '4' ... '1' '3' '5']
 ['5' '2' '7' ... '2' '2' '1']
 ['0' '2' '6' ... '2' '2' '3']
 ...
 ['5' '3' '3' ... '5' '5' '4']
 ['5' '3' '1' ... '5' '1' '0']
 ['2' '3' '1' ... '5' '1' '0']]


I using train/test split to train and test decision tree,
train_test_split will return 4 different parameters. We will name them:
`X_train, X_test, y_train, y_test`.

The X and y are the arrays required before the split, the test_size represents the ratio of the testing dataset, and the random_state ensures that we obtain the same splits.

I chose the ratio of train and test set 70% and 30%.

In [38]:
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.3, random_state=24)

Now, in order to train ***k*** numbers of classifiers, we also need ***k*** numbers of training sets, that's why we extract new training data from the training data sets ***k*** times.

For this purpose, I use the `bootstrap` function, which takes the set of features and labels and arranges them, then it stores the same amount of data from them as the first set, using placement, and then separates the label from the feature and return two collection provides for training.

In [39]:
def bootstrap(X, Y):
    dataset = np.column_stack((X, Y))
    newDataset = dataset[np.random.choice(
        dataset.shape[0], size=dataset.shape[0])]
    new_X = newDataset[:, :-1]
    new_Y = newDataset[:, -1]
    return new_X, new_Y

perform the above function 5 times and a list is obtained that containing 5 training datasets, each of which is a **tuple** of **feature** and **label** pairs.
<br>

Below we print an example of that pair that will be used to train the last classifier.

In [40]:
NUMBER_OF_BOOTSTRAP = 5
bootstrapDataset = [bootstrap(X_train, y_train)
                    for _ in range(NUMBER_OF_BOOTSTRAP)]
print(f"A pair including features and labels: \n{bootstrapDataset[-1]}")

A pair including features and labels: 
(array([['0', '3', '7', ..., '1', '3', '1'],
       ['2', '3', '7', ..., '2', '5', '1'],
       ['2', '3', '3', ..., '1', '4', '0'],
       ...,
       ['5', '2', '6', ..., '1', '3', '1'],
       ['2', '0', '4', ..., '2', '3', '1'],
       ['5', '3', '3', ..., '0', '4', '4']], dtype='<U24'), array(['e', 'e', 'e', ..., 'e', 'e', 'p'], dtype='<U24'))


### Classifire
We will first create an instance of the **DecisionTreeClassifier** called **tree**.
Inside of the classifier, specify criterion="entropy" so we can see the information gain of each node.

Next, we will fit the data with the training feature `bootstrapDataset` and training response vector
I add this generated tree to a list, and repeat this cycle 5 times until 5 trees are formed from 5 series of training datasets.

In the last line, as an example, I print the type of the variable in the last cell of the `classifires` list.

In [46]:
classifiers = []
for index in range(NUMBER_OF_BOOTSTRAP):
    # Define Decision Thee
    tree = DecisionTreeClassifier(criterion="entropy", max_depth=4)
    tree.fit(*bootstrapDataset[index])
    classifiers.append(tree)
print(f"The last classifier is: {type(classifiers[-1])}")

The last classifier is: <class 'sklearn.tree._classes.DecisionTreeClassifier'>


### Prediction

In [42]:
votes = [tree.predict(X_test) for tree in classifiers]
votes = np.array(votes)
votes = np.transpose(votes)

In [43]:
def majority(vote):
    # find most frequent element in a list
    vote = list(vote)
    return max(set(vote), key=vote.count)

In [44]:
# Finding the majority vote
predicted_Y = [majority(vote) for vote in votes]

In [45]:
# figure out my tree accuracy
accuracy = classification_report(y_test, predicted_Y)
print(accuracy)

              precision    recall  f1-score   support

           e       1.00      0.99      0.99      1083
           p       0.98      0.99      0.98       611

    accuracy                           0.99      1694
   macro avg       0.99      0.99      0.99      1694
weighted avg       0.99      0.99      0.99      1694

