# Decision Tree Ensembles - Bagging 
*Ensemble methods*, which combines several decision trees to produce better predictive performance than utilizing a single decision tree. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner.

### Import Library
Import the Following Libraries:
- csv
- numpy (as np)
- DecisionTreeClassifier from sklearn.tree
- preprocessing from sklearn
- train_test_split from sklearn
- classification_report from sklearn
- matplotlib (as plt)

In [None]:
import csv
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

### Read data
Now, ***read the data*** using *csv*

The following functions **readData** will read data from csv file And returns all the data in the dimensions of the file itself <br>
Then in the next step, we prepare it for pre-processing.

In [None]:
def readData(address):
    with open(address) as csvFile:
        reader = csv.reader(csvFile)
        data = [row for row in reader]
    return data


def cleanData(data):
    return list(filter(lambda thisList: False if '?' in thisList else True, data))


First, we read the information and store the number of available data in a variable, then in the next stage of cleaning, we reduce the size of the information from the initial amount to find the number of rows containing *missing values* and print **percentage** of this Incorrect information.
<br>

Remove the row containing the headers name since it doesn't contain any information.

In [None]:
fileAddress = './train+dev+test.csv'
data = readData(fileAddress)
missingValues = len(data)
print(f"Number of rows before data cleaning: {len(data)}")
data = cleanData(data)
missingValues -= len(data)
data = np.array(data)[1:]  # remove headers
print(f"Number of rows after data cleaning: {len(data)}")
missingValues = round(missingValues/(len(data)+missingValues)*100, 2)
print(f"Percentage of missing values: {missingValues}%")

### Indicator variables
As you may figure out, All features in this dataset are categorical, such as **cap-shape** or **habitat**. Sklearn Decision Trees does not handle categorical variables. We can still convert these features to numerical values using `dummyVariables` to convert the categorical variable into dummy/indicator variables.


In [None]:
def dummyVariables(features):
    for column in range(features.shape[1]):
        # 0,1,2,3,...,21
        featureStatus = set(features[:, column])
        tranasformer = preprocessing.LabelEncoder()
        tranasformer.fit(list(featureStatus))
        features[:, column] = tranasformer.transform(features[:, column])
    return features

Now separate the labels of the samples and their features:
- **X** as the Feature Matrix (data)
- **Y** as the response vector (target)
<br>

Then we give the list of features **X** to the number converter function `dummyVariables`.

In [None]:
X = data[:, 1:]
Y = data[:, 0]
print(f"Before indicator variables: \n{X}")
X = dummyVariables(X)
print(f"\nAfter indicator variables: \n{X}")

### Train - Test split
I using train/test split to train and test decision tree,
train_test_split will return 4 different parameters. We will name them:
`X_train, X_test, y_train, y_test`.

The X and y are the arrays required before the split, the test_size represents the ratio of the testing dataset, and the random_state ensures that we obtain the same splits.

I chose the ratio of train and test set 70% and 30%.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.3, random_state=24)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

### Bootstrap
Now, in order to train ***k*** numbers of classifiers, we also need ***k*** numbers of training sets, that's why we extract new training data from the training data sets ***k*** times.

For this purpose, I use the `bootstrap` function, which takes the set of features and labels and arranges them, then it stores the same amount of data from them as the first set, using placement, and then separates the label from the feature and return two collection provides for training.

In [None]:
def bootstrap(X, Y):
    dataset = np.column_stack((X, Y))
    newDataset = dataset[np.random.choice(
        dataset.shape[0], size=dataset.shape[0])]
    new_X = newDataset[:, :-1]
    new_Y = newDataset[:, -1]
    return new_X, new_Y

perform the above function 5 times and a list is obtained that containing 5 training datasets, each of which is a **tuple** of **feature** and **label** pairs.
<br>

Below we print an example of that pair that will be used to train the last classifier.

In [None]:
NUMBER_OF_BOOTSTRAP = 5
bootstrapDataset = [bootstrap(X_train, y_train)
                    for _ in range(NUMBER_OF_BOOTSTRAP)]
print(f"A pair including features and labels: \n{bootstrapDataset[-1]}")

### Classifire
We will first create an instance of the **DecisionTreeClassifier** called **tree**.
Inside of the classifier, specify criterion="entropy" so we can see the information gain of each node.

Next, we will fit the data with the training feature `bootstrapDataset` and training response vector
I add this generated tree to a list, and repeat this cycle 5 times until 5 trees are formed from 5 series of training datasets.

In the last line, as an example, I print the type of the variable in the last cell of the `classifires` list.

In [None]:
classifiers = []
for index in range(NUMBER_OF_BOOTSTRAP):
    # Define Decision Thee
    tree = DecisionTreeClassifier(criterion="entropy", max_depth=4)
    tree.fit(*bootstrapDataset[index])
    classifiers.append(tree)
print(f"The last classifier is: {type(classifiers[-1])}")

### Prediction
make some predictions for each tree on the testing dataset and store it into a list called `votes`. 
<br>
Currently, this list contains 5 rows and n columns, which represent the opinion of each tree about the test samples
In order to be clean and convenient in calculations, we convert it into a Matrix containing n rows and 5 columns, where each row represents the opinion of the trees about that test sample, and n is the number of samples.

In [None]:
votes = [tree.predict(X_test) for tree in classifiers]
votes = np.array(votes)
print(f"Dimensions of opinion before reshaping: {votes.shape}")
votes = np.transpose(votes)
print(f"Dimensions of opinion after reshaping: {votes.shape}")

### Voting
Now, for each example of the tests, we walk on the Matrix of opinions, and the vote that has the most repetition is used as the main label and stored in the list of predictions `predicted_Y`.
<br>
By going through the list of votes, each cell contains a list of 5 vote, which is the example, which is given to the `majority` function and returns the common vote.
<br>
Finally, the total number of Consensus opinion are printed.

In [None]:
def majority(vote):
    vote = list(vote)
    return max(set(vote), key=vote.count)

In [None]:
predicted_Y = [majority(vote) for vote in votes]
print(f"Number of Consensus votes: {len(predicted_Y)}")

### Evaluation
Accuracy classification score computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding real labels **y_test**.

In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

In [None]:
accuracy = classification_report(y_test, predicted_Y)
print(accuracy)