## Decision Trees

### Q98. What are black box and white box models?

As you can see Decision Trees are fairly intuitive and their decisions are easy to interpret.
Such models are often called white box models. In contrast, as we will see, Random
Forests or neural networks are generally considered black box models. They
make great predictions, and you can easily check the calculations that they performed
to make these predictions; nevertheless, it is usually hard to explain in simple terms
why the predictions were made. For example, if a neural network says that a particular
person appears on a picture, it is hard to know what actually contributed to this
prediction: did the model recognize that person’s eyes? Her mouth? Her nose? Her
shoes? Or even the couch that she was sitting on? Conversely, Decision Trees provide
nice and simple classification rules that can even be applied manually if need be (e.g.,
for flower classification).


### Q99. Explain the working of decision trees.

![image.png](attachment:image.png)

A decision tree is a type of machine learning algorithm that is used for both regression and classification tasks. It is called a "tree" because it can be visualized as a tree-like structure, with nodes representing decisions and branches representing the outcomes of these decisions.

The decision tree works by recursively partitioning the data into smaller subsets based on the most significant feature, until all the data points in a subset belong to the same class. This process continues until a stopping criterion is reached, such as a maximum depth of the tree, a minimum number of samples in a leaf node, or a certain level of impurity in the subset.

At each node of the decision tree, a feature is selected and a threshold is set to partition the data. The feature and threshold that result in the highest improvement in the impurity of the subsets are selected. This process is repeated at each node until the stopping criterion is reached.

Once the tree is built, making a prediction for a new sample is straightforward. The sample is passed down the tree following the branches that correspond to the values of the features, until a leaf node is reached. The class prediction for the sample is the majority class of the samples in the leaf node.

In summary, decision trees work by recursively partitioning the data based on the most significant features, resulting in a tree-like structure that can be used to make predictions for new samples.

### Q100. Can you import the decision trees in a graphical format for visualization?

Matplotlib in Python: Matplotlib is a plotting library in Python that can be used to visualize decision trees. The decision tree can be visualized using the export_graphviz function from the scikit-learn library, which converts the tree into a GraphViz format that can be visualized using Matplotlib.

### Q101. Can we train a decision tree without scaling the data?

Yes, it is possible to train a decision tree without scaling the data. Decision trees are not sensitive to the scale of the features, unlike some other machine learning algorithms such as k-nearest neighbors and support vector machines.

In a decision tree, the features are used to partition the data into smaller subsets, and the thresholds used to split the data are based on the feature values themselves. The scale of the features does not affect the threshold values, so scaling is not necessary.

However, it is still a good practice to check if the features have a similar scale, as features with a large scale can dominate the decision making process in the tree, potentially leading to an unfair representation of the other features. In such cases, feature scaling can help balance the influence of the different features on the final decision.

In summary, while scaling is not necessary for training a decision tree, it is still a good practice to check the scale of the features and perform scaling if necessary to balance the influence of the different features on the final decision.

### Q102. What is entropy? What is the mathematical formula for it?

Entropy is a measure of the uncertainty or randomness of a system. In information theory and machine learning, entropy is often used to quantify the impurity or disorder of a set of samples. The entropy of a set of samples is higher if the samples are evenly distributed among several classes, and lower if the samples are primarily in one class.

![image-2.png](attachment:image-2.png)

The mathematical formula for entropy is:

H(S) = -∑ p(x) log2 p(x)

where H(S) is the entropy of a set of samples S, p(x) is the probability of a sample x belonging to a class, and the sum is taken over all classes. The logarithm is taken in base 2 so that the entropy is expressed in bits.

In a decision tree, the entropy is used to quantify the impurity of a set of samples at each node of the tree. The goal is to choose the feature and threshold that result in the highest reduction in the entropy, so that the samples are more evenly distributed among the classes.

In summary, entropy is a measure of the uncertainty or randomness of a system, often used to quantify the impurity or disorder of a set of samples in information theory and machine learning, and is expressed mathematically as a negative sum of the probability of each class multiplied by its logarithm in base 2.

![image.png](attachment:image.png)



### Q103. What is information gain? How is mathematically calculated?

Information gain is a measure of the reduction in entropy achieved by partitioning a set of samples based on a feature. In a decision tree, information gain is used to determine which feature and threshold to use at each node to split the data into subsets. The feature and threshold that result in the highest information gain are chosen, as this results in the greatest reduction in the impurity of the subsets.

The mathematical formula for information gain is:

Gain(S, A) = H(S) - ∑ (|S_v| / |S|) H(S_v)

where Gain(S, A) is the information gain of a set of samples S with respect to a feature A, H(S) is the entropy of the set of samples S, |S_v| is the number of samples in the subset S_v corresponding to a particular value v of the feature A, and |S| is the total number of samples in the set S. The sum is taken over all values v of the feature A.

In other words, information gain is calculated as the difference between the initial entropy of the set of samples and the weighted average of the entropies of the subsets resulting from partitioning the set based on a feature. The feature with the highest information gain is chosen to split the data at each node in the tree.

In summary, information gain is a measure of the reduction in entropy achieved by partitioning a set of samples based on a feature, and is calculated as the difference between the initial entropy of the set and the weighted average of the entropies of the subsets resulting from the partition.

### Q104. What is ginni impurity and how is it calculated?

Gini impurity is a measure of the impurity or disorder of a set of samples. It is similar to entropy, but instead of using the logarithm of the probabilities, it uses the square of the probabilities.

The mathematical formula for Gini impurity is:

Gini(S) = 1 - ∑ p(x)^2

where Gini(S) is the Gini impurity of a set of samples S, p(x) is the probability of a sample x belonging to a class, and the sum is taken over all classes.

In a decision tree, Gini impurity is used to quantify the impurity of a set of samples at each node of the tree. The goal is to choose the feature and threshold that result in the greatest reduction in the Gini impurity, so that the samples are more evenly distributed among the classes.

In summary, Gini impurity is a measure of the impurity or disorder of a set of samples, expressed mathematically as the sum of the square of the probabilities of each class, and is used in decision trees to quantify the impurity of a set of samples at each node and determine which feature and threshold to use to split the data.

![image.png](attachment:image.png)



### Q105. What is CART(Classification and Regression Tree) algorithm?

CART (Classification and Regression Tree) is a popular decision tree algorithm for both classification and regression problems. The CART algorithm creates a binary tree, where each node represents a decision on a feature, and the leaves represent the output class or predicted value.

In a classification problem, the CART algorithm works by iteratively splitting the data into subsets based on a feature and a threshold, until a stopping criterion is reached. The feature and threshold are chosen at each step to maximize the reduction in impurity of the samples. The impurity of the samples is measured using a criterion such as Gini impurity or information gain. The final output is a set of rules or conditions that specify the class of a sample, based on its feature values.

In a regression problem, the CART algorithm works similarly, but instead of choosing the feature and threshold that result in the greatest reduction in impurity, it chooses the feature and threshold that result in the smallest variance or mean squared error of the samples in the subsets.

The CART algorithm is simple and easy to understand, but it can be prone to overfitting if the tree is grown too deep or if the data has too many features. To avoid overfitting, techniques such as pruning or ensembles of trees, such as random forests, can be used.

In summary, the CART (Classification and Regression Tree) algorithm is a popular decision tree algorithm for both classification and regression problems, where a binary tree is constructed by iteratively splitting the data into subsets based on a feature and threshold, with the goal of reducing impurity or variance of the samples.

### Q106. What is the cost fuction of CART for classification?

The cost function of the CART (Classification and Regression Tree) algorithm for classification problems is the impurity measure of the samples at each node. The goal of the CART algorithm is to minimize the impurity of the samples in the subsets resulting from splitting the data based on a feature and threshold.

The most commonly used impurity measures for the CART algorithm are Gini impurity and information gain. The CART algorithm works by choosing the feature and threshold that result in the greatest reduction in impurity, so that the samples are more evenly distributed among the classes.

The Gini impurity cost function is defined as:

Gini(S) = 1 - ∑ p(x)^2

where Gini(S) is the Gini impurity of a set of samples S, p(x) is the probability of a sample x belonging to a class, and the sum is taken over all classes.

The information gain cost function is defined as:

Gain(S, A) = H(S) - ∑ (|S_v| / |S|) H(S_v)

where Gain(S, A) is the information gain of a set of samples S with respect to a feature A, H(S) is the entropy of the set of samples S, |S_v| is the number of samples in the subset S_v corresponding to a particular value v of the feature A, and |S| is the total number of samples in the set S. The sum is taken over all values v of the feature A.

In summary, the cost function of the CART algorithm for classification problems is the impurity measure of the samples at each node, which is typically measured using Gini impurity or information gain, and the goal is to choose the feature and threshold that result in the greatest reduction in impurity.





### Q107. Why is CART a greedy algorithm?

CART (Classification and Regression Tree) is considered a greedy algorithm because it makes locally optimal choices at each step in the construction of the tree, with the goal of finding the best solution globally.

In the CART algorithm, at each step, the algorithm chooses the feature and threshold that result in the greatest reduction in impurity or variance of the samples in the subsets, based on a cost function such as Gini impurity or information gain. This choice is made without considering future choices or the impact on the global solution.

The greedy nature of the CART algorithm can lead to suboptimal solutions, especially if the tree is grown too deep or if the data has too many features. To avoid this, techniques such as pruning or ensembles of trees, such as random forests, can be used to produce more accurate and stable results.

In summary, CART is considered a greedy algorithm because it makes locally optimal choices at each step in the construction of the tree, without considering the impact on the global solution. This can lead to suboptimal solutions, so techniques such as pruning or ensembles of trees may be used to produce more accurate results.

### Q108. What is parametric model?

A parametric model is a mathematical model that is defined by a fixed number of parameters. The parameters of a parametric model are estimated based on observed data, and the model is used to make predictions about future data.

The key idea behind a parametric model is to find a simple, compact representation of the relationship between the input variables and the output variables. The parameters of the model capture the underlying patterns and relationships in the data, and the model can be used to make predictions by plugging in new values of the input variables.

Parametric models are useful when there is a clear and well-defined relationship between the input and output variables, and when the number of parameters is small compared to the size of the data. Some examples of parametric models include linear regression, logistic regression, and Gaussian mixture models.

In summary, a parametric model is a mathematical model that is defined by a fixed number of parameters, and the parameters are estimated based on observed data to make predictions about future data.

### Q109. What is non parametric model?

A non-parametric model is a mathematical model that does not make any assumptions about the underlying distribution of the data. Unlike parametric models, which have a fixed number of parameters and make assumptions about the underlying data distribution, non-parametric models have a flexible number of parameters and do not make any assumptions about the data distribution.

Non-parametric models are often used when the relationship between the input and output variables is complex and not well understood, or when there is not enough data to estimate a large number of parameters. Non-parametric models are also used when the data has a complex structure, such as non-linear relationships, interactions between variables, or multiple local patterns.

Some examples of non-parametric models include decision trees, random forests, support vector machines, k-nearest neighbors, and kernel regression.

In summary, a non-parametric model is a mathematical model that does not make any assumptions about the underlying distribution of the data, and has a flexible number of parameters. Non-parametric models are often used when the relationship between the input and output variables is complex and not well understood, or when there is not enough data to estimate a large number of parameters.





### Q110. What all regularization parameters can you use in decision tree?

In decision trees, regularization is used to control the size and complexity of the tree, and to prevent overfitting. There are several common regularization parameters that can be used in decision trees:

    Maximum Depth: This parameter sets the maximum number of levels in the tree. It is used to control the complexity of the tree and prevent overfitting by limiting the tree's ability to capture fine details in the data.

    Minimum Samples per Leaf: This parameter sets the minimum number of samples required to be present in a leaf node. It is used to control the size of the tree and prevent overfitting by requiring a minimum amount of data in each leaf node.

    Minimum Impurity Decrease: This parameter sets the minimum amount of impurity reduction required for a split to be made. It is used to control the size and complexity of the tree and prevent overfitting by requiring a significant improvement in the impurity measure for each split.

    Maximum Features: This parameter sets the maximum number of features to consider when splitting a node. It is used to control the complexity of the tree and prevent overfitting by limiting the number of features considered for each split.

    Maximum Leaf Nodes: This parameter sets the maximum number of leaf nodes in the tree. It is used to control the size and complexity of the tree and prevent overfitting by limiting the number of leaf nodes in the tree.

These are some of the regularization parameters that can be used in decision trees to control the size and complexity of the tree and prevent overfitting. The choice of regularization parameter and its value will depend on the specific problem and the nature of the data.

### Q111. What is pruning?

Pruning is a technique used in decision trees to reduce their size and complexity by removing branches that are not necessary for accurate predictions. Pruning works by removing the leaf nodes or internal nodes of the tree that do not contribute significantly to the accuracy of the predictions.

The goal of pruning is to simplify the tree, reduce overfitting, and improve the generalization performance of the model. Overfitting occurs when the tree is too complex and captures too much of the noise and random fluctuations in the data, leading to poor performance on new, unseen data. Pruning helps to overcome this by removing branches that are not necessary for accurate predictions.

There are several techniques for pruning decision trees, including reduced error pruning, cost complexity pruning, and minimum description length pruning. The choice of pruning method will depend on the specific problem and the nature of the data.

In summary, pruning is a technique used in decision trees to reduce their size and complexity by removing branches that are not necessary for accurate predictions. The goal of pruning is to simplify the tree, reduce overfitting, and improve the generalization performance of the model.





### Q112. When do we need to perform pruning?

Pruning is typically performed when the decision tree has grown to a size that is too large and complex, leading to overfitting and poor generalization performance. Overfitting occurs when the tree is too complex and captures too much of the noise and random fluctuations in the data, leading to poor performance on new, unseen data.

Pruning is used to reduce the size and complexity of the tree and to simplify the model, making it more generalizable and less prone to overfitting. Pruning is typically performed after the tree has been grown to its maximum size and complexity, and is done in order to improve the performance of the model on new, unseen data.

In summary, pruning is typically performed when the decision tree has grown to a size that is too large and complex, leading to overfitting and poor generalization performance. Pruning is used to reduce the size and complexity of the tree and to simplify the model, making it more generalizable and less prone to overfitting.





### Q113. Explain pre pruning.

Pre-pruning is a technique used in decision tree algorithms to prevent the tree from growing to an overly large and complex size. Pre-pruning works by setting stopping criteria for tree growth before the tree has been fully grown. Once the stopping criteria have been met, no further splits are made, and the tree is considered to be complete.

Pre-pruning is used to prevent overfitting, which occurs when the tree is too complex and captures too much of the noise and random fluctuations in the data, leading to poor performance on new, unseen data. By stopping tree growth before the tree has become too complex, pre-pruning helps to keep the tree simple and generalizable.

There are several techniques for pre-pruning decision trees, including setting a maximum depth for the tree, setting a minimum number of samples required in a leaf node, and setting a minimum impurity decrease required for a split to be made. The choice of pre-pruning method and its parameters will depend on the specific problem and the nature of the data.

In summary, pre-pruning is a technique used in decision tree algorithms to prevent the tree from growing to an overly large and complex size. Pre-pruning works by setting stopping criteria for tree growth before the tree has been fully grown, and is used to prevent overfitting and improve the generalization performance of the model.





### Q114. What is post pruning?

Post-pruning is a technique used in decision tree algorithms to reduce the size and complexity of the tree after it has been fully grown. Post-pruning works by removing leaf nodes or internal nodes of the tree that do not contribute significantly to the accuracy of the predictions. The goal of post-pruning is to simplify the tree, reduce overfitting, and improve the generalization performance of the model.

Post-pruning works by removing nodes from the tree in a bottom-up manner, starting from the leaf nodes. The algorithm evaluates the impact of removing each node on the accuracy of the predictions, and removes the node if it does not significantly impact accuracy. This process is repeated until all nodes that can be removed without significantly impacting accuracy have been removed.

There are several techniques for post-pruning decision trees, including reduced error pruning, cost complexity pruning, and minimum description length pruning. The choice of post-pruning method will depend on the specific problem and the nature of the data.

In summary, post-pruning is a technique used in decision tree algorithms to reduce the size and complexity of the tree after it has been fully grown. The goal of post-pruning is to simplify the tree, reduce overfitting, and improve the generalization performance of the model.

### Q115. What is the difference between decision tree classifier and decision tree regressor?

The difference between a Decision Tree Classifier and a Decision Tree Regressor lies in the type of problem they are designed to solve.

A Decision Tree Classifier is used for classification problems, where the target variable is categorical. In this case, the tree is used to make predictions about the class label of new data points based on the values of the input features. The tree splits the data into subgroups based on the feature values and assigns a class label to each leaf node. The class label assigned to a new data point is based on the majority class label of the subgroup that the point belongs to.

A Decision Tree Regressor, on the other hand, is used for regression problems, where the target variable is continuous. In this case, the tree is used to make predictions about the value of a continuous target variable based on the values of the input features. The tree splits the data into subgroups based on the feature values and assigns a predicted target value to each leaf node. The predicted target value for a new data point is based on the average target value of the subgroup that the point belongs to.

In summary, the key difference between Decision Tree Classifier and Decision Tree Regressor is the type of problem they are designed to solve. Decision Tree Classifier is used for classification problems, while Decision Tree Regressor is used for regression problems.

### Q116. What is the cost fuction of CART for regression?

The cost function used in CART (Classification and Regression Trees) for regression problems is mean squared error (MSE).

Mean Squared Error measures the average squared difference between the predicted target values and the true target values. The goal of CART for regression problems is to minimize the MSE between the predicted and actual target values.

Mathematically, the MSE is defined as:

MSE = (1/n) * Σ(y_i - ŷ_i)^2

where n is the number of data points, y_i is the actual target value for data point i, and ŷ_i is the predicted target value for data point i.

The CART algorithm uses the MSE cost function to determine the optimal splits of the data in the tree and minimize the difference between the predicted and actual target values. The CART algorithm will continue to split the data until it reaches a stopping criterion, such as reaching a minimum node size or achieving a minimum reduction in MSE.

In summary, the cost function used in CART for regression problems is mean squared error (MSE), which measures the average squared difference between the predicted target values and the true target values. The goal of the CART algorithm is to minimize the MSE between the predicted and actual target values.





### Q117. List some instabilities of decision tree.

Decision Trees can be subject to a number of instabilities that can affect their performance. Some of the common instabilities of decision trees include:

Overfitting: Decision trees can easily overfit to the training data, resulting in a complex tree that does not generalize well to new data. Overfitting occurs when the tree grows too deep and fits the training data too closely, capturing even the noise in the data.

Unstable with small changes in data: Decision trees can be very sensitive to small changes in the data, leading to completely different trees being generated for slightly different data sets. This can result in a lack of stability and consistency in the predictions made by the tree.

Bias towards features with many levels: Decision trees are biased towards features with many levels and may over-represent these features in the tree. This can lead to poor generalization and reduced performance on new data.

Lack of feature interaction: Decision trees do not capture interactions between features and may not effectively model complex relationships between features.

High Variance: Decision trees have high variance, meaning that they can be very different depending on the training data. This can lead to unstable trees and poor generalization to new data.

In summary, Decision Trees are subject to a number of instabilities that can affect their performance, including overfitting, instability with small changes in data, bias towards features with many levels, lack of feature interaction, and high variance.





### Q118. What is the approximate depth of a Decision Tree trained (without restrictions) on a training set with one million instances?

The depth of a well-balanced binary tree containing m leaves is equal to log₂(m), rounded up. log₂ is the binary log; log₂(m) = log(m) / log(2). A binary Decision Tree (one that makes only binary decisions, as is the case with all trees in Scikit-Learn) will end up more or less well balanced at the end of training, with one leaf per training instance if it is trained without restrictions. Thus, if the training set contains one million instances, the Decision Tree will have a depth of log₂(106) ≈ 20 (actually a bit more since the tree will generally not be perfectly well balanced).

### Q119. Is a node’s Gini impurity generally lower or greater than its parent’s? Is it generally lower/greater, or always lower/greater?

A node's Gini impurity is generally lower than its parent's. This is due to the CART training algorithm's cost function, which splits each node in a way that minimizes the weighted sum of its children's Gini impurities. However, it is possible for a node to have a higher Gini impurity than its parent, as long as this increase is more than compensated for by a decrease in the other child's impurity. For example, consider a node containing four instances of class A and one of class B. Its Gini impurity is 1 – (1/5)² – (4/5)² = 0.32. Now suppose the dataset is one-dimensional and the instances are lined up in the following order: A, B, A, A, A. You can verify that the algorithm will split this node after the second instance, producing one child node with instances A, B, and the other child node with instances A, A, A. The first child node's Gini impurity is 1 – (1/2)² – (1/2)² = 0.5, which is higher than its parent's. This is compensated for by the fact that the other node is pure, so its overall weighted Gini impurity is 2/5 × 0.5 + 3/5 × 0 = 0.2, which is lower than the parent's Gini impurity.

### Q120. If a Decision Tree is overfitting the training set, is it a good idea to try decreasing max_depth?

If a Decision Tree is overfitting the training set, it may be a good idea to decrease max_depth, since this will constrain the model, regularizing it.

### Q121. If a Decision Tree is underfitting the training set, is it a good idea to try scaling the input features?

Decision Trees don't care whether or not the training data is scaled or centered; that's one of the nice things about them. So if a Decision Tree underfits the training set, scaling the input features will just be a waste of time.

### Q122. If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?

The computational complexity of training a Decision Tree is O(n × m log₂(m)). So if you multiply the training set size by 10, the training time will be multiplied by K = (n × 10 m × log₂(10 m)) / (n × m × log₂(m)) = 10 × log₂(10 m) / log₂(m). If m = 106, then K ≈ 11.7, so you can expect the training time to be roughly 11.7 hours.

### Q123. If your training set contains 100,000 instances, will setting presort=True speed up training?

It depends on the specific implementation and the underlying data. In some cases, setting `presort=True` may speed up training when the training set contains a large number of instances.

`presort=True` tells the implementation to sort the data before building the decision tree. Sorting the data before building the tree can reduce the amount of computation required to find the optimal splits. This is because the implementation only needs to evaluate the split criterion for a smaller set of values (the unique values in the sorted data) instead of for all possible values.

However, sorting the data can be an expensive operation, especially when the training set is large. The actual effect of setting `presort=True` on the training time will depend on the size of the training set, the complexity of the split criterion, and the efficiency of the sorting algorithm used.

It is always a good idea to measure the training time for your specific problem and data, and compare the performance of the model with and without `presort=True`. This will give you a more accurate understanding of the effect of `presort=True` on the training time for your specific case.

### Q124. Train and fine-tune a Decision Tree for the moons dataset by following these steps:
##### a. Use make_moons(n_samples=10000, noise=0.4) to generate a moons dataset.

In [1]:
from sklearn.datasets import make_moons

X_moons, y_moons = make_moons(n_samples=10000, noise=0.4, random_state=42)

#### b. Use train_test_split() to split the dataset into a training set and a test set.

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_moons, y_moons,
                                                    test_size=0.2,
                                                    random_state=42)

#### c. Use grid search with cross-validation (with the help of the GridSearchCV  class) to find good hyperparameter values for a DecisionTreeClassifier.
- Hint: try various values for max_leaf_nodes.

In [3]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
params = {
    'max_leaf_nodes': list(range(2, 100)),
    'max_depth': list(range(1, 7)),
    'min_samples_split': [2, 3, 4]
}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42),
                              params,
                              cv=3)

grid_search_cv.fit(X_train, y_train)

In [4]:
grid_search_cv.best_estimator_

#### d. Train it on the full training set using these hyperparameters, and measure your model’s performance on the test set. You should get roughly 85% to 87% accuracy.

In [5]:
from sklearn.metrics import accuracy_score

y_pred = grid_search_cv.predict(X_test)
accuracy_score(y_test, y_pred)

0.8595

### Q125. Grow a forest by following these steps:

a. Continuing the previous exercise, generate 1,000 subsets of the training set,
each containing 100 instances selected randomly. Hint: you can use ScikitLearn’s ShuffleSplit class for this.

In [6]:
from sklearn.model_selection import ShuffleSplit
import numpy as np

n_trees = 1000
n_instances = 100

mini_sets = []

rs = ShuffleSplit(n_splits=n_trees, test_size=len(X_train) - n_instances,
                  random_state=42)

for mini_train_index, mini_test_index in rs.split(X_train):
    X_mini_train = X_train[mini_train_index]
    y_mini_train = y_train[mini_train_index]
    mini_sets.append((X_mini_train, y_mini_train))

b. Train one Decision Tree on each subset, using the best hyperparameter values
found in the previous exercise. Evaluate these 1,000 Decision Trees on the test
set. Since they were trained on smaller sets, these Decision Trees will likely
perform worse than the first Decision Tree, achieving only about 80%
accuracy.

In [7]:
from sklearn.base import clone

forest = [clone(grid_search_cv.best_estimator_) for _ in range(n_trees)]

accuracy_scores = []

for tree, (X_mini_train, y_mini_train) in zip(forest, mini_sets):
    tree.fit(X_mini_train, y_mini_train)
    
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

np.mean(accuracy_scores)

0.8056605

c. Now comes the magic. For each test set instance, generate the predictions of
the 1,000 Decision Trees, and keep only the most frequent prediction (you can
use SciPy’s mode() function for this). This approach gives you majority-vote
predictions over the test set.

In [8]:
Y_pred = np.empty([n_trees, len(X_test)], dtype=np.uint8)

for tree_index, tree in enumerate(forest):
    Y_pred[tree_index] = tree.predict(X_test)

d. Evaluate these predictions on the test set: you should obtain a slightly higher
accuracy than your first model (about 0.5 to 1.5% higher). Congratulations,
you have trained a Random Forest classifier!


In [9]:
from scipy.stats import mode

y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)

accuracy_score(y_test, y_pred_majority_votes.reshape([-1]))

0.873

Ques by Sarath - 

Identify cat or dog ?In Random forest If 500 trees says dog and 500 trees says Cat then what do you do , you are already in production. 

one of the interview question I got can you please explain this ?