## Decision Tree - Classification

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

<img src='../../img/Decision_Tree_1.png' height="1800" width="1800">

### Entropy
The idea of entropy is to quantify the uncertainty of the probability distribution with respect to the possible classification classes, if you can quantify uncertainty about your current classification accuracy, you can also reason about what node you should add next in the decision tree to maximize reduction of that uncertainty.

<img src='../../img/Entropy.png' height="600" width="800">

## Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).

Step 1: Calculate entropy of the target.

<img src='../../img/Entropy_3.png' height="600" width="800">

Step 2: The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain, or decrease in entropy. 

<img src='../../img/Entropy_gain.png' height="600" width="800">

<img src='../../img/Entropy_attributes.png' height="800" width="800">

Step 3: Choose attribute with the largest information gain as the decision node, divide the dataset by its branches and repeat the same process on every branch.

<img src='../../img/decision_tree_slices.png' height="600" width="800">

Step 4: A branch with entropy of 0 is a leaf node.

<img src='../../img/Entropy_overcast.png' height="600" width="800">

### Code Dictionary
code | description
-----|------------
`DecisionTreeClassifier()` | Decision Tree Classification.
`.makeblob` |  Generate random data set with a specified amount of clusters.

In [None]:
## IMPLEMENT A CLASSIFIER WITH THE telecom_churn DATA

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)

In [None]:
## SHOW THE CONFUSION MATRIX

In [None]:
# Visualising the Test set results
from matplotlib.colors import ListedColormap
def plot_tree(X_test, y_test):
    X_set, y_set = X_test, y_test
    plt.figure(figsize=(10,8))
    X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                         np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
    plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),alpha = 0.75, cmap = plt.cm.Paired)
    plt.xlim(X1.min(), X1.max())
    plt.ylim(X2.min(), X2.max())
    for i, j in enumerate(np.unique(y_set)):
        plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],cmap = ListedColormap(('red', 'green'))(i), label = j)
    plt.title('Classifier (Test set)')
    plt.xlabel('Age')
    plt.ylabel('Estimated Salary')
    plt.legend()

In [None]:
plot_tree(X_test, y_test)

## Random Forest 

A random forest model is a collection of decision tree models that are combined together to make predictions. When you make a random forest, you have to specify the number of decision trees you want to use to make the model. 

### Algorithm:
- Take random samples of observations from your training data and 
- Builds a decision tree model for each sample. 
- The random samples are typically drawn with replacement, meaning the same observation can be drawn multiple times. 
- The end result is a bunch of decision trees that are created with different groups of data records drawn from the original training data.

<img src='../../img/rforest.png' height="600" width="800">

Random forests are an example of an ensemble model: a model composed of some combination of several different underlying models.

In [None]:
## IMPLEMENT A CLASSIFIER WITH THE telecom_churn DATA

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10, criterion='entropy', oob_score=True )

In [None]:
## SHOW THE CONFUSION MATRIX

Since random forest models involve building trees from random subsets or "bags" of data, model performance can be estimated by making predictions on the out-of-bag (OOB) samples instead of using cross validation. You can use cross validation on random forests, but OOB validation already provides a good estimate of performance and building several random forest models to conduct K-fold cross validation with random forest models can be computationally expensive.

In [None]:
print("OOB accuracy: ")
print(classifier.oob_score_)

The random forest classifier assigns an importance value to each feature used in training. Features with higher importance were more influential in creating the model, indicating a stronger association with the response variable. Let's check the feature importance for our random forest model:

In [None]:
for feature, imp in zip(dataset.columns[[2,3]], classifier.feature_importances_):
    print(feature, imp)

In [None]:
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.figure(figsize=(12,12))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = plt.cm.Paired)
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Classifier (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()