In [1]:
# Copyright (c) 2020-2021 CertifAI Sdn. Bhd.
# 
# This program is part of OSRFramework. You can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU Affero General Public License for more details.
# 
# You should have received a copy of the GNU Affero General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

## Decision Tree

Decision Trees (DTs) are a supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

## Random Forest

A **random forest** is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. 

Random Forest classifier is a type of ensemble learning.

# iris dataset

### Use Decision Tree for Classification

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
from sklearn import model_selection
from sklearn import metrics

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
iris = datasets.load_iris()

In [None]:
iris.feature_names

In [None]:
iris.target_names

In [None]:
data = iris.data.astype(np.float32)
target = iris.target.astype(np.float32)

Split data to train and test

In [None]:
X_train, X_test, y_train, y_test = model_selection.(
    data, target, test_size=0.3, random_state=123
)

In [None]:
classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)

Train model

In [None]:
classifier.(X_train,y_train)

### Default parameters

#### criterion='entropy'
Evaluate feature importance. 'entropy' algorithm is based on Information theory which is a method to quantify information in a message. In our example, it is used to quantify the information of the data to make decision and split the node.

#### min_samples_leaf=1
Minimum number of sample(s) to qualify as leaf node

#### min_samples_split=2
Minimum number of sample(s) to qualify for internal node split

#### splitter='best'
Method used by the model to make decision when splitting. 'best' method will tell the model to consider feature with highest importance

#### random_state=0
Seed to generate random number by the model. Will effect any randomness from the model

## View tree

In [None]:
def view_tree(classifier):
    fig, axes = plt.subplots(nrows=1,ncols=1,figsize=(4,4), dpi=150) #change dpi to resize image
    tree_view = plot_tree(classifier, feature_names=iris.feature_names,
              class_names=iris.target_names, ax=axes, filled=True)

In [None]:
view_tree(classifier)

The color filled indicate the majority class for classification. 

Predict test set

In [None]:
predictions = classifier.(X_test)

In [None]:
print(confusion_(y_test,predictions))

In [None]:
print(classification_(y_test,predictions))

### Use Random Forest for Classification

In [None]:
classifier = RandomForestClassifier()

In [None]:
classifier.fit(X_train,y_train)

### Default parameters

#### bootstrap=True
Decide if the model will use all or different(random) number of sample for every tree. If true, the model randomly choose number of samples for every tree.

#### max_features='auto'
Decide the number of features to conisder for best split. 'auto' will use sqrt(n_features) for making decision

#### min_samples_leaf=1
Minimum number of sample(s) to qualify as leaf node

#### min_samples_split=2
Minimum number of sample(s) to qualify for internal node split

#### n_estimators=10
Decide the number of decision tree. This is important as RandomForest uses multiple decision trees.

#### verbose=0
To view training information

## View trees
RandomForest algorithm is a combination of few decision trees. Therefore, every tree should be plot individually

In [None]:
# Number of trees
num_trees = len(classifier.estimators_)
print(num_trees)

In [None]:
for i in range(num_trees):
    if i > 2: #Only plot the first 3 trees
        break
    view_tree(classifier.estimators_[i])
    print()

Different trees provide different decision branch

In [None]:
predictions = classifier.predict(X_test)

In [None]:
# Plot confusion matrix
def plot_cm(y_test, predictions, figsize):
    cm = confusion_matrix(y_test,predictions)
    df_cm = pd.DataFrame(cm)
    df_cm.index.name = 'Actual'
    df_cm.columns.name = 'Predicted'

    plt.figure(figsize=figsize)
    sns.set(font_scale=1.4)
    sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16},
               fmt='g')

In [None]:
plot_cm(y_test, predictions, figsize=(7, 5))

### Result Interpretation - Confusion Matrix

predicted axis: the result of model prediction
actual axis: actual ground truth

The desired result is for the prediction to be the same as actual. From the matrix, we can see that the model predict 100% correct for item 0 and 2 and produce two wrong predictions for item 1. The confusion matrix provides useful information for model bias, as in, what is the tendency of the model prediction.

In [None]:
print(classification_report(y_test,predictions))

### Result Interpretaion - Classification Report

#### Precision
The percentage of correct predictions. Like the confusion matrix, you can see that item 0 and 2 is correctly predicted 100%.

#### Recall
Ability of the classifier to find all positive instances. Look from actual axis for item 2. The model find positive instances 15 times out of total instances of 17. This is 88% of the total.

#### F1-Score
The weighted average of the precision and recall. 

<img src="https://miro.medium.com/max/752/1*UJxVqLnbSj42eRhasKeLOA.png" />

[Image Source: Towards Data Science](https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c#:~:text=We%20use%20the%20harmonic%20mean%20instead%20of%20a%20simple%20average)

#### Support 
Total number of occurences of given class. We call them item in this example

# 20 news Group Classification

### Use decision tree

In [None]:
news_categories = ['comp.graphics','comp.os.ms-windows.misc',
                   'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
                   'comp.windows.x']

ng = datasets.fetch_20newsgroups(categories=news_categories)

Refer https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html for full news categories. We will only use some in this example 

### View data

In [None]:
ng.data[:1]

In [None]:
news_class = ng.target[:1]

# Let's see which class does the news belong.
print(news_class)

# Let's convert the index to string label
print(ng.target_names[news_class[0]])

### Data preparation 1 : Vectorizing

Sklearn algorithm cannot process strings. Because our data is in string format, it needs to be converted to numbers. This process is called vectorizing.

There are lots of vectorizing algorithm. For now, we are going to use a simple algorithm called CountVectorier. This algorithm converts the data into a matrix of token counts. A token is a single word. A sentence with five words has 5 tokens. 

In [None]:
vectorizer = CountVectorizer()
data = vectorizer.fit_transform(ng.data)

In [None]:
data

Now, we have the right format. The data has 11314 sentences and 130107 vocabulary. Wow! That's a lot.

In [None]:
dat1 = data[1,:].toarray()
dat1

Every sentence in the dataset is converted to a vector of length 66735. The value of each element indicate the total occurence of a particular word in the sentence. 

### Data preparation 2 : Split train and test

In [None]:
# Convert target to numpy float

target = ng.target.astype(np.float32)

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    data, target, test_size=0.3, random_state=123
)

### Train model

Initialize sklearn decision tree classifier using entropy criterion and random state of 0

perform training

### Evaluate

predict test data

In [None]:
plot_cm(y_test, predictions, figsize=(10, 8))

As said before, it is desired that the diagonal be the most significant number for the particular column.

However, in this example, the model does not classify the data well.

print classification report

We can see that the accuracy of model accorss all class is only 65%

### Use random forest

Train sklearn random forest classifier

Perform evaluation by showing confusion matrix and classification report

Even though we are using the default value, the model performs well with 69% accuracy.

The model can be improved by performing a grid search to find the best hyperparameters.