## Decision Tree

Decision Trees (DTs) are a supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

## Random Forest

A **random forest** is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. 

Random Forest classifier is a type of ensemble learning.

# iris dataset

### Use Decision Tree for Classification

In [1]:
import numpy as np
from sklearn import datasets
from sklearn import model_selection
from sklearn import metrics

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
iris = datasets.load_iris()

In [3]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [4]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [5]:
data = iris.data.astype(np.float32)
target=iris.target.astype(np.float32)

In [6]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    data, target, test_size=0.3, random_state=123
)

In [7]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

In [8]:
classifier.fit(X_train,y_train)

DecisionTreeClassifier(criterion='entropy', random_state=0)

In [9]:
predictions=classifier.predict(X_test)

In [10]:
from sklearn.metrics import confusion_matrix, classification_report

In [11]:
print(confusion_matrix(y_test,predictions))

[[18  0  0]
 [ 0 10  0]
 [ 0  3 14]]


In [12]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        18
         1.0       0.77      1.00      0.87        10
         2.0       1.00      0.82      0.90        17

    accuracy                           0.93        45
   macro avg       0.92      0.94      0.92        45
weighted avg       0.95      0.93      0.93        45



In [13]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

### Use Random Forest for Classification

In [14]:
from sklearn.ensemble import RandomForestClassifier

In [15]:
classifier=RandomForestClassifier()

In [16]:
classifier.fit(X_train,y_train)

RandomForestClassifier()

In [17]:
predictions=classifier.predict(X_test)

In [18]:
print(confusion_matrix(y_test,predictions))

[[18  0  0]
 [ 0 10  0]
 [ 0  2 15]]


In [19]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        18
         1.0       0.83      1.00      0.91        10
         2.0       1.00      0.88      0.94        17

    accuracy                           0.96        45
   macro avg       0.94      0.96      0.95        45
weighted avg       0.96      0.96      0.96        45



# 20 news Group Classification

### Use decision tree

In [84]:
ng = datasets.fetch_20newsgroups()

### View data

In [89]:
ng.data[:1]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"]

In [97]:
news_class = ng.target[:1]

# Let's see which class does the news belong.
print(news_class)

# Let's convert the index to string label
print(ng.target_names[news_class[0]])

[7]
rec.autos


Sound fine, the news definitely talk about car which is of autos class

### Data preparation 1 : Vectorizing

Algorithm cannot process string. Because our data is in string format, it need to be converted to numbers. This process is called vectorizing.

There are lots of vectorizing algorithm. For now, we are going to use a simple algorithm called CountVectorier. This algorithm converts the data into a matrix of token counts. Token is a single word. A sentence with 5 words have 5 tokens. 

In [102]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data = vectorizer.fit_transform(ng.data)

In [106]:
data

<11314x130107 sparse matrix of type '<class 'numpy.int64'>'
	with 1787565 stored elements in Compressed Sparse Row format>

Now, we have the right format. The data has 11314 sentences and 130107 vocabulary. Wow! That's a lot.

### Data preparation 2 : Split train and test

In [108]:
# Convert target to numpy float

target = ng.target.astype(np.float32)

In [109]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    data, target, test_size=0.3, random_state=123
)

### Train model

In [110]:
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

In [111]:
classifier.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=0)

### Evaluate

In [112]:
predictions=classifier.predict(X_test)

In [115]:
print(confusion_matrix(y_test,predictions))

[[ 59   3   1   1   5   2   2   0   1   4   3   0   3   6   1   7   2   5
    6  16]
 [  3  46  16  11  11  13   8   5   3   8   5   7  12  10   2   5   2   2
    1   1]
 [  1  10  97  19   3  11   4   1   1   4   0   5   9   6   7   2   0   0
    1   1]
 [  3  13  15  43  11   7   6   4   5   1   5   2  18   6   3  10   1   1
    1   3]
 [  2  10   7  14  64   4  11   7   4   7   2   2  12   8   7   8   3   1
    4   3]
 [  2  35  11  18   3  64   5   6   3   2   3   5  19   8   6   5   1   3
    1   0]
 [  0   4   3  12   5   3 112   3   6   4   1   2   4   4   4   5   1   0
    1   0]
 [  3   8   1   5   6   5   2  94   9   4   4   1   8   7   4   0   3   6
    4   5]
 [  2   3   2   6   3   5   5   7 109   2   5   0   4   5   1   2   3   1
    1   3]
 [  1   8   2   3   6   3   4   4   3  93  22   0   6   6   5   6   6   3
    3   2]
 [  1   5   2   1   0   3   6   3   2  21  92   3   6   1   1   0   4   0
    5   3]
 [  0  10   1   4   6   4   1   0   2   1   3 142   6   5   2   2

In [114]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

         0.0       0.44      0.46      0.45       127
         1.0       0.22      0.27      0.24       171
         2.0       0.52      0.53      0.53       182
         3.0       0.24      0.27      0.26       158
         4.0       0.43      0.36      0.39       180
         5.0       0.42      0.32      0.36       200
         6.0       0.59      0.64      0.62       174
         7.0       0.53      0.53      0.53       179
         8.0       0.65      0.64      0.65       169
         9.0       0.53      0.50      0.52       186
        10.0       0.53      0.58      0.55       159
        11.0       0.76      0.69      0.72       206
        12.0       0.22      0.21      0.21       189
        13.0       0.33      0.41      0.37       167
        14.0       0.53      0.45      0.49       177
        15.0       0.54      0.54      0.54       199
        16.0       0.55      0.52      0.54       165
        17.0       0.70    

### Use random forest

In [116]:
classifier=RandomForestClassifier()

In [117]:
classifier.fit(X_train, y_train)

RandomForestClassifier()

In [118]:
predictions=classifier.predict(X_test)

In [119]:
print(confusion_matrix(y_test,predictions))

[[106   0   0   0   3   0   0   2   0   0   1   0   1   1   1   7   0   0
    0   5]
 [  1 121  20   5   6   9   5   0   0   1   1   1   0   0   1   0   0   0
    0   0]
 [  0   8 150  10   3   5   3   0   0   1   0   0   1   0   1   0   0   0
    0   0]
 [  0   7  15 112   6   4   5   0   0   0   1   3   3   0   1   1   0   0
    0   0]
 [  0   5   3  21 135   1  11   0   0   1   0   0   1   1   0   0   0   1
    0   0]
 [  0  15  16  12   0 151   1   1   0   2   0   0   0   0   2   0   0   0
    0   0]
 [  0   0   2   6   1   0 156   0   0   3   2   0   0   2   0   0   0   1
    0   1]
 [  0   3   4   3   2   1   4 155   5   1   1   0   0   0   0   0   0   0
    0   0]
 [  1   1   0   0   1   0   4   3 156   0   0   0   0   1   0   1   1   0
    0   0]
 [  1   3   0   0   2   0   3   2   0 163  10   0   2   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   1   1   0   2 153   0   0   2   0   0   0   0
    0   0]
 [  1   2   2   1   0   2   1   0   0   0   0 195   1   0   0   0

In [120]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

         0.0       0.87      0.83      0.85       127
         1.0       0.67      0.71      0.69       171
         2.0       0.68      0.82      0.74       182
         3.0       0.57      0.71      0.63       158
         4.0       0.78      0.75      0.76       180
         5.0       0.84      0.76      0.80       200
         6.0       0.73      0.90      0.80       174
         7.0       0.90      0.87      0.88       179
         8.0       0.94      0.92      0.93       169
         9.0       0.90      0.88      0.89       186
        10.0       0.88      0.96      0.92       159
        11.0       0.96      0.95      0.95       206
        12.0       0.87      0.66      0.75       189
        13.0       0.89      0.83      0.86       167
        14.0       0.92      0.92      0.92       177
        15.0       0.78      0.92      0.84       199
        16.0       0.90      0.92      0.91       165
        17.0       0.94    