# Data Mining Project 2 - Decision Tree
2018/11/20
---


## Title : Go out today or not 


## Data :

- generate **1000** data randomly
- shape = (1000, 7)
- attributes :
    - 昨天有出門 : `wasOut` (boolean)
    - 昨天有下雨 : `wasRainy` (boolean)
    - 今天有下雨 : `isRainy` (boolean)
    - 今天是假日 : `isHoliday` (boolean)
    - 今天想睡覺 : `isSleepy` (boolean)
    - 功課有寫完 : `doneHomework` (boolean)
    - 家事有做完 : `doneHoueswork` (boolean)
- class :
    - 出門與否 : `Out` or `In`
    
## Tree(created from rules) : 

![tree](tree_by_rule.jpg)

## Code block :
---

In [1]:
import pandas as pd
import random
import numpy as np

### Step1. Define the rules

In [2]:
def getData(numInstance=1000):
    numInstance = 1000
    df = pd.DataFrame(columns=['wasOut', 'wasRainy', 'isRainy', 'isHoliday', 'isSleepy', 'doneHomework', 'doneHousework', 'out'])
    for col in df.columns[:-1]:
        df[col]=[bool(random.randint(0,1)) for i in range(numInstance)]
    for i in range(numInstance):
        if df.loc[i, 'wasRainy']:
            if df.loc[i, 'isRainy']:
                df.loc[i, 'out'] = False
            else:
                if df.loc[i, 'doneHousework']:
                    if df.loc[i, 'wasOut']:
                        df.loc[i, 'out'] = False
                    else:
                        df.loc[i, 'out'] = True
                else:
                    df.loc[i, 'out'] = False
        else:
            if df.loc[i, 'doneHomework']:
                if df.loc[i, 'isHoliday']:
                    df.loc[i, 'out'] = True
                else:
                    if df.loc[i, 'isSleepy']:
                        df.loc[i, 'out'] = False
                    else:
                        df.loc[i, 'out'] = True
            else:
                df.loc[i, 'out'] = False
    return df

### Step2. Generate(new) / Read(old) data

In [3]:
# Generate new data
# df = getData(1000)
# df.to_csv('Data.csv', index=False)

# Or read data form file
df = pd.read_csv('data.csv')

### Step3. Create tree from rules

In [4]:
class treeNode():
    def __init__(self, value, rightSon=None, leftSon=None):
        self.value = value
        self.rightSon = rightSon
        self.leftSon = leftSon
    def getGINI_son(self, df):
        counts = df['out'].value_counts()
        total = sum(counts)
        GINI = 1 - sum([pow(count/total, 2) for count in counts])
        return total, GINI
        # return GINI
    def getGINI(self, df):
        N1_count, N1_GINI = self.getGINI_son(df[df[self.value]==True])
        N2_count, N2_GINI = self.getGINI_son(df[df[self.value]==False])
        count = N1_count + N2_count
        GINI = (N1_count*N1_GINI + N2_count*N2_GINI) / count
        # print(N1_count, N1_GINI, N2_count, N2_GINI)
        return GINI
    def getValue(self):
        return self.value
    def getRightSon(self):
        return self.rightSon
    def getLeftSon(self):
        return self.leftSon

In [5]:
#
Node_wasOut = treeNode('wasOut')
Node_doneHousework = treeNode('doneHousework', rightSon=Node_wasOut)
Node_isRainy = treeNode('isRainy', leftSon=Node_doneHousework)
#
Node_isSleepy = treeNode('isSleepy')
Node_isHoliday = treeNode('isHoliday', leftSon=Node_isSleepy)
Node_doneHomework = treeNode('doneHomework', rightSon=Node_isHoliday)
#
Node_wasRainy = treeNode('wasRainy', Node_isRainy, Node_doneHomework)


### Step4. Calculate number of instance in each node

In [6]:
count_Nodes = np.zeros(8)
for i in range(df.shape[0]):
        if df.loc[i, 'wasRainy']:
            if df.loc[i, 'isRainy']:
                count_Nodes[0] +=1
            else:
                if df.loc[i, 'doneHousework']:
                    if df.loc[i, 'wasOut']:
                        count_Nodes[1] += 1
                    else:
                        count_Nodes[2] += 1
                else:
                    count_Nodes[3] += 1
        else:
            if df.loc[i, 'doneHomework']:
                if df.loc[i, 'isHoliday']:
                    count_Nodes[4] += 1
                else:
                    if df.loc[i, 'isSleepy']:
                        count_Nodes[5] += 1
                    else:
                        count_Nodes[6] += 1
            else:
                count_Nodes[7] += 1

### Step5. Calculate GINI in each node

In [7]:
def printGINI(Node, df):
    if Node.getRightSon() != None:
        printGINI(Node.getRightSon(), df[df[Node.getValue()]==True])
    if Node.getLeftSon() != None:
        printGINI(Node.getLeftSon(), df[df[Node.getValue()]==False])
    print(Node.getValue(), round(Node.getGINI(df), 3))

In [8]:
printGINI(Node_wasRainy,df)

wasOut 0.0
doneHousework 0.243
isRainy 0.188
isSleepy 0.0
isHoliday 0.247
doneHomework 0.196
wasRainy 0.343


### Step6. Create decision tree model

In [9]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(df.iloc[:, :-1], df['out'])

### Step7. Show the tree created by the model
![tree_by_model](tree_by_model.jpg)

In [10]:
import graphviz 
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 
dot_data = tree.export_graphviz(clf, out_file=None, 
                                feature_names=df.columns[:-1],  
                                class_names=['in', 'Out'],
                                filled=True, rounded=True,
                                special_characters=True)  
graph = graphviz.Source(dot_data)
# save graph in tree.pdf
graph.render('tree')

'tree.pdf'

### Step8. Print the $GINI~split$ corresponding the root's attributes

In [11]:
# 印出root選哪個attribute相對應的GINI_split
for col in df.columns[:-1]:
    # right
    N1_count = df[df[col]==True].shape[0]
    N2_count = df[df[col]==False].shape[0]
    N1_C1, N1_C2 = df[df[col]==True]['out'].value_counts()
    N2_C1, N2_C2 = df[df[col]==False]['out'].value_counts()
    GINI = (N1_count * (1 - pow(N1_C1 / (N1_C1 + N1_C2), 2) - pow(N1_C2 / (N1_C1 + N1_C2), 2)) +
            N2_count * (1 - pow(N2_C1 / (N2_C1 + N2_C2), 2) - pow(N2_C2 / (N2_C1 + N2_C2), 2)))/(N1_count + N2_count)
    print(col, round(GINI, 3))

wasOut 0.363
wasRainy 0.343
isRainy 0.367
isHoliday 0.37
isSleepy 0.368
doneHomework 0.307
doneHousework 0.364


### Step9. Analyze the difference between trees
![tree_by_rule](tree_by_rule.jpg)
![tree_by_model](tree_by_model.jpg)

**Difference** : reason
1. **root節點的attribute不同** : 因為樹是由規則所創造，樹根的$GINI~split$ **(0.34, wasRainy)**不一定是最小，然而決策樹的演算法是由最小$GINI~split$ **(0.307, doneHomework)** 作為選擇attribute依據。
<span style="color:red">**注意:** 下圖中的GINI僅是該節點的GINI並非是GINI_split</span>
2. **attribute的使用次數不同** : 因為規則中並無重複使用相同的attribute作為選擇依據，然後決策樹演算法(如上述)是有可能重複選擇attribute。
3. **tree的level數不同** : 因為決策樹演算法僅能保證GINI最小，而不能保證樹的level的多寡。
