# HW2 - Decision Tree Practice

#### 檔案說明：
- hw2.ipynb: 作業主要檔案，讀入資料跑 decision tree ，生出 tree 的 pdf 檔
- generate.py: 生測試資料(test.csv)的 script ，用比較不接近訓練資料各欄位數值的比例生成測試資料，並讓 label 符合規則
- generate-like.py: 生測試資料(test-like.csv)的 script ，可以用接近訓練資料各欄位數值的比例生成測試資料，並讓 label 符合規則
- train.csv: 生出來的 training data (1000筆) (`attendance`還沒標示正確，暫時設定全部為0)
- test.csv: 生出來的 test data (1000筆)（`attendance`已經標示正確, 各欄位數值的比例較不接近測試資料）
- test-like.csv: 生出來的 test data (1000筆)（`attendance`已經標示正確, 各欄位數值的比例較接近測試資料）
- tree.pdf: 跑 hw2.ipynb 生出來的 decision tree 的決策圖 pdf 檔
- rule.jpg: 自己設計的分類規則圖

## Step 1 - Generate data
#### 生成資料: 報名通識講座的1000人是否到場
#### 資料欄位解釋（1人1筆資料）: 
- isStudent: 是否為校內學生（是學生為1, 否則為0）
- mayNotGraduate: 畢業危機程度（1至5, 數字越大表示越需要聽通識講座）
- interested: 對講座有興趣的程度（1至5, 數字越大表示越有興趣）
- alone: 是否一個人參加（一個人參加為1, 結伴參加則為0）
- signUpOnline: 是否線上報名（線上報名為1, 當場報名為0）
- attendance: 是否出席講座（是則為1, 否則為0）

In [1]:
# Read data

import pandas as pd
import numpy as np

dataPathTrain = './train.csv'
dataPathTest = './test.csv'
dataPathTestLike = './test-like.csv'

df = pd.read_csv(dataPathTrain)
testDf = pd.read_csv(dataPathTest)
testLikeDf = pd.read_csv(dataPathTestLike)

print(df.shape)
print(df.head())

(1000, 6)
   isStudent  mayNotGraduate  interested  alone  signUpOnline  attendance
0          1               5           5      1             1           0
1          1               5           5      1             1           0
2          1               5           5      1             1           0
3          1               5           5      1             1           0
4          1               5           5      1             1           0


## Step 2 - Design rules
### 規則樹狀圖
（`rule.jpg`）
根據規則生出training data 的 label (`attendance`)
![rule](rule.jpg)

In [2]:
from collections import Counter

attendanceList = []

for index, row in df.iterrows():
    if(row['isStudent'] == 1):
        if(row['mayNotGraduate'] == 5):
            attendanceList.append(1)
        else:
            if(row['interested'] >= 4):
                attendanceList.append(1)
            else:
                if(row['alone'] == 0):
                    attendanceList.append(1)
                else:
                    if(row['signUpOnline'] == 0):
                        attendanceList.append(1)
                    else:
                        attendanceList.append(0)
    else:
        if(row['signUpOnline'] == 0):
            attendanceList.append(1)
        else:
            if(row['interested'] >= 3):
                if(row['alone'] == 0):
                    attendanceList.append(1)
                else:
                    attendanceList.append(0)
            else:
                if(row['alone'] == 0):
                    attendanceList.append(0)
                else:
                    attendanceList.append(1)

counter = Counter(attendanceList)
print('attendance:')
for key in counter:
    print('%d: %d people' % (key, counter[key]))

attendanceDf = pd.DataFrame({'attendance': attendanceList})
df.update(attendanceDf)

attendance:
1: 920 people
0: 80 people


## Step 3 - Build a decision tree
分離x（各個feature）, y(attendance) 資料

In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dataX = df[['isStudent', 'mayNotGraduate', 'interested', 'alone', 'signUpOnline']]
dataY = df[['attendance']]

testX = testDf[['isStudent', 'mayNotGraduate', 'interested', 'alone', 'signUpOnline']]
testY = testDf[['attendance']]

print('data x:')
print(dataX.head())
print('\ndata y:')
print(dataY.head())

mTree = DecisionTreeClassifier()

mTree.fit(dataX, dataY)
predictY = mTree.predict(testX)

print('\naccuracy:')
print(accuracy_score(testY, predictY))

data x:
   isStudent  mayNotGraduate  interested  alone  signUpOnline
0          1               5           5      1             1
1          1               5           5      1             1
2          1               5           5      1             1
3          1               5           5      1             1
4          1               5           5      1             1

data y:
   attendance
0           1
1           1
2           1
3           1
4           1

accuracy:
0.978


## Step 4 - Plot the decision tree
產生一個 decision tree 的決策圖(`tree.pdf`)在同一層資料夾中

In [4]:
import pydotplus
from sklearn.externals.six import StringIO   
from sklearn.tree import export_graphviz

treeDataPath = './tree.pdf'

dot_data = StringIO()
export_graphviz(mTree, 
                out_file=dot_data,  
                filled=True, 
                feature_names=list(dataX),
                class_names=['absent', 'present'],
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_pdf(treeDataPath)



True

## Step 5 - Discussion
scikit-learn 的 decision tree 預設參數可以讓它完全 fit 訓練資料，所以最後訓練出來的 decision tree 決策圖中，每筆不同的資料都可能各有一個 node。加了其它參數之後，反而會讓 decision tree 往較不會過於 fit 訓練資料的方向調整(例如min_samples_split, max_depth, max_features, max_leaf_nodes等)，因此在這次作業的情況(資料都可以完全依照清楚的規則被區分不同label)中，調整參數過後的 decision tree 跑測試資料時的 accuracy 都比沒調過參數的 decision tree 還要低。

沒調過參數的 decision tree 做出的決策圖，從`tree.pdf`的 leaf node 中 value 都有一邊是 0 就可以看出這個 decision tree 已經自己找到一個可以完全把訓練資料的 label 分得完全無誤的規則了。

scikit-learn 的 decision tree 預設 criterion 為 gini，因此 decision tree 會先將 gini 值最接近 0 (最容易區分不同 label)的欄位 `alone` 作為 decision tree 的 root。當初訂定規則時，的確也有因為覺得如果是結伴報名(`alone`=0)那就表示可能跟朋友約好了所以不會臨時報名了卻沒到，後來把`alone`放在規則判斷比較後面的部份。下面印出按照原本的規則去分 label 時，最後是被`alone`這個欄位確認label的資料數，以及被其它欄位確認label的資料數，可以看到最後是被`alone`這個欄位確認label的資料數最多，所以`alone`這個欄位比較能夠正確的區分不同label的資料，因此被 decision tree 放在 root 是合理的。

In [5]:
dataNumSplitByAlone = 0
dataNum1 = 0
dataNum2 = 0
dataNum3 = 0
dataNum4 = 0
dataNum5 = 0
testAttendanceList = []
for index, row in testDf.iterrows():
    if(row['isStudent'] == 1):
        if(row['mayNotGraduate'] == 5):
            testAttendanceList.append(1)
            dataNum1 += 1
        else:
            if(row['interested'] >= 4):
                testAttendanceList.append(1)
                dataNum2 += 1
            else:
                if(row['alone'] == 0):
                    testAttendanceList.append(1)
                    dataNumSplitByAlone += 1
                else:
                    if(row['signUpOnline'] == 0):
                        testAttendanceList.append(1)
                        dataNum3 += 1
                    else:
                        testAttendanceList.append(0)
                        dataNum4 += 1
    else:
        if(row['signUpOnline'] == 0):
            testAttendanceList.append(1)
            dataNum5 += 1
        else:
            dataNumSplitByAlone += 1
            if(row['interested'] >= 3):
                if(row['alone'] == 0):
                    testAttendanceList.append(1)
                else:
                    testAttendanceList.append(0)
            else:
                if(row['alone'] == 0):
                    testAttendanceList.append(0)
                else:
                    testAttendanceList.append(1)
                    
print(dataNumSplitByAlone)
print(dataNum1)
print(dataNum2)
print(dataNum3)
print(dataNum4)
print(dataNum5)

305
146
403
20
76
50


下面用原本各欄位的每個數值比例比較接近原本訓練資料的測試資料再跑一次同樣的 decision tree 就發現 accuracy 變高了，推測應該是因為數值比例接近的測試資料跟訓練資料應該會比較像。
實際再把重複的資料刪除之後發現 accuracy 比較低 (0.978) 的測試資料比訓練資料多出一倍的不同資料，所以accuracy比跟訓練資料相近的測試資料低也是較為合理的。

In [6]:
testLikeX = testLikeDf[['isStudent', 'mayNotGraduate', 'interested', 'alone', 'signUpOnline']]
testLikeY = testLikeDf[['attendance']]

predictLikeY = mTree.predict(testLikeX)

print('\naccuracy:')
print(accuracy_score(testLikeY, predictLikeY))


accuracy:
0.998


In [7]:
print('training data')
print(df.drop_duplicates().shape)
print('\ntest data')
print(testDf.drop_duplicates().shape)
print('\ntest data which are more like the training data')
print(testLikeDf.drop_duplicates().shape)

training data
(75, 6)

test data
(147, 6)

test data which are more like the training data
(96, 6)


將 test-like.csv 測試資料中分類錯誤的資料印出來

In [8]:
i = 0
count = 0
for index, row in testLikeDf.iterrows():
    if(row['attendance'] != predictLikeY[i]):
        print(row)
        count += 1
    i += 1
print('\nnumber of wrong prediction:')
print(count)

isStudent         0
mayNotGraduate    1
interested        2
alone             1
signUpOnline      1
attendance        1
Name: 260, dtype: int64
isStudent         0
mayNotGraduate    1
interested        1
alone             0
signUpOnline      0
attendance        1
Name: 927, dtype: int64

number of wrong prediction:
2


把 decision tree 的決策圖寫成 if-else 規則，實際去跑一次測試資料，發現錯誤分類最多次的node跟決策圖上 gini 值最接近 0.5 （比較沒辦法正確分開不同 label 的 attribute）的node是一樣的

In [9]:
wrong1 = 0
wrong2 = 0
wrong3 = 0
wrong4 = 0
wrong5 = 0
wrong6 = 0
wrong7 = 0
wrong8 = 0
wrong9 = 0
num1 = 0
num5 = 0
num6 = 0
for index, row in testDf.iterrows():
    if(row['alone']<=0.5):
        if(row['isStudent']<=0.5):
            if(row['interested']<=2.5):
                num1 += 1
                if(row['attendance']!=0):
                    wrong1 += 1
            else:
                if(row['attendance']!=1):
                    wrong2 += 1
        else:
            if(row['attendance']!=1):
                wrong3 += 1
    else:
        if(row['interested']<=3.5):
            if(row['mayNotGraduate'] <= 4.5):
                if(row['signUpOnline']<=0.5):
                    if(row['attendance']!=1):
                        wrong4 += 1
                else:
                    num5+=1
                    if(row['attendance']!=0):
                        wrong5 += 1
            else:
                num6 += 1
                if(row['attendance']!=1):
                    wrong6 += 1
        else:
            if(row['isStudent']<=0.5):
                if(row['signUpOnline']<=0.5):
                    if(row['attendance']!=1):
                        wrong7 += 1
                else:
                    if(row['attendance']!=0):
                        wrong8 += 1
            else:
                if(row['attendance']!=1):
                        wrong9 += 1

print(wrong1)
print(wrong2)
print(wrong3)
print(wrong4)
print(wrong5)
print(wrong6)
print(wrong7)
print(wrong8)
print(wrong9)
print('---')
print(num1)
print(num5)
print(num6)
print('---')
print(wrong1/num1)
print(wrong5/num5)
print(wrong6/num6)

3
0
0
0
17
2
0
0
0
---
20
102
21
---
0.15
0.16666666666666666
0.09523809523809523


## Step 6 - Other Classifiers
下面利用 grid search 試著找出適合的分類器參數，以下用的三種分類器分別為: Random forest classifer, Ada boost classifer 以及 Support vector classifier。將訓練資料分為 5 個 fold 去測試各種參數組合的 accuracy。

In [10]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

mRFC = RandomForestClassifier()
mABC = AdaBoostClassifier()
mSVC = SVC()

m_classifiers = {
    'Random Forest Classifier': {
        'clf': mRFC,
        'tuned_parameters': [{
            'n_estimators': [20, 50, 100, 200, 300],
            'max_depth': [3, 5, 8],
            'max_leaf_nodes': [5, 10, 30, 50],
        }],
    },
    'AdaBoost Classifier': {
        'clf': mABC,
        'tuned_parameters': [{
            'n_estimators': [30, 50, 100],
            'algorithm': ['SAMME', 'SAMME.R'],
        }],
    },
    'Support Vector Classifier': {
        'clf': mSVC,
        'tuned_parameters': [{
            'C': [0.01, 0.1, 1, 10, 100],
            'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
            'coef0': [0.01, 0.1, 1, 10, 100],
            'gamma': ['auto', 'scale'],
        }],
    },
}

for clf_key in m_classifiers.keys():
    print('\n=============== %s ===============' % (clf_key))
    clf = GridSearchCV(
        m_classifiers[clf_key]['clf'],
        m_classifiers[clf_key]['tuned_parameters'],
        cv=5,
        scoring='accuracy')
    clf.fit(dataX, dataY.values.ravel())
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
        
    print("Best parameters set found on development set:", end='')
    print(clf.best_params_)
        
    predictY = clf.predict(testX)
    print('\ntesting accuracy:', end='')
    print(accuracy_score(testY, predictY))


0.801 (+/-0.292) for {'max_depth': 3, 'max_leaf_nodes': 5, 'n_estimators': 20}
0.720 (+/-0.554) for {'max_depth': 3, 'max_leaf_nodes': 5, 'n_estimators': 50}
0.806 (+/-0.309) for {'max_depth': 3, 'max_leaf_nodes': 5, 'n_estimators': 100}
0.760 (+/-0.414) for {'max_depth': 3, 'max_leaf_nodes': 5, 'n_estimators': 200}
0.771 (+/-0.412) for {'max_depth': 3, 'max_leaf_nodes': 5, 'n_estimators': 300}
0.813 (+/-0.572) for {'max_depth': 3, 'max_leaf_nodes': 10, 'n_estimators': 20}
0.929 (+/-0.084) for {'max_depth': 3, 'max_leaf_nodes': 10, 'n_estimators': 50}
0.720 (+/-0.554) for {'max_depth': 3, 'max_leaf_nodes': 10, 'n_estimators': 100}
0.720 (+/-0.554) for {'max_depth': 3, 'max_leaf_nodes': 10, 'n_estimators': 200}
0.764 (+/-0.401) for {'max_depth': 3, 'max_leaf_nodes': 10, 'n_estimators': 300}
0.729 (+/-0.568) for {'max_depth': 3, 'max_leaf_nodes': 30, 'n_estimators': 20}
0.720 (+/-0.554) for {'max_depth': 3, 'max_leaf_nodes': 30, 'n_estimators': 50}
0.749 (+/-0.451) for {'max_depth': 3, 

將上面 grid search 所找到的最佳參數拿去跑測試資料，將測試資料跑出的 accuracy 結果印出。可以看到跑出來的 accuracy 都 >= 原本上面沒調過參數的分類器跑出來的 accuracy。

In [14]:
clf = RandomForestClassifier(max_depth=3, max_leaf_nodes=10, n_estimators=50)
clf.fit(dataX, dataY.values.ravel())
predictY = clf.predict(testX)
print('\nRandom forest classifier - testing accuracy using best params:', end='')
print(accuracy_score(testY, predictY))

clf = AdaBoostClassifier(algorithm='SAMME', n_estimators=50)
clf.fit(dataX, dataY.values.ravel())
predictY = clf.predict(testX)
print('\nAdaBoost classifier - testing accuracy using best params:', end='')
print(accuracy_score(testY, predictY))

clf = SVC(C=0.01, coef0=100, gamma='scale', kernel='poly')
clf.fit(dataX, dataY.values.ravel())
predictY = clf.predict(testX)
print('\nSupport vector classifier - testing accuracy using best params:', end='')
print(accuracy_score(testY, predictY))


Random forest classifier - testing accuracy using best params:0.944

AdaBoost classifier - testing accuracy using best params:0.953

Support vector classifier - testing accuracy using best params:0.976


### 補充
- sklearn 版本
- 測試 test data 是否 label 正確

In [12]:
import sklearn
print(sklearn.__version__)

0.21.3


In [13]:
# check if test data follow the rule

testAttendanceList = []
for index, row in testDf.iterrows():
    if(row['isStudent'] == 1):
        if(row['mayNotGraduate'] == 5):
            testAttendanceList.append(1)
        else:
            if(row['interested'] >= 4):
                testAttendanceList.append(1)
            else:
                if(row['alone'] == 0):
                    testAttendanceList.append(1)
                else:
                    if(row['signUpOnline'] == 0):
                        testAttendanceList.append(1)
                    else:
                        testAttendanceList.append(0)
    else:
        if(row['signUpOnline'] == 0):
            testAttendanceList.append(1)
        else:
            if(row['interested'] >= 3):
                if(row['alone'] == 0):
                    testAttendanceList.append(1)
                else:
                    testAttendanceList.append(0)
            else:
                if(row['alone'] == 0):
                    testAttendanceList.append(0)
                else:
                    testAttendanceList.append(1)
                    
i=0
diff=0
for index, row in testDf.iterrows():
    if(row['attendance'] != testAttendanceList[i]):
        diff += 1
    i+=1
    
print('Number of labels which are different from the correct labels (generated by the rules):')
print(diff)

Number of labels which are different from the correct labels (generated by the rules):
0
