# Clssification Problem Code Tutorial

<b><u>[목적]</u></b>
- 복잡한 Regression Problem을 단순한 Classification으로 변환하여 접근함
- DecisionTree를 활용하여 Rule을 Extraction 함
- DecisionTree는 단순하지만 강력한 설명력을 가짐

<b><u>[Process]</u></b>
- Data Path = 'https://github.com/GonieAhn/Data-Science-online-course-from-gonie/tree/main/Data%20Store'
- Define X's & Y
- Transform Y (Numeric --> Category)
    - Why? --> Deep Thinking Yourself
- Split Train & Valid data set
- Modeling
- 해석

In [1]:
import os
import gc
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from collections import Counter

In [2]:
%%time
# Data Load 
data = pd.read_csv("../Data Store/TOY_DATA.csv")
print(">>>> Data Shape : {}".format(data.shape))

>>>> Data Shape : (3500, 357)
Wall time: 250 ms


<b><u>[Data Selection]</u></b>
- Data Cleaning 진행
- Regression Problem ==> Classification Problem 상위 30% [Class 1]과 하위 30% [Class 0]
    - 이렇게 하는 이유는 결국 우리는 상위 n%와 하위 n%의 차이를 알고 싶은 것
    - 복잡한 Regression 문제보다 1 또는 0을 마추는 Classification Problem으로 전환해보자
    - 그리고 상위 n%가 나오는 Rule을 도출해보자 (해석력을 얻기 위함)
    - 이렇게 할 경우 데이터 손실이 있기 때문에 데이터가 많은 경우 써먹어보자

In [3]:
 In [3]: 
 # Feature Selection
selc_col = ['Y', 'X1', 'X2', 'X3']
data = data[selc_col]
# Missing value dropping
data.dropna(inplace=True)
data.reset_index(inplace=True, drop=True)
print("Data Shape : {}".format(data.shape))

Data Shape : (3500, 4)


In [4]:
# 상위 30% (Class 1)과 하위 30% (Class 0)
per_70 = np.percentile(data['Y'], 70)
per_30 = np.percentile(data['Y'], 30)
print(">>>> 70 Percentile : {}".format(per_70))
print(">>>> 30 Percentile : {}".format(per_30))

>>>> 70 Percentile : 0.821760391
>>>> 30 Percentile : 0.7269621028


In [5]:
# Data Selection
data = data[(data['Y'] >= per_70) | (data['Y'] <= per_30)]
data.reset_index(inplace=True, drop=True)
print('Data shape : {}'.format(data.shape))

Data shape : (2101, 4)


In [6]:
# Assign Class
data['Label'] = 3
data['Label'].iloc[np.where(data['Y'] >= per_70)[0]] = 1
data['Label'].iloc[np.where(data['Y'] <= per_30)[0]] = 0
print("Unique Label : {}".format(set(data['Label']))) 

Unique Label : {0, 1}


In [7]:
Counter(data['Label'])

Counter({1: 1051, 0: 1050})

In [8]:
Y = data['Label']
X = data.drop(columns=['Y', 'Label'])
idx = list(range(X.shape[0]))
train_idx, valid_idx = train_test_split(idx, test_size=0.3, random_state=2021)
print(">>>> # of Train data : {}".format(len(train_idx)))
print(">>>> # of valid data : {}".format(len(valid_idx))) 

>>>> # of Train data : 1470
>>>> # of valid data : 631


<b><u>[Decision Tree를 활용한 Rule Extraction]</u></b>
- Max_Depth는 5 초과를 넘지 않아야함, 5를 초과하게 되면 Rule Extraction Plotting의 가독성이 매우 떨어짐
    - 정확도와 설명력은 Trade-off가 존재하기 때문에 자기만의 기준으로 적절한 선을 선택하면 됨
- .dot 파일을 .png 파일로 변경
    - <b>"dot file.dot -Tpng -o image.png"</b>를 Terminal command 창에서 실행해 주어야함
- 이슈사항
    - Rule Extraction 할때 GINI INDEX 뿐만 아니라 Sample 개수도 중요한 척도가 됨
    - GINI INDEX가 아주 낮지만(불순도가 낮음, 좋음) Sample의 개수가 너무 적으면 의미가 없음(Overfitting이라고 생각됨)

In [9]:
# Parameter Searching ==> Depth 2 ~ 10
for i in range(2,11,1):
    print(">>>> Depth {}".format(i))

    model = DecisionTreeClassifier(max_depth=i, criterion='gini')
    model.fit(X.iloc[train_idx], Y.iloc[train_idx])

    # Train Acc
    y_pre_train = model.predict(X.iloc[train_idx])
    cm_train = confusion_matrix(Y.iloc[train_idx], y_pre_train)
    print("Train Confusion Matrix")
    print(cm_train)
    print("Train Acc : {}".format((cm_train[0,0] + cm_train[1,1])/cm_train.sum()))

    # Test Acc
    y_pre_test = model.predict(X.iloc[valid_idx])
    cm_test = confusion_matrix(Y.iloc[valid_idx], y_pre_test)
    print("Train Confusion Matrix")
    print(cm_test)
    print("TesT Acc : {}".format((cm_test[0,0] + cm_test[1,1])/cm_test.sum()))
    print("-----------------------------------------------------------------------")
    print("-----------------------------------------------------------------------")
 

>>>> Depth 2
Train Confusion Matrix
[[721  19]
 [ 40 690]]
Train Acc : 0.9598639455782313
Train Confusion Matrix
[[291  19]
 [ 20 301]]
TesT Acc : 0.9381933438985737
-----------------------------------------------------------------------
-----------------------------------------------------------------------
>>>> Depth 3
Train Confusion Matrix
[[719  21]
 [ 16 714]]
Train Acc : 0.9748299319727891
Train Confusion Matrix
[[290  20]
 [ 13 308]]
TesT Acc : 0.9477020602218701
-----------------------------------------------------------------------
-----------------------------------------------------------------------
>>>> Depth 4
Train Confusion Matrix
[[730  10]
 [  3 727]]
Train Acc : 0.991156462585034
Train Confusion Matrix
[[297  13]
 [  3 318]]
TesT Acc : 0.9746434231378764
-----------------------------------------------------------------------
-----------------------------------------------------------------------
>>>> Depth 5
Train Confusion Matrix
[[738   2]
 [  1 729]]
Train Acc : 

In [12]:
# Depth가 깊어질 수록 정확도는 높게 나오지만 해석력에 대한 가독성을 위해 Depth 5를 선택함
model = DecisionTreeClassifier(max_depth=5, criterion='gini')
model.fit(X.iloc[train_idx], Y.iloc[train_idx])

DecisionTreeClassifier(max_depth=5)

In [13]:
# Rule Extraction --> Dot file 생성
export_graphviz(model, out_file="./DT_RuleExtraction.dot", class_names = ["Low", "High"], 
                feature_names = X.columns, impurity=True, filled=True)