- 의료 보조 서비스
- 수식화 (모델링) 단계
    - 어떤 대상이 이 서비스(모델)를 사용할 것인가?
    - **어떤 설명변수를 선택하는가**
    - 의사를 대상으로 할 때 
    - 환자를 대상으로 할 때 

In [1]:
import pandas as pd

In [2]:
df1 = pd.read_csv("examples/preprocessing.csv")
print(df1.shape)
df1.head(2)

(1894, 52)


Unnamed: 0.1,Unnamed: 0,환자ID,Large Lymphocyte,Location of herniation,ODI,가족력,간질성폐질환,고혈압여부,과거수술횟수,당뇨여부,...,Modic change,PI,PT,Seg Angle(raw),Vaccum disc,골밀도,디스크단면적,디스크위치,척추이동척도,척추전방위증
0,0,1PT,22.8,3,51.0,0.0,0,0,0,0,...,3,51.6,36.6,14.4,0,-1.01,2048.5,4,Down,0
1,1,2PT,44.9,4,26.0,0.0,0,0,0,0,...,0,40.8,7.2,17.8,0,-1.14,1753.1,4,Up,0


In [3]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1894 entries, 0 to 1893
Data columns (total 52 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              1894 non-null   int64  
 1   환자ID                    1894 non-null   object 
 2   Large Lymphocyte        1894 non-null   float64
 3   Location of herniation  1894 non-null   int64  
 4   ODI                     462 non-null    float64
 5   가족력                     1843 non-null   float64
 6   간질성폐질환                  1894 non-null   int64  
 7   고혈압여부                   1894 non-null   int64  
 8   과거수술횟수                  1894 non-null   int64  
 9   당뇨여부                    1894 non-null   int64  
 10  말초동맥질환여부                1894 non-null   int64  
 11  빈혈여부                    1894 non-null   int64  
 12  성별                      1894 non-null   int64  
 13  스테로이드치료                 1894 non-null   int64  
 14  신부전여부                   1894 non-null   

In [4]:
df1.isnull().sum()

Unnamed: 0                   0
환자ID                         0
Large Lymphocyte             0
Location of herniation       0
ODI                       1432
가족력                         51
간질성폐질환                       0
고혈압여부                        0
과거수술횟수                       0
당뇨여부                         0
말초동맥질환여부                     0
빈혈여부                         0
성별                           0
스테로이드치료                      0
신부전여부                        0
신장                           0
심혈관질환                        0
암발병여부                        0
연령                           0
우울증여부                        0
입원기간                         0
입원일자                         0
종양진행여부                       0
직업                         415
체중                           0
퇴원일자                         0
헤모글로빈수치                      1
혈전합병증여부                      0
환자통증정도                       0
흡연여부                         0
통증기간(월)                      4
수술기법                        81
수술시간    

In [5]:
df1.columns

Index(['Unnamed: 0', '환자ID', 'Large Lymphocyte', 'Location of herniation',
       'ODI', '가족력', '간질성폐질환', '고혈압여부', '과거수술횟수', '당뇨여부', '말초동맥질환여부', '빈혈여부',
       '성별', '스테로이드치료', '신부전여부', '신장', '심혈관질환', '암발병여부', '연령', '우울증여부', '입원기간',
       '입원일자', '종양진행여부', '직업', '체중', '퇴원일자', '헤모글로빈수치', '혈전합병증여부', '환자통증정도',
       '흡연여부', '통증기간(월)', '수술기법', '수술시간', '수술실패여부', '수술일자', '재발여부', '혈액형',
       '전방디스크높이(mm)', '후방디스크높이(mm)', '지방축적도', 'Instability', 'MF + ES',
       'Modic change', 'PI', 'PT', 'Seg Angle(raw)', 'Vaccum disc', '골밀도',
       '디스크단면적', '디스크위치', '척추이동척도', '척추전방위증'],
      dtype='object')

In [6]:
# 변수 선택(서비스의 대상을 기준으로)
df2 = df1[["성별", "신장", "체중", "흡연여부", "연령", "혈액형", "직업", "재발여부"]]
df3 = df2.dropna() # 결측치 제거
print(df3.shape)

(1479, 8)


- 설명변수 및 목표변수 설정

In [7]:
# 더미변수처리(문자 데이터 -> One Hot Encoding)
X = df3.drop(columns="재발여부")
X1 = pd.get_dummies(X)

In [8]:
X1

Unnamed: 0,성별,신장,체중,흡연여부,연령,혈액형_RH+A,혈액형_RH+AB,혈액형_RH+B,혈액형_RH+O,직업_건설업,...,직업_사무직,직업_사업가,직업_예술가,직업_운동선수,직업_운수업,직업_의료직,직업_자영업,직업_주부,직업_특수전문직,직업_학생
0,2,163,60.3,0,66,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,1,171,71.7,0,47,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,1,178,77.1,0,39,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1,174,74.2,0,40,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
4,1,183,80.7,0,42,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1889,2,157,64.0,0,59,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1890,2,157,59.0,0,42,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
1891,1,167,70.0,0,61,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1892,1,177,77.0,0,29,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [11]:
Y = df3["재발여부"]

In [19]:
Y.value_counts()

0    1302
1     177
Name: 재발여부, dtype: int64

In [13]:
# 기계학습 알고리즘 호출
from sklearn.model_selection import train_test_split
from sklearn.pipeline        import Pipeline
from sklearn.preprocessing   import MinMaxScaler
from sklearn.tree            import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics         import classification_report
from sklearn.tree            import plot_tree

In [15]:
X_train, X_test, Y_train, Y_test = train_test_split(X1, Y, test_size = 0.3,
                                                    random_state= 1234)
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

(1035, 26)
(1035,)
(444, 26)
(444,)


In [16]:
pipe_list = [("scaler" ,MinMaxScaler()),("model" ,DecisionTreeClassifier())]
pipe_model = Pipeline(pipe_list)
pipe_model

In [20]:
hyper_list = {"model__max_depth":range(2, 10),
              "model__min_samples_leaf":range(2, 10),
              "model__criterion":["gini","entropy"],
              "model__class_weight":[None,"balanced"],
              "model__min_samples_split":range(2, 10)}

grid_model = GridSearchCV(pipe_model, param_grid=hyper_list,
                          scoring="f1",
                          n_jobs= -1,
                          cv = 5)

grid_model.fit(X_train,Y_train)

In [21]:
best_model = grid_model.best_estimator_

In [22]:
Y_train_pred = best_model.predict(X_train)
Y_test_pred = best_model.predict(X_test)

- 분류 모델에서 확인하고자 하는 Target 항목의 데이터 비율이 적다면?
    - recall 값이 매우 떨어지는 현상 도출

In [25]:
print(classification_report(Y_train, Y_train_pred))

              precision    recall  f1-score   support

           0       0.90      1.00      0.95       911
           1       0.93      0.23      0.36       124

    accuracy                           0.91      1035
   macro avg       0.92      0.61      0.66      1035
weighted avg       0.91      0.91      0.88      1035



In [26]:
print(classification_report(Y_test, Y_test_pred))

              precision    recall  f1-score   support

           0       0.89      0.99      0.94       391
           1       0.56      0.09      0.16        53

    accuracy                           0.88       444
   macro avg       0.72      0.54      0.55       444
weighted avg       0.85      0.88      0.84       444

