---
> # **기계 학습 평가(Model Evaluation)**
> ### **모델 평가 기법(Evaluation)**
> #### **1) 분류에서의 평가**
>    - 정확도(Accurray) = 정확하게 분류한 데이터 수 / 전체 데이터 수
>    - 오차 행렬(Confusion Matrix)
>
> ![n0](오차행렬.PNG)
>
>    - 정밀도(Precision)=$\frac{TP}{FP + TP}$ ; Predict Positive 중에서 True Positive
>    - 재현율(Recall) = $\frac{TP}{FN + TP}$ ; Real Positive 중에서 True Positive
>    - F1 Score = $\frac{2×(정밀도 × 재현율)}{(정밀도 + 재현율)}$
---

In [1]:
import numpy as np
import pandas as pd # 정형데이터 전처리 / 통계 연산 수행

import seaborn as sns # 통계적 시각화 연산
import matplotlib as mpl # 시각화 옵션 (외부)
import matplotlib.pyplot as plt # 시각화 옵션 (내부)
import plotly.express as px # 동적 그래프 시각화
import scipy.stats as stats # 통계적 가설 검정 수행

from sklearn.model_selection import train_test_split # 머신러닝 데이터 split
from sklearn.tree import DecisionTreeClassifier # 분류모델
from sklearn.metrics import accuracy_score # 머신러닝 평가
from sklearn.metrics import classification_report # 분류모델의 성능 측정함수

mpl.rc('font',family='Malgun Gothic') # 한글 글꼴 설정

In [2]:
df1 = pd.read_csv("01_Data.csv")
df1.head()

Unnamed: 0,Index,Member_ID,Sales_Type,Contract_Type,Channel,Datetime,Term,Payment_Type,Product_Type,Amount_Month,Customer_Type,Age,Address1,Address2,State,Overdue_count,Overdue_Type,Gender,Credit_Rank,Bank
0,1,66758234,렌탈,일반계약,영업방판,2019-05-06,60,CMS,DES-1,96900,개인,42.0,경기도,경기도,계약확정,0,없음,여자,9.0,새마을금고
1,2,66755948,렌탈,교체계약,영업방판,2020-02-20,60,카드이체,DES-1,102900,개인,39.0,경기도,경기도,계약확정,0,없음,남자,2.0,현대카드
2,3,66756657,렌탈,일반계약,홈쇼핑/방송,2019-02-28,60,CMS,DES-1,96900,개인,48.0,경기도,경기도,계약확정,0,없음,여자,8.0,우리은행
3,4,66423450,멤버십,멤버십3유형,재계약,2019-05-13,12,CMS,DES-1,66900,개인,39.0,경기도,경기도,계약확정,0,없음,남자,5.0,농협회원조합
4,5,66423204,멤버십,멤버십3유형,재계약,2019-05-10,12,CMS,DES-1,66900,개인,60.0,경기도,경기도,기간만료,12,있음,남자,8.0,농협회원조합


---
### **1. 데이터 전처리 (결측치 처리)**
---

In [3]:
df1['Overdue_Type'].value_counts()

없음    49110
있음     2191
Name: Overdue_Type, dtype: int64

In [4]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51301 entries, 0 to 51300
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Index          51301 non-null  int64  
 1   Member_ID      51301 non-null  int64  
 2   Sales_Type     51301 non-null  object 
 3   Contract_Type  51301 non-null  object 
 4   Channel        51301 non-null  object 
 5   Datetime       51301 non-null  object 
 6   Term           51301 non-null  int64  
 7   Payment_Type   51301 non-null  object 
 8   Product_Type   51301 non-null  object 
 9   Amount_Month   51301 non-null  int64  
 10  Customer_Type  51299 non-null  object 
 11  Age            44329 non-null  float64
 12  Address1       51299 non-null  object 
 13  Address2       51299 non-null  object 
 14  State          51301 non-null  object 
 15  Overdue_count  51301 non-null  int64  
 16  Overdue_Type   51301 non-null  object 
 17  Gender         51301 non-null  object 
 18  Credit

In [5]:
df1_clean = df1.dropna().reset_index()
df1_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40647 entries, 0 to 40646
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   index          40647 non-null  int64  
 1   Index          40647 non-null  int64  
 2   Member_ID      40647 non-null  int64  
 3   Sales_Type     40647 non-null  object 
 4   Contract_Type  40647 non-null  object 
 5   Channel        40647 non-null  object 
 6   Datetime       40647 non-null  object 
 7   Term           40647 non-null  int64  
 8   Payment_Type   40647 non-null  object 
 9   Product_Type   40647 non-null  object 
 10  Amount_Month   40647 non-null  int64  
 11  Customer_Type  40647 non-null  object 
 12  Age            40647 non-null  float64
 13  Address1       40647 non-null  object 
 14  Address2       40647 non-null  object 
 15  State          40647 non-null  object 
 16  Overdue_count  40647 non-null  int64  
 17  Overdue_Type   40647 non-null  object 
 18  Gender

---
### **2. 목표변수 (Y) 와 설명변수 (X)를 선언**
---

In [6]:
X = df1_clean[['Term','Amount_Month','Credit_Rank','Age']]
Y = df1_clean['Overdue_Type']

In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40647 entries, 0 to 40646
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Term          40647 non-null  int64  
 1   Amount_Month  40647 non-null  int64  
 2   Credit_Rank   40647 non-null  float64
 3   Age           40647 non-null  float64
dtypes: float64(2), int64(2)
memory usage: 1.2 MB


In [8]:
Y.info()

<class 'pandas.core.series.Series'>
RangeIndex: 40647 entries, 0 to 40646
Series name: Overdue_Type
Non-Null Count  Dtype 
--------------  ----- 
40647 non-null  object
dtypes: object(1)
memory usage: 317.7+ KB


---
### **3. 학습데이터와 검증데이터 분할**
---

In [9]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.3,
                                                   random_state=1234)

In [10]:
print(X_train.shape)
X_train.info()

(28452, 4)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28452 entries, 10668 to 27439
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Term          28452 non-null  int64  
 1   Amount_Month  28452 non-null  int64  
 2   Credit_Rank   28452 non-null  float64
 3   Age           28452 non-null  float64
dtypes: float64(2), int64(2)
memory usage: 1.1 MB


In [11]:
print(Y_train.shape)
Y_train.info()

(28452,)
<class 'pandas.core.series.Series'>
Int64Index: 28452 entries, 10668 to 27439
Series name: Overdue_Type
Non-Null Count  Dtype 
--------------  ----- 
28452 non-null  object
dtypes: object(1)
memory usage: 444.6+ KB


In [12]:
print(X_test.shape)
X_test.info()

(12195, 4)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12195 entries, 22350 to 23783
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Term          12195 non-null  int64  
 1   Amount_Month  12195 non-null  int64  
 2   Credit_Rank   12195 non-null  float64
 3   Age           12195 non-null  float64
dtypes: float64(2), int64(2)
memory usage: 476.4 KB


In [13]:
print(Y_test.shape)
Y_test.info()

(12195,)
<class 'pandas.core.series.Series'>
Int64Index: 12195 entries, 22350 to 23783
Series name: Overdue_Type
Non-Null Count  Dtype 
--------------  ----- 
12195 non-null  object
dtypes: object(1)
memory usage: 190.5+ KB


---
### **4. 학습**
---

In [14]:
model = DecisionTreeClassifier()
model.fit(X_train,Y_train)

---
### **5. 평가(정확도)**
---

In [15]:
Y_train_pred = model.predict(X_train)
Y_test_pred = model.predict(X_test)

In [16]:
accuracy_score(Y_train, Y_train_pred)

0.9729017292281738

In [17]:
accuracy_score(Y_test,Y_test_pred)

0.9603936039360393

#### **학습 성능**
- support는 데이터 개수
- macro avg : f1 score의 평균인 0.7 

In [18]:
print(classification_report(Y_train, Y_train_pred))

              precision    recall  f1-score   support

          없음       0.97      1.00      0.99     27423
          있음       0.94      0.27      0.42      1029

    accuracy                           0.97     28452
   macro avg       0.96      0.63      0.70     28452
weighted avg       0.97      0.97      0.97     28452



#### **일반화 성능**
- 학습 성능보다 더 떨어짐

In [19]:
print(classification_report(Y_test, Y_test_pred))

              precision    recall  f1-score   support

          없음       0.97      0.99      0.98     11795
          있음       0.24      0.10      0.14       400

    accuracy                           0.96     12195
   macro avg       0.60      0.54      0.56     12195
weighted avg       0.95      0.96      0.95     12195



---
> ## **특성공학(Feature Enginnering)**
> - #### **방법 : 데이터를 잘 만들기 + 알고리즘을 통제**
> - #### **하는 이유 : 학습을 했을 때, 평가가 좋지 않을 때 사용**
> - #### **학습의 목적에 맞게 데이터를 깔끔하게 다듬기**
> - #### **과정**
> #### 1. Scaling & Encoding
> - **Scaling : 연속형 숫자 데이터의 Scale을 맞춰주기**       
>    - Standard Scaler : 평균 0 / 표준편차 1
>    - Min Max Scaler : 최소값 0 / 최대값 1
>    - Robust Scalar : 중앙값 0 / IQR 1
> - **Encoding : 범주형 데이터를 숫자형태로 변환**
>    - Label Encoding : 범주형 데이터의 각 항목값을 정수로 변환
>    - One Hot Encoding : 0과 1로 이루어지게 데이터를 변환 시킴
>
> ![n1](라벨.PNG) ![n1](원핫.PNG)
>
> #### 2. Imputation
> #### 3. Cross Validation
> #### 4. Hyper Parameter Tuning
> #### 5. Imbalanced Data Sampling
---

---
> ## **파이프라인 (Pipe Line)**
> - #### **학습하기 전에 데이터 전처리를 자동적으로 진행되도록 프로그램(서비스)화 하는 방법**
---

In [20]:
!pip install imblearn



---
### **파이프라인을 생성하기 위한 라이브러리**
---

In [21]:
from imblearn.pipeline import make_pipeline # 특성공학 + 학습 파이프 생성함수
from sklearn.compose import make_column_transformer # 각 항목의 타입별로 파이프 분할

from sklearn.impute import SimpleImputer        # 1. 결측치 대치
from sklearn.preprocessing import MinMaxScaler  # 2. 숫자 스케일링
from sklearn.preprocessing import OneHotEncoder # 3. 문자 인코딩 (항목 -> 1/0)

---
### **숫자데이터 처리 파이프 : numeric_pipe**
##### 1. 결측치 대치 : 평균으로 대치
##### 2. Scaling
---

In [22]:
numeric_pipe = make_pipeline((SimpleImputer(strategy='mean')),
                              (MinMaxScaler()))
numeric_pipe

---
### **문자데이터 처리 파이프 : category_pipe**
##### 1. 결측치 대치 : 최빈값으로 처리
##### 2. Encoding
---

In [23]:
category_pipe = make_pipeline((SimpleImputer(strategy='most_frequent')),
                              (OneHotEncoder()))
category_pipe

---
### **X,Y 컬럼들 속성 구분**
- ##### 숫자형과 문자형
---

In [24]:
X = df1[['Term','Amount_Month','Age','Credit_Rank','Gender','Product_Type']]
Y = df1['Overdue_Type']

In [25]:
X.describe().columns

Index(['Term', 'Amount_Month', 'Age', 'Credit_Rank'], dtype='object')

In [26]:
X.describe(include='object').columns

Index(['Gender', 'Product_Type'], dtype='object')

---
### **숫자 항목, 문자항목 분류 파이프 : preprocessing_pipe**
- ##### **숫자 pipeline, 문자 pipeline 합해주기**
---

In [27]:
number_list = X.describe().columns.tolist()
category_list = X.describe(include='object').columns.tolist()
preprocessing_pipe = make_column_transformer( (numeric_pipe, number_list),
                                              (category_pipe, category_list))
preprocessing_pipe

---
### **전처리 파이프 + 학습알고리즘 합산 파이프 : model_pipe**
---

In [28]:
model_pipe = make_pipeline( preprocessing_pipe, DecisionTreeClassifier())
model_pipe

---
### **파이프라인 진행하기**
---

In [29]:
model_pipe.fit(X,Y)

---
### **전체 학습 모델을 파일형태로 저장**
---

In [30]:
import pickle # 파이썬에서 선언된 객체를 파일로 변환

In [31]:
pickle.dump(model_pipe, open('model.sav','wb')) # model.sav 저장