## 분류성능평가지표
 1. accuracy
 2. confusion matrix
 3. precision, recall
 4. f1 score
 5. ROC AUC

### 1. accuracy
- 예측 맞게 한 것/전체
- 자료의 label 분포가 불균형할때 적합한 지표가 아니다

In [16]:
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_digits
from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split

class MyFakeClassifier(BaseEstimator):
    def fit(self,X,y):
        pass
    def predict(self,X):
        return np.zeros(len(X)) #뭐든지 0이라고 예측

In [17]:
digits=load_digits()
y=(digits.target==7).astype(int) #숫자가 7이면1, 아니면0

In [22]:
X_train, X_test, y_train, y_test=train_test_split(digits.data,y,
                                                 test_size=0.2,random_state=11)

In [23]:
fakeclf=MyFakeClassifier()
fakeclf.fit(X_train,y_train) #fit
pred=fakeclf.predict(X_test) #predict(그냥 뭐든지 다 0이라고 예측)
accuracy_score(y_test,pred)

0.9

### 2. confusion matrix

In [21]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,pred)

array([[405,   0],
       [ 45,   0]], dtype=int64)

### 3. precision, recall
- precision: 예측1로 한 것중에 맞은 비율 ex) 위의 예시에서 0
- recall: 실제1인 것중에 맞은 비율 ex) 위의 예시에서 0
- precision이 중요한 경우: 스팸메일
- recall이 중요한 경우: 암, 보험사기
- threshold(임계값)을 조정하여서 precision이나 recall을 높일 수 있음.
- 임계값 낮아질수록 양성이라고 예측 많이해서 precision 낮아지고 recall 높아짐

In [101]:
#타이타닉 데이터 가지고 돌릴려면 전처리 먼저 해줘야함!
import pandas as pd
df=pd.read_csv('titanic_train.csv')
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [102]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [103]:
#필요없는 정보 제거
df.drop(['PassengerId','Name','Ticket'],axis=1,inplace=True)

In [104]:
#성별 label encoding
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
encoder.fit(df['Sex'])
labels=encoder.transform(df['Sex'])
df['Sex']=labels

In [105]:
#age 결측값 평균으로 채우기
df['Age'].fillna(df['Age'].mean(),inplace=True)

In [106]:
#cabin, embarked 결측값에 'N'채우기
df['Cabin'].fillna('N',inplace=True)
df['Embarked'].fillna('N',inplace=True)
#cabin 첫글자만 따기
df['Cabin']=df['Cabin'].str[:1]

In [107]:
#cabin, embarked label encoding
features=['Cabin','Embarked']
for feature in features:
    le=LabelEncoder()
    le.fit(df[feature])
    df[feature]=le.transform(df[feature])

In [108]:
#전처리 완료!
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int32  
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Cabin     891 non-null    int32  
 8   Embarked  891 non-null    int32  
dtypes: float64(2), int32(3), int64(4)
memory usage: 52.3 KB


In [109]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score

y=df['Survived']
x=df.drop('Survived',axis=1)

X_train, X_test, y_train, y_test=train_test_split(x,y,test_size=0.2,random_state=11)

lr_clf=LogisticRegression(solver='liblinear')

lr_clf.fit(X_train, y_train)
pred=lr_clf.predict(X_test)
print(confusion_matrix(y_test,pred))
print(precision_score(y_test,pred))
print(recall_score(y_test,pred)) 
#여기서 신기했던거! 나는 연령대 결측치를 평균으로 채우는 것보다는 제거하는 것이 더 score가 높게 나올줄 알았는데 아니다

[[108  10]
 [ 14  47]]
0.8245614035087719
0.7704918032786885


- 임계값 조정에 따라

In [113]:
#predict_proba() 예측 확률 자체를 반환 [0일확률, 1일확률]
pred_proba=lr_clf.predict_proba(X_test)
pred_proba[:3]

array([[0.44935227, 0.55064773],
       [0.86335512, 0.13664488],
       [0.86429645, 0.13570355]])

In [119]:
#Binarizer 이용해서 임계값 조정
from sklearn.preprocessing import Binarizer
pred_proba_1=pred_proba[:,1].reshape(-1,1)
binarizer=Binarizer(threshold=0.4).fit(pred_proba_1)
predict=binarizer.transform(pred_proba_1)

print(precision_score(y_test,predict))
print(recall_score(y_test,predict)) #임계값 낮췄으니 recall 높아질것

0.704225352112676
0.819672131147541


In [121]:
#임계값 변경에 따른 정밀도-재현율 변화
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds=precision_recall_curve(y_test,pred_proba_1)
print(thresholds[]) #임계값이 증가함에 따라 precision 증가, recall 감소
print(precisions[:5])
print(recalls[:5])

#여기서 왜 thresholds, precisions shape이 다르게 나오는지가 이해가 안됨

[0.11573102 0.11636721 0.11819212 0.12102773 0.12349479]
[0.37888199 0.375      0.37735849 0.37974684 0.38216561]
[1.         0.98360656 0.98360656 0.98360656 0.98360656]


### 4. F1 score
- 2*precision*recall/(precision+recall)

In [128]:
from sklearn.metrics import f1_score
f1=f1_score(y_test,pred)
f1

0.7966101694915254

### 5. ROC AUC
- ROC: FPR이 변할때 TPR이 어떻게 변하는지 나타낸 곡선
- AUC: ROC 곡선 밑에 면적;1에 가까울수록 좋은 수치
- FPR: 실제 0인 것중에 틀린거
- TPR: 실제 1인 것중에 맞은거

- 임계값이 0이라고 한다면 둘다 1
- 임계값을 높여감에 따라서 FPR은 빠르게, TPR은 느리게 떨어져야 좋은 것이다. 즉, 면적이 넓을 수록 좋은것

In [129]:
from sklearn.metrics import roc_curve
fprs, tprs, thresholds=roc_curve(y_test,pred_proba_1)
print(thresholds[:10])
print(fprs[:10])
print(tprs[:10])

[1.94326279 0.94326279 0.94040086 0.93261004 0.87778554 0.86565305
 0.72771397 0.68584875 0.64779432 0.63856712]
[0.         0.         0.         0.         0.         0.00847458
 0.00847458 0.01694915 0.01694915 0.02542373]
[0.         0.01639344 0.03278689 0.06557377 0.24590164 0.24590164
 0.49180328 0.49180328 0.63934426 0.63934426]


In [135]:
from sklearn.metrics import roc_auc_score
roc_score=roc_auc_score(y_test,pred_proba_1)
roc_score

0.8986524034454015

### 6. 모델 평가하는 함수

In [143]:
def get_clf_eval(y_test,pred=None,pred_proba=None):
    confusion=confusion_matrix(y_test,pred)
    accuracy=accuracy_score(y_test,pred)
    precision=precision_score(y_test,pred)
    recall=recall_score(y_test,pred)
    f1=f1_score(y_test,pred)
    roc_auc=roc_auc_score(y_test,pred_proba)
    print('오차행렬')
    print(confusion)
    print('정확도:{0:.4f}, 정밀도:{1:.4f}, 재현율:{2:.4f}, f1:{3:.4f}, AUC:{4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))

In [144]:
get_clf_eval(y_test,pred,pred_proba_1)

오차행렬
[[108  10]
 [ 14  47]]
정확도:0.8659, 정밀도:0.8246, 재현율:0.7705, f1:0.7966, AUC:0.8987
