# 프로젝트 (3) load_breast_cancer : 유방암 여부를 진단해 봅시다

- 세번째 실습 : 제공된 데이터를 활용하여 환자의 유방암 여부를 분류해보자.
    - 여러사람의 건강지표에 대한 데이터가 feature로 들어가있고, 유방암의 여부가 True/False로 Label이 됨


- Scikit-Learn 라이브러리의 datasets를 모듈 안의 load_breast_cancer 매서드 사용해서 프로젝트를 진행

---
### (1) 필요한 모듈 import 하기
---

In [1]:
from sklearn.datasets import load_breast_cancer  # 실습파일 불러오기
from sklearn.model_selection import train_test_split  # model_selection 모듈 안의 Train / Test Set을 나누기 위한 함수 불러오기
from sklearn.metrics import classification_report  # metrics 모듈 안의 학습 모델을 평가하기 위한 함수 불러오기
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

import pandas as pd  # pandas 모듈 불러오기
import numpy as np   # numpy 모듈 불러오기
import matplotlib.pyplot as plt

---
### (2) 데이터 준비
---

In [3]:
breast_cancer = load_breast_cancer()  # load_breast_cancer 매서드를 사용해서 breast_cancer 데이터 불러옴
type(breast_cancer)  # breast_cancer의 타입을 구하면 파이썬 딕셔너리, 번치(bunch) 객체로 표현됨

sklearn.utils.Bunch

- `breast_cancer.keys()` 매서드를 사용해서 `breast_cancer` 객체 안의 정보 확인해보자

In [4]:
breast_cancer.keys()  # digits의 속성 확인하기

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

* `breast_cancer`의 데이터 정보를 파악하기 위해 pandas 모듈의 DataFrame을 활용해보자

In [5]:
breast_cancer_df = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)
breast_cancer_df['label'] = breast_cancer.target
breast_cancer_df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,label
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


In [6]:
breast_cancer_df['label'].value_counts()  # breast_cancer의 target 값 중 unique한 값들의 중복되는 갯수는 다음과 같다.

1    357
0    212
Name: label, dtype: int64

* `breast_cancer`의 feature data의 크기는 569x30 이며 총 569개의 샘플 데이터가 있으며 30개의 특성을 갖는다.


* label은 [1|0]인 True or False 로 나누어진다.


* feature는 <br>['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']<br>로 나누어진다.

---
### (3) 데이터 이해하기
---

**1) Feature Data 지정하기**


In [9]:
breast_cancer_data = breast_cancer.data  # Feature Data 지정
print(breast_cancer_data.shape)   # Feature Data의 모양 출력
print(type(breast_cancer_data))   # Feature Data의 객체 타입 출력

(569, 30)
<class 'numpy.ndarray'>


**2) Label Data 지정하기**

In [10]:
breast_cancer_label = breast_cancer.target  # Label Data 지정
print(breast_cancer_label.shape)     # Label Data의 모양 출력
print(type(breast_cancer_label))     # Label Data의 객체 타입 출력

(569,)
<class 'numpy.ndarray'>


**3) Target Names 출력해보기**

In [12]:
breast_cancer_target_names = breast_cancer.target_names
print(breast_cancer_target_names)

['malignant' 'benign']


**4) 데이터 Describe 해보기**

In [13]:
print(breast_cancer.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

---
### (4) train, test 데이터 분리
---

- 모델 학습과 테스트용 문제지와 정답지를 준비해보자.
    - `train_test_split` 함수를 사용하여 Data를 학습 : 테스트의 비율을 8:2로 골고루 섞이도록 나눠보자


In [15]:
X_train, X_test, y_train, y_test = train_test_split(breast_cancer_data, 
                                                    breast_cancer_label, 
                                                    test_size=0.2, 
                                                    random_state=2222)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(455, 30) (455,)
(114, 30) (114,)


- 아래 셀의 결과를 확인하면 나누어진 Test Data가 전체 Data에서 골고루 섞여져 있는 것으로 확인된다.

In [16]:
unique, counts = np.unique(y_test, return_counts=True)
dict(zip(unique, counts))


{0: 44, 1: 70}

---
### (5) 다양한 모델로 학습시켜보기
---

- 1) **Decision Tree** 사용해 보기<br>
- 2) **Random Forest** 사용해 보기<br>
- 3) **SVM** 사용해 보기<br>
- 4) **SGD Classifier** 사용해 보기<br>
- 5) **Logistic Regression** 사용해 보기<br>

In [17]:
# 학습모델 불러오기 

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

In [18]:
# 학습모델을 각각의 변수에 저장

decision_tree = DecisionTreeClassifier(random_state=2222)
random_forest = RandomForestClassifier(random_state=2222)
svm_model = svm.SVC()
sgd_model = SGDClassifier()
logistic_model = LogisticRegression()

In [19]:
# 각각의 모델을 Train Data set로 학습

decision_tree.fit(X_train, y_train)
random_forest.fit(X_train, y_train)
svm_model.fit(X_train, y_train)
sgd_model.fit(X_train, y_train)
logistic_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [20]:
# Test Data를 통해 학습모델의 결과값 예측

prediction_decision_tree = decision_tree.predict(X_test)
prediction_random_forest = random_forest.predict(X_test)
prediction_svm_model = svm_model.predict(X_test)
prediction_sgd_model = sgd_model.predict(X_test)
prediction_logistic_model = logistic_model.predict(X_test)

---
### (6) 모델을 평가해 보기
---

- **1) sklearn.metrics accuracy 지표를 사용한 경우**

In [22]:
print(accuracy_score(y_test, prediction_decision_tree))
print(accuracy_score(y_test, prediction_random_forest))
print(accuracy_score(y_test, prediction_svm_model))
print(accuracy_score(y_test, prediction_sgd_model))
print(accuracy_score(y_test, prediction_logistic_model))
# print(accuracy_score(y_test, [1]*len(prediction_logistic_model)))

0.8947368421052632
0.9298245614035088
0.9035087719298246
0.8771929824561403
0.9298245614035088


- **2) sklearn.metrics classification_report 지표를 사용한 경우**

In [23]:
print(classification_report(y_test, prediction_decision_tree), end='------------------------------------------------------------\n\n')
print(classification_report(y_test, prediction_random_forest), end='------------------------------------------------------------\n\n')
print(classification_report(y_test, prediction_svm_model), end='------------------------------------------------------------\n\n')
print(classification_report(y_test, prediction_sgd_model), end='------------------------------------------------------------\n\n')
print(classification_report(y_test, prediction_logistic_model), end='------------------------------------------------------------\n\n')

              precision    recall  f1-score   support

           0       0.94      0.77      0.85        44
           1       0.87      0.97      0.92        70

    accuracy                           0.89       114
   macro avg       0.91      0.87      0.88       114
weighted avg       0.90      0.89      0.89       114
------------------------------------------------------------

              precision    recall  f1-score   support

           0       1.00      0.82      0.90        44
           1       0.90      1.00      0.95        70

    accuracy                           0.93       114
   macro avg       0.95      0.91      0.92       114
weighted avg       0.94      0.93      0.93       114
------------------------------------------------------------

              precision    recall  f1-score   support

           0       1.00      0.75      0.86        44
           1       0.86      1.00      0.93        70

    accuracy                           0.90       114
   mac

- **3) sklearn.metrics confusion_matrix 지표를 사용한 경우**

In [24]:
print(confusion_matrix(y_test, prediction_decision_tree), end='\n\n')
print(confusion_matrix(y_test, prediction_random_forest), end='\n\n')
print(confusion_matrix(y_test, prediction_svm_model), end='\n\n')
print(confusion_matrix(y_test, prediction_sgd_model), end='\n\n')
print(confusion_matrix(y_test, prediction_logistic_model))

[[34 10]
 [ 2 68]]

[[36  8]
 [ 0 70]]

[[33 11]
 [ 0 70]]

[[41  3]
 [11 59]]

[[40  4]
 [ 4 66]]


---
### (7) 결론
---

- `프로젝트(3) load_breast_cancer : 유방암 여부를 진단해 봅시다`에서는 569개의 데이터 샘플을 가지고 총 30개의 feature(특징값)를 사용하여 유방암 인지 아닌지 여부를 맞추기위한 **지도학습 - 분류 문제**라 할 수 있다.  


- 총 569개의 Data Sample 중 필자는 train_test_split 함수를 이용하여 Data Sample을 8:2 비율로 균등하게 나누었다. 아래 그림과 같이 Data Sample의 데이터들의 치우침 정도를 확인하기 위해 pandas DataFrame의 매소드 value_counts()를 사용하였다.<br> **※ 데이터가 불균형(imbalance) 하지 않음을 확인하였다**


![image](https://user-images.githubusercontent.com/103712369/165586850-bbd88d7a-bb33-477b-9d00-65937f240ad5.png)


- 본 노드를 통해 배운 Scikit-Learn에서 제공하는 평가지표 함수는 accuracy / classification_report / confusion_matrix가 있었는데 confusion_matrix의 성능지표 중 가장 많이 사용되는 Precision / Recall을 사용하기에는 데이터가 고루 분포되어 있기 때문에 **Accuracy 성능지표를 사용하여 학습모델을 평가하는것이 바람직하다고 생각된다.**


- 모델은 **Ensemble 기법 중 하나인 RandomForestClassifier 및 선형 모델의 LogisticRegression**을 사용했을 때 약 93%로 가장 좋은 성능을 보였다. 
