# Classification with **scikit-learn**

## Handwriting classification

- [Optical recognition of handwritten digits dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#optical-recognition-of-handwritten-digits-dataset)

In [1]:
#################
### Libraries ###
#################
import pandas as pd
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier, LogisticRegression

#################
### Load Data ###
#################
digits = load_digits()
digits_data = digits.data
digits_label = digits.target

print(digits.keys())
print('----------------------------------------------------------------------')
print('변수 및 메소드:', dir(digits))
print('----------------------------------------------------------------------')
print('(데이터 개수, 각 담고 있는 정보의 개수):',digits_data.shape)
print('----------------------------------------------------------------------')
print('이름:',digits.target_names, ', 값:', set(digits.target),' 으로',digits_label.shape[0], '개의 라벨이 존재')
print('----------------------------------------------------------------------')
print('데이터 분포:\n', pd.Series(digits_label).value_counts())
print('----------------------------------------------------------------------')
print('첫번째 데이터의 정보\n', digits.images[0])
print('----------------------------------------------------------------------')
print('데이터셋의 자세한 설명', digits.DESCR)

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])
----------------------------------------------------------------------
변수 및 메소드: ['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_names']
----------------------------------------------------------------------
(데이터 개수, 각 담고 있는 정보의 개수): (1797, 64)
----------------------------------------------------------------------
이름: [0 1 2 3 4 5 6 7 8 9] , 값: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}  으로 1797 개의 라벨이 존재
----------------------------------------------------------------------
데이터 분포:
 3    183
1    182
5    182
4    181
6    181
9    180
7    179
0    178
2    177
8    174
dtype: int64
----------------------------------------------------------------------
첫번째 데이터의 정보
 [[ 0.  0.  5. 13.  9.  1.  0.  0.]
 [ 0.  0. 13. 15. 10. 15.  5.  0.]
 [ 0.  3. 15.  2.  0. 11.  8.  0.]
 [ 0.  4. 12.  0.  0.  8.  8.  0.]
 [ 0.  5.  8.  0.  0.  9.  8.  0.]
 [ 0.  4. 11.  0.  1. 12.  7.  0.]
 [ 0.  2. 1

In [2]:
##################
### Split Data ###
##################
X_train, X_test, y_train, y_test = train_test_split(digits_data, 
                                                    digits_label, 
                                                    test_size=0.2, 
                                                    random_state=15)

#############################
### Train & Predict Model ###
#############################
def train_model(model):
    if model == 'Decision_Tree':
        decision_tree = DecisionTreeClassifier(random_state=32)
        decision_tree.fit(X_train, y_train)
        y_pred = decision_tree.predict(X_test)
    elif model == 'Random_Forest':
        random_forest = RandomForestClassifier(random_state=32)
        random_forest.fit(X_train, y_train)
        y_pred = random_forest.predict(X_test)
    elif model == 'SVM':
        svm_model = svm.SVC(random_state=32)
        svm_model.fit(X_train, y_train)
        y_pred = svm_model.predict(X_test)
    elif model == 'SGD_Classifier':
        sgd_model = SGDClassifier(random_state=32)
        sgd_model.fit(X_train, y_train)
        y_pred = sgd_model.predict(X_test)
    elif model == 'Logistic_Regression':
        logistic_model = LogisticRegression(random_state=32)
        logistic_model.fit(X_train, y_train)
        y_pred = logistic_model.predict(X_test)
    return y_pred

model_lst = ['Decision_Tree', 'Random_Forest', 'SVM', 'SGD_Classifier', 'Logistic_Regression']

for selected_model in model_lst:
    y_pred = train_model(selected_model)

    ######################
    ### Evaluate Model ###
    ######################
    print(f'========================== {selected_model} ==========================')
    print('--- classification_report ---\n', classification_report(y_test, y_pred))
    accuracy = accuracy_score(y_test, y_pred)
    print('--- accuracy_score ---\n', accuracy)
    confusion_M = confusion_matrix(y_test, y_pred)
    print('--- confusion_matrix ---\n', confusion_M)
    print('\n')

--- classification_report ---
               precision    recall  f1-score   support

           0       1.00      0.94      0.97        31
           1       0.82      0.82      0.82        38
           2       0.72      0.87      0.79        38
           3       0.85      0.81      0.83        27
           4       0.97      0.78      0.86        41
           5       0.82      0.89      0.85        35
           6       0.85      0.89      0.87        38
           7       0.91      0.91      0.91        34
           8       0.74      0.74      0.74        35
           9       0.83      0.79      0.81        43

    accuracy                           0.84       360
   macro avg       0.85      0.84      0.84       360
weighted avg       0.85      0.84      0.84       360

--- accuracy_score ---
 0.8416666666666667
--- confusion_matrix ---
 [[29  0  0  0  0  0  0  0  2  0]
 [ 0 31  3  1  1  0  1  0  1  0]
 [ 0  1 33  1  0  1  0  0  1  1]
 [ 0  0  0 22  0  1  1  0  2  1]
 [ 0  0  

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Handwriting classification - Result

- Decision Tree, Random Forest, SVM, SGD Classifier, Logistic Regression 이 5가지 모델로 학습 시킨 결과, SVM 모델이 98%로 가장 성능이 좋음
- 그리고 다른 모델도 대부분 정확도가 좋았으나 Decision Tree은 84%로 낮은 정확도를 보임
- 평가지표로는 accuracy를 선택. 그 이유는 라벨 분포가 균형잡혀 있으므로 전체 데이터 중 정답인 경우만 봐도 무방하기 때문

## Wine classification

- [Wine recognition dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-recognition-dataset)

In [3]:
#################
### Libraries ###
#################
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier, LogisticRegression

#################
### Load Data ###
#################
wine = load_wine()
wine_data = wine.data
wine_label = wine.target

print(wine.keys())
print('----------------------------------------------------------------------')
print('변수 및 메소드:', dir(wine))
print('----------------------------------------------------------------------')
print('(데이터 개수, 각 담고 있는 정보의 개수):',wine_data.shape)
print('----------------------------------------------------------------------')
print('이름:',wine.target_names, ', 값:', set(wine.target),' 으로',wine_label.shape[0], '개의 라벨이 존재')
print('----------------------------------------------------------------------')
print('데이터 분포:\n', pd.Series(wine_label).value_counts())
print('----------------------------------------------------------------------')
print('각 feature의 정보\n', wine.feature_names)
print('----------------------------------------------------------------------')
print('데이터셋의 자세한 설명', wine.DESCR)
print('----------------------------------------------------------------------')
wine_df = pd.DataFrame(data=wine_data, columns=wine.feature_names)
wine_df

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])
----------------------------------------------------------------------
변수 및 메소드: ['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']
----------------------------------------------------------------------
(데이터 개수, 각 담고 있는 정보의 개수): (178, 13)
----------------------------------------------------------------------
이름: ['class_0' 'class_1' 'class_2'] , 값: {0, 1, 2}  으로 178 개의 라벨이 존재
----------------------------------------------------------------------
데이터 분포:
 1    71
0    59
2    48
dtype: int64
----------------------------------------------------------------------
각 feature의 정보
 ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
----------------------------------------------------------------------
데이터셋의 자세한 설명 .. _wine_dataset:

Wine 

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0


In [4]:
wine_df['label'] = wine.target
print('정답 데이터도 컬럼에 추가', wine_df.label.unique())
wine_df

정답 데이터도 컬럼에 추가 [0 1 2]


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,label
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0,2
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0,2
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0,2
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0,2


In [5]:
##################
### Split Data ###
##################
X_train, X_test, y_train, y_test = train_test_split(wine_data, 
                                                    wine_label, 
                                                    test_size=0.2, 
                                                    random_state=15)

#############################
### Train & Predict Model ###
#############################
def train_model(model):
    if model == 'Decision_Tree':
        decision_tree = DecisionTreeClassifier(random_state=32)
        decision_tree.fit(X_train, y_train)
        y_pred = decision_tree.predict(X_test)
    elif model == 'Random_Forest':
        random_forest = RandomForestClassifier(random_state=32)
        random_forest.fit(X_train, y_train)
        y_pred = random_forest.predict(X_test)
    elif model == 'SVM':
        svm_model = svm.SVC(random_state=32)
        svm_model.fit(X_train, y_train)
        y_pred = svm_model.predict(X_test)
    elif model == 'SGD_Classifier':
        sgd_model = SGDClassifier(random_state=32)
        sgd_model.fit(X_train, y_train)
        y_pred = sgd_model.predict(X_test)
    elif model == 'Logistic_Regression':
        logistic_model = LogisticRegression(random_state=32)
        logistic_model.fit(X_train, y_train)
        y_pred = logistic_model.predict(X_test)
    return y_pred

model_lst = ['Decision_Tree', 'Random_Forest', 'SVM', 'SGD_Classifier', 'Logistic_Regression']

for selected_model in model_lst:
    y_pred = train_model(selected_model)

    ######################
    ### Evaluate Model ###
    ######################
    print(f'========================== {selected_model} ==========================')
    print('--- classification_report ---\n', classification_report(y_test, y_pred))
    accuracy = accuracy_score(y_test, y_pred)
    print('--- accuracy_score ---\n', accuracy)
    confusion_M = confusion_matrix(y_test, y_pred)
    print('--- confusion_matrix ---\n', confusion_M)
    print('\n')

--- classification_report ---
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       0.85      0.92      0.88        12
           2       0.91      0.83      0.87        12

    accuracy                           0.92        36
   macro avg       0.92      0.92      0.92        36
weighted avg       0.92      0.92      0.92        36

--- accuracy_score ---
 0.9166666666666666
--- confusion_matrix ---
 [[12  0  0]
 [ 0 11  1]
 [ 0  2 10]]


--- classification_report ---
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00      1.00      1.00        12
           2       1.00      1.00      1.00        12

    accuracy                           1.00        36
   macro avg       1.00      1.00      1.00        36
weighted avg       1.00      1.00      1.00        36

--- accuracy_score ---
 1.0
--- confusion_matrix ---
 [[12  0  0]
 [

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Wine classification - Result

- 5가지 모델로 학습시킨 결과, Random Forest의 정확도가 100%로 성능이 가장 좋음
- Decision Tree와 Logistic Regression의 정확도가 90% 이상인 반면, SVM과 SGD Classifier은 각 61%, 75% 정도로 낮음
- Confusion Matrix를 확인해보면, SGD Classifier은 class_1과 class_2에서 예측이 약간 어려워 보이지만 SVM의 경우 class_2를 전혀 예측하지 못하는 것을 보여줌
- 우선 데이터 분포가 '1': 71, '0': 59, '2': 48 으로 label '1'에 치중되어 있는 것을 확인 가능하며, label 불균형으로 인하여 SVM의 class_2에서 정확도가 0%로 되었다고 판단.
- 특정 라벨에 해당하는 와인이 아닌 것. 즉, Negative를 확실하게 구분하기 위해서는 False Positive가 낮아야하므로 와인 분류 모델의 평가지표로 Precision을 사용하는 것이 적당함


## Diagnosis of breast cancer

- [Breast cancer wisconsin (diagnostic) dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset)

In [6]:
#################
### Libraries ###
#################
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier, LogisticRegression

#################
### Load Data ###
#################
breast_cancer = load_breast_cancer()
breast_cancer_data = breast_cancer.data
breast_cancer_label = breast_cancer.target

print(breast_cancer.keys())
print('----------------------------------------------------------------------')
print('변수 및 메소드:', dir(breast_cancer))
print('----------------------------------------------------------------------')
print('(데이터 개수, 각 담고 있는 정보의 개수):',breast_cancer_data.shape)
print('----------------------------------------------------------------------')
print('이름:',breast_cancer.target_names, ', 값:', set(breast_cancer.target),' 으로',breast_cancer_label.shape[0], '개의 라벨이 존재')
print('----------------------------------------------------------------------')
print('데이터 분포:\n', pd.Series(breast_cancer_label).value_counts())
print('----------------------------------------------------------------------')
print('각 feature의 정보\n', breast_cancer.feature_names)
print('----------------------------------------------------------------------')
print('데이터셋의 자세한 설명', breast_cancer.DESCR)
print('----------------------------------------------------------------------')
breast_cancer_df = pd.DataFrame(data=breast_cancer_data, columns=breast_cancer.feature_names)
breast_cancer_df


dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
----------------------------------------------------------------------
변수 및 메소드: ['DESCR', 'data', 'data_module', 'feature_names', 'filename', 'frame', 'target', 'target_names']
----------------------------------------------------------------------
(데이터 개수, 각 담고 있는 정보의 개수): (569, 30)
----------------------------------------------------------------------
이름: ['malignant' 'benign'] , 값: {0, 1}  으로 569 개의 라벨이 존재
----------------------------------------------------------------------
데이터 분포:
 1    357
0    212
dtype: int64
----------------------------------------------------------------------
각 feature의 정보
 ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' '

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [7]:
breast_cancer_df['label'] = breast_cancer.target
print('정답 데이터도 컬럼에 추가', breast_cancer_df.label.unique())
breast_cancer_df

정답 데이터도 컬럼에 추가 [0 1]


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,label
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


In [8]:
##################
### Split Data ###
##################
X_train, X_test, y_train, y_test = train_test_split(breast_cancer_data, 
                                                    breast_cancer_label, 
                                                    test_size=0.2, 
                                                    random_state=15)

#############################
### Train & Predict Model ###
#############################
def train_model(model):
    if model == 'Decision_Tree':
        decision_tree = DecisionTreeClassifier(random_state=32)
        decision_tree.fit(X_train, y_train)
        y_pred = decision_tree.predict(X_test)
    elif model == 'Random_Forest':
        random_forest = RandomForestClassifier(random_state=32)
        random_forest.fit(X_train, y_train)
        y_pred = random_forest.predict(X_test)
    elif model == 'SVM':
        svm_model = svm.SVC(random_state=32)
        svm_model.fit(X_train, y_train)
        y_pred = svm_model.predict(X_test)
    elif model == 'SGD_Classifier':
        sgd_model = SGDClassifier(random_state=32)
        sgd_model.fit(X_train, y_train)
        y_pred = sgd_model.predict(X_test)
    elif model == 'Logistic_Regression':
        logistic_model = LogisticRegression(random_state=32)
        logistic_model.fit(X_train, y_train)
        y_pred = logistic_model.predict(X_test)
    return y_pred

model_lst = ['Decision_Tree', 'Random_Forest', 'SVM', 'SGD_Classifier', 'Logistic_Regression']

for selected_model in model_lst:
    y_pred = train_model(selected_model)

    ######################
    ### Evaluate Model ###
    ######################
    print(f'========================== {selected_model} ==========================')
    print('--- classification_report ---\n', classification_report(y_test, y_pred))
    accuracy = accuracy_score(y_test, y_pred)
    print('--- accuracy_score ---\n', accuracy)
    confusion_M = confusion_matrix(y_test, y_pred)
    print('--- confusion_matrix ---\n', confusion_M)
    print('\n')

--- classification_report ---
               precision    recall  f1-score   support

           0       0.97      0.87      0.92        39
           1       0.94      0.99      0.96        75

    accuracy                           0.95       114
   macro avg       0.95      0.93      0.94       114
weighted avg       0.95      0.95      0.95       114

--- accuracy_score ---
 0.9473684210526315
--- confusion_matrix ---
 [[34  5]
 [ 1 74]]


--- classification_report ---
               precision    recall  f1-score   support

           0       0.94      0.87      0.91        39
           1       0.94      0.97      0.95        75

    accuracy                           0.94       114
   macro avg       0.94      0.92      0.93       114
weighted avg       0.94      0.94      0.94       114

--- accuracy_score ---
 0.9385964912280702
--- confusion_matrix ---
 [[34  5]
 [ 2 73]]


--- classification_report ---
               precision    recall  f1-score   support

           0      

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Diagnosis of breast cancer - Result

- 동일하게 5가지 모델로 학습 결과, Decision Tree의 정확도가 95%정도로 가장 좋음
- 비슷한 정확도로 Random Forest도 높음 (94%)
- 여기서도 '1': 357, '0': 212으로 데이터 불균형이 확인 가능했으며, 따라서 accuracy를 사용하면 안됨
- 암을 진단하는데 있어서 암이 맞는 경우 Positive를 확실해게 구분해야 함. 양성인데 음성으로 오진하면 암의 경우 문제가 더 커짐. 따라서 이 경우 False Negative를 낮을수록 좋은 Recall을 평가지표로 사용하는 것이 적당

## 결론

- 수집된 데이터 파악이 우선. 형태가 무엇이고, 데이터의 feature나 label이 무엇인지 먼저 파악하는 것이 중요
- 그리고 데이터의 label 균형|불균형에 따라서 accuracy만으로는 모델의 성능을 평가하는데 문제가 될 수 있음
- 따라서 데이터의 분류 우선순위에 맞게 평가지표를 사용해야하며, 대표적인 평가지표로는 Precision, Recall, F1 score, Accuracy가 있음
- 데이터셋의 분류 모델의 특성에 따라서 Positive나 Negative의 중요도를 판단하여 그에 맞는 평가지표를 선택하는 것이 중요
