# EP02_Classification

## Project 01_load digits

### 1. 데이터 준비

In [1]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [2]:
digits = load_digits()
digits_data = digits.data
digits_label = digits.target

In [3]:
print(digits_data.shape)
digits_data[0]

(1797, 64)


array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

- digits_data에 저장된 데이터와 형식 확인
- 1797개의 digit 데이터 확인
- 각각의 데이터에 64개의 값이 저장되어 있음을 확인

In [4]:
print(digits_label.shape)
print(digits_label[:20])
print(digits.target_names)

(1797,)
[0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9]


In [7]:
print(digits.DESCR)

.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 1797
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each blo

### 2. 데이터 분리

In [17]:
X_train, X_test, y_train, y_test = train_test_split(digits_data,
                                                   digits_label,
                                                   test_size=0.30,
                                                   random_state=5)

print('number of X_train:', len(X_train), 'number of X_test:', len(X_test))
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

number of X_train: 1257 number of X_test: 540
(1257, 64) (1257,)
(540, 64) (540,)


- train data & test data 로 split.
- 30%의 test data size로 구성
- 1,797개의 데이터 셋 중, train data (1257), test data (540)
- random(5)으로 섞어서 구성

### 3. 다양한 모델로 학습

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

decision_tree = DecisionTreeClassifier(random_state=32)
random_forest = RandomForestClassifier(random_state=32)
svm_model = svm.SVC()
sgd_model = SGDClassifier()
logistic_model = LogisticRegression()

- 아래 5개의 모델로 구현
 - Decision Tree (decision_tree)
 - Random Forest (random_forest)
 - SVM (svm_model)
 - SGD Classifier (sgd_model)
 - Logistic Regression (logistic_model)

#### 3-1. Decision Tree

In [19]:
decision_tree.fit(X_train, y_train)
y_pred_dicision_tree = decision_tree.predict(X_test)

print(classification_report(y_test, y_pred_dicision_tree))
print(confusion_matrix(y_test, y_pred_dicision_tree))

              precision    recall  f1-score   support

           0       0.98      0.95      0.96        58
           1       0.73      0.88      0.80        52
           2       0.89      0.84      0.87        58
           3       0.89      0.81      0.85        59
           4       0.75      0.77      0.76        43
           5       0.88      0.88      0.88        64
           6       0.95      0.89      0.92        47
           7       0.96      0.78      0.86        59
           8       0.76      0.78      0.77        50
           9       0.77      0.94      0.85        50

    accuracy                           0.85       540
   macro avg       0.86      0.85      0.85       540
weighted avg       0.86      0.85      0.85       540

[[55  0  0  0  1  1  0  0  1  0]
 [ 0 46  2  0  0  0  0  0  1  3]
 [ 0  4 49  2  0  1  1  0  1  0]
 [ 0  1  3 48  1  0  0  0  3  3]
 [ 0  5  0  0 33  2  1  1  0  1]
 [ 1  0  0  0  0 56  0  0  2  5]
 [ 0  0  0  0  3  1 42  0  1  0]
 [ 0  2  0

- 결과 : 86%의 정확도이지만, precision과 recall의 편차가 큰게 확인 됨. -> 모델성능이 좋다고 볼 순 없음.

#### 3-2. Random Forest

In [20]:
random_forest.fit(X_train, y_train)
y_pred_random_forest = random_forest.predict(X_test)

print(classification_report(y_test, y_pred_random_forest))
print(confusion_matrix(y_test, y_pred_random_forest))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99        58
           1       0.96      1.00      0.98        52
           2       0.98      1.00      0.99        58
           3       1.00      0.92      0.96        59
           4       0.98      0.98      0.98        43
           5       0.97      0.97      0.97        64
           6       1.00      0.98      0.99        47
           7       0.98      0.98      0.98        59
           8       0.96      0.96      0.96        50
           9       0.91      0.98      0.94        50

    accuracy                           0.97       540
   macro avg       0.97      0.97      0.97       540
weighted avg       0.97      0.97      0.97       540

[[57  0  0  0  1  0  0  0  0  0]
 [ 0 52  0  0  0  0  0  0  0  0]
 [ 0  0 58  0  0  0  0  0  0  0]
 [ 0  1  0 54  0  1  0  0  1  2]
 [ 0  0  0  0 42  0  0  1  0  0]
 [ 0  0  0  0  0 62  0  0  0  2]
 [ 0  0  0  0  0  0 46  0  1  0]
 [ 0  0  0

- 결과 : 97%의 정확도. precision과 recall의 편차가 작음. -> 모델성능이 좋음.

#### 3-3. SVM

In [21]:
svm_model.fit(X_train, y_train)
y_pred_svm_model = svm_model.predict(X_test)

print(classification_report(y_test, y_pred_svm_model))
print(confusion_matrix(y_test, y_pred_svm_model))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        58
           1       0.98      1.00      0.99        52
           2       1.00      1.00      1.00        58
           3       1.00      0.97      0.98        59
           4       1.00      1.00      1.00        43
           5       0.97      0.98      0.98        64
           6       1.00      1.00      1.00        47
           7       1.00      0.98      0.99        59
           8       0.96      0.96      0.96        50
           9       0.94      0.96      0.95        50

    accuracy                           0.99       540
   macro avg       0.99      0.99      0.99       540
weighted avg       0.99      0.99      0.99       540

[[58  0  0  0  0  0  0  0  0  0]
 [ 0 52  0  0  0  0  0  0  0  0]
 [ 0  0 58  0  0  0  0  0  0  0]
 [ 0  0  0 57  0  1  0  0  1  0]
 [ 0  0  0  0 43  0  0  0  0  0]
 [ 0  0  0  0  0 63  0  0  0  1]
 [ 0  0  0  0  0  0 47  0  0  0]
 [ 0  0  0

- 결과 : 99%를 넘어선 100%에 가까운 정확도. precision과 recall의 편차가 완벽에 가까움.

#### 3-4. SGD

In [22]:
sgd_model.fit(X_train, y_train)
y_pred_sgd_model = sgd_model.predict(X_test)

print(classification_report(y_test, y_pred_sgd_model))
print(confusion_matrix(y_test, y_pred_sgd_model))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99        58
           1       0.91      0.96      0.93        52
           2       0.98      1.00      0.99        58
           3       1.00      0.88      0.94        59
           4       0.96      1.00      0.98        43
           5       0.93      0.97      0.95        64
           6       1.00      0.96      0.98        47
           7       1.00      0.95      0.97        59
           8       0.95      0.82      0.88        50
           9       0.80      0.98      0.88        50

    accuracy                           0.95       540
   macro avg       0.95      0.95      0.95       540
weighted avg       0.95      0.95      0.95       540

[[57  0  0  0  1  0  0  0  0  0]
 [ 0 50  1  0  0  1  0  0  0  0]
 [ 0  0 58  0  0  0  0  0  0  0]
 [ 0  0  0 52  0  3  0  0  1  3]
 [ 0  0  0  0 43  0  0  0  0  0]
 [ 0  0  0  0  0 62  0  0  0  2]
 [ 0  1  0  0  0  0 45  0  1  0]
 [ 0  0  0

- 결과 : 95%의 정확도. precision과 recall의 편차가 어느정도 있음.

#### 3-5. Logistic Regression

In [23]:
logistic_model.fit(X_train, y_train)
y_pred_logistic_model = logistic_model.predict(X_test)

print(classification_report(y_test, y_pred_logistic_model))
print(confusion_matrix(y_test, y_pred_logistic_model))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        58
           1       0.94      0.94      0.94        52
           2       0.95      1.00      0.97        58
           3       0.98      0.93      0.96        59
           4       0.95      0.93      0.94        43
           5       0.97      0.95      0.96        64
           6       1.00      0.98      0.99        47
           7       0.98      0.97      0.97        59
           8       0.94      0.94      0.94        50
           9       0.91      0.98      0.94        50

    accuracy                           0.96       540
   macro avg       0.96      0.96      0.96       540
weighted avg       0.96      0.96      0.96       540

[[58  0  0  0  0  0  0  0  0  0]
 [ 0 49  1  1  0  0  0  0  1  0]
 [ 0  0 58  0  0  0  0  0  0  0]
 [ 0  0  1 55  0  1  0  0  1  1]
 [ 0  2  0  0 40  0  0  1  0  0]
 [ 0  0  0  0  1 61  0  0  0  2]
 [ 0  0  0  0  0  0 46  0  1  0]
 [ 0  0  0

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


- 결과 : 96%의 정확도. 
- Warning : lbfgs 가 수렴에 실패함. 이유? -> 분석 필요.

## Project 02_Load Wine

### Project 01 과 같이,
### 1. 데이터 준비 -> 2. 데이터 분리 -> 3. 다양한 방법으로의 학습 -> 결론 순으로 이어짐.

In [27]:
from sklearn.datasets import load_wine

wine = load_wine()
wine_data = wine.data
wine_label = wine.target

print(wine_data.shape)
wine_data[0]

wine.feature_names

print(wine_label.shape)
print(wine_label[:])
print(wine.target_names)

#print(wine.DESCR) #코드 가독 편의상 생략

X_train, X_test, y_train, y_test = train_test_split(wine_data,
                                                   wine_label,
                                                   test_size=0.30,
                                                   random_state=5)

print('number of X_train:', len(X_train), 'number of X_test:', len(X_test))
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(178, 13)
(178,)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
['class_0' 'class_1' 'class_2']
number of X_train: 124 number of X_test: 54
(124, 13) (124,)
(54, 13) (54,)


In [30]:
#3-1. decision_tree
decision_tree.fit(X_train, y_train)
y_pred_dicision_tree = decision_tree.predict(X_test)

print(classification_report(y_test, y_pred_dicision_tree))
print(confusion_matrix(y_test, y_pred_dicision_tree))

#3-2. Random Forest
random_forest.fit(X_train, y_train)
y_pred_random_forest = random_forest.predict(X_test)

print(classification_report(y_test, y_pred_random_forest))
print(confusion_matrix(y_test, y_pred_random_forest))

#3-3. SVM
svm_model.fit(X_train, y_train)
y_pred_svm_model = svm_model.predict(X_test)

print(classification_report(y_test, y_pred_svm_model))
print(confusion_matrix(y_test, y_pred_svm_model))

#3-4. SGD
sgd_model.fit(X_train, y_train)
y_pred_sgd_model = sgd_model.predict(X_test)

print(classification_report(y_test, y_pred_sgd_model))
print(confusion_matrix(y_test, y_pred_sgd_model))

#3-5. Logistic Regression
logistic_model.fit(X_train, y_train)
y_pred_logistic_model = logistic_model.predict(X_test)

print(classification_report(y_test, y_pred_logistic_model))
print(confusion_matrix(y_test, y_pred_logistic_model))

              precision    recall  f1-score   support

           0       0.95      0.91      0.93        23
           1       0.89      0.89      0.89        18
           2       0.93      1.00      0.96        13

    accuracy                           0.93        54
   macro avg       0.92      0.93      0.93        54
weighted avg       0.93      0.93      0.93        54

[[21  2  0]
 [ 1 16  1]
 [ 0  0 13]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        23
           1       1.00      0.94      0.97        18
           2       0.93      1.00      0.96        13

    accuracy                           0.98        54
   macro avg       0.98      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54

[[23  0  0]
 [ 0 17  1]
 [ 0  0 13]]
              precision    recall  f1-score   support

           0       0.89      0.74      0.81        23
           1       0.60      0.83      0.70        18
 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


- 결과_1 : 93%의 정확도. 
- 결과_2 : 98%의 정확도. 
- 결과_3 : 67%의 정확도. 
- 결과_4 : 61%의 정확도. 
- 결과_5 : 87%의 정확도. 

## Project 03_Load Breast Cancer

### Project 01 과 같이,
### 1. 데이터 준비 -> 2. 데이터 분리 -> 3. 다양한 방법으로의 학습 -> 결론 순으로 이어짐.

In [31]:
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
cancer_data = cancer.data
cancer_label = cancer.target

print(cancer_data.shape)
cancer_data[0]

cancer.feature_names

print(cancer_label.shape)
print(cancer_label[:])
print(cancer.target_names)

#print(cancer.DESCR) #코드 가독 편의상 생략

X_train, X_test, y_train, y_test = train_test_split(cancer_data,
                                                   cancer_label,
                                                   test_size=0.30,
                                                   random_state=5)

print('number of X_train:', len(X_train), 'number of X_test:', len(X_test))
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(569, 30)
(569,)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1

In [32]:
#3-1. decision_tree
decision_tree.fit(X_train, y_train)
y_pred_dicision_tree = decision_tree.predict(X_test)

print(classification_report(y_test, y_pred_dicision_tree))
print(confusion_matrix(y_test, y_pred_dicision_tree))

#3-2. Random Forest
random_forest.fit(X_train, y_train)
y_pred_random_forest = random_forest.predict(X_test)

print(classification_report(y_test, y_pred_random_forest))
print(confusion_matrix(y_test, y_pred_random_forest))

#3-3. SVM
svm_model.fit(X_train, y_train)
y_pred_svm_model = svm_model.predict(X_test)

print(classification_report(y_test, y_pred_svm_model))
print(confusion_matrix(y_test, y_pred_svm_model))

#3-4. SGD
sgd_model.fit(X_train, y_train)
y_pred_sgd_model = sgd_model.predict(X_test)

print(classification_report(y_test, y_pred_sgd_model))
print(confusion_matrix(y_test, y_pred_sgd_model))

#3-5. Logistic Regression
logistic_model.fit(X_train, y_train)
y_pred_logistic_model = logistic_model.predict(X_test)

print(classification_report(y_test, y_pred_logistic_model))
print(confusion_matrix(y_test, y_pred_logistic_model))

              precision    recall  f1-score   support

           0       0.89      0.92      0.90        61
           1       0.95      0.94      0.94       110

    accuracy                           0.93       171
   macro avg       0.92      0.93      0.92       171
weighted avg       0.93      0.93      0.93       171

[[ 56   5]
 [  7 103]]
              precision    recall  f1-score   support

           0       0.98      0.95      0.97        61
           1       0.97      0.99      0.98       110

    accuracy                           0.98       171
   macro avg       0.98      0.97      0.97       171
weighted avg       0.98      0.98      0.98       171

[[ 58   3]
 [  1 109]]
              precision    recall  f1-score   support

           0       1.00      0.90      0.95        61
           1       0.95      1.00      0.97       110

    accuracy                           0.96       171
   macro avg       0.97      0.95      0.96       171
weighted avg       0.97     

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


- 결과_1 : 93%의 정확도. 
- 결과_2 : 98%의 정확도. 
- 결과_3 : 96%의 정확도. 
- 결과_4 : 91%의 정확도. 
- 결과_5 : 95%의 정확도. 

- 하지만 앞서 공부 한 것과 같이, 종양 검증에 있어 양성,음성 판단유무는 recall 값이 중요함.
  - recall 값은 Random Forest 의 수치가 가장 높게 나왔으므로, 제일 적합하다고 보여짐.