### Data used:
- digit: size: (1797, 64)
- wine: size: (178, 13)
- breast-cancer: size: (569, 30)

### Models used for each data set
- Decision Tree
- Random Forest
- SVM
- SGD Classifier
- Logistic Regression


### Import Modules

In [139]:
from sklearn.datasets import load_digits
from sklearn.datasets import load_wine
from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

from sklearn.preprocessing import MinMaxScaler

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

import numpy as np

# DIGITS

### Load Data and generate regularized train&test data sets

In [140]:
digits_data = load_digits()
digits_data.keys()

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])

In [141]:
data= digits_data['data']
label= digits_data['target']

In [142]:
scaler= MinMaxScaler()
scaled_data= scaler.fit_transform(data)

In [143]:
X_train, X_test, y_train, y_test = train_test_split(scaled_data, label, test_size=0.2, random_state= 32)

## Training Model Comparisons

### Decision Tree

In [144]:
decision_tree= DecisionTreeClassifier(random_state=32)
decision_tree.fit(X_train, y_train)

y_pred_dc= decision_tree.predict(X_test)

accuracy_dc= accuracy_score(y_test, y_pred_dc)
class_report_dc= classification_report(y_test, y_pred_dc)

In [128]:
print(accuracy_dc, '\n', class_report_dc) # raw data

0.875 
               precision    recall  f1-score   support

           0       1.00      0.89      0.94        38
           1       0.75      0.83      0.79        36
           2       0.75      0.84      0.79        32
           3       0.92      0.86      0.89        56
           4       0.85      0.90      0.88        31
           5       0.92      0.97      0.95        36
           6       0.97      0.94      0.96        34
           7       0.94      0.88      0.91        34
           8       0.88      0.78      0.82        27
           9       0.79      0.83      0.81        36

    accuracy                           0.88       360
   macro avg       0.88      0.87      0.87       360
weighted avg       0.88      0.88      0.88       360



In [9]:
print(accuracy_dc, '\n', class_report_dc) # scaled data

0.875 
               precision    recall  f1-score   support

           0       1.00      0.89      0.94        38
           1       0.75      0.83      0.79        36
           2       0.75      0.84      0.79        32
           3       0.92      0.86      0.89        56
           4       0.85      0.90      0.88        31
           5       0.92      0.97      0.95        36
           6       0.97      0.94      0.96        34
           7       0.94      0.88      0.91        34
           8       0.88      0.78      0.82        27
           9       0.79      0.83      0.81        36

    accuracy                           0.88       360
   macro avg       0.88      0.87      0.87       360
weighted avg       0.88      0.88      0.88       360



### Random Forest

In [145]:
random_forest = RandomForestClassifier(random_state=32)
random_forest.fit(X_train, y_train)

y_pred_RFC= random_forest.predict(X_test)

accuracy_RFC= accuracy_score(y_test, y_pred_RFC)
class_report_RFC= classification_report(y_test, y_pred_RFC)

In [130]:
print(accuracy_RFC, '\n', class_report_RFC) ## raw data

0.9861111111111112 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        38
           1       0.97      1.00      0.99        36
           2       0.97      1.00      0.98        32
           3       1.00      1.00      1.00        56
           4       1.00      0.97      0.98        31
           5       1.00      0.97      0.99        36
           6       1.00      1.00      1.00        34
           7       0.97      1.00      0.99        34
           8       0.96      0.89      0.92        27
           9       0.97      1.00      0.99        36

    accuracy                           0.99       360
   macro avg       0.98      0.98      0.98       360
weighted avg       0.99      0.99      0.99       360



In [12]:
print(accuracy_RFC, '\n', class_report_RFC) ## scaled data

0.9861111111111112 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        38
           1       0.97      1.00      0.99        36
           2       0.97      1.00      0.98        32
           3       1.00      1.00      1.00        56
           4       1.00      0.97      0.98        31
           5       1.00      0.97      0.99        36
           6       1.00      1.00      1.00        34
           7       0.97      1.00      0.99        34
           8       0.96      0.89      0.92        27
           9       0.97      1.00      0.99        36

    accuracy                           0.99       360
   macro avg       0.98      0.98      0.98       360
weighted avg       0.99      0.99      0.99       360



### SVM

In [146]:
svm_model = svm.SVC(random_state=32)
svm_model.fit(X_train, y_train)
y_pred_sv = svm_model.predict(X_test)

accuracy_sv= accuracy_score(y_test, y_pred_sv)
class_report_sv= classification_report(y_test, y_pred_sv)

In [132]:
print(accuracy_sv, '\n', class_report_sv) ## raw data

0.9944444444444445 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        38
           1       1.00      1.00      1.00        36
           2       1.00      1.00      1.00        32
           3       1.00      1.00      1.00        56
           4       1.00      0.97      0.98        31
           5       1.00      0.97      0.99        36
           6       1.00      1.00      1.00        34
           7       1.00      1.00      1.00        34
           8       1.00      1.00      1.00        27
           9       0.95      1.00      0.97        36

    accuracy                           0.99       360
   macro avg       0.99      0.99      0.99       360
weighted avg       0.99      0.99      0.99       360



In [15]:
print(accuracy_sv, '\n', class_report_sv) ## scaled data

0.9944444444444445 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        38
           1       1.00      1.00      1.00        36
           2       1.00      1.00      1.00        32
           3       1.00      1.00      1.00        56
           4       1.00      0.97      0.98        31
           5       1.00      0.97      0.99        36
           6       1.00      1.00      1.00        34
           7       1.00      1.00      1.00        34
           8       1.00      1.00      1.00        27
           9       0.95      1.00      0.97        36

    accuracy                           0.99       360
   macro avg       0.99      0.99      0.99       360
weighted avg       0.99      0.99      0.99       360



### SGD Classifier

In [147]:
sgd_model = SGDClassifier()
sgd_model.fit(X_train, y_train)
y_pred_sgd = sgd_model.predict(X_test)

accuracy_sgd = accuracy_score(y_test, y_pred_sgd)
class_report_sgd = classification_report(y_test, y_pred_sgd)

In [134]:
print(accuracy_sgd, '\n', class_report_sgd) ## raw data

0.9722222222222222 
               precision    recall  f1-score   support

           0       1.00      0.95      0.97        38
           1       0.97      0.94      0.96        36
           2       1.00      1.00      1.00        32
           3       0.92      1.00      0.96        56
           4       1.00      0.97      0.98        31
           5       0.97      0.94      0.96        36
           6       1.00      1.00      1.00        34
           7       0.97      1.00      0.99        34
           8       0.96      0.93      0.94        27
           9       0.97      0.97      0.97        36

    accuracy                           0.97       360
   macro avg       0.98      0.97      0.97       360
weighted avg       0.97      0.97      0.97       360



In [18]:
print(accuracy_sgd, '\n', class_report_sgd) ## scaled data

0.975 
               precision    recall  f1-score   support

           0       1.00      0.97      0.99        38
           1       0.92      0.97      0.95        36
           2       1.00      1.00      1.00        32
           3       1.00      0.98      0.99        56
           4       1.00      0.97      0.98        31
           5       0.94      0.92      0.93        36
           6       1.00      1.00      1.00        34
           7       1.00      1.00      1.00        34
           8       0.96      0.93      0.94        27
           9       0.92      1.00      0.96        36

    accuracy                           0.97       360
   macro avg       0.97      0.97      0.97       360
weighted avg       0.98      0.97      0.98       360



### Logistic Regression

In [149]:
logistic_model = LogisticRegression(random_state=32, max_iter=200)
logistic_model.fit(X_train, y_train)
y_pred_lr = logistic_model.predict(X_test)

accuracy_lr = accuracy_score(y_test, y_pred_lr)
class_report_lr = classification_report(y_test, y_pred_lr)

In [150]:
print(set(y_test))
print(set(y_pred_lr))


{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}


In [137]:
print(accuracy_lr, '\n', class_report_lr) # raw data

0.9694444444444444 
               precision    recall  f1-score   support

           0       1.00      0.95      0.97        38
           1       0.97      0.97      0.97        36
           2       1.00      1.00      1.00        32
           3       0.98      0.98      0.98        56
           4       1.00      0.97      0.98        31
           5       0.94      0.92      0.93        36
           6       1.00      1.00      1.00        34
           7       1.00      0.94      0.97        34
           8       0.93      1.00      0.96        27
           9       0.88      0.97      0.92        36

    accuracy                           0.97       360
   macro avg       0.97      0.97      0.97       360
weighted avg       0.97      0.97      0.97       360



In [22]:
print(accuracy_lr, '\n', class_report_lr) # scaled data

0.9722222222222222 
               precision    recall  f1-score   support

           0       1.00      0.97      0.99        38
           1       0.95      0.97      0.96        36
           2       1.00      1.00      1.00        32
           3       1.00      0.96      0.98        56
           4       1.00      0.97      0.98        31
           5       0.94      0.94      0.94        36
           6       0.97      1.00      0.99        34
           7       1.00      1.00      1.00        34
           8       0.92      0.89      0.91        27
           9       0.92      1.00      0.96        36

    accuracy                           0.97       360
   macro avg       0.97      0.97      0.97       360
weighted avg       0.97      0.97      0.97       360



In [138]:
print('with RAW data', accuracy_dc, accuracy_RFC, accuracy_sv, accuracy_sgd, accuracy_lr, sep='\n')

with RAW data
0.875
0.9861111111111112
0.9944444444444445
0.9722222222222222
0.9694444444444444


In [151]:
print('with SCALED data', accuracy_dc, accuracy_RFC, accuracy_sv, accuracy_sgd, accuracy_lr, sep='\n')

with SCALED data
0.875
0.9861111111111112
0.9944444444444445
0.9666666666666667
0.9722222222222222


# WINE

### Load Data and generate regularized train&test data sets

In [152]:
wine_data= load_wine()
data= wine_data['data']
label= wine_data['target']
#print(wine_data.target_names, wine_data['DESCR'], sep='\n\n')

In [153]:
data.shape

(178, 13)

In [154]:
wine_data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])

In [155]:
wine_data['feature_names']

['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

#### MIN MAX Scaling

In [156]:
scaler= MinMaxScaler()

scaled_data= scaler.fit_transform(data)

In [157]:
X_train, X_test, y_train, y_test = train_test_split(scaled_data, label, test_size=0.2, random_state= 32)

## Training Model Comparisons

### Decision Tree 사용하기

In [158]:
decision_tree= DecisionTreeClassifier(random_state=32)
decision_tree.fit(X_train, y_train)

y_pred_dc= decision_tree.predict(X_test)

accuracy_dc= accuracy_score(y_test, y_pred_dc)
class_report_dc= classification_report(y_test, y_pred_dc)

In [93]:
print(accuracy_dc, '\n', class_report_dc) # raw data

0.9166666666666666 
               precision    recall  f1-score   support

           0       0.94      1.00      0.97        16
           1       0.89      0.80      0.84        10
           2       0.90      0.90      0.90        10

    accuracy                           0.92        36
   macro avg       0.91      0.90      0.90        36
weighted avg       0.92      0.92      0.91        36



In [33]:
print(accuracy_dc, '\n', class_report_dc) # scaled data

0.9166666666666666 
               precision    recall  f1-score   support

           0       0.94      1.00      0.97        16
           1       0.89      0.80      0.84        10
           2       0.90      0.90      0.90        10

    accuracy                           0.92        36
   macro avg       0.91      0.90      0.90        36
weighted avg       0.92      0.92      0.91        36



### Random Forest 사용하기

In [159]:
random_forest = RandomForestClassifier(random_state=32)
random_forest.fit(X_train, y_train)

y_pred_RFC= random_forest.predict(X_test)

accuracy_RFC= accuracy_score(y_test, y_pred_RFC)
class_report_RFC= classification_report(y_test, y_pred_RFC)

In [95]:
print(accuracy_RFC, '\n', class_report_RFC) ## raw data

0.9722222222222222 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       1.00      0.90      0.95        10
           2       0.91      1.00      0.95        10

    accuracy                           0.97        36
   macro avg       0.97      0.97      0.97        36
weighted avg       0.97      0.97      0.97        36



In [36]:
print(accuracy_RFC, '\n', class_report_RFC) ## scaled data

0.9722222222222222 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       1.00      0.90      0.95        10
           2       0.91      1.00      0.95        10

    accuracy                           0.97        36
   macro avg       0.97      0.97      0.97        36
weighted avg       0.97      0.97      0.97        36



### SVM 사용해 보기

In [160]:
svm_model = svm.SVC(random_state=32)
svm_model.fit(X_train, y_train)
y_pred_sv = svm_model.predict(X_test)

In [161]:
accuracy_sv= accuracy_score(y_test, y_pred_sv)
class_report_sv= classification_report(y_test, y_pred_sv)

In [98]:
print(accuracy_sv, '\n', class_report_sv) ## raw data

0.6111111111111112 
               precision    recall  f1-score   support

           0       0.93      0.81      0.87        16
           1       0.41      0.90      0.56        10
           2       0.00      0.00      0.00        10

    accuracy                           0.61        36
   macro avg       0.45      0.57      0.48        36
weighted avg       0.53      0.61      0.54        36



In [40]:
print(accuracy_sv, '\n', class_report_sv) ## scaled data

0.9444444444444444 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.90      0.90      0.90        10
           2       0.90      0.90      0.90        10

    accuracy                           0.94        36
   macro avg       0.93      0.93      0.93        36
weighted avg       0.94      0.94      0.94        36



### SGD Classifier 사용해 보기

In [162]:
sgd_model = SGDClassifier()
sgd_model.fit(X_train, y_train)
y_pred_sgd = sgd_model.predict(X_test)

accuracy_sgd = accuracy_score(y_test, y_pred_sgd)
class_report_sgd = classification_report(y_test, y_pred_sgd)

In [100]:
print(accuracy_sgd, '\n', class_report_sgd) ## raw data

0.6111111111111112 
               precision    recall  f1-score   support

           0       0.92      0.75      0.83        16
           1       0.43      1.00      0.61        10
           2       0.00      0.00      0.00        10

    accuracy                           0.61        36
   macro avg       0.45      0.58      0.48        36
weighted avg       0.53      0.61      0.54        36



In [43]:
print(accuracy_sgd, '\n', class_report_sgd) ## scaled data

0.9722222222222222 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.91      1.00      0.95        10
           2       1.00      0.90      0.95        10

    accuracy                           0.97        36
   macro avg       0.97      0.97      0.97        36
weighted avg       0.97      0.97      0.97        36



### Logistic Regression 사용해 보기

In [163]:
logistic_model = LogisticRegression(random_state=32, max_iter=100)
logistic_model.fit(X_train, y_train)
y_pred_lr = logistic_model.predict(X_test)

accuracy_lr = accuracy_score(y_test, y_pred_lr)
class_report_lr = classification_report(y_test, y_pred_lr)

# log_model = LogisticRegression(solver='lbfgs', max_iter=1000)

In [102]:
print(set(y_test))
print(set(y_pred_lr))

{0, 1, 2}
{0, 1, 2}


In [103]:
print(accuracy_lr, '\n', class_report_lr) # raw data

0.9166666666666666 
               precision    recall  f1-score   support

           0       0.94      1.00      0.97        16
           1       0.89      0.80      0.84        10
           2       0.90      0.90      0.90        10

    accuracy                           0.92        36
   macro avg       0.91      0.90      0.90        36
weighted avg       0.92      0.92      0.91        36



In [164]:
print(accuracy_lr, '\n', class_report_lr) # scaled data

0.9722222222222222 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       1.00      0.90      0.95        10
           2       0.91      1.00      0.95        10

    accuracy                           0.97        36
   macro avg       0.97      0.97      0.97        36
weighted avg       0.97      0.97      0.97        36



In [104]:
print('with RAW data', accuracy_dc, accuracy_RFC, accuracy_sv, accuracy_sgd, accuracy_lr, sep='\n')

with RAW data
0.9166666666666666
0.9722222222222222
0.6111111111111112
0.6111111111111112
0.9166666666666666


In [165]:
print('with SCALED data', accuracy_dc, accuracy_RFC, accuracy_sv, accuracy_sgd, accuracy_lr, sep='\n', end='\n\n')

with SCALED data
0.9166666666666666
0.9722222222222222
0.9444444444444444
0.9444444444444444
0.9722222222222222



# BREAST CANCER

### Load Data and generate regularized train&test data sets

In [166]:
bc_data= load_breast_cancer()
bc_data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [167]:
data= bc_data['data']
label= bc_data['target']

In [168]:
bc_data['feature_names']

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

In [169]:
scaler= MinMaxScaler()

scaled_data= scaler.fit_transform(data)

In [170]:
X_train, X_test, y_train, y_test = train_test_split(scaled_data, label, test_size=0.2, random_state= 32)

### Decision Tree 사용하기

In [171]:
decision_tree= DecisionTreeClassifier(random_state=32)
decision_tree.fit(X_train, y_train)

y_pred_dc= decision_tree.predict(X_test)

accuracy_dc= accuracy_score(y_test, y_pred_dc)
class_report_dc= classification_report(y_test, y_pred_dc)

In [111]:
print(accuracy_dc, '\n', class_report_dc) # raw data

0.8771929824561403 
               precision    recall  f1-score   support

           0       0.84      0.84      0.84        44
           1       0.90      0.90      0.90        70

    accuracy                           0.88       114
   macro avg       0.87      0.87      0.87       114
weighted avg       0.88      0.88      0.88       114



In [57]:
print(accuracy_dc, '\n', class_report_dc) # scaled data

0.8771929824561403 
               precision    recall  f1-score   support

           0       0.84      0.84      0.84        44
           1       0.90      0.90      0.90        70

    accuracy                           0.88       114
   macro avg       0.87      0.87      0.87       114
weighted avg       0.88      0.88      0.88       114



### Random Forest 사용하기

In [172]:
random_forest = RandomForestClassifier(random_state=32)
random_forest.fit(X_train, y_train)

y_pred_RFC= random_forest.predict(X_test)

accuracy_RFC= accuracy_score(y_test, y_pred_RFC)
class_report_RFC= classification_report(y_test, y_pred_RFC)

In [113]:
print(accuracy_RFC, '\n', class_report_RFC) ## raw data

0.9385964912280702 
               precision    recall  f1-score   support

           0       0.91      0.93      0.92        44
           1       0.96      0.94      0.95        70

    accuracy                           0.94       114
   macro avg       0.93      0.94      0.94       114
weighted avg       0.94      0.94      0.94       114



In [60]:
print(accuracy_RFC, '\n', class_report_RFC) ## scaled data

0.9385964912280702 
               precision    recall  f1-score   support

           0       0.91      0.93      0.92        44
           1       0.96      0.94      0.95        70

    accuracy                           0.94       114
   macro avg       0.93      0.94      0.94       114
weighted avg       0.94      0.94      0.94       114



### SVM 사용해 보기

In [173]:
svm_model = svm.SVC(random_state=32)
svm_model.fit(X_train, y_train)
y_pred_sv = svm_model.predict(X_test)

accuracy_sv= accuracy_score(y_test, y_pred_sv)
class_report_sv= classification_report(y_test, y_pred_sv)

In [115]:
print(accuracy_sv, '\n', class_report_sv) ## raw data

0.868421052631579 
               precision    recall  f1-score   support

           0       0.91      0.73      0.81        44
           1       0.85      0.96      0.90        70

    accuracy                           0.87       114
   macro avg       0.88      0.84      0.85       114
weighted avg       0.87      0.87      0.86       114



In [63]:
print(accuracy_sv, '\n', class_report_sv) ## scaled data

0.9912280701754386 
               precision    recall  f1-score   support

           0       1.00      0.98      0.99        44
           1       0.99      1.00      0.99        70

    accuracy                           0.99       114
   macro avg       0.99      0.99      0.99       114
weighted avg       0.99      0.99      0.99       114



### SGD Classifier 사용해 보기

In [174]:
sgd_model = SGDClassifier()
sgd_model.fit(X_train, y_train)
y_pred_sgd = sgd_model.predict(X_test)

accuracy_sgd = accuracy_score(y_test, y_pred_sgd)
class_report_sgd = classification_report(y_test, y_pred_sgd)

In [118]:
print(accuracy_sgd, '\n', class_report_sgd) ## raw data

0.8859649122807017 
               precision    recall  f1-score   support

           0       0.94      0.75      0.84        44
           1       0.86      0.97      0.91        70

    accuracy                           0.89       114
   macro avg       0.90      0.86      0.87       114
weighted avg       0.89      0.89      0.88       114



In [66]:
print(accuracy_sgd, '\n', class_report_sgd) ## scaled data

0.9912280701754386 
               precision    recall  f1-score   support

           0       1.00      0.98      0.99        44
           1       0.99      1.00      0.99        70

    accuracy                           0.99       114
   macro avg       0.99      0.99      0.99       114
weighted avg       0.99      0.99      0.99       114



### Logistic Regression 사용해 보기

In [175]:
logistic_model = LogisticRegression(random_state=32, max_iter=100)
logistic_model.fit(X_train, y_train)
y_pred_lr = logistic_model.predict(X_test)

accuracy_lr = accuracy_score(y_test, y_pred_lr)
class_report_lr = classification_report(y_test, y_pred_lr)



# log_model = LogisticRegression(solver='lbfgs', max_iter=1000)

In [176]:
print(set(y_test))
print(set(y_pred_lr))


{0, 1}
{0, 1}


In [121]:
print(accuracy_lr, '\n', class_report_lr) # raw data

0.9122807017543859 
               precision    recall  f1-score   support

           0       0.89      0.89      0.89        44
           1       0.93      0.93      0.93        70

    accuracy                           0.91       114
   macro avg       0.91      0.91      0.91       114
weighted avg       0.91      0.91      0.91       114



In [70]:
print(accuracy_lr, '\n', class_report_lr) # scaled data

0.9649122807017544 
               precision    recall  f1-score   support

           0       0.98      0.93      0.95        44
           1       0.96      0.99      0.97        70

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



In [122]:
print('with RAW data', accuracy_dc, accuracy_RFC, accuracy_sv, accuracy_sgd, accuracy_lr, sep='\n')

with RAW data
0.8771929824561403
0.9385964912280702
0.868421052631579
0.8859649122807017
0.9122807017543859


In [177]:
print('with SCALED data', accuracy_dc, accuracy_RFC, accuracy_sv, accuracy_sgd, accuracy_lr, sep='\n', end='\n\n')

with SCALED data
0.8771929824561403
0.9385964912280702
0.9912280701754386
0.9736842105263158
0.9649122807017544



## 회고
#### 데이터의 사용
- 각 데이터는 train 80%, test 20% 로 나누어 진행 하였다.
- 각 데이터는 MinMaxScaler() 를 사용, regularization을 하여 사용하였다.
- 각 데이터의 scaling 이 안된 raw 데이터로 학습한 결과물을 함께 비교하였다
- SVM 과 logistic regression 모델에서 raw data를 사용하였을 때 warning 이 발생하였으며, 그 과정은 아래 warning 항목에 설명하였다. 

### 1) Digit 데이터:
#### DIGIT Accuracy
| MODEL                 	|  raw data  	| scaled data 	|
|:-----------------------	|:----------:	|:-----------:	|
| Decision Tree         	|    0.875   	|    0.875    	|
| Random Forest         	| 0.98611111 	| 0.986111111 	|
| SVM                   	| 0.99444444 	| 0.994444444 	|
| SGD Classifier        	| 0.97222222 	|    0.975    	|
| Logistic   Regression 	| 0.96944444 	| 0.972222222 	|

#### DIGIT Precision
| MODEL                 	| raw data 	| scaled data 	|
|-----------------------	|:--------:	|:-----------:	|
| Decision Tree         	|   0.88   	|     0.88    	|
| Random Forest         	|   0.99   	|     0.99    	|
| SVM                   	|   0.99   	|     0.99    	|
| SGD Classifier        	|   0.97   	|     0.98    	|
| Logistic   Regression 	|   0.97   	|     0.97    	|

- digit 데이터는 각각 8x8 크기의 이미지로, 각 클래스 별로 174~183개의 이미지가 고루 분포해 있으며, digit 은 TP 가 가장 중요하므로, precision 이 적절한 지표로 보인다.
- Accuracy 를 보면, SGD Classifier 와 Logistic Regression 두 모델에서 raw data 와 scaled data 를 각각 학습시켰을 때 성능 차이가 보였으나 precision 을 보면 크게 크게 다르지 않은것을 볼 수 있다.
- Decision tree 에 비하여 random forest 가 11%의 성능 향상이 있었다.
- 이 데이터는 SVM 을 사용하였을 때 accuracy 와 precision 모두 가장 높은 성능을 보였다.

### 2) Wine 데이터:
#### WINE Accuracy
| MODEL                 	|  raw data 	| scaled data 	|
|:-----------------------	|:---------:	|:-----------:	|
| Decision Tree         	| 0.9166667 	| 0.916666667 	|
| Random Forest         	| 0.9722222 	| 0.972222222 	|
| SVM                   	| 0.6111111 	| 0.944444444 	|
| SGD Classifier        	| 0.6111111 	| 0.972222222 	|
| Logistic   Regression 	| 0.9166667 	| 0.972222222 	|


- wine 데이터는 총 3개의 class 로 구성되었으며 (각각 class0: 59, class1: 71, class2: 48 개의 데이터가 있다), 각 데이터 별로 총 13개의 feature 이 있다. 특히 magnesium과 proline 의 값이 다른 특징들에 비해 큰 값을 보여 이 특징에 맞는 preprocessing 방법을 찾아야 하겠다.
- 와인은 그 특징에 따라 맛이 민감하게 차이가 나는 제품으로, TP, TN 둘 다 중요하기 때문에 accuracy 가 적합한 지표가 된다.
- Decision Tree  와 Random Forest 두 모델에서는 raw data 와 scaled data 각 데이터에 대한 성능 차이가 없다. 그 외 SVM, SGD, Logistic Regression 에서는 성능 향상이 뚜렷하게 보였다.
- wine data 에서 또한 decision tree 에 비하여 random forest 에서 성능 향상이 있었다 (약 5%).
- 이 데이터는 random forest, SGD, Logistic Regression 세 모델에서 같은 성능을 보였다.
- wine data 는 SVM, SGD, Logistic Regression 세 모델로 더 실험/분석하여 적합한 모델을 선정해야 할 것으로 보인다.


### 3) Breast Cancer 데이터:
#### BREAST CANCER Accuracy
| MODEL                 	|   raw data  	| scaled data 	|
|:-----------------------	|:-----------:	|:-----------:	|
| Decision Tree         	| 0.877192982 	| 0.877193    	|
| Random Forest         	| 0.938596491 	| 0.9385965   	|
| SVM                   	| 0.868421053 	| 0.9912281   	|
| SGD Classifier        	| 0.885964912 	| 0.9912281   	|
| Logistic   Regression 	| 0.912280702 	| 0.9649123   	|

#### BREAST CANCER Recall
| MODEL                 	| raw data 	| scaled data 	|
|-----------------------	|:--------:	|:-----------:	|
| Decision Tree         	|   0.88   	|     0.88    	|
| Random Forest         	|   0.94   	|     0.94    	|
| SVM                   	|   0.87   	|     0.99    	|
| SGD Classifier        	|   0.89   	|     0.99    	|
| Logistic   Regression 	|   0.91   	|     0.96    	|

- breast cancer 데이터는 총 569개의 데이터로, WDBC-Malignant, WDBC-Benign 두가지 클래스로 구성되어 있으며, 총 30개의 특징이 있다. 
- 암환자를 놓치지 않고 양성으로 판단하여야 하므로 Recall 이 중요한 지표이다.
- Breast cancer data 에서 또한 decision tree 에 비하여 random forest 에서 성능 향상이 있었다 (약 6%).
- Decision Tree  와 Random Forest 두 모델에서는 raw data 와 scaled data 각 데이터에 대한 성능 차이가 없다. 그 외 SVM, SGD, Logistic Regression 에서는 성능 향상이 뚜렷하게 보였다.
- SVM 과 SGD 모델이 약 accuracy 99.12%, recall 99 로 가장 높은 성능을 나타냈다.
- Breast cancer data 는 SVM, SGD 두 모델 중 더 실험/분석하여 적합한 모델을 선정해야 할 것으로 보인다.

#### warning
- 3개의 모든 데이터에서 공통적으로 raw data 를 사용하였을때 logistic regression 트레이닝 중 max_iter 에 대한 warnig 이 발생했다. 데이터값의 크기로 인한 warning 으로 보이나, max_iter 를 높여 warning이 없었을 때에도 모델의 성능에는 변화가 없었다.
- wine data 의 scale 되지 않은 raw data 를 사용하였을때 SVM 모델의 classification_report을 띄울 때 UndefinedMetricWarning 이 발생했다. 모든 test data 가 class 0 혹은 class 1 로 분류되어 class 2 에 대한 값이 0 이기 때문에 나타나는 현상이었다. 이는 raw 데이터로는 학습이 제대로 되지 않았다는 것을 의미하며, result table 에서도 볼 수 있다. (약 61.1%)