## Wine

## 1. 모듈 임포트

In [1]:
from sklearn.datasets import load_wine   ## 데이터 지정
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd


## 2. 데이터 준비, 3. 데이터 이해하기

In [2]:
wine = load_wine()
wine_data = wine.data
wine_label = wine.target
wine.target_names
print("< Feature names >")
print(wine.feature_names)
print("\n< Target names >")
print(wine.target_names)
print("\n< Data shape >")
print(wine_data.shape)
print("\n< Label shape >")
print(wine_label.shape)
print(wine.DESCR)

< Feature names >
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

< Target names >
['class_0' 'class_1' 'class_2']

< Data shape >
(178, 13)

< Label shape >
(178,)
.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
  

## 4. train, test 데이터 분리

In [3]:
X_train, X_test, y_train, y_test = train_test_split(wine_data, wine_label, 
                                                    test_size = 0.5, random_state= 8)   ## test_size를 변경해가며 시험.

## 5. 모델별 학습

In [4]:
### Decision tree
from sklearn.tree import DecisionTreeClassifier   ## 모델 불러오기

decision_tree = DecisionTreeClassifier(random_state = 32)   ## 모델 지정
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)

print("------------------------------Train result------------------------------")
print(classification_report(y_test, y_pred))

------------------------------Train result------------------------------
              precision    recall  f1-score   support

           0       1.00      0.87      0.93        30
           1       0.85      0.97      0.91        36
           2       0.95      0.91      0.93        23

    accuracy                           0.92        89
   macro avg       0.94      0.92      0.92        89
weighted avg       0.93      0.92      0.92        89



In [5]:
### Random Forest
from sklearn.ensemble import RandomForestClassifier   ## 모델 불러오기

decision_tree = RandomForestClassifier(random_state = 32)   ## 모델 지정
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)

print("------------------------------Train result------------------------------")
print(classification_report(y_test, y_pred))

------------------------------Train result------------------------------
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        30
           1       1.00      1.00      1.00        36
           2       1.00      1.00      1.00        23

    accuracy                           1.00        89
   macro avg       1.00      1.00      1.00        89
weighted avg       1.00      1.00      1.00        89



In [6]:
### Support Vector Machine
from sklearn import svm   ## 모델 불러오기

decision_tree = svm.SVC(random_state = 32)   ## 모델 지정
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)

print("------------------------------Train result------------------------------")
print(classification_report(y_test, y_pred, zero_division=1))


------------------------------Train result------------------------------
              precision    recall  f1-score   support

           0       0.87      0.90      0.89        30
           1       0.59      0.94      0.72        36
           2       1.00      0.00      0.00        23

    accuracy                           0.69        89
   macro avg       0.82      0.61      0.54        89
weighted avg       0.79      0.69      0.59        89



In [7]:
### Stochastic Gradient Descent
from sklearn.linear_model import SGDClassifier   ## 모델 불러오기

decision_tree = SGDClassifier(random_state = 32)   ## 모델 지정
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)

print("------------------------------Train result------------------------------")
print(classification_report(y_test, y_pred, zero_division=1))


------------------------------Train result------------------------------
              precision    recall  f1-score   support

           0       1.00      0.73      0.85        30
           1       0.83      0.28      0.42        36
           2       0.38      0.91      0.54        23

    accuracy                           0.60        89
   macro avg       0.74      0.64      0.60        89
weighted avg       0.77      0.60      0.59        89



In [8]:
### Logistic Regression
from sklearn.linear_model import LogisticRegression   ## 모델 불러오기

decision_tree = LogisticRegression(random_state = 32, max_iter=4000)   ## 모델 지정
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)

print("------------------------------Train result------------------------------")
print(classification_report(y_test, y_pred))

------------------------------Train result------------------------------
              precision    recall  f1-score   support

           0       0.96      0.90      0.93        30
           1       0.89      0.94      0.92        36
           2       0.96      0.96      0.96        23

    accuracy                           0.93        89
   macro avg       0.94      0.93      0.94        89
weighted avg       0.93      0.93      0.93        89



## 6. 평가 및 회고

#### DT: Decision Tree, RF: Random Forest, SVM: Support Vector Machine, SGD: Stochastic Gradient Descent, LR: Logistic Regression
### 모델별 예측정확도 결과는 아래와 같다.

#### Test size: 0.1일때
#### DT: 83%, RF: 94%, SVM: 56%, SGD: 72%, LR: 83%

#### Test size: 0.2일때
#### DT: 94%, RF: 100%, SVM: 64%, SGD: 58%, LR: 92%

#### Test size: 0.3일때
#### DT: 87%, RF: 100%, SVM: 65%, SGD: 56%, LR: 93%

#### Test size: 0.4일때
#### DT: 97%, RF: 100%, SVM: 69%, SGD: 61%, LR: 93%

#### Test size: 0.5일때
#### DT: 92%, RF: 100%, SVM: 69%, SGD: 95%, LR: 97%

### RF가 Test size에 관계없이 가장 높았고 0.2~0.5에서는 100%였다. 베스트 모델.
### 이 데이터셋은 Test size에 영향을 받는 폭이 Digits에 비해 컸다. Feature number에 영향을 받는건가 싶은데.. 이유에 대한 근거가 부족하다. 추후 추가 스터디가 필요하다.
### DT의 경우 들쭉날쭉했는데, 경로의 무작위성 영향을 많이 받는 것처럼 보였다. 이는 Digits에서의 추이와도 비슷하다.
### RF는 DT에 비해 안정적으로 보인다. Test size 0.1 ->0.2로 갈 때 100%로 6% 상승한 후 계속 100%가 나왔다. 앞서 SVM이 베스트였던 것과는 달리 RF가 베스트인 것이 주목할만한 부분. SVM은 비선형 모델에서 정확도가 높다고 하는데, 이 데이터셋은 데이터가 선형성이 더 두드러져서 그런 것이 아닌가 싶다.