# 12. SVM

예측이 정확하고 여러가지 형태의 자료에 대하여 적용이 용이함.  

분류문제와 회귀문제 모두 사용가능.  

**로지스틱 회귀** : 출력값에 대한 조건부 확률을 추정. 

**서포트 벡터 머신** : 확률을 추정하지 않고 직접 분류 결과를 예측.  

## 1. SVM의 이해

[참고](https://hleecaster.com/ml-svm-concept/)




## 2. sklearn 모듈을 활용한 SVM

### 2-1. Train set - Test set 분할

sklearn.model_selection 모듈의 train_test_split()함수 이용

> from sklearn.model_selection import train_test_split

> train 객체명, test 객체명 = train_test_split(데이터셋, test_size=비율, random_state=번호, stratify=출력변수)

예측의 대상이 되는 출력변수에 대하여 층화추출을 통해 Train 데이터와 Test 데이터를 분류

### 2-2. Train set을 활용한 SVM

sklearn.svm 모듈의 SVC()함수 이용

> from sklearn.svm import SVC

> SVC(kernel='linear', 'rbf', 'poly', 'sigmoid', C=슬랙변수 초모수값, gamma=비선형SVM비선형성값, probability=True/False).fit(입력변수, 출력변수)

* kernel 옵션
    - 선형 SVM 또는 비선형 SVM 결정
    - 'linear' 기본값 (선형)
    - 'rbf' 방사형 기저함수 사용 (비선형)
    - 'poly' p차 다항식 사용 (비선형)
    - 'sigmoid' 시그모이드함수 사용 (비선형)
    
    
* C 옵션
    - 슬랙변수가 주어졌을 때의 초모수값 지정
    - 실수값 입력 가능 (기본값은 1로 지정)
    
    
* gamma 옵션
    - 비선형 SVM 실시
    - 커널함수를 rbf, poly, sigmoid로 지정하였을 때 사용
    - 실수값 지정 가능
    
    
* probability 옵션
    - 확률추정을 활성화 할 지 여부를 지정
    - 기본적으로 False로 지정되어 있어 확률값을 계산하지 않음
    - predict_proba()함수를 이용하여 출력변수의 예측확률값을 확인하려면 True를 지정하여 확률값 계산
    
    
> 예측값: SVM객체.predict(입력변수)

> 확률값: SVM객체.proba(입력변수)


SVM 모형 평가에 사용되는 분류표, 정분류율, 평가지표 확인

> from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

> 분류표: confusion_matrix(출력변수 예측값, 출력변수)

> 정뷴류율: accuracy_score(출력변수, 출력변수 예측값)

> 분류 예측력 평가지표: classification_report(출력변수, 출력변수 예측값)

### 2-3. 선형 SVM 실시

In [1]:
import seaborn as sns

In [2]:
df = sns.load_dataset('iris')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [4]:
from sklearn.model_selection import train_test_split

In [5]:
train, test = train_test_split(df, test_size=0.3, random_state=1, stratify=df['species'])

In [6]:
y_train = train['species']
X_train = train[['sepal_length','sepal_width','petal_length','petal_width']]

y_test = test['species']
X_test = test[['sepal_length','sepal_width','petal_length','petal_width']]

In [7]:
from sklearn.svm import SVC

In [8]:
svm = SVC(kernel='linear', C=1.0, probability=True, random_state=1).fit(X_train, y_train)

In [9]:
svm.predict_proba(X_train)[:5]

array([[0.97679222, 0.01152322, 0.01168456],
       [0.93280957, 0.04645169, 0.02073874],
       [0.01531616, 0.03733592, 0.94734792],
       [0.00964874, 0.02514918, 0.96520208],
       [0.96911164, 0.02044402, 0.01044434]])

In [10]:
y_train_pred = svm.predict(X_train)

In [11]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [12]:
confusion_matrix(y_train_pred, y_train)

array([[35,  0,  0],
       [ 0, 33,  1],
       [ 0,  2, 34]])

In [13]:
accuracy_score(y_train_pred, y_train)

0.9714285714285714

In [14]:
svm_report = classification_report(y_train, y_train_pred)

In [15]:
print(svm_report)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        35
  versicolor       0.97      0.94      0.96        35
   virginica       0.94      0.97      0.96        35

    accuracy                           0.97       105
   macro avg       0.97      0.97      0.97       105
weighted avg       0.97      0.97      0.97       105



In [16]:
y_test_pred = svm.predict(X_test)

In [17]:
confusion_matrix(y_test_pred, y_test)

array([[15,  0,  0],
       [ 0, 15,  1],
       [ 0,  0, 14]])

In [18]:
accuracy_score(y_test_pred, y_test)

0.9777777777777777

In [19]:
svm_report1 = classification_report(y_test, y_test_pred)

In [20]:
print(svm_report1)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        15
  versicolor       0.94      1.00      0.97        15
   virginica       1.00      0.93      0.97        15

    accuracy                           0.98        45
   macro avg       0.98      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45



### 2-4. 비선형 SVM 실시

In [21]:
from sklearn.svm import SVC

In [22]:
svm = SVC(kernel='rbf', C=1.0, gamma=0.2, probability=True, random_state=1).fit(X_train, y_train)

In [23]:
svm.predict_proba(X_train)[:5]

array([[0.94025384, 0.03167803, 0.02806813],
       [0.94990912, 0.03161942, 0.01847146],
       [0.01358309, 0.0313825 , 0.95503441],
       [0.0115695 , 0.01544497, 0.97298553],
       [0.96377102, 0.01889853, 0.01733045]])

In [24]:
y_train_pred = svm.predict(X_train)

In [25]:
confusion_matrix(y_train_pred, y_train)

array([[35,  0,  0],
       [ 0, 33,  2],
       [ 0,  2, 33]])

In [26]:
accuracy_score(y_train, y_train_pred)

0.9619047619047619

In [27]:
svm_report2 = classification_report(y_train_pred, y_train)

In [28]:
print(svm_report2)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        35
  versicolor       0.94      0.94      0.94        35
   virginica       0.94      0.94      0.94        35

    accuracy                           0.96       105
   macro avg       0.96      0.96      0.96       105
weighted avg       0.96      0.96      0.96       105



In [29]:
y_test_pred = svm.predict(X_test)

In [30]:
confusion_matrix(y_test_pred, y_test)

array([[15,  0,  0],
       [ 0, 15,  1],
       [ 0,  0, 14]])

In [31]:
accuracy_score(y_test, y_test_pred)

0.9777777777777777

In [32]:
svm_report3 = classification_report(y_test_pred, y_test)

In [33]:
print(svm_report3)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        15
  versicolor       1.00      0.94      0.97        16
   virginica       0.93      1.00      0.97        14

    accuracy                           0.98        45
   macro avg       0.98      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45



## 실습

### [과제12] SVM 실시
seaborn 모듈의 penguins 데이터에서 species를 예측하기 위해 bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g을 사용한 SVM을 실시하시오.  

In [34]:
df = sns.load_dataset('penguins')

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [36]:
import pandas as pd

In [37]:
df = df.dropna(subset = ['species', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'], 
               how = 'any', axis = 0)

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 342 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            342 non-null    object 
 1   island             342 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 21.4+ KB


In [39]:
train, test = train_test_split(df, test_size=0.3, random_state=123, stratify=df['species'])

In [40]:
X_train = train[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y_train = train['species']

X_test = test[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y_test = test['species']


In [41]:
svm = SVC(kernel='linear', C=1.0, random_state=1, probability=True).fit(X_train, y_train)

In [42]:
svm.predict_proba(X_train)[:5]

array([[9.10672441e-01, 8.83934367e-02, 9.34122472e-04],
       [1.66325528e-02, 9.16181052e-01, 6.71863952e-02],
       [3.33436832e-02, 9.36650295e-01, 3.00060215e-02],
       [9.93876464e-01, 5.45886358e-03, 6.64671981e-04],
       [9.21366758e-01, 7.38168416e-02, 4.81640003e-03]])

In [43]:
y_train_pred = svm.predict(X_train)

In [44]:
confusion_matrix(y_train_pred, y_train)

array([[105,   0,   0],
       [  1,  47,   0],
       [  0,   0,  86]])

In [45]:
accuracy_score(y_train, y_train_pred)

0.99581589958159

In [46]:
svm_report = classification_report(y_train, y_train_pred)

In [47]:
print(svm_report)

              precision    recall  f1-score   support

      Adelie       1.00      0.99      1.00       106
   Chinstrap       0.98      1.00      0.99        47
      Gentoo       1.00      1.00      1.00        86

    accuracy                           1.00       239
   macro avg       0.99      1.00      0.99       239
weighted avg       1.00      1.00      1.00       239



In [48]:
svm.predict_proba(X_test)[:5]

array([[1.54894932e-02, 9.70232681e-01, 1.42778260e-02],
       [9.90554604e-01, 9.31702638e-03, 1.28369214e-04],
       [2.11670916e-01, 7.28873177e-01, 5.94559069e-02],
       [9.90392386e-01, 5.74639287e-03, 3.86122150e-03],
       [2.18024796e-03, 9.88624224e-01, 9.19552796e-03]])

In [49]:
y_test_pred = svm.predict(X_test)

In [50]:
confusion_matrix(y_test_pred, y_test)

array([[45,  0,  0],
       [ 0, 21,  0],
       [ 0,  0, 37]])

In [51]:
accuracy_score(y_test, y_test_pred)

1.0

In [52]:
svm_report1 = classification_report(y_test, y_test_pred)

In [53]:
print(svm_report1)

              precision    recall  f1-score   support

      Adelie       1.00      1.00      1.00        45
   Chinstrap       1.00      1.00      1.00        21
      Gentoo       1.00      1.00      1.00        37

    accuracy                           1.00       103
   macro avg       1.00      1.00      1.00       103
weighted avg       1.00      1.00      1.00       103



In [54]:
svm = SVC(kernel='rbf', C=1.0, gamma=0.2, random_state=123, probability=True)

In [55]:
svm.fit(X_train, y_train)

SVC(gamma=0.2, probability=True, random_state=123)

In [56]:
svm.predict_proba(X_train)[:5]

array([[0.79836388, 0.19655498, 0.00508114],
       [0.19738927, 0.7533416 , 0.04926912],
       [0.1952624 , 0.7561998 , 0.0485378 ],
       [0.85898021, 0.11214959, 0.0288702 ],
       [0.85904018, 0.11210355, 0.02885627]])

In [57]:
y_train_pred = svm.predict(X_train)

In [58]:
confusion_matrix(y_train_pred, y_train)

array([[106,   1,   0],
       [  0,  46,   0],
       [  0,   0,  86]])

In [59]:
accuracy_score(y_train, y_train_pred)

0.99581589958159

In [60]:
svm_report2 = classification_report(y_train, y_train_pred)

In [61]:
print(svm_report2)

              precision    recall  f1-score   support

      Adelie       0.99      1.00      1.00       106
   Chinstrap       1.00      0.98      0.99        47
      Gentoo       1.00      1.00      1.00        86

    accuracy                           1.00       239
   macro avg       1.00      0.99      0.99       239
weighted avg       1.00      1.00      1.00       239



In [62]:
y_test_pred = svm.predict(X_test)

In [63]:
confusion_matrix(y_test_pred, y_test)

array([[45, 19, 20],
       [ 0,  2,  0],
       [ 0,  0, 17]])

In [64]:
accuracy_score(y_test, y_test_pred)

0.6213592233009708

In [65]:
svm_report3 = classification_report(y_test, y_test_pred)

In [66]:
print(svm_report3)

              precision    recall  f1-score   support

      Adelie       0.54      1.00      0.70        45
   Chinstrap       1.00      0.10      0.17        21
      Gentoo       1.00      0.46      0.63        37

    accuracy                           0.62       103
   macro avg       0.85      0.52      0.50       103
weighted avg       0.80      0.62      0.57       103



linear가 아닌 rbf일 때 Adelie종을 구별하는데 정확도가 크게 떨어지는 것을 알 수 있음. 즉 Train set으로 fitting한 비선형 SVM이 Overfitting 되었다는 의미이므로 정확한 값을 얻기 위해서는 옵션값을 조절해야 한다.  

최적의 옵션값을 선택하는 방법은 '교차검증', 'hyperparameter tuning'등의 키워드로 검색하여 알 수 있다.  