# 수리적 머신러닝 분류화 과제_25.3.17
## 120240106 수학과 김건휘

## 사용할 Dataset : IRIS
**Dataset 출처** : https://scikit-learn.org/1.4/auto_examples/datasets/plot_iris_dataset.html

**Dataset Description** : This data sets consists of 3 different types of irises' (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray

The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.

The below plot uses the first two features. See here for more information on this dataset.

**Dataset 구조 설명** :
- 총 샘플 수: 150개
- 클래스(품종) 수: 3개 (각 클래스당 50개 샘플)
- 특성(Feature) 수: 4개
- 목표 변수(Target): 꽃의 품종(Label)
    
**특성 (Features) :**

| Feature Name        | 설명                   | 단위  |
|----------------------|------------------------|-------|
| **sepal length**    | 꽃받침(sepal) 길이      | cm    |
| **sepal width**     | 꽃받침(sepal) 너비      | cm    |
| **petal length**    | 꽃잎(petal) 길이        | cm    |
| **petal width**     | 꽃잎(petal) 너비        | cm    |




## 데이터 전처리

### Dataset 불러오기

In [None]:
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data  # 꽃받침 및 꽃잎의 길이와 너비 특성
y = iris.target  # 품종 (0: setosa, 1: versicolor, 2: virginica)

### Classs 두개 선택

In [None]:
# 특정 두 개의 품종만 선택 (setosa와 versicolor)
flag_setosa_versicolor = (y == 0) | (y == 1)
X = X[flag_setosa_versicolor]
y = y[flag_setosa_versicolor]

### Train/Test 분할

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# 학습 데이터와 테스트 데이터 분할 (70% 훈련, 30% 테스트)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=128)

## 데이터 정규화 작업

In [None]:
from sklearn.preprocessing import LabelEncoder
# Label Encoder 초기화
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_valid_encoded = le.transform(y_valid)

In [None]:
from sklearn.preprocessing import StandardScaler
import pandas as pd
# 데이터 정규화 (StandardScaler: 평균 0, 표준편차 1)
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), dtype='float32')
X_valid_scaled = pd.DataFrame(scaler.transform(X_valid), dtype='float32')

## XGboost을 이용한 분류

In [None]:
import xgboost as xgb
print(xgb.__version__)

In [None]:
# set XGBoost regressor parameters
my_random_seed = 128
early_stop_rounds = 20

early_stop = xgb.callback.EarlyStopping(rounds=early_stop_rounds, save_best=True)

xgb_classify = xgb.XGBClassifier(random_state=my_random_seed, callbacks=[early_stop])

In [None]:
%%time

## train

# fit
xgb_classify.fit(X_train, y_train,
                 eval_set=[(X_valid, y_valid)], verbose=True)

In [None]:
# predict

y_predicted_vaild = xgb_classify.predict(X_valid)
y_predicted_vaild

### 성능 평가

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score
import matplotlib.pyplot as plt

In [None]:
# 성능 평가
print("Accuracy:", accuracy_score(y_valid, y_predicted_vaild))
print("Confusion Matrix:\n", confusion_matrix(y_valid, y_predicted_vaild))
# pos_label을 0.0으로 설정
print("Precision (Class=0.0):", precision_score(y_valid, y_predicted_vaild, pos_label=0.0))
print("Recall (Class=0.0):", recall_score(y_valid, y_predicted_vaild, pos_label=0.0))
print("F1-Score (Class=0.0):", f1_score(y_valid, y_predicted_vaild, pos_label=0.0))