### 목표 : 생선 분류 모델
- 데이터 : fish.csv
- feature : 5개 (Weight,Length,Diagonal,Height,Width) 
- target : 1개 
- 방법 : 지도학습 + 다중 분류

In [1]:
# 모듈 로딩
import pandas as pd
import numpy as np

In [2]:
# 데이터 준비
file = '../data/fish.csv'

fishDF = pd.read_csv(file)
fishDF.head(2)

Unnamed: 0,Species,Weight,Length,Diagonal,Height,Width
0,Bream,242.0,25.4,30.0,11.52,4.02
1,Bream,290.0,26.3,31.2,12.48,4.3056


(2) 학습을 위한 데이터 준비<hr>

(2-1) 피처와 타겟 분리

In [3]:
featureDF = fishDF[fishDF.columns[1:]]
targetDF = fishDF[fishDF.columns[0]]

In [4]:
print(f'featureDF : {featureDF.shape}, targetDF : {targetDF.shape}')

featureDF : (159, 5), targetDF : (159,)


In [5]:
# 타겟의 클래스 수 확인
targetDF.nunique()

7

In [6]:
# 타겟 클래스 별 데이터 수 확인
targetDF.value_counts()

# 타겟 클래스 별 데이터 비율 확인
(targetDF.value_counts()/targetDF.shape[0])*100

Species
Perch        35.220126
Bream        22.012579
Roach        12.578616
Pike         10.691824
Smelt         8.805031
Parkki        6.918239
Whitefish     3.773585
Name: count, dtype: float64

(2-2) 학습용/ 테스트용 데이터 셋 준비

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(featureDF, targetDF, stratify=targetDF, random_state=11) # 비율이 서로 너무 다르니까 비율 맞추려고 stratify = target

In [8]:
# 데이터셋 내용 확인.
print(f'[Train Dataset] {X_train.shape}, {y_train.shape}')
print(f'[Test Dataset] {X_test.shape}, {y_test.shape}')

[Train Dataset] (119, 5), (119,)
[Test Dataset] (40, 5), (40,)


(3)학습 진행

In [9]:
from sklearn.linear_model import LogisticRegression

In [10]:
#모델 인스턴스 생성 및 학습
model = LogisticRegression(max_iter = 20000, solver='liblinear') # max_iter = (처음부터 끝까지 몇 번 학습할 건지),  solver 학습방법 = 라이브러리, 선형, tol = 충분히 주고 max_iter이 남을 때 언제 그만 둘 건지..학습규제기준.
model.fit(X_train, y_train)

In [11]:
print(f'classes_ : {model.classes_}')
print(f'feature_names_in_ : {model.feature_names_in_}')
print(f'max_iter : {model.max_iter}')
print(f'coef_ : {len(model.coef_)}개\n{model.coef_}')
print(f'coef_ : {len(model.intercept_)}개\n{model.intercept_}')

classes_ : ['Bream' 'Parkki' 'Perch' 'Pike' 'Roach' 'Smelt' 'Whitefish']
feature_names_in_ : ['Weight' 'Length' 'Diagonal' 'Height' 'Width']
max_iter : 20000
coef_ : 7개
[[ 1.31151754e-02 -1.64944470e+00  8.28009575e-01  1.41621595e+00
  -4.15067201e-01]
 [-2.10617657e-02  3.33701594e-01 -9.64909143e-01  2.19381184e+00
   2.66611701e-02]
 [-1.97453974e-03  2.60616873e+00 -2.66412260e+00 -7.93176743e-03
   1.91659551e+00]
 [ 1.01422059e-02  2.55168743e-01  1.51461260e-01 -1.94779290e+00
  -8.36602128e-01]
 [-9.89829706e-03 -1.72578825e+00  1.53807538e+00 -5.12880032e-01
   1.65750894e+00]
 [-7.29426634e-02  3.82049401e-01  1.62783679e-01 -1.55364795e+00
  -5.97839461e-01]
 [ 5.68775586e-03 -5.20399292e-01  2.54546484e-01 -2.46921990e-01
   8.40269158e-01]]
coef_ : 7개
[-0.27362898  0.07982094 -0.34682853 -1.23222237 -1.32590576  0.41907035
 -0.34453293]


(4) 평가<hr>

In [12]:
print(f'[Train Score] {model.score(X_train, y_train)}\n[Test Score] {model.score(X_test, y_test)}')

[Train Score] 0.9495798319327731
[Test Score] 0.975


(5) 모델 활용 <hr>

In [13]:
y_pre = model.predict(X_test.iloc[[0]])
y_pre, y_test[:1]

(array(['Bream'], dtype=object),
 1    Bream
 Name: Species, dtype: object)

In [14]:
model.predict_proba(X_test.iloc[[0]]) # ▲ 이 중에서 가장 확률이 높은 것을 model.predict 결괏값으로 출력했음.

array([[5.04315647e-01, 3.10853586e-01, 3.75723755e-04, 2.25202324e-07,
        1.72946819e-01, 6.17918834e-13, 1.15079995e-02]])

In [15]:
# 5r개 데이터에 대한 생선 분류 예측 
print(model.classes_)
np.round(model.predict_proba(X_test.iloc[:5]), 3), y_test[:5].to_list()
         

['Bream' 'Parkki' 'Perch' 'Pike' 'Roach' 'Smelt' 'Whitefish']


(array([[0.504, 0.311, 0.   , 0.   , 0.173, 0.   , 0.012],
        [0.158, 0.73 , 0.044, 0.   , 0.057, 0.   , 0.01 ],
        [0.772, 0.024, 0.001, 0.   , 0.18 , 0.   , 0.023],
        [0.001, 0.089, 0.719, 0.002, 0.155, 0.004, 0.03 ],
        [0.   , 0.021, 0.753, 0.009, 0.176, 0.009, 0.031]]),
 ['Bream', 'Parkki', 'Bream', 'Perch', 'Perch'])

In [16]:
result = model.predict_proba(X_test.iloc[:5]).argmax(axis=1)

In [17]:
data = {"Pre_Y" : [model.classes_[idx] for idx in result],
        "True Y" : y_test[:5].to_list()}

In [18]:
pd.DataFrame(data)

Unnamed: 0,Pre_Y,True Y
0,Bream,Bream
1,Parkki,Parkki
2,Bream,Bream
3,Perch,Perch
4,Perch,Perch
