### 생선 분류 
- 데이터 : fish.csv
- 피 쳐 : 5개 Weight, Length, Height, Diagonal
- 타 겟 : 1개 Species
- 방 법 : 지도학습 + 다중분류

(1) 모듈 로딩 및 데이터 준비<hr>

In [1]:
# 모듈 로딩
import pandas as pd
import numpy as np

In [2]:
# 데이터 준비
fishDF = pd.read_csv('../data/fish.csv')
fishDF.head()

Unnamed: 0,Species,Weight,Length,Diagonal,Height,Width
0,Bream,242.0,25.4,30.0,11.52,4.02
1,Bream,290.0,26.3,31.2,12.48,4.3056
2,Bream,340.0,26.5,31.1,12.3778,4.6961
3,Bream,363.0,29.0,33.5,12.73,4.4555
4,Bream,430.0,29.0,34.0,12.444,5.134


(2) 학습 위한 데이터 준비<hr>

(2-1) 피쳐와 타겟 분리

In [3]:
featureDF = fishDF[fishDF.columns[1:]]
targetSR = fishDF[fishDF.columns[0]]
featureDF.shape, targetSR.shape

((159, 5), (159,))

In [4]:
# 타겟의 클래스 수 확인
targetSR.nunique()

7

In [5]:
# 타겟 클래스 별 데이터 수 확인
(targetSR.value_counts()/targetSR.shape[0]) * 100

Species
Perch        35.220126
Bream        22.012579
Roach        12.578616
Pike         10.691824
Smelt         8.805031
Parkki        6.918239
Whitefish     3.773585
Name: count, dtype: float64

In [6]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder.fit(targetSR)

targetSR = encoder.transform(targetSR)
targetSR

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5])

(2-2) 학습용/테스트용 데이터셋 준비

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
xtrain,xtest,ytrain,ytest = train_test_split(featureDF,targetSR,stratify=targetSR,random_state=11)

In [9]:
xtrain.shape, ytrain.shape,xtest.shape, ytest.shape

((119, 5), (119,), (40, 5), (40,))

(3) 학습 진행

In [10]:
from sklearn.linear_model import LogisticRegression

In [11]:
# 모델 인스턴스 생성 및 학습
model = LogisticRegression(max_iter=20000, solver='liblinear') # solver : 학습 방법 지정
model.fit(xtrain,ytrain)

In [12]:
# 학습 후 결정된 모델 파라미터 확인
print(model.classes_)  # target 값
print(model.feature_names_in_)  # feature name
print(model.n_iter_)  # 학습 횟수
print(model.coef_)  # 각 피쳐의 가중치
print(f'{len(model.coef_)}개')  # 가중치 개수
print(f'{len(model.intercept_)}개')  # 절편(Bias) 개수

[0 1 2 3 4 5 6]
['Weight' 'Length' 'Diagonal' 'Height' 'Width']
[20 22 19 18 17 16 19]
[[ 1.31151754e-02 -1.64944470e+00  8.28009575e-01  1.41621595e+00
  -4.15067201e-01]
 [-2.10617657e-02  3.33701594e-01 -9.64909143e-01  2.19381184e+00
   2.66611701e-02]
 [-1.97453974e-03  2.60616873e+00 -2.66412260e+00 -7.93176743e-03
   1.91659551e+00]
 [ 1.01422059e-02  2.55168743e-01  1.51461260e-01 -1.94779290e+00
  -8.36602128e-01]
 [-9.89829706e-03 -1.72578825e+00  1.53807538e+00 -5.12880032e-01
   1.65750894e+00]
 [-7.29426634e-02  3.82049401e-01  1.62783679e-01 -1.55364795e+00
  -5.97839461e-01]
 [ 5.68775586e-03 -5.20399292e-01  2.54546484e-01 -2.46921990e-01
   8.40269158e-01]]
7개
7개


(4) 모델 평가 <hr>

In [13]:
model.score(xtrain,ytrain), model.score(xtest,ytest)

(0.9495798319327731, 0.975)

(5) 모델 활용 <hr>

In [14]:
ypre = model.predict(xtest.iloc[[0]])
ypre, ytest[0]

(array([0]), 0)

In [15]:
model.predict_proba(xtest.iloc[[0]]) # 각 해당 종일 확률 

array([[5.04315647e-01, 3.10853586e-01, 3.75723755e-04, 2.25202324e-07,
        1.72946819e-01, 6.17918834e-13, 1.15079995e-02]])

In [16]:
# 5개 데이터에 대한 생선 분류 예측 - 확률값
pd.DataFrame(np.round(model.predict_proba(xtest.iloc[0:5]),3), columns=model.classes_) 

Unnamed: 0,0,1,2,3,4,5,6
0,0.504,0.311,0.0,0.0,0.173,0.0,0.012
1,0.158,0.73,0.044,0.0,0.057,0.0,0.01
2,0.772,0.024,0.001,0.0,0.18,0.0,0.023
3,0.001,0.089,0.719,0.002,0.155,0.004,0.03
4,0.0,0.021,0.753,0.009,0.176,0.009,0.031


In [17]:
# 각 분류기의 선형식에서 계산한 값
model.decision_function(xtest.iloc[[-1]])

array([[-4.35846721, -1.06168101,  0.77087474, -2.98778124, -1.09529369,
         0.50735754, -1.78072043]])

In [18]:
result = model.predict_proba(xtest.iloc[0:5]).argmax(axis=1) # argmax() 최고 클래스 값 반환
result

array([0, 1, 0, 2, 2], dtype=int64)

In [19]:
data = {'Pre Y': [model.classes_[idx] for idx in result], 'True Y': ytest[0:5]}
pd.DataFrame(data)

Unnamed: 0,Pre Y,True Y
0,0,0
1,1,1
2,0,0
3,2,2
4,2,2


(6) 모델 성능 평가<hr>
- 정확도
- 정밀도
- 재현율
- F1-Score
- Confunsion Matrics
- Classfication Report

In [20]:
from sklearn.metrics import f1_score, confusion_matrix, classification_report,precision_score,recall_score,confusion_matrix,classification_report

In [21]:
print(classification_report(ytest,model.predict(xtest), zero_division=0))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         9
           1       1.00      1.00      1.00         3
           2       0.93      1.00      0.97        14
           3       1.00      1.00      1.00         4
           4       1.00      1.00      1.00         5
           5       1.00      1.00      1.00         4
           6       0.00      0.00      0.00         1

    accuracy                           0.97        40
   macro avg       0.85      0.86      0.85        40
weighted avg       0.95      0.97      0.96        40


In [22]:
print(f1_score(ytest,model.predict(xtest), average='weighted'))

0.9629310344827587


In [23]:
recall_score(ytest,model.predict(xtest), average='weighted')

0.975

In [24]:
confusion_matrix(ytest,model.predict(xtest))

array([[ 9,  0,  0,  0,  0,  0,  0],
       [ 0,  3,  0,  0,  0,  0,  0],
       [ 0,  0, 14,  0,  0,  0,  0],
       [ 0,  0,  0,  4,  0,  0,  0],
       [ 0,  0,  0,  0,  5,  0,  0],
       [ 0,  0,  0,  0,  0,  4,  0],
       [ 0,  0,  1,  0,  0,  0,  0]], dtype=int64)

In [29]:
labels = encoder.inverse_transform(model.classes_)
from sklearn.metrics import ConfusionMatrixDisplay
from matplotlib import pyplot as plt
cm = confusion_matrix(ytest,model.predict(xtest))
cmplot = ConfusionMatrixDisplay(cm, display_labels=labels)
cmplot.plot(cmap = 'gray')