## MLflow
- MLflow 是一個開源工具，可以幫助追蹤模型運行狀態，包括模型參數、指數、結果以及使用資料，還有程式碼。
- MLflow有很多其他的功能像是部屬模型、打包程式碼建立可回復性以及儲存模型。
- 此篇主要展示如何追蹤模型的部分。

追蹤模型運行狀態可以很好地去觀察不同模型的效能，根據在不同的資料以及超參數的情況之下。

接下來我們用一個基礎的資料集()追蹤其指標效能acc、f1-score，模型使用knn當作基礎，超參數也可能很簡單的用k來看看效果。

In [1]:
import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

In [2]:
data = load_breast_cancer(as_frame=True)
X = data['data']
y = data['target']

In [3]:
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [4]:
y

0      0
1      0
2      0
3      0
4      0
      ..
564    0
565    0
566    0
567    0
568    1
Name: target, Length: 569, dtype: int32

In [5]:
print(X.shape, y.shape)

(569, 30) (569,)


In [6]:
# 切分資料
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1, stratify=y)

In [7]:
# 模型訓練以及效能驗證

#knn model parameters - n_neighbors is 'k' from k-Nearest Neighbors
n_neighbors = 5
# Create KNN classifier
knn = KNeighborsClassifier(n_neighbors = n_neighbors)
# Fit the classifier to the data
knn.fit(X_train,y_train)

print(f'knn model is training - n_neigbors: {n_neighbors}')
#show first 5 model predictions on the test data
knn.predict(X_test)[0:5]
#check accuracy of our model on the test data
accuracy = knn.score(X_test, y_test)
print(f'model accuracy is: {accuracy}')

knn model is training - n_neigbors: 5
model accuracy is: 0.956140350877193


In [8]:
# 先將資料存取
df = pd.concat((X, pd.DataFrame(y, columns=['target'])), axis=1)

In [9]:
import os

if not os.path.exists('data'):
    os.makedir('data')
df.to_csv('data/breast_cancer.csv', index=False)

> 以上是一般的狀況，比較難以追蹤模型結果。

In [14]:
import mlflow

# 告訴mlflow去開始 logging
with mlflow.start_run():
    
    # 讀取資料
    df = pd.read_csv('data/breast_cancer.csv')
    
    # Log data to mlflow
    mlflow.log_artifact('data/breast_cancer.csv')
    
    # X, y
    X = df[list(df.columns)[:-1]]
    y = df['target']
    
    # view data
    print(X.shape, y.shape)
    print(X.head())
    print(f'首五筆y資料: {y[:5]}')
    
    # split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1, stratify=y)    
    
    # modeling
    k = 7
    #log n_neighbors param to mlflow
    mlflow.log_param('n_neighbors', k)
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train)
    
    print(f"knn model is training - k: {k}")
    
    # show prediction result
    knn.predict(X_test)[:5]
    
    # model metric
    acc = knn.score(X_test,y_test)
    f1 = f1_score(y_true=y_test, y_pred=knn.predict(X_test))
    
    # log model metrics to mlflow
    mlflow.log_metrics({
        'acc': acc,
        'f1': f1
    })
    
    print(f"model acc is {acc}, model f1 is {f1}")

(569, 30) (569,)
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst radius  worst texture  worst perimete

至此，參數、指標、資料集都被記錄了，可以在mlruns資料夾中只到，也可搭配在cmd run: mlflow ui查看!(記得要在與主程式同一level運行)
![image.png](attachment:image.png)

In [11]:
# !mlflow ui

內部截圖
![image.png](attachment:image.png)

勾選多個模型還可以圖式話比較，非常方便
![image.png](attachment:image.png)