## 算法评估矩阵

选择能够展示机器学习算法模型的准确度的评估矩阵，是计算和比较算法模型最好的方法，并且在评估算法的时，计算并比较这些评估矩阵，可以快速的选择合适的算法    


In [2]:
# 分类算法矩阵
# 评估分类算法的评估矩阵有以下几种
# 分类准确度；对数损失函数；AUC图；混淆矩阵；分类报告

# 分类准确度：算法自动分类正确的样本数除以所有样本数得出的结果
# 通常，准确度越高，分类器越好

import pandas as pd
from sklearn.model_selection import KFold,cross_val_score
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(filename, names=names)
# 将数据分为输入数据和输出结果
array = data.values
X = array[:, 0:8]
Y = array[:, 8]

seed = 7
num_folds = 9
kfold = KFold(n_splits=num_folds, random_state=seed)
model = LogisticRegression()
result = cross_val_score(model, X, Y, cv=kfold)
print('算法评估结果准确度： %.3f (%.3f)' % (result.mean(), result.std()))

算法评估结果准确度： 0.775 (0.052)


In [3]:
# 对数损失函数
scoring = 'neg_log_loss'
result = cross_val_score(model, X, Y, scoring=scoring)
print('Logloss %.3f (%.3f)' % (result.mean(), result.std()))

Logloss -0.498 (0.011)


In [4]:
# AUC图：评价分类器的指标
scoring = 'roc_auc'
result = cross_val_score(model, X, Y, scoring=scoring)
print('AUC %.3f (%.3f)' % (result.mean(), result.std()))

AUC 0.825 (0.028)


In [5]:
# 混淆矩阵:用于比较分类结果和实际测得值
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
test_size = 0.33
seed = 4
X_train, X_test, Y_traing, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model.fit(X_train, Y_traing)
predicted = model.predict(X_test)
matrix = confusion_matrix(Y_test, predicted)
classes = ['0', '1']
dataframe = pd.DataFrame(data=matrix, index=classes, columns=classes)
print(dataframe)

     0   1
0  152  19
1   31  52


In [6]:
# 分类报告：对分类问题的评估报告

from sklearn.metrics import classification_report

report = classification_report(Y_test, predicted)
print(report)

              precision    recall  f1-score   support

         0.0       0.83      0.89      0.86       171
         1.0       0.73      0.63      0.68        83

    accuracy                           0.80       254
   macro avg       0.78      0.76      0.77       254
weighted avg       0.80      0.80      0.80       254



In [7]:
# 回归算法矩阵
# 三种评价回归算法的评估矩阵：MAE平均绝对误差；MSE均方误差；决定系数R²

# 平均绝对误差：所有单个预测值与算数平均值的偏差的绝对值的平均值，可以更好地反应预测值误差的实际情况

# 测试了之前的数据集
n_splits = 10
seed =7
kfold = KFold(n_splits=n_splits, random_state=seed)
scoring = 'neg_mean_absolute_error'
result = cross_val_score(model, X, Y, scoring=scoring)
print('MAE %.3f (%.3f)' % (result.mean(), result.std()))

MAE -0.230 (0.019)


In [8]:
filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 
         'TAX', 'PRTATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_csv(filename, names=names, delim_whitespace=True)
# 将数据分为输入数据和输出结果
array = data.values
X = array[:, 0:13]
Y = array[:, 13]
n_splits = 10
seed =7
kfold = KFold(n_splits=n_splits, random_state=seed)
model = LogisticRegression()
scoring = 'neg_mean_absolute_error'
result = cross_val_score(model, X, Y.astype('int'), cv=kfold, scoring=scoring)
print('MAE %.3f (%.3f)' % (result.mean(), result.std()))

MAE -5.049 (1.706)


In [9]:
# 均方误差：衡量平均误差的方法，可以评价数据的变化程度
scoring = 'neg_mean_squared_error'
result = cross_val_score(model, X, Y.astype('int'), cv=kfold, scoring=scoring)
print('MSE %.3f (%.3f)' % (result.mean(), result.std()))

MSE -55.781 (40.712)


In [10]:
# 决定系数R²
scoring = 'r2'
result = cross_val_score(model, X, Y.astype('int'), cv=kfold, scoring=scoring)
print('R2 %.3f (%.3f)' % (result.mean(), result.std()))

R2 -0.602 (1.302)
