Лабораторная работа №5:
====================
Метрики качества классификации
====================

Данные
---------------------
В качестве данных был выбран датасет Ирис (доступен по ссылке http://archive.ics.uci.edu/ml/datasets/Iris).

Предметная область - ботаника, количество записей - 150.

Входные параметры:

    1. sepal length in cm
    2. sepal width in cm 
    3. petal length in cm
    4. petal width in cm

Выходной параметр - class:
    
    - Iris Versicolour (1)
    - Iris Setosa (2)
    - Iris Virginica (3)

Точность классификации (Classification Accuracy)
----

In [12]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn import cross_validation, metrics
from sklearn.linear_model import LogisticRegression
# импортируем набор данных (iris)
iris = datasets.load_iris()
X = iris.data[:, :4]
y = iris.target

In [13]:
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
logReg = LogisticRegression()
svc = svm.SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X, y)
scoring = 'accuracy'
logRegAccuracy = cross_validation.cross_val_score(logReg, X, y, cv=kfold, scoring=scoring)
svcAccuracy = cross_validation.cross_val_score(svc, X, y, cv=kfold, scoring=scoring)
print("Accuracy for LogisticRegression: %.3f (%.3f)" % (logRegAccuracy.mean(), logRegAccuracy.std()))
print("Accuracy for SVC: %.3f (%.3f)" % (svcAccuracy.mean(), svcAccuracy.std()))

Accuracy for LogisticRegression: 0.880 (0.148)
Accuracy for SVC: 0.947 (0.058)


Логарифм функции правдоподобия (Logarithmic Loss)
---

In [40]:
from sklearn.cross_validation import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import label_binarize
X = iris.data[:, :4]
y = iris.target
test_size = 0.33
#Переведем вывод в двоичную систему
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=0)
svc_classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, random_state=random_state))
svc = svc_classifier.fit(X_train, y_train)
lr_classifier = OneVsRestClassifier(LogisticRegression())
lr = lr_classifier.fit(X_train, y_train)
scoring = 'log_loss'
lr_logLoss = cross_validation.cross_val_score(lr, X, y, cv=kfold, scoring=scoring)
svc_logLoss = cross_validation.cross_val_score(svc, X, y, cv=kfold, scoring=scoring)
print("Logloss: %.3f (%.3f)" % (lr_logLoss.mean(), lr_logLoss.std()))
print("Logloss: %.3f (%.3f)" % (svc_logLoss.mean(), svc_logLoss.std()))


Logloss: -0.410 (0.160)
Logloss: -0.337 (0.124)


Область под кривой ошибок (Area Under ROC Curve)
-------------


In [81]:
from sklearn.metrics import roc_auc_score 
X = iris.data[:, :4]
y = iris.target
#Возьмём только два класса
X = X[:100]
y = y[:100]
test_size = 0.33
shuffle = cross_validation.KFold(len(X), n_folds=3, shuffle=True)
     
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=0)
svc_classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, random_state=random_state))
svc = svc_classifier.fit(X_train, y_train)
lr_classifier = OneVsRestClassifier(LogisticRegression())
lr = lr_classifier.fit(X_train, y_train)
scoring = 'roc_auc'
lr_auc = cross_validation.cross_val_score(lr, X, y, cv=shuffle, scoring=scoring)
svc_auc = cross_validation.cross_val_score(lr, X, y, cv=shuffle, scoring=scoring)
print("auc: %.3f (%.3f)" % (lr_auc.mean(), lr_auc.std()))
print("auc: %.3f (%.3f)" % (svc_auc.mean(), svc_auc.std()))


auc: 1.000 (0.000)
auc: 1.000 (0.000)


Матрица неточностей (Confusion Matrix)
---

In [34]:
from sklearn.metrics import confusion_matrix
X = iris.data[:, :4]
y = iris.target
test_size = 0.33

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=0)
svc_classifier = (svm.SVC(kernel='linear', probability=True, random_state=random_state))
svc = svc_classifier.fit(X_train, y_train)
svc_pred = svc.predict(X_test)
svc_matrix = confusion_matrix(y_test, svc_pred)
lr_classifier = OneVsRestClassifier(LogisticRegression())
lr = lr_classifier.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
lr_matrix = confusion_matrix(y_test, lr_pred)
print("SVC:\n", svc_matrix)
print("Logistic Regression:\n", lr_matrix)

SVC:
 [[16  0  0]
 [ 0 18  1]
 [ 0  0 15]]
Logistic Regression:
 [[16  0  0]
 [ 0 14  5]
 [ 0  0 15]]


Отчет классификации (Classification Report)
---

In [33]:
from sklearn.metrics import classification_report
X = iris.data[:, :4]
y = iris.target
test_size = 0.33

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=0)
svc_classifier = (svm.SVC(kernel='linear', probability=True, random_state=random_state))
svc = svc_classifier.fit(X_train, y_train)
svc_pred = svc.predict(X_test)
svc_report = classification_report(y_test, svc_pred)
lr_classifier = OneVsRestClassifier(LogisticRegression())
lr = lr_classifier.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
lr_report = classification_report(y_test, lr_pred)
print("SVC:\n", svc_report)
print("Logistic Regression:\n", lr_report)


SVC:
              precision    recall  f1-score   support

          0       1.00      1.00      1.00        16
          1       1.00      0.95      0.97        19
          2       0.94      1.00      0.97        15

avg / total       0.98      0.98      0.98        50

Logistic Regression:
              precision    recall  f1-score   support

          0       1.00      1.00      1.00        16
          1       1.00      0.74      0.85        19
          2       0.75      1.00      0.86        15

avg / total       0.93      0.90      0.90        50



Вывод:
-------------
    

Выполнили:
---
    Студенты группы P4117
    Герасин Олег
    Крихели Артём