# Machine Learning Algorithm Performance Metrics

Choice of metrics influences how the performance of machine learning algorithms is measured and compared. They influence how you weight the importance of different characteristics in the results and your ultimate choice of which algorithm to choose.

## Classification Metrics
- Classification Accuracy.
- Logarithmic Loss.
- Area Under ROC Curve.
- Confusion Matrix.
- Classification Report.

### Classification Accuracy:
- Classification accuracy is the number of correct predictions made as a ratio of all predictions made.
- This is the most common evaluation metric for classication problems, it is also the most misused.
- It is really only suitable when there are an equal number of observations in each class (which is rarely the case) and that all predictions and prediction errors are equally important, which is often not the case.


In [1]:
# Pima Indians Diabetes Dataset
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

In [2]:
#Loading dataset
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv('pima-indians-diabetes.data',names=names)

# separate array into input and output components
X = df.drop('class',axis='columns')
Y = df['class']

In [3]:
kfold = KFold(n_splits=10, random_state=7)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold, scoring='accuracy')
(results.mean(), results.std())

(0.76951469583048526, 0.048410519245671947)

## Logarithmic Loss:
- Logarithmic loss (or logloss) is a performance metric for evaluating the predictions of probabilities of membership to a given class.
- The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm.
- Predictions that are correct or incorrect are rewarded or punished proportionally to the confidence of the prediction.


In [6]:
kfold = KFold(n_splits=10, random_state=7)
model1 = LogisticRegression()
results = cross_val_score(model1, X, Y, cv=kfold, scoring='neg_log_loss')
(results.mean(), results.std())

(-0.49265511114827021, 0.046890273308686774)

Smaller logloss is better with 0 representing a perfect logloss. As mentioned above, the
measure is inverted to be ascending when using the cross val score() function.

## Area Under ROC Curve:
- Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems.
- The AUC represents a model's ability to discriminate between positive and negative classes.
- An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model that is as good as random.
- ROC can be broken down into sensitivity and specicity. A binary classication problem is really a trade-o between sensitivity and specicity.


Sensitivity is the true positive rate also called the recall. It is the number of instances
from the positive (rst) class that actually predicted correctly.

Specicity is also called the true negative rate. Is the number of instances from the
negative (second) class that were actually predicted correctly.

In [7]:
model2 = LogisticRegression()
results = cross_val_score(model2, X, Y, cv=kfold, scoring='roc_auc')
(results.mean(), results.std())

(0.82341723399457067, 0.040708943934478131)

You can see the AUC is relatively close to 1 and greater than 0.5, suggesting some skill in
the predictions

## Confusion Matrix:
- The confusion matrix is a handy presentation of the accuracy of a model with two or more classes.

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
model3 = LogisticRegression()
model3.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [12]:
predicted = model3.predict(X_test)
matrix = confusion_matrix(Y_test, predicted)
print(matrix)

[[141  21]
 [ 41  51]]


Although the array is printed without headings, you can see that the majority of the
predictions fall on the diagonal line of the matrix (which are correct predictions).

## Classification Report:
The scikit-learn library provides a convenience report when working on classification prob-
lems to give you a quick idea of the accuracy of a model using a number of measures. The
classification report() function displays the precision, recall, F1-score and support for each
class.

In [13]:
from sklearn.metrics import classification_report

In [15]:
model = LogisticRegression()
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
report = classification_report(Y_test, predicted)
print(report)

             precision    recall  f1-score   support

          0       0.77      0.87      0.82       162
          1       0.71      0.55      0.62        92

avg / total       0.75      0.76      0.75       254



https://www.quora.com/What-is-the-best-way-to-understand-the-terms-precision-and-recall

# Regression Metrics

- Mean Absolute Error.
- Mean Squared Error.
- R2.

In [16]:
from sklearn.linear_model import LinearRegression

In [20]:
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
df = pd.read_csv('housing.data',names=names,delim_whitespace=True)

In [21]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [22]:
# separate array into input and output components
X = df.drop('MEDV',axis='columns')
Y = df['MEDV']

## Mean Absolute Error

- The Mean Absolute Error (or MAE) is the sum of the absolute differences between predictions and actual values. It gives an idea of how wrong the predictions were.
- The measure gives an idea of the magnitude of the error, but no idea of the direction (e.g. over or under predicting).

In [23]:
model = LinearRegression()
scoring = 'neg_mean_absolute_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
(results.mean(), results.std())

(-4.0049466353239556, 2.0835992687095457)

A value of 0 indicates no error or perfect predictions. Like logloss, this metric is inverted by
the cross val score() function.

## Mean Squared Error

The Mean Squared Error (or MSE) is much like the mean absolute error in that it provides a
gross idea of the magnitude of error. Taking the square root of the mean squared error converts
the units back to the original units of the output variable and can be meaningful for description
and presentation. This is called the Root Mean Squared Error (or RMSE). The example below
provides a demonstration of calculating mean squared error.

In [24]:
model = LinearRegression()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
(results.mean(), results.std())

(-34.705255944524602, 45.573999200308421)

This metric too is inverted so that the results are increasing. Remember to take the absolute
value before taking the square root if you are interested in calculating the RMSE.

## R2 Metric
The R2 (or R Squared) metric provides an indication of the goodness of fit of a set of predictions
to the actual values. In statistical literature this measure is called the coeffcient of determination.
This is a value between 0 and 1 for no-fit and perfect fit respectively.

In [25]:
model = LinearRegression()
scoring = 'r2'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
(results.mean(), results.std())

(0.20252899006057859, 0.59529601695119627)

You can see the predictions have a poor t to the actual values with a value closer to zero
and less than 0.5.

In [26]:
#page 70