# Chapter 5. Model evaluation and enhancement.
# Part 3. Quality metrics.
R^2 for regressions and accuracy for classificators sometimes are not the metrics that really needed. Thus there are more.

2 types of errors:

1) False-positive

2) False-negative

Accuracy is NOT an adequate metric for evaluating model prognostic ability fitted over unbalanced datasets.

## - Confusion Matrices
One of the most qualitative methods to evaluate model's prognostic ability

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

#-----setting up
#--loading dataset
digits = load_digits()
#'== 9' means that dataset will have only one forecasting class
y = digits.target == 9
X_train, X_test, y_train, y_test = train_test_split(digits.data, y, random_state=0)

#initialization, builidng and applying logreg
logreg = LogisticRegression(C=0.2).fit(X_train, y_train)
pred_logreg = logreg.predict(X_test)

#-----applying confusion matrix
confusion = confusion_matrix(y_test, pred_logreg)
print('Confusion matrix: \n{}'.format(confusion))

Confusion matrix: 
[[402   1]
 [  6  41]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


^ Gain array means following:

Number of rows (2): There were 2 actual classes and each row represents one of them

Number of columns (2): Forecated 2 classes and each column represents one of them

Main diagonal represents number of correctly forecasted samples

Rest of elements represents numbers of mistakes

#### Some examples for alternative models:
These models show nice accuracy but considering the unbalanced dataset they were trained on AND the very incorrect specifics of some of them, there IS a need of using another metrics like confusion matrix to check the actual prognostic ability.

In [2]:
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier

#-----setting up models
#strategy - most frequent class preferring
dummy_majority = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
pred_most_frequent = dummy_majority.predict(X_test)

#strategy - tree model
tree = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train)
pred_tree = tree.predict(X_test)

#strategy - dummies model
dummy = DummyClassifier().fit(X_train, y_train)
pred_dummy = dummy.predict(X_test)

#model's quality check
print("<Most frequent> strategy:")
print(confusion_matrix(y_test, pred_most_frequent))
print("\n<Dummy model> strategy:")
print(confusion_matrix(y_test, pred_dummy))
print("\n<Tree model> strategy:")
print(confusion_matrix(y_test, pred_tree))

<Most frequent> strategy:
[[403   0]
 [ 47   0]]

<Dummy model> strategy:
[[403   0]
 [ 47   0]]

<Tree model> strategy:
[[390  13]
 [ 24  23]]


^ Only tree model among given alternatives shows a decent ish correctness. But still logreg was better.

## Precision, recall, F-measure
Precision, recall and F-measure are another metrics (apart from correctness) to evaluate specifics of prognostication ability.

Precision. It's high if all Prognosed Positive samples are True.

Precision = TP / (TP + FP)

Recall. It's high if all Actual Positive samples are covered.

Recall = TP / (TP + FN)

F-measure. It's high if there's a compromise between Precision and Recall.

F-measure = 2(Precision*Recall) / (Precision+Recall)

In [5]:
from sklearn.metrics import f1_score
print('MosFreq F-measure: {}'.format(f1_score(y_test, pred_most_frequent)))
print('Dummy F-measure: {}'.format(f1_score(y_test, pred_dummy)))
print('Tree F-measure: {}'.format(f1_score(y_test, pred_tree)))
print('Logreg F-measure: {}'.format(f1_score(y_test, pred_logreg)))


MosFreq F-measure: 0.0
Dummy F-measure: 0.0
Tree F-measure: 0.5542168674698795
Logreg F-measure: 0.9213483146067415


^ It's clear that LogReg only has a decent prognostication ability.

There's another tool to get reports with pronostication ability analysis:

In [11]:
from sklearn.metrics import classification_report

print('- MostFreq report:')
print(classification_report(y_test, pred_most_frequent, target_names=['not-a-nine','a-nine']))
print()
print('- Dummy report:')
print(classification_report(y_test, pred_dummy, target_names=['not-a-nine','a-nine']))
print()
print('- LogReg report:')
print(classification_report(y_test, pred_logreg, target_names=['not-a-nine','a-nine']))

- MostFreq report:
              precision    recall  f1-score   support

  not-a-nine       0.90      1.00      0.94       403
      a-nine       0.00      0.00      0.00        47

    accuracy                           0.90       450
   macro avg       0.45      0.50      0.47       450
weighted avg       0.80      0.90      0.85       450


- Dummy report:
              precision    recall  f1-score   support

  not-a-nine       0.90      1.00      0.94       403
      a-nine       0.00      0.00      0.00        47

    accuracy                           0.90       450
   macro avg       0.45      0.50      0.47       450
weighted avg       0.80      0.90      0.85       450


- LogReg report:
              precision    recall  f1-score   support

  not-a-nine       0.99      1.00      0.99       403
      a-nine       0.98      0.87      0.92        47

    accuracy                           0.98       450
   macro avg       0.98      0.93      0.96       450
weighted avg       0

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


^ Again, it's clear that LogReg only has a decent prognostication ability since this model is the only one with high F-measure for both classes.