# Machine Learning with Python

In [None]:
import matplotlib.pyplot as plt
import numpy as np

## 2.3 Evaluation

There are many metrics that we may want to use to evaluate performance of supervised learning.

### [Evaluating Classifiers](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)

`sklearn.metrics` provides most of the commonly-used metrics, see [documentation](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics).

Some of these are restricted to binary classifiers, but others are also defined for multiclass (several possible values for `y`) and/or multilabel (potential for multiple simultaneous values for `y`) problems. 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import RocCurveDisplay
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=42)
svc = SVC(probability=True, random_state=42)
svc.fit(X_train, y_train)

In [None]:
y_pred = svc.predict(X_test)

In [None]:
from sklearn.metrics import classification_report
print( classification_report(y_test,y_pred) )

The receiver-operating characteristic (ROC) curve gives a useful visual evaluation for any method that can return probabilities or prediction scores. The `roc_curve` method works for binary classification:

In [None]:
from sklearn.metrics import roc_curve

probs = svc.predict_proba(X_test)
fpr,tpr,thresholds = roc_curve(y_test,probs[:,1])

In [None]:
plt.plot(fpr,tpr)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.show()

We can also get the area under the curve (AUC) as a metric:

In [None]:
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test,probs[:,1])
print("AUC =",auc)

The precision-recall (PR) curve is also a useful evaluation for tasks where we are most interested in eliminating false positives, e.g. screening a population for a disease.

In [None]:
from sklearn.metrics import precision_recall_curve

pre,rec,thresholds = precision_recall_curve(y_test,probs[:,1])

In [None]:
plt.plot(rec,pre)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.show()

The [weighted average precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score) over this curve is often quoted as a metric:

In [None]:
from sklearn.metrics import average_precision_score
avg_pre = average_precision_score(y_test, probs[:,1])
print("Average precision =",avg_pre)

### [Evaluating Regressors](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics)

Once again, there are several metrics for evaluation of regression - the user guide has full details for each one.

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, random_state=0)

In [None]:
from sklearn.neural_network import MLPRegressor
nn = MLPRegressor(hidden_layer_sizes=(100),max_iter=10000)
nn.fit(X_train,y_train)

In [None]:
y_pred = nn.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error,mean_squared_error,max_error,explained_variance_score,r2_score

print("Mean Absolute Error, MAE = %.2f" % mean_absolute_error(y_test, y_pred))
print("Mean squared error, MSE = %.2f" % mean_squared_error(y_test, y_pred))
print("Max Error = %.2f" % max_error(y_test, y_pred))
print("Explained Variance Score = %.2f" % explained_variance_score(y_test, y_pred))
print("Coefficient of determination, r2 = %.2f" % r2_score(y_test, y_pred))


### [Cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html)

Cross-validation is essential in model development - it allows us to compare the performance of alternative algorithms and different settings for model hyperparameters, *without* making use of the test data. This is very important so that we can obtain an accurate assessment of the final model performance.

`KFold` is a simple way to get the data indices for cross-validation, which we can loop over:

In [None]:
# Using only the first 100 data points
X = diabetes.data[:100]
y = diabetes.target[:100]

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5,shuffle=True,random_state=42)

In [None]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()

for train, test in kf.split(X):
    print("training set indices:")
    print(train)
    print("test set indices:")
    print(test)
    lm.fit(X[train], y[train])
    y_pred = lm.predict(X[test])
    print("r2 = %.2f" % r2_score(y[test],y_pred))
    print()

If we just want to calculate a metric, there is another convenient function `cross_val_score`.

In [None]:
from sklearn.model_selection import cross_val_score
lm = LinearRegression()
score = cross_val_score( lm,X,y,cv=5,scoring='r2' )
print("Cross-validated r2:")
print(score)

We would usually quote the mean score under cross-validation:

In [None]:
print("mean r2 =", np.mean(score))

The standard deviation of the cross-validation scores is also useful as an estimate of the error compared to the true performance on unseen test data.

In [None]:
print("sd =", np.std(score))

In addition to the basic *k*-fold cross-validation, there are many alternative procedures that may be suitable depending on the structure of your particular data set. 

For example, there may be definable subgroups within the data that we might want to leave out of training one at a time, to assess how good the predictor is at extrapolating beyond known groups.

### Exercise

Use ROC curves to compare the performance of a Decision Tree and Logistic Regressor on the `breast_cancer` dataset.

Use 5-fold cross-validation to evaluate your regressor for the `wine_quality_white` dataset.