Given a trained classification model, you want to evaluate its quality.

Use scikit-learn’s cross_val_score to conduct cross-validation while using the
scoring parameter to define one of a number of performance metrics, including
accuracy, precision, recall, and F1
.
Accuracy is a common performance metric. It is simply the proportion of
observations predicted correctly:

![](./pics/accuracy.jpg)


We can measure accuracy in three-fold (the default number of folds) crossvalidation by setting scoring="accuracy":

In [1]:
# Load libraries
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Generate features matrix and target vector
X, y = make_classification(n_samples = 10000,
n_features = 3,
n_informative = 3,
n_redundant = 0,
n_classes = 2,
random_state = 1)
# Create logistic regression
logit = LogisticRegression()
# Cross-validate model using accuracy
cross_val_score(logit, X, y, scoring="accuracy")

array([0.9555, 0.95  , 0.9585, 0.9555, 0.956 ])

The appeal of accuracy is that it has an intuitive and plain English explanation:
proportion of observations predicted correctly. However, in the real world, often
our data has imbalanced classes (e.g., the 99.9% of observations are of class 1
and only 0.1% are class 2). When in the presence of imbalanced classes,
accuracy suffers from a paradox where a model is highly accurate but lacks
predictive power. For example, imagine we are trying to predict the presence of

a very rare cancer that occurs in 0.1% of the population. After training our
model, we find the accuracy is at 95%. However, 99.9% of people do not have
the cancer: if we simply created a model that “predicted” that nobody had that
form of cancer, our naive model would be 4.9% more accurate, but clearly is not
able to predict anything. For this reason, we are often motivated to use other
metrics like precision, recall, and the F1 score.

![](./pics/precision.jpg)

In [2]:
# Cross-validate model using precision
cross_val_score(logit, X, y, scoring="precision")

array([0.95963673, 0.94820717, 0.9635996 , 0.96149949, 0.96060606])

![](./pics/recall.jpg)

In [3]:
# Cross-validate model using recall
cross_val_score(logit, X, y, scoring="recall")

array([0.951, 0.952, 0.953, 0.949, 0.951])

If this is the first time you have encountered precision and recall, it is
understandable if it takes you a little while to fully understand them. This is one
of the downsides to accuracy; precision and recall are less intuitive. Almost always we want some kind of balance between precision and recall, and this role
is filled by the F1 score. The F1 score is the harmonic mean (a kind of average
used for ratios):

![](./pics/f1.jpg)

In [4]:
# Cross-validate model using f1
cross_val_score(logit, X, y, scoring="f1")

array([0.95529884, 0.9500998 , 0.95827049, 0.95520886, 0.95577889])

As an evaluation metric, accuracy has some valuable properties, especially its
simple intuition. However, better metrics often involve using some balance of
precision and recall—that is, a trade-off between the optimism and pessimism of
our model. F1
represents a balance between the recall and precision, where the
relative contributions of both are equal.
Alternatively to using cross_val_score, if we already have the true y values
and the predicted y values, we can calculate metrics like accuracy and recall
directly:

In [5]:
# Load library
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Create training and test split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.1,
random_state=1)
# Predict values for training target vector
y_hat = logit.fit(X_train, y_train).predict(X_test)

In [6]:
accuracy_score(y_test, y_hat)


0.947