# Machine Learning with scikit-learn

In the lessons so far, we have mostly just used the default `.score()` method of fitted models.  For most or all classification models, this measures accuracy.  For most or all regression models, this measures $R^2$ (coefficient of determination).  As we have mentioned, the subpackage `sklearn.metrics` contains a large number of other scorers.  Depending on your purpose, one of these might be more appropriate.

In this lesson we will look at a few such metrics, but we will also develop a custom metric that is not included in scikit-learn (and presumably never will be, for reasons we will see).

In [1]:
%matplotlib inline
from src.setup import *

## A slightly unbalanced classification

For this lesson, we will look at the [Dermatology Data Set](https://archive.ics.uci.edu/ml/datasets/Dermatology) available from UCI.  This data contains 34 measurements of 36 patients, with each one diagnosed as having one of six skin conditions.  Our purpose in using this data is two-fold. On the one hand, we want to look at a multi-class classification problem, which we have not done extensively in these lessons.  But more interestingly at the end, we want to look at the value of non-top diagnoses, which may have utility for particular domain problems.

> <small>Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.</small>

### Digression on multi-labels

Note that what we present here is **not** a multi-label problem.  In some situations it is useful to identify more than one class to which a sample might belong.  In the current domain, that would be patients who have multiple skin conditions at once.  Such is possible, but this dataset is assumed not to contain that situation.  Or in another domain, we might wish to characterize a photographic image by multiple classes.  For example, an image containing both a cat and a dog would get both of these labels, but would get none of the, e.g. other 98 available labels because those things were not in the image.  Multi-label problems can be addressed with scikit-learn.

See the official documentation of [multiclass and multilabel algorithms](https://scikit-learn.org/stable/modules/multiclass.html).  Note that multi-output is related to multi-label, but is a somewhat different concept.  In multi-label, any number of labels may be identified (including zero).  This is akin to one-hot encoding, but of the output (maybe "multi-hot" would be a good description).  

In contrast, mutli-output identified a fixed number of outputs, which we might think of as orthogonal dimensions of the output.  In a sense this is like the fixed number of input features.  For example, in the photo classification problem, we might always want to predict `(color, subject, lighting)` for every image.  So sometimes it is a "brown dog in daylight", other times it is a "white cat at night."

Basically all classification models can be transformed into multi-label algorithms by transforming the problem into a collection of one-vs-all classifiers.  For example, one model is cat-vs-not-cat; another model is dog-vs-not-dog.  Similarly for all of the stipulated 100 known classes.  If both the cat-vs-not-cat and dog-vs-not-dog models make a positive prediction, we would assign both those labels.  However, other models are inherently multi-label by their design, so this kind of transformation is irrelevant (and counter-productive, in fact) if you use those.

### The dataset

We get this data in somewhat raw form.  The `dermatology.data` file is a CSV with no headers.  The `dermatology.names` files contains a bit more than its name might suggest.  Beyond providing the feature names, it gives additional exposition of the dataset, such as value coding, where unknown values occur, and a few other things in prose.  I produced a code-friendly extraction of the relevant information below.

In [2]:
# Histopathological Attributes: (values 0, 1, 2, 3)
# Clinical Attributes: (values 0, 1, 2, 3, unless indicated)
features = [
    "erythema",
    "scaling",
    "definite borders",
    "itching",
    "koebner phenomenon",
    "polygonal papules",
    "follicular papules",
    "oral mucosal involvement",
    "knee and elbow involvement",
    "scalp involvement",
    "family history",  # 0 or 1
    "melanin incontinence",
    "eosinophils in the infiltrate",
    "PNL infiltrate",
    "fibrosis of the papillary dermis",
    "exocytosis",
    "acanthosis",
    "hyperkeratosis",
    "parakeratosis",
    "clubbing of the rete ridges",
    "elongation of the rete ridges",
    "thinning of the suprapapillary epidermis",
    "spongiform pustule",
    "munro microabcess",
    "focal hypergranulosis",
    "disappearance of the granular layer",
    "vacuolisation and damage of basal layer",
    "spongiosis",
    "saw-tooth appearance of retes",
    "follicular horn plug",
    "perifollicular parakeratosis",
    "inflammatory monoluclear inflitrate",
    "band-like infiltrate",
    "Age",  # linear; missing marked '?'
    "TARGET"  # See mapping
]

For reference and later use, the dictionary `targets` contains the class code and name of the skin condition diagnosed.  We also not here the number of obvservations of each condition.  They are somewhat imbalanced, which might affect the metrics we use.  That is, in this dataset, psorisis is much more common than pilaris.  I am not a dermatologist, and have no idea what the prevalence of these conditions is in the general population; there may have been selection bias in this aggregation.  That is, beyond the obvious selection bias that people with no skin conditions at all are not included.

In [3]:
targets = {
    1:"psoriasis",                 # 112 instances
    2:"seboreic dermatitis",       # 61
    3:"lichen planus",             # 72
    4:"pityriasis rosea",          # 49
    5:"cronic dermatitis",         # 52    
    6:"pityriasis rubra pilaris",  # 20
}

Reading in the data needs minor massaging to be ready for use.  To have a friendly DataFrame to work with, we attach the names of the features as columns.  But the missing `Age` that is marked with a questino mark needs extra clean-up.  I have decided to impute the median age for missing data. Other approaches are possible, and some models will work with missing data.  As my domain judgement, I chose this approach.

In [122]:
df = pd.read_csv('data/dermatology.data', header=None, names=features)
df.loc[df.Age == '?', 'Age'] = None
df['Age'] = df.Age.astype(float)
df.loc[df.Age.isnull(), 'Age'] = df.Age.median()
df

Unnamed: 0,erythema,scaling,definite borders,itching,koebner phenomenon,polygonal papules,follicular papules,oral mucosal involvement,knee and elbow involvement,scalp involvement,...,disappearance of the granular layer,vacuolisation and damage of basal layer,spongiosis,saw-tooth appearance of retes,follicular horn plug,perifollicular parakeratosis,inflammatory monoluclear inflitrate,band-like infiltrate,Age,TARGET
0,2,2,0,3,0,0,0,0,1,0,...,0,0,3,0,0,0,1,0,55.0,2
1,3,3,3,2,1,0,0,0,1,1,...,0,0,0,0,0,0,1,0,8.0,1
2,2,1,2,3,1,3,0,3,0,0,...,0,2,3,2,0,0,2,3,26.0,3
3,2,2,2,0,0,0,0,0,3,2,...,3,0,0,0,0,0,3,0,40.0,1
4,2,3,2,2,2,2,0,2,0,0,...,2,3,2,3,0,0,2,3,45.0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
361,2,1,1,0,1,0,0,0,0,0,...,0,0,1,0,0,0,2,0,25.0,4
362,3,2,1,0,1,0,0,0,0,0,...,1,0,1,0,0,0,2,0,36.0,4
363,3,2,2,2,3,2,0,2,0,0,...,0,3,0,3,0,0,2,3,28.0,3
364,2,1,3,1,2,3,0,2,0,0,...,0,2,0,1,0,0,2,3,50.0,3


## Training a model

The usual steps can be done here.  We create our X and y arrays for the features and target.  We perform a train/test split on the data.  For this problem, we will use a k-nearest neighbors model.  I have not tried a wide variety of models or hyperparameters, and have no idea what the "best" model is.  But KNN is often quite good, and it is a good way to illustrate the concepts here.

In [45]:
X = df.drop('TARGET', axis=1)
y = df['TARGET']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [155]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

0.9130434782608695

Let us also create a collection of predictions against the test data that we will utilize below.  For convenience, we can transform the array result into a Pandas Series so that the index matches between the ground truth of the training data and the predictions.

In [124]:
y_pred = pd.Series(knn.predict(X_test), index=y_test.index)
y_pred.map(targets)

247              psoriasis
127          lichen planus
230    seboreic dermatitis
162          lichen planus
159    seboreic dermatitis
              ...         
321    seboreic dermatitis
59     seboreic dermatitis
12     seboreic dermatitis
312          lichen planus
107              psoriasis
Length: 92, dtype: object

## Evaluating the trained model

So far, so good.  The usual fit-then-score steps work as expected.  But in particular, we simply used the default `.score()` method attached to the trained model object.  For this classification, that default is *accuracy*.  We saw earlier, in the introductory material, that depending on our purpose, accuracy might not be the most useful metric.  

The decision of a metric is very much driven by our "business requirement" and there is not single objective answer.  However, one thing that is absolute is that when we get to comparing different models to each other—whether entirely different styles of models, or different hyperparameters—we need some way of quantifying the **goodness** of a model to choose which one to keep.  In particular, in practical terms we need to reduce tha "goodness" to a single number we can compare among modeling approaches.

### Precision, recall, and f1 score

How good is our fitted model by some other metrics we dicussed in the very first ["What Is Machine Learning?"](WhatIsML.ipynb) lesson.  Perhaps you should review that lesson for the following discussion.  One matter is that the default "averaging" technique assumes a binary classification.  For our multi-class model we have to chose something different.  There are three options here:

* 'micro': Count false positives, false negatives, true positives, true negatives for all observations independently, and simply perform row-wise aggregation of the counts.
* 'macro': Count the true and false categories grouping by class label, and take the mean of all those scores per label.
* 'weighted': Similar to macro, but take a weighted average based on the "support" (frequency) of each different label.

Again, there is no uniformly right answer to which of these is the best.  It depends on your requirements.

In [90]:
from sklearn.metrics import f1_score, precision_score, recall_score

In [133]:
precision_score(y_pred, y_test, average='micro')

0.9130434782608695

In [130]:
f1_score(y_pred, y_test, average='weighted')

0.9139386189258313

In [132]:
recall_score(y_pred, y_test, average='macro')

0.867564534231201

These numbers are not especially far apart, but the different metrics absolutely give us different results.  Which one we choose will give different answers for which model we should choose for the production system.  

Let us create a different model and compare it to the first one under different metrics.  A confession here is that I easily identified a number of models types and hyperparameters that are clearly better than the first one under almost any metric.  The `knn` and `knn2` objects are simply "naive" attempts that show the pattern I want to demonstrate.  That is, among these two models, chosing one is sensitive to which metric you prefer.  But for those better models I found, the same general pattern will emerge, just with higher numbers on each metric.

In [221]:
knn2 = KNeighborsClassifier(n_neighbors=2, metric="manhattan")
knn2.fit(X_train, y_train)
y_pred2 = knn2.predict(X_test)

In [222]:
print('Model 1:', precision_score(y_pred, y_test, average='micro'))
print('Model 2:', precision_score(y_pred2, y_test, average='micro'))      

Model 1: 0.9130434782608695
Model 2: 0.8695652173913043


In [223]:
print('Model 1:', f1_score(y_pred, y_test, average='weighted'))
print('Model 2:', f1_score(y_pred2, y_test, average='weighted'))

Model 1: 0.9139386189258313
Model 2: 0.876764539808018


In [224]:
print('Model 1:', recall_score(y_pred, y_test, average='macro'))
print('Model 2:', recall_score(y_pred2, y_test, average='macro'))

Model 1: 0.867564534231201
Model 2: 0.9017094017094017


### Revisiting accuracy

In [None]:
from sklearn.metrics import balanced_accuracy_score, accuracy_score


In [101]:
counts = y_test.value_counts().sort_index()
weights = 1/counts

In [115]:
# Remember
knn.score(X_test, y_test)

0.9130434782608695

In [69]:
print(accuracy_score(y_pred, y_test))
print(balanced_accuracy_score(y_pred, y_test))

0.9130434782608695
0.867564534231201


In [114]:
counts = y.value_counts().sort_index()
weights = 1/counts
balanced_accuracy_score(y_pred, y_test, sample_weight=y_test.map(weights))

0.8991012410142091

In [117]:
y_plurality = 1 + (y_pred*0)
y_plurality

247    1
127    1
230    1
162    1
159    1
      ..
321    1
59     1
12     1
312    1
107    1
Length: 92, dtype: int64

In [118]:
accuracy_score(y_plurality, y_test)

0.29347826086956524

In [119]:
balanced_accuracy_score(y_plurality, y_test)

0.29347826086956524

In [120]:
balanced_accuracy_score(y_plurality, y_test, sample_weight=y_test.map(counts))

0.44068784610900613

In [121]:
balanced_accuracy_score(y_plurality, y_test, sample_weight=y_test.map(weights))

0.17078893196892914

In [48]:
probs = pd.DataFrame(knn.predict_proba(X_test), columns=targets.values())
probs

Unnamed: 0,psoriasis,seboreic dermatitis,lichen planus,pityriasis rosea,cronic dermatitis,pityriasis rubra pilaris
0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.6,0.0,0.4,0.0,0.0
3,0.0,0.0,0.8,0.2,0.0,0.0
4,0.0,0.6,0.0,0.4,0.0,0.0
...,...,...,...,...,...,...
87,0.0,0.6,0.0,0.4,0.0,0.0
88,0.0,1.0,0.0,0.0,0.0,0.0
89,0.0,1.0,0.0,0.0,0.0,0.0
90,0.0,0.0,1.0,0.0,0.0,0.0
