# Before we start the exercise
* we need to make sure that we have the current version of the sklearn library. The script is prepared for version 0.19

In [1]:
import sklearn
print('The scikit-learn version has been installed: {}.'.format(sklearn.__version__))

The scikit-learn version has been installed: 0.19.1.


In OKWF, you will need to install a local environment as described:

https://brain.fuw.edu.pl/edu/index.php/Uczenie_maszynowe_i_sztuczne_sieci_neuronowe/konfiguracja

It will also be useful for further activities.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from scipy import diag, interp
from itertools import cycle

from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

# Exercise: Cross-validation
* In this exercise, we will look at how measures of the classifier's quality depend on the proportion of classes in the training set and the size of the training set
* The regression will still be logistic regression, but this time we will start using the library version from the module [scikit-learn](http://scikit-learn.org/stable/index.html)

Function for generating data:

In [None]:
def gen(ile):
    mu = [(-1,0.5),(1.2,4)] # secondary classes
    cov = [diag([3,3]), diag([4,1.7])] # covariance matrices for classes
    
    X = np.zeros((ile*len(mu), 2)) # space for input data
    Y = np.zeros((ile*len(mu), 1),dtype = int) # space for output
    for klasa in range(len(mu)):
        X[klasa*ile:(klasa+1)*ile] = np.random.multivariate_normal(mu[klasa],cov[klasa],ile)
        Y[klasa*ile:(klasa+1)*ile] = klasa
    Y = Y.ravel()
    return (X,Y)

We test this function, we generate 50 examples, we print the first 5, we draw all of them using the function `scatter`:

In [None]:
X,Y = gen(50)
print('X: ', X[0:5,:])
print('Y: ', Y[0:5])
plt.scatter(X[:,0], X[:,1] ,c = Y, cmap=plt.cm.Set1, alpha =0.5)
plt.show()

# Klasy równoliczne

## Let us observe the variability of the classifier quality measures when selecting subsets for learning and testing from the training set.
* we will use the function to divide the set [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
* we will use functions from the module to calculate quality measures [sklearn.metrics](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)
Complete the training file so that the test set is 20% of the entire training set. Illustrate the points belonging to the teaching part and dio of the test part with `scatter`:

* Breakdown:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size= ...)

* Ilustration:

In [None]:
plt.scatter(X_train[:,0], X_train[:,1], c = y_train, cmap=plt.cm.Set1, alpha =0.5)
plt.scatter(X_test[:,0] , X_test[:,1],  c = y_test,  cmap=plt.cm.Set1, alpha =0.5, marker = '*' )
plt.show()

Logistic regression is implemented in the class ['LogisticRegression'](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). We create an instance of this class object:

In [None]:
lr = LogisticRegression()

We teach it on the training set:

In [None]:
lr.fit(X_train,y_train)

We perform predictions for the test set:

In [None]:
y_pred = lr.predict(X_test)  

The effects can be viewed using a matrix of errors:

In [None]:
print(metrics.confusion_matrix(y_test, y_pred))
tn, fp, fn, tp = metrics.confusion_matrix(y_test, y_pred).ravel()
print('TN: ',tn,'FP: ', fp, 'FN: ', fn, 'TP: ', tp )

In the loop, we repeat the process of dividing the training set and for each division we calculate the quality measures:
* positive precision: (positive predictive value (PPV), precision). He answers the question: "If the test result is positive, what is the probability that the subject is ill?"

$ \qquad $ $ PPV = \frac {TP} {P '} = \frac {TP} {TP + FP} $

* sensitivity: The probability that the classification will be correct, provided the case is positive (True Positive Rate, Recall). This is, for example, the probability that a test made for a sick person shows that she is ill.

$ \qquad $ $ TPR = \frac {TP} {P} = \frac {TP} {TP + FN} $


* Accuracy (accuracy (ACC)): Probability of correct classification.

$ \qquad $ $ ACC = \frac {TP + TN} {P + N} $

* F1-score: harmonic mean of precision and sensitivity:

$ \qquad $ $ F_1 = 2 \frac {PPV \cdot TPR} {PPV + TPR} = \frac {2TP} {2TP + FP + FN} $
This measure gives an assessment of the balance between sensitivity and precision. This measure does not include true negative results.

* Matthews correlation coefficient (Matthews correlation coefficient):

$ \qquad $ $
\text {MCC} = \frac {TP \cdot TN - FP \cdot FN} {\sqrt {(TP + FP) (TP + FN) (TN + FP) (TN + FN)}}
$

  * This ratio takes into account both true and false positive and negative results and is generally considered as a balanced measure that can be applied even when classes are of very different sizes.
  * MCC is in fact the correlation coefficient between observed and predicted binary classifications; returns a value from -1 to +1.
    * The +1 ratio corresponds to the ideal classification,
    * 0 no better than random assignment of result and
    * -1 means a total disagreement between the classification and the actual state.

In [None]:
lr = ... # create an instance of the classifier
for i in range(10):
    X_train, X_test, y_train, y_test = ... # podziel zbiór z 20% do testowania


    ... # Train the classifier
     y_pred = ... # do the prediction for the test set
    
    
    PPV = metrics.precision_score(y_test, y_pred)
    REC = metrics.recall_score(y_test, y_pred)
    ACC = metrics.accuracy_score(y_test, y_pred)
    F1 = metrics.f1_score(y_test, y_pred)
    MCC = metrics.matthews_corrcoef(y_test, y_pred)
    
    print('PPV = {p:.3f} REC = {r:.3f} ACC = {a:.3f} F1 = {f:.3f} MCC =  {m:.3f}  '.format(a=ACC,f=F1,m=MCC,p=PPV,r=REC))

We see that the measures change with each draw.

Most often, not such random divisions are used, but a systematic 'k-fold cross-validation' distribution. The procedure looks like this:
* We divide the teaching set (X and y) into equal parts
* We put aside the 1st part as test data,
* We teach the classifier on the remaining `k-1` parts
* We calculate the quality measures on this reserved part
* Select the 2nd part as test data
* We teach the classifier on the remaining `k-1` parts
* We calculate the quality measures on this reserved part
* $ \vdots $

In the `sklearn` library we have the convenient` cross_val_score` function for this:

In [None]:
from  sklearn.model_selection import cross_val_score

Let's see how it works:

In [None]:
ppv = cross_val_score(lr, X, Y, cv=10, scoring='precision')
print('PPV = {0:.2f} +/- {1:.2f}'.format(ppv.mean(),ppv.std()))
rec = cross_val_score(lr, X, Y, cv=10, scoring='recall')
print('REC = {0:.2f} +/- {1:.2f}'.format(rec.mean(),rec.std()))
acc = cross_val_score(lr, X, Y, cv=10, scoring='accuracy')
print('ACC = {0:.2f} +/- {1:.2f}'.format(acc.mean(),acc.std()))
f1 = cross_val_score(lr, X, Y, cv=10, scoring='f1')
print('F1 = {0:.2f} +/- {1:.2f}'.format(f1.mean(),f1.std()))

For the set, let's examine the ROC curve. This time we will also use library functions.

In [None]:
skf  = StratifiedKFold(n_splits=6)
lr = LogisticRegression()
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)

i = 0
for train, test in skf.split(X, Y):
    lr.fit(X[train], Y[train]) # Fit the regression(?)
    probas_ = lr.predict_proba(X[test]) # we calculate the probabilities of belonging to test examples
                                         # to classes by learned classifier
                                         # (it returns in the given row the probability set for each of the possible classes)
   
    # We calculate the points of the ROC curve
    fpr, tpr, thresholds = metrics.roc_curve(Y[test], probas_[:, 1]) # in relation to the probability of class 1
    tprs.append(interp(mean_fpr, fpr, tpr))
    tprs[-1][0] = 0.0
    # and the area under the curve
    roc_auc = metrics.auc(fpr, tpr)
    aucs.append(roc_auc)
    # we draw a curve 
    plt.plot(fpr, tpr, lw=1, alpha=0.3,
             label='ROC dla podziału %d (AUC = %0.2f)' % (i, roc_auc))

    i += 1
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
         label='Losowa klasa', alpha=.8)
# below summary: counting of mean and standard deviations, shading of the confidence interval
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = metrics.auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
         label=r'Średni ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
         lw=2, alpha=.8)

std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                 label=r'$\pm$ 1 std. dev.')

plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

Let's check how the quality measures from the size of the training set:

In [None]:
N = 10
PPV_mean = np.zeros((N,1))
PPV_std = np.zeros((N,1))
REC_mean = np.zeros((N,1))
REC_std = np.zeros((N,1))
ACC_mean = np.zeros((N,1))
ACC_std = np.zeros((N,1))
F1_mean = np.zeros((N,1))
F1_std = np.zeros((N,1))

n= 30+np.floor(np.logspace(1,4,N)).astype(int)

for i in range(N):
    X,Y = gen(int(n[i]))
    lr = LogisticRegression()
    ppv = ...
    PPV_mean[i] =ppv.mean()
    PPV_std[i]  = ppv.std()
    rec = ...
    REC_mean[i]  = rec.mean()
    REC_std[i]  = rec.std()
    acc = ...
    ACC_mean[i]  = acc.mean()
    ACC_std[i]  = acc.std()
    f1 = ...
    F1_mean[i]  = f1.mean()
    F1_std[i]  = f1.std()

ax = plt.subplot(1,1,1)
plt.errorbar(n,PPV_mean,yerr=PPV_std)
plt.errorbar(n+2,REC_mean,yerr=REC_std)
plt.errorbar(n+4,ACC_mean,yerr=ACC_std)
plt.errorbar(n+6,F1_mean,yerr=F1_std)
plt.legend(('PPV','REC','ACC','F1'))
ax.set_xscale("log", nonposx='clip')
plt.show()

|## Unbalanced classes
We will now create data in which one of the classes is M-fold.

In [None]:
def gen_rozne(ile, M):
    mu = [(-1,0.5),(1,4)]
    #mu = [(-1,0.5),(-1,0.5)]
    cov = [diag([1.7,1.8]), diag([1.5,0.7])]
    X = np.zeros(((M+1)*ile, 2)) # space for input data
    Y = np.zeros(((M+1)*ile, 1),dtype = int) # space for output
    print(Y.shape)
    klasa = 0
    X[0:ile] = np.random.multivariate_normal(mu[klasa],cov[klasa],ile)
    Y[0:ile] = klasa
    klasa =1 
    X[ile:ile+ile*M] = np.random.multivariate_normal(mu[klasa],cov[klasa],ile*M)
    Y[ile:ile+ile*M] = klasa
    Y = Y.ravel()
    print(np.sum(Y==0), np.sum(Y==1) )
    return (X,Y)

We are watching data:

In [None]:
X,Y = gen_rozne(30,100)
plt.scatter ...
plt.show()

We calculate quality measures for unbalanced data with a 10-fold split. Note the difference in the values of the first 4 measures and measures of the MCC:

In [None]:
ppv = ...
print('PPV = {0:.2f} +/- {1:.2f}'.format(ppv.mean(),ppv.std()))
rec = ...
print('REC = {0:.2f} +/- {1:.2f}'.format(rec.mean(),rec.std()))
acc = ...
print('ACC = {0:.2f} +/- {1:.2f}'.format(acc.mean(),acc.std()))
f1 = ...
print('F1 = {0:.2f} +/- {1:.2f}'.format(f1.mean(),f1.std()))
print('-----')
MCC=np.zeros((10,1))
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=...) # 10% test to be similar to a 10-fold split
     ... # train the model
    y_pred = ... # prediction for the test set
    MCC[i] = metrics.matthews_corrcoef(y_test, y_pred)
print('MCC = {0:.2f} +/- {1:.2f}'.format(MCC.mean(),MCC.std()))  
   

Now we will try to see if it can be improved if in the divisions, take care to preserve the proportions of classes. This can be easily done using the `StratifiedKFold` function, it returns indexes to the training and test inventory:

In [None]:
skf = StratifiedKFold(n_splits=4)
for train, test in skf.split(X, Y):  
    lr.fit(X[train,:],Y[train])
    y_pred = lr.predict(X[test,:]) 
    y_test = Y[test]
    PPV = metrics.precision_score(y_test, y_pred)
    REC = metrics.recall_score(y_test, y_pred)
    ACC = metrics.accuracy_score(y_test, y_pred)
    F1 = metrics.f1_score(y_test, y_pred)
    MCC = metrics.matthews_corrcoef(y_test, y_pred)
    
    print('PPV = {p:.3f} REC = {r:.3f} ACC = {a:.3f} F1 = {f:.3f} MCC =  {m:.3f}  '.format(a=ACC,f=F1,m=MCC,p=PPV,r=REC))

Let's examine the ROC curve:

In [None]:
skf  = StratifiedKFold(n_splits=6)
lr = LogisticRegression()
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)

i = 0
for train, test in skf .split(X, Y):
    ... # we fit the data (?)
    probas_ = ...# we calculate the probabilities of belonging to test examples
                                         # to classes by learned classifier
                                         # (it returns in the given row the probability set for each of the possible classes)
    # We calculate the points of the ROC curve
    fpr, tpr, thresholds = ... # in relation to the probability of class 1
    tprs.append(interp(mean_fpr, fpr, tpr))
    tprs[-1][0] = 0.0
    # and area under the curve
    roc_auc = ...
    aucs.append(roc_auc)
    # we draw a curve
    plt.plot(fpr, tpr, lw=1, alpha=0.3,
             label='ROC dla podziału %d (AUC = %0.2f)' % (i, roc_auc))

    i += 1
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
         label='Losowa klasa', alpha=.8)
# below summary: counting of mean and standard deviations, shading of the confidence interval
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = metrics.auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
         label=r'Średni ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
         lw=2, alpha=.8)

std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                 label=r'$\pm$ 1 std. dev.')

plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

The above calculation calculations should be carried out for classes whose differences clearly differ and for those that overlap to a significant extent. It is necessary to replace the average classes in the function generating differential data.

## What's the result of the application?