In [1]:
from sklearn.datasets import load_breast_cancer
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Loading datasets

In [2]:
X,y = load_breast_cancer(return_X_y=True)

In [3]:
# Complete the code below, by setting the variables to the proper sets.
# Use 33% of the data as the test set. Use random_state=42 to make the
# results reproducible.

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=42)

# Fitting the data

Let's fit the data using a logistic-regression model. See the [dedicated wikipedia page](https://en.wikipedia.org/wiki/Logistic_regression) for information about this particular learning model.

In [4]:
lr = LogisticRegression()   # Instantiate the logistic regression model

In [5]:
lr.fit(X_train, y_train)   # Fits the model on the available data

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Making predictions

The logistic regression model can be used to make hard predictions using `lr.predict(...)`, or to obtain the probability of each class using `predict_proba`.

These are standard methods from the `sklearn` library. Not all models support
predicting probabilities. In case they don't, they simply do not implement
the corresponding method (i.e., `predict_proba`)

The `predict_proba` method returns the probabilities associated with each possible label. Since this is a binary problem, it returns a pair $(p_0, p_1)$ collecting the probabilities for class 0 and 1 (note that, since they form a distribution, we have $p_1 = 1 - p_0$ and one of the two values is sufficient to describe the complete result).

In [6]:
scores = lr.predict_proba(X_test)[:,1]     # collects the p_1 probabilities.

# Functions for evaluating fp,tp and accuracy

Please complete the following functions.

- `eval_fp_tp` accepts an `np.array` containing the `actual` labels from the dataset, and an `np.array` containing the predictions from a classifier. It should return a pair `(FP,TP)`, where `FP` is the count of false-positive predictions and `TP` is the count of true-positive predictions.
- `eval_accuracy` accepts the same parameters of `eval_fp_tp` and returns the accuracy (i.e., $\frac{\text{TP} + \text{TN}}{\text{POS}+\text{NEG}}$) of the given predictions.

Note: `actual` and `predicted` variables can be assumed to be np.arrays containing boolean values (`True` or `False`).

In [None]:
def eval_fp_tp(actual, predicted):
    TP = np.logical_and(actual,predicted).sum()
    FP = np.logical_and(np.logical_not(actual), predicted).sum()
    return (FP, TP)

In [None]:
def eval_accuracy(actual, predicted):
    pass

# fp,tp and accuracy evaluations for different thresholds

Given the scores for the test cases, we might want to find the best possible threshold for classification, i.e., the real value $t$ such that `scores >` $t$ gives the best classifiation of the examples.

Let us then start to consider 100 possible thresholds in the range $[0,1]$:

In [None]:
# complete the code setting the thresholds variable to an np.array containing 100
# equally spaced values between 0 and 1.

thresholds = [i/100.0 for i in range(100)]

and compute the tp, fp, and accuracy values of the labelings obtained by comparing the scores with those thresholds.

In [None]:
performances = []
fps, tps = [], []

# complete the code by filling the lists fps, tps and performances by iterating
# over the thresholds array and appending the results of eval_fp_tp and eval_accuracy
# to the lists fps, tps and performances
# At the end, at index i of each these array one should find
# fps: the number of false positives for threshold at position i
# tps: the number of true positives for threshold at position i
# performances: a tuple (accuracy, threshold, fp, tp) evaluated for threshold at position i

scores > threshold


performances = np.array(performances)

# Plotting

Let us then start plotting the coverage plot for the obtained classifications.

In [None]:
plt.plot(fps, tps)

# Checking performances for threshold 0.5

The predict_proba method we used to get the score returns the probability that examples belong to the positive class. Usually the positive class is then predicted as score > 0.5 (since in this case it is the one with the largest likelihood).

Let's then see where this classifier (i.e., the one obtained setting the threshold to 0.5) lays in the coverage plot and if there are better options.

**note**: since we saved interesting stats in the `performances` array, we can retrieve the fp, tp position of the classifier we get by setting the thresholds to 0.5, by finding the position of the row we are interested using the expression: `performances[:,1] == 0.5` and then using the resulting boolean vector to retrieve the correct row of the matrix: `performances[performances[:,1] == 0.5]`.
```

In [None]:
plt.plot(fps, tps)
accuracy, threshold, fp, tp = performances[performances[:,1] == 0.5][0]
plt.scatter(fp,tp,color='red')
plt.plot([fp-10,fp+10],[tp-10,tp+10], color="red")

As it is shown by the red dot and the red line, threshold 0.5 is a good one, but apparently two other points can reach a better classification.

Let us see where these point lay in the plot and what is their accuracy.

In [None]:
perf05 = performances[performances[:, 1] == 0.5][0,0]
perf_largerthan_05 = performances[performances[:, 0] > perf05]
plt.scatter(perf_largerthan_05[:,2], perf_largerthan_05[:,3], color="green")

max_perfs = perf_largerthan_05[:,0].max()
perf_largerthan_05[perf_largerthan_05[:,0] == max_perfs]

The two points that we are looking for are then in position (5,121) and (1,117)

In [None]:
plt.plot(fps, tps)
fp, tp = eval_fp_tp(actual, scores > 0.5)
plt.scatter(fp,tp, color="red")
plt.scatter(5,121, color="orange")
plt.scatter(1,117, color="orange")
plt.plot([fp-10,fp+10],[tp-10,tp+10], color="red")

These two points (that we found by looking only to the accuracies) are indeed the two points that the plot show having a better accuracy.