# Class 2 - Classifiers evaluation

### Modules setup

In [None]:
%pip install numpy pandas matplotlib scikit-learn

In [None]:
import random

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import auc, classification_report, confusion_matrix, roc_curve
from sklearn.model_selection import train_test_split

In [None]:
plt.rcParams["figure.dpi"] = 120.0  # size of figures

### German Credit - Data loading & pre-processing

In [None]:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data-numeric"
DATA_SET = pd.read_fwf(url, header=None)
DATA_SET.rename(columns={24: "target"}, inplace=True)
DATA_SET["target"] = DATA_SET["target"] - 1  # recoding target variable
DATA_SET

In [None]:
X = DATA_SET.drop(["target"], axis=1)
y = DATA_SET["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [None]:
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

Is random sampling the best approach? What if one class has many more records than the other?
Imbalanced data may lead to poor model which may have good overall performance metrics e.g. accuracy.

There are several approaches to tackle the issue:
- undersampling, 
- oversampling, 
- **cost-based analysis**,
- algorithmic approches e.g. SMOTE

Check [imbalanced-learn](https://imbalanced-learn.org/stable/user_guide.html#user-guide) documentation for details.

### [Building logistic regression model](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)

In [None]:
model = LogisticRegression(penalty=None, max_iter=1000)
LR = model.fit(X_train, y_train)

In [None]:
LR.intercept_, LR.coef_

**Predict probability of bad credit**

In [None]:
# On validation data
score_val = LR.predict_proba(X_test)[:, 1]
# On training data
score_train = LR.predict_proba(X_train)[:, 1]

**Confusion matrix**

❗ Remember class indicator (0, 1,...) and actual or predicted values may be switched in confusion matrix

<img src="https://miro.medium.com/max/712/1*Z54JgbS4DUwWSknhDCvNTQ.png" width=400>

<img src="https://miro.medium.com/max/1780/1*LQ1YMKBlbDhH9K6Ujz8QTw.jpeg"  width=400>

**Performance measures derived from confusion matrix:**

- Accuracy - percentage of correct predictions

`ACC = (TP + TN)/(TP + FP + TN + FN)`

- Precision - percentage of positive predictions which were actually correct

`PREC = TP / (TP + FP)`

-  Recall - what percentage of actual positives were predicted correctly
 (Recall = Sensitivity = Hit rate = True Positive Rate (TPR))
 
`REC = TP / (TP + FN)`

- F1 Score - traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/1bf179c30b00db201ce1895d88fe2915d58e6bfd)

### Cost-based approach in model assessment

In [None]:
def cutoff_analysis(
    y_test: pd.Series, y_test_hat: pd.Series, cost_matrix: np.array
) -> list:
    """
    Calculate cost for cutoff thresholds between 0 and 1 for given true labels `y_test`,
    predicted labels `y_test_hat` and cost matrix `cost_matrix`.
    """
    cutoff_range = np.arange(0, 1.0, 0.01)
    vec = []
    for cutoff in cutoff_range:
        y_test_hat_bin = np.where(y_test_hat >= cutoff, 1, 0)
        conf_mat = confusion_matrix(y_test, y_test_hat_bin)
        conf_const_mat = np.multiply(conf_mat, cost_matrix)
        vec.append(conf_const_mat.sum() / len(y_test))
    return vec

In [None]:
costmat = np.array([[0, 1], [5, 0]])
cost_val = cutoff_analysis(y_test, score_val, costmat)
cost_train = cutoff_analysis(y_train, score_train, costmat)
cost_val[:5]

In [None]:
plt.figure()
plt.xlabel("Cutoff point")
plt.ylabel("Cost per client")
plt.title("Cost vs. cutoff threshold")

plt.plot(np.arange(0, 1.0, 0.01), cost_val, color="blue")
plt.plot(np.arange(0, 1.0, 0.01), cost_train, color="red")
cutoffs = np.arange(0, 1.0, 0.01)
plt.plot(
    [0, 1],
    [min(cost_val)] * 2,
    color="blue",
    linestyle=":",
    label=f"Min Cost Val = {min(cost_val):.2f} for k = {cutoffs[np.argmin(cost_val)]}",
)
plt.plot(
    [0, 1],
    [min(cost_train)] * 2,
    color="red",
    linestyle=":",
    label=f"Min Cost Train = {min(cost_train):.2f} for k = {cutoffs[np.argmin(cost_train)]}",
)
plt.legend();

Looks like we got lower cost for predictions on training set - model may **overfit** slightly.


### Performance metrics

In [None]:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/australian/australian.dat"
dataset = pd.read_csv(
    url, sep=" ", header=None, names=["V" + str(i) for i in range(0, 15)]
)
# dataset.columns = ["V" + str(i) for i in range(0,15)]
dataset.rename(columns={"V14": "class"}, inplace=True)

dataset["V3"] = np.where(dataset["V3"] == 1, 0, 1)
dataset["V11"] = np.where(dataset["V11"] == 1, 0, 1)
dataset["V13"] = np.log(dataset["V13"])
dataset

In [None]:
training_fraction = 0.8
X = dataset.iloc[:, 0:14]
y = dataset.iloc[:, 14]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=1 - training_fraction, random_state=42
)

In [None]:
lr = LogisticRegression

In [None]:
model2 = lr(penalty=None, max_iter=1000).fit(X_train, y_train)

In [None]:
y_test_hat = model2.predict(X_test)

In [None]:
# What is what in confusion matrix?
confm = confusion_matrix(y_test, y_test_hat)
confm

In [None]:
ACC = (confm[0, 0] + confm[1, 1]) / np.sum(confm)
PREC = (confm[1, 1]) / (confm[1, 1] + confm[0, 1])
REC = (confm[1, 1]) / (confm[1, 1] + confm[1, 0])
F1 = 2 * PREC * REC / (PREC + REC)
print("ACC ", ACC, "\nPREC ", PREC, "\nREC ", REC, "\nF1 ", F1)

In [None]:
# Sklearn built-in report
print(classification_report(y_test, y_test_hat))

### Visual performance metric - ROC Curve + descriptive AUC

In [None]:
y_train_hat = model2.predict_proba(X_train)[:,1]
y_test_hat = model2.predict_proba(X_test)[:,1]
fprv, tprv, _ = roc_curve(y_test, y_test_hat)
fprt, tprt, _ = roc_curve(y_train, y_train_hat)
auc_rocv = auc(fprv, tprv)
auc_roct = auc(fprt, tprt)

plt.figure()

plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC+AUC")

plt.plot([0, 1], [0, 1], color="grey", linestyle="--", label="Random, AUC = 0.5")
plt.plot([0, 0], [0, 1], color="navy", linestyle=":", label="Wizard, AUC = 1.0")
plt.plot([0, 1], [1, 1], color="navy", linestyle=":")

plt.plot(fprt, tprt, color="orange", label="Model - train, AUC = %0.2f" % auc_roct)
plt.plot(fprv, tprv, color="red", label="Model - val, AUC = %0.2f" % auc_rocv)
plt.legend(loc="lower right");

## Exercises

Load Iris dataset from https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv to 'iris' DataFrame

In [None]:
iris = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

Code `species` column to have value 1 if iris is from _versicolor_ species and 0 otherwise

In [None]:
iris.species = np.where(iris.species == "versicolor", 1, 0)

Your goal is to predict `species` column. Split dataset to train and validation subsets using `train_test_split` function. Training set should have **75%** of all observations.

In [None]:
X = iris.drop("species", axis=1)
y = iris["species"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.75, random_state=1
)

Build logistic regression (with `LogisticRegression` from `sklearn`) using **Elastic-net** regularization with 0.35 L1 ratio (only one solver supports that, check [here](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression))

You can read more about **Elastic-net** [here](https://en.wikipedia.org/wiki/Elastic_net_regularization)

In [None]:
model = LogisticRegression(
    penalty="elasticnet", l1_ratio=0.35, solver="saga", max_iter=10000
)
model.fit(X_train, y_train)

Make a prediction on test set with 0.5 cutoff thresholds. Produce classification report with `classification_report`. What is accuracy of the model?

In [None]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Why accuracy is so low? If you want to know check [here](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html#Unsupervised-learning-example:-Iris-dimensionality) below `In[19]`.