Table of contents:
- What is ROC
- What is AUC
- What does it mean to be balanced? -> exactly 50/50 or what?
- xaxis in ROC when dataset is:
    - imbalanced : Precision as x-axis
    - balanced: FPR as x-axis
    - explain why certain x-axis is prefered

Previously in <a link="https://medium.com/analytics-vidhya/classification-performance-metric-with-python-sklearn-d8342ac25898">Classification Performance Metric with python Sklearn</a> we've covered various performance metrics in classification including ROC curve and AUC however they were briefly mentioned. 

Readers are assumed to have understanding about confusion matrix, precision, recall, TPR,  and FPR. If you don't, it is recommended to read previous blog.

We will dive deeper into ROC to understand its pros/cons, AUC, and when it should be replaced with PR curve.

We will use same dataset as before, breast cancer dataset from sklearn.

Just for my own sake, I've labelled malignant as 1 and benign as 0 which is the opposite labelling from previous blog.

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


import plotly.graph_objects as go
import numpy as np
import pandas as pd

bc = load_breast_cancer()
df = pd.DataFrame(data=bc.data, columns=bc.feature_names)
df["target"] = bc.target
df["target"] = df["target"].map({0:1, 1:0})

malignant_subset_df = df.loc[df["target"] == 1].sample(30)
new_df = pd.concat([malignant_subset_df, df.loc[df["target"] == 0]])
new_df = new_df.sample(frac=1).reset_index(drop=True)
new_df.iloc[:, :-1] = StandardScaler().fit_transform(new_df.iloc[:, :-1])

y = new_df["target"].values
X = new_df.drop(columns=["target"]).values

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)

In [2]:
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)

In [3]:
probas = log_reg.predict_proba(X_test)

In [4]:
result_df = pd.DataFrame(probas, columns=["0", "1"])
result_df["target"] = y_test
result_df["pred_t"] = y_pred

In [6]:
result_df.head()

Unnamed: 0,0,1,target,pred_t
0,0.979129,0.020871,0,0
1,0.142418,0.857582,1,1
2,0.000768,0.999232,1,1
3,0.987901,0.012099,0,0
4,0.999721,0.000279,0,0


In [13]:
def sigmoid(z):
    return 1/(1+np.exp(z))

In [14]:
log_reg.coef_

array([[ 0.03255541,  0.24489433,  0.04944208,  0.1656219 ,  0.40947867,
        -0.36898393,  0.36843419,  0.58261486,  0.09488274, -0.0175737 ,
         0.9775314 ,  0.46945104,  0.84874838,  0.52061561, -0.55458253,
        -0.68421379, -0.28128956,  0.02138524, -0.39380904, -0.31029486,
         0.17741219,  0.46444507,  0.37281804,  0.29749622,  0.25868817,
         0.29438826,  0.47746857,  0.38116866,  0.27445527,  0.36390683]])

In [16]:
log_reg.intercept_

array([-5.62683611])

In [12]:
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x = result_df["1"],
        y = result_df["1"],
        mode="markers"
    )
)

fig.show()

In [8]:
new_df.columns

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'target'],
      dtype='object')

## What is ROC

## What is AUC

balanced => exactly 50/50?

## ROC Vs. PR curve

<b>References</b>

- https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc