# <u>Data Science Essentials</u>

## <u>Topic</u>: Brier Score

## <u>Category</u>: Model Evaluation

### <u>Created By</u>: Mohammed Misbahullah Sheriff
- [LinkedIn](https://www.linkedin.com/in/mohammed-misbahullah-sheriff/)
- [GitHub](https://github.com/MisbahullahSheriff)

## Importing Libraries

In [3]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.metrics import (
    accuracy_score,
    brier_score_loss
)

from sklearn.dummy import DummyClassifier

from sklearn.ensemble import RandomForestClassifier

## Getting the Data

In [5]:
path = r"/content/creditcard.csv"

credit_card = pd.read_csv(path)
credit_card.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [6]:
X = credit_card.drop(columns="Class")
y = credit_card.Class.copy()

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify=y,
                                                    test_size=0.2,
                                                    random_state=42)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(227845, 30) (227845,)
(56962, 30) (56962,)


## Target Label Distribution

In [7]:
(
    y_train
      .value_counts()
      .pipe(lambda ser: pd.concat([ser, y_train.value_counts(normalize=True)],
                                  axis=1))
      .set_axis(["count", "percentage"], axis=1)
      .rename_axis(["label"])
)

Unnamed: 0_level_0,count,percentage
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,227451,0.998271
1,394,0.001729


- The dataset is severaly imbalanced as can be seen above
- About 99% of the observations in the training data belong to the negative class (majority)

## Demo

In [18]:
def brier_skill_score(y_true, y_pred_probs):
  """

  Description:
  ------------
  This function takes will the brier skill score for given true labels and
  predicted probabilities of the positive class

  Parameters:
  -----------
  y_true: array-like
          The true class labels

  y_pred_probs: array-like
                  Predicted probabilities of the positive class

  """
  ref_score = brier_score_loss(y_true, np.full_like(y_true,
                                                    fill_value=0.001729,
                                                    dtype=float))
  model_score = brier_score_loss(y_true, y_pred_probs)
  score = 1 - (model_score / ref_score)
  return score

In [22]:
def evaluate_model(model, accuracy=True):
  """

  Description:
  ------------
  This function takes in a model instance, trains it and returns its
  performance on the training and test sets

  Parameters:
  -----------
  model: object
         Any classifier instance

  accuracy: bool
            Whether to use accuracy as evaluation metric or not. Will use g-mean if set to False.

  """
  model.fit(X_train, y_train)

  if accuracy:
    metric = accuracy_score
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
  else:
    metric = brier_skill_score
    y_train_pred = model.predict_proba(X_train)[:, 1]
    y_test_pred = model.predict_proba(X_test)[:, 1]

  train_score = metric(y_train, y_train_pred)
  test_score = metric(y_test, y_test_pred)

  print(f"{'Train Score':12}: {train_score:.10f}")
  print(f"{'Test Score':12}: {test_score:.10f}")

In [14]:
dummy = DummyClassifier(strategy="prior")

rf = RandomForestClassifier(n_estimators=10, max_depth=5)

### Accuracy

In [15]:
evaluate_model(dummy)

Train Score : 0.9982707542408216
Test Score  : 0.9982795547909132


In [16]:
evaluate_model(rf)

Train Score : 0.9994162698325616
Test Score  : 0.9992626663389628


### Brier Skill Score

In [23]:
evaluate_model(dummy, accuracy=False)

Train Score : 0.0000000000
Test Score  : -0.0000000025


In [24]:
evaluate_model(rf, accuracy=False)

Train Score : 0.7404720978
Test Score  : 0.7101060967


- When using `accuracy` to evaluate models on imbalanced data, it can be seen that even a no-skill (baseline) model can achieve a very high accuracy, simply by returning the majority class label for any input
- In this case, the no-skill model achieves an accuracy of about 99%
- The random forest model couldn't do much better than the no-skill model in terms of accuracy
- When using `brier skill score` to evaluate the models, the no-skill model returns a score of 0 on the training data
 - Returns a negative score for test data, thus indicating no predictive skill whatsoever
- The random forest model returns a score of about 0.74 on the training data and 0.71 on test data, thus demonstrating predictive skill
- In conclusion, when evaluating classification models, not using an appropriate metric can lead to misleading results and reliability of a model