## Misclassification cost as part of training

There are 2 ways in which we can introduce cost into the learning function of the algorithm with Scikit-learn:

- Defining the **class_weight** parameter for those estimators that allow it, when we set the estimator
- Passing a **sample_weight** vector with the weights for every single observation, when we fit the estimator.


With both the **class_weight** parameter or the **sample_weight** vector, we indicate that the loss function should be modified to accommodate the class imbalance and the cost attributed to each misclassification.


## Classifiers that support class_weight

In [13]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

In [14]:
# load data
# only a few observations to speed the computaton

data = pd.read_csv('../kdd2004.csv').sample(10000)

data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65,66,67,68,69,70,71,72,73,target
96076,66.96,24.36,0.08,4.5,-29.0,1067.9,-1.43,-1.21,-16.5,-57.0,...,331.4,0.99,1.96,12.0,-31.0,150.0,0.17,0.32,0.64,-1
43227,95.21,23.02,1.42,18.5,-7.5,1079.0,1.58,0.82,10.0,-81.0,...,1155.5,-1.33,2.78,-1.0,-50.0,386.0,-0.87,0.1,-0.61,-1
77087,14.91,35.29,-0.66,-5.5,12.5,614.3,1.01,-1.66,-16.5,-37.5,...,463.9,1.1,-1.98,0.0,-20.0,285.3,-0.37,0.63,0.35,-1
100288,72.76,25.66,-0.39,-14.5,98.5,4571.0,-1.33,0.58,-16.5,-104.5,...,4136.0,-1.63,4.86,37.0,-221.0,1546.5,-1.74,0.14,0.02,-1
52999,92.74,19.13,-1.0,-22.0,19.0,1566.3,-1.94,0.46,8.5,-71.5,...,1323.9,-1.91,1.0,4.0,-53.0,63.3,1.55,0.37,0.37,-1


In [15]:
# imbalanced target

data.target.value_counts() / len(data)

-1    0.9917
 1    0.0083
Name: target, dtype: float64

In [16]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((7000, 74), (3000, 74))

## Using class_weight

In [19]:
# Logistic Regression with class_weight

# we initialize the cost / weights when we set up the transformer

def run_Logit(X_train, X_test, y_train, y_test, class_weight):
    
    # weights introduced here
    logit = LogisticRegression(
        penalty='l2',
        solver='newton-cg',
        random_state=0,
        max_iter=10,
        n_jobs=4,
        class_weight=class_weight
    )
    
    logit.fit(X_train, y_train)
    
    
    # model performance report
    print('Train set')
    pred = logit.predict_proba(X_train)
    print(
        'Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))

    print('Test set')
    pred = logit.predict_proba(X_test)
    print(
        'Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

In [20]:
# evaluate performance of algorithm built
# using imbalanced dataset

run_Logit(X_train,
          X_test,
          y_train,
          y_test,
          class_weight=None)

Train set
Logistic Regression roc-auc: 0.9044952737235635
Test set
Logistic Regression roc-auc: 0.9487147177419355


In [21]:
# evaluate performance of algorithm built
# cost estimated as imbalance ratio

# 'balanced' indicates that we want same amount of 
# each observation, thus, imbalance ratio

run_Logit(X_train,
          X_test,
          y_train,
          y_test,
          class_weight='balanced')

Train set
Logistic Regression roc-auc: 0.9893875497840149
Test set
Logistic Regression roc-auc: 0.9720682123655914


In [22]:
# evaluate performance of algorithm built
# cost estimated as imbalance ratio

# alternatively, we can pass a different cost
# in a dictionary, if we know it already

run_Logit(X_train,
          X_test,
          y_train,
          y_test,
          class_weight={-1:1, 1:10})

Train set
Logistic Regression roc-auc: 0.9318615253504722
Test set
Logistic Regression roc-auc: 0.9558691756272402


Play with the cost and see what you get in terms of performance.

## Using sample_weight

In [23]:
# Logistic Regression + sample_weight

# pass the weights / cost, when we train the algorithm

def run_Logit(X_train, X_test, y_train, y_test, sample_weight):
    
    logit = LogisticRegression(
        penalty='l2',
        solver='newton-cg',
        random_state=0,
        max_iter=10,
        n_jobs=4,
    )
    
    # costs are passed here
    logit.fit(X_train, y_train, sample_weight=sample_weight)

    print('Train set')
    pred = logit.predict_proba(X_train)
    print(
        'Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))

    print('Test set')
    pred = logit.predict_proba(X_test)
    print(
        'Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

In [24]:
# evaluate performance of algorithm built
# using imbalanced dataset

run_Logit(X_train,
          X_test,
          y_train,
          y_test,
          sample_weight=None)

Train set
Logistic Regression roc-auc: 0.9044952737235635
Test set
Logistic Regression roc-auc: 0.9487147177419355


In [25]:
# evaluate performance of algorithm built
# cost estimated as imbalance ratio

# with numpy.where, we introduce a cost of 99 to
# each observation of the minority class, and 1
# otherwise.

run_Logit(X_train,
          X_test,
          y_train,
          y_test,
          sample_weight=np.where(y_train==1,99,1))

Train set
Logistic Regression roc-auc: 0.9888723111748172
Test set
Logistic Regression roc-auc: 0.9732862903225805


Cost-sensitive learning has improved the performance of the model.