### Classifier based Drift Detection using sklearn as backend

### Topics Covered:
   * Prepare Data for drift detection demo
   * X_ref and x_h0 will be created from same data distribution
   * x_h1 will be created from different distribution of data
   * Define a classifier based drift detector with sklearn as backend
   * Train the detector on x_ref (same X_train which we name in model training)
   * test the drift on x_h0 (similar to x_test which we name in model validation phase)
   * because X_ref and x_h0 are drawn from same distribution so expectation is detector should not detect any drift
   * in x_h1 detector should detect the drift
   
**Finally we will also see what will happen if drift detectors are not in place.**

### Installations

install alibi in user account as below:

`pip install --user alibi`

`pip install --user alibi_detect`

Data Source: https://archive.ics.uci.edu/ml/datasets/adult

Last Accessed on 06-02-2023

### Data Preparation

In [1]:
import pandas as pd

In [45]:
data = pd.read_csv("data/adult.data", header=None)

In [46]:
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [None]:
data.values

In [4]:
columns = ['Age',
    'Workclass',
    'fnlwgt',
    'Education',
    'education-num',
    'Marital Status',
    'Occupation',
    'Relationship',
    'Race',
    'Sex',
    'Capital Gain',
    'Capital Loss',
    'Hours per week',
    'Country',
    'class']

In [47]:
adult_data = pd.DataFrame(data=data.values, columns=columns)

In [48]:
adult_data.head()

Unnamed: 0,Age,Workclass,fnlwgt,Education,education-num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


<img src="drift-flow-diagram.png">

In [49]:
adult_data.columns

Index(['Age', 'Workclass', 'fnlwgt', 'Education', 'education-num',
       'Marital Status', 'Occupation', 'Relationship', 'Race', 'Sex',
       'Capital Gain', 'Capital Loss', 'Hours per week', 'Country', 'class'],
      dtype='object')

In [50]:
categorical_columns=['Workclass',
 'Education',
 'Marital Status',
 'Occupation',
 'Relationship',
 'Race',
 'Sex',
 'Country']

In [51]:
adult_data['Education'].unique()

array([' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th',
       ' Some-college', ' Assoc-acdm', ' Assoc-voc', ' 7th-8th',
       ' Doctorate', ' Prof-school', ' 5th-6th', ' 10th', ' 1st-4th',
       ' Preschool', ' 12th'], dtype=object)

In [52]:
for x in categorical_columns:
    adult_data[x] = adult_data[x].str.strip() 
# adult_data['Workclass'] = adult_data["Workclass"].str.strip()
# adult_data['Education'] = adult_data["Education"].str.strip()
# adult_data['Marital Status'] = adult_data["Marital Status"].str.strip()
# adult_data['Occupation'] = adult_data["Occupation"].str.strip()
# adult_data['Relationship'] = adult_data["Relationship"].str.strip()
# adult_data['Race'] = adult_data["Race"].str.strip()
# adult_data['Sex'] = adult_data["Sex"].str.strip()
# adult_data['Country'] = adult_data["Country"].str.strip()
# adult_data['class'] = adult_data["class"].str.strip()

In [53]:
adult_data['Education'].unique()

array(['Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
       'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
       '5th-6th', '10th', '1st-4th', 'Preschool', '12th'], dtype=object)

In [12]:
numerical_columns = ['Age', 'Capital Gain', 'Capital Loss', 'Hours per week']

In [16]:
# define low education
low_education = [
    '11th',
    '9th',
    'Some-college',
    'Assoc-acdm',
    'Assoc-voc', 
    '7th-8th',
    '5th-6th', 
    '10th', 
    '1st-4th', 
    'Preschool', 
    '12th',
    'HS-grad',
    'Bachelors'

]
# define high education
high_education = [
    'Bachelors',
    'Masters',
    'Doctorate',
    'Prof-school'
]
print("Low education:", low_education)
print("High education:", high_education)

Low education: ['11th', '9th', 'Some-college', 'Assoc-acdm', 'Assoc-voc', '7th-8th', '5th-6th', '10th', '1st-4th', 'Preschool', '12th', 'HS-grad', 'Bachelors']
High education: ['Bachelors', 'Masters', 'Doctorate', 'Prof-school']


In [17]:
low_education

['11th',
 '9th',
 'Some-college',
 'Assoc-acdm',
 'Assoc-voc',
 '7th-8th',
 '5th-6th',
 '10th',
 '1st-4th',
 'Preschool',
 '12th',
 'HS-grad',
 'Bachelors']

In [18]:
# select instances for low and high education
low_education_mask = pd.Series(adult_data.loc[:, "Education"]).isin(low_education).to_numpy()
high_education_mask = pd.Series(adult_data.loc[:, "Education"]).isin(high_education).to_numpy()


In [19]:
low_education_mask

array([ True,  True,  True, ...,  True,  True,  True])

In [20]:
high_education_mask

array([ True,  True, False, ..., False, False, False])

In [21]:
X_low, X_high = adult_data.values[low_education_mask], adult_data.values[high_education_mask]

In [54]:
X_low

array([[39, 'State-gov', 77516, ..., 0, 40, 'United-States'],
       [50, 'Self-emp-not-inc', 83311, ..., 0, 13, 'United-States'],
       [38, 'Private', 215646, ..., 0, 40, 'United-States'],
       ...,
       [58, 'Private', 151910, ..., 0, 40, 'United-States'],
       [22, 'Private', 201490, ..., 0, 20, 'United-States'],
       [52, 'Self-emp-inc', 287927, ..., 0, 40, 'United-States']],
      dtype=object)

In [23]:
X_low.shape

(29849, 14)

In [25]:
X_high.shape

(8067, 14)

In [22]:
import numpy as np
from sklearn.model_selection import train_test_split

In [30]:
size = 1000
np.random.seed(0)

# define reference and H0 dataset
idx_low = np.random.choice(np.arange(X_low.shape[0]), size=2*size, replace=False)
x_ref, x_h0 = train_test_split(X_low[idx_low], test_size=0.5, random_state=5, shuffle=True)

# define reference and H1 dataset
idx_high = np.random.choice(np.arange(X_high.shape[0]), size=size, replace=False)
x_h1 = X_high[idx_high]

In [55]:
categorical_ids = [adult_data.columns.get_loc(x) for x in categorical_columns]
numerical_ids = [adult_data.columns.get_loc(x) for x in numerical_columns]

In [56]:
categorical_ids

[1, 3, 5, 6, 7, 8, 9, 13]

In [27]:
adult_data.head()

Unnamed: 0,Age,Workclass,fnlwgt,Education,education-num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


### Column Transformer

In [28]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [26]:
[range(len(i)) for i in [list(adult_data[x].unique()) for x in categorical_columns]]

[range(0, 9),
 range(0, 16),
 range(0, 7),
 range(0, 15),
 range(0, 6),
 range(0, 5),
 range(0, 2),
 range(0, 42)]

In [32]:
# define numerical standard scaler.
num_transf = StandardScaler()

# define categorical one-hot encoder.
cat_transf = OneHotEncoder(
    categories=[range(len(i)) for i in [list(adult_data[x].unique()) for x in categorical_columns]],
    handle_unknown="ignore"
)

# Define column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", cat_transf, categorical_ids),
        ("num", num_transf, numerical_ids),
    ],
    sparse_threshold=0
)

# fit preprocessor.
preprocessor = preprocessor.fit(np.concatenate([x_ref, x_h0, x_h1]))

### Defining the Classifier Drift Detector

In [33]:
from alibi_detect.cd import ClassifierDrift

  from .autonotebook import tqdm as notebook_tqdm


In [35]:
from sklearn.ensemble import RandomForestClassifier

In [57]:
# define classifier
model = RandomForestClassifier()

# define drift detector with binarize prediction
cd = ClassifierDrift(
    x_ref=x_ref,
    model=model,
    backend='sklearn',
    preprocess_fn=preprocessor.transform,
    binarize_preds=True,
    n_folds=2,
)

Both `n_folds` and `train_size` specified. By default `n_folds` is used.
`retrain_from_scratch=True` sets automatically the parameter `warm_start=False`.
`use_oob=False` sets automatically the classifier parameters `oob_score=False`.


In [58]:
cd.predict(x=x_h0)

{'data': {'is_drift': 0,
  'distance': 0,
  'p_val': 0.9974120649075104,
  'threshold': 0.05,
  'probs_ref': array([0.49435714, 0.71      , 0.70270455, 0.42633333, 0.67223684,
         0.43456443, 0.81      , 0.96      , 0.45028571, 0.2497619 ,
         0.7       , 0.6075    , 0.74      , 0.69371429, 0.71402381,
         0.61      , 0.73169446, 0.36477242, 0.611893  , 0.74333333,
         0.08131002, 0.67922236, 0.46692736, 0.58796392, 0.58466667,
         0.61101942, 0.36477242, 0.4976699 , 0.63507148, 0.304     ,
         0.96725   , 0.97925   , 0.43      , 0.39347375, 0.12208344,
         0.48392446, 0.05916667, 0.26911905, 0.68416667, 0.53252025,
         0.21181814, 0.52585335, 0.4889659 , 0.57469048, 0.54480071,
         0.28      , 0.27024696, 0.42      , 0.56881205, 0.61101942,
         0.53166667, 0.975     , 0.53833333, 0.31888961, 0.17      ,
         0.385     , 0.40069583, 0.37093806, 0.70443888, 0.36477242,
         0.60305358, 0.69571429, 0.5015709 , 0.46395238, 0.3475  

In [59]:
labels = ['No!', 'Yes!']

def print_preds(preds: dict, used_data_name: str) -> None:
    print(used_data_name)
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print(f'p-value: {preds["data"]["p_val"]:.3f}')
    print('')

In [60]:
print_preds(cd.predict(x=x_h0), "H0")

H0
Drift? No!
p-value: 0.989



In [62]:
cd.predict(x=x_h1)

{'data': {'is_drift': 1,
  'distance': 0.08099999999999996,
  'p_val': 0.0001579695460540278,
  'threshold': 0.05,
  'probs_ref': array([0.66982143, 0.91      , 0.72028263, 0.02      , 0.16033333,
         0.18435229, 0.955     , 0.98      , 0.155     , 0.08      ,
         0.9       , 0.44588235, 0.33      , 0.75739286, 0.38782251,
         0.50083333, 0.36010265, 0.28200074, 0.23669444, 0.77      ,
         0.24102273, 0.72273346, 0.44095642, 0.54610317, 0.6642381 ,
         0.6139037 , 0.28200074, 0.42503141, 0.55961067, 0.27489177,
         0.99666667, 0.96      , 0.17      , 0.78      , 0.29498692,
         0.39335873, 0.48195238, 0.40347763, 0.98      , 0.27052734,
         0.41305952, 0.2198567 , 0.59192916, 0.15433333, 0.32594056,
         0.6       , 0.19138299, 0.945     , 0.5342381 , 0.6139037 ,
         0.19928571, 0.        , 0.22828571, 0.37003344, 0.52333333,
         0.74195238, 0.71098093, 0.4444133 , 0.47597785, 0.28200074,
         0.4239654 , 0.645     , 0.34799359,

In [61]:
print_preds(cd.predict(x=x_h1), "H1")

H1
Drift? Yes!
p-value: 0.001



### Saving and Loading the detector

In [63]:
from alibi_detect.saving import save_detector, load_detector

In [64]:
save_detector(cd, './detectors5')

Directory detectors5 does not exist and is now created.
Directory detectors5\preprocess_fn does not exist and is now created.
Directory detectors5\model does not exist and is now created.


In [65]:
cdl = load_detector('./detectors5/')

Both `n_folds` and `train_size` specified. By default `n_folds` is used.
`retrain_from_scratch=True` sets automatically the parameter `warm_start=False`.
`use_oob=False` sets automatically the classifier parameters `oob_score=False`.


In [66]:
print_preds(cdl.predict(x=x_h1), "H1")

H1
Drift? Yes!
p-value: 0.000



### Thank You!!!