### Data Drift & Model Drift Detection

#### Data Drift
If there is changes in the data, we normally call it as Data Drift or Data Shift. 
A Data Drift can also refer to
+ changes in the input data
+ changes in the values of the features used to define or predict a target label.
+ changes in the properties of the independent variable

#### Model Drift
This refers to changes in the performance of the model over time. 
It is the deterioration of models over time in the case of accuracy and prediction.
ML Models do not live in a static environment hence they will deteriorate or decay over time.

#### Deepchecks
+ Useful for detecting data drift,data integrity,model performance,etc
+ pip install deepchecks

In [1]:
# Load Packages
import pandas as pd 
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
#### Build A Model
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [3]:
# load data
df = pd.read_csv("data/bank-additional-full_encoded.csv")

In [4]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,0,0,0,0,0,0,0,0,0,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
1,57,1,0,1,1,0,0,0,0,0,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
2,37,1,0,1,0,1,0,0,0,0,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
3,40,2,0,2,0,0,0,0,0,0,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
4,56,1,0,1,0,0,1,0,0,0,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0


In [5]:
# Features & Labels
Xfeatures = df.drop('y',axis=1)
# Select last column of dataframe as a dataframe object
ylabels = df.iloc[: , -1:]

In [6]:
Xfeatures.columns

Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed'],
      dtype='object')

In [7]:
# Split Dataset
x_train,x_test,y_train,y_test = train_test_split(Xfeatures,ylabels,test_size=0.3,random_state=7)

### Requirements
+ Datasets
    - train,test data
+ Model

#### Components
+ Suites
+ Checks
+ Dataset

In [8]:
# Build the Model
pipe_lr = Pipeline(steps=[('sc',StandardScaler()),('lr',LogisticRegression())])

In [9]:
pipe_lr

In [10]:
# Train to Fit
pipe_lr.fit(x_train,y_train)

  y = column_or_1d(y, warn=True)


In [11]:
# Accuarcy
pipe_lr.score(x_test,y_test)

0.9105770008901837

In [14]:
!pip install deepchecks

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting deepchecks
  Downloading deepchecks-0.13.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m58.6 MB/s[0m eta [36m0:00:00[0m
Collecting category-encoders>=2.3.0
  Downloading category_encoders-2.6.0-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 KB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Collecting PyNomaly>=0.3.3
  Downloading PyNomaly-0.3.3.tar.gz (8.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jedi>=0.16
  Downloading jedi-0.18.2-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m72.6 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: PyNomaly
  Building wheel for PyNomaly (setup.py) ... [?25l[?25hdone
  Created wheel for PyNomaly: filename=PyNomal

### Using Deepchecks for Offline ML Data Drift Detection

In [15]:
import deepchecks

In [16]:
# Method
dir(deepchecks)

['BaseCheck',
 'BaseSuite',
 'CheckFailure',
 'CheckResult',
 'Condition',
 'ConditionCategory',
 'ConditionResult',
 'Context',
 'Dataset',
 'ModelComparisonCheck',
 'ModelComparisonSuite',
 'ModelOnlyBaseCheck',
 'ModelOnlyCheck',
 'SingleDatasetBaseCheck',
 'SingleDatasetCheck',
 'Suite',
 'SuiteResult',
 'TrainTestBaseCheck',
 'TrainTestCheck',
 '_SubstituteModule',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__original_module__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_init_module_attrs',
 'analytics',
 'core',
 'get_verbosity',
 'is_notebook',
 'matplotlib',
 'os',
 'pio',
 'pio_backends',
 'set_verbosity',
 'sys',
 'tabular',
 'types',
 'utils',
 'validate_latest_version',
 'version',

### Full Suite
+ Data Drift Detection
+ Model Performance /Confidence
+ Data Integrity Check
+ Label Ambuiguity
+ Other checks

In [17]:
from deepchecks.suites import full_suite


Ability to import tabular suites from the `deepchecks.suites` is deprecated, please import from `deepchecks.tabular.suites` instead



In [18]:
# Create the Dataset Objects
ds_train = deepchecks.Dataset(df=x_train,label=y_train,cat_features=[])
ds_test = deepchecks.Dataset(df=x_test,label=y_test,cat_features=[])


Ability to import base tabular functionality from the `deepchecks` package directly is deprecated, please import from `deepchecks.tabular` instead


Ability to import base tabular functionality from the `deepchecks` package directly is deprecated, please import from `deepchecks.tabular` instead



In [19]:
# Create the suite
fsuite = full_suite()

In [20]:
results = fsuite.run(train_dataset=ds_train,test_dataset=ds_test,model=pipe_lr)

deepchecks - INFO - Calculating permutation feature importance. Expected to finish in 12 seconds
INFO:deepchecks:Calculating permutation feature importance. Expected to finish in 12 seconds


In [21]:
results

#### Feature/Data Drift

In [22]:
from deepchecks.checks import TrainTestFeatureDrift


Ability to import tabular checks from the `deepchecks.checks` is deprecated, please import from `deepchecks.tabular.checks` instead



In [23]:
check = TrainTestFeatureDrift()


The TrainTestFeatureDrift check is deprecated and will be removed in the 0.14 version. Please use the FeatureDrift check instead



In [24]:
result = check.run(train_dataset=ds_train, test_dataset=ds_test, model=pipe_lr)

deepchecks - INFO - Calculating permutation feature importance. Expected to finish in 32 seconds
INFO:deepchecks:Calculating permutation feature importance. Expected to finish in 32 seconds


In [25]:
result

In [26]:
### Label Drift
from deepchecks.checks import TrainTestLabelDrift
lcheck = TrainTestLabelDrift()
lresult = lcheck.run(train_dataset=ds_train, test_dataset=ds_test)


The TrainTestLabelDrift check is deprecated and will be removed in the 0.14 version.Please use the LabelDrift check instead.



In [27]:
lresult

### Dataset Integrity Checks using Deepchecks
+ pip install deepchecks

#### Components
+ checks
+ suites
+ Dataset

In [28]:
import pandas as pd
import deepchecks

In [29]:
# Load Dataset
df = pd.read_csv("data/bank-additional-full_encoded.csv")

In [30]:
dir(deepchecks)

['BaseCheck',
 'BaseSuite',
 'CheckFailure',
 'CheckResult',
 'Condition',
 'ConditionCategory',
 'ConditionResult',
 'Context',
 'Dataset',
 'ModelComparisonCheck',
 'ModelComparisonSuite',
 'ModelOnlyBaseCheck',
 'ModelOnlyCheck',
 'SingleDatasetBaseCheck',
 'SingleDatasetCheck',
 'Suite',
 'SuiteResult',
 'TrainTestBaseCheck',
 'TrainTestCheck',
 '_SubstituteModule',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__original_module__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_init_module_attrs',
 'analytics',
 'checks',
 'core',
 'get_verbosity',
 'is_notebook',
 'matplotlib',
 'os',
 'pio',
 'pio_backends',
 'ppscore',
 'set_verbosity',
 'suites',
 'sys',
 'tabular',
 'types',
 'utils',
 'validate_latest_version',
 'version',

In [32]:
!pip install single_dataset_integrity

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Could not find a version that satisfies the requirement single_dataset_integrity (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for single_dataset_integrity[0m[31m
[0m

In [35]:
from deepchecks.suites import single_dataset_integrity

ImportError: ignored

In [34]:
# Fxn
integrity = single_dataset_integrity()
integrity.run(df)

NameError: ignored