<a href="https://colab.research.google.com/github/AyorindeTayo/Data-and-Model-drift-monitering-using-whylabs-/blob/main/Intro_ML_Monitoring_Data_Drift%2C_Bias%2C_Explainability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML Monitoring Fundamentals

## Setup

Required to run this notebook
- Google Account `file > save a copy in drive`
- [Free WhyLabs Account](https://whylabs.ai/free)

Other useful links:
- whylogs [GitHub](https://github.com/whylabs/whylogs/)
- [Slack channel](https://bit.ly/r2ai-slack) (Ask questions here)





## Quick Note on Google Colab

Colab is essentially Google's way of hosting a jupyter notebook. A very popular tool to use as a data scientist!

It allows us to write code, documentation, and output visuals all in one place.

To be able to and edit the code in this workshop. Please make a copy for yourself

`file > save a copy in drive`

## Code Cells
Below is a code cell. There is nothing in it right now.

To run a code cell click on it and then click the play button. Or press `shift+enter`

You can add new code cells by clicking the ` + Code ` button above


In [1]:
print("Hello!")

Hello!


## Terminal Commands

Colab actually gives you access to a whole ubuntu instance!

You can run terminal commands by putting ! before the command

In [2]:
!ls

sample_data


In [3]:
# Install whylogs (We'll use this in a bit!)
!pip install 'whylogs[viz]'

Collecting whylogs[viz]
  Downloading whylogs-1.3.26-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting platformdirs<4.0.0,>=3.5.0 (from whylogs[viz])
  Downloading platformdirs-3.11.0-py3-none-any.whl (17 kB)
Collecting types-requests<3.0.0.0,>=2.30.0.0 (from whylogs[viz])
  Downloading types_requests-2.31.0.20240311-py3-none-any.whl (14 kB)
Collecting whylabs-client<0.6.0,>=0.5.10 (from whylogs[viz])
  Downloading whylabs_client-0.5.10-py3-none-any.whl (440 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m440.1/440.1 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting whylogs-sketching>=3.4.1.dev3 (from whylogs[viz])
  Downloading whylogs_sketching-3.4.1.dev3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.3/547.3 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[


# 1. Data Drift, Model Drift, Performance


In [4]:
# Imports
import whylogs as why
import numpy as np
import pandas as pd
import datetime
import os

from sklearn.model_selection import train_test_split

# I know we've prob seen iris dataset,
# I promise this is going to be more interesting!
from sklearn.datasets import load_iris

# Set to show all columns in dataframe
pd.set_option("display.max_columns", None)

## Train a Machine Learning Model (quickly)

In [5]:
# Load iris data as dataframe(df)
data_iris = load_iris(as_frame=True)

# List names in dataset
print(list(data_iris.target_names))
print(list(data_iris.data))

['setosa', 'versicolor', 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [6]:
# Train baseline Model
# KNN Model
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

# Create feature and target data varaible
X, y = data_iris.data, data_iris.target

#create train & test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=42,
                                                    stratify=y)
# Train model
knn.fit(X_train, y_train)

# Predict the labels on test data sset
y_pred = knn.predict(X_test)

# Print model accuracy
knn.score(X_test, y_test)


0.9777777777777777

### KNN intuition

Just a little bit of intuition how kNN models work.

This will be helpful for troubleshooting some issues later!

Iris data plotted by:

`x = 'sepal length (cm)', y = 'petal width (cm)'`

![](https://github.com/sagecodes/intro-machine-learning/raw/master/irisknn.png)

## Import batches of data

In [8]:
 # Import data batches
url = 'https://raw.githubusercontent.com/sagecodes/sythetic_iris_data/main/iris_1_no_drift_.csv'
data_batch_1 = pd.read_csv(url)

url2 = 'https://raw.githubusercontent.com/sagecodes/sythetic_iris_data/main/iris_2_no_drift_.csv'
data_batch_2 = pd.read_csv(url2)

url3 = 'https://raw.githubusercontent.com/sagecodes/sythetic_iris_data/main/iris_3_no_drift_.csv'
data_batch_3 = pd.read_csv(url3)

url4 = 'https://raw.githubusercontent.com/sagecodes/sythetic_iris_data/main/iris_12_drift_0s.csv'
data_batch_4 = pd.read_csv(url4)

url5 = 'https://raw.githubusercontent.com/sagecodes/sythetic_iris_data/main/iris_14_drifty.csv'
data_batch_5 = pd.read_csv(url5)

url6 = 'https://raw.githubusercontent.com/sagecodes/sythetic_iris_data/main/iris_13_drift_petal5.csv'
data_batch_6 = pd.read_csv(url6)

url7 = 'https://raw.githubusercontent.com/sagecodes/sythetic_iris_data/main/iris_6_no_drift_.csv'
data_batch_7 = pd.read_csv(url7)

# iris feature names
feature_names = ['sepal length (cm)', 'sepal width (cm)','petal length (cm)','petal width (cm)']

# separate targets
X_batch_1 = data_batch_1[feature_names]
X_batch_2 = data_batch_2[feature_names]
X_batch_3 = data_batch_3[feature_names]
X_batch_4 = data_batch_4[feature_names]
X_batch_5 = data_batch_5[feature_names]
X_batch_6 = data_batch_6[feature_names]
X_batch_7 = data_batch_7[feature_names]

# We'll save the target values for later!
y_batch_1 = data_batch_1['target']
y_batch_2 = data_batch_2['target']
y_batch_3 = data_batch_3['target']
y_batch_4 = data_batch_4['target']
y_batch_5 = data_batch_5['target']
y_batch_6 = data_batch_6['target']
y_batch_7 = data_batch_7['target']


# create list of our batches
dfs = [X_batch_1, X_batch_4, X_batch_5, X_batch_6, X_batch_2, X_batch_3, X_batch_7]

df_target = [y_batch_1, y_batch_4, y_batch_5, y_batch_6, y_batch_2, y_batch_3, y_batch_7]


In [9]:
X_batch_1

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.5,3.2,1.5,-0.0
1,6.2,2.5,3.9,1.2
2,5.9,3.1,4.4,1.3
3,6.1,3.0,6.4,1.9
4,6.6,3.3,6.4,1.8
...,...,...,...,...
145,5.6,3.0,3.6,1.1
146,5.2,3.5,1.4,0.1
147,6.8,2.9,4.7,1.3
148,4.6,3.5,1.4,0.2


In [10]:
dfs[0].head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.5,3.2,1.5,-0.0
1,6.2,2.5,3.9,1.2
2,5.9,3.1,4.4,1.3
3,6.1,3.0,6.4,1.9
4,6.6,3.3,6.4,1.8


## Create a log with whylogs

whylogs is an open source library for logging any kind of data. With whylogs, users are able to generate summaries of their datasets (called whylogs profiles) which they can use to:

- Track changes in their dataset
- Create data constraints to know whether their data looks the way it should
- Quickly visualize key summary statistics about their datasets


![](https://user-images.githubusercontent.com/7946482/171062942-01c420f2-7768-4b7c-88b5-e3f291e1b7d8.png)

profiles generated with whylogs are:
- Efficient
- Customizable
- Mergeable


In [11]:
# create profile
profile1 = why.log(X_batch_1)

profile_view1 = profile1.view()
profile_view1.to_pandas()

⚠️ No session found. Call whylogs.init() to initialize a session and authenticate. See https://docs.whylabs.ai/docs/whylabs-whylogs-init for more information.


Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,distribution/min,distribution/n,distribution/q_01,distribution/q_05,distribution/q_10,distribution/q_25,distribution/q_75,distribution/q_90,distribution/q_95,distribution/q_99,distribution/stddev,type,types/boolean,types/fractional,types/integral,types/object,types/string,types/tensor
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
petal length (cm),42.000004,42.0,42.002101,0,150,0,0,6.8,3.775333,4.2,1.0,150,1.1,1.3,1.4,1.5,5.3,5.9,6.3,6.6,1.800669,SummaryType.COLUMN,0,150,0,0,0,0
petal width (cm),24.000001,24.0,24.0012,0,150,0,0,2.5,1.168667,1.3,-0.0,150,0.0,0.1,0.2,0.3,1.8,2.2,2.3,2.4,0.758499,SummaryType.COLUMN,0,150,0,0,0,0
sepal length (cm),35.000003,35.0,35.00175,0,150,0,0,8.1,5.856,5.8,4.0,150,4.3,4.5,4.8,5.1,6.6,7.0,7.2,7.7,0.874277,SummaryType.COLUMN,0,150,0,0,0,0
sepal width (cm),20.000001,20.0,20.001,0,150,0,0,4.4,3.1,3.0,2.1,150,2.3,2.5,2.6,2.8,3.3,3.6,3.8,4.4,0.417567,SummaryType.COLUMN,0,150,0,0,0,0


Learn more about whylogs:
-  GitHub: https://github.com/whylabs/whylogs
- Examples:
https://github.com/whylabs/whylogs/tree/mainline/python/examples




## Writing profiles to WhyLabs

We're going start with an example of using profiles with the WhyLabs Observatory.

We'll explore using whylogs for data validation & drift visualization after this!


## Get WhyLabs access tokens [expand]





Before integrate our data into WhyLabs we need three things:
- WhyLabs API Key
- WhyLabs Org-ID
- Project-ID


The easiest way to get the API token & ord-id:

`Menu -> Settings -> Access Tokens`

![](https://github.com/sagecodes/workshop-images/blob/master/access_token_org.png?raw=true)

Create a new project to get the project-id

`Create Project -> Set up model -> `

![](https://github.com/sagecodes/workshop-images/blob/master/project-create.png?raw=true)


## Sending profiles

In [12]:
# set authentication & project keys
os.environ["WHYLABS_DEFAULT_ORG_ID"] = 'org-5x4W2m'
os.environ["WHYLABS_API_KEY"] = 't0iZuf3myw.PiGsaWwFHj2MsI4lytbm9CZTShX51BhYUHMcfPg8hSVuG3WsSR5yt:org-5x4W2m'
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = 'model-2'

In [13]:
from whylogs.api.writer.whylabs import WhyLabsWriter

In [14]:
# Single Profile
writer = WhyLabsWriter()
profile= why.log(X_batch_1)
writer.write(file=profile.view())

⚠️ Initializing default session because no session was found.
Initializing session with config /root/.config/whylogs/config.ini

✅ Using session type: LOCAL. Profiles won't be uploaded or written anywhere automatically.


(True, 'log-Ag7sLQ28fZCPRhd2')

Write multiple profiles with different dates to backfill

In [15]:
# initialize writer
writer = WhyLabsWriter()

# back fill 1 day per batch
for i, df in enumerate(dfs):

    # walking backwards. Each dataset has to map to a date to show up as a different batch in WhyLabs
    dt = datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=i)

    # create profile for each batch of data
    profile = why.log(df).profile()

    # set the dataset timestamp for the profile
    profile.set_dataset_timestamp(dt)
    # write the profile to the WhyLabs platform
    writer.write(file=profile.view())

**Note**: Colab might throw SSL cert error if runtime has been disconnected. Restarting runtime should fix.

Reference Profile

In [16]:
ref_profile = why.log(data_iris.data).profile()
writer = WhyLabsWriter().option(reference_profile_name="iris_training_profile")
writer.write(file=ref_profile.view())

(True, 'ref-UrztECjYhdzHeX9n')

## Platform of results [Expand]



Refresh your page and you'll see profiles have been uploaded

![](https://github.com/sagecodes/workshop-images/blob/master/profiles-uploaded.png?raw=true)

Click into project and we'll see our data profiles

![](https://github.com/sagecodes/workshop-images/blob/master/inputs.png?raw=true)

In the profile tab we can see our disribution visualization

![](https://github.com/sagecodes/workshop-images/blob/master/profiles.png?raw=true)

Monitor manager tab

![](https://github.com/sagecodes/workshop-images/blob/master/monitor_manager.png?raw=true)

Preview monitor in the inputs feature:

![](https://github.com/sagecodes/workshop-images/blob/master/monitor_preview.png?raw=true)

Configure alerts

`settings -> notfications & alerts`

![](https://github.com/sagecodes/workshop-images/blob/master/alert_integration.png?raw=true)

Read more about [customizable monitoring for any use case](https://whylabs.ai/blog/posts/model-data-monitoring-simple-customizable-actionable)

## Logging output

In [17]:
# Get predictions with model & append to df
pred_dfs = dfs

class_names = ['setosa', 'versicolor', 'virginica']

for i, df in enumerate(pred_dfs):
    y_pred = knn.predict(df)
    y_prob = knn.predict_proba(df)
    pred_scores = []
    pred_classes = []

    for pred in y_pred:
      pred_classes.append(class_names[pred])
    df['cls_output'] = pred_classes
    for prob in y_prob:
      pred_scores.append(max(prob))
    df['prob_output'] = pred_scores
    # print(pred_scores)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cls_output'] = pred_classes
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['prob_output'] = pred_scores


In [18]:
pred_dfs[-1]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),cls_output,prob_output
0,5.2,2.9,1.2,0.3,setosa,1.0
1,5.5,2.6,4.5,1.2,versicolor,1.0
2,6.2,3.5,3.4,1.5,versicolor,1.0
3,5.0,3.4,1.5,0.1,setosa,1.0
4,5.3,3.2,1.3,0.3,setosa,1.0
...,...,...,...,...,...,...
145,4.7,3.2,5.1,1.1,versicolor,0.8
146,6.7,3.3,4.9,2.3,virginica,1.0
147,4.9,2.7,1.4,0.2,setosa,1.0
148,5.6,3.0,5.0,1.5,virginica,0.6


In [19]:
writer = WhyLabsWriter()

for i, df in enumerate(pred_dfs):

    out_df = df[['cls_output', 'prob_output']].copy()
   # walking backwards. Each dataset has to map to a date to show up as a different batch in WhyLabs
    dt = datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=i)
    profile = why.log(out_df).profile()

    # set the dataset timestamp for the profile
    profile.set_dataset_timestamp(dt)
    #write the profile to the WhyLabs platform
    writer.write(file=profile.view())

**Note** You can do both input and output logging in the same loop. I broke apart for this workshop.

Without backfilling the process is only a few lines of code
```
writer = WhyLabsWriter()
profile= why.log(data_batch_1)
writer.write(file=profile.view())
```

## Log performance

Instead of just logging outputs, if we have ground truth data we can also monitor performance metrics overtime.


Classification:

Regression:


In [20]:
pred_dfs[-1]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),cls_output,prob_output
0,5.2,2.9,1.2,0.3,setosa,1.0
1,5.5,2.6,4.5,1.2,versicolor,1.0
2,6.2,3.5,3.4,1.5,versicolor,1.0
3,5.0,3.4,1.5,0.1,setosa,1.0
4,5.3,3.2,1.3,0.3,setosa,1.0
...,...,...,...,...,...,...
145,4.7,3.2,5.1,1.1,versicolor,0.8
146,6.7,3.3,4.9,2.3,virginica,1.0
147,4.9,2.7,1.4,0.2,setosa,1.0
148,5.6,3.0,5.0,1.5,virginica,0.6


In [21]:
# Append ground truth data to dataframe
for i, df in enumerate(pred_dfs):
    df['ground_truth'] = df_target[i]

In [22]:
pred_dfs[0]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),cls_output,prob_output,ground_truth
0,5.5,3.2,1.5,-0.0,setosa,1.0,setosa
1,6.2,2.5,3.9,1.2,versicolor,1.0,versicolor
2,5.9,3.1,4.4,1.3,versicolor,1.0,versicolor
3,6.1,3.0,6.4,1.9,virginica,1.0,virginica
4,6.6,3.3,6.4,1.8,virginica,1.0,virginica
...,...,...,...,...,...,...,...
145,5.6,3.0,3.6,1.1,versicolor,1.0,versicolor
146,5.2,3.5,1.4,0.1,setosa,1.0,setosa
147,6.8,2.9,4.7,1.3,versicolor,1.0,versicolor
148,4.6,3.5,1.4,0.2,setosa,1.0,setosa


In [None]:
# Log performance

for i, df in enumerate(pred_dfs):

  results = why.log_classification_metrics(
          df,
          target_column = "ground_truth",
          prediction_column = "cls_output",
          score_column="prob_output"
      )
   # walking backwards. Each dataset has to map to a date to show up as a different batch in WhyLabs
  dt = datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=i)

  profile = results.profile()
  profile.set_dataset_timestamp(dt)

  results.writer("whylabs").write()

## Performance results [Expand]



![](https://github.com/sagecodes/workshop-images/blob/master/model_performance1.png?raw=true)

![](https://github.com/sagecodes/workshop-images/blob/master/model_performance2.png?raw=true)

## Learn more about WhyLabs

Learn more about the WhyLabs observatory [here](http://whylabs.ai/).

Learn more about other whylogs writers [here](https://github.com/whylabs/whylogs/tree/mainline/python/examples/integrations/writers).

# 2. Monitoring for Bias & Fairness with Tracing & Explainability



In [None]:
# Imports
import whylogs as why
import numpy as np
import pandas as pd
import datetime
import os

from sklearn.model_selection import train_test_split

# I know we've prob seen iris dataset,
# I promise this is going to be more interesting!
from sklearn.datasets import load_iris

# Set to show all columns in dataframe
pd.set_option("display.max_columns", None)

In [None]:
# Load iris data as dataframe(df)
data_iris = load_iris(as_frame=True)

# List names in dataset
print(list(data_iris.target_names))
print(list(data_iris.data))

['setosa', 'versicolor', 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


## Train a Machine Learning Model (quickly)

In [None]:
# Train baseline Model
# KNN Model
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

# Create featurex and target data varaible
X, y = data_iris.data, data_iris.target

#create train & test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=42,
                                                    stratify=y)
# Train model
knn.fit(X_train, y_train)

# Predict the labels on test data sset
y_pred = knn.predict(X_test)

# Print model accuracy
knn.score(X_test, y_test)


0.9777777777777777

### KNN intuition

Just a little bit of intuition how kNN models work.

This will be helpful for troubleshooting some issues later!

Iris data plotted by:

`x = 'sepal length (cm)', y = 'petal width (cm)'`

![](https://github.com/sagecodes/intro-machine-learning/raw/master/irisknn.png)

## Import data batches

In [None]:
 # Import data batches
url = 'https://raw.githubusercontent.com/sagecodes/sythetic_iris_data/main/iris_15_statefl_1.csv'
data_batch_1 = pd.read_csv(url)

url2 = 'https://raw.githubusercontent.com/sagecodes/sythetic_iris_data/main/iris_16_statefl_1.csv'
data_batch_2 = pd.read_csv(url2)

url3 = 'https://raw.githubusercontent.com/sagecodes/sythetic_iris_data/main/iris_17_statefl_1.csv'
data_batch_3 = pd.read_csv(url3)

url4 = 'https://raw.githubusercontent.com/sagecodes/sythetic_iris_data/main/iris_18_statefl_1.csv'
data_batch_4 = pd.read_csv(url4)

url5 = 'https://raw.githubusercontent.com/sagecodes/sythetic_iris_data/main/iris_19_statefl_1.csv'
data_batch_5 = pd.read_csv(url5)

url6 = 'https://raw.githubusercontent.com/sagecodes/sythetic_iris_data/main/iris_20_statefl_1.csv'
data_batch_6 = pd.read_csv(url6)

url7 = 'https://raw.githubusercontent.com/sagecodes/sythetic_iris_data/main/iris_21_statefl_1.csv'
data_batch_7 = pd.read_csv(url7)

# iris feature names
feature_names = ['sepal length (cm)', 'sepal width (cm)','petal length (cm)','petal width (cm)', 'state']

# separate targets
X_batch_1 = data_batch_1[feature_names]
X_batch_2 = data_batch_2[feature_names]
X_batch_3 = data_batch_3[feature_names]
X_batch_4 = data_batch_4[feature_names]
X_batch_5 = data_batch_5[feature_names]
X_batch_6 = data_batch_6[feature_names]
X_batch_7 = data_batch_7[feature_names]

# We'll save the target values for later!
y_batch_1 = data_batch_1['target']
y_batch_2 = data_batch_2['target']
y_batch_3 = data_batch_3['target']
y_batch_4 = data_batch_4['target']
y_batch_5 = data_batch_5['target']
y_batch_6 = data_batch_6['target']
y_batch_7 = data_batch_7['target']


# create list of our batches
dfs = [X_batch_1, X_batch_4, X_batch_5, X_batch_6, X_batch_2, X_batch_3, X_batch_7]

df_target = [y_batch_1, y_batch_4, y_batch_5, y_batch_6, y_batch_2, y_batch_3, y_batch_7]


In [None]:
dfs[0].head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),state
0,4.9,4.0,1.6,0.3,Washington
1,4.9,3.5,1.3,0.1,Washington
2,5.9,3.0,5.1,1.3,Washington
3,5.2,3.3,1.6,0.3,Washington
4,4.6,3.2,1.2,0.3,Washington


Recap:

## Creating profiles with whylogs


Profiles generated with whylogs are:

- Secure
- Efficient
- Customizable
- Mergeable

In [None]:
# create profile
profile1 = why.log(X_batch_1)

profile_view1 = profile1.view()
profile_view1.to_pandas()

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,distribution/min,distribution/n,distribution/q_01,distribution/q_05,distribution/q_10,distribution/q_25,distribution/q_75,distribution/q_90,distribution/q_95,distribution/q_99,distribution/stddev,type,types/boolean,types/fractional,types/integral,types/object,types/string,types/tensor,frequent_items/frequent_strings
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
petal length (cm),49.000006,49.0,49.002452,0,150,0,0,7.5,4.041333,4.3,1.1,150,1.1,1.3,1.5,2.4,5.4,6.2,6.8,7.5,1.776664,SummaryType.COLUMN,0,150,0,0,0,0,
petal width (cm),24.000001,24.0,24.0012,0,150,0,0,2.7,1.207333,1.3,0.0,150,0.1,0.1,0.2,0.3,1.9,2.2,2.4,2.5,0.769651,SummaryType.COLUMN,0,150,0,0,0,0,
sepal length (cm),33.000003,33.0,33.00165,0,150,0,0,7.7,5.856,5.8,4.2,150,4.2,4.7,4.9,5.1,6.5,7.1,7.2,7.7,0.847775,SummaryType.COLUMN,0,150,0,0,0,0,
sepal width (cm),19.000001,19.0,19.00095,0,150,0,0,4.0,2.996,2.9,2.2,150,2.2,2.3,2.4,2.7,3.3,3.6,3.8,4.0,0.448395,SummaryType.COLUMN,0,150,0,0,0,0,
state,3.0,3.0,3.00015,0,150,0,0,,0.0,,,0,,,,,,,,,0.0,SummaryType.COLUMN,0,0,0,0,150,0,"[FrequentItem(value='Washington', est=50, uppe..."


Learn more about creating data profiles with whylogs
- [whylogs basics](https://github.com/whylabs/whylogs/tree/mainline/python/examples/basic)
- [whylogs examples](https://github.com/whylabs/whylogs/tree/mainline/python/examples)


## Writing data profiles to to WhyLabs

![](https://camo.githubusercontent.com/8e9cc18b64b157d4569fa6ed2bd5152200ee7bb1a11e54f858f923a4be635f90/68747470733a2f2f7768796c6162732e61692f5f6e6578742f696d6167653f75726c3d6874747073253341253246253246636f6e74656e742e7768796c6162732e6169253246636f6e74656e74253246696d616765732532463230323225324631312532464672616d652d363839392d2d312d2e706e6726773d3331323026713d3735)


In [None]:
# set authentication & project keys
# os.environ["WHYLABS_DEFAULT_ORG_ID"] = 'ORGID'
# os.environ["WHYLABS_API_KEY"] = 'APIKEY'
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = 'MODELID'

### write a single profile
```
profile = why.log(df)
profile.writer("whylabs").write()
```

### Create dataframe with model predictions

In [None]:
# Get predictions with model & append to df
pred_dfs = dfs

class_names = ['setosa', 'versicolor', 'virginica']

for i, df in enumerate(pred_dfs):
    y_pred = knn.predict(df.iloc[:, :4])
    y_prob = knn.predict_proba(df.iloc[:, :4])
    pred_scores = []
    pred_classes = []

    for pred in y_pred:
      pred_classes.append(class_names[pred])
    df['cls_output'] = pred_classes
    for prob in y_prob:
      pred_scores.append(max(prob))
    df['prob_output'] = pred_scores

In [None]:
pred_dfs[-1]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),state,cls_output,prob_output
0,5.5,2.9,4.4,1.2,Washington,versicolor,1.0
1,4.7,3.1,1.3,0.3,Washington,setosa,1.0
2,4.7,3.0,1.4,0.2,Washington,setosa,1.0
3,5.2,3.7,1.4,0.3,Washington,setosa,1.0
4,6.8,3.2,6.2,1.6,Washington,virginica,1.0
...,...,...,...,...,...,...,...
145,5.9,3.0,5.8,2.6,Missouri,virginica,1.0
146,4.8,2.3,3.7,1.4,Missouri,versicolor,1.0
147,5.3,2.7,4.3,1.3,Missouri,versicolor,1.0
148,5.8,3.2,5.7,1.9,Missouri,virginica,1.0


### Backfilling data in WhyLabs

In [None]:
from whylogs.core.schema import DatasetSchema
from whylogs.core.segmentation_partition import segment_on_column

In [None]:
# back fill 1 day per batch
for i, df in enumerate(pred_dfs):
    # walking backwards. Each dataset has to map to a date to show up as a different batch in WhyLabs
    dt = datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=i)

    # create profile for each batch of data
    profile = why.log(df, schema=DatasetSchema(segments=segment_on_column("state")))

    # set the dataset timestamp for the profile
    profile.set_dataset_timestamp(dt)
    # write the profile to the WhyLabs platform
    profile.writer("whylabs").write()

Learn more about segmentation in whylogs
- [Intro to Segmentation with whylogs](https://github.com/whylabs/whylogs/blob/mainline/python/examples/advanced/Segments.ipynb)

In [None]:
# Create reference profile
ref_profile = why.log(data_iris.data).profile()
writer = WhyLabsWriter().option(reference_profile_name="iris_training_profile")
writer.write(file=ref_profile.view())

###Classification Performance Metrics

In [None]:
# Append ground truth data to dataframe
for i, df in enumerate(pred_dfs):
    df['ground_truth'] = df_target[i]

In [None]:
pred_dfs[0]

In [None]:
from whylogs import log_classification_metrics
# from whylogs.core.schema import DatasetSchema
# from whylogs.core.segmentation_partition import segment_on_column

In [None]:
for i, df in enumerate(pred_dfs):

  segmented_classification_results = log_classification_metrics(
    df,
    target_column = "ground_truth",
    prediction_column = "cls_output",
    schema = DatasetSchema(segments=segment_on_column("state"))
  )
   # walking backwards. Each dataset has to map to a date to show up as a different batch in WhyLabs
  dt = datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=i)

  # profile = segmented_classification_results.profile()
  segmented_classification_results.set_dataset_timestamp(dt)

  segmented_classification_results.writer("whylabs").write()

## Feature importance

Learn more about SHAP
https://github.com/slundberg/shap

In [None]:
!pip install shap

In [None]:
import shap

In [None]:
explainer = shap.Explainer(knn.predict, X_train)

In [None]:
shap_values = explainer(X_test)

In [None]:
shap.summary_plot(shap_values, X_test, plot_type="bar")

In [None]:
# Get global featue importance
shap_feature_importance = np.mean(np.abs(shap_values.values), axis=0)

In [None]:
# Create dict with feature importance
shap_feature_importance_dict = dict(zip(X_train.columns.tolist(), shap_feature_importance.tolist()))
feature_importance_dict = {k: v for k, v in sorted(shap_feature_importance_dict.items(),
                                                   key=lambda item: item[1], reverse=True)}


In [None]:
print(feature_importance_dict)

In [None]:
# Write values to WhyLabs
from whylogs.core.feature_weights import FeatureWeights
from whylogs.api.writer.whylabs import WhyLabsWriter

feature_weights = FeatureWeights(shap_feature_importance_dict)
result = feature_weights.writer("whylabs").write()

result

# 3. Open-source data & ML monitoring with whylogs

## Using data drift reports with whylogs in a Python environment

![](https://whylabs.ai/_next/image?url=https%3A%2F%2Fcontent.whylabs.ai%2Fcontent%2Fimages%2F2022%2F06%2FTDSImage3.jpeg&w=3120&q=75)








In [None]:
# creat profiles of batches

profile_view1 = why.log(X_batch_1).view()
profile_view1 = why.log(X_batch_1).view()
profile_view2 = why.log(X_batch_2).view()
profile_view3 = why.log(X_batch_3).view()
profile_view4 = why.log(data_batch_4).view()
profile_view5 = why.log(data_batch_5).view()
profile_view6 = why.log(data_batch_6).view()
profile_view7 = why.log(X_batch_7).view()
# profile_view8 = why.log(data_batch_8).view()

In [None]:
# Data Drift with whylogs
from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.set_profiles(target_profile_view=profile_view1, reference_profile_view=profile_view2)

In [None]:
visualization.summary_drift_report()

In [None]:
visualization.double_histogram(feature_name="petal width (cm)")


In [None]:
visualization.double_histogram(feature_name="petal length (cm)")


In [None]:

from whylogs.viz.drift.column_drift_algorithms import calculate_drift_scores

scores = calculate_drift_scores(target_view=profile_view1, reference_view=profile_view2, with_thresholds = True)

scores

In [None]:
# Compare Another profiles:

from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.set_profiles(target_profile_view=profile_view1, reference_profile_view=profile_view3)

In [None]:
visualization.summary_drift_report()

In [None]:
visualization.double_histogram(feature_name="petal length (cm)")


In [None]:
visualization.double_histogram(feature_name="petal width (cm)")


In [None]:

from whylogs.viz.drift.column_drift_algorithms import calculate_drift_scores

scores = calculate_drift_scores(target_view=profile_view1, reference_view=profile_view6, with_thresholds = True)

scores

Learn more about using data drift reports with whylogs
- [Drift Algorithm Configuration](https://github.com/whylabs/whylogs/blob/mainline/python/examples/advanced/Drift_Algorithm_Configuration.ipynb)



## Data validation with constraints in whylogs


![](https://miro.medium.com/max/1400/1*-LNvKMkSTJ3q22BH8DNsTg.gif)

Data quality validation ensures data is structured and falls in the range expected for our data pipelines or applications. When collecting or using data it’s important to verify the quality to avoid unwanted machine learning behavior in production, such as errors or faulty prediction results.

For example, we may want to ensure our data doesn’t contain any empty or negative values before moving it along in the pipeline if our model does not expect those values.

In [None]:
# Data Quality Validation whylogs

from whylogs.core.constraints import (Constraints,
                                     ConstraintsBuilder,
                                     MetricsSelector,
                                     MetricConstraint)

In [None]:
# Using Constraints for Data Quality Validation

def validate_features(profile_view, verbose=False):

  builder = ConstraintsBuilder(profile_view)

  # Define a constraint for validating data
  builder.add_constraint(MetricConstraint(
    name="petal length > 0 and < 15",
    condition=lambda x: x.min > 0 and x.max < 15,
    metric_selector=MetricsSelector(metric_name='distribution',
                                    column_name='petal length (cm)')
  ))

  builder.add_constraint(MetricConstraint(
    name="petal width > 0 and < 15",
    condition=lambda x: x.min > 0 and x.max < 15,
    metric_selector=MetricsSelector(metric_name='distribution',
                                    column_name='petal width (cm)')
  ))

  builder.add_constraint(MetricConstraint(
    name="sepal length > 0 and < 15",
    condition=lambda x: x.min > 0 and x.max < 15 ,
    metric_selector=MetricsSelector(metric_name='distribution',
                                    column_name='sepal length (cm)')
  ))

  builder.add_constraint(MetricConstraint(
    name="sepal width > 0 and < 15",
    condition=lambda x: x.min > 0 and x.max < 15,
    metric_selector=MetricsSelector(metric_name='distribution',
                                    column_name='sepal width (cm)')
  ))

  # Build the constraints and return the report
  constraints: Constraints = builder.build()

  if verbose:
    print(constraints.report())

  # return constraints.report()
  return constraints


In [None]:
const = validate_features(profile_view2, True)

In [None]:
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(const, cell_height=300)

In [None]:
# check all constraints for passing:
constraints_valid = const.validate()
print(constraints_valid)

In [None]:
const = validate_features(profile_view4, True)

In [None]:
visualization = NotebookProfileVisualizer()
visualization.constraints_report(const, cell_height=300)

In [None]:
# check all constraints for passing:
constraints_valid = const.validate()
print(constraints_valid)

In [None]:
profile_view4.to_pandas()

Leran more about performing data validation with whylogs
- [Data Validation with Metric Constraints](https://github.com/whylabs/whylogs/blob/mainline/python/examples/advanced/Metric_Constraints.ipynb)


# Wrap up
- [Request a Workshop Certificate](https://docs.google.com/forms/d/e/1FAIpQLScKdXX59i8P0783HKTRr7MaW65B6z55jiqpVDyOiaebHqQorQ/viewform?usp=sf_link)
- [Upcoming Events](https://whylabs.ai/events)
- [WhyLabs Blog](https://whylabs.ai/blog)
- [whylogs GitHub](https://github.com/whylabs/whylogs)
- [AI Slack group](http://join.slack.whylabs.ai/)


Try Our Expert Plan FREE for 30 Days! https://bit.ly/coupon-wlcommunity



Learn more at [https://whylabs.ai/](https://whylabs.ai/)
