In [1]:
from pdstools.explanations import Explanations

import datetime
import polars as pl

In [2]:
import logging
logging.basicConfig(level=logging.INFO)

## Aggregate data exported from infinity

**Prerequisite**: You have already exported explanation files from infinity.

parameters:
- `data_folder`: the folder which has the explanation files
- `model_name`: `optional` - the model rule to check for explanations, if not passed will pick up any files in the folder
- `from_date`: `optional` - if not passed will be today - 7 days
- `to_date`: `optional` - if not passed will be today

The aggregated data will be stored in the `.tmp/aggregated_data` directory.

In [3]:
explanations = Explanations(
    data_folder='../../data/explanations/',
    model_name='AdaptiveBoostCT',
    from_date=datetime.date(2024,3,28),
    to_date=datetime.date(2025,3,28)
)

## Simple plotting of contributions

These methods will help to plot the contributions for overall model or a specific context.

The first plot will show the `top-n` predictors with their contributions. The remaining plots are for each predictor in the `top-n` list. Numeric predictors values will be binned to a max of 10 bins, while the categorical predictors will show the `top-k` categories with their contributions.

### Explanations for overall model

Call `explanations.plot.contributions()` without selecting any context from the interactive context picker. This will result in plots which aggregate the data over all contexts.

parameters:
- `top_n`: Number of top predictors to plot.
- `top_k`: Number of top predictor values for symbolic predictors to plot.
- `remaining`: If `True`, the remaining predictors will be plotted as a single bar.
- `missing`: If `True`, the missing values will be plotted as a separate bar.
- `descending`: If `True`, the predictors will be sorted in descending order of their contributions. i.e least contributing predictors will be plotted first.
- `contribution_calculation`: Method to calculate contributions. Some options are `contribution`, `contribution_abs`, `contribution_weighted`. Default is `contribution` which is the average contributions to predictions.

In [4]:
explanations.plot.contributions(top_n=3, top_k=5)

No context selected, plotting overall contributions. Use explanations.filter.interative() to select a context.


### Explanations for selected context

Call `explanations.filter.interactive()` to display the interactive context picker. This allows you to select a specific context from the list of available contexts.

The context picker will help in filtering the data for very large list of contexts. Fine-tune your selection by using the comboboxes on the left side of the context picker. This will display the available contexts on the right, from which you can select specific context keys.

Run `explanations.plot.contributions()` after selecting a context from the interactive context picker. This will plot the contributions for the selected context.

__NOTE__: Plots are only for a single context. i.e required for a context to be selected from the list.

In [5]:
explanations.filter.interactive()

GridBox(children=(HTML(value='<h3>Select Context Filters</>', layout=Layout(width='auto')), HTML(value='<h3>Co…

In [14]:
explanations.plot.contributions(top_n=3, top_k=5)

Can also set the context manually by passing a dictionary with the context keys and values.

In [7]:
explanations.filter.set_selected_context(
    {"pyChannel": "PegaBatch",
    "pyDirection": "E2E Test",
    "pyGroup": "E2E Test",
    "pyIssue": "Batch",
    "pyName": "P2"
})

In [8]:
explanations.plot.contributions(top_n=3, top_k=5)

## Advaced Data Exploration
For more advanced data exploration you can directly. These classes provide more flexibility in how the data is loaded and processed. Allowing you to inspect the data before plotting.

In [9]:
aggregate = explanations.aggregate # load the aggregated data

### Inspect data for overall model

Get the `top_n` predictors and their contributions for the overall model

In [10]:
df_overall = aggregate.get_predictor_contributions(top_n = 3, remaining=False)
df_overall


partition,predictor_name,predictor_type,contribution,contribution_abs,contribution_weighted,contribution_weighted_abs,frequency,contribution_min,contribution_max
str,str,str,f64,f64,f64,f64,i64,f64,f64
"""whole_model""","""pyName""","""SYMBOLIC""",-0.021191,0.021255,-9.6e-05,9.7e-05,50000,-0.044283,0.029203
"""whole_model""","""Age""","""NUMERIC""",-0.011306,0.011738,-9.2e-05,9.5e-05,50000,-0.034704,0.023485
"""whole_model""","""Occupation""","""SYMBOLIC""",-0.008904,0.010193,-1.7e-05,1.9e-05,50000,-0.032185,0.06411


We can get the `top_k` predictors and inspect their most influential values

In [11]:
top_n_predictors = df_overall.select(pl.col('predictor_name')).unique().to_series().to_list()
aggregate.get_predictor_value_contributions(
    predictors=top_n_predictors, 
    top_k = 2, 
    remaining=False
)

partition,predictor_name,predictor_type,bin_order,bin_contents,contribution,contribution_abs,contribution_weighted,contribution_weighted_abs,frequency,contribution_min,contribution_max,sort_column,sort_value
str,str,str,i64,str,f64,f64,f64,f64,i64,f64,f64,str,f64
"""whole_model""","""Age""","""NUMERIC""",0,"""MISSING""",-0.013647,0.015131,-0.000503,0.000557,1842,-0.034704,0.023485,"""bin_order""",0.0
"""whole_model""","""Age""","""NUMERIC""",5,"""[43.000:48.000]""",-0.013103,0.013121,-0.001262,0.001264,4816,-0.026145,0.006043,"""bin_order""",5.0
"""whole_model""","""Occupation""","""SYMBOLIC""",20,"""Communications engineer""",-0.01616,0.016168,-0.000335,0.000335,1035,-0.028147,0.001797,"""contribution""",0.01616
"""whole_model""","""Occupation""","""SYMBOLIC""",43,"""Psychotherapist, child""",-0.016257,0.016257,-0.000268,0.000268,825,-0.032185,-0.003855,"""contribution""",0.016257
"""whole_model""","""pyName""","""SYMBOLIC""",15,"""P18""",-0.024462,0.024462,-0.001208,0.001208,2469,-0.03746,-0.009315,"""contribution""",0.024462
"""whole_model""","""pyName""","""SYMBOLIC""",1,"""P1""",-0.025354,0.025354,-0.001307,0.001307,2577,-0.044283,-0.007185,"""contribution""",0.025354


### Inspect data by selected context

Let's repeat the same again, but this time we will inspect a selected context, instead of the entire model.

In [12]:
import random
context_info = random.choice(aggregate.get_unique_contexts_list())
print('Selected random context: \n')
for key, value in context_info.items():
    print(f'{key}: {value}')
df_by_context = aggregate.get_predictor_contributions(
    context=context_info, 
    top_n=3, 
    remaining=False)
df_by_context


Selected random context: 

pyChannel: PegaBatch
pyDirection: E2E Test
pyGroup: E2E Test
pyIssue: Batch
pyName: P12


partition,predictor_name,predictor_type,contribution,contribution_abs,contribution_weighted,contribution_weighted_abs,frequency,contribution_min,contribution_max
str,str,str,f64,f64,f64,f64,i64,f64,f64
"""{""partition"":{""pyChannel"":""Peg…","""pyName""","""SYMBOLIC""",-0.022127,0.022127,-0.002012,0.002012,2520,-0.032641,-0.002168
"""{""partition"":{""pyChannel"":""Peg…","""Age""","""NUMERIC""",-0.011253,0.011742,-9.3e-05,9.6e-05,2520,-0.024616,0.019108
"""{""partition"":{""pyChannel"":""Peg…","""Occupation""","""SYMBOLIC""",-0.007745,0.009442,-1.5e-05,1.7e-05,2520,-0.026176,0.040177


In [13]:
top_n_predictors = df_by_context.select(pl.col('predictor_name')).unique().to_series().to_list()
aggregate.get_predictor_value_contributions(
    predictors=top_n_predictors, 
    top_k=2, 
    context=context_info, 
    remaining=False)

partition,predictor_name,predictor_type,bin_order,bin_contents,contribution,contribution_abs,contribution_weighted,contribution_weighted_abs,frequency,contribution_min,contribution_max,sort_column,sort_value
str,str,str,i64,str,f64,f64,f64,f64,i64,f64,f64,str,f64
"""{""partition"":{""pyChannel"":""Peg…","""Age""","""NUMERIC""",7,"""[53.000:58.000]""",-0.014213,0.014213,-0.00137,0.00137,243,-0.020593,-0.001174,"""bin_order""",7.0
"""{""partition"":{""pyChannel"":""Peg…","""Age""","""NUMERIC""",8,"""[58.000:64.000]""",-0.014049,0.014049,-0.001349,0.001349,242,-0.022683,-0.000974,"""bin_order""",8.0
"""{""partition"":{""pyChannel"":""Peg…","""Occupation""","""SYMBOLIC""",42,"""Psychotherapist, child""",-0.015971,0.015971,-0.000234,0.000234,37,-0.022916,-0.009341,"""contribution""",0.015971
"""{""partition"":{""pyChannel"":""Peg…","""Occupation""","""SYMBOLIC""",19,"""Communications engineer""",-0.017566,0.017566,-0.000362,0.000362,52,-0.026176,-0.005375,"""contribution""",0.017566
"""{""partition"":{""pyChannel"":""Peg…","""pyName""","""SYMBOLIC""",1,"""P12""",-0.022127,0.022127,-0.022127,0.022127,2520,-0.032641,-0.002168,"""contribution""",0.022127
