# ADM Explained

__Pega__

__2023-03-15__

This notebook shows exactly how all the values in an ADM model report
are calculated. It also shows how the propensity is calculated for a
particular customer.

We use one of the shipped datamart exports for the example. This is a
model very similar to one used in some of the ADM PowerPoint/Excel deep
dive examples. You can change this notebook to apply to your own data.



In [1]:
# These lines are only for rendering in the docs, and are hidden through Jupyter tags
# Do not run if you're running the notebook seperately

import plotly.io as pio

pio.renderers.default = "notebook_connected"

import sys
import re
sys.path.append("../../../")
sys.path.append('../../python')
import pandas as pd
pd.set_option('display.max_colwidth', 0)

format_binning_derived = {'Positives':'{:.0f}', 'Negatives':'{:.0f}', 'Responses %':'{:.2f}', 'Positives %':'{:.2f}', 'Negatives %':'{:.2f}', 'Propensity':'{:.4f}'}
format_lift = {'Positives': '{:.0f}', 'Negatives': '{:.0f}', 'Lift': '{:.4f}'}
format_z_ratio = {'Positives':'{:.0f}', 'Negatives':'{:.0f}', 'Positives %':'{:.2f}', 'Negatives %':'{:.2f}', 'ZRatio':'{:.4f}'}
format_log_odds = {'Positives':'{:.0f}', 'Negatives':'{:.0f}', 'Positives %':'{:.2f}', 'LogOdds %':'{:.4f}', 'ModifiedLogOdds':'{:.4f}'}
format_classifier = {'Positives':'{:.0f}', 'Negatives':'{:.0f}'}

In [2]:
import polars as pl
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
from typing import List
from math import log

from pdstools import datasets, cdh_utils
from pdstools.plots.plots_plotly import ADMVisualisations


pl.Config.set_fmt_str_lengths(100);

In [3]:
model_name = "AutoNew84Months"
predictor_name = "Customer.NetWealth"
channel= "Web"

For the example we pick one particular model over a channel.
To explain the ADM model report, we use one of the active predictors as an
example. Swap for any other predictor when using different data.

In [4]:
dm = datasets.CDHSample(subset=False)

model = dm.combinedData.filter(
    (pl.col("Name") == model_name) & (pl.col("Channel") == channel)
)

modelpredictors = (
    dm.combinedData.join(
        model.select(pl.col("ModelID").unique()), on="ModelID", how="inner"
    )
    .filter(pl.col("EntryType") != "Inactive")
    .with_columns(Action=pl.concat_str(["Issue", "Group"], separator="/"),
                  PredictorName=pl.col("PredictorName").cast(pl.Utf8))
    .collect()
)

predictorbinning = modelpredictors.filter(
    pl.col("PredictorName") == predictor_name
).sort("BinIndex")

In [5]:
model_id = None

if (modelpredictors.select(pl.col("ModelID").unique()).shape[0] > 1) and (
    model_id is None
):
    display(
        model.group_by("ModelID")
        .agg(
            number_of_predictors=pl.col("PredictorName").n_unique(),
            model_performance=cdh_utils.weighed_performance_polars() * 100,
            response_count=pl.sum("ResponseCount"),
        )
        .collect()
        .to_pandas()
    )
    raise Exception(
    f"**{model_name}** model has multiple instances."
    "\nThis could be due to the same model name being used in different configurations, directions, issues, or having multiple treatments."
    "\nTo ensure the selection of a unique model, please choose a model_id from the table above and update the `model_id` variable at the top of this cell."
    "\nAfterward, rerun this cell."
    f"\nSee model IDs in {model_name} model above:"
    )
if model_id is not None:
    if (
        model_id
        not in modelpredictors.select(pl.col("ModelID").unique())
        .get_column("ModelID")
        .to_list()
    ):
        raise Exception(
            f"The {model_name} model does not have a model ID: {model_id}."
            f"Please ensure that the spelling of the model ID is correct."
            f"You can run `modelpredictors.select(pl.col('ModelID').unique().implode()).row(0)` to see the exact spellings of your IDs."
            "After updating the `model_id`, you can restart the notebook from the beginning."
        )

    predictors_in_selected_model = (
        modelpredictors.filter(
            pl.col("ModelID") == model_id
        )
        .select(pl.col("PredictorName").unique())
        .get_column("PredictorName")
        .to_list()
    )
    if predictor_name not in predictors_in_selected_model:
        raise Exception(
            f"{predictor_name} is not a predictor of the model with ID: {model_id}."
            "Please choose one of the available predictors below and update the **predictor_name** variable in the cell above:"
            f"\nAvailable Predictors:\n{predictors_in_selected_model}."
        )

    modelpredictors = modelpredictors.filter(pl.col("ModelID") == model_id)
    predictorbinning = predictorbinning.filter(pl.col("ModelID") == model_id)
    print(f"{model_name} model with **{model_id}** model ID is selected successfully.")

## Model Overview

The selected model is shown below. Only the currently active predictors are used for the propensity calculation, so only showing those.



In [6]:
modelpredictors.select(
    pl.col("Action").unique(),
    pl.col("Channel").unique(),
    pl.col("Name").unique(),
    pl.col("PredictorName").unique().sort().implode().alias("Active Predictors"),
    (pl.col("Performance").unique() * 100).alias("Model Performance (AUC)"),
).to_pandas().T.set_axis(["Values"], axis=1)


Unnamed: 0,Values
Action,Sales/AutoLoans
Channel,Web
Name,AutoNew84Months
Active Predictors,"[Classifier, Customer.Age, Customer.AnnualIncome, Customer.BusinessSegment, Customer.CLV, Customer.CLV_VALUE, Customer.CreditScore, Customer.Date_of_Birth, Customer.Gender, Customer.MaritalStatus, Customer.NetWealth, Customer.NoOfDependents, Customer.Prefix, Customer.RelationshipStartDate, Customer.RiskCode, Customer.WinScore, Customer.pyCountry, IH.Email.Outbound.Accepted.pxLastGroupID, IH.Email.Outbound.Accepted.pxLastOutcomeTime.DaysSince, IH.Email.Outbound.Accepted.pyHistoricalOutcomeCount, IH.Email.Outbound.Churned.pyHistoricalOutcomeCount, IH.Email.Outbound.Loyal.pxLastOutcomeTime.DaysSince, IH.Email.Outbound.Rejected.pyHistoricalOutcomeCount, IH.SMS.Outbound.Accepted.pxLastGroupID, IH.SMS.Outbound.Accepted.pyHistoricalOutcomeCount, IH.SMS.Outbound.Churned.pxLastOutcomeTime.DaysSince, IH.SMS.Outbound.Loyal.pxLastOutcomeTime.DaysSince, IH.SMS.Outbound.Loyal.pyHistoricalOutcomeCount, IH.SMS.Outbound.Rejected.pxLastGroupID, IH.SMS.Outbound.Rejected.pyHistoricalOutcomeCount, IH.Web.Inbound.Accepted.pxLastGroupID, IH.Web.Inbound.Accepted.pyHistoricalOutcomeCount, IH.Web.Inbound.Loyal.pxLastGroupID, IH.Web.Inbound.Loyal.pyHistoricalOutcomeCount, IH.Web.Inbound.Rejected.pxLastGroupID, IH.Web.Inbound.Rejected.pyHistoricalOutcomeCount, Param.ExtGroupCreditcards]"
Model Performance (AUC),77.4901


## Binning of the selected Predictor

The Model Report in Prediction Studio for this model will have a predictor binning plot like below.

All numbers can be derived from just the number of positives and negatives in each bin that are stored in the ADM Data Mart. The next sections will show exactly how that is done.

In [7]:
display(predictorbinning.group_by("PredictorName").agg(
    pl.first("ResponseCount").cast(pl.Int64).alias("# Responses"),
    pl.n_unique("BinIndex").alias("# Bins"),
    (pl.first("PerformanceBin") * 100).alias("Predictor Performance(AUC)"),
).rename({"PredictorName": "Predictor Name"}).transpose(include_header=True).rename(
    {"column": "", "column_0": "Value"}
).to_pandas().set_index(""))

fig = dm.plotPredictorBinning(modelids=modelpredictors.get_column("ModelID").unique().to_list(),
                        predictors=[predictor_name])
fig.update_layout(width=600, height=400)
display(fig)

Unnamed: 0,Value
,
Predictor Name,Customer.NetWealth
# Responses,1636
# Bins,8
Predictor Performance(AUC),72.2077


In [8]:
BinPositives = pl.col("BinPositives")
BinNegatives = pl.col("BinNegatives")
sumPositives = pl.sum("BinPositives")
sumNegatives = pl.sum("BinNegatives")

# TODO: add totals for first 5 columns, base rate for 6 and 0,1, see R version

binstats = predictorbinning.select(
    pl.col("BinSymbol").alias("Range/Symbol"),
    ((BinPositives + BinNegatives) / (sumPositives + sumNegatives))
    .round(3)
    .alias("Responses (%)"),
    BinPositives.alias("Positives"),
    (BinPositives / sumPositives).round(3).alias("Positives (%)"),
    BinNegatives.alias("Negatives"),
    (BinNegatives / sumNegatives).round(3).alias("Negatives (%)"),
    (BinPositives / (BinPositives + BinNegatives)).round(4).alias("Propensity (%)"),
    cdh_utils.zRatio(negCol=BinNegatives, posCol=BinPositives),
    (
        (BinPositives / (BinPositives + BinNegatives))
        / (sumPositives / (BinPositives + BinNegatives).sum())
    ).alias("Lift"),
)

binstats.vstack(
    pl.DataFrame(dict(zip(
                binstats.columns,
                ["Total"] + [binstats.select(pl.sum(col)).row(0)[0] for col in binstats.columns[1:]]
                )),
    schema=binstats.schema,
    )).to_pandas().set_index("Range/Symbol")


Unnamed: 0_level_0,Responses (%),Positives,Positives (%),Negatives,Negatives (%),Propensity (%),ZRatio,Lift
Range/Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
<11684.56,0.267,13.0,0.063,423.0,0.296,0.0298,-11.186877,0.236795
"[11684.56, 13732.56>",0.123,24.0,0.117,178.0,0.124,0.1188,-0.332146,0.943574
"[13732.56, 16845.52>",0.163,17.0,0.083,250.0,0.175,0.0637,-4.264671,0.505654
"[16845.52, 19139.28>",0.141,51.0,0.248,179.0,0.125,0.2217,3.908162,1.760996
"[19139.28, 20286.16>",0.055,7.0,0.034,83.0,0.058,0.0778,-1.711775,0.617692
"[20286.16, 22743.76>",0.136,53.0,0.257,169.0,0.118,0.2387,4.397646,1.896003
"[22743.76, 23890.64>",0.055,13.0,0.063,77.0,0.054,0.1444,0.515565,1.147141
>=23890.64,0.061,28.0,0.136,71.0,0.05,0.2828,3.512888,2.246151
Total,1.001,206.0,1.001,1430.0,1.0,1.1777,-5.161209,9.354006


## Bin Statistics

### Positive and Negative ratios

Internally, ADM only keeps track of the total counts of positive and negative responses in each bin. Everything else is derived from those numbers. The percentages and totals are trivially derived, and the propensity is just the number of positives divided by the total. The numbers calculated here match the numbers from the datamart table exactly.

In [9]:
binningDerived = predictorbinning.select(
    pl.col("BinSymbol").alias("Range/Symbol"),
    BinPositives.alias("Positives"),
    BinNegatives.alias("Negatives"),
    (((BinPositives + BinNegatives) / (sumPositives + sumNegatives)) * 100)
    .round(2)
    .alias("Responses %"),
    ((BinPositives / sumPositives) * 100).round(2).alias("Positives %"),
    ((BinNegatives / sumNegatives) * 100).round(2).alias("Negatives %"),
    (BinPositives / (BinPositives + BinNegatives)).round(4).alias("Propensity"),
)
binningDerived.to_pandas(use_pyarrow_extension_array=True).set_index("Range/Symbol").style.format(
    format_binning_derived
).set_properties(
    color="#0000FF", subset=["Responses %", "Positives %", "Negatives %", "Propensity"]
)

Unnamed: 0_level_0,Positives,Negatives,Responses %,Positives %,Negatives %,Propensity
Range/Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
<11684.56,13,423,26.65,6.31,29.58,0.0298
"[11684.56, 13732.56>",24,178,12.35,11.65,12.45,0.1188
"[13732.56, 16845.52>",17,250,16.32,8.25,17.48,0.0637
"[16845.52, 19139.28>",51,179,14.06,24.76,12.52,0.2217
"[19139.28, 20286.16>",7,83,5.5,3.4,5.8,0.0778
"[20286.16, 22743.76>",53,169,13.57,25.73,11.82,0.2387
"[22743.76, 23890.64>",13,77,5.5,6.31,5.38,0.1444
>=23890.64,28,71,6.05,13.59,4.97,0.2828


### Lift

Lift is the ratio of the propensity in a particular bin over the average propensity. So a value of 1 is the average, larger than 1 means higher propensity, smaller means lower propensity:

In [10]:
Positives = pl.col("Positives")
Negatives = pl.col("Negatives")
sumPositives = pl.sum("Positives")
sumNegatives = pl.sum("Negatives")
binningDerived.select(
    "Range/Symbol",
    "Positives",
    "Negatives",
    (
        (Positives / (Positives + Negatives))
        / (sumPositives / (Positives + Negatives).sum())
    ).alias("Lift"),
).to_pandas().set_index("Range/Symbol").style.format(format_lift).set_properties(
    **{"color": "blue"}, subset=["Lift"]
)

Unnamed: 0_level_0,Positives,Negatives,Lift
Range/Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
<11684.56,13,423,0.2368
"[11684.56, 13732.56>",24,178,0.9436
"[13732.56, 16845.52>",17,250,0.5057
"[16845.52, 19139.28>",51,179,1.761
"[19139.28, 20286.16>",7,83,0.6177
"[20286.16, 22743.76>",53,169,1.896
"[22743.76, 23890.64>",13,77,1.1471
>=23890.64,28,71,2.2462


### Z-Ratio

The Z-Ratio is also a measure of the how the propensity in a bin differs from the average, but takes into account the size of the bin and thus is statistically more relevant. It represents the number of standard deviations from the average, so centers around 0. The wider the spread, the better the predictor is.
$$\frac{posFraction-negFraction}{\sqrt(\frac{posFraction*(1-posFraction)}{\sum positives}+\frac{negFraction*(1-negFraction)}{\sum negatives})}$$ 

See the calculation here, which is also included in [cdh_utils' zRatio()](https://pegasystems.github.io/pega-datascientist-tools/Python/autoapi/pdstools/utils/cdh_utils/index.html#pdstools.utils.cdh_utils.zRatio).

In [11]:
def zRatio(
    posCol: pl.Expr = pl.col("BinPositives"), negCol: pl.Expr = pl.col("BinNegatives")
) -> pl.Expr:
    def getFracs(posCol=pl.col("BinPositives"), negCol=pl.col("BinNegatives")):
        return posCol / posCol.sum(), negCol / negCol.sum()

    def zRatioimpl(
        posFractionCol=pl.col("posFraction"),
        negFractionCol=pl.col("negFraction"),
        PositivesCol=pl.sum("BinPositives"),
        NegativesCol=pl.sum("BinNegatives"),
    ):
        return (
            (posFractionCol - negFractionCol)
            / (
                (posFractionCol * (1 - posFractionCol) / PositivesCol)
                + (negFractionCol * (1 - negFractionCol) / NegativesCol)
            ).sqrt()
        ).alias("ZRatio")

    return zRatioimpl(*getFracs(posCol, negCol), posCol.sum(), negCol.sum())


binningDerived.select(
    "Range/Symbol", "Positives", "Negatives", "Positives %", "Negatives %"
).with_columns(zRatio(Positives, Negatives)).to_pandas().set_index("Range/Symbol").style.format(
    format_z_ratio
).set_properties(
    **{"color": "blue"}, subset=["ZRatio"]
)

Unnamed: 0_level_0,Positives,Negatives,Positives %,Negatives %,ZRatio
Range/Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
<11684.56,13,423,6.31,29.58,-11.1869
"[11684.56, 13732.56>",24,178,11.65,12.45,-0.3321
"[13732.56, 16845.52>",17,250,8.25,17.48,-4.2647
"[16845.52, 19139.28>",51,179,24.76,12.52,3.9082
"[19139.28, 20286.16>",7,83,3.4,5.8,-1.7118
"[20286.16, 22743.76>",53,169,25.73,11.82,4.3976
"[22743.76, 23890.64>",13,77,6.31,5.38,0.5156
>=23890.64,28,71,13.59,4.97,3.5129


## Predictor AUC


The predictor AUC is the univariate performance of this predictor against the outcome. This too can be derived from the positives and negatives and
there is  a convenient function in pdstools to calculate it directly from the positives and negatives.

This function is implemented in cdh_utils: [cdh_utils.auc_from_bincounts()](https://pegasystems.github.io/pega-datascientist-tools/Python/autoapi/pdstools/utils/cdh_utils/index.html#pdstools.utils.cdh_utils.auc_from_bincounts).

In [12]:
pos=binningDerived.get_column("Positives").to_numpy()
neg=binningDerived.get_column("Negatives").to_numpy()
probs=binningDerived.get_column("Propensity").to_numpy()
order = np.argsort(probs)

FPR = np.cumsum(neg[order]) / np.sum(neg[order])
TPR = np.cumsum(pos[order]) / np.sum(pos[order])
TPR = np.insert(TPR, 0, 0, axis=0)
FPR = np.insert(FPR, 0, 0, axis=0)
# Checking whether classifier labels are correct
if TPR[1] < 1-FPR[1]:
    temp = FPR
    FPR = TPR
    TPR = temp
auc = cdh_utils.auc_from_bincounts(pos=pos, neg=neg,probs=probs)

fig = px.line(
    x=[1-x for x in FPR], y=TPR,
    labels=dict(x='Specificity', y='Sensitivity'),
    title = f"AUC = {auc.round(3)}",
    width=700, height=700,
    range_x=[1,0],
    template='none'
)
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=1, x1=0, y0=0, y1=1
)
fig.show()


## Naive Bayes and Log Odds

The basis for the Naive Bayes algorithm is Bayes' Theorem:

$$p(C_k|x) = \frac{p(x|C_k)*p(C_k)}{p(x)}$$

with $C_k$ the outcome and $x$ the customer. Bayes' theorem turns the
question "what's the probability to accept this action given a customer" around to 
"what's the probability of this customer given an action". With the independence
assumption, and after applying a log odds transformation we get a log odds score 
that can be calculated efficiently and in a numerically stable manner:

$$log\ odds\ score = \sum_{p\ \in\ active\ predictors}log(p(x_p|Positive)) + log(p_{positive}) - \sum_plog(p(x_p|Negative)) - log(p_{negative})$$
note that the _prior_ can be written as:

$$log(p_{positive}) - log(p_{negative}) = log(\frac{TotalPositives}{Total})-log(\frac{TotalNegatives}{Total}) = log(TotalPositives) - log(TotalNegatives)$$


## Predictor Contribution

The contribution (_conditional log odds_) of an active predictor $p$ for bin $i$ with the number
of positive and negative responses in $Positives_i$ and $Negatives_i$ is calculated as (note the "laplace smoothing" to avoid log 0 issues):

$$contribution_p = \log(Positives_i+\frac{1}{nBins}) - \log(Negatives_i+\frac{1}{nBins}) - \log(1+\sum_{i\ = 1..nBins}{Positives_i}) + \log(1+\sum_i{Negatives_i})$$


In [13]:
N = binningDerived.shape[0]
binningDerived.with_columns(
    LogOdds=(pl.col("Positives %") / pl.col("Negatives %")).log(),
    ModifiedLogOdds=(
        ((Positives + 1 / N).log() - (Positives.sum() + 1).log())
        - ((Negatives + 1 / N).log() - (Negatives.sum() + 1).log())
    ),
).drop("Responses %", "Propensity").to_pandas().set_index("Range/Symbol").style.format(
    format_log_odds
).set_properties(
    **{"color": "blue"}, subset=["LogOdds", "ModifiedLogOdds"]
)

Unnamed: 0_level_0,Positives,Negatives,Positives %,Negatives %,LogOdds,ModifiedLogOdds
Range/Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
<11684.56,13,423,6.31,29.58,-1.544963,-1.5397
"[11684.56, 13732.56>",24,178,11.65,12.45,-0.066414,-0.0658
"[13732.56, 16845.52>",17,250,8.25,17.48,-0.750844,-0.748
"[16845.52, 19139.28>",51,179,24.76,12.52,0.681902,0.6796
"[19139.28, 20286.16>",7,83,3.4,5.8,-0.534083,-0.5233
"[20286.16, 22743.76>",53,169,25.73,11.82,0.777865,0.7754
"[22743.76, 23890.64>",13,77,6.31,5.38,0.159447,0.1625
>=23890.64,28,71,13.59,4.97,1.005915,1.0056


## Propensity mapping

### Log odds contribution for all the predictors

The final score is loosely referred to as "the average contribution" but
in fact is a little more nuanced. The final score is calculated as:

$$score = \frac{\log(1 + TotalPositives) – \log(1 + TotalNegatives) + \sum_p contribution_p}{1 + nActivePredictors}$$

Here, $TotalPositives$ and $TotalNegatives$ are the total number of
positive and negative responses to the model.

Below an example. From all the active predictors of the model 
we pick a value (in the middle for numerics, first symbol
for symbolics) and show the (modified) log odds. The final score is
calculated per the above formula, and this is the value that is mapped
to a propensity value by the classifier (which is constructed using the
[PAV(A)](https://en.wikipedia.org/wiki/Isotonic_regression) algorithm).


In [14]:
def middleBin():
    return pl.col("BinIndex") == (pl.max("BinIndex") / 2).floor().cast(pl.UInt32)


if not all(
    col in modelpredictors.columns for col in ["BinLowerBound", "BinUpperBound"]
):

    def extract_numbers_in_contents(s: str, index):
        numbers = re.findall(r"[-+]?\d*\.\d+|\d+", s)
        try:
            number = float(numbers[index])
        except:
            number = 0
        return number

    modelpredictors = modelpredictors.with_columns(
        pl.col("Contents").cast(pl.Utf8)
    ).with_columns(
        pl.when(pl.col("Type") == "numeric")
        .then(pl.col("Contents").map_elements(lambda col: extract_numbers_in_contents(col, 0)))
        .otherwise(pl.lit(-9999))
        .alias("BinLowerBound")
        .cast(pl.Float32),
        pl.when(pl.col("Type") == "numeric")
        .then(pl.col("Contents").map_elements(lambda col: extract_numbers_in_contents(col, 1)))
        .otherwise(pl.lit(-9999))
        .alias("BinUpperBound")
        .cast(pl.Float32),
    )


def RowWiseLogOdds(Bin, Positives, Negatives):
    Bin, N = Bin.list.get(0) - 1, Positives.list.lengths()
    Pos, Neg = Positives.list.get(Bin), Negatives.list.get(Bin)
    PosSum, NegSum = Positives.list.sum(), Negatives.list.sum()
    return (
        (((Pos + (1 / N)).log() - (PosSum + 1).log()))
        - (((Neg + (1 / N)).log()) - (NegSum + 1).log())
    ).alias("Modified Log odds")


df = (
    modelpredictors.filter(pl.col("PredictorName") != "Classifier")
    .group_by("PredictorName")
    .agg(
        Value=pl.when(pl.col("Type").first() == "numeric")
        .then(
            ((pl.col("BinLowerBound") + pl.col("BinUpperBound")) / 2).where(middleBin())
        )
        .otherwise(pl.col("BinSymbol").str.split(",").list.first().where(middleBin())),
        Bin=pl.col("BinIndex").where(middleBin()),
        Positives=pl.col("BinPositives"),
        Negatives=pl.col("BinNegatives"),
    )
    .with_columns(
        pl.col(["Positives", "Negatives"]).list.get(pl.col("Bin").list.get(0) - 1),
        pl.col("Bin", "Value").list.get(0),
        LogOdds=RowWiseLogOdds(pl.col("Bin"), pl.col("Positives"), pl.col("Negatives")),
    )
    .sort("PredictorName")
)

classifier = (
    modelpredictors.filter(pl.col("EntryType") == "Classifier")
    .with_columns(
        Propensity=(BinPositives / (BinPositives / BinNegatives)),
        AdjustedPropensity=((0.5 + BinPositives) / (1 + BinPositives + BinNegatives)),
        ZRatio=cdh_utils.zRatio(negCol=BinNegatives, posCol=BinPositives),
        Lift=(
            (BinPositives / (BinPositives + BinNegatives))
            / (sumPositives / (BinPositives + BinNegatives).sum())
        ),
    )
    .select(
        [
            pl.col("BinIndex").alias("Index"),
            pl.col("BinSymbol").alias("Bin"),
            BinPositives.alias("Positives"),
            BinNegatives.alias("Negatives"),
            ((pl.cumsum("BinResponseCount") / pl.sum("BinResponseCount")) * 100).alias(
                "Cum. Total (%)"
            ),
            (pl.col("BinPropensity") * 100).alias("Propensity (%)"),
            (pl.col("AdjustedPropensity") * 100).alias("Adjusted Propensity (%)"),
            ((pl.cumsum("BinPositives") / pl.sum("BinPositives")) * 100).alias(
                "Cum Positives (%)"
            ),
            pl.col("ZRatio"),
            (pl.col("Lift") * 100).alias("Lift(%)"),
            pl.col("BinResponseCount").alias("Responses"),
        ]
    )
)
classifierLogOffset = log(1 + classifier["Positives"].sum()) - log(
    1 + classifier["Negatives"].sum()
)

propensity_mapping = (
    df.vstack(
        pl.DataFrame(
            dict(
                zip(
                    df.columns,
                    ["Final Score"]
                    + [None] * 4
                    + [(df["LogOdds"].sum() + classifierLogOffset) / (len(df) + 1)],
                )
            ),
            schema=df.schema,
        )
    )
    .to_pandas()
    .set_index("PredictorName")
    .style.set_properties(**{"color": "blue"}, subset=["LogOdds"])
)

propensity_mapping

Unnamed: 0_level_0,Value,Bin,Positives,Negatives,LogOdds
PredictorName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Customer.Age,34.56,4.0,9.0,198.0,-1.145923
Customer.AnnualIncome,-24043.049,1.0,74.0,1166.0,-0.819651
Customer.BusinessSegment,middleSegmentPlus,1.0,96.0,970.0,-0.376415
Customer.CLV,NON-MISSING,1.0,111.0,570.0,0.300922
Customer.CLV_VALUE,1345.52,4.0,31.0,297.0,-0.322731
Customer.CreditScore,518.92,3.0,33.0,205.0,0.110531
Customer.Date_of_Birth,18773.504,5.0,28.0,152.0,0.244642
Customer.Gender,U,1.0,52.0,481.0,-0.285516
Customer.MaritalStatus,No Resp+,1.0,67.0,745.0,-0.470766
Customer.NetWealth,17992.398,4.0,51.0,179.0,0.6796


## Classifier

The success rate is defined as $\frac{positives}{positives+negatives}$ per bin. 

The adjusted propensity that is returned is a small modification (Laplace smoothing) to this and calculated as $\frac{0.5+positives}{1+positives+negatives}$ so empty models return a propensity of 0.5.


In [15]:
# TODO see if we can port the "getActiveRanges" code to python so to highlight the classifier rows that are "active"

classifier.drop("Responses").to_pandas().set_index("Index").style.format(format_classifier).set_properties(
    **{"color": "blue"}, subset=["Adjusted Propensity (%)"]
)

Unnamed: 0_level_0,Bin,Positives,Negatives,Cum. Total (%),Propensity (%),Adjusted Propensity (%),Cum Positives (%),ZRatio,Lift(%)
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,<-0.21,17,443,28.117359,3.695652,3.796095,8.252427,-9.994484,1.956663
2,"[-0.21, -0.185>",8,133,36.735939,5.673759,5.985916,12.135922,-3.495416,3.003971
3,"[-0.185, -0.175>",3,48,39.853302,5.882353,6.73077,13.592233,-1.977473,3.114411
4,"[-0.175, -0.105>",28,370,64.180931,7.035176,7.142858,27.184465,-4.628074,3.724772
5,"[-0.105, -0.095>",4,51,67.542786,7.272727,8.035714,29.126215,-1.505372,3.850544
6,"[-0.095, -0.09>",2,19,68.826408,9.523809,11.363637,30.097088,-0.478811,5.042379
7,"[-0.09, -0.065>",9,77,74.08313,10.465117,10.91954,34.466019,-0.657755,5.540754
8,"[-0.065, -0.02>",30,154,85.330078,16.304348,16.486486,49.029125,1.4644,8.632335
9,"[-0.02, 0.03>",37,65,91.564789,36.274509,36.407764,66.990295,4.913029,19.205534
10,"[0.03, 0.06>",20,29,94.559898,40.816326,41.0,76.699028,3.664015,21.610199


## Final Propensity

Below the classifier mapping. On the x-axis the binned scores (log odds values), on the y-axis the Propensity. Note the returned propensities are following a slightly adjusted formula, see the table above. The bin that contains the calculated final score is highlighted.

In [16]:
score = propensity_mapping.data.loc["Final Score", "LogOdds"]
score_bin = modelpredictors.filter(pl.col("EntryType") == "Classifier").select(
    pl.col("BinSymbol").where(
        pl.lit(score).is_between(pl.col("BinLowerBound"), pl.col("BinUpperBound"))
    )
)["BinSymbol"][0]
score_responses = modelpredictors.filter(
    (pl.col("EntryType") == "Classifier") & (pl.col("BinSymbol") == score_bin)
)["BinResponseCount"][0]
score_bin_index = (
    modelpredictors.filter(pl.col("EntryType") == "Classifier")["BinSymbol"]
    .to_list()
    .index(score_bin)
)
score_propensity = classifier.to_pandas().iloc[score_bin_index][
    "Adjusted Propensity (%)"
]

adjusted_propensity = (
    modelpredictors.filter(pl.col("EntryType") == "Classifier")
    .with_columns(
        AdjustedPropensity=((0.5 + BinPositives) / (1 + BinPositives + BinNegatives)),
    )
    .select(
        pl.col("AdjustedPropensity").where(
            (pl.col("BinLowerBound") < score) & (pl.col("BinUpperBound") > score)
        )
    )["AdjustedPropensity"][0]
    * 100
)
adjusted_propensity = round(adjusted_propensity, 2)

fig = ADMVisualisations.distribution_graph(
    modelpredictors.filter(pl.col("EntryType") == "Classifier"),
    "Propensity distribution",
).add_annotation(
    x=score_bin,
    y=score_propensity / 100,
    text=f"Returned propensity: {score_propensity:.2f}%",
    bgcolor="#FFFFFF",
    bordercolor="#000000",
    showarrow=False,
    yref="y2",
    opacity=0.7,
)
bin_index = list(fig.data[0]["x"]).index(score_bin)
fig.data[0]["marker_color"] = (
    ["grey"] * bin_index
    + ["#1f77b4"]
    + ["grey"] * (classifier.shape[0] - bin_index - 1)
)
fig