In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from prep_data import load_demo_data
from key_drivers import decompose_funnel_metrics

data_obj = load_demo_data(".")

df_sales = data_obj.df_sales
df_stores = data_obj.df_stores
group_cols = data_obj.group_variables
funnel_cols = data_obj.funnel_variables

df_metrics = decompose_funnel_metrics(
    df_sales, "report_date", ["store_nbr"], funnel_cols
)

df_joined = (
    df_metrics.set_index("store_nbr")
    .join(df_stores.set_index("store_nbr"), on="store_nbr")
    .reset_index()
)

# Finding Key Drivers

Once we have our data nicely broken down, we might want a way to automatically detect the major driving factors. This is often a difficult task, as the biggest factors won't always be the most obvious. This search is skewed by the dimensions we choose to look at, and the way we choose to look at them. For example, if you go looking through age-based factors, your conclusions will be skewed to look at age as a major factor.

As such, most analysts I've worked with (myself included!) tend to take a "kitchen sink" approach when searching for factors. Naturally, this search is not completely blind, and is usually guided by some domain knowledge. However, the goal is to cast a wide net and see what sticks, and try to disprove the obvious hypotheses. This search is slow and tedious, and often involves a lot of manual work. It also easily misses out on interactions between factors, and can be skewed by the order in which factors are considered.

To accelerate this manual work, it's tempting to look at effects as driven by "individual factors". Statistcal approaches work well here, and machine learning approaches such as boosted trees are exceptionally useful for detecting interactions between factors, even in high dimensions. Ignore issues with interpretability for a moment, as tools like Shap can be (ab)used to get a good proxy for contribution.

The biggest issue here is that effects may be driven not just by the contributions of individuals, but also by the size of their demographic. Both machine learning and statistical methods work with averages, but sums pay your salary. In physics terms, we're less interested in the "temperature" of a group of particles, and more interested in the "heat" of the group. For example, a small group of people with a high spend may have a smaller effect on overall profit than a large group of people with a low average spend. Accordingly, if we want to "double down on success", our efforts to drive incremental revenue may be best focused in the latter group!

To date, the only method I've found that can handle consistently getting at the core drivers of a problem is the "key drivers" method implemented in this package. This process works as follows:

1. Group your data by the dimensions you're interested in. For each group, calculate the sum of the target KPI.
1. Declare some threshold for the minimum impact you're interested in. This could be a percentage of the total, or a fixed value. I normally use a percentage of the total, so I divide the total KPI by the mean.
1. Find all contributions from all columns with an effect greater than the threshold ordered by the smallest membership. This represents the smallest groups with an outsized effect on the target KPI, so include them before bigger groups.
1. This one's important: _remove the observations that are in this group from the data_. This is the key step that allows us to find the next biggest driver.
1. Assuming you have $n$ columns in the previous step. Repeat steps 3-4 on combinations of $n-1$ columns.
1. Repeat step 5 until you're working with individual columns.

Now you might observe that this process is...somewhat inefficient. I agree. And if you were working with thousands of columns...well, you'd be in trouble. However, this process is the only one I've found that consistently gets at the core drivers of a problem. If you know of a better one, and I'd be willing to bet that there are MANY people out there who know more about this stuff than me, please let me know!

For the rest of us, this approach is also great way to get a sense of the interactions between factors as part of an EDA. Besides, if you're putting in 1000 columns, you'd probably struggle to interpret the results for business stakeholders anyway. Have a chat to your counterparts about likely factors and try get that kitchen sink down from the size of a swimming pool to...uh...a kitchen sink.

## Example
The function assumes you can neatly sum all variables, and so have denominated all factors in your key KPI. (If this seems like magic, check the last notebook!) For funsies, we're going to concentrate on the profit gain from `items_per_transaction` and `income_per_item`. There are many other factors that you could care about given your context (see [the following article](https://commoncog.com/the-amazon-weekly-business-review/)) but we're going to assume that these two are the most important for now.

We'll set the target number of factors to 10; this means we're hunting for groupings that comprise 10% or greater of total profit. We can tweak both the target number of factors and the threshold independently, but I've found that the target number of factors is the most important parameter to tweak.

In [2]:
from key_drivers import find_key_drivers

columns_of_interest = ["items_per_transaction", "income_per_item"]

key_drivers = find_key_drivers(
    df_joined,
    10,
    "store_nbr",
    group_cols,
    columns_of_interest,
)
key_drivers

[DrivingFactor(categories={'store_city': 'Quito', 'store_state': 'Pichincha', 'store_type': 'A', 'store_cluster': '14', 'opening_time_cat': 'already open'}, total=0.10862596457116738, vcount=144, id_count=3),
 DrivingFactor(categories={'store_city': 'Quito', 'store_state': 'Pichincha', 'store_type': 'A', 'opening_time_cat': 'already open'}, total=0.12307345564186359, vcount=144, id_count=3),
 DrivingFactor(categories={'store_city': 'Quito', 'store_state': 'Pichincha', 'store_type': 'D', 'opening_time_cat': 'already open'}, total=0.16594084264371697, vcount=333, id_count=7),
 DrivingFactor(categories={'store_city': 'Guayaquil', 'store_state': 'Guayas', 'opening_time_cat': 'already open'}, total=0.11634425442394761, vcount=336, id_count=7),
 DrivingFactor(categories={'store_type': 'B', 'store_cluster': '6.0'}, total=0.10326773548796588, vcount=240, id_count=5),
 DrivingFactor(categories={'store_type': 'C', 'opening_time_cat': 'already open'}, total=0.16083416067337672, vcount=576, id_cou

## But that seems kind of lumpy...

Great point! By specifying a small number of factors, we're hunting for very big effects. To achieve this, we need to merge together a large number of categories to get the requisite impact on the target KPI.

However, if we make the target number a bit larger...

In [3]:
key_drivers = find_key_drivers(
    df_joined,
    20,
    "store_nbr",
    group_cols,
    columns_of_interest,
)
key_drivers

[DrivingFactor(categories={'store_city': 'Quito', 'store_state': 'Pichincha', 'store_type': 'A', 'store_cluster': '11', 'opening_time_cat': 'already open'}, total=0.08264584231232966, vcount=96, id_count=2),
 DrivingFactor(categories={'store_city': 'Quito', 'store_state': 'Pichincha', 'store_type': 'A', 'store_cluster': '14', 'opening_time_cat': 'already open'}, total=0.10862596457116738, vcount=144, id_count=3),
 DrivingFactor(categories={'store_city': 'Quito', 'store_state': 'Pichincha', 'store_type': 'D', 'store_cluster': '8', 'opening_time_cat': 'already open'}, total=0.09439109307866639, vcount=144, id_count=3),
 DrivingFactor(categories={'store_city': 'Quito', 'store_state': 'Pichincha', 'store_type': 'D', 'store_cluster': '13', 'opening_time_cat': 'already open'}, total=0.05500901712920686, vcount=141, id_count=3),
 DrivingFactor(categories={'store_state': 'Pichincha', 'store_type': 'B', 'store_cluster': '6.0', 'opening_time_cat': 'already open'}, total=0.05429017254983364, vcou

We see more granular data emerging. If we continue, we'll even begin to see the contributions of individual stores to overall profit. Stepping the number of target factors "up" in this is a great way to get a sense of the "shape" of the data, and to see where the biggest opportunities lie.

In [4]:
key_drivers = find_key_drivers(
    df_joined,
    30,
    "store_nbr",
    group_cols,
    columns_of_interest,
)
key_drivers

[DrivingFactor(categories={'store_nbr': '3'}, total=0.04461223966325644, vcount=48, id_count=1),
 DrivingFactor(categories={'store_nbr': '44'}, total=0.040427613329533936, vcount=48, id_count=1),
 DrivingFactor(categories={'store_nbr': '45'}, total=0.04831346528078318, vcount=48, id_count=1),
 DrivingFactor(categories={'store_nbr': '46'}, total=0.03400135186034735, vcount=48, id_count=1),
 DrivingFactor(categories={'store_nbr': '47'}, total=0.042672169443293065, vcount=48, id_count=1),
 DrivingFactor(categories={'store_nbr': '49'}, total=0.03433237703154648, vcount=48, id_count=1),
 DrivingFactor(categories={'store_city': 'Quito', 'store_state': 'Pichincha', 'store_type': 'D', 'store_cluster': '8', 'opening_time_cat': 'already open'}, total=0.04977885341540996, vcount=96, id_count=2),
 DrivingFactor(categories={'store_city': 'Quito', 'store_state': 'Pichincha', 'store_type': 'D', 'store_cluster': '13', 'opening_time_cat': 'already open'}, total=0.05500901712920686, vcount=141, id_count

# Reducing Granularity

While the above categories represent a good start, it's often the case that subpopulations exist at a higher level than our explicit data allows. These subpopulations may be driven by factors that we haven't considered, or that we can't measure. For example, we might not have data on the number of children in a household, but we might have data on the level of grocery spending. Households with children may have a different spending pattern to those without, and so we might want identify these common traits.

As is often the case, we're often not even sure of what we're looking for when we do this! But if we make some simple assumptions, such as:
- Relevant subpopulations will behave similarly in terms of our target KPI
- Relevant subpopulations will have a similar size in terms of our target KPI
- Conditional on other dimensions, subpopulations will post-hoc have some unifying characteristic
Then we can apply semi/unsupervised methods to identify these subpopulations.

In this case, we're going to use the combination of UMAP and DBSCAN to identify these subpopulations. UMAP is a dimensionality reduction technique that is particularly good at preserving local structure, and DBSCAN is a clustering algorithm that is good at identifying clusters without presuming the number of clusters that ought to be there. Together, they can identify subpopulations that are similar in terms of our target KPI, and that are of a similar size.

## Example

We're going to use the same data as above, but we're going to use UMAP and DBSCAN to identify subpopulations. We'll use the same target KPI, but we'll use the sum of the target KPI as the target KPI for DBSCAN.

Often you'd have to tune the algorithm to get consisitent results, but I've found that the following parameters work well for most datasets. If you have concerns about consistency, run the algorithm a few times and check if the results are consistent.

In [6]:
from key_drivers import reduce_cat_columns

df_reduced = reduce_cat_columns(
    df_joined,
    "store_nbr",
    group_cols,
    columns_of_interest,
)

In [9]:
for column in group_cols:
    modified = df_reduced[column].value_counts(normalize=True)
    modified = modified[~modified.index.isin(df_stores[column].unique())]
    if len(modified) == 0:
        continue
    print(modified, end="\n\n")

store_city
(Santo Domingo|Latacunga|Manta|El Carmen|Ibarra|Playas|Puyo|Riobamba|Salinas)    0.24102
(Machala|Babahoyo|Daule|Esmeraldas|Guaranda|Libertad|Loja|Quevedo)               0.16686
Name: proportion, dtype: float64

store_state
(Manabi|Santo Domingo de los Tsachilas|Cotopaxi|Chimborazo|Imbabura|Pastaza|Santa Elena)    0.22248
(El Oro|Los Rios|Bolivar|Esmeraldas|Loja)                                                   0.12978
Name: proportion, dtype: float64

store_cluster
(15|7)     0.12978
(12|16)    0.03708
Name: proportion, dtype: float64

opening_time_cat
(rush_open|brand_new)    0.11124
Name: proportion, dtype: float64



Notice that in most cases, the smaller chunks are moved into similar categories. These smaller categories often make little impact on their own, and so their aggregate effect is sometimes overlooked. If you suspect this to be the case in your data, this is a great way to identify these subpopulations and to highlight their impact on your business objectives.

We can then plug these values into the same KPI sweeps we used above:

In [10]:
key_drivers = find_key_drivers(
    df_reduced,
    10,
    "store_nbr",
    group_cols,
    columns_of_interest,
)
key_drivers

[DrivingFactor(categories={'store_city': 'Quito', 'store_state': 'Pichincha', 'store_type': 'A', 'store_cluster': '14', 'opening_time_cat': 'already open'}, total=0.10862596457116738, vcount=144, id_count=3),
 DrivingFactor(categories={'store_city': 'Quito', 'store_state': 'Pichincha', 'store_type': 'A', 'opening_time_cat': 'already open'}, total=0.12307345564186359, vcount=144, id_count=3),
 DrivingFactor(categories={'store_city': 'Quito', 'store_state': 'Pichincha', 'store_type': 'D', 'opening_time_cat': 'already open'}, total=0.16594084264371697, vcount=333, id_count=7),
 DrivingFactor(categories={'store_city': '(Machala|Babahoyo|Daule|Esmeraldas|Guaranda|Libertad|Loja|Quevedo)', 'store_state': '(El Oro|Los Rios|Bolivar|Esmeraldas|Loja)', 'opening_time_cat': 'already open'}, total=0.13784527491869775, vcount=336, id_count=7),
 DrivingFactor(categories={'store_city': 'Guayaquil', 'store_state': 'Guayas', 'opening_time_cat': 'already open'}, total=0.11634425442394761, vcount=336, id_c

And indeed, if we look at the result, several aggregated categories turn out to have a large impact on the KPIs we're studying.