### Title: Marginal feature effects

### Authors: Tahmeed Shafiq (tahmeed@lighthousereports.com)

### Last updated: 17 Oct 2024

This notebook unpickles and examines the algorithm.

**Setup and unpickling**

In [32]:
#Check requirements. Don't skip this step; I made adjustments to the requirements.txt file
# pip install -r requirements.txt

In [52]:
import pickle 
import joblib as jl
import xgboost
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore", category=FutureWarning)

Matplotlib is building the font cache; this may take a moment.


Unpickle the models and examine its basic structure.

In [39]:
#Prepilot model
model=jl.load(open('wpi_model.pkl', 'rb'))
emb = model['model']

#Uncomment lines below to see more details
print('The model is a \n', type(model), '\n with keys given by: \n', model.keys())
#print('\n \n The parameters of the model are given by: \n', emb_model.get_params())

#Stand-in for reweighed model
model2 = model
emb2 = model2['model']

"""GG Note: Why are we changing feature importances? Just fyi these don't actually influence the model itself, since 
the feautre importances are only descriptive of the relationships that are already in the model formula """

#Change feature importances
model2['feature_importance']['deelnames_started_percentage_last_year']=0.07317625572
model2['feature_importance']['days_since_last_relocation']=0.02227508473
model2['feature_importance']['afspraken_no_show_count_last_year']=0.0003162555345
model2['feature_importance']['total_vermogen']=0.01292534192
model2['feature_importance']['days_since_last_dienst_end']=0.008268376334
model2['feature_importance']['has_medebewoner']=0.005575243894
model2['feature_importance']['avg_percentage_maatregel']=0.003545110815
model2['feature_importance']['at_least_one_address_in_amsterdam']=0.0003825554705
model2['feature_importance']['afspraken_no_contact_count_last_year']=0.01296115356
model2['feature_importance']['active_address_count']=0.0000305
model2['feature_importance']['has_partner']=-0.0003825554705
model2['feature_importance']['sum_inkomen_bruto_value']=-0.0006325110689
model2['feature_importance']['sum_inkomen_bruto_was_mean_imputed']=-0.000698811005
model2['feature_importance']['applied_for_same_product_last_year']=-0.001662821754
model2['feature_importance']['received_same_product_last_year']=-0.005228500068

The model is a 
 <class 'dict'> 
 with keys given by: 
 dict_keys(['model', 'performance', 'bootstrapped_performance', 'train_performance', 'feature_importance', 'flags', 'feature_selection_method'])


In [51]:
"""
GG Note: Cool that you figured out how to get the labels directly from the model pipleline!
"""

# Access the final estimator (classifier) in the pipeline
final_estimator = emb.named_steps['clf'].best_estimator_

# Get the class labels
class_labels = final_estimator.classes_

print("The class labels are", class_labels, "which correspond to 'not onderzoekswaardig (doesn't merit further investigation)'and 'onderzoekswaardig (mertis further investigation)' respectively")

The class labels are [0 1] which correspond to 'not onderzoekswaardig (doesn't merit further investigation)'and 'onderzoekswaardig (mertis further investigation)' respectively


**Create synthetic data**

In [45]:
prep_step = emb.named_steps['prep']
# Extract feature names
features = []
for feature in prep_step.features:
    features.extend(feature[0])

print("Feature names:", features)

Feature names: ['active_address_count', 'afspraken_no_show_count_last_year', 'received_same_product_last_year', 'total_vermogen', 'days_since_last_dienst_end', 'applied_for_same_product_last_year', 'at_least_one_address_in_amsterdam', 'has_medebewoner', 'has_partner', 'avg_percentage_maatregel', 'deelnames_started_percentage_last_year', 'days_since_last_relocation', 'afspraken_no_contact_count_last_year', 'sum_inkomen_bruto']
14


In [46]:
#Define ranges for each feature
#Distinguish between ranges that are integers or decimals
#Note that categorical features still use 0/1
#Note that sum_inkomen_bruto_was_mean_imputed not included in feature list. Is it only for reweighed model?
#Note that feature name is sum_inkomen_bruto not sum_inkomen_bruto_value like documentation
ranges={
    'deelnames_started_percentage_last_year':(0, 1, 'float'),
    'at_least_one_address_in_amsterdam':(0, 1, 'int'),
    'active_address_count':(0, 3, 'int'),
    'days_since_last_relocation':(0, 750, 'int'),
    'days_since_last_dienst_end':(0, 365, 'int'),
    'has_medebewoner':(0, 1, 'float'),
    'avg_percentage_maatregel':(0, 100000, 'float'),
    'total_vermogen':(-40000, 10000, 'float'),
    'afspraken_no_show_count_last_year':(0, 4, 'int'),
    'has_partner':(0, 1, 'int'),
    #'sum_inkomen_bruto_was_mean_imputed':(0, 1, 'int'),
    'applied_for_same_product_last_year':(0, 1, 'int'),
    'received_same_product_last_year':(0, 1, 'int'),
    'afspraken_no_contact_count_last_year':(0, 3, 'int'),
    'sum_inkomen_bruto':(0,2000, 'float')  
}

#Prepare DataFrame
nrow = 1000
df = pd.DataFrame()

# Generate random numbers for each feature within the specified range
for feature in features:
    low, high, dtype = ranges[feature]
    if dtype == 'int':
        df[feature] = np.random.randint(low, high, size=nrow)
    elif dtype == 'float':
        df[feature] = np.random.uniform(low, high, size=nrow)

df.columns = features
df.head()

Unnamed: 0,active_address_count,afspraken_no_show_count_last_year,received_same_product_last_year,total_vermogen,days_since_last_dienst_end,applied_for_same_product_last_year,at_least_one_address_in_amsterdam,has_medebewoner,has_partner,avg_percentage_maatregel,deelnames_started_percentage_last_year,days_since_last_relocation,afspraken_no_contact_count_last_year,sum_inkomen_bruto
0,1,1,0,1778.302065,341,0,0,0.901683,0,7233.219045,0.561229,438,1,1808.972001
1,0,2,0,-318.277243,232,0,0,0.258416,0,9037.04743,0.711933,730,1,1321.66705
2,2,1,0,-4563.654009,18,0,0,0.256922,0,3383.17769,0.133166,297,0,1187.198213
3,0,1,0,-27433.272944,116,0,0,0.737144,0,81120.181388,0.859493,390,0,1449.866894
4,2,2,0,-29196.464811,103,0,0,0.601461,0,7502.969364,0.501087,575,0,1761.199482


**Conduct 'A/B test'**

In [48]:
"""
GG Note: This function isn't working as intended. See note on '==' below. 
"""

#Helper function to create data at both ends of the range
def ab_data(df, feature):
    low, high = ranges[feature][0], ranges[feature][1]
    if ranges[feature][2] == 'float':
        print(f'Warning: A/B data for {feature} is not binary. Model predictions may be misleading.')
    df_low, df_high = df.copy(), df.copy() # GG Note: I don't know 100% if it makes a difference but better to just use .copy() here

    """
    GG Note: This is an easy typo to make. Remember that == is only for comparisons, so you aren't actually reassigning anything here. 
    """
    #df_low[feature] == low
    #df_high[feature] == high

    df_low[feature] = low
    df_high[feature] = high

    return(df_low, df_high)

In [81]:
#Which features to A/B test
#Probably most useful for features that don't seem to be proxies
ab_features = ['has_partner', 'received_same_product_last_year','applied_for_same_product_last_year']
ab_results = pd.DataFrame(columns=[
    'feature', 'minimized_prepilot', 'maximized_prepilot', 
    'minimized_reweighed', 'maximized_reweighed'
])
threshold = 0.56

for feature in ab_features:
    feature_data_low, feature_data_high = ab_data(df, feature)
    

    """
    GG Note: In addition to these metrics let's also take the average difference in risk score. 
    """
    #Only take column of "mertis further investigation"
    #Prepilot model
    prepilot_low = np.count_nonzero(emb.predict_proba(feature_data_low)[:, 1] > threshold)
    prepilot_high = np.count_nonzero(emb.predict_proba(feature_data_high)[:, 1] > threshold)
    #Reweighed model
    reweighed_low = np.count_nonzero(emb2.predict_proba(feature_data_low)[:,1] > threshold)
    reweighed_high = np.count_nonzero(emb2.predict_proba(feature_data_high)[:,1] > threshold)
    #Store data
    ab_results = ab_results.append({
        'feature': feature,
        'minimized_prepilot': prepilot_low,
        'maximized_prepilot': prepilot_high,
        'minimized_reweighed': reweighed_low,
        'maximized_reweighed': reweighed_high
    }, ignore_index=True)

In [82]:
ab_results

Unnamed: 0,feature,minimized_prepilot,maximized_prepilot,minimized_reweighed,maximized_reweighed
0,has_partner,940,940,940,940
1,received_same_product_last_year,940,940,940,940
2,applied_for_same_product_last_year,940,940,940,940


In [None]:
#Plot 
#Note: Interpret results in context

**Prepare models for PDP**