# "Open Payments" dataset

In [39]:
# data manipulation
from scipy.io import arff
import pandas as pd
import numpy as np

Version 2 of dataset open_payments is inactive, meaning that issues have been found in the dataset. We use an alternative dataset that differs from the original one in the type of features, which we can change, and in the number of available entries.

In [40]:
from sklearn.datasets import fetch_openml
id = 42738
data = fetch_openml(data_id = id, parser = 'auto')

# The returned dataset is a Bunch object, similar to a dictionary
X = data['data']
y = data['target']

In [41]:
# Summary vectors creation

default_summary  = []
encoder_summary  = []
value_summary    = []
time_summary     = []
n_models_summary = []
card_9_summary   = []

#### Description
Payments given by healthcare manufacturing companies to medical doctors or hospitals

#### Attributes Information
- status  [target]
- Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name
- Dispute_Status_for_Publication
- Name_of_Associated_Covered_Device_or_Medical_Supply
- Name_of_Associated_Covered_Drug_or_Biological
- Physician_Specialty

Convertir a tipo objeto todas las columnas.

In [42]:
X.head()

Unnamed: 0,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name,Dispute_Status_for_Publication,Name_of_Associated_Covered_Device_or_Medical_Supply1,Name_of_Associated_Covered_Drug_or_Biological1,Physician_Specialty
0,ELI LILLY AND COMPANY,No,,,Allopathic & Osteopathic Physicians|Pediatrics...
1,ELI LILLY AND COMPANY,No,,,Allopathic & Osteopathic Physicians|Internal M...
2,ELI LILLY AND COMPANY,No,,,Allopathic & Osteopathic Physicians|Internal M...
3,ELI LILLY AND COMPANY,No,,,Allopathic & Osteopathic Physicians|Internal M...
4,ELI LILLY AND COMPANY,No,,EFFIENT,Allopathic & Osteopathic Physicians|Pediatrics...


In [43]:
X.dtypes

Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name      object
Dispute_Status_for_Publication                                     object
Name_of_Associated_Covered_Device_or_Medical_Supply1               object
Name_of_Associated_Covered_Drug_or_Biological1                     object
Physician_Specialty                                              category
dtype: object

In [44]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73558 entries, 0 to 73557
Data columns (total 5 columns):
 #   Column                                                         Non-Null Count  Dtype   
---  ------                                                         --------------  -----   
 0   Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name  73558 non-null  object  
 1   Dispute_Status_for_Publication                                 73558 non-null  object  
 2   Name_of_Associated_Covered_Device_or_Medical_Supply1           30499 non-null  object  
 3   Name_of_Associated_Covered_Drug_or_Biological1                 37431 non-null  object  
 4   Physician_Specialty                                            69562 non-null  category
dtypes: category(1), object(4)
memory usage: 2.4+ MB


In [45]:
X.shape

(73558, 5)

In [46]:
X.describe()

Unnamed: 0,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name,Dispute_Status_for_Publication,Name_of_Associated_Covered_Device_or_Medical_Supply1,Name_of_Associated_Covered_Drug_or_Biological1,Physician_Specialty
count,73558,73558,30499,37431,69562
unique,1466,2,4373,2263,513
top,Merck Sharp & Dohme Corporation,No,Vascular,Invokana,Allopathic & Osteopathic Physicians|Internal M...
freq,1975,73145,292,306,2481


We check for duplicate rows.

In [47]:
X.duplicated().sum()

1878

In [48]:
y = y[-X.duplicated()]
X = X.drop_duplicates()
X.duplicated().sum()

0

## Study of NA's

In [49]:
X.isna().sum().sort_values(ascending = False)

Name_of_Associated_Covered_Device_or_Medical_Supply1             41732
Name_of_Associated_Covered_Drug_or_Biological1                   35203
Physician_Specialty                                               3472
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name        0
Dispute_Status_for_Publication                                       0
dtype: int64

In [50]:
X.isna().sum().sort_values(ascending = False)/len(X)

Name_of_Associated_Covered_Device_or_Medical_Supply1             0.582199
Name_of_Associated_Covered_Drug_or_Biological1                   0.491113
Physician_Specialty                                              0.048438
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name    0.000000
Dispute_Status_for_Publication                                   0.000000
dtype: float64

As can be seen, there are np.nan in most of the variables.

Two of them, have around of 50% of missing values. Then, 'NA' will be another category.

In [51]:
X = X.copy()
X['Name_of_Associated_Covered_Device_or_Medical_Supply1'] = X['Name_of_Associated_Covered_Device_or_Medical_Supply1'].fillna('NA')
X['Name_of_Associated_Covered_Drug_or_Biological1'] = X['Name_of_Associated_Covered_Drug_or_Biological1'].fillna('NA')

## Type of Variables

In [52]:
obj_cols = X.select_dtypes(include=['object','category']).columns.tolist()
obj_cols

['Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name',
 'Dispute_Status_for_Publication',
 'Name_of_Associated_Covered_Device_or_Medical_Supply1',
 'Name_of_Associated_Covered_Drug_or_Biological1',
 'Physician_Specialty']

In [53]:
len(obj_cols)

5

In [54]:
for col in obj_cols:
    X = X.copy()
    X[col] = X[col].astype('category')

In [55]:
cat_cols = X.select_dtypes(include=['category']).columns.tolist()
cat_cols

['Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name',
 'Dispute_Status_for_Publication',
 'Name_of_Associated_Covered_Device_or_Medical_Supply1',
 'Name_of_Associated_Covered_Drug_or_Biological1',
 'Physician_Specialty']

General review of the values of all variables.

In [56]:
X[cat_cols].apply(lambda col: col.nunique()).sort_values(ascending=False)

Name_of_Associated_Covered_Device_or_Medical_Supply1             4374
Name_of_Associated_Covered_Drug_or_Biological1                   2264
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name    1466
Physician_Specialty                                               513
Dispute_Status_for_Publication                                      2
dtype: int64

As Dispute_Status_for_Publication only have to categories, we convert it in a float feature with only values of (0,1).

In [57]:
X['Dispute_Status_for_Publication'] = X['Dispute_Status_for_Publication'].map({'No':0.0,'Yes':1.0}).astype(float)

## Value counts of the variables with more cardinality

#### Name_of_Associated_Covered_Device_or_Medical_Supply1

In [58]:
X.Name_of_Associated_Covered_Device_or_Medical_Supply1.value_counts()[0:10]

Name_of_Associated_Covered_Device_or_Medical_Supply1
NA                                                   41732
Vascular                                               268
Spine                                                  180
ARTHREX PRODUCT LINE DISTAL EXTREMITY ARTHROSCOPY      179
Surgical                                               140
ALL ARTHREX PRODUCT LINES                              140
LifeVest                                               133
Spinal Cord Neurostimulation - Neuro                   125
PAIN MANAGEMENT                                        115
Interventional Therapies                               114
Name: count, dtype: int64

#### Name_of_Associated_Covered_Drug_or_Biological1

In [59]:
X.Name_of_Associated_Covered_Drug_or_Biological1.value_counts()[0:10]

Name_of_Associated_Covered_Drug_or_Biological1
NA             35203
Invokana         305
Xarelto          247
NON-PRODUCT      208
BUTRANS          197
Prolia           193
NON BRAND        191
No Product       188
Nesina           187
ELIQUIS          174
Name: count, dtype: int64

#### Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name

In [60]:
X.Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name.value_counts()[0:10]

Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name
Merck Sharp & Dohme Corporation         1793
Novartis Pharmaceuticals Corporation    1448
Pfizer Inc.                             1276
Boston Scientific Corporation           1258
Covidien Sales LLC                      1061
Stryker Corporation                     1003
SANOFI-AVENTIS U.S. LLC                  976
AstraZeneca Pharmaceuticals LP           956
Genentech USA, Inc.                      907
AbbVie, Inc.                             884
Name: count, dtype: int64

## Response variable distribution

One of the problems we encountered is that by eliminating repeated entries, they are all classified as allowed, which further shortens the number of positive observations.

In [61]:
y.value_counts()

status
disallowed    68809
allowed        2871
Name: count, dtype: int64

In [62]:
y.value_counts(normalize=True)

status
disallowed    0.959947
allowed       0.040053
Name: proportion, dtype: float64

## Summary

In [63]:
print('open_payments')
print('------------------------------------------------')
print(f'Id on OpenML: {id}')
print(f'Columns: {X.shape[1]}')
print(f'Rows: {X.shape[0]}')

numeric_cols = X.select_dtypes(exclude=['object','category']).columns
num_binary_cols = [col for col in X.columns if X[col].isin([0.0, 1.0]).all()]
cat_binary_cols = [col for col in X.columns if X[col].isin(['0','1']).all()]

if len(numeric_cols)!=0:
    num_numeric_cols = len(numeric_cols) - len(num_binary_cols) - len(cat_binary_cols)
    num_binary_cols = len(num_binary_cols) + len(cat_binary_cols)
else:
    num_numeric_cols = len(numeric_cols)
    num_binary_cols = len(num_binary_cols) + len(cat_binary_cols)


object_cols = X.select_dtypes(include=['object','category']).columns
num_object_cols = len(object_cols) - len(cat_binary_cols)

print(f'Numeric variables: {num_numeric_cols}')
print(f'Binary variables: {num_binary_cols}')
print(f'Categorical variables: {num_object_cols}')


porcentaje_na = round((X.isna().sum().sum() / (X.shape[0] * X.shape[1])) * 100, 2)
print(f'Percentage of missing values: {porcentaje_na}%')

df_obj = X[object_cols]
if len(df_obj.nunique()[df_obj.nunique()>9].tolist()) > 0:
    numHighCardLevels = sorted(df_obj.nunique()[df_obj.nunique()>9].tolist())
    print(f'Number of levels for high cardinality variables (>9): {numHighCardLevels}')
else:
    print(f'Cardinalidades más altas:')

open_payments
------------------------------------------------
Id on OpenML: 42738
Columns: 5
Rows: 71680
Numeric variables: 0
Binary variables: 1
Categorical variables: 4
Percentage of missing values: 0.97%
Number of levels for high cardinality variables (>9): [513, 1466, 2264, 4374]


## Train-Test Split

In [64]:
from sklearn.model_selection import train_test_split

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                                    test_size = 0.33, 
                                                    random_state = 42,
                                                    stratify = y)

In [66]:
print(f'X_train shape: {X_train.shape}')
print(f'X_test shape:  {X_test.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'y_test shape:  {y_test.shape}')

X_train shape: (48025, 5)
X_test shape:  (23655, 5)
y_train shape: (48025,)
y_test shape:  (23655,)


## Tranformation Pipelines

In [67]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.preprocessing import OneHotEncoder

from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingClassifier
from sklearn.metrics import balanced_accuracy_score

import scipy.stats
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

import time

In [68]:
num_cols = X_train.select_dtypes(include=['number']).columns.to_list()
cat_cols = X_train.select_dtypes(include=['category','object']).columns.to_list()

In [69]:
# Define the HistGradientBoostingClassifier models
hgb_default = HistGradientBoostingClassifier(random_state=1234,
                                             max_iter = 25,
                                             early_stopping=False,
                                             min_samples_leaf=40,
                                             scoring='balanced_accuracy',
                                             class_weight='balanced')

# Define the hyperparameter search space
param_distributions = {
    'model__max_iter': scipy.stats.randint(5, 50),
    'model__learning_rate': scipy.stats.uniform(0.01, 0.3),
    'model__min_samples_leaf': scipy.stats.randint(10, 70),
}

# Create a StratifiedKFold cross-validation instance
stratified_kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=1234)

### One Hot Encoding + HistGradientBoosting

#### Preprocessing

In [70]:
num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "median"))
])

cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent")),
    ("encoder", OneHotEncoder(drop = "first", handle_unknown = "ignore"))
])

preprop_pipeline = ColumnTransformer(
    transformers = [("one_hot", cat_pipeline, cat_cols)],
    sparse_threshold=0
)

#### Create a HistGradientBoostingClassifier model with default parameters and early stopping

In [71]:
ohe_hgb_default_pipeline = Pipeline([("preprocessing",preprop_pipeline),
                                     ('model', hgb_default)])

In [72]:
tic = time.time()

ohe_hgb_default = ohe_hgb_default_pipeline.fit(X_train, y_train)

toc = time.time()
ohe_hgb_default_time_taken = toc-tic

In [73]:
# Display pipeline
print("Time taken: ", ohe_hgb_default_time_taken)
ohe_hgb_default

Time taken:  54.28791785240173


In [74]:
# Predict using the model with the best parameters
y_ohe_hgb_default_pred = ohe_hgb_default.predict(X_test)
ohe_hgb_default_accuracy = balanced_accuracy_score(y_test, y_ohe_hgb_default_pred)
print(f'Balanced accuracy with default parameters: {ohe_hgb_default_accuracy}')

# Save results
default_summary.append("Default")
card_9_summary.append("AllVariables")
encoder_summary.append("OneHotEncoding")
value_summary.append(ohe_hgb_default_accuracy)
time_summary.append(ohe_hgb_default_time_taken)
n_models_summary.append(1)



Balanced accuracy with default parameters: 0.8517453064189986


#### Create a HistGradientBoostingClassifier model for tuning

In [75]:
ohe_hgb_tune = RandomizedSearchCV(estimator = ohe_hgb_default_pipeline, 
                                  param_distributions = param_distributions, 
                                  n_iter = 10,
                                  cv = stratified_kfold,
                                  scoring = 'balanced_accuracy', 
                                  random_state = 1234,
                                  n_jobs = -1, 
                                  verbose = 1)

In [76]:
tic = time.time()

ohe_hgb_tune = ohe_hgb_tune.fit(X_train, y_train)

toc = time.time()
ohe_hgb_tune_time_taken = toc-tic

Fitting 3 folds for each of 10 candidates, totalling 30 fits


In [77]:
# Display pipeline
print("Time taken: ", ohe_hgb_tune_time_taken)
ohe_hgb_tune

Time taken:  584.4420716762543


In [78]:
# Predict using the model with the best parameters
y_ohe_hgb_tune_pred = ohe_hgb_tune.predict(X_test)

# Get the best parameters
ohe_hgb_tune_best_params = ohe_hgb_tune.best_params_
print(f'Best parameters: {ohe_hgb_tune_best_params}')

# Calculate balanced accuracy for the model with the best parameters
ohe_hgb_tune_accuracy = balanced_accuracy_score(y_test, y_ohe_hgb_tune_pred)
print(f'Balanced accuracy with best parameters: {ohe_hgb_tune_accuracy}')

# Save results
default_summary.append("Tune")
card_9_summary.append("AllVariables")
encoder_summary.append("OneHotEncoding")
value_summary.append(ohe_hgb_tune_accuracy)
time_summary.append(ohe_hgb_tune_time_taken)
n_models_summary.append(ohe_hgb_tune.n_iter)



Best parameters: {'model__learning_rate': 0.23148603040441015, 'model__max_iter': 46, 'model__min_samples_leaf': 21}
Balanced accuracy with best parameters: 0.8755766473919198


### Count Encoder + HistGradientBoosting


In [79]:
from category_encoders.count import CountEncoder

#### Preprocessing

In [80]:
cat_pipeline_more9 = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent")),
    ("encoder", CountEncoder())
])

preprop_pipeline = ColumnTransformer(
    transformers = [("num", num_pipeline, num_cols),
                    ("count_encoder", cat_pipeline_more9, cat_cols)],
    sparse_threshold=0
)

#### Create a HistGradientBoostingClassifier model with default parameters and early stopping

In [81]:
count_hgb_default_pipeline = Pipeline([('preprocessing', preprop_pipeline),
                                       ('model', hgb_default)])

In [82]:
tic = time.time()

count_hgb_default = count_hgb_default_pipeline.fit(X_train, y_train)

toc = time.time()
count_hgb_default_time_taken = toc-tic

In [83]:
# Display pipeline
print("Time taken: ", count_hgb_default_time_taken)
count_hgb_default

Time taken:  0.33795857429504395


In [84]:
# Calculate balanced accuracy for the model with default parameters
y_count_hgb_default_pred = count_hgb_default.predict(X_test)
count_hgb_default_accuracy = balanced_accuracy_score(y_test, y_count_hgb_default_pred)
print(f'Balanced accuracy with default parameters: {count_hgb_default_accuracy}')

# Save results
default_summary.append("Default")
card_9_summary.append("AllVariables")
encoder_summary.append("CountEncoding")
value_summary.append(count_hgb_default_accuracy)
time_summary.append(count_hgb_default_time_taken)
n_models_summary.append(1)

Balanced accuracy with default parameters: 0.8831639515419952


#### Create a HistGradientBoostingClassifier model for tuning

In [85]:
count_hgb_tune = RandomizedSearchCV(estimator = count_hgb_default_pipeline, 
                                   param_distributions = param_distributions, 
                                   n_iter = 10,
                                   cv = stratified_kfold,
                                   scoring = 'balanced_accuracy', 
                                   random_state = 1234,
                                   n_jobs = -1)

In [86]:
tic = time.time()

count_hgb_tune = count_hgb_tune.fit(X_train, y_train)

toc = time.time()
count_hgb_tune_time_taken = toc-tic

In [87]:
# Display pipeline
print("Time taken: ", count_hgb_tune_time_taken)
count_hgb_tune

Time taken:  5.8771491050720215


In [88]:
# Predict using the model with the best parameters
y_count_hgb_tune_pred = count_hgb_tune.predict(X_test)

# Get the best parameters
count_hgb_tune_best_params = count_hgb_tune.best_params_
print(f'Best parameters: {count_hgb_tune_best_params}')

# Calculate balanced accuracy for the model with the best parameters
count_hgb_tune_accuracy = balanced_accuracy_score(y_test, y_count_hgb_tune_pred)
print(f'Balanced accuracy with best parameters: {count_hgb_tune_accuracy}')

# Save results
default_summary.append("Tune")
card_9_summary.append("AllVariables")
encoder_summary.append("CountEncoding")
value_summary.append(count_hgb_tune_accuracy)
time_summary.append(count_hgb_tune_time_taken)
n_models_summary.append(count_hgb_tune.n_iter)

Best parameters: {'model__learning_rate': 0.29744180610511156, 'model__max_iter': 31, 'model__min_samples_leaf': 64}
Balanced accuracy with best parameters: 0.8944403016376683


### Ordinal Encoding + HistGradientBoosting

In [89]:
from sklearn.preprocessing import OrdinalEncoder

#### Preprocessing

In [90]:
cat_pipeline_more9 = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent")),
    ("encoder", OrdinalEncoder(dtype = int,
                               handle_unknown = 'use_encoded_value',
                               unknown_value = 99999,
                               encoded_missing_value = 99999))
])

preprop_pipeline = ColumnTransformer(
    transformers = [("num", num_pipeline, num_cols),
                    ("ordinal_encoder", cat_pipeline_more9, cat_cols)],
    sparse_threshold=0
)

#### Create a HistGradientBoostingClassifier model with default parameters and early stopping

In [91]:
ordinal_hgb_default_pipeline = Pipeline([('preprocessing', preprop_pipeline),
                                         ('model', hgb_default)])

In [92]:
tic = time.time()

ordinal_hgb_default = ordinal_hgb_default_pipeline.fit(X_train, y_train)

toc = time.time()
ordinal_hgb_default_time_taken = toc-tic

In [93]:
# Display pipeline
print("Time taken: ", ordinal_hgb_default_time_taken)
ordinal_hgb_default

Time taken:  0.2350928783416748


In [94]:
# Calculate balanced accuracy for the model with default parameters
y_ordinal_hgb_default_pred = ordinal_hgb_default.predict(X_test)
ordinal_hgb_default_accuracy = balanced_accuracy_score(y_test, y_ordinal_hgb_default_pred)
print(f'Balanced accuracy with default parameters: {ordinal_hgb_default_accuracy}')

# Save results
default_summary.append("Default")
card_9_summary.append("AllVariables")
encoder_summary.append("OrdinalEncoder")
value_summary.append(ordinal_hgb_default_accuracy)
time_summary.append(ordinal_hgb_default_time_taken)
n_models_summary.append(1)


Balanced accuracy with default parameters: 0.8367003455466666


#### Create a HistGradientBoostingClassifier model for tuning

In [95]:
ordinal_hgb_tune = RandomizedSearchCV(estimator = ordinal_hgb_default_pipeline, 
                                      param_distributions = param_distributions, 
                                      n_iter = 10,
                                      cv = stratified_kfold,
                                      scoring = 'balanced_accuracy', 
                                      random_state = 1234,
                                      n_jobs = -1)

In [96]:
tic = time.time()

ordinal_hgb_tune = ordinal_hgb_tune.fit(X_train, y_train)

toc = time.time()
ordinal_hgb_tune_time_taken = toc-tic

In [97]:
# Display pipeline
print("Time taken: ", ordinal_hgb_tune_time_taken)
ordinal_hgb_tune

Time taken:  2.0348551273345947


In [98]:
# Predict using the model with the best parameters
y_ordinal_hgb_tune_pred = ordinal_hgb_tune.predict(X_test)

# Get the best parameters
ordinal_hgb_tune_best_params = ordinal_hgb_tune.best_params_
print(f'Best parameters: {ordinal_hgb_tune_best_params}')

# Calculate balanced accuracy for the model with the best parameters
ordinal_hgb_tune_accuracy = balanced_accuracy_score(y_test, y_ordinal_hgb_tune_pred)
print(f'Balanced accuracy with best parameters: {ordinal_hgb_tune_accuracy}')

# Save results
default_summary.append("Tune")
card_9_summary.append("AllVariables")
encoder_summary.append("OrdinalEncoder")
value_summary.append(ordinal_hgb_tune_accuracy)
time_summary.append(ordinal_hgb_tune_time_taken)
n_models_summary.append(ordinal_hgb_tune.n_iter)

Best parameters: {'model__learning_rate': 0.23148603040441015, 'model__max_iter': 46, 'model__min_samples_leaf': 21}
Balanced accuracy with best parameters: 0.870580733983009


### Native HistGradientBoosting support for categorical variables


In [99]:
from sklearn.preprocessing import OrdinalEncoder

#### Preprocessing

In [100]:
cat_pipeline_more9 = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent")),
    ("encoder", OrdinalEncoder(dtype = int,
                               handle_unknown = 'use_encoded_value',
                               unknown_value = 99999,
                               encoded_missing_value = 99999,
                               max_categories = 254))
])

preprop_pipeline = ColumnTransformer(
    transformers = [("num", num_pipeline, num_cols),
                    ("cat", cat_pipeline_more9, cat_cols)],
    sparse_threshold=0
)

In [101]:
category_features_for_nativesupport = [False]*len(num_cols) + [True]*len(cat_cols)

In [102]:
hgb_default_categories_support = HistGradientBoostingClassifier(random_state=1234,
                                                                max_iter=25,
                                                                early_stopping=False,
                                                                min_samples_leaf=40,
                                                                scoring='balanced_accuracy',
                                                                class_weight='balanced',
                                                                categorical_features=category_features_for_nativesupport)

#### Create a HistGradientBoostingClassifier model with default parameters and early stopping

In [103]:
catsup_hgb_default_pipeline = Pipeline([('preprocessing', preprop_pipeline),
                                        ('model', hgb_default_categories_support)])

In [104]:
tic = time.time()

catsup_hgb_default = catsup_hgb_default_pipeline.fit(X_train, y_train)

toc = time.time()
catsup_hgb_default_time_taken = toc-tic

In [105]:
# Display pipeline
print("Time taken: ", catsup_hgb_default_time_taken)
catsup_hgb_default

Time taken:  0.34642529487609863


In [106]:
# Calculate balanced accuracy for the model with default parameters
y_catsup_hgb_default_pred = catsup_hgb_default.predict(X_test)
catsup_hgb_default_accuracy = balanced_accuracy_score(y_test, y_catsup_hgb_default_pred)
print(f'Balanced accuracy with default parameters: {catsup_hgb_default_accuracy}')

# Save results
default_summary.append("Default")
card_9_summary.append("AllVariables")
encoder_summary.append("HGB_NativeSupport")
value_summary.append(catsup_hgb_default_accuracy)
time_summary.append(catsup_hgb_default_time_taken)
n_models_summary.append(1)

Balanced accuracy with default parameters: 0.8295086102074749


#### Create a HistGradientBoostingClassifier model for tuning

In [107]:
catsup_hgb_tune = RandomizedSearchCV(estimator = catsup_hgb_default_pipeline, 
                                   param_distributions = param_distributions, 
                                   n_iter = 10,
                                   cv = stratified_kfold,
                                   scoring = 'balanced_accuracy', 
                                   random_state = 1234,
                                   n_jobs = -1)

In [108]:
tic = time.time() 

catsup_hgb_tune = catsup_hgb_tune.fit(X_train, y_train)

toc = time.time()
catsup_hgb_tune_time_taken = toc-tic

In [109]:
# Display pipeline
print("Time taken: ", catsup_hgb_tune_time_taken)
catsup_hgb_tune

Time taken:  2.333542585372925


In [110]:
# Predict using the model with the best parameters
y_catsup_hgb_tune_pred = catsup_hgb_tune.predict(X_test)

# Get the best parameters
catsup_hgb_tune_best_params = catsup_hgb_tune.best_params_
print(f'Best parameters: {catsup_hgb_tune_best_params}')

# Calculate balanced accuracy for the model with the best parameters
catsup_hgb_tune_accuracy = balanced_accuracy_score(y_test, y_catsup_hgb_tune_pred)
print(f'Balanced accuracy with best parameters: {catsup_hgb_tune_accuracy}')

# Save results
default_summary.append("Tune")
card_9_summary.append("AllVariables")
encoder_summary.append("HGB_NativeSupport")
value_summary.append(catsup_hgb_tune_accuracy)
time_summary.append(catsup_hgb_tune_time_taken)
n_models_summary.append(catsup_hgb_tune.n_iter)

Best parameters: {'model__learning_rate': 0.2418479864837122, 'model__max_iter': 8, 'model__min_samples_leaf': 49}
Balanced accuracy with best parameters: 0.8589613855273666


### Target Encoder + HistGradientBoosting

In [111]:
from sklearn.preprocessing import TargetEncoder

#### Preprocessing

In [112]:
cat_pipeline_more9 = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent")),
    ("encoder", TargetEncoder())
])

preprop_pipeline = ColumnTransformer(
    transformers = [("num", num_pipeline, num_cols),
                    ("cat", cat_pipeline_more9, cat_cols)],
    sparse_threshold=0
)

#### Create a HistGradientBoostingClassifier model with default parameters and early stopping

In [113]:
target_hgb_default_pipeline = Pipeline([('preprocessing', preprop_pipeline),
                                        ('model', hgb_default)])

In [114]:
tic = time.time()

target_hgb_default = target_hgb_default_pipeline.fit(X_train, y_train)

toc = time.time()
target_hgb_default_time_taken = toc-tic

In [115]:
# Display pipeline
print("Time taken: ", target_hgb_default_time_taken)
target_hgb_default

Time taken:  0.27484583854675293


In [116]:
# Calculate balanced accuracy for the model with default parameters
y_target_hgb_default_pred = target_hgb_default.predict(X_test)
target_hgb_default_accuracy = balanced_accuracy_score(y_test, y_target_hgb_default_pred)
print(f'Balanced accuracy with default parameters: {target_hgb_default_accuracy}')

# Save results
default_summary.append("Default")
card_9_summary.append("AllVariables")
encoder_summary.append("TargetEncoder")
value_summary.append(target_hgb_default_accuracy)
time_summary.append(target_hgb_default_time_taken)
n_models_summary.append(1)

Balanced accuracy with default parameters: 0.9163152359536684


#### Create a HistGradientBoostingClassifier model for tuning
 

In [117]:
target_hgb_tune = RandomizedSearchCV(estimator = target_hgb_default_pipeline, 
                                     param_distributions = param_distributions, 
                                     n_iter = 10,
                                     cv = stratified_kfold,
                                     scoring = 'balanced_accuracy', 
                                     random_state = 1234,
                                     n_jobs = -1)

In [118]:
tic = time.time() 

target_hgb_tune = target_hgb_tune.fit(X_train, y_train)

toc = time.time()
target_hgb_tune_time_taken = toc-tic

In [119]:
# Display pipeline
print("Time taken: ", target_hgb_tune_time_taken)
target_hgb_tune

Time taken:  2.0451889038085938


In [120]:
# Predict using the model with the best parameters
y_target_hgb_tune_pred = target_hgb_tune.predict(X_test)

# Get the best parameters
target_hgb_tune_best_params = target_hgb_tune.best_params_
print(f'Best parameters: {target_hgb_tune_best_params}')

# Calculate balanced accuracy for the model with the best parameters
target_hgb_tune_accuracy = balanced_accuracy_score(y_test, y_target_hgb_tune_pred)
print(f'Balanced accuracy with best parameters: {target_hgb_tune_accuracy}')

# Save results
default_summary.append("Tune")
card_9_summary.append("AllVariables")
encoder_summary.append("TargetEncoder")
value_summary.append(target_hgb_tune_accuracy)
time_summary.append(target_hgb_tune_time_taken)
n_models_summary.append(target_hgb_tune.n_iter)

Best parameters: {'model__learning_rate': 0.14131832170213435, 'model__max_iter': 29, 'model__min_samples_leaf': 67}
Balanced accuracy with best parameters: 0.9122019294959802


### CatBoost

In [121]:
from catboost import CatBoostClassifier

#### Preprocessing

In [122]:
cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent"))
])

preprop_pipeline = ColumnTransformer(
    transformers = [("cat_more9", cat_pipeline, cat_cols),
                    ("num", num_pipeline, num_cols)
                    ],
    sparse_threshold=0
)

In [123]:
category_features_for_catboostsupport = [index for index in range(len(cat_cols))]
print(category_features_for_catboostsupport)

[0, 1, 2, 3]


Catboost allows to give a maximum value of unique categories for which a variable is encoded or not by One-Hot-Encoder.

In [124]:
# Create catboost models
catboost_default_raw = CatBoostClassifier(iterations=25,
                                          eval_metric = 'BalancedAccuracy',
                                          loss_function = 'Logloss',
                                          auto_class_weights = 'Balanced',
                                          od_type='Iter',
                                          one_hot_max_size = 0,
                                          random_seed = 1234,
                                          min_data_in_leaf = 40,
                                          cat_features=category_features_for_catboostsupport,
                                          verbose = False)

catboost_default_raw.set_params(cat_features = category_features_for_catboostsupport)

# Default CatBoostClassifier Pipeline
catboost_default_pipeline = Pipeline([('preprocessing', preprop_pipeline),
                                      ('model', catboost_default_raw)])

# Define the hyperparameter search space
catboost_param_distributions = {
    'model__iterations': scipy.stats.randint(5, 50),
    'model__learning_rate': scipy.stats.uniform(0.01, 0.3),
    'model__min_data_in_leaf': scipy.stats.randint(10, 70),
}

# Create a StratifiedKFold cross-validation instance
stratified_kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=1234)

catboost_tune_raw = RandomizedSearchCV(estimator = catboost_default_pipeline, 
                                   param_distributions = catboost_param_distributions, 
                                   n_iter = 10,
                                   cv = stratified_kfold,
                                   scoring = 'balanced_accuracy', 
                                   random_state = 1234,
                                   n_jobs = -1)

In [125]:
tic = time.time()

catboost_default = catboost_default_pipeline.fit(X_train, y_train)

toc = time.time()
catboost_default_time_taken = toc-tic

In [126]:
# Display pipeline
print("Time taken: ", catboost_default_time_taken)
catboost_default

Time taken:  1.1470975875854492


In [127]:
# Calculate balanced accuracy for the model with default parameters
y_catboost_default_pred = catboost_default.predict(X_test)
catboost_default_accuracy = balanced_accuracy_score(y_test, y_catboost_default_pred)
print(f'Balanced accuracy with default parameters: {catboost_default_accuracy}')

# Save results
default_summary.append("Default")
card_9_summary.append("AllVariables")
encoder_summary.append("CatboostNativeSupport")
value_summary.append(catboost_default_accuracy)
time_summary.append(catboost_default_time_taken)
n_models_summary.append(1)

Balanced accuracy with default parameters: 0.9165386778082851


In [128]:
tic = time.time()

catboost_tune = catboost_tune_raw.fit(X_train, y_train)

toc = time.time()
catboost_tune_time_taken = toc-tic

In [129]:
# Display pipeline
print("Time taken: ", catboost_tune_time_taken)
catboost_tune

Time taken:  8.66935920715332


In [130]:
# Predict using the model with the best parameters
y_catboost_tune_pred = catboost_tune.predict(X_test)

# Get the best parameters
catboost_tune_best_params = catboost_tune.best_params_
print(f'Best parameters: {catboost_tune_best_params}')

# Calculate balanced accuracy for the model with the best parameters
catboost_tune_accuracy = balanced_accuracy_score(y_test, y_catboost_tune_pred)
print(f'Balanced accuracy with best parameters: {catboost_tune_accuracy}')

# Save results
default_summary.append("Tune")
card_9_summary.append("AllVariables")
encoder_summary.append("CatboostNativeSupport")
value_summary.append(catboost_tune_accuracy)
time_summary.append(catboost_tune_time_taken)
n_models_summary.append(catboost_tune.n_iter)

Best parameters: {'model__iterations': 48, 'model__learning_rate': 0.25454888021530064, 'model__min_data_in_leaf': 36}
Balanced accuracy with best parameters: 0.9182978464576398


### Results Summary

In [131]:
results_summary = pd.DataFrame({"Dataset":"open_payments",
                                "Variables":card_9_summary,
                                "Default/Tune":default_summary,
                                "Encoder":encoder_summary,
                                "Metric":"BalancedAccuracy",
                                "Value":value_summary,
                                "Time":time_summary,
                                "Iterations":n_models_summary})
results_summary["mean_Time"] = (results_summary["Time"] / results_summary["Iterations"])

results_summary

Unnamed: 0,Dataset,Variables,Default/Tune,Encoder,Metric,Value,Time,Iterations,mean_Time
0,open_payments,AllVariables,Default,OneHotEncoding,BalancedAccuracy,0.851745,54.287918,1,54.287918
1,open_payments,AllVariables,Tune,OneHotEncoding,BalancedAccuracy,0.875577,584.442072,10,58.444207
2,open_payments,AllVariables,Default,CountEncoding,BalancedAccuracy,0.883164,0.337959,1,0.337959
3,open_payments,AllVariables,Tune,CountEncoding,BalancedAccuracy,0.89444,5.877149,10,0.587715
4,open_payments,AllVariables,Default,OrdinalEncoder,BalancedAccuracy,0.8367,0.235093,1,0.235093
5,open_payments,AllVariables,Tune,OrdinalEncoder,BalancedAccuracy,0.870581,2.034855,10,0.203486
6,open_payments,AllVariables,Default,HGB_NativeSupport,BalancedAccuracy,0.829509,0.346425,1,0.346425
7,open_payments,AllVariables,Tune,HGB_NativeSupport,BalancedAccuracy,0.858961,2.333543,10,0.233354
8,open_payments,AllVariables,Default,TargetEncoder,BalancedAccuracy,0.916315,0.274846,1,0.274846
9,open_payments,AllVariables,Tune,TargetEncoder,BalancedAccuracy,0.912202,2.045189,10,0.204519


In [132]:
results_summary.to_excel("open_payments_results.xlsx")