# Custom ML Infill

This notebook is a companion to the essay [Custom ML Infill with Automunge](https://medium.com/automunge/custom-ml-infill-with-automunge-5b31d7cfd4d2), and demonstrates user-defined machine learning training and inference operations for integration into Automunge's ML infill.

We recommend reading the essay prior to this tutorial.

Missing data is kind of a fundamental obstacle for machine learning, as backpropagation requires all valid entries. ML infill is a more sophisticated convention than often common practices like mean imputation to numeric sets or constant imputation to categoric. By feature set specific partitioning of the training data, feature set specific machine learning models are trained to impute missing data based on properties of the surrounding features. Sounds simple, doesn’t it?

This tutorial demonstrates a new convention allowing users to define custom machine learning algorithms for integration into Automunge’s ML infill. These custom learning algorithms could be built around gradient boosting, neural networks, or even quantum machine learning, whatever you want. All you have to do is define a wrapper function for your model tuning / training and a wrapper function for inference. You pass those functions as part of the automunge(.) call, and we do all the rest. Sounds simple, doesn’t it?

You can either define separate wrapper functions for classification and regression, or you can define a single wrapper function and use the received labels column header to distinguish between whether a received label set is a target for classification (1) or regression (0).

In [3]:
#full templates provided in essay appendices and read me
def customML_train_classifier(labels, features, columntype_report, commands, randomseed):
  ...
  return model
def customML_train_regressor(labels, features, columntype_report, commands, randomseed):
  ...
  return model

The convention is really simple, your wrapper function receives as input a dataframe of labels, a dataframe of features, a report of feature properties, any commands that you passed as part of the automunge(.) call for the operation, and a unique sampled randomseed. You then tune and train a model however you want and return the trained model from the function and let us handle the rest (basically that means we’ll store the model in the returned dictionary that is used as key to prepare additional data).

The features will be received as a numerically encoded dataframe consistent with form returned from automunge(.), excluding any features from transforms that may return non-numeric entries or otherwise identified as a channel for data leakage. Any missing data will have received an initial imputation applied as part of the transformation functions, which initial imputation may be replaced when that feature has already been targeted with ML infill. Categoric features will be integer encoded, which could include ordinal integers, one hot encodings, or binarizations. The columntype_report can be used to access feature properties, and will include a list of all categoric features, a list of all numeric features, or more granular details such as listed categoric features of a certain type and groupings (the form will be similar to the final version returned as postprocess_dict['columtype_report']).

The labels for a classification target will be received as a single column pandas series with header as the integer 1, and entries in the form of str(int), which basically means entries will be integers that have been converted to strings. The str(int) convention we believe is a neat concept since some libraries like their classification targets as strings and some prefer integers, so this way if you library is integer based classification you can just convert the labels with labels.astype(int) and you’re off to the races. The labels for a regression target will also be a single column pandas series, but this time with column header as the integer 0 and entries as floats. (Received continuous integer types for regression targets can be treated as floats since we’ll round them back to integers after inference.)

Any imports needed, such as for learning libraries and stuff, can either be performed external to the automunge(.) call or included as an import operation within the wrapper functions. Pandas is available as pd, Numpy as np, spicy.stats as stats.

When it comes time to use that model for inference, we’ll access the appropriate model and pass it to your corresponding custom inference function along with the correspondingly partitioned features dataframe serving as basis and any commands a user passed as part of the automunge(.) call.

In [4]:
#full templates provided in essay appendices and read me
def customML_predict_classifier(features, model, commands):
  ...
  return infill
def customML_predict_regressor(features, model, commands):
  ...
  return infill

So then you just use the predict wrapper function to run your inference and return the resulting derived infill. The form of the returned infill is user choice, you can provide the derivations as a single column array, single column dataframe, or as a series. Regression output is expected as floats. Classification output types can be returned as int or as str(int). Up to you. Once we access the infill we’ll convert it back to whatever form is needed. We’ll take it from there.

Having defined our custom ML wrapper functions, now all it takes to integrate into an automunge call is passing them through the ML_cmnd parameter. Here we demonstration choosing customML as the autoML_type (meaning we apply your defined functions instead of the default random forest), passing any desired parameters to your functions (which may differ between automunge(.) calls), and passing the functions themselves. This also demonstrates passing parameters to the customML functions which are received as the input "commands" - in this example classification training and inference would receive the diciontary commands = {'parameter1' : 'value1'}.

In [6]:
ML_cmnd = {'autoML_type' : 'customML',
           'MLinfill_cmnd' : {'customML_Classifier':{'parameter1' : 'value1'},
                              'customML_Regressor' :{'parameter2' : 'value2'}},
           'customML' : {'customML_Classifier_train'  : customML_train_classifier, 
                         'customML_Classifier_predict': customML_predict_classifier, 
                         'customML_Regressor_train'   : customML_train_regressor, 
                         'customML_Regressor_predict' : customML_predict_regressor}}

Note that the library has an internal suite of inference functions for different ML libraries that can optionally be used in place of a user defined customML inference function. These can be activated by passing a string to entries for ‘customML_Classifier_predict’ or ‘customML_Regressor_predict’ as one of `{‘tensorflow’, ‘xgboost’, ‘catboost’, ‘flaml’, ‘autogluon’, ‘randomforest’}`. Use of the internally defined inference functions allows a user to upload a postprocess_dict in a separate notebook without needing to first reinitialize the customML inference functions. For example, to apply a default inference function for the XGBoost library could apply:

In [7]:
#this demonstrates applying a default inference function for xgboost
#by way of string  'xgboost' specification
#to ML_cmnd['customML']['customML_Classifier_predict']
#and ML_cmnd['customML']['customML_Regressor_predict']

#customML_train_classifier and customML_train_regressor
#should be user defined training functions built around xgboost library
#per the templates in Appendix A

ML_cmnd = {'autoML_type' : 'customML',
           'customML' : {'customML_Classifier_train'  : customML_train_classifier, 
                         'customML_Classifier_predict': 'xgboost', 
                         'customML_Regressor_train'   : customML_train_regressor, 
                         'customML_Regressor_predict' : 'xgboost'}}
                         
#default inference functions currently available for following libraries
#{'tensorflow', 'xgboost', 'catboost', 'flaml', 'autogluon', 'randomforest'}

Having defined the customML training functions and populated a ML_cmnd specificaiton, they can they be passed to an automunge(.) call as:

In [None]:
from Automunge import *
am = AutoMunge()

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
            labels_column = labels_column,
            ML_cmnd = ML_cmnd,
            printstatus=True)

#download postprocess_dict for use in another notebook

We can then use that postprocess_dict to prepare additional corresponding data in a separate notebook. Because we used the default inference functions instead of user defined inference functions, we won't need to re-initialize the inference function definitions prior to uploading postprocess_dict in another notebook.

In [None]:
test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(postprocess_dict, df_test)

_____

In the essay we gave examples of defined training and corresponding inference functions built around random forest. Here we'll demonstrate defining training functions for each of these libraries that can make use of the internally defined inference functions. A user could use any of these functions as a starting point for building a training loop incorporating hyperparameter tuning or whatever bells and whistles they prefer.

The following demonstrated customML training pipelines are compatible with the internally defined inference options.

# scikit-learn Random Forest

In [None]:
def customML_train_classifier(labels, features, columntype_report, commands, randomseed):
  
  #note that RandomForestClassifier already imported in Automunge
  model = RandomForestClassifier(**commands)

  #labels are received as str(int), for this demonstration will convert to integer
  labels = labels.astype(int)

  model.fit(features, labels)

  return model

#___________________________________________________

def customML_train_regressor(labels, features, columntype_report, commands, randomseed):
  
  #note that RandomForestRegressor already imported in Automunge
  model = RandomForestRegressor(**commands)

  model.fit(features, labels)

  return model

#___________________________________________________

ML_cmnd = {'autoML_type' : 'customML',
           'customML' : {'customML_Classifier_train'  : customML_train_classifier, 
                         'customML_Classifier_predict': 'randomforest', 
                         'customML_Regressor_train'   : customML_train_regressor, 
                         'customML_Regressor_predict' : 'randomforest'}}

# AutoGluon

In [None]:
def customML_train_classifier(labels, features, columntype_report, commands, randomseed):
  
  from autogluon import TabularPrediction as task
  
  #autogluon accepts labels as part of training set
  features = pd.concat([features, labels], axis=1)
  
  label = labels.name
  
  model = task.fit(train_data=features, label=label, **commands, random_seed=randomseed)

  return model

#___________________________________________________

def customML_train_regressor(labels, features, columntype_report, commands, randomseed):
  
  from autogluon import TabularPrediction as task
  
  #autogluon accepts labels as part of training set
  features = pd.concat([features, labels], axis=1)
  
  label = labels.name
  
  model = task.fit(train_data=features, label=label, **commands, random_seed=randomseed)

  return model

#___________________________________________________

ML_cmnd = {'autoML_type' : 'customML',
           'customML' : {'customML_Classifier_train'  : customML_train_classifier, 
                         'customML_Classifier_predict': 'autogluon', 
                         'customML_Regressor_train'   : customML_train_regressor, 
                         'customML_Regressor_predict' : 'autogluon'}}

# Flaml

In [None]:
def customML_train_classifier(labels, features, columntype_report, commands, randomseed):
  
  from flaml import AutoML
  
  #flaml takes numeric classificaiton targets
  labels = labels.astype(int)
  
  commands.update({'task' : 'classification'})
  
  model = AutoML()

  #train the model without validation set
  model.fit(
    features, labels, **commands
  )

  return model

#___________________________________________________

def customML_train_regressor(labels, features, columntype_report, commands, randomseed):
  
  from flaml import AutoML
  
  commands.update({'task' : 'regression'})
  
  model = AutoML()

  #train the model without validation set
  model.fit(
    features, labels, **commands
  )

  return model

#___________________________________________________

ML_cmnd = {'autoML_type' : 'customML',
           'customML' : {'customML_Classifier_train'  : customML_train_classifier, 
                         'customML_Classifier_predict': 'flaml', 
                         'customML_Regressor_train'   : customML_train_regressor, 
                         'customML_Regressor_predict' : 'flaml'}}

# CatBoost

In [None]:
def customML_train_classifier(labels, features, columntype_report, commands, randomseed):
  
  from catboost import CatBoostClassifier
#   from catboost import CatBoostRegressor
  
  categorical_features_indices = columntype_report['all_categoric']
  
  model = CatBoostClassifier()

  #train the model without validation set
  model.fit(
    features, labels, **commands
  )

  return model

#___________________________________________________

def customML_train_regressor(labels, features, columntype_report, commands, randomseed):
  
#   from catboost import CatBoostClassifier
  from catboost import CatBoostRegressor
  
  categorical_features_indices = columntype_report['all_categoric']
  
  model = CatBoostRegressor()

  #train the model without validation set
  model.fit(
    features, labels, **commands
  )

  return model

#___________________________________________________

ML_cmnd = {'autoML_type' : 'customML',
           'customML' : {'customML_Classifier_train'  : customML_train_classifier, 
                         'customML_Classifier_predict': 'catboost', 
                         'customML_Regressor_train'   : customML_train_regressor, 
                         'customML_Regressor_predict' : 'catboost'}}

# XGBoost

In [None]:
def customML_train_classifier(labels, features, columntype_report, commands, randomseed):
  
  from xgboost import XGBClassifier
#   from xgboost import XGBRegressor

  labels = labels.astype(int)
  
  default_model_params = {'verbosity' : 1,
                           'use_label_encoder' : False}
  
  model = XGBClassifier(**default_model_params)

  #train the model without validation set
  model.fit(
    features, labels, **commands
  )

  return model

#___________________________________________________

def customML_train_regressor(labels, features, columntype_report, commands, randomseed):
  
#   from xgboost import XGBClassifier
  from xgboost import XGBRegressor

  default_model_params = {'verbosity' : 1}
  
  model = XGBRegressor(**default_model_params)

  #train the model without validation set
  model.fit(
    features, labels, **commands
  )

  return model

#___________________________________________________

ML_cmnd = {'autoML_type' : 'customML',
           'customML' : {'customML_Classifier_train'  : customML_train_classifier, 
                         'customML_Classifier_predict': 'xgboost', 
                         'customML_Regressor_train'   : customML_train_regressor, 
                         'customML_Regressor_predict' : 'xgboost'}}

# Tensorflow

In [None]:
def customML_train_classifier(labels, features, columntype_report, commands, randomseed):
  
  import tensorflow as tf

  from tensorflow.keras.models import Sequential
  from tensorflow.keras.layers import Dense
  from tensorflow.keras.layers import Dropout

  labels = labels.astype(int)
  
  #will implement tf classification with one hot labels
  #with unique sigmoid activation per column
  
  maxlabel = labels.max()
  nunique = maxlabel+1
  
  labels_onehot = pd.DataFrame()
  
  for entry in range(nunique):
    labels_onehot[entry] = np.where(labels==entry, 1, 0)
    
  featurecount = len(list(features))

  def create_model():

    #create model with keras
    model = Sequential()
    #layer widths are kind of arbitrarily populated
    model.add(Dropout(0.1, input_shape=(featurecount,)))
    model.add(Dense(int(featurecount/2), activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(int(featurecount/4), activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(int(featurecount/6), activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(nunique, activation='sigmoid'))

    #compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model
  
  model = create_model()
  
  features = features.to_numpy()
  labels_onehot = labels_onehot.to_numpy()
  
  train = tf.convert_to_tensor(features)
  labels = tf.convert_to_tensor(labels_onehot)
  
  model.fit(train, labels, epochs=5, verbose=1)

  return model

#___________________________________________________

def customML_train_regressor(labels, features, columntype_report, commands, randomseed):
  
  import tensorflow as tf

  from tensorflow.keras.models import Sequential
  from tensorflow.keras.layers import Dense
  from tensorflow.keras.layers import Dropout
  
  featurecount = len(list(features))
  
  def create_model():
    
    #this config has edge case when less than 3 features
    #create model with keras
    model = Sequential()
    #layer widths are kind of arbitrarily populated
    model.add(Dropout(0.1, input_shape=(featurecount,)))
    model.add(Dense(int(featurecount/2), activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(int(featurecount/4), activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(int(featurecount/6), activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(1, activation='linear'))

    #compile model
    model.compile(loss='mse', optimizer='adam', metrics=['mse'])

    return model
  
  model = create_model()
  
  features = features.to_numpy()
  labels = labels.to_numpy()
  
  train = tf.convert_to_tensor(features)
  labels = tf.convert_to_tensor(labels)
  
  model.fit(train, labels, epochs=5, verbose=1)

  return model

#___________________________________________________

ML_cmnd = {'autoML_type' : 'customML',
           'customML' : {'customML_Classifier_train'  : customML_train_classifier, 
                         'customML_Classifier_predict': 'tensorflow', 
                         'customML_Regressor_train'   : customML_train_regressor, 
                         'customML_Regressor_predict' : 'tensorflow'}}