Imputation of missing values using ML models. (Enhancement and Bug fix opened in #477) #478
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I have implemented the functionality for imputation of missing values using ML models.
Now below additional options are provided in healthcareai.common.transformers.DataFrameImputer class:
imputeStrategy : string, default='MeanMode'
It decides the technique will be used for imputation of missing values.
Imputation is done using Mean/Mode
Imputation is done using ML models.
tunedRandomForest : boolean, default=False
If set to True, RandomForestClassifier/RandomForestRegressor to be used for imputation of missing values are tuned using grid search and K-fold cross validation.
******************************************** Bug Fix ***********************************************
In existing code there was no provision to handle the columns which are of type int/float but by nature they are categorical. < ex JobCode ( Levels : 1,2,3,4,5,6) >. Therefor this type of column were imputed using Mean value ( ex. 2.8, 3.6 etc) which can be very hazardous.
I handled this problem as well for both imputation strategy i.e 'MeanMedian' and 'RandomForest'.
Now user can use below parameter to explicitly mention such type of columns.
numeric_columns_as_categorical : List of type String, default=None
List of column names which are numeric(int/float) in dataframe, but by nature they are to be considered as categorical.
The existing approach of missing value imputation( using Mean/Mode) is preserved with one fix.