Imputation of missing values using ML models. (Enhancement and Bug fix opened in #477) #478

vijayphugat · 2018-09-28T10:30:05Z

I have implemented the functionality for imputation of missing values using ML models.
Now below additional options are provided in healthcareai.common.transformers.DataFrameImputer class:

imputeStrategy : string, default='MeanMode'
It decides the technique will be used for imputation of missing values.

If imputeStrategy = 'MeanMode'
Imputation is done using Mean/Mode
If imputeStrategy = 'RandomForest'
Imputation is done using ML models.

tunedRandomForest : boolean, default=False
If set to True, RandomForestClassifier/RandomForestRegressor to be used for imputation of missing values are tuned using grid search and K-fold cross validation.

******************************************** Bug Fix ***********************************************
In existing code there was no provision to handle the columns which are of type int/float but by nature they are categorical. < ex JobCode ( Levels : 1,2,3,4,5,6) >. Therefor this type of column were imputed using Mean value ( ex. 2.8, 3.6 etc) which can be very hazardous.

I handled this problem as well for both imputation strategy i.e 'MeanMedian' and 'RandomForest'.
Now user can use below parameter to explicitly mention such type of columns.

numeric_columns_as_categorical : List of type String, default=None
List of column names which are numeric(int/float) in dataframe, but by nature they are to be considered as categorical.

For example:
There is a column JobCode( Levels : 1,2,3,4,5,6)
If there are missing values in JobCode column, panadas will by default convert this column into type float.

If numeric_columns_as_categorical=None
	Missing values of this column will be imputed by Mean value of JobCode column.
	type of 'JobCode' column will remain float. 
If numeric_columns_as_categorical=['JobCode']
	Missing values of this column will be imputed by mode value of JobCode column.
	Also final type of 'JobCode' column will be numpy.object

The existing approach of missing value imputation( using Mean/Mode) is preserved with one fix.

Added code clas to impute missing values using RandomForestRegressor and RandomForestClassifier

vijayphugat · 2018-10-03T10:03:41Z

Raised the same pull request from different account. Therefore closing this pull request

Imputation of missing values using ML models. (Enhancement and Bug fix opened in #477, #478)

vijayphugat added 2 commits September 28, 2018 14:06

impute missing values using ML models

fc8b498

Added code clas to impute missing values using RandomForestRegressor and RandomForestClassifier

impute missing values using ML models

7270f2e

vijayphugat closed this Oct 3, 2018

mmastand added a commit that referenced this pull request Nov 6, 2018

Merge pull request #479 from VijaySingh-GSLab/branch-1

cb82b94

Imputation of missing values using ML models. (Enhancement and Bug fix opened in #477, #478)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imputation of missing values using ML models. (Enhancement and Bug fix opened in #477) #478

Imputation of missing values using ML models. (Enhancement and Bug fix opened in #477) #478

vijayphugat commented Sep 28, 2018

vijayphugat commented Oct 3, 2018

Imputation of missing values using ML models. (Enhancement and Bug fix opened in #477) #478

Imputation of missing values using ML models. (Enhancement and Bug fix opened in #477) #478

Conversation

vijayphugat commented Sep 28, 2018

vijayphugat commented Oct 3, 2018