Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imputation of missing values using ML models. (Enhancement and Bug fix opened in #477) #478

Closed
wants to merge 2 commits into from

Conversation

vijayphugat
Copy link
Contributor

I have implemented the functionality for imputation of missing values using ML models.
Now below additional options are provided in healthcareai.common.transformers.DataFrameImputer class:

imputeStrategy : string, default='MeanMode'
It decides the technique will be used for imputation of missing values.

  • If imputeStrategy = 'MeanMode'
    Imputation is done using Mean/Mode
  • If imputeStrategy = 'RandomForest'
    Imputation is done using ML models.

tunedRandomForest : boolean, default=False
If set to True, RandomForestClassifier/RandomForestRegressor to be used for imputation of missing values are tuned using grid search and K-fold cross validation.

******************************************** Bug Fix ***********************************************
In existing code there was no provision to handle the columns which are of type int/float but by nature they are categorical. < ex JobCode ( Levels : 1,2,3,4,5,6) >. Therefor this type of column were imputed using Mean value ( ex. 2.8, 3.6 etc) which can be very hazardous.

I handled this problem as well for both imputation strategy i.e 'MeanMedian' and 'RandomForest'.
Now user can use below parameter to explicitly mention such type of columns.

numeric_columns_as_categorical : List of type String, default=None
List of column names which are numeric(int/float) in dataframe, but by nature they are to be considered as categorical.

For example:
There is a column JobCode( Levels : 1,2,3,4,5,6)
If there are missing values in JobCode column, panadas will by default convert this column into type float.

If numeric_columns_as_categorical=None
	Missing values of this column will be imputed by Mean value of JobCode column.
	type of 'JobCode' column will remain float. 
If numeric_columns_as_categorical=['JobCode']
	Missing values of this column will be imputed by mode value of JobCode column.
	Also final type of 'JobCode' column will be numpy.object 

The existing approach of missing value imputation( using Mean/Mode) is preserved with one fix.

Added code clas to impute missing values using RandomForestRegressor and RandomForestClassifier
@vijayphugat vijayphugat closed this Oct 3, 2018
@vijayphugat
Copy link
Contributor Author

Raised the same pull request from different account. Therefore closing this pull request

mmastand added a commit that referenced this pull request Nov 6, 2018
Imputation of missing values using ML models. (Enhancement and Bug fix opened in #477, #478)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant