Skip to content

Variable Selection in Shifu

wu haifeng edited this page May 21, 2020 · 17 revisions

Shifu variable(aka feature) selection is configured in section "varselect" of ModelConfig.json file, here is a sample

  "varSelect" : {
    "forceEnable" : true,
    "candidateColumnNameFile" : "columns/candidate.column.names"
    "forceSelectColumnNameFile" : "columns/forceselect.column.names",
    "forceRemoveColumnNameFile" : "columns/forceremove.column.names",
    "filterEnable" : true,
    "filterNum" : 100,
    "filterOutRatio" : 0.05,
    "filterBy" : "FI",
    "missingRateThreshold" : 0.98,
    "params" : null
  }

forceEnable

Whether or not to enable force selection. If true, all variables specified in forceSelectColumnNameFile will be force selected and variables specified in forceRemoveColumnNameFile will be force removed for model training.

candidateColumnNameFile

File contains name of variables which can be used in variable selection. If candidateColumnNameFile is not set, or the content is empty, all variables will be candidate variables. Otherwise, only variables in candidateColumnNameFile could be used as variables

forceSelectColumnNameFile

File contains name of variables which should be force selected for model training, each variable name occupies one line E.g

variable_name_1
variable_name_2
...
variable_name_n

forceRemoveColumnNameFile

File contains name of variables which should be force removed for model training, file format is the same as forceSelectColumnNameFile

filterEnabled

Whether or not to enable filter. If true, ColumnConfig.json file will be modified based on your variable select settings after run shifu varselect command, if false, ColumnConfig.json will not be modified. Typically if user wants to only output sensitivity analysis report or feature importance report but without re-selecting variables, this would be set to false.

filterNum

Integer type, the number of variables need to be selected for model training. FilterNum has higher priority than filterOutRatio. in another word, once filterNum is set, filterOutRatio will be ignored. If you need to run variable selection iteratively, you need set filterNum to 0.

filterOutRatio

Float type, ratio of variables that needs to be filtered out after running shifu varselect. For example, in ColumnConfig.json file, 100 variables are set to finalSelect=true and filterOutRatio is set to 0.05 in ModelConfig.json file, once you run shifu varselect command, 5 variables will be set to finalSelect=false in ColumnConfig.json file.

filterBy

Method to select variables

Statistic Based Variable Selection

In stats step, KS and IV value are computed per each feature and used for variable selection. According to number of features to be selected, sort by KS or IV in descending order to do variable selection.

  • KS – What is the maximum difference in the cumulative distribution functions of the good’s/bad’s on a given feature? “Regions”/bins of impact
  • IV – Information Value – Overall strong split characteristics -- How well a variable can distinguish between categories of the response
  • FI – Feature Importance - Works only for tree models. If filterEnabled is set to false, it will read an existing tree model and output feature importance values into featureImportance/all.fi file. If filterEnabled is set to true, a new tree model will be trained based on training settings and used for variable selection.

Sensitivity Analysis for Variable Selection

  • SE – Sensitivity analysis comparing with model output
  • ST – Sensitivity analysis comparing with target value Sensitivity Analysis

This solution works well in neural network model variable selection.

  • Train a model at first
  • Per each instance in training data, each time drop one feature and compute new score, based on such score, compute diff with original score (SE) or target value (ST)
  • For all training data, compute mean and stddev per each diff and sort in descending order for mean
  • Remove 5% configured by user
  • Redo the same cycle until it meets final number of features.

FAQ

How To Do Feature Selection in GBT/RF Model Training?

A: Solution 1: By feature importance (FT)
Solution 2: Set training parameters to NN and do feature selection by Sensitivity Analysis and then change training parameters to GBT/RF related.

How To Do Quick Feature Selection If # of Features >= 10000?

A: Do coarse feature selection by KS/IV at first keep to 2000 features and then use Sensitivity Analysis to filter out 0.05 feature in each round.

Clone this wiki locally