mcs_kfold
stands for "monte carlo stratified k fold". This library attempts to achieve equal distribution of discrete/categorical variables in all folds.
Internally, the seed is changed and stratified k-fold trials are repeated to find the seed with the least entropy in the distribution of the specified variables. The greatest advantage of this method is that it can be applied to multi-dimensional targets.
from mcs_kfold import MCSKFold
mcskf = MCSKFold(n_splits=num_cv, shuffle_mc=True, max_iter=100)
for fold, (train_idx, valid_idx) in enumerate(
mcskf.split(df=df, target_cols=["Survived", "Pclass", "Sex"])
):
.
.
.
see also example for further information.
histograms shown below is generated with this library with Kaggle Titanic: Machine Learning from Disaster data. you can see here that three target variables are equally distributed over five folds.
pip install mcs_kfold
git clone https://github.com/MasashiSode/mcs_kfold
cd mcs_kfold
pip install .
poetry install
pytest