Skip to content

Sadegh28/PyIT-MLFS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python-Based Information Theoretic Multi-Label Feature Selection (PyIT-MLFS) Library

Dependencies and Installation

  • python >= 3.8
  • numpy
  • pyitlib
  • sklearn
  • skmultilearn
  • tqdm
  1. Clone Repo
git clone https://github.com/Sadegh28/PyIT-MLFS.git
  1. Create Conda Environment
conda create --name PyIT_MLFS python=3.8
conda activate PyIT_MLFS
  1. Install Dependencies
pip install pyitlib 
conda install -c conda-forge scikit-learn
pip install scikit-multilearn
conda install -c conda-forge numpy
pip install tqdm

Get Started

Mulan Datasets

Use the following command to rank features of a dataset from the mulan repository:

    python PyIT-MLFS.py   --datasets   d1, d2, ..., dn   --fs-methods a1, a2, ..., am

Each di must be a mulan dataset:

    {'Corel5k', 'bibtex', 'birds', 'delicious', 'emotions', 'enron', 'genbase', 'mediamill', 'medical',
    'rcv1subset1', 'rcv1subset2', 'rcv1subset3', 'rcv1subset4', 'rcv1subset5', 'scene', 'tmc2007_500', 'yeast'}

and each ai must be a multi-label feature selection method supportd by PyIT-MLFS library:

    {'LRFS', 'PPT_MI', 'IGMF', 'PMU', 'D2F', 'SCLS', 'MDMR', 'LSMFS', 'MLSMFS' }

For example the following command ranks the features of 'emotions' and 'birds' datasets using 'LRFS' and 'PPT_MI' methods:

    python PyIT-MLFS.py   --datasets   'emotions', 'birds'   --fs-methods 'LRFS', 'PPT_MI'

Check out the results in ./results/SelectedSubsets/

In addition, use the following command to select a subset of 20 top features (instead of ranking the entire feature space):

    python PyIT-MLFS.py   --datasets   'emotions', 'birds'   --fs-methods 'LRFS', 'PPT_MI'   --selection-type 'fixed-num' --num-of-features 20

Your Own Dataset

  1. Put your datasets into ./datasets folder. f the data is already splitted into train/test, then the folder structure for each dataset should follow the format:

     -YourDataset
     |--- train.csv
     |--- train_labels.csv
     |--- test.csv
     |--- test_labels.csv
    

otherwise it should follow the format:

    -YourDataset
    |--- X.csv
    |--- y.csv
  1. Run the following command to rank features of a dataset from the mulan repository:

     python PyIT-MLFS.py  --data-path 'data\'  --datasets   d1, d2, ..., dn   --fs-methods a1, a2, ..., am
    

As an example, download the emotions dataset through this link. After extracting into ./datasets folder, you should see the follwing structure:

    -emotions
    |--- train.csv
    |--- train_labels.csv
    |--- test.csv
    |--- test_labels.csv

Now you can run the following commands for feature ranking and selection, respectively:

    python PyIT-MLFS.py  --data-path 'datasets\'  --datasets   'emotions'   --fs-methods 'LRFS', 'PPT_MI' 
    python PyIT-MLFS.py  --data-path 'datasets\'  --datasets   'emotions'   --fs-methods 'LRFS', 'PPT_MI' --selection-type 'fixed-num' --num-of-features 20

pre_eval and post_eval modes

You can use 'pre_eval' and 'post_eval' modes to calculate information theoretic measures between variables. In 'pre_eval' mode, all required calculations are performed before the feature selection process. But in the case of 'post_eval', the measures are calculated on demand in the feature selection process. In general, 'pre_eval' mode runs much faster than 'post_eval' unless you want to select a very small number of features (say 5). 'pre_eval' mode is the default, and if you want to use 'post_eval' mode, run the following command:

    python PyIT-MLFS.py  --data-path 'datasets\'  --datasets   'emotions'   --fs-methods 'LRFS', 'PPT_MI' --selection-type 'fixed-num' --num-of-features 5  --eval-mode 'post_eval'

Evaluation

Use the following command to to get the accuracy of the selected subsets using different classifiers:

    python PyIT-MLFS.py   --datasets   d1, d2, ..., dn   --fs-methods a1, a2, ..., am \
                          --classifiers  c1, c2, ..., ck  --metrics  m1, m2, ..., mt

For example the following command ranks the features of 'emotions' and 'birds' datasets using 'LRFS' and 'PPT_MI' methods, then classifies the datasets using 'MLKNN' and 'BinaryRelevance' classifiers and finally evaluates the classfication results using four metrics namely 'hamming loss', 'label ranking loss', 'coverage error', and 'average precision score'

    python PyIT-MLFS.py   --datasets   'emotions', 'birds' \
                          --fs-methods 'LRFS', 'PPT_MI' \
                          --classifiers  "MLKNN", "BinaryRelevance" \
                          --metrics  'hamming loss', 'label ranking loss', 'coverage error', 'average precision score'

Check out the results in ./results/Accuracies/

About

A Python-Based Information Theoretic Multi-Label Feature Selection

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages