# Tutorial for the use of the MEvA-X tool

<p>In this tutorial you will find instructions on how to run the MEvA-X for the datasets presented in the manuscript and an explanation of the parameters and variables the user can modify. Additionally, the necessary modules are provided with the version that was used while the algorithm was tested.</p>

## Installing the necessary modules/libraries

In [None]:
if False:
    import sys
    !{sys.executable} -m pip install pandas==1.4.4
    !{sys.executable} -m pip install numpy==1.21.5
    !{sys.executable} -m pip install xgboost==1.7.3
    !{sys.executable} -m pip install git+https://github.com/danielhomola/mifs#egg=httpie
    !{sys.executable} -m pip install git+https://github.com/iskandr/knnimpute#egg=httpie


## Examples of use

#### Calling the general the Algorithm (user inputs)

In [None]:
!%%python3 ../beta/MEvA-X_V1.2.0.py -h

In [None]:
!%%python3 ../beta/MEvA-X_V1.2.0.py --K 10 --P 50 --G 200 --dataset_path my_data.txt --labels_path my_labels.tsv --FS_path precalculated_features.csv --output_dir current_folder --crossover_perc 0.9 --arithmetic_perc .0 --mutation_perc 0.1 --goal_sig_list 0.8 2 0.8 1 1 0.7 0.7 1 2 0.5 2


<h3> Explanation of the in-line attributes of the MEvA-X tool</h3>
<ul>
    <li><code>-K<code> [int] the number of folds in K-Fold cross validation</li>
    <li><code>-P<code> [int] the number of individuals in the population</li>
    <li><code>-G<code> [int] the number of maximum generations of the Evolutionary Algorithm</li>
        <li><code>-K<code> [int] the number of folds in K-Fold cross validation</li>
            <li><code>-K<code> [int] the number of folds in K-Fold cross validation</li>
                <li><code>-K<code> [int] the number of folds in K-Fold cross validation</li>
                    <li><code>-K<code> [int] the number of folds in K-Fold cross validation</li>
                        <li><code>-K<code> [int] the number of folds in K-Fold cross validation</li>
                            <li><code>-K<code> [int] the number of folds in K-Fold cross validation</li>
                                <li><code>-K<code> [int] the number of folds in K-Fold cross validation</li>
                                    <li><code>-K<code> [int] the number of folds in K-Fold cross validation</li>
                                        <li><code>-K<code> [int] the number of folds in K-Fold cross validation</li>
        
</ul>

In [None]:
import argparse
# defined command line options
# this also generates --help and error handling
CLI=argparse.ArgumentParser()
CLI.add_argument(
  "--lista",  # name on the CLI - drop the `--` for positional/required parameters
  nargs="*",  # 0 or more values expected => creates a list
  type=int,
  default=[1, 2, 3],  # default if nothing is provided
)
CLI.add_argument(
  "--listb",
  nargs="*",
  type=float,  # any type/callable can be used here
  default=[],
)

# parse the command line
args = CLI.parse_args()
# access CLI options
print("lista: %r" % args.lista)
print("listb: %r" % args.listb)

In [None]:
try:
    dataset_filename=sys.argv[1]
    labels_filename=sys.argv[2]
    population=int(sys.argv[3])
    generations=int(sys.argv[4])
    two_points_crossover_probability=float(sys.argv[5])
    arithmetic_crossover_probability=float(sys.argv[6])
    mutation_probability=float(sys.argv[7])	
    goal_significances_filename=sys.argv[8]
    num_of_folds=int(sys.argv[9])
except:
    

### Data exploration
<p>In the folder <code>Data</code> there are the datasets that were used to evaluate MEvA-X's performance.</p>

In [None]:
!ls ../Data

### The omics dataset used in the  evaluation of the tool (ORNISH dataset)
<p>Here there are the data used in the files <code> diet_dataset.txt </code> and <code> diet_labels.txt </code>, along with the precalculated featuers from the feature selection methods used (mRMR, JMI, SelectKBest, and Wilcoxon rank sum) in the subfolder <code>FS_methods.</code><br>
Alternative data and the raw data can be also found here in the directories <code>Alternarive_data</code> and <code>GSE66175_RAW</code> respectively.</p>

In [None]:
!ls ../Data/Ornish/

### The dataset with the categorical clinical data used in the  evaluation of the tool (OPERA study)
<p>This dataset has the peculiarity of having 4 different labels:
<ul>
    <li>Total_Severity_Change</li>
    <li>Total_Medicine_Change</li>
    <li>Complaints_Change</li>
    <li>Total_Interference_Change</li>
</ul></p>

<p> Here there are the data used <code>opera_full_dataset_headers.csv</code> and the labels with an nindication based on the label [1-4] (i.e. <code>opera_full_labels_binary_1.csv</code>).</p>

In [None]:
!ls ../Data/Opera/

## Format of the data

The dataset must have the format of Features(Rows) and Samples(Columns) as the example below for the ORNISH dataset.

In [None]:
import pandas as pd
import numpy as np
import os
data = pd.read_csv("../Data/Ornish/diet_dataset.txt", index_col = 0, sep="\t")
data.head(10)

The format of the labels is in a 1D array with the values corresponding to the samples of the data matrix

In [None]:
labels = pd.read_csv("../Data/Ornish/diet_labels.txt",header=None, sep="\t")
labels.head(10)

## Results of the algorithm

<p>The results of the algotirm are saved in the path indicated on the <code>MEvA-XV1.0.0.py</code></p> in the variable <code>output_folder</code>. The user can change this to any other path suitable but it is set to save the results in a relave path in the <code>./XGB_results/P&#60Population&#62_G&#60Generations&#62_K&#60K-fold&#62_&#60TimeStamp&#62</code> where:<br>
Generations is the number of generations the user wants the algorithm to run for,<br>
Population is the number of individual solutions the user has enteres,
K-fold is the number for the variable K in the cross-validation framework<br>
(i.e.: <code>./XGB_results/P100_G50_K10_524457</code>)

## Prameters of the algorithm

### User defined parameters

<ul>
    <h3>Path related variables</h3>
    <li><code>dataset_filename</code></li>
    <li><code>labels_filename</code></li>
    <li><code>FS_dir</code></li>
    <li><code>goal_significances</code></li>
    <h3>Evolutionary algorithm related parameters</h3>
    <li><code>population</code></li>
    <li><code>generations</code></li>
    <li><code>num_of_folds</code></li>
    <li><code>two_points_crossover_probability</code></li>
    <li><code>arithmetic_crossover_probability</code></li>
    <li><code>mutation_probability</code></li>
    <li><code>goal_significances</code></li>

</ul>

### hyper-parameters of the classifier

<p>These parameters control the exploration of the classifier's parameters space. The user can change the range [min, max] values of these parameters, even though this is not recomended because the values areset to explore a large set of hyperparameters.<br>
In order to change these limits, one has to make modifications in the <code>__main__</code> function of the <code>MEvA-X_V1.0.0.py</code> in the array variables <code>min_values</code> and <code>max_values</code> respectively.</p>

<ol>
    <h3>Feature Selection methods parameters</h3>
    <li><code>FS_method</code></li>
    <li><code>use_of_FS</code></li>
    <li><code>k-NN(mifs)</code></li>
    <li><code>k_SKB</code></li>
    <h3>XGBoost parameters</h3>
    <li><code>eta</code></li>
    <li><code>max_depth</code></li>
    <li><code>gamma</code></li>
    <li><code>lambda</code></li>
    <li><code>alpha</code></li>
    <li><code>min_child_weight</code></li>
    <li><code>scale_pos_weight</code></li>
    <h3>Not used parameters</h3>
    <li><code>colsample</code> [XGB, not used]</li>
    <li><code>subsample</code> [XGB, not used]</li>

</ol>

## How to call the algorithm for the different datasets
<p>There are two scripts with which one can call the different datasets used in the evaluation of the tool. The reason is that the OPERA dataset has multiple labels and we decided to create a new script that asks the used to define which label to use.</p>

### How to call the algorithm for the ORNISH dataset

```python MEvA-X_V1.0.0.py```

This script has the following by-default setings:
<ul>
    <h3>Path related variables</h3>
    <li><code>dataset_filename = "./Data/Ornish/diet_dataset.txt"</code></li>
    <li><code>labels_filename = "./Data/Ornish/diet_labels.txt"</code></li>
    <li><code>FS_dir = "./Data/Ornish/FS_methods/"</code></li>
</ul>


### How to call the algorithm for the OPERA dataset

```python MEvA-X_V1.0.0_opera.py```

This script has the following by-default setings:
<ul>
    <h3>Path related variables</h3>
    <li><code>dataset_filename = "./Data/Opera/opera_full_dataset_headers.csv"</code> </li>
    <li><code>labels_filename = "./Data/Opera/opera_full_labels_binary_&#60[1-4]&#62.csv"</code> and then the user is asked to provide a number [1-4] for the coresponding label</li>
    <li><code>FS_dir = None</code> . Not all of the FS technques worked and it was decided to be calculated every time since the feature set is relatively small (~50 features)</li>
</ul>

In [2]:
!jupyter nbconvert --execute --to html MEvA-X_Tutorial_Notebook.ipynb

[NbConvertApp] Converting notebook MEvA-X_Tutorial_Notebook.ipynb to html
[NbConvertApp] Writing 616041 bytes to MEvA-X_Tutorial_Notebook.html
