# Intro to the ML_Pipeline

Here we will walk through how to use scripts in the ML-Pipeline to do machine learning. Check out the [README](https://github.com/ShiuLab/ML-Pipeline/blob/master/README.md) file for a graphical representation of what's going on inside the ML-Pipeline.
We are interested in trying to predict grain yield in Zea maize (corn) using information about
gene expression levels for 31,237 genes measured on seedlings. We will use these data to 
predict:
- Grain yield for each line (i.e. a regression problem)
- If the line was in the top 25% for yield (i.e. a binary classification problem)
- If the line was in the top 25%, bottom 25%, or middle 50% for yield (i.e. a multi-class classification problem).


### Outline of the tutorial

1. Data exploration
2. Select instances (i.e. maize lines) to holdout as the testing set
3. Regression models 
4. Binary classification models
5. Multi-class classification models
6. Additinal Pipeline Features:
    6.1 Data-preprocessing
    6.2 Feature selection
    6.3 Applying ML model to unknowns
    6.4 Visualizing your results

## Data exploration

Here we will check out what our data looks like. 

In [None]:
import pandas as pd

In [None]:
d_reg = pd.read_csv('data_regression.txt', sep='\t', index_col = 0)
print(d_reg.head())
print(d_reg['yield'].describe())

In [None]:
d_bi = pd.read_csv('data_binary.txt', sep='\t', index_col = 0)
print(d_bi.head())
print(d_bi['yield'].describe())

In [None]:
d_mc = pd.read_csv('data_multiclass.txt', sep='\t', index_col = 0)
print(d_mc.head())
print(d_mc['yield'].describe())

## 2. Select instances (i.e. maize lines) to holdout as the final testing set

In order to find out how well our models are performing, we want to set aside some of our
data so we can make a final assessment of how well our model performs without any overfitting. 

The test_set.py script will do this for you! You'll need to provide your dataset (-df), the number (-n) or percent (-p) of instances you would like to set asside for final testing, what to save the output as (-save), and finally, you'll need to specify what type of model you will be running (-type) as either c/r for classification/regression. This is important because while for regression models, testing instances can be selected randomly, for classification models, we might want either the same proportion or the same number of instances from each class. To get the same number of instances from each class use -n, to get a testing set with equal proportions as your training set use -p.

In [13]:
%run ../test_set.py -df data_regression.txt -type r -p 0.1 -save test_reg

Holding out 10.0 percent
39 instances in holdout
finished!


In [None]:
%run ../test_set.py -df data_binary.txt -type c -n 15 -save test_bin -y_name yield

In [3]:
%run ../test_set.py -df data_multiclass.txt -type c -n 15 -save test_mc -y_name yield

Holding out 15 instances per class
Pulling holdout set from classes: ['bottom' 'top' 'middle']
['ms132', 'lh38', 'mo5', 'nd167', 'c15', 'co236', 'k47', 'ids69', 'cl17', 'w9', 'a322', 'a188', 'cl22', 'a554', 't141']
['ms132', 'lh38', 'mo5', 'nd167', 'c15', 'co236', 'k47', 'ids69', 'cl17', 'w9', 'a322', 'a188', 'cl22', 'a554', 't141', 'dkpb80', 'oh7b', 'r134', 'nc358', 'n542', 'k4', 'mo3', 'mo28w', 'b52', 'pa891', 'w605s', 'b110', 'mog', 'r226', 'sd101']
['ms132', 'lh38', 'mo5', 'nd167', 'c15', 'co236', 'k47', 'ids69', 'cl17', 'w9', 'a322', 'a188', 'cl22', 'a554', 't141', 'dkpb80', 'oh7b', 'r134', 'nc358', 'n542', 'k4', 'mo3', 'mo28w', 'b52', 'pa891', 'w605s', 'b110', 'mog', 'r226', 'sd101', 'nc356', 'hy', 'b87', 'a73', 'ms221', 'b119', 'ch753-4', 'co237', 'w604s', 'mo17', 'f2834t', 'co216', 'pa778', 'a71', 'east_028']
45 instances in holdout
finished!


By running the code above, you've generated three new files called test_reg, test_bin, and test_mc that should be in your working directory. These files contain the list of lines that will be held out of training and used to 

# Building Machine Learning Models


## 3. Regression models

The machine learning algorithms in the ML_Pipeline are implement from [SciKit-Learn](https://scikit-learn.org/stable/), which has excellent resources to learn more about the ins and outs of these algorithms.

Regression algorithms available in the pipeline are: Support Vector Regression with linear (SVM), polynomial (SVMpoly), and radial basis function (SVMrbf) kernels, Random Forest (RF), Gradient Tree Boosting (GB), Logistic Regression (LogReg).

Note, there are many functions available within the pipeline that are not described in this tutorial. Run python ML_regression.py without any arguments to see more details!

We'll start by using RF to predict maize grain yield using the test_reg 

In [2]:
%run ../ML_regression.py -df data_regression.txt -test test_reg -y_name yield \
-alg RF -gs true -gs_reps 2 -n 2

Model built using 388 instances
Removing test instances to apply model on later...
Snapshot of data being used:
                 Y  GRMZM2G517408  GRMZM5G824831  GRMZM2G019971  \
04033v       0.945       4.000622       2.420715       0.379902   
33-16        2.715       3.617671       2.177523       0.414420   
38-11        2.210       3.474531       2.059158       0.220565   
4554_inbred  0.695       4.154001       2.327104       1.649560   
4578_inbred  1.855       3.876912       2.161319       0.794462   

             GRMZM2G134393  GRMZM2G149617  GRMZM2G024624  GRMZM2G174574  \
04033v            1.673321       2.178089       0.218431       2.921370   
33-16             2.087288       2.328111       0.685711       2.672112   
38-11             1.777336       1.911081       0.000000       1.923086   
4554_inbred       2.578617       1.925957       0.542716       1.340303   
4578_inbred       2.566272       2.169369       0.047206       1.935339   

             GRMZM2G412601  GRMZM2

**Results Breakdown**

First, note that you see two sets of results, first results from the validation set followed by results from the test. If this is your final model, you can report the results from the test set, however, if you are going to compare this model to, say a model using a different algorithm, and then select which algorithm to use for your study, you want to use the validation set results. Basically, you want to avoid making any decisions about the modeling process using the test set results because it may lead to overfitting.

For regression models, four performance metrics are reported ([MSE](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error), [EVS](https://scikit-learn.org/stable/modules/model_evaluation.html#explained-variance-score), [r2](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score), and [PCC](https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html) along with the standard deviation (std) and standard error (se) over the replicates (n). In the case of regression models, each replicate is different because different [cross validation folds](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html) are assigned to each replicate.

One parameter that can be adjusted is how many cross validation folds you want to include (-cv). The default and a commonly used fold number is -cv 10, however, if you have a small dataset, using fewer folds (-cv 5) may perform better. In the extreme case, if you have very few instances to train on, you can set -cv equal to the number of instances in your dataset, allowing you to perform leave one out [LOO](https://www.cs.cmu.edu/~schneide/tut5/node42.html) cross validation.

**Output files**

You will also note that the pipeline generated a number of output files from your run. You can specify a prefix for these files using -save. Here is a breakdown of what you'll find in those files:

- **"_results":** will give you a summary of the results from the run, similar to what you see when you run the pipeline on the command line.
- **"_scores":** For each instance you see the original value to predict (Y), the mean and stdev prediction across all replicates, and the predicted Y for each replicate
- **"_imp":** the importance of each feature in your model. For RF and GTB this score represents the [Gini Index](https://medium.com/the-artificial-impostor/feature-importance-measures-for-tree-models-part-i-47f187c1a2c3), while for LogReg and SVM it is the [coefficient](https://medium.com/@aneesha/visualising-top-features-in-linear-svm-with-scikit-learn-and-matplotlib-3454ab18a14d). SVM with non-linear kernels (i.e. poly, rbf) does not report importance scores.
- **"_GridSearch":** the average performance metric across the whole possible parameter space tested via the grid search. 

## 4. Binary classification models

All of the algorithms available for regression are also available for classification. The pipeline uses 1 and 0 as the default positive and negative classes, respectively. However, you can specify your own pos/neg classes using -pos POS_string -neg NEG_string. 


Lets try using SVM with a linear kernel to predict if a line is in the top 25% (pos) or bottom 75% (neg) for yield.

In [2]:
%run ../ML_classification -df data_binary.txt -test test_bin -y_name yield \
-alg SVM -gs true -gs_reps 2 -n 2

Removing test instances to apply model on later...
Snapshot of data being used:
             Class  GRMZM2G517408  GRMZM5G824831  GRMZM2G019971  GRMZM2G134393
33-16            1       0.584857       0.850034       0.201713       0.515018
38-11            0       0.509846       0.803828       0.107357       0.361492
4554_inbred      0       0.865917       0.908425       0.802900       0.758384
80-2             0       0.684600       0.914129       0.788498       0.626799
a                1       0.744183       0.830093       0.560451       0.588417
CLASSES: [0 1]
POS: 1 <class 'int'>
NEG: 0 <class 'int'>
Balanced dataset will include 81 instances of each class


===>  Grid search started  <===
Round 1 of 2
Round 2 of 2
Parameter sweep time: 3.759559 seconds
Parameters selected: Kernel=Linear, C=0.01
Grid search complete. Time: 3.760623 seconds


===>  ML Pipeline started  <===
  Round 1 of 2
  Round 2 of 2
ML Pipeline time: 0.943225 seconds


===>  ML Results  <===

Validation Set Score

  AucRoc_array.append(r['AucRoc'])


**Results Breakdown**

First, note that before the grid search started running, the model told you what your positive and negative class strings were and how many instances would be in each class for each replicate. This is an important feature in the ML-Pipeline. Training a ML classifier with unbalanced data, or data with different numbers of each instance, can cause your model to be biased toward predicting the more numerous class. For example, if you train a model using 100 positive examples and 100,000 negative examples, your model would do well to just call instance negative! Therefore, in this pipeline, the larger classes are randomly downsampled to generate balanced datasets. To ensure we still utilize as much data as possible, this downsampling is done independelty for each replicate (-n). 

A number of performance metrics are included for binary classification models, including [AUC-ROC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), [F-measure](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html), [AUC-PRC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html), [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html), and [precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score).

**Output files**

The classficiation pipeline generates similar output as the regression pipeline, although there are a few notable differences. 

- **"_scores":** Here, the predicted probability (pp) scores are reported for each replicate. The pp score represents how confident the model was in its classification, where a pp=1 means it is certain the instance is positive and pp=0 means it is certain the instance is negative. For each replicate, an instance is classified as pos if pp > threshold, which is defined as value between 0.001-0.999 that maximises the F-measure. While the performance metrics generated by the pipeline are calcuated for each replicate independently, we want to be able to make a final statement about which instances were called as positive and which were called as negative. You'll find those results in this file. To make this final call we calculated the mean threshold and the mean pp for each instance and called the instance pos if the mean pp > mean threshold. 
- **"_results":** Here you will see an overview of the results similar to what is printed in the command line. However, for classification problems you will see two additional sections: the Mean Balanced Confusion Matrix (CM and the Final Full CM. The mean balanced CM was generated by taking the average number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) across all replicate (which have been downsampled randomly to be balanced). The Final Full CM represents the final TP, TN, FP, and FN results from the final pos/neg classifications (descirbed above in _scores) for all instances in your input dataset. 
- **"_BalancedID":** Each row lists the instances that were included in each replicate (-n) after downsampling. 

## 5. Multi-class classification models

All of the algorithms available for regression and binary classification are also available for multiclass classification. There are no default classes for multiclass models, so you will have to specity what classes are including using (-cl_train)


Lets try using SVM with a linear kernel to predict if a line is in the top 25% (2) or middle 50% (1) or bottom 25% (0) for yield.

In [4]:
%run ../ML_classification.py -df data_multiclass.txt -test test_mc -y_name yield \
-cl_train top,middle,bottom -alg SVM -gs true -gs_reps 2 -n 2 

Removing test instances to apply model on later...
Snapshot of data being used:
              Class  GRMZM2G517408  GRMZM5G824831  GRMZM2G019971  \
04033v       bottom       0.785540       0.944968       0.184912   
33-16           top       0.584857       0.850034       0.201713   
38-11        middle       0.509846       0.803828       0.107357   
4554_inbred  bottom       0.865917       0.908425       0.802900   
4578_inbred  middle       0.720710       0.843708       0.386693   

             GRMZM2G134393  
04033v            0.309971  
33-16             0.515018  
38-11             0.361492  
4554_inbred       0.758384  
4578_inbred       0.752269  
CLASSES: ['bottom' 'middle' 'top']
POS: top <class 'str'>
NEG: multiclass_no_NEG <class 'str'>
Balanced dataset will include 81 instances of each class


===>  Grid search started  <===
Round 1 of 2
Round 2 of 2
Parameter sweep time: 16.296684 seconds
Parameters selected: Kernel=Linear, C=0.5
Grid search complete. Time: 16.299435 secon

**Results Breakdown**

The output for multi-class classification models is similar to binary classification models with a few key differences. 

*An important note: For binary classification using balanced datasets, you would expect a ML model that was just randomly guessing the class to be correct ~50% of the time, because of this the random expectation for performance metrics like AUC-ROC and the F-measure are 0.50. This is not the case for multi-class predictions. Using our model above as an example, a ML model that was randomly guessing top, middle, or bottom, would only be correct ~33% of the time. That means models performing with >33% accuracy are performing better than random expectation.*

There are two types of performance metrics for multi-class models, commonly referred to as macro and micro metrics. Micro performance metrics are generated for each class in your ML problem. For example, from our model we will get three micro F-measures (F1-top, F1-middle, F1-bottom). These micro scores are available in the *_results* output file. Macro performance metrics are generated by taking the average of all the micro performance metrics. These scores are what are printed in the command line. 

# Additional ML-Pipeline Features

## 6.1 Data Preprocessing

Coming soon - imputing NAs, one-hot encoding, etc.

## 6.2 Feature Selection

While one major advantage of ML approaches is that they are robust when the number of features is very large, there are cases where removing unuseful features or selecting only the best features may help you better answer your question. One common issue we see with using feature selection for machine learning is using the whole dataset to select the best features, which results in overfitting [see here](https://www.nature.com/articles/srep10312). The ML-Pipeline allows you to perform feature selection within the training-validation-testing scheme, thereby avoiding overfitting. 

There are many strategies for feature selection out there. The ML-Pipeline allows you to perform feature selection using:
- Chi2
- Enrichement (for binary classification only)
- Random Forest
- Relief
- LASSO (for regression only)
- Bayesian LASSO (for regression only)
- Bayes A (for regression only)
- Elastic Net (for regression only)
- ridge regression (for regression only)
- Random

For more information, run Feature_Selection.py with no arguments:

In [6]:
%run ../Feature_Selection.py


PURPOSE:
Run feature selection method available from sci-kit learn on a given dataframe

Must set path to Miniconda in HPC:  export PATH=/mnt/home/azodichr/miniconda3/bin:$PATH


INPUT:
  -df       Feature file for ML. If class/Y values are in a separate file use -df for features and -df2 for class/Y
  -f        Feature selection method to use 
                - Chi2 
                    need: -n
                - RandomForest 
                    need: -n, -type
                - Enrichment using Fisher's Exact (for classification with binary feats only)
                    need: -p (default=0.05)
                - LASSO 
                    need: -p, -type, n
                - Bayesian LASSO (bl) 
                    need: -n
                - Elastic Net (EN)
                    need: -n -p (default=0.5, proportion of L1 to L2 penalty)
                - Relief (https://github.com/EpistasisLab/scikit-rebate) (currently for regression only)
                    need: -n, 
            

NameError: name 'exit' is not defined