# Machine Learning Workshop

Here we will walk through an example of a machine learning workflow following five steps:

<img src="../_img/ml_workflow.png" alt="ML Workflow" width="800"/>

For more detailed information on the Shiu Lab's ML pipeline, including explanations of all output files,
check out the [README](https://github.com/ShiuLab/ML-Pipeline).

***

## Step 0. Set up Jupyter notebook & software

Check out this [**guide**](../Tutorial/README.md) to learn how to set up Jupyter notebook and the software needed to run the Shiu Lab's ML pipeline.


***

![Step 1](../_img/step1.png "ML Workflow step 1")

**What do we want to predict?** 

If a gene is annotated as being involved in specialized or general metabolism. 

**What are the labeled instances?**

Tomato genes annotated as being involved in specialized or general metabolism by TomatoCyc.

**What are the predictive features?** 
- duplication information (e.g. number of paralogs, gene family size)
- sequence conservation (e.g. nonsynonymous/synonymouse substitution rates between homologs)
- gene expression (e.g. breadth, stress specific, co-expression)
- protein domain conent (e.g. p450, Aldedh)
- epigenetic modification (e.g. H3K23ac histone marks)
- network properties (# protein-protein interactions, network connectivity).

**What data do we have?**
- 532 tomato genes with specialized metabolism annotation by TomatoCyc
- 2,318 tomato genes with general metabolism annotation by TomatoCyc
- 4,197 features (we are only using a subset of **564** for this workshop)



***

![Step 2](../_img/step2.png "ML Workflow step 2")


In [1]:
## A. Lets look at the data (note, you can do this in excel or R!)
import pandas as pd

d = pd.read_table('data.txt', sep='\t', index_col = 0)

print('Shape of data (rows, cols):')
print(d.shape)

print('\nSnapshot of data:')
print(d.iloc[:6,:5])  # prints first 6 rows and 5 columns

print('\nList of class labels')
print(d['Class'].value_counts())

Shape of data (rows, cols):
(2872, 565)

Snapshot of data:
                Class  Crubella_183_v1.0.csv  FamilySize FamilySize_cat  \
YP_008563134      gen                    0.0    0.010582         medium   
XP_010327628      gen                    0.0    0.000000          small   
XP_010327620  special                    0.0    0.052910         medium   
XP_010327578      gen                    0.0         NaN            NaN   
XP_010327494      gen                    1.0    0.021164         medium   
YP_008563119  special                    0.0    0.000000          small   

              Transferase  
YP_008563134          0.0  
XP_010327628          NaN  
XP_010327620          0.0  
XP_010327578          0.0  
XP_010327494          0.0  
YP_008563119          0.0  

List of class labels
gen        2318
special     532
unknown      22
Name: Class, dtype: int64


**Things to notice:**
- Our data has NAs. ML algorithms cannot handel NAs. We either needs to drop or impute NA values!
- We have binary, continuous, and categorical features in this dataset. A perk of ML models is that they can integrate multiple datatypes in a single model. 
- However, before being used as input, a categorical feature needs to be converted into set binary features using an approach called [one-hot-encoding](https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding). 

*Before One-Hot Encoding:*

| ID   | Class    | Weather   |
|---    |---    |---    |
| instance_A    |  1     | sunny     |
| instance_B    |  0    |  overcast     |
| instance_C   |  0     |  rain    | 
| instance_D   | 1     |  sunny    |

*After One-Hot Encoding:*

| ID   | Class    | Weather_sunny   | Weather_overcast   | Weather_rain   |
|---    |---    |---    |---    |---    |
| instance_A    |  1     | 1     | 0     | 0     |
| instance_B    |  0    |  0     |  1     |  0     |
| instance_C   |  0     |  0    |  0    |  1    | 
| instance_D   | 1     |  1    | 0    | 0    |



***

### Automated data cleaning: ML_preprocess.py

Input
```
-df: your data table
-na_method: how you want to impute NAs (options: drop, mean, median, mode)
-h: show more options
```

In [2]:
# B. Drop/Impute NAs and one-hot-encode categorical features

%run ../ML_preprocess.py -df data.txt -na_method median

Snapshot of input data...
                Class  Crubella_183_v1.0.csv  FamilySize FamilySize_cat  \
YP_008563134      gen                    0.0    0.010582         medium   
XP_010327628      gen                    0.0    0.000000          small   
XP_010327620  special                    0.0    0.052910         medium   
XP_010327578      gen                    0.0         NaN            NaN   
XP_010327494      gen                    1.0    0.021164         medium   

              Transferase  
YP_008563134          0.0  
XP_010327628          NaN  
XP_010327620          0.0  
XP_010327578          0.0  
XP_010327494          0.0  


### Dropping/imputing NAs... ###

Number of columns with NAs: 41
Features dropped because missing > 50.00% of data: ['SQS_PSY']
Number of columns to impute: 40


### One Hot Encoding... ###

Features to one-hot-encode: ['FamilySize_cat']
Dataframe shape (rows, cols) before and after one-hot-encoding:
Before: (2872, 563)
After: (2872, 565)

Number of d

***

## Set aside instances for testing 

We want to set aside a subset of our data to use to test how well our model performed. Note that this is done before feature engineering, parameter selection, or model training. This will ensure our performance metric is entirely independent from our modeling!


### Automated selection of test set: test_set.py

Input
```
-df: your data table
-use: what class labels to include in the test set (we don't want to include unknowns!)
-type: (c) classification or (r) regression
-p: What percent of instances from each class to select for test (0.1 = 10%)
-save: save name for test set
```


In [18]:
# C. Define test set

%run ../test_set.py -df data_mod.txt  \
                    -use gen,special  \
                    -type c  \
                    -p 0.1  \
                    -save test_genes.txt

Holding out 10.0 percent
Pulling test set from classes: ['gen', 'special']
285 instances in test set
finished!


***


![Step 3](../_img/step3.png "ML Workflow step 3")

While one major advantage of ML approaches is that they are robust when the number of features is very large, there are cases where removing unuseful features or selecting only the best features may help you better answer your question. One common issue we see with using feature selection for machine learning is using the whole dataset to select the best features, which results in overfitting! **Be sure you specify your test set so that this data is not used for feature selection!**


### Automated feature selection: Feature_Selection.py

Input
```
-df: your data table
-test: what instances to hold out (i.e. test instances!)
-cl_train: labels to include in training the feature selection algorithm
-type: (c) classification or (r) regression
-alg: what feature selection algorithm to use (e.g. lasso, elastic net, random forest)
-p: Parameter specific to different algorithms (use -h for more information)
-n: Number of feature to select (unless algorithm does this automatically)
-save: save name for list of selected features
```


Here we will use one of the most common feature selection algorithms: LASSO. LASSO requires the user to select the level of sparcity (-p) they want to induce during feature selection, where a larger value will result in more features being selected and a smaller value resulting in fewer features being selected. You can play around with this value to see what it does for your data.  


In [19]:
%run ../Feature_Selection.py -df data_mod.txt \
                            -test test_genes.txt \
                            -cl_train special,gen  \
                            -type c  \
                            -alg lasso  \
                            -p 0.01  \
                            -save top_feat_lasso.txt

Removing testldout instances...
Dropping instances that are not in ['special', 'gen'], changed dimensions from (2587, 566) to (2565, 566) (instance, features).

Snapshot of data:
              Class  Crubella_183_v1.0.csv  FamilySize  Transferase  \
YP_008563134      0                    0.0    0.010582          0.0   
XP_010327628      0                    0.0    0.000000          0.0   
XP_010327578      0                    0.0    0.015873          0.0   
XP_010327494      0                    1.0    0.021164          0.0   
YP_008563119      1                    0.0    0.000000          0.0   
YP_008563115      0                    0.0    0.010582          0.0   

              Exo_endo_phos  
YP_008563134            0.0  
XP_010327628            0.0  
XP_010327578            0.0  
XP_010327494            0.0  
YP_008563119            0.0  
YP_008563115            0.0  
=====* Running L1/LASSO based feature selection *=====
Features selected using LASSO: ['p450' 'UDPGT' 'tandemDupG

In [6]:
%run ../Feature_Selection.py -df data_mod.txt  \
                            -test test_genes.txt  \
                            -cl_train special,gen \
                            -type c  \
                            -alg random  \
                            -n 10  \
                            -save rand_feat.txt


Removing testldout instances...
Dropping instances that are not in ['special', 'gen'], changed dimensions from (2587, 566) to (2565, 566) (instance, features).

Snapshot of data:
              Class  Crubella_183_v1.0.csv  FamilySize  Transferase  \
YP_008563134      0                    0.0    0.010582          0.0   
XP_010327628      0                    0.0    0.000000          0.0   
XP_010327620      1                    0.0    0.052910          0.0   
XP_010327578      0                    0.0    0.015873          0.0   
XP_010327494      0                    1.0    0.021164          0.0   
YP_008563115      0                    0.0    0.010582          0.0   

              Exo_endo_phos  
YP_008563134            0.0  
XP_010327628            0.0  
XP_010327620            0.0  
XP_010327578            0.0  
XP_010327494            0.0  
YP_008563115            0.0  
Run time (sec):0.28
Done!


***

![Step 4](../_img/step4.png "ML Workflow step 4")

Next we want to determine which ML algorithm (i.e. Support Vector Machine (SVM), Random Forest (RF)) we should use and what parameters needed by those algorithms work best. Importantly, at this stage we **only assess our model performance on the validation data** in order to assure we aren't just selecting the algorithm that works best on our held out testing data. The pipeline will automatically withhold the testing data from the parameter selection (i.e. grid search) step. 

Note, the pipeline **automatically "balances" your data**, meaning it pulls the same number of instances of each class for training. This avoids biasing the model to just predict everything as the more common class. This is a major reason why we want to run multiple replicates of the model!


### Algorithm Selection
The machine learning algorithms in the ML_Pipeline are implement from [SciKit-Learn](https://scikit-learn.org/stable/), which has excellent resources to learn more about the ins and outs of these algorithms.

**Why is algorithm selection useful?** ML models are able to learn patterns from data without the being explictely programmed to look for those patterns. ML algorithms differ in what patterns they excel at finding. For example, SVM is limited to linear relationships between feature and labels, while RF, because of its heiarchical structure, is able to model interactive patterns between your features. Furthermore, algorithms vary in their complexity and the amount of training data that is needed in order to  


### Parameter Selection
Most ML algorithms have internal parameters that need to be set by the user. For example:

![RF Parameter examples](../_img/rf_params.png "Sample of RF parameters")


There are two general strategies for parameter selection: the grid search (default option: left) and the random search (use "-gs_type random": right):
![Grid search](../_img/grid_rand_search.png "Grid Search")
*Image: Bergstra & Bengio 2012; used under CC-BY license*


### Training and Validation
Training and validation is done using a [cross-validation (CV)](https://towardsdatascience.com/cross-validation-70289113a072) scheme. CV is useful because it makes good use of our data (i.e. uses all non-test data for training at some point) but also makes sure we are selecting the best parameters/algorithms on models that aren't overfit to the training data. Here is a visual to demonstrate how CV works (with 10-cv folds in this example):

![Cross Validation](../_img/cross_validation.png "Cross validation")


### Automated parameter selection, ML training, validation, & testing:  ML_classification.py/ML_regression.py

**Input:**
```
-df: your data table
-test: what instances to hold out (i.e. test instances)
-cl_train: labels to include in training the feature selection algorithm
-alg: what ML algorithm to use (e.g. SVM, RF, LogReg)
-cv: Number of cross-validation folds (default = 10, use fewer if data set is small)
-n: Number of replicates of the cross-validation scheme to run (default = 100)
```

*There are many functions available within the pipeline that are not described in this workshop. For more options run:*
```
python ML_classification.py -h
```


In [3]:
%run ../ML_classification.py -df data_mod.txt \
                        -test test_genes.txt \
                        -cl_train special,gen \
                        -alg SVM \
                        -cv 5 \
                        -n 10


Removing test instances to apply model on later...
Snapshot of data being used:
                Class  Crubella_183_v1.0.csv  FamilySize  Transferase  \
YP_008563134      gen                    0.0    0.016667          0.0   
XP_010327628      gen                    0.0    0.000000          0.0   
XP_010327620  special                    0.0    0.083333          0.0   
XP_010327578      gen                    0.0    0.025000          0.0   
XP_010327494      gen                    1.0    0.033333          0.0   

              Exo_endo_phos  
YP_008563134            0.0  
XP_010327628            0.0  
XP_010327620            0.0  
XP_010327578            0.0  
XP_010327494            0.0  


CLASSES: ['gen' 'special']
POS: special type:  <class 'str'>
NEG: gen type:  <class 'str'>

Balanced dataset will include 478 instances of each class


===>  Grid search started  <===
Round 1 of 10
Round 2 of 10
Round 3 of 10
Round 4 of 10
Round 5 of 10
Round 6 of 10
Round 7 of 10
Round 8 of 10
Rou

#### Results Breakdown

There are dozens of [performance metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) that can be used to assess how well a ML model works. While the best metric for you depends on the type of question you are asking, some of the most generally useful metrics include the area under the Receiver Operator Characteristic curve (AUC-ROC), the area under the Precision-Recall curve (AUC_PRc), and the F-measure (F1).

![AUCROC_Correlation](../_img/metrics.png "AUCROC Correlation")


Running the same script (only changing **-alg XXX**), average performance on the validation data using other algorithms:

| Alg  	| F1  	| AUC-ROC  	|
|---	|---	|---	|
| RF  	| 0.787  	| 0.824  	|
| LogReg  	| 0.862  	| 0.921  	|
| SVMpoly  	| 0.833  	| 0.897  	|
| SVMrbf  	| 0.855  	| 0.905  	|
| SVM  	| 0.856  	| 0.911  	|


***SVM performed best on the validation data so we will continue with that algorithm!***

![Step 5](../_img/step5.png "ML Workflow step 5")

Now that we have our best performing algorithm, we will run the pipeline one more time, but with more replicates (note, I still just use 10 here for time!) and we will use it to predict our unknown genes. 

**Additional input:**
```
- apply: List of lable names to apply trained model to (i.e. all, or 'unknown')
- plots: True/False if you want the pipeline to generate performance metric plots (default = F)
- save: Name to save output to (will over-write old files)

```


In [20]:
%run ../ML_classification.py -df data_mod.txt \
                            -test test_genes.txt \
                            -cl_train special,gen \
                            -alg SVM \
                            -cv 5 \
                            -n 10 \
                            -apply unknown \
                            -plots T \
                            -save metab_SVM


Removing test instances to apply model on later...
Snapshot of data being used:
                Class  Crubella_183_v1.0.csv  FamilySize  Transferase  \
YP_008563134      gen                    0.0    0.016667          0.0   
XP_010327628      gen                    0.0    0.000000          0.0   
XP_010327620  special                    0.0    0.083333          0.0   
XP_010327578      gen                    0.0    0.025000          0.0   
XP_010327494      gen                    1.0    0.033333          0.0   

              Exo_endo_phos  
YP_008563134            0.0  
XP_010327628            0.0  
XP_010327620            0.0  
XP_010327578            0.0  
XP_010327494            0.0  


CLASSES: ['gen' 'special']
POS: special type:  <class 'str'>
NEG: gen type:  <class 'str'>

Balanced dataset will include 478 instances of each class


===>  Grid search started  <===
Round 1 of 10
Round 2 of 10
Round 3 of 10
Round 4 of 10
Round 5 of 10
Round 6 of 10
Round 7 of 10
Round 8 of 10
Rou

**Let's check out our results...**

Here are the files that are output from the model:
- **data.txt_results:** A detailed look at the model that was run and its performance.  

- **data.txt_scores:** The probability score for each gene (i.e. how confidently it was predicted) and the final classification for each gene, including the unknowns the model was applied to.

- **data.txt_imp:** The importance of each feature in your model.

- **data.txt_GridSearch:** Detailed results from the parameter grid search.

- **data.txt_BalancedID:** A list of the genes that were included in each replicate after downsampling to balance the model.

*For a detailed description of the content of the pipeline output see the [README](../README.md)*

***

## What if we use fewer features?

Additional input:
```
- feat: List of features to use.
```
Use smaller balenced data set

### Advanced Topics
- multiclass
- transfer learning
- venn diagrams


In [14]:
%run ../ML_classification.py -df data_mod.txt \
                            -test test_genes.txt \
                            -cl_train special,gen \
                            -alg SVM \
                            -cv 5 \
                            -n 10 \
                            -feat top_feat_lasso.txt \
                            -save metab_SVM_lasso10


Using subset of features from: top_feat_lasso.txt
Removing test instances to apply model on later...
Snapshot of data being used:
                Class  p450  UDPGT  tandemDupGenes  \
YP_008563134      gen   0.0    0.0             0.0   
XP_010327628      gen   0.0    0.0             0.0   
XP_010327620  special   0.0    0.0             0.0   
XP_010327578      gen   0.0    0.0             0.0   
XP_010327494      gen   0.0    0.0             0.0   

              Nicotiana_tabacum.TN90_AYMY.SS.csv  
YP_008563134                                 0.0  
XP_010327628                                 0.0  
XP_010327620                                 0.0  
XP_010327578                                 0.0  
XP_010327494                                 1.0  


CLASSES: ['gen' 'special']
POS: special type:  <class 'str'>
NEG: gen type:  <class 'str'>

Balanced dataset will include 478 instances of each class


===>  Grid search started  <===
Round 1 of 10
Round 2 of 10
Round 3 of 10
Round 4 of 

In [15]:
%run ../ML_classification.py -df data_mod.txt \
                            -test test_genes.txt \
                            -cl_train special,gen \
                            -alg SVM \
                            -cv 5 \
                            -n 10 \
                            -feat rand_feat.txt_10 \
                            -save metab_SVM_rand10

Using subset of features from: rand_feat.txt_10
Removing test instances to apply model on later...
Snapshot of data being used:
                Class  Osativa_medKaKs  Glyco_hydro_28  NIR_SIR  \
YP_008563134      gen         0.247338             0.0      0.0   
XP_010327628      gen         0.247338             0.0      0.0   
XP_010327620  special         0.206369             0.0      0.0   
XP_010327578      gen         0.212135             0.0      0.0   
XP_010327494      gen         0.179797             0.0      0.0   

              GHMP_kinases_N  
YP_008563134             0.0  
XP_010327628             0.0  
XP_010327620             0.0  
XP_010327578             0.0  
XP_010327494             0.0  


CLASSES: ['gen' 'special']
POS: special type:  <class 'str'>
NEG: gen type:  <class 'str'>

Balanced dataset will include 478 instances of each class


===>  Grid search started  <===
Round 1 of 10
Round 2 of 10
Round 3 of 10
Round 4 of 10
Round 5 of 10
Round 6 of 10
Round 7 of 10

### Visualizing Your Results

There are a number of vizualization tools available in the ML-Pipeline (see ML_Postprocessing). Here we will use ML_plots. 


**ML_plots.py input:**
```
ML_plots.py [SAVE_NAME] [POS] [NEG] [M1_name] [PATH_M1_scores] [M2_name] [PATH_M2_scores]... [Mn_name] [PATH_Mn_scores]
```

In [23]:
%run ../scripts_PostAnalysis/ML_plots.py compare_SVM special gen all metab_SVM_scores.txt LASSO metab_SVM_lasso10_scores.txt Random metab_SVM_rand10_scores.txt



all
LASSO
Random
                Class      Mean     stdev Predicted_0.43    Median   score_0  \
ID                                                                             
NP_001233775  special  0.879745  0.046322        special  0.887030  0.837215   
NP_001233777      gen  0.297954  0.103856            gen  0.272025  0.294006   
NP_001233790  special  0.465117  0.104629        special  0.474427  0.338578   
NP_001233795      gen  0.127774  0.051437            gen  0.113469  0.115227   
NP_001233801      gen  0.143935  0.042301            gen  0.147364  0.075589   

               score_1   score_2   score_3   score_4   score_5   score_6  \
ID                                                                         
NP_001233775  0.903283  0.812769  0.878187  0.841674  0.928159  0.895872   
NP_001233777  0.151053  0.479233  0.281056  0.262994  0.234144  0.231677   
NP_001233790  0.314705  0.575976  0.379030  0.537606  0.543523  0.429752   
NP_001233795  0.109948  0.111711  0.122113

  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on LASSO 1


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on LASSO 2


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on LASSO 3


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on LASSO 4


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on LASSO 5


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on LASSO 6


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on LASSO 7


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on LASSO 8


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on LASSO 9


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


                Class      Mean     stdev Predicted_0.42    Median   score_0  \
ID                                                                             
NP_001233775  special  0.537257  0.008205        special  0.537263  0.537502   
NP_001233777      gen  0.536167  0.006922        special  0.537263  0.537502   
NP_001233790  special  0.529128  0.005849        special  0.527949  0.526287   
NP_001233795      gen  0.087656  0.018650            gen  0.090657  0.096085   
NP_001233801      gen  0.320816  0.031036            gen  0.323903  0.308653   

               score_1   score_2   score_3   score_4   score_5   score_6  \
ID                                                                         
NP_001233775  0.526561  0.550286  0.541096  0.534715  0.547085  0.530501   
NP_001233777  0.526561  0.539483  0.541096  0.534715  0.547085  0.530501   
NP_001233790  0.524323  0.539483  0.531433  0.529611  0.535850  0.520343   
NP_001233795  0.087057  0.096370  0.096426  0.094257  0.082

  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on Random 1


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on Random 2


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on Random 3


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on Random 4


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on Random 5


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on Random 6


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on Random 7
Working on Random 8


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Working on Random 9


  precis.append(TP/(TP+FP))
  precis.append(TP/(TP+FP))


Done!


In [33]:
%run ../scripts_PostAnalysis/ML_plots.py -save compare_SVM \
                    -cl_train special gen \
                    -names All LASSO Random\
                    -scores metab_SVM_scores.txt metab_SVM_lasso10_scores.txt metab_SVM_rand10_scores.txt


Processing model All, replicate 0
Processing model All, replicate 1
Processing model All, replicate 2
Processing model All, replicate 3
Processing model All, replicate 4
Processing model All, replicate 5
Processing model All, replicate 6
Processing model All, replicate 7
Processing model All, replicate 8
Processing model All, replicate 9
Processing model LASSO, replicate 0
Processing model LASSO, replicate 1
Processing model LASSO, replicate 2
Processing model LASSO, replicate 3
Processing model LASSO, replicate 4
Processing model LASSO, replicate 5
Processing model LASSO, replicate 6
Processing model LASSO, replicate 7
Processing model LASSO, replicate 8
Processing model LASSO, replicate 9
Processing model Random, replicate 0
Processing model Random, replicate 1
Processing model Random, replicate 2
Processing model Random, replicate 3
Processing model Random, replicate 4
Processing model Random, replicate 5
Processing model Random, replicate 6
Processing model Random, replicate 7
Proc