## RDP Classifier

Three datastes are required which are train, validation, and test. Each dataset contains sequence ID, sequence, and taxonomy ranks from kingdom to species. Train and validation dataset will be merged into one, becuase machine learning does not require validation stpes in training. Databases are converted to ready4train taxonomy and fasta files to be utilized by RDP Classifier.

RDP Classifier need to be downloaded to make models and classify. It can be downloaded from: [RDP Classifier](https://github.com/rdpstaff/classifier). 

After classification, output text file shows the best predictions for each sequences with confidence values. Confidence value can be changed. Default is 0.8. Classificatoin output is compared with true value, answers for classification of test set. This answer should be tab separated taxonomy text file with header. Evaluation of models is done by accuracy, F1 score, and Matthew Correlation Coefficient (MCC) via sklearn.

When running cells, new files will be saved in "RDPfiles" directory at the same folder of this jupyter notebook is located. Directory name can be changed. 

### 1. Setting Variables

This cell includes necessary packages and variables for preprocessing, training, classifying and scoring. These packages and variables are used in main sciprt.

When running cells, new files will be saved in "RDPfiles" directory at the same folder of this jupyter notebook is located. Directory name can be changed. 
3 datasets are required and each should be in csv format with header.
Specific path to RDP Classifier is required.
Also prediction level can be selected. Training and classification is done with whole taxons, from kingdom to species, but specific level can be choosed in evaluation. 

In [318]:
#required packages and variables

import os 
import sys
import pandas as pd
import string
from sklearn.metrics import f1_score, matthews_corrcoef, accuracy_score
import numpy as np
import time

global RDPfiles

#variables
RDPfiles = "RDPfiles"
raw_train = "df_train_0.csv"
raw_val = "df_val_0.csv"
test = "df_test_0.csv"
classifier_loc = "/Users/inkyunpark/vscjava/rdptools/classifier.jar"
confidence_score = 0.8
level = 'genus' #ranks = ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']

## 2. Necessary Definition

This cell includes three definitions: lineage2taxTrain, addFullLineage, and RDPoutput2score. These definition helps function of main script.

Lineage2taxTrain is converting tab separated taxonomy text files into ready4train_taxonomy.txt file. This text file contains the hierarchical taxonomy information in the following format: tax ID * taxon name * parent taxid * depth * rank. 
- Tax ID is index of specific rank in taxonomy file. If genus *Bacillus* is assigned to 31, there will be no other tax ID with genus *Bacillus*.
- Taxon name is a name for each taxonoimc ranks. 
- Parent taxid is tax ID of higher rank. For example, *Bacillus cereus* is one of *Bacillus* species. If *Bacillus* has tax ID of 43, parent taxid of *Bacillus* becomes 31 too.
- Depth means depth of ranks. Depth 0 is always root. Highest ranks, kingdom as an example, has depth of 1. Depth is increased as ranks becomes lower. 
- Rank is taxonomic ranks.
    For example, label for *Bacillus cereus* is 43\*Bacillus cereus\*31\*7\*Species.

AddFullLineage definition makes ready4train_seqs.fasta file. It has similar structure with sequence fasta file, but semicolon separated taxonomy is added to the next of sequence ID. The taxonomy begins with 'Root' and kingdom to lowest taxons. 

RDPoutput2score definition evaluates accuracy, F1 score and MCC using output.txt and test_taxonomy.txt. Output.txt can be earned fomr classifying test_sequences.txt with trained model. Score is made by comparing output.txt and test_taxonomy.txt, which are prediction and answers respectively.

In [319]:
#NECESSARY DEFINITIONS FOR RDP TRAINING, CLASSIFICATION AND SCORING

#codes to make raw sequence and raw taxonomy into ready4rdp input files.
#scripts are from https://github.com/GLBRC-TeamMicrobiome/python_scripts. Raw scripts are revised to be used in this concatenated jupyter notebook file.
def lineage2taxTrain(raw_taxons):
    taxons_list = raw_taxons.strip().split('\n')
    header = taxons_list[0].split('\t')[1:]#headers = list of ranks
    hash = {}#taxon name-id map
    ranks = {}#column number-rank map
    lineages = []#list of unique lineages

    with open("{}/ready4train_taxonomy.txt".format(RDPfiles), "w") as f:
        hash = {"Root":0}#initiate root rank taxon id map
        for i in range(len(header)):
            name = header[i]
            ranks[i] = name
        root = ['0', 'Root', '-1', '0', 'rootrank']#root rank info
        f.write("*".join(root) +  '\n')
        ID = 0 #taxon id
        for line in taxons_list[1:]:
            cols = line.strip().split('\t')[1:]
            for i in range(len(cols)):#iterate each column
                name = []
                for node in cols[:i + 1]:
                    node = node.strip()
                    if not node in ('-', ''):
                        name.append(node)
                pName = ";".join(name[:-1])
                if not name in lineages:
                    lineages.append(name)
                depth = len(name)
                name = ';'.join(name)
                if name in hash.keys():#already seen this lineage
                    continue
                try:
                    rank = ranks[i]
                except KeyError:
                    print (cols)
                    sys.exit()
                if i == 0:
                    pName = 'Root'
                pID = hash[pName]#parent taxid
                ID += 1
                hash[name] = ID #add name-id to the map
                out = ['%s'%ID, name.split(';')[-1], '%s'%pID, '%s'%depth, rank]
                f.write("*".join(out) + '\n')
    f.close()

def addFullLineage(raw_taxons, raw_seqs):
    hash = {} #lineage map
    taxonomy_list = raw_taxons.strip().split('\n')
    for line in taxonomy_list[1:]:
        line = line.strip()
        cols = line.strip().split('\t')
        lineage = ['Root']
        for node in cols[1:]:
            node = node.strip()
            if not (node == '-' or node == ''):
                lineage.append(node)
        ID = cols[0]
        lineage = ';'.join(lineage).strip()
        hash[ID] = lineage
    sequence_list = raw_seqs.strip().split('\n')
    with open("{}/ready4train_seqs.fasta".format(RDPfiles), "w") as f:
        for line in sequence_list:
            line = line.strip()
            if line == '':
                continue
            if line[0] == '>':
                ID = line.strip().split()[0].replace('>', '')
                lineage = hash[ID]
                f.write('>' + ID + '\t' + lineage + '\n')
            else:
                f.write(line.strip() + '\n')
    f.close()

#codes to make score from prediction and true values.
#sklearn.metrics tools are used to make calcuation.
def RDPoutput2score(pred_file, true_file, level, cf):
    taxon_list = []
    ranks = ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
    level = ranks.index(level)

    pred = pd.read_csv(pred_file, sep="\t", header=None)
    pred.drop(pred.columns[1:level+5+2*level],  axis = 'columns', inplace=True)
    pred.drop(pred.columns[4:], axis = 'columns', inplace=True)
    
    pred_dict = {}
    for index, row in pred.iterrows():
        row = row.tolist()
        if row[1] not in taxon_list:
            taxon_list += [row[1]]
        if float(row[3]) >= cf:
            pred_dict[row[0]] = row[1]

    true = pd.read_csv(true_file, sep="\t", header=None)
    true_dict = {}
    for index, row in true.iterrows():
        true_dict[row[0]] = row[level+1]
        if row[level+1] not in taxon_list:
            taxon_list += [row[level+1]]


    y_pred, y_true = [], []
    for i in pred_dict.keys():
        y_pred.append(taxon_list.index(pred_dict[i]))
        y_true.append(taxon_list.index(true_dict[i]))

    acc = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred, average='weighted') # works well if both y_true_decode and y_pred are list.
    mcc = matthews_corrcoef(y_true, y_pred)

    print("accuracy is {}\nf1 score is {}\nMCC score is {}".format(acc, f1, mcc))

## 3. Main

Below cells are main script.

In this cell, first, new directory is made to save required materials here. Then preprocessing steps are proceed. 
1. Call three datasets and merges train and validation sets into one train.
2. Create raw_seqs and raw_taxons. These will be utilized in lineage2taxTrain and addFullLineage to become ready4train files, which are input for training new RDP models.
3. test_sequcnes.fasta and test_taxonomy.txt are made from test dataset. These are used in evaluation of new trained models.
4. lineage2taxTrain and addFullLineage to make ready4train files.

End of preprocessing

In [320]:
#main_RDP.py

os.system("mkdir {}".format(RDPfiles))

#merge train and validation database into one.
train = pd.concat(map(pd.read_csv, [raw_train, raw_val]))
test = pd.read_csv(test)

#change train and test dataframe into tab separated taxonomy and sequence string
#taxnomy file is tab separated string
#sequence file is fasta format with sequence ID and sequence
raw_seqs = ''
raw_taxons = 'SeqId	Kingdom	Phylum	Class	Order	Family	Genus	Species' + '\n'
for index, row in train.iterrows():
    taxons = row.tolist()
    raw_seqs += '>' + taxons[0] + '\n' + taxons[-1] + '\n'
    raw_taxons += '\t'.join(taxons[:-1]) + '\n'

#change test dataframe to test sequence into text and fasta files respectively to be utilized by RDP
#taxnomy file is tab separated and saved in text file
#sequence file is fasta format with sequence ID and sequence and saved in fasta file
with open("{}/test_sequences.fasta".format(RDPfiles), "w") as seq_f, open("{}/test_taxonomy.txt".format(RDPfiles), "w") as tax_f:
    for index,row in test.iterrows():
        taxons = row.tolist()
        seq_f.write('>' + taxons[0] + '\n' + taxons[-1] + '\n')
        tax_f.write('\t'.join(taxons[:-1]) + '\n')
    seq_f.close()
    tax_f.close()

#change raw taxonomy and sequence files to ready4rdp trainable files
lineage2taxTrain(raw_taxons)
addFullLineage(raw_taxons, raw_seqs)

print("data preprocessing for RDP is done")

data preprocessing for RDP is done


Here, new RDP classifier is trained and evaluated with test sets.

Training is done with use of train function of RDP Classifier. New models is saved in training_files directory with four files of weights. New file, rRNAClassifier.properties, is needed to be used as a bridge between these four files and RDP Classifier.
Classification is done with classify function of RDP Classifier. Option -o leads RDP Classifier to use new training models. Prediction will be written on output.txt.
Time for training and classification are meausred.

Using RDPoutput2score, accuracy, F1 score and MCC are made. These values will be used to compare with other deep learning models. Users can choose specific level of ranks to evaluate prediction.

In [321]:
#train new RDP models and add necessary file.
t0 = time.time()
os.system("java -Xmx10g -jar {} train -o {}/training_files -s {}/ready4train_seqs.fasta -t {}/ready4train_taxonomy.txt".format(classifier_loc, RDPfiles, RDPfiles, RDPfiles))
with open("{}/training_files/rRNAClassifier.properties".format(RDPfiles), "w") as f:
    f.write("bergeyTree=bergeyTrainingTree.xml" + '\n' + "probabilityList=genus_wordConditionalProbList.txt" + '\n' + "probabilityIndex=wordConditionalProbIndexArr.txt" + '\n' + "wordPrior=logWordPrior.txt" + '\n' + "classifierVersion=RDP Naive Bayesian rRNA Classifier Version 2.5, May 2012 ")
    f.close()
t1 = time.time()
print("RDP training time is {} seconds".format(t1-t0))

#classify test sequences
t0 = time.time()
os.system("java -Xmx10g -jar {} classify -t {}/training_files/rRNAClassifier.properties  -o {}/output.txt {}/test_sequences.fasta".format(classifier_loc, RDPfiles, RDPfiles, RDPfiles))
t1 = time.time()
print("RDP classification time is {} seconds".format(t1-t0))

#scoring
RDPoutput2score("{}/output.txt".format(RDPfiles), "{}/test_taxonomy.txt".format(RDPfiles), level, confidence_score)

RDP training time is 23.329681158065796 seconds
RDP classification time is 53.412200927734375 seconds
accuracy is 0.9754299754299754
f1 score is 0.9756451668945949
MCC score is 0.9752839861577162
