
# Snekmer Learn - Apply  Demo


 <b>Learn/Apply</b> is a protein annotation method that uses cosine similarity to compare a user-generated kmer counts matrix to the kmer counts of an novel genome, predicting the annotation of each protein within.

**Learn** relies on a large training set of genomes to make predictions. With each addition to the training set, the accuracy increases.

**Apply** requires several outputs from Learn as well some novel genomes or sequences.


In this notebook, we will demonstrate how to use Snekmer Learn/Apply with small training dataset of 20 genomes and 2 "unknown" genomes.



## Getting Started with LEARN

### Setup

First, install Snekmer using the instructions in the [user installation guide](https://github.com/PNNL-CompBio/Snekmer/).

Before running Snekmer, verify that files have been placed in an **_input_** directory placed at the same level as the **_config.yaml_** file. The assumed file directory structure is illustrated below.

    .
    ├── input
    │   ├── A.fasta
    │   ├── B.fasta
    │   ├── C.fasta
    │   ├── D.fasta
    │   └── etc.
    ├── config.yaml
    ├── annotations
        └── TIGRFAMS.ann
    
(Note: Snekmer automatically creates the **_output_** directory when creating output files, so there is no need to create this folder in advance. Additionally, inclusion of background sequences is optional, but is illustrated above for interested users.)

To ensure that snekmer is available in the Jupyter notebook do the following:
    conda activate snekmer 
    conda install -c anaconda ipykernel 
    python -m ipykernel install --user --name=snekmer
    jupyter notebook







### Notes on Using Snekmer

Snekmer assumes that the user will primarily process input files using the command line. For more detailed instructions, refer to the [README](https://github.com/PNNL-CompBio/Snekmer).

The basic process for running Snekmer Learn-Apply is as follows:

1. Verify that your file directory structure is correct and that the top-level directory contains a **_config.yaml_** file.
    - A **_config.yaml_** template has been included in the Snekmer codebase at **_resources/learn_apply/config.yaml_**.
2. Modify the **_config.yaml_** with the desired parameters.
3. Use the command line to navigate to the directory containing both the **_config.yaml_** file and **_input_** directory.
4. Run `snekmer learn`, then copy the appropriate outputs a seperate directory to run `snekmer apply`


# Running Snekmer Learn Pipeline

### Setup

To set up the workflow such that operation mimics the command line implementation of Snekmer Leanr/Apply, we will initialize a dictionary (rather than a YAML file) and gather all input files. Input files are detected here using `glob.glob`, exactly as Snekmer performs input file detection.

In [1]:
# built-in imports
import itertools
import gzip
import json
import pickle
from datetime import datetime
from glob import glob
from os.path import basename, join
import numpy as np
from Bio import SeqIO
import snekmer as skm
from os.path import basename, dirname, exists, join, splitext, split
import numpy as np
import pandas as pd
import seaborn as sns
from Bio import SeqIO
from pathlib import Path
import copy
from scipy.stats import rankdata
import csv
import sys
import time
import pyarrow as pa
import pyarrow.csv as csv
import itertools
import sklearn
from scipy.interpolate import interp1d
import os
from typing import Any, Dict, List, Optional, Tuple, Union
import re
import shutil

## Configuration File Input

In [2]:
# define config
# (note: handled via config.yaml in the snekmer CLI workflow)

config = {
    
    # required parameters
    "k": 8,
    "alphabet": 2,  # choices 0-5 or names (see alphabet module), or None
    "min_rep_thresh": 1,
    "processes": 2,

    # input handling
    "input": {
        "example_index_file": False,
        "feature_set": False,
        "file_extensions": ["fasta", "fna", "faa", "fa"],
        "regex": r"[a-z]{3}[A-Z]{1}",  # regex to parse family from filename
    },
    
    # output handling
    "output": {
        "nested_dir": False,  # if True, saves into {save_dir}/{alphabet name}/{k}
        "verbose": True, # if True, logs verbose outputs
        "format": "simple",  # choices: ["simple", "gist", "sieve"]
        "filter_duplicates": True,
        "n_terminal_file": False,
        "shuffle_n": False,
        "shuffle_sequences": False,
    },
    
    # LearnApply Parameters
    "learnapp": {
        "save_apply_associations": False
    }

}

## Rule 0: Receive files

Before going through the workflow, we glob all filenames contained within the input directory that end in the pre-defined file extensions and/or the extension and `.gz`.

Note that while in this notebook, the path to the demo files is specified with the `input_dir` variable, the Snekmer CLI assumes that input files are stored according to the file structure specified above in the **Setup** section.

In [3]:
# collect all fasta-like files, unzipped filenames, and basenames
input_dir = "LearnApp_Tutorial_Files/LEARN/input/"
input_files = glob(os.path.join(input_dir, "*"))
zipped = [fa for fa in input_files if fa.endswith(".gz")]
unzipped = [
    fa.rstrip(".gz")
    for fa, ext in itertools.product(input_files, config["input"]["file_extensions"])
    if fa.rstrip(".gz").endswith(f".{ext}")
]

print("zipped files:\t", zipped)
print("unzipped files:\t", unzipped)

zipped files:	 []
unzipped files:	 ['LearnApp_Tutorial_Files/LEARN/input/UP000315395_2594265.fasta', 'LearnApp_Tutorial_Files/LEARN/input/UP000313849_676201.fasta', 'LearnApp_Tutorial_Files/LEARN/input/UP000319374_2585119.fasta', 'LearnApp_Tutorial_Files/LEARN/input/UP000319088_2600309.fasta', 'LearnApp_Tutorial_Files/LEARN/input/UP000317977_2528013.fasta', 'LearnApp_Tutorial_Files/LEARN/input/UP000319776_92403.fasta', 'LearnApp_Tutorial_Files/LEARN/input/UP000315466_1411316.fasta', 'LearnApp_Tutorial_Files/LEARN/input/UP000310227_2562283.fasta', 'LearnApp_Tutorial_Files/LEARN/input/UP000319209_2594004.fasta', 'LearnApp_Tutorial_Files/LEARN/input/UP000316313_1293412.fasta', 'LearnApp_Tutorial_Files/LEARN/input/UP000319639_2592816.fasta', 'LearnApp_Tutorial_Files/LEARN/input/UP000316827_1981880.fasta', 'LearnApp_Tutorial_Files/LEARN/input/UP000317332_2567861.fasta', 'LearnApp_Tutorial_Files/LEARN/input/UP000316252_2590779.fasta', 'LearnApp_Tutorial_Files/LEARN/input/UP000316154_239.fast

In [4]:
# map extensions to basename (basename.ext.gz -> {basename: ext})
UZ_MAP = {
    skm.utils.split_file_ext(f)[0]: skm.utils.split_file_ext(f)[1] for f in zipped
}

FA_MAP = {
    skm.utils.split_file_ext(f)[0]: skm.utils.split_file_ext(f)[1] for f in unzipped
}

UZS = list(UZ_MAP.keys())
FAS = list(FA_MAP.keys())

print("zipped filename wildcards:\t", UZS)
print("unzipped filename wildcards:\t", FAS)

zipped filename wildcards:	 []
unzipped filename wildcards:	 ['UP000315395_2594265', 'UP000313849_676201', 'UP000319374_2585119', 'UP000319088_2600309', 'UP000317977_2528013', 'UP000319776_92403', 'UP000315466_1411316', 'UP000310227_2562283', 'UP000319209_2594004', 'UP000316313_1293412', 'UP000319639_2592816', 'UP000316827_1981880', 'UP000317332_2567861', 'UP000316252_2590779', 'UP000316154_239', 'UP000310017_2583587', 'UP000319210_66857', 'UP000380825_2584524', 'UP000319897_1715348', 'UP000319825_1880']


In [5]:
# define output directory (and create if missing)
output_dir = "LearnApp_Tutorial_Files/LEARN/output"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("output directory:\t", output_dir)

# validity check
skm.alphabet.check_valid(config["alphabet"])  # raises error if invalid alphabet

output directory:	 LearnApp_Tutorial_Files/LEARN/output


## Rule 0.5: Unzip files

Any zipped files detected by the above are automatically unzipped. The zipped version of the file is copied into a separate subdirectory.

**Snakemake code:**

    # if any files are gzip compressed, unzip them
    rule unzip:
    input:
        join("input", "{uz}.gz")
    output:
        join("input", "{uz}")
    params:
        outdir=join("input", "zipped")
    shell:
        "mkdir {params.outdir} && gunzip -c {input} > {output} && mv {input} {params.outdir}/."
                

In [6]:
# if any files are gzip compressed, unzip them
for uz in UZS:
    input_ = os.path.join(input_dir, f"{uz}.{UZ_MAP[uz]}.gz")
    output_ = os.path.join(input_dir, f"{uz}.{UZ_MAP[uz]}")
    outdir = os.path.join(input_dir, "zipped")
    
    ! mkdir -p $outdir && gunzip -c $input_ > $output_ && mv $input_ $outdir/.

    print("input:\t", input_)
    print("output:\t", output_)
    

## Rule 1: Preprocess

In this step, we parse user-defined parameters into an appropriate format for subsequent pipeline steps.

Parameter options include:
- `k`: Define kmer length
- `alphabet`: Define the translation alphabet

The Snakemake code is not shown due to length, but the converted Python-ized code is shown below:

In [7]:
for fa in unzipped:
    # this is handled by snakemake but we'll specify it here
    base = f'{skm.utils.split_file_ext(fa)[0]}.kmers'
    output_kmerobj = os.path.join(output_dir, "kmerize", base)
    if not os.path.exists(os.path.join(output_dir, "kmerize")):
        os.mkdir(os.path.join(output_dir, "kmerize"))
        
    base = f'{skm.utils.split_file_ext(fa)[0]}.npz'
    output_data = os.path.join(output_dir, "vector", base)
    if not os.path.exists(os.path.join(output_dir, "vector")):
        os.mkdir(os.path.join(output_dir, "vector"))
    
    fasta = SeqIO.parse(fa, "fasta")

    # initialize kmerization object
    kmer = skm.vectorize.KmerVec(alphabet=config["alphabet"], k=config["k"])

    vecs, seqs, ids, lengths = list(), list(), list(), list()
    for f in fasta:
        vecs.append(kmer.reduce_vectorize(f.seq))
        seqs.append(
            skm.vectorize.reduce(
                f.seq,
                alphabet=config["alphabet"],
                mapping=skm.alphabet.FULL_ALPHABETS,
            )
        )
        ids.append(f.id)
        lengths.append(len(f.seq))

    # save seqIO output and transformed vecs
    np.savez_compressed(output_data, ids=ids, seqs=seqs, vecs=vecs, lengths=lengths)

    with open(output_kmerobj, "wb") as f:
        pickle.dump(kmer, f)

  val = np.asanyarray(val)


## Rule 2: Learn
In this step, we generate kmer counts for each fasta input file. In the following step these are merged to find cumulative kmer counts. These kmer counts can be thought of as training the annotation model.


In [8]:
if not os.path.exists("LearnApp_Tutorial_Files/LEARN/output/learn"):
    os.makedirs("LearnApp_Tutorial_Files/LEARN/output/learn")

for fa in unzipped:
    annot_files = glob(join("LearnApp_Tutorial_Files/LEARN/annotations", "*.ann"))
    base = f'{skm.utils.split_file_ext(fa)[0]}.npz'
    input_data = os.path.join(output_dir, "vector", base)
    
    Annotation = list()
    for f in annot_files:
        Annotation.append(pd.read_table(f))
    annotations = pd.concat(Annotation)
    Seq_Anot = {}
    Seqs= Annotation[0]['id'].tolist()
    ANNs = Annotation[0]['TIGRFAMs'].tolist()
    for i,seqid in enumerate(Seqs):
        Seq_Anot[seqid] = ANNs[i]
    Seqs = set(Seqs)
    ANNs = set(ANNs)

    output_data = os.path.join(output_dir, "vector", base)
    fasta = SeqIO.parse(fa, "fasta")
    # initialize kmerization object
    kmer = skm.vectorize.KmerVec(alphabet=config["alphabet"], k=config["k"])

    vecs, seqs, ids, lengths = list(), list(), list(), list()

    for f in fasta:
        vecs.append(kmer.reduce_vectorize(f.seq))
        seqs.append(
            skm.vectorize.reduce(
                f.seq,
                alphabet=config["alphabet"],
                mapping=skm.alphabet.FULL_ALPHABETS,
            )
        )
        ids.append(f.id)
        lengths.append(len(f.seq))
        
    df, kmerlist = skm.vectorize.make_feature_matrix(vecs)

    seqids = ids
    kmer_totals = []
    for item in kmerlist:
        kmer_totals.append(0)

    ##### Generate Kmer Counts
    k_len = len(kmerlist[0])
    seq_kmer_dict = {}
    counter = 0
    for i,seq in enumerate(seqids):
        v = seqs[i]
        kmer_counts = dict()
        items = []
        for item in range(0,(len((v)) - k_len +1)):
            items.append(v[item:(item+k_len)])
        for j in items:
            kmer_counts[j] = kmer_counts.get(j, 0) + 1  
        store = []
        for i,item in enumerate(kmerlist):
            if item in kmer_counts:
                store.append(kmer_counts[item])
                kmer_totals[i] += kmer_counts[item]
            else:
                store.append(0)
        seq_kmer_dict[seq]= store


    #Filter out Non-Training Annotations 
    Annotation_Counts = {}
    total_seqs = len(seq_kmer_dict)
    for i,seqid in enumerate(list(seq_kmer_dict)):
        x =re.findall(r'\|(.*?)\|', seqid)[0]
        if x not in Seqs:
            del seq_kmer_dict[seqid]
        else:
            if Seq_Anot[x] not in seq_kmer_dict:
                seq_kmer_dict[Seq_Anot[x]] = seq_kmer_dict.pop(seqid)
            else:
                zipped_lists = zip(seq_kmer_dict.pop(seqid), seq_kmer_dict[Seq_Anot[x]])
                seq_kmer_dict[Seq_Anot[x]] = [x + y for (x, y) in zipped_lists]
            if Seq_Anot[x] not in Annotation_Counts:
                Annotation_Counts[Seq_Anot[x]] = 1
            else: 
                Annotation_Counts[Seq_Anot[x]] += 1

    #Construct Kmer Counts Output
    Kmer_Counts = pd.DataFrame(seq_kmer_dict.values())        
    Kmer_Counts.insert(0,"Annotations",Annotation_Counts.values(),True)
    Kmer_Counts.insert(1,"Kmer Count",(Kmer_Counts[list(Kmer_Counts.columns[1:])].sum(axis=1).to_list()),True)
    kmer_totals[0:0] = [0,total_seqs]
    colnames = ["Sequence count"] + ["Kmer Count"] + list(kmerlist)
    Kmer_Counts = pd.DataFrame(np.insert(Kmer_Counts.values, 0, values=(kmer_totals), axis=0))
    Kmer_Counts.columns = colnames
    new_index = ["Totals"] + list(Annotation_Counts.keys())
    Kmer_Counts.index = new_index
    print("Counts Data Generated for: ",input_data)


    #### Write Output
    out_name = "LearnApp_Tutorial_Files/LEARN/output/learn/kmer-counts-" + str(input_data)[44:-4] + ".csv"
    Kmer_Counts_out = pa.Table.from_pandas(Kmer_Counts,preserve_index=True)
    csv.write_csv(Kmer_Counts_out, out_name)

Counts Data Generated for:  LearnApp_Tutorial_Files/LEARN/output/vector/UP000315395_2594265.npz
Counts Data Generated for:  LearnApp_Tutorial_Files/LEARN/output/vector/UP000313849_676201.npz
Counts Data Generated for:  LearnApp_Tutorial_Files/LEARN/output/vector/UP000319374_2585119.npz
Counts Data Generated for:  LearnApp_Tutorial_Files/LEARN/output/vector/UP000319088_2600309.npz
Counts Data Generated for:  LearnApp_Tutorial_Files/LEARN/output/vector/UP000317977_2528013.npz
Counts Data Generated for:  LearnApp_Tutorial_Files/LEARN/output/vector/UP000319776_92403.npz
Counts Data Generated for:  LearnApp_Tutorial_Files/LEARN/output/vector/UP000315466_1411316.npz
Counts Data Generated for:  LearnApp_Tutorial_Files/LEARN/output/vector/UP000310227_2562283.npz
Counts Data Generated for:  LearnApp_Tutorial_Files/LEARN/output/vector/UP000319209_2594004.npz
Counts Data Generated for:  LearnApp_Tutorial_Files/LEARN/output/vector/UP000316313_1293412.npz
Counts Data Generated for:  LearnApp_Tutori

## Rule 3: Merge
In this step, we merge all of the previously generated counts datafiles. While running this pipeline through the command line, we also have the option of merging in a previously generated counts file. This allows for additive kmer count integration for massive project scaling.


In [9]:

input_counts = glob("LearnApp_Tutorial_Files/LEARN/output/learn/*")

for file_num,f in enumerate(input_counts):
    print("databases merged: ",file_num,"\n")
    Kmer_Counts = pd.read_csv(str(f), index_col="__index_level_0__", header=0, engine="pyarrow")
#     print(Kmer_Counts)
    if file_num == 0:
        running_merge = Kmer_Counts
    elif file_num >= 1:
        running_merge = (pd.concat([running_merge,Kmer_Counts]).reset_index().groupby('__index_level_0__', sort=False).sum(min_count=1)).fillna(0)

running_merge_out = pa.Table.from_pandas(running_merge,preserve_index=True)
csv.write_csv(running_merge_out, "LearnApp_Tutorial_Files/LEARN/output/learn/kmer-counts-total.csv")




databases merged:  0 

databases merged:  1 

databases merged:  2 

databases merged:  3 

databases merged:  4 

databases merged:  5 

databases merged:  6 

databases merged:  7 

databases merged:  8 

databases merged:  9 

databases merged:  10 

databases merged:  11 

databases merged:  12 

databases merged:  13 

databases merged:  14 

databases merged:  15 

databases merged:  16 

databases merged:  17 

databases merged:  18 

databases merged:  19 



## Rule 3: Eval_Apply
In this step, we find the cosine similarity score between the merged kmer count database and each kmer counts from each sequence in the fasta files. Essentially, we are comparing self-predictions against actual values. This output is used in the next step to calculate confidence scores.




In [10]:
if not os.path.exists("LearnApp_Tutorial_Files/LEARN/output/eval_apply"):
    os.makedirs("LearnApp_Tutorial_Files/LEARN/output/eval_apply")

compare_associations = "LearnApp_Tutorial_Files/LEARN/output/learn/kmer-counts-total.csv"
annotation = ["LearnApp_Tutorial_Files/LEARN/annotations/TIGRFAMs_annotation.ann"]
for fa in unzipped:
    # this is handled by snakemake but we'll specify it here

    ###
    base = f'{skm.utils.split_file_ext(fa)[0]}.npz'
    output_data = os.path.join(output_dir, "vector", base)

    fasta = SeqIO.parse(fa, "fasta")

    # initialize kmerization object
    kmer = skm.vectorize.KmerVec(alphabet=config["alphabet"], k=config["k"])

    vecs, seqs, ids, lengths = list(), list(), list(), list()

    for f in fasta:
        vecs.append(kmer.reduce_vectorize(f.seq))
        seqs.append(
            skm.vectorize.reduce(
                f.seq,
                alphabet=config["alphabet"],
                mapping=skm.alphabet.FULL_ALPHABETS,
            )
        )
        ids.append(f.id)
        lengths.append(len(f.seq))


    ##### Generate Inputs
    Annotation = list()
    Kmer_Count_Totals = pd.read_csv(str(compare_associations), index_col="__index_level_0__", header=0, engine="c")
    for f in annotation:
        Annotation.append(pd.read_table(f))
    Seqs = Annotation[0]['id'].tolist()
    ANNs = Annotation[0]['TIGRFAMs'].tolist()
    Seq_Anot = {}
    for i,seqid in enumerate(Seqs):
        Seq_Anot[seqid] = ANNs[i]
    Seqs = set(Seqs)
    ANNs = set(ANNs)
#     df, kmerlist = skm.io.load_npz(input.data)
    seqids = ids
    kmer_totals = []
    for item in kmerlist:
        kmer_totals.append(0)

    ##### Generate Kmer Counts
    seq_kmer_dict = {}
    counter = 0
    k_len = len(kmerlist[0])
    for i,seq in enumerate(seqids):
        v = seqs[i]
        kmer_counts = dict()
        items = []
        for item in range(0,(len((v)) - k_len +1)):
            items.append(v[item:(item+k_len)])
        for j in items:
            kmer_counts[j] = kmer_counts.get(j, 0) + 1  
        store = []
        for i,item in enumerate(kmerlist):
            if item in kmer_counts:
                store.append(kmer_counts[item])
                kmer_totals[i] += kmer_counts[item]
            else:
                store.append(0)
        seq_kmer_dict[seq]= store


    ###### ADD Known / Unknown tag to mark for confidence assessment
    Annotation_Counts = {}
    total_seqs = len(seq_kmer_dict)
    count = 0
    for seqid in list(seq_kmer_dict):
        x =re.findall(r'\|(.*?)\|', seqid)[0]
        if x not in Seqs:
            seq_kmer_dict[(x+"_unknown_"+str(count))] = seq_kmer_dict.pop(seqid)
        else: 
            seq_kmer_dict[(Seq_Anot[x] + "_known_" + str(count))] = seq_kmer_dict.pop(seqid)
        count +=1
    Annotation_Counts = {}
    total_seqs = len(seq_kmer_dict)

    ######  Construct Kmer Counts Dataframe
    Kmer_Counts = pd.DataFrame(seq_kmer_dict.values())        
    Kmer_Counts.insert(0,"Annotations",1,True)
    kmer_totals.insert(0,total_seqs)
    Kmer_Counts = pd.DataFrame(np.insert(Kmer_Counts.values, 0, values=kmer_totals, axis=0))
    Kmer_Counts.columns = ["Sequence count"] + list(kmerlist)
    Kmer_Counts.index = ["Totals"] + list(seq_kmer_dict.keys())


    ##### Make New Counts Data match Kmer Counts Totals Format
    if len(str(Kmer_Counts.columns.values[10])) == len(str(Kmer_Count_Totals.columns.values[10])):
        compare_check = True
    else: 
        compare_check = False
    if compare_check == True:
        check_1 = len(Kmer_Counts.columns.values)
        alphabet_initial = set(itertools.chain(*[list(x) for x in Kmer_Counts.columns.values[10:check_1]]))
        alphabet_compare = set(itertools.chain(*[list(x) for x in Kmer_Count_Totals.columns.values[10:check_1]]))
        if alphabet_compare == alphabet_initial:
            compare_check = True
        else: 
            compare_check = False
    if compare_check == False:
        print("Compare Check Failed. ")
        sys.exit()

    new_cols = set(Kmer_Counts.columns)
    compare_cols = set(Kmer_Count_Totals.columns)
    add_to_compare = []
    add_to_new = []
    for val in new_cols:
        if val not in compare_cols:
            add_to_compare.append(val)
    for val in compare_cols:
        if val not in new_cols:
            add_to_new.append(val)

    Kmer_Count_Totals = pd.concat([Kmer_Count_Totals, pd.DataFrame(dict.fromkeys(add_to_compare, 0), index=Kmer_Count_Totals.index)], axis=1)
    Kmer_Count_Totals.drop(columns=Kmer_Count_Totals.columns[:2], index="Totals", axis=0, inplace=True)
    Kmer_Counts = pd.concat([Kmer_Counts, pd.DataFrame(dict.fromkeys(add_to_new, 0), index=Kmer_Counts.index)], axis=1)
    Kmer_Counts.drop(columns=Kmer_Counts.columns[-1:].union(Kmer_Counts.columns[:1]), index="Totals", axis=0, inplace=True)

    #Perform Cosine Similarity between Kmer Counts Totals and Counts and Sums DF
    cosine_df = sklearn.metrics.pairwise.cosine_similarity(Kmer_Count_Totals,Kmer_Counts).T
    final_matrix_with_scores = pd.DataFrame(cosine_df, columns=Kmer_Count_Totals.index, index=Kmer_Counts.index)

    #Write Output
    out_name = "LearnApp_Tutorial_Files/LEARN/output/eval_apply/seq-annotation-scores-" + str(fa)[36:-6] + ".csv"

    final_matrix_with_scores_write = pa.Table.from_pandas(final_matrix_with_scores)
    csv.write_csv(final_matrix_with_scores_write, out_name)
    print("File completed: seq-annotation-scores-" + str(fa)[36:-6] + ".csv")



File completed: seq-annotation-scores-UP000315395_2594265.csv
File completed: seq-annotation-scores-UP000313849_676201.csv
File completed: seq-annotation-scores-UP000319374_2585119.csv
File completed: seq-annotation-scores-UP000319088_2600309.csv
File completed: seq-annotation-scores-UP000317977_2528013.csv
File completed: seq-annotation-scores-UP000319776_92403.csv
File completed: seq-annotation-scores-UP000315466_1411316.csv
File completed: seq-annotation-scores-UP000310227_2562283.csv
File completed: seq-annotation-scores-UP000319209_2594004.csv
File completed: seq-annotation-scores-UP000316313_1293412.csv
File completed: seq-annotation-scores-UP000319639_2592816.csv
File completed: seq-annotation-scores-UP000316827_1981880.csv
File completed: seq-annotation-scores-UP000317332_2567861.csv
File completed: seq-annotation-scores-UP000316252_2590779.csv
File completed: seq-annotation-scores-UP000316154_239.csv
File completed: seq-annotation-scores-UP000310017_2583587.csv
File completed:

## Rule 4: Eval_Conf
In this step, we evalate if the cosine scores between kmer counts dataframes accurately predict the annotation. The ratio of true positive to false positives is taken and we generate our global confidence scores.  Each delta will be mapped to a confidence score. **Delta** is defined as the difference between the two highest cosine similarity scores. 




In [11]:
if not os.path.exists("LearnApp_Tutorial_Files/LEARN/output/eval_conf"):
    os.makedirs("LearnApp_Tutorial_Files/LEARN/output/eval_conf")


eval_apply_data = glob("LearnApp_Tutorial_Files/LEARN/output/eval_apply/seq-annotation-scores-*")
        #### Generate Input Data
for j,f in enumerate(eval_apply_data):
    seq_ann_scores = pd.read_csv(f, index_col="__index_level_0__", header=0, engine="c")
    max_value_index = seq_ann_scores.idxmax(axis="columns")
    result = max_value_index.keys()
    TF = list()
    Known = list()
    for i,item in enumerate(list(max_value_index)):
        if item in result[i]:
            TF.append("T")
        else:
            TF.append("F")
        if "unknown" in result[i]:
            Known.append("Unknown")
        else:
            Known.append("Known")

    seq_ann_vals = seq_ann_scores.values
    seq_ann_vals = seq_ann_scores.values[np.arange(len(seq_ann_scores))[:,None],np.argpartition(-seq_ann_vals,np.arange(2),axis=1)[:,:2]]

    diff_df = pd.DataFrame(seq_ann_vals, columns = ['Top','Second'])
    diff_df['Delta'] = -(np.diff(seq_ann_vals, axis=1).round(decimals=2))
    diff_df['Prediction'] = list(max_value_index)
    diff_df['Actual'] = result
    diff_df["T/F"] = TF
    diff_df["Known/Unknown"] = Known

    #### Create CrossTabs - ie True/False Count Sums and sum within .01 intervals
    known_true_diff_df = diff_df[(diff_df["Known/Unknown"] == "Known") & (diff_df["T/F"] == "T")]
    known_false_diff_df = diff_df[(diff_df["Known/Unknown"] == "Known") & (diff_df["T/F"] == "F")]
    possible_vals = [round(x * 0.01,2) for x in range(0, 101)]
    true_crosstab = pd.crosstab(known_true_diff_df.Prediction,known_true_diff_df.Delta)
    false_crosstab = pd.crosstab(known_false_diff_df.Prediction,known_false_diff_df.Delta)

    if j == 0:
        true_running_crosstab = true_crosstab
        false_running_crosstab = false_crosstab
    else:
        true_running_crosstab = (pd.concat([true_running_crosstab,true_crosstab]).reset_index().groupby('Prediction', sort=False).sum(min_count=1)).fillna(0)
        false_running_crosstab = (pd.concat([false_running_crosstab,false_crosstab]).reset_index().groupby('Prediction', sort=False).sum(min_count=1)).fillna(0)


    add_to_true_df = pd.DataFrame(0, index = sorted(set(false_running_crosstab.index) - set(true_running_crosstab.index)), columns= true_running_crosstab.columns)
    add_to_false_df = pd.DataFrame(0, index = sorted(set(true_running_crosstab.index) - set(false_running_crosstab.index)), columns= false_running_crosstab.columns)

    true_running_crosstab = pd.concat([true_running_crosstab,add_to_true_df])[sorted(list(set(possible_vals) & set(true_running_crosstab.columns)))].assign(**dict.fromkeys(list(map(str, sorted(list(set(possible_vals) ^ set(true_running_crosstab.columns.astype(float)))))),0))
    false_running_crosstab = pd.concat([false_running_crosstab,add_to_false_df])[sorted(list(set(possible_vals) & set(false_running_crosstab.columns)))].assign(**dict.fromkeys(list(map(str, sorted(list(set(possible_vals) ^ set(false_running_crosstab.columns.astype(float)))))),0))

    true_running_crosstab.index.names = ['Prediction']
    false_running_crosstab.index.names = ['Prediction']
    true_running_crosstab.sort_index(inplace=True) 
    false_running_crosstab.sort_index(inplace=True) 
    true_running_crosstab.columns = true_running_crosstab.columns.astype(float)
    false_running_crosstab.columns = false_running_crosstab.columns.astype(float)
    true_running_crosstab = true_running_crosstab[sorted(true_running_crosstab.columns)]
    false_running_crosstab = false_running_crosstab[sorted(false_running_crosstab.columns)]


    print("Dataframes joined: ", j+1, " out of ",len(eval_apply_data) , ".")

#### Generate Each Global CrossTab
ratio_running_crosstab = (true_running_crosstab/(true_running_crosstab + false_running_crosstab))
true_total_dist = true_running_crosstab.sum(numeric_only=True, axis=0)
false_total_dist = false_running_crosstab.sum(numeric_only=True, axis=0)
ratio_total_dist = (true_running_crosstab.sum(numeric_only=True, axis=0)/(true_running_crosstab.sum(numeric_only=True, axis=0) + false_running_crosstab.sum(numeric_only=True, axis=0)))

####Interpolate For final Ratio, this only will affect upper limit values if there is a decent amount of data
ratio_total_dist = ratio_total_dist.interpolate(method="linear")

##### Write Final Confidence Results
ratio_total_dist.to_csv("LearnApp_Tutorial_Files/LEARN/output/eval_conf/global-confidence-scores.csv")
csv.write_csv(pa.Table.from_pandas(true_running_crosstab), "LearnApp_Tutorial_Files/LEARN/output/eval_conf/true-total.csv")
csv.write_csv(pa.Table.from_pandas(false_running_crosstab), "LearnApp_Tutorial_Files/LEARN/output/eval_conf/false-total.csv")
csv.write_csv(pa.Table.from_pandas(ratio_running_crosstab), "LearnApp_Tutorial_Files/LEARN/output/eval_conf/confidence-matrix.csv")

print("\nGlobal Confidence scores mapped to Delta:\n", ratio_total_dist)


Dataframes joined:  1  out of  20 .
Dataframes joined:  2  out of  20 .
Dataframes joined:  3  out of  20 .
Dataframes joined:  4  out of  20 .
Dataframes joined:  5  out of  20 .
Dataframes joined:  6  out of  20 .
Dataframes joined:  7  out of  20 .
Dataframes joined:  8  out of  20 .
Dataframes joined:  9  out of  20 .
Dataframes joined:  10  out of  20 .
Dataframes joined:  11  out of  20 .
Dataframes joined:  12  out of  20 .
Dataframes joined:  13  out of  20 .
Dataframes joined:  14  out of  20 .
Dataframes joined:  15  out of  20 .
Dataframes joined:  16  out of  20 .
Dataframes joined:  17  out of  20 .
Dataframes joined:  18  out of  20 .
Dataframes joined:  19  out of  20 .
Dataframes joined:  20  out of  20 .

Global Confidence scores mapped to Delta:
 Delta
0.00    0.380282
0.01    0.709163
0.02    0.869565
0.03    0.935065
0.04    0.981752
          ...   
0.96    1.000000
0.97    1.000000
0.98    1.000000
0.99    1.000000
1.00    1.000000
Length: 101, dtype: float64


## Learn Pipeline is done.

Key outputs include:  
* Kmer counts database: /output/learn/kmer-counts-total.csv  
* Global confidence scores: output/eval_conf/global-confidence-scores.csv  

The next step is to prepare for the Apply pipeline.

## Intermediate Steps

Users will have to extract key outputs and copy them into a new directory to run the Apply pipeline.


In [12]:
if not os.path.exists("LearnApp_Tutorial_Files/APPLY/counts"):
    os.makedirs("LearnApp_Tutorial_Files/APPLY/counts")

if not os.path.exists("LearnApp_Tutorial_Files/APPLY/confidence"):
    os.makedirs("LearnApp_Tutorial_Files/APPLY/confidence")
    
shutil.copyfile("LearnApp_Tutorial_Files/LEARN/output/learn/kmer-counts-total.csv", "LearnApp_Tutorial_Files/APPLY/counts/kmer-counts-total.csv")
shutil.copyfile("LearnApp_Tutorial_Files/LEARN/output/eval_conf/global-confidence-scores.csv", "LearnApp_Tutorial_Files/APPLY/confidence/global-confidence-scores.csv")


'LearnApp_Tutorial_Files/APPLY/confidence/global_confidence_scores.csv'



# Getting Started with APPLY

### Setup


Before running Snekmer Apply, verify that files have been placed in an **_input_** directory placed at the same level as the **_config.yaml_** file. The assumed file directory structure is illustrated below.

    .
    ├── input
    │   ├── W.fasta
    │   ├── X.fasta
    │   ├── Y.fasta
    │   ├── Z.fasta
    │   └── etc.
    ├── config.yaml
    ├── counts
    │   └── kmer-counts-total.csv
    └── confidence
        └── global-confidence-scores.csv
     
        
    
Note: Snekmer automatically creates the **_output_** directory when creating output files, so there is no need to create this folder in advance



# Running Snekmer Apply Pipeline


## Rule 0.5: Unzip files

Any zipped files detected by the above are automatically unzipped. The zipped version of the file is copied into a separate subdirectory.
                

In [13]:
# if any files are gzip compressed, unzip them
for uz in UZS:
    input_ = os.path.join(input_dir, f"{uz}.{UZ_MAP[uz]}.gz")
    output_ = os.path.join(input_dir, f"{uz}.{UZ_MAP[uz]}")
    outdir = os.path.join(input_dir, "zipped")
    
    ! mkdir -p $outdir && gunzip -c $input_ > $output_ && mv $input_ $outdir/.

    print("input:\t", input_)
    print("output:\t", output_)
    

### kmerize/vectorize
for fa in unzipped:
    # this is handled by snakemake but we'll specify it here
    base = f'{skm.utils.split_file_ext(fa)[0]}.kmers'
    output_kmerobj = os.path.join(output_dir, "kmerize", base)
    if not os.path.exists(os.path.join(output_dir, "kmerize")):
        os.mkdir(os.path.join(output_dir, "kmerize"))
        
    base = f'{skm.utils.split_file_ext(fa)[0]}.npz'
    output_data = os.path.join(output_dir, "vector", base)
    if not os.path.exists(os.path.join(output_dir, "vector")):
        os.mkdir(os.path.join(output_dir, "vector"))
    
    fasta = SeqIO.parse(fa, "fasta")

    # initialize kmerization object
    kmer = skm.vectorize.KmerVec(alphabet=config["alphabet"], k=config["k"])

    vecs, seqs, ids, lengths = list(), list(), list(), list()
    for f in fasta:
        vecs.append(kmer.reduce_vectorize(f.seq))
        seqs.append(
            skm.vectorize.reduce(
                f.seq,
                alphabet=config["alphabet"],
                mapping=skm.alphabet.FULL_ALPHABETS,
            )
        )
        ids.append(f.id)
        lengths.append(len(f.seq))

    # save seqIO output and transformed vecs
    np.savez_compressed(output_data, ids=ids, seqs=seqs, vecs=vecs, lengths=lengths)

    with open(output_kmerobj, "wb") as f:
        pickle.dump(kmer, f)


  val = np.asanyarray(val)


## Rule 1: Preprocess

In this step, we parse user-defined parameters into an appropriate format for subsequent pipeline steps.

Parameter options include:
- `k`: Define kmer length
- `alphabet`: Define the translation alphabet

Note: This essentially the same step as in Learn.

In [14]:
# collect all fasta-like files, unzipped filenames, and basenames
input_dir = "LearnApp_Tutorial_Files/APPLY/input/"
input_files = glob(os.path.join(input_dir, "*"))
zipped = [fa for fa in input_files if fa.endswith(".gz")]
unzipped = [
    fa.rstrip(".gz")
    for fa, ext in itertools.product(input_files, config["input"]["file_extensions"])
    if fa.rstrip(".gz").endswith(f".{ext}")
]

print("zipped files:\t", zipped)
print("unzipped files:\t", unzipped)
# define output directory (and create if missing)
output_dir = "LearnApp_Tutorial_Files/APPLY/output"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("output directory:\t", output_dir)

# validity check
skm.alphabet.check_valid(config["alphabet"])  # raises error if invalid alphabet
# if any files are gzip compressed, unzip them
for uz in UZS:
    input_ = os.path.join(input_dir, f"{uz}.{UZ_MAP[uz]}.gz")
    output_ = os.path.join(input_dir, f"{uz}.{UZ_MAP[uz]}")
    outdir = os.path.join(input_dir, "zipped")
    
    ! mkdir -p $outdir && gunzip -c $input_ > $output_ && mv $input_ $outdir/.

    print("input:\t", input_)
    print("output:\t", output_)
    

zipped files:	 []
unzipped files:	 ['LearnApp_Tutorial_Files/APPLY/input/UP000481030_1602942.fasta', 'LearnApp_Tutorial_Files/APPLY/input/UP000480297_2681465.fasta', 'LearnApp_Tutorial_Files/APPLY/input/UP000480288_2708300.fasta', 'LearnApp_Tutorial_Files/APPLY/input/UP000482960_1076125.fasta', 'LearnApp_Tutorial_Files/APPLY/input/UP000483286_2682977.fasta']
output directory:	 LearnApp_Tutorial_Files/APPLY/output


## Rule 2: Apply
In this step, we find the cosine similarity score between the kmer-counts-total.csv and the kmer count vector from each sequence in the fasta files. Essentially, we are comparing new kmer count frequencies against the trained model/dataframe. 

 
This compares each new sequence against every sequence have in the trained model.   

Optional output: 
* Specify learnapp parameter 'save_results' = True in the config file.
* This will output a matrix of all cosine similarity scores. 
* **Warning**: these files may take up a lot of storage space.

In [17]:
if not os.path.exists("LearnApp_Tutorial_Files/APPLY/output/apply"):
    os.makedirs("LearnApp_Tutorial_Files/APPLY/output/apply")

confidence_associations = "LearnApp_Tutorial_Files/APPLY/confidence/global-confidence-scores.csv"
compare_associations = "LearnApp_Tutorial_Files/APPLY/counts/kmer-counts-total.csv"
for fa in unzipped:
    # this is handled by snakemake but we'll specify it here

    ###
    base = f'{skm.utils.split_file_ext(fa)[0]}.npz'
    output_data = os.path.join(output_dir, "vector", base)

    print("parsing...")
    fasta = SeqIO.parse(fa, "fasta")
    print("finished...")

    # initialize kmerization object
    kmer = skm.vectorize.KmerVec(alphabet=config["alphabet"], k=config["k"])

    vecs, seqs, ids, lengths = list(), list(), list(), list()

    for f in fasta:
        vecs.append(kmer.reduce_vectorize(f.seq))
        seqs.append(
            skm.vectorize.reduce(
                f.seq,
                alphabet=config["alphabet"],
                mapping=skm.alphabet.FULL_ALPHABETS,
            )
        )
        ids.append(f.id)
        lengths.append(len(f.seq))

    print("making kmer list")
    df, kmerlist = skm.vectorize.make_feature_matrix(vecs)
    print("done")
    
    kmer_count_totals = pd.read_csv(str(compare_associations), index_col="__index_level_0__", header=0, engine="c")
    seqids = ids
    kmer_totals = []
    for item in kmerlist:
        kmer_totals.append(0)
    k_len = len(kmerlist[0])

    ##### Generate Kmer Counts
    seq_kmer_dict = {}
    counter = 0
    for i,seq in enumerate(seqids):
        v = seqs[i]
        kmer_counts = dict()
        items = []
        for item in range(0,(len((v)) - k_len +1)):
            items.append(v[item:(item+k_len)])
        for j in items:
            kmer_counts[j] = kmer_counts.get(j, 0) + 1  
        store = []
        for i,item in enumerate(kmerlist):
            if item in kmer_counts:
                store.append(kmer_counts[item])
                kmer_totals[i] += kmer_counts[item]
            else:
                store.append(0)
        seq_kmer_dict[seq]= store


    ######  Construct Kmer Counts Dataframe
    total_seqs = len(seq_kmer_dict)
    kmer_counts = pd.DataFrame(seq_kmer_dict.values())        
    kmer_counts.insert(0,"Annotations",1,True)
    kmer_totals.insert(0,total_seqs)
    kmer_counts = pd.DataFrame(np.insert(kmer_counts.values, 0, values=kmer_totals, axis=0))
    kmer_counts.columns = ["Sequence count"] + list(kmerlist)
    kmer_counts.index = ["Totals"] + list(seq_kmer_dict.keys())

    new_associations = kmer_counts.iloc[1:, 1:].div(kmer_counts["Sequence count"].tolist()[1:], axis = "rows")

    ##### Make Kmer Counts Dataframe match Kmer Counts Totals Format
    if len(str(kmer_counts.columns.values[10])) == len(str(kmer_count_totals.columns.values[10])):
        compare_check = True
    else: 
        compare_check = False
    if compare_check == True:
        check_1 = len(new_associations.columns.values)
        check_2 = len(kmer_count_totals.columns.values)
        alphabet_initial = set(itertools.chain(*[list(x) for x in kmer_counts.columns.values[10:check_1]]))
        alphabet_compare = set(itertools.chain(*[list(x) for x in kmer_count_totals.columns.values[10:check_1]]))
        if alphabet_compare == alphabet_initial:
            compare_check = True
    if compare_check == False:
        print("Compare Check Failed. ")
        sys.exit()

    new_cols = set(kmer_counts.columns)
    compare_cols = set(kmer_count_totals.columns)
    add_to_compare = []
    add_to_new = []
    for val in new_cols:
        if val not in compare_cols:
            add_to_compare.append(val)
    for val in compare_cols:
        if val not in new_cols:
            add_to_new.append(val)

    kmer_count_totals = pd.concat([kmer_count_totals, pd.DataFrame(dict.fromkeys(add_to_compare, 0), index=kmer_count_totals.index)], axis=1)
    kmer_count_totals.drop(columns=kmer_count_totals.columns[:2], index="Totals", axis=0, inplace=True)
    kmer_counts = pd.concat([kmer_counts, pd.DataFrame(dict.fromkeys(add_to_new, 0), index=kmer_counts.index)], axis=1)
    kmer_counts.drop(columns=kmer_counts.columns[-1:].union(kmer_counts.columns[:1]), index="Totals", axis=0, inplace=True)


    #### Perform Cosine Similarity between Kmer Counts Totals and Counts and Sums DF
    cosine_df = sklearn.metrics.pairwise.cosine_similarity(kmer_count_totals,kmer_counts).T
    kmer_count_totals = pd.DataFrame(cosine_df, columns=kmer_count_totals.index, index=kmer_counts.index)



    ##### Create True Output
    # Protein ID, Prediction, Score, delta, Confidence
    global_confidence_scores = pd.read_csv(str(confidence_associations))
    global_confidence_scores.index= global_confidence_scores[global_confidence_scores.columns[0]]
    global_confidence_scores = global_confidence_scores.iloc[: , 1:]
    global_confidence_scores = global_confidence_scores[global_confidence_scores.columns[0]].squeeze()

    score_rank =[]
    sorted_vals = np.argsort(-kmer_count_totals.values, axis=1)[:, :2]
    for i,item in enumerate(sorted_vals):
        score_rank.append((kmer_count_totals[kmer_count_totals.columns[[item]]][i:i+1]).values.tolist()[0])

    delta = []
    top_score = []
    for score in score_rank:
        delta.append(score[0] - score[1])
        top_score.append(score[0])

    vals = pd.DataFrame({'delta':delta})
    predictions = pd.DataFrame(np.array(kmer_count_totals.columns)[sorted_vals][:, :1])

    score = pd.DataFrame(top_score)
    score.columns = ["Score"]
    predictions.columns = ["Prediction"]
    predictions = predictions.astype(str)
    vals = vals.round(decimals=2)
    vals['Confidence'] = vals["delta"].map(global_confidence_scores)

    results = pd.concat([predictions,score,vals],axis=1)
    results.index=kmer_count_totals.index
    results.index.names = ['SeqID']

    #### Write results 
    out_name_2 = "LearnApp_Tutorial_Files/APPLY/output/apply/kmer-summary-" + str(fa)[36:-6] + ".csv"
#     results.reset_index(inplace=True)
    results_write = pa.Table.from_pandas(results)
    
    csv.write_csv(results_write, out_name_2)
    print(results)


parsing...
finished...
making kmer list
done


  result = getitem(key)


                               Prediction     Score  delta  Confidence
SeqID                                                                 
tr|A0A6L3UVZ1|A0A6L3UVZ1_9BACI  TIGR01782  0.159632   0.00    0.380282
tr|A0A6L3UWN7|A0A6L3UWN7_9BACI  TIGR00231  0.201725   0.00    0.380282
tr|A0A6L3UX61|A0A6L3UX61_9BACI  TIGR00254  0.268196   0.01    0.709163
tr|A0A6L3UXS8|A0A6L3UXS8_9BACI  TIGR02937  0.235456   0.00    0.380282
tr|A0A6L3UXV5|A0A6L3UXV5_9BACI  TIGR02937  0.218722   0.00    0.380282
...                                   ...       ...    ...         ...
tr|A0A6L3VDZ3|A0A6L3VDZ3_9BACI  TIGR00384  0.213473   0.00    0.380282
tr|A0A6L3VET6|A0A6L3VET6_9BACI  TIGR01780  0.117717   0.00    0.380282
tr|A0A6L3VG18|A0A6L3VG18_9BACI  TIGR00914  0.138133   0.00    0.380282
tr|A0A6L3VGG9|A0A6L3VGG9_9BACI  TIGR04018  0.236016   0.02    0.869565
tr|A0A6L3VI01|A0A6L3VI01_9BACI  TIGR01007  0.173710   0.01    0.709163

[4971 rows x 4 columns]
parsing...
finished...
making kmer list
done


  result = getitem(key)


                               Prediction     Score  delta  Confidence
SeqID                                                                 
tr|A0A6J4F4E3|A0A6J4F4E3_9PROT  TIGR00653  0.529945   0.27    1.000000
tr|A0A6J4F4W2|A0A6J4F4W2_9PROT  TIGR04490  0.165330   0.03    0.935065
tr|A0A6J4F5K9|A0A6J4F5K9_9PROT  TIGR00229  0.225698   0.00    0.380282
tr|A0A6J4F5U0|A0A6J4F5U0_9PROT  TIGR00362  0.401043   0.15    0.993228
tr|A0A6J4F758|A0A6J4F758_9PROT  TIGR00326  0.234519   0.01    0.709163
...                                   ...       ...    ...         ...
tr|A0A6J4G3Q6|A0A6J4G3Q6_9PROT  TIGR02937  0.266368   0.01    0.709163
tr|A0A6J4G4E3|A0A6J4G4E3_9PROT  TIGR00426  0.126779   0.00    0.380282
tr|A0A6J4G4E9|A0A6J4G4E9_9PROT  TIGR04131  0.200287   0.00    0.380282
tr|A0A6J4G4I0|A0A6J4G4I0_9PROT  TIGR01049  0.709616   0.57    1.000000
tr|A0A6J4G4Q6|A0A6J4G4Q6_9PROT  TIGR01965  0.372966   0.00    0.380282

[4587 rows x 4 columns]
parsing...
finished...
making kmer list
done


  result = getitem(key)


                               Prediction     Score  delta  Confidence
SeqID                                                                 
tr|A0A6M0HIY6|A0A6M0HIY6_9RHIZ  TIGR04183  0.204168   0.00    0.380282
tr|A0A6M0HJ66|A0A6M0HJ66_9RHIZ  TIGR01643  0.192836   0.00    0.380282
tr|A0A6M0HJN7|A0A6M0HJN7_9RHIZ  TIGR01133  0.171233   0.00    0.380282
tr|A0A6M0HJS1|A0A6M0HJS1_9RHIZ  TIGR04183  0.192684   0.00    0.380282
tr|A0A6M0HK20|A0A6M0HK20_9RHIZ  TIGR01643  0.234667   0.00    0.380282
...                                   ...       ...    ...         ...
tr|A0A6M0HXZ3|A0A6M0HXZ3_9RHIZ  TIGR01066  0.146006   0.03    0.935065
tr|A0A6M0HYD7|A0A6M0HYD7_9RHIZ  TIGR01730  0.172446   0.00    0.380282
tr|A0A6M0HYM5|A0A6M0HYM5_9RHIZ  TIGR04131  0.140279   0.01    0.709163
tr|A0A6M0HZ22|A0A6M0HZ22_9RHIZ  TIGR00460  0.147662   0.00    0.380282
tr|A0A6M0HZC6|A0A6M0HZC6_9RHIZ  TIGR00757  0.174282   0.03    0.935065

[4688 rows x 4 columns]
parsing...
finished...
making kmer list
done


  result = getitem(key)


                               Prediction     Score  delta  Confidence
SeqID                                                                 
tr|A0A6V8KMR1|A0A6V8KMR1_9ACTN  TIGR01188  0.336461   0.15    0.993228
tr|A0A6V8KMT8|A0A6V8KMT8_9ACTN  TIGR01830  0.319477   0.08    0.992126
tr|A0A6V8KMV4|A0A6V8KMV4_9ACTN  TIGR00229  0.227885   0.02    0.869565
tr|A0A6V8KNR0|A0A6V8KNR0_9ACTN  TIGR04183  0.265409   0.01    0.709163
tr|A0A6V8KQ77|A0A6V8KQ77_9ACTN  TIGR00231  0.223534   0.01    0.709163
...                                   ...       ...    ...         ...
tr|A0A6V8L4S8|A0A6V8L4S8_9ACTN  TIGR03654  0.375962   0.18    1.000000
tr|A0A6V8LEJ2|A0A6V8LEJ2_9ACTN  TIGR00229  0.165081   0.01    0.709163
tr|A0A6V8LEV2|A0A6V8LEV2_9ACTN  TIGR00594  0.211811   0.01    0.709163
tr|A0A6V8LMI1|A0A6V8LMI1_9ACTN  TIGR02601  0.181719   0.00    0.380282
tr|A0A6V8LMV5|A0A6V8LMV5_9ACTN  TIGR04057  0.177651   0.00    0.380282

[10307 rows x 4 columns]
parsing...
finished...
making kmer list
done
      

  result = getitem(key)


# Apply Pipeline is done.

Output is located in /output/apply/kmer-summary-{input file name}.csv.

Each output file has 5 columns.
* **SeqID**: ID of the sequence whose annotation we are trying to predict.  
* **Prediction**: The predicted annotatation for the sequence.  
* **Score**: The cosine similarity score between the sequence and the predicted annotation.  
* **Delta**: The difference of cosine similarity scores between the top two predicted values.  
* **Confidence**: The estimated confidence of the prediction. This is based on the global distribution. Confidence will be more accurate for annotations with more training sequences.  

