# Autometa Data Descriptions and Formats

The Autometa pipeline will generate a number of files required for binning metagenomes into MAGs. These files correspond to different statistics and annotations respective to 

| File Name | File Description | File Format |
|------|------|------|
| \*filtered_ge_{length_cutoff}.fna | contigs greater than or equal to the provided `--length-cutoff` parameter | fasta |
| k-mers.tsv | contig x k-mer frequencies | wide |
| k-mers.normalized.tsv | contig x CLR normalized k-mer frequencies | wide |
| \*orfs.faa | prodigal-annotated amino-acid ORFs | fasta |
| \*orfs.fna | prodigal-annotated nucleotide ORFs | fasta |
| \*blastp.tsv | Diamond blastp search results table (outfmt 6) | long |
| \*blastp.pkl.gz | Autometa DiamondResult object used for taxonomic assignment | python pickled dict object |
| majority_vote.tsv | Autometa voted taxids | long |
| \*lca.tsv | ORF x LCA taxid | long |
| tour.pkl.gz | Autometa list object used for LCA assignment | python pickled list object |
| level.pkl.gz | Autometa list object used for LCA assignment | python pickled list object |
| occurrence.pkl.gz | Autometa list object used for LCA assignment | python pickled list object |
| sparse.pkl.gz | Autometa list object used for LCA assignment | python pickled numpy sparse table |
| \*.hmmscan.tsv | kingdom-specific hmmscan raw results table | long |
| \*.markers.tsv | kingdom-specific marker annotation table | long |

## Data handling methods

In [1]:
!cd ../
!echo "To setup and IDE using autometa's modules, navigate to the autometa base directory..."
!echo "I.e. Base directory : $(pwd)"

To setup and IDE using autometa's modules, navigate to the autometa base directory...
I.e. Base directory : /Users/rees/Wisc/kwan/tools/autometa/docs


In [2]:
import os
os.chdir('../')

In [3]:
# Import autometa utility function to load python pickled objects
from autometa.common.utilities import unpickle

# Import DiamondResult to allow unpickling blastp.pkl.gz results
from autometa.common.external.diamond import DiamondResult

## orfs.blastp.pkl.gz (dict of ORFs DiamondResult)

In [4]:
results = unpickle('dev/scaffolds.orfs.faa.blastp.pkl.gz')

Notice a utility function is provided to easily load the python serialized objects

The `DiamondResult` Autometa object holds a number of attributes as well as some methods that provide easy access to blastp results corresponding to any searched query seqid (ORF)

In [5]:
NODE_98_orf_5 = results.get('NODE_98_length_143507_cov_224.136_5')

# NODE_98_orf_5?

As you can see the `DiamondResult` object holds a number of attributes:

Attributes:
* qseqid
* pident
* length
* gapopen
* mismatch
* qstart
* qend
* sstart
* send
* length
* bitscore
* evalue
* sseqid
* sseqids

Methods:
* get_top_hit

Perhaps the most convenient attribute to query a `qseqid` of the ORF's `DiamondResult` object is `sseqids`.

i.e.

In [6]:
for sseqid, attrs in results.get('NODE_98_length_143507_cov_224.136_11').sseqids.items():
    print(
        f"sseqid: {sseqid} bitscore: {attrs.get('bitscore')} "
        f"taxid: {attrs.get('taxid')} "
    )

sseqid: WP_011577313.1 bitscore: 550.1 taxid: 212 
sseqid: WP_021308412.1 bitscore: 541.6 taxid: 210 
sseqid: WP_024751083.1 bitscore: 539.3 taxid: 210 
sseqid: WP_000753486.1 bitscore: 537.3 taxid: 210 
sseqid: WP_001963517.1 bitscore: 537.3 taxid: 210 
sseqid: WP_079359809.1 bitscore: 537.3 taxid: 210 
sseqid: WP_000753485.1 bitscore: 537.0 taxid: 210 
sseqid: WP_033595811.1 bitscore: 537.0 taxid: 210 
sseqid: WP_058337672.1 bitscore: 537.0 taxid: 210 
sseqid: WP_077656665.1 bitscore: 537.0 taxid: 210 
sseqid: WP_064436257.1 bitscore: 535.8 taxid: 210 
sseqid: WP_001921678.1 bitscore: 535.4 taxid: 210 
sseqid: KNE03025.1 bitscore: 535.4 taxid: 210 
sseqid: WP_024773924.1 bitscore: 535.0 taxid: 210 
sseqid: WP_015087528.1 bitscore: 534.3 taxid: 210 
sseqid: WP_078248142.1 bitscore: 533.9 taxid: 210 
sseqid: WP_000753484.1 bitscore: 533.5 taxid: 210 
sseqid: WP_000753487.1 bitscore: 533.5 taxid: 210 
sseqid: WP_000753488.1 bitscore: 533.1 taxid: 210 
sseqid: WP_001941093.1 bitscore: 53

In [7]:
results.get('NODE_98_length_143507_cov_224.136_11').get_top_hit()

'WP_011577313.1'

### Reading tab-delimited tables

Autometa generates a few different tables in *long* and *wide* formats.

Most of the tables will have a `contig` column to be used as the pandas `DataFrame` index.
Throughout the autometa code base these tables will consistently have `contig` as the index for easy look-up methods.

## K-mers table

In [8]:
# Import pandas library to easily read in tab-delimited tables
import pandas as pd

df = pd.read_csv('dev/kmers.tsv', sep='\t', index_col='contig')

In [9]:
# Note: We can easilty look up the number of rows and columns in our table with the `shape` attribute.
df.shape

(3587, 512)

In [10]:
df.describe()

Unnamed: 0,AAAAA,AAAAT,AAAAC,AAAAG,AAATA,AAATT,AAATC,AAATG,AAACA,AAACT,...,GTGCC,GTGGC,GCACC,GCAGC,GCTCC,GCCCC,GCCGC,GCGCC,GGACC,GGCCC
count,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,...,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0
mean,129.68804,100.005018,57.866183,70.265403,77.569836,74.099805,57.12378,52.016727,49.043212,39.641204,...,55.279621,61.176471,79.473097,95.75913,43.639532,50.939783,153.644829,122.586005,39.891274,60.813493
std,490.711572,380.595919,192.942122,255.260888,339.260178,272.786141,190.483309,179.009566,164.588022,144.145529,...,226.833744,236.797279,352.880434,370.257865,134.571926,177.541757,690.738041,560.927086,158.767559,230.312608
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,...,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,0.0,0.0
50%,5.0,4.0,3.0,4.0,3.0,3.0,4.0,3.0,3.0,3.0,...,9.0,11.0,13.0,15.0,10.0,10.0,16.0,13.0,7.0,7.0
75%,50.0,41.0,27.0,29.0,27.0,30.0,27.0,23.0,23.0,20.0,...,41.0,46.0,54.0,64.5,38.0,42.0,103.0,74.0,30.0,45.0
max,10356.0,8635.0,3519.0,5603.0,8568.0,6672.0,3292.0,3265.0,2900.0,3564.0,...,5891.0,5660.0,9894.0,8731.0,2752.0,4278.0,18126.0,14603.0,3992.0,5785.0


Let's look up the contig `NODE_98_length_143507_cov_224.136` from above to see the k-mer frequencies and explore this contig in our other tables...

In [11]:
# Note: Because we have indexed our DataFrame by contigs, 
# we can use the `self.loc` method to find our corresponding data.
df.loc['NODE_98_length_143507_cov_224.136']

AAAAA    2807
AAAAT    1786
AAAAC    1288
AAAAG    1392
AAATA     867
         ... 
GCCCC     287
GCCGC      77
GCGCC      75
GGACC      37
GGCCC      32
Name: NODE_98_length_143507_cov_224.136, Length: 512, dtype: int64

In [12]:
# We can similarly look up a specific value in a row and column
# Syntax: `self.loc[row,col]`
row = 'NODE_98_length_143507_cov_224.136'
col = 'GCCCC'
df.loc[row, col]

287

### Looking up multiple contigs in a DataFrame

We more often are going to want to look up multiple contigs in our table at a time. We can subset our table in one line with `self.isin()`

In [13]:
df.head(n=20)

Unnamed: 0_level_0,AAAAA,AAAAT,AAAAC,AAAAG,AAATA,AAATT,AAATC,AAATG,AAACA,AAACT,...,GTGCC,GTGGC,GCACC,GCAGC,GCTCC,GCCCC,GCCGC,GCGCC,GGACC,GGCCC
contig,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
NODE_1_length_1389215_cov_225.275,170,214,544,412,288,254,892,536,447,511,...,5891,5660,9894,8731,2752,3318,18126,13512,3992,5785
NODE_2_length_1166739_cov_224.155,1688,1625,2187,1354,949,1178,2608,1513,1833,1175,...,3654,4208,4797,6906,2042,4278,9604,9627,2198,3739
NODE_3_length_1063064_cov_225.095,143,158,419,345,237,229,724,386,399,394,...,4530,4292,7448,6461,2011,2519,14146,9667,3082,4228
NODE_4_length_1031470_cov_223.812,347,225,485,331,90,184,484,325,272,466,...,3888,4406,5543,7679,2564,4048,14005,14603,3368,4460
NODE_5_length_937195_cov_225.122,156,172,382,317,194,198,620,378,324,353,...,3848,3678,6377,5703,1834,2201,11991,8548,2585,3668
NODE_6_length_801699_cov_224.213,97,152,322,265,202,180,613,324,302,323,...,3442,3194,5366,4745,1643,1864,10094,6872,2306,3079
NODE_7_length_734119_cov_224.107,93,109,278,227,168,172,559,316,316,297,...,3148,2983,4843,4283,1433,1673,9355,6476,2234,2818
NODE_8_length_686892_cov_223.785,9083,7161,3519,4526,7106,4912,3292,3035,2900,3564,...,294,437,603,904,678,156,137,133,245,83
NODE_9_length_644926_cov_224.104,10356,8635,3275,5603,8568,6672,3221,3265,2618,2988,...,179,199,449,680,480,116,66,57,232,77
NODE_10_length_622659_cov_224.054,68,81,222,170,137,100,392,205,177,235,...,2613,2519,4419,3933,1170,1509,8726,5917,1835,2548


In [14]:
# First let's get the first 20 contigs in the DataFrame to use for our example.
contigs = df.head(20).index

# Now we will check the index of the DataFrame (the contigs) and return a boolean array
df.index.isin(contigs)

# If we want to retrieve the information in the corresponding table,
# we need to provide this array to our DataFrame.
# Notice the `df[filter_criterion]` syntax
df[df.index.isin(contigs)]

Unnamed: 0_level_0,AAAAA,AAAAT,AAAAC,AAAAG,AAATA,AAATT,AAATC,AAATG,AAACA,AAACT,...,GTGCC,GTGGC,GCACC,GCAGC,GCTCC,GCCCC,GCCGC,GCGCC,GGACC,GGCCC
contig,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
NODE_1_length_1389215_cov_225.275,170,214,544,412,288,254,892,536,447,511,...,5891,5660,9894,8731,2752,3318,18126,13512,3992,5785
NODE_2_length_1166739_cov_224.155,1688,1625,2187,1354,949,1178,2608,1513,1833,1175,...,3654,4208,4797,6906,2042,4278,9604,9627,2198,3739
NODE_3_length_1063064_cov_225.095,143,158,419,345,237,229,724,386,399,394,...,4530,4292,7448,6461,2011,2519,14146,9667,3082,4228
NODE_4_length_1031470_cov_223.812,347,225,485,331,90,184,484,325,272,466,...,3888,4406,5543,7679,2564,4048,14005,14603,3368,4460
NODE_5_length_937195_cov_225.122,156,172,382,317,194,198,620,378,324,353,...,3848,3678,6377,5703,1834,2201,11991,8548,2585,3668
NODE_6_length_801699_cov_224.213,97,152,322,265,202,180,613,324,302,323,...,3442,3194,5366,4745,1643,1864,10094,6872,2306,3079
NODE_7_length_734119_cov_224.107,93,109,278,227,168,172,559,316,316,297,...,3148,2983,4843,4283,1433,1673,9355,6476,2234,2818
NODE_8_length_686892_cov_223.785,9083,7161,3519,4526,7106,4912,3292,3035,2900,3564,...,294,437,603,904,678,156,137,133,245,83
NODE_9_length_644926_cov_224.104,10356,8635,3275,5603,8568,6672,3221,3265,2618,2988,...,179,199,449,680,480,116,66,57,232,77
NODE_10_length_622659_cov_224.054,68,81,222,170,137,100,392,205,177,235,...,2613,2519,4419,3933,1170,1509,8726,5917,1835,2548


## Voted Taxids Table (majority_vote.tsv) 

In [15]:
df = pd.read_csv('dev/majority_vote.tsv', sep='\t', index_col='contig')

In [16]:
df.head(3)

Unnamed: 0_level_0,taxid,superkingdom,phylum,class,order,family,genus,species
contig,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
NODE_98_length_143507_cov_224.136,212,bacteria,proteobacteria,epsilonproteobacteria,campylobacterales,helicobacteraceae,helicobacter,helicobacter acinonychis
NODE_99_length_143125_cov_224.739,573569,bacteria,proteobacteria,gammaproteobacteria,thiotrichales,francisellaceae,francisella,francisella sp. tx077308
NODE_100_length_142487_cov_223.46,212,bacteria,proteobacteria,epsilonproteobacteria,campylobacterales,helicobacteraceae,helicobacter,helicobacter acinonychis


In [17]:
superkingdoms = dict(list(df.groupby('superkingdom')))

In [18]:
# All superkingdom classifications found in DataFrame
superkingdoms.keys()

dict_keys(['bacteria', 'eukaryota', 'unclassified', 'viruses'])

In [19]:
# We can retrieve a subset DataFrame of the superkingdom of interest.
bacteria_df = superkingdoms.get('bacteria')
print(f'master shape: {df.shape} bacteria shape: {bacteria_df.shape}')

master shape: (3250, 8) bacteria shape: (3163, 8)


In [20]:
bacteria_df.head()

Unnamed: 0_level_0,taxid,superkingdom,phylum,class,order,family,genus,species
contig,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
NODE_98_length_143507_cov_224.136,212,bacteria,proteobacteria,epsilonproteobacteria,campylobacterales,helicobacteraceae,helicobacter,helicobacter acinonychis
NODE_99_length_143125_cov_224.739,573569,bacteria,proteobacteria,gammaproteobacteria,thiotrichales,francisellaceae,francisella,francisella sp. tx077308
NODE_100_length_142487_cov_223.46,212,bacteria,proteobacteria,epsilonproteobacteria,campylobacterales,helicobacteraceae,helicobacter,helicobacter acinonychis
NODE_101_length_139096_cov_225.266,146819,bacteria,actinobacteria,actinobacteria,streptomycetales,streptomycetaceae,streptomyces,streptomyces europaeiscabiei
NODE_102_length_138307_cov_223.908,220685,bacteria,firmicutes,bacilli,bacillales,bacillaceae,bacillus,bacillus bataviensis


This table is convenient for inspecting each contigs' taxonomic assignment. However, if we only have a list of taxids we can just as easily reconstruct the full lineage.

In [21]:
#  An Autometa method that will construct each lineage
from autometa.taxonomy.ncbi import NCBI
# First we need to instantiate the NCBI object
ncbi = NCBI('databases/ncbi')



In [22]:
# Code block from above to retrieve ORF specific taxids
taxids = set([attrs.get('taxid') for sseqid,attrs in results.get('NODE_98_length_143507_cov_224.136_11').sseqids.items()])
print(f'taxids: {taxids}')
# Note: There are many other methods available in the NCBI
df = ncbi.get_lineage_dataframe(taxids)

taxids: {210, 212, 992074}


In [23]:
df

Unnamed: 0_level_0,superkingdom,phylum,class,order,family,genus,species
taxid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
210,bacteria,proteobacteria,epsilonproteobacteria,campylobacterales,helicobacteraceae,helicobacter,helicobacter pylori
212,bacteria,proteobacteria,epsilonproteobacteria,campylobacterales,helicobacteraceae,helicobacter,helicobacter acinonychis
992074,bacteria,proteobacteria,epsilonproteobacteria,campylobacterales,helicobacteraceae,helicobacter,helicobacter pylori


We can equally add these taxids back to the DataFrame from which we retrieved them...

## \*.lca.tsv table

In [24]:
lca_filepath = 'dev/scaffolds.orfs.faa.lca.tsv'
lca_df = pd.read_csv(lca_filepath, sep='\t', index_col='qseqid')
lca_df.head()

Unnamed: 0_level_0,name,rank,lca
qseqid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NODE_98_length_143507_cov_224.136_1,helicobacter acinonychis,species,212
NODE_98_length_143507_cov_224.136_2,helicobacter,genus,209
NODE_98_length_143507_cov_224.136_3,helicobacter,genus,209
NODE_98_length_143507_cov_224.136_4,helicobacter,genus,209
NODE_98_length_143507_cov_224.136_5,helicobacter acinonychis,species,212


In [25]:
df = ncbi.get_lineage_dataframe(lca_df.lca.unique().tolist())
df.head()

Unnamed: 0_level_0,superkingdom,phylum,class,order,family,genus,species
taxid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,unclassified,unclassified,unclassified,unclassified,unclassified,unclassified,unclassified
2,bacteria,unclassified,unclassified,unclassified,unclassified,unclassified,unclassified
1703944,bacteria,actinobacteria,actinobacteria,streptomycetales,streptomycetaceae,streptomyces,streptomyces sp. cb02400
2062,bacteria,actinobacteria,actinobacteria,streptomycetales,streptomycetaceae,unclassified,unclassified
2070,bacteria,actinobacteria,actinobacteria,pseudonocardiales,pseudonocardiaceae,unclassified,unclassified


Notice the returned DataFrame is indexed by taxid. This allows easy merging to the `taxids_df` DataFrame

In [26]:
# Merge DataFrames
merged_df = pd.merge(
        lca_df,
        df,
        how='left',
        left_on='lca',
        right_index=True)

merged_df.head()

Unnamed: 0_level_0,name,rank,lca,superkingdom,phylum,class,order,family,genus,species
qseqid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
NODE_98_length_143507_cov_224.136_1,helicobacter acinonychis,species,212,bacteria,proteobacteria,epsilonproteobacteria,campylobacterales,helicobacteraceae,helicobacter,helicobacter acinonychis
NODE_98_length_143507_cov_224.136_2,helicobacter,genus,209,bacteria,proteobacteria,epsilonproteobacteria,campylobacterales,helicobacteraceae,helicobacter,unclassified
NODE_98_length_143507_cov_224.136_3,helicobacter,genus,209,bacteria,proteobacteria,epsilonproteobacteria,campylobacterales,helicobacteraceae,helicobacter,unclassified
NODE_98_length_143507_cov_224.136_4,helicobacter,genus,209,bacteria,proteobacteria,epsilonproteobacteria,campylobacterales,helicobacteraceae,helicobacter,unclassified
NODE_98_length_143507_cov_224.136_5,helicobacter acinonychis,species,212,bacteria,proteobacteria,epsilonproteobacteria,campylobacterales,helicobacteraceae,helicobacter,helicobacter acinonychis


## Markers table format

In [27]:
markers_filepath = 'dev/bacteria.markers.tsv'
markers_df = pd.read_csv(markers_filepath ,sep='\t',index_col='contig')
markers_df.head()

Unnamed: 0_level_0,orf,sacc,sname,score,cutoff
contig,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NODE_13_length_533917_cov_222.896,NODE_13_length_533917_cov_222.896_412,PF00121.13,TIM,315.4,140.2
NODE_126_length_115614_cov_223.525,NODE_126_length_115614_cov_223.525_21,PF00712.14,DNA_pol3_beta,139.2,48.7
NODE_126_length_115614_cov_223.525,NODE_126_length_115614_cov_223.525_21,PF02767.11,DNA_pol3_beta_2,122.6,49.65
NODE_126_length_115614_cov_223.525,NODE_126_length_115614_cov_223.525_21,PF02768.10,DNA_pol3_beta_3,91.8,44.5
NODE_127_length_115095_cov_224.529,NODE_127_length_115095_cov_224.529_91,PF08459.6,UvrC_HhH_N,140.3,81.9


In [28]:
from autometa.common.markers import Markers

# load markers is a staticmethod within the Markers class so we can 
# load in the previously annotated markers table
markers_df = Markers.load_markers(markers_filepath, format='wide')
markers_df.head()

sacc,PF00035.20,PF00113.17,PF00121.13,PF00154.16,PF00162.14,PF00163.14,PF00164.20,PF00177.16,PF00181.18,PF00189.15,...,PF06574.7,PF06689.8,PF07499.8,PF08275.6,PF08459.6,PF08529.6,PF10385.4,PF10458.4,PF11987.3,PF12344.3
contig,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
NODE_1007_length_15092_cov_223.109,,,,,,,,,,,...,,,,,,,,,,
NODE_1009_length_14996_cov_222.895,,,,,,,1.0,1.0,,,...,,,,,,,1.0,,,
NODE_100_length_142487_cov_223.46,1.0,,,,,,,,,,...,,,,,,,,,,
NODE_1017_length_14855_cov_223.142,,,,,,,,,,,...,,,,,,,1.0,,,
NODE_1023_length_14648_cov_224.155,,,,,,,,,,,...,,,,,,,,,,


In [29]:
markers_df = Markers.load_markers(markers_filepath, format='long')
markers_df.head()

Unnamed: 0_level_0,sacc,count
contig,Unnamed: 1_level_1,Unnamed: 2_level_1
NODE_1007_length_15092_cov_223.109,PF02367.12,1
NODE_1009_length_14996_cov_222.895,PF00164.20,1
NODE_1009_length_14996_cov_222.895,PF00177.16,1
NODE_1009_length_14996_cov_222.895,PF00298.14,1
NODE_1009_length_14996_cov_222.895,PF00466.15,1


In [30]:
markers = Markers.load_markers(markers_filepath, format='list')
markers

{'NODE_1007_length_15092_cov_223.109': ['PF02367.12'],
 'NODE_1009_length_14996_cov_222.895': ['PF00584.15',
  'PF03946.9',
  'PF00298.14',
  'PF00687.16',
  'PF00466.15',
  'PF00542.14',
  'PF00562.23',
  'PF04563.10',
  'PF04561.9',
  'PF04560.15',
  'PF04565.11',
  'PF10385.4',
  'PF04997.7',
  'PF04998.12',
  'PF00623.15',
  'PF04983.13',
  'PF05000.12',
  'PF00164.20',
  'PF00177.16'],
 'NODE_100_length_142487_cov_223.46': ['PF05697.8',
  'PF05698.9',
  'PF03948.9',
  'PF01281.14',
  'PF00035.20',
  'PF02938.9',
  'PF03602.10',
  'PF00712.14',
  'PF02768.10',
  'PF00830.14',
  'PF01687.12'],
 'NODE_1017_length_14855_cov_223.142': ['PF04997.7',
  'PF04998.12',
  'PF00623.15',
  'PF04983.13',
  'PF05000.12',
  'PF00562.23',
  'PF04561.9',
  'PF04560.15',
  'PF04565.11',
  'PF10385.4',
  'PF04563.10',
  'PF00542.14',
  'PF00466.15',
  'PF00687.16',
  'PF00298.14',
  'PF03946.9'],
 'NODE_1023_length_14648_cov_224.155': ['PF02882.14', 'PF00763.18'],
 'NODE_1026_length_14554_cov_223.897

In [31]:
markers = Markers.load_markers(markers_filepath, format='counts')
markers

{'NODE_1007_length_15092_cov_223.109': 1,
 'NODE_1009_length_14996_cov_222.895': 19,
 'NODE_100_length_142487_cov_223.46': 11,
 'NODE_1017_length_14855_cov_223.142': 16,
 'NODE_1023_length_14648_cov_224.155': 2,
 'NODE_1026_length_14554_cov_223.897': 1,
 'NODE_102_length_138307_cov_223.908': 6,
 'NODE_1032_length_14379_cov_222.741': 1,
 'NODE_1035_length_14277_cov_224.245': 1,
 'NODE_1037_length_14262_cov_223.873': 1,
 'NODE_1039_length_14197_cov_223.624': 23,
 'NODE_103_length_137756_cov_225.461': 1,
 'NODE_1046_length_13906_cov_224.346': 2,
 'NODE_1052_length_13792_cov_224.116': 1,
 'NODE_1057_length_13739_cov_222.742': 1,
 'NODE_1059_length_13673_cov_223.999': 2,
 'NODE_105_length_136131_cov_224.464': 13,
 'NODE_1073_length_13348_cov_222.786': 1,
 'NODE_1074_length_13345_cov_224.047': 1,
 'NODE_1076_length_13288_cov_223.745': 4,
 'NODE_107_length_135043_cov_223.938': 14,
 'NODE_1080_length_13246_cov_222.788': 1,
 'NODE_1086_length_13138_cov_224.524': 1,
 'NODE_108_length_134368_cov_

# Exposing data to other language formats

With a pandas DataFrame, a variety of methods are available to expose data to a different language format

Examples:

- json objects
- python numpy object
- table with different delimiter
- to python dict object

In [32]:
markers_df = Markers.load_markers(markers_filepath)

In [33]:
markers_df.to_json()

'{"PF00035.20":{"NODE_1007_length_15092_cov_223.109":null,"NODE_1009_length_14996_cov_222.895":null,"NODE_100_length_142487_cov_223.46":1.0,"NODE_1017_length_14855_cov_223.142":null,"NODE_1023_length_14648_cov_224.155":null,"NODE_1026_length_14554_cov_223.897":null,"NODE_102_length_138307_cov_223.908":null,"NODE_1032_length_14379_cov_222.741":null,"NODE_1035_length_14277_cov_224.245":null,"NODE_1037_length_14262_cov_223.873":null,"NODE_1039_length_14197_cov_223.624":null,"NODE_103_length_137756_cov_225.461":null,"NODE_1046_length_13906_cov_224.346":null,"NODE_1052_length_13792_cov_224.116":null,"NODE_1057_length_13739_cov_222.742":null,"NODE_1059_length_13673_cov_223.999":null,"NODE_105_length_136131_cov_224.464":null,"NODE_1073_length_13348_cov_222.786":null,"NODE_1074_length_13345_cov_224.047":null,"NODE_1076_length_13288_cov_223.745":null,"NODE_107_length_135043_cov_223.938":null,"NODE_1080_length_13246_cov_222.788":null,"NODE_1086_length_13138_cov_224.524":null,"NODE_108_length_134

In [34]:
markers_df.to_numpy()

array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [ 1., nan, nan, ..., nan, nan, nan],
       ...,
       [nan,  1., nan, ..., nan, nan, nan],
       [nan,  1., nan, ..., nan, nan, nan],
       [ 1., nan, nan, ..., nan,  1., nan]])

In [35]:
markers_df.to_dict()

{'PF00035.20': {'NODE_1007_length_15092_cov_223.109': nan,
  'NODE_1009_length_14996_cov_222.895': nan,
  'NODE_100_length_142487_cov_223.46': 1.0,
  'NODE_1017_length_14855_cov_223.142': nan,
  'NODE_1023_length_14648_cov_224.155': nan,
  'NODE_1026_length_14554_cov_223.897': nan,
  'NODE_102_length_138307_cov_223.908': nan,
  'NODE_1032_length_14379_cov_222.741': nan,
  'NODE_1035_length_14277_cov_224.245': nan,
  'NODE_1037_length_14262_cov_223.873': nan,
  'NODE_1039_length_14197_cov_223.624': nan,
  'NODE_103_length_137756_cov_225.461': nan,
  'NODE_1046_length_13906_cov_224.346': nan,
  'NODE_1052_length_13792_cov_224.116': nan,
  'NODE_1057_length_13739_cov_222.742': nan,
  'NODE_1059_length_13673_cov_223.999': nan,
  'NODE_105_length_136131_cov_224.464': nan,
  'NODE_1073_length_13348_cov_222.786': nan,
  'NODE_1074_length_13345_cov_224.047': nan,
  'NODE_1076_length_13288_cov_223.745': nan,
  'NODE_107_length_135043_cov_223.938': nan,
  'NODE_1080_length_13246_cov_222.788': na

These methods come with their own arguments to return the designated object and 

```python
# delimiter = ',' or '\t' etc...
markers_df.to_csv('<filepath>', sep='<delimiter>')

```