Fixes #16. Many changes but the most important to document was this, …

…I'll explain in full in comments on the pull request from this issue.
MatthewRalston · Jan 9, 2021 · 090b91a · 090b91a
1 parent 9c08150
commit 090b91a
Show file tree

Hide file tree

Showing 8 changed files with 19,948 additions and 49 deletions.
diff --git a/TODO.org b/TODO.org
@@ -15,7 +15,7 @@ The final prototype would be .bgzf format from biopython
 * DONE Nearest neighbor profile
   CLOSED: [2021-01-07 Thu 21:37]
 * VERIFY index class
-* TODO Probability function
+* VERIFY Probability function
 * Rmd report1
 ** Distribution analysis
 ** Accurately describe kdb counting algorithm

diff --git a/examples/example_report/index.Rmd b/examples/example_report/index.Rmd
@@ -124,7 +124,11 @@ In this report, we will look at the shape of the data as it is normalized by sam
 
 ## kmerdb and on-disk k-mer counting
 
-kmerdb is a Python module, distributed with the Python Package Index (PyPI) through the command line interface 'pip'. Installation is very simple and the experience should be painless for most users. The primary feature of kmerdb is a series of command-line actions called "sub-commands." These sub-commands perform specific functions on kdb database files or matrices of their origin. But perhaps more specifically the k-mer database is designed to generate very simple k-mer graph neighbors and abundances and store them in the bgzf file format. The first step of any analysis done with kdb is the k-mer counting done when generating a k-mer "profile." During this step, .fastq or .fasta files and eventually .bam files are assessed to create k-mer database files. Each read (or sequence in the case of fasta format) is a considerably different kind of sub-profile with error mixed in. We don't yet have a layer to propagate that error in to, numerically at this point, per k-mer. And so we leave that where it is, as far as cleaning of read groups prior to analysis with kdb, if fastq or bam files are preferred. To be candid, I am stuck between whether to study the error rates associated with fastq subsampling of a perfect fasta genome, and with whether to study the profiling of profile subcommand, 
+kmerdb is a Python module, distributed with the Python Package Index (PyPI) through the command line interface 'pip'. Installation is very simple and the experience should be painless for most users. The primary feature of kmerdb is a series of command-line actions called "sub-commands." These sub-commands perform specific functions on kdb database files or matrices of their origin. But perhaps more specifically the k-mer database is designed to generate very simple k-mer graph neighbors and abundances and store them in the bgzf file format. 
+
+The first step of any analysis done with kdb is the k-mer counting done when generating one or more k-mer "profile." During this step, .fastq or .fasta files (we eventually look to add support for .bam files) are assessed to create k-mer database files. Each read (or sequence in the case of fasta format) is a considerably different kind of sub-profile with error mixed in. We don't yet have a layer to propagate that error in to, numerically at this point, per k-mer. And so we leave that where it is, as far as cleaning of read groups prior to analysis with kdb, if fastq or bam files are preferred. To be candid, I am stuck between whether to study the error rates associated with fastq subsampling of a perfect fasta genome, and with whether to study the profiling of profile subcommand. 
+
+The more deliberate feature of the on-disk k-mer counting is that higher choices of k may be uncovered for which bugs do not have to be uncovered, we tied ourselves to SQLite3 and then let loose the development on other fronts. This is covered to a better extent by `example_report1` because it is more focused on the profiling side of things, to what limit I can explore k meaningfully. The other benefit of having
 
 
 ## Distribution fitting on normalized and unnormalized k-mer counts

diff --git a/examples/example_report1/NOTES.Rmd b/examples/example_report1/NOTES.Rmd
@@ -97,7 +97,7 @@ With consumer hardware, kdb profile generates 8-mer profiles without issue. In p
 
 Certainly with an eye to the future, we can see that the solution prototyped here can scale to at least k=20 with on-disk counting and without choosing an optimal database for rapid counting, and if "rapid" is even necessary if it's only done once. The tool was used to generate 8-mer profiles in this study, and we will b
 
-
+It is worth noting that the maximum array index for a certain value of k, must not exceed the value of a 64bit integer, which is (2^64)-1 ~ 18 pentillion. However, the maximum index happens to be 1.1 pentillion in the case of 4^30 (k=30). When I looked again at 4^31, 64 bits is capable of storing an index for that extent with numpy. However, k=32 seems to exceed the number by almost 1. Then again, k-mer indices of that magnitude are rarely necessary from my perspective, and they would need a more complex key. 
 
 The advent of Next-Generation Sequencing technology may be one of the most influencial technologies of the early 21st century. 
 

diff --git a/kmerdb/__init__.py b/kmerdb/__init__.py
@@ -51,7 +51,8 @@ def markov_probability(arguments):
 
     if index.has_index(arguments.kdb):
         arguments.kdbi = arguments.kdb + "i"
-        df = pd.DataFrame([])
+        #df = pd.DataFrame([], columns=["SequenceID", "Log_Odds_ratio", "p_of_seq"])
+        profiles = np.array([], dtype="int64")
         with fileutil.open(arguments.kdb, 'r') as kdb:
             k = kdb.header['k']
             with index.open(arguments.kdbi, 'r') as kdbi:
@@ -68,10 +69,15 @@ def markov_probability(arguments):
                         # Do something here
 
                         markov_probs = list(map(lambda p: [p["seq"].name, p["log_odds_ratio"], p["p_of_seq"]], [probability.markov_probability(seq, kdb, kdbi) for seq in recs]))
-                        df.append(pd.DataFrame(markov_probs, columns=["SequenceID", "Log_Odds_ratio", "p_of_seq"]))
 
-                        recs = [r for r in seqprsr] # Essentially accomplishes an iteration in the file, wrapped by the seqparser.SeqParser class
+                        print(markov_probs)
+                        if profiles.shape == (0,):
+                            profiles = np.array(markov_probs)
+                        else:
+                            np.append(profiles, markov_probs, axis=0)
 
+                        recs = [r for r in seqprsr] # Essentially accomplishes an iteration in the file, wrapped by the seqparser.SeqParser class
+        df = pd.DataFrame(profiles, columns=["SequenceID", "Log_Odds_ratio", "p_of_seq"])
         df.to_csv(sys.stdout, sep=arguments.delimiter, index=False)
 
     else:

diff --git a/kmerdb/index.py b/kmerdb/index.py
@@ -20,6 +20,7 @@
 import os
 import copy
 import gzip
+import yaml
 import numpy as np
 #import jsonschema
 sys.path.append('..')
@@ -91,11 +92,15 @@ def __init__(self, indexfile:str):
                 raise IOError("Could not determine k by casting the first line of the index as an integer. Improperly formatted index")
             i = 0
             for line in ifile:
-                if i == 0:
-                    pass
-                else:
-                    kmer_id, line_offset = line.rstrip().split(",")
-                    idx[kmer_id] = line_offset
+                # if i == 0:
+                #     logger.warning("i: {0}, line:\n{1}".format(i, line))
+
+                #     sys.exit(1)
+                #     pass
+                # else:
+                kmer_id, line_offset = [int(i) for i in line.rstrip().split("\t")]
+                idx[kmer_id] = line_offset
+                i += 1
         self.index = idx
 
     def __enter__(self):
@@ -130,20 +135,31 @@ def __init__(self, kdbfile:str, k:int):
             logger.info("KDBReader logs the position in mode 'r' as 'header_offset', after appending blocks: {0}".format(kdbrdr.header["header_offset"])) # This could be a compressed offset
             logger.info("Asked for the position one more time in mode 'r' in 'with this' and received: {0}".format(kdbrdr.tell()))
             line_index[0] = kdbrdr.tell()
-            
+
             # SOmething should be checked here
             #kdbrdr.seek(kdbrdr.header["header_offset"])
             #kdbrdr._load_block()
             i = 1
-            old_idx = line_index[0]
+            this_line = line_index[0]
             logger.debug("0th index, points to the first row? : {0}".format(line_index[0]))
             for line in kdbrdr:
                 try:
-                    pos = kdbrdr.tell()
+                    next_line = kdbrdr.tell()
                     kmer_id, count = [int(i) for i in line.rstrip().split("\t")[:-1]]
-                    line_index[kmer_id] = old_idx
-                    logger.debug("i: {0}, K-mer id: {1}, count: {2}, index line: {3}, offset: {4}".format(i, kmer_id, count, i, old_idx))
-                    old_idx = pos
+                    line_index[kmer_id] = this_line
+                    kdbrdr.seek(this_line)
+                    new_line = kdbrdr.readline()
+                    if new_line == line:
+                        logger.debug("i: {0}, K-mer id: {1}, count: {2}, index line: {3}, offset: {4}".format(i, kmer_id, count, i, this_line))
+                    elif next_line == 0 or this_line == 0:
+                        logger.warning("i: 0, k-mer id: {1}, this_line: {2}, next_line{3}".format(i, kmer_id, this_line, next_line))
+                        raise ValueError("Encountered a line where kdbrdr.tell() is producing 0")
+                    else:
+                        logger.warning("This line was not reproducibly generated...")
+                        logger.warning("i: {0}, K-mer id: {1}, offset: {2}".format(i, kmer_id, this_line))
+                        sys.exit(1)
+
+                    this_line = next_line
                     if kdbrdr.tell() == 0:
                         raise IOError("kmerdb.index.IndexBuilder expects the kdbrdr to increment with each line")
                     # elif i > 10:
@@ -213,12 +229,18 @@ def write_index(index:np.array, indexfile:str, k:int):
         raise TypeError("kmerdb.index.write_index expects a str as its second positional argument")
     elif os.path.exists(indexfile):
         logger.warning("kmerdb.index.write_index is overwriting an existing index '{0}'...".format(indexfile))
+        mode = "wt"
     elif type(k) is not int:
         raise TypeError("kmerdb.index.write_index expects an int as its third positional argument")
     else:
-        with gzip.open(indexfile, 'xt') as ofile:
-            ofile.write("{0}\n".format(k))
-            for i, line_offset in enumerate(index):
+        mode = "xt"
+    with gzip.open(indexfile, mode) as ofile:
+        ofile.write("{0}\n".format(k))
+        for i, line_offset in enumerate(index):
+            if int(line_offset) == 0:
+                logger.debug("K-mer {0} was not observed, not writing to file".format(i))
+                pass
+            else:
                 ofile.write("{0}\t{1}\n".format(i, line_offset))
 
 

diff --git a/kmerdb/kmer.py b/kmerdb/kmer.py
@@ -106,7 +106,12 @@ def kmer_to_id(s):
             s = str(s)
         for c in bytes(s, "UTF-8"): # Use byteshifting for fast conversion to binary encoding
             idx = idx << 2
-            idx = idx | letterToBinary[c]
+            try:
+                idx = idx | letterToBinary[c]
+            except KeyError as e:
+                logger.error("Entire sequence: {0}".format(s))
+                logger.error("Problematic character: {0}".format(c))
+                raise e
         return idx
 
 

diff --git a/kmerdb/probability.py b/kmerdb/probability.py
@@ -41,27 +41,29 @@ def markov_probability(seq:Bio.SeqRecord.SeqRecord, kdbrdr:fileutil.KDBReader, k
     :rtype:
     """
     if not isinstance(seq, Bio.SeqRecord.SeqRecord):
-        raise TypeError("kmerdb.probability.MarkovProbability expects a Bio.SeqRecord.SeqRecord as its first positional argument")
+        raise TypeError("kmerdb.probability.markov_probability expects a Bio.SeqRecord.SeqRecord as its first positional argument")
 
     elif not isinstance(kdbrdr, fileutil.KDBReader):
-        raise TypeError("kmerdb.probability.MarkovProbability expects a kmerdb.fileutil.KDBReader as its second positionl argument")
+        raise TypeError("kmerdb.probability.markov_probability expects a kmerdb.fileutil.KDBReader as its second positionl argument")
     elif not isinstance(kdbidx, index.IndexReader):
-        raise TypeError("kmerdb.probability.MarkovProbability expects a kmerdb.index.IndexReader as its third positional argument")
+        raise TypeError("kmerdb.probability.markov_probability expects a kmerdb.index.IndexReader as its third positional argument")
 
 
     k = kdbrdr.header['k']
+    mononucleotides = kdbrdr.header['files'][0]['mononucleotides']
     N = 4**k
     total_kmers = sum([f['total_kmers'] for f in kdbrdr.header['files']])
+    um = 1/N # This is the uniform frequency for the space
     if len(seq) < k:
-        raise ValueError("kmerdb.probability.MarkovProbability expects the sequence to be at least k={0} in length".format(k))
+        raise ValueError("kmerdb.probability.markov_probability expects the sequence to be at least k={0} in length".format(k))
     x = seq.seq[:k]
     kmer_id_x = kmer.kmer_to_id(x)
     if kmer_id_x > N:
-        raise ValueError("kmerdb.probability.MarkovProbability expected the k-mer id {0} of the subsequence x='{1}' to have an id less than N = 4^k = {2}".format(kmer_id_x, x, N))
+        raise ValueError("kmerdb.probability.markov_probability expected the k-mer id {0} of the subsequence x='{1}' to have an id less than N = 4^k = {2}".format(kmer_id_x, x, N))
     if not isinstance(kdbidx.index, np.ndarray):
-        raise ValueError("kmerdb.probability.MarkovProbability expects the kdbidx to be a Numpy array")
+        raise ValueError("kmerdb.probability.markov_probability expects the kdbidx to be a Numpy array")
     elif kdbidx.index.size != N:
-        raise ValueError("kmerdb.probability.MarkovProbability expects the kdbidx to have exactly {0} items, found {1}".format(N, kdbidx.size))
+        raise ValueError("kmerdb.probability.markov_probability expects the kdbidx to have exactly {0} items, found {1}".format(N, kdbidx.size))
 
 
     kmer_id, count, neighbors = index.read_line(kdbrdr, kdbidx, kmer_id_x)
@@ -75,35 +77,71 @@ def markov_probability(seq:Bio.SeqRecord.SeqRecord, kdbrdr:fileutil.KDBReader, k
     # Find the correct neighbor and calculate the transition probability
     prefix = x
     total = 0
+    total2 = 0
     product = 1
+    total_nucleotides = sum(mononucleotides.values())
+    logger.debug(seq.seq)
+    logger.debug(len(seq.seq))
     for i in range(k, len(seq.seq)):
+
         sum_qses = 0
-        for s in neighbors["suffixes"]:
-            kmer_id, count, _ = index.read_line(kdbrdr, kdbidx, s)
+        sum_fomm = 0
+
+        for char, idx in neighbors["suffixes"].items():
+            kmer_id, count, _ = index.read_line(kdbrdr, kdbidx, idx)
             if kmer_id is None or count is None or _ is None:
-                raise ValueError("k-mer id '{0}' had an offset of zero in the index, it was not observed in the genome. Effective probability of sequence is 0 and Log-odds ratio of the sequence being generated from the k-mer profile is effectively 0 as well.")
-            if s != kmer_id:
-                raise ValueError("k-mer id read from file did not match the provided neighbor suffix id")
+                kmer_id = idx
+                count = 0
+                #raise ValueError("k-mer id '{0}' had an offset of zero in the index, it was not observed in the genome. Effective probability of sequence is 0 and Log-odds ratio of the sequence being generated from the k-mer profile is effectively 0 as well.".format(s))
+
+            sum_fomm += (mononucleotides[char] / total_nucleotides)
             sum_qses += count/float(total_kmers)
         # Prefix is the next prefix sequence to find a suffix for
-        prefix = prefix[1:] + seq.seq[i]
-        # We take the prefixes count and new neighbors
-        kmer_id, count, neighbors = index.read_line(kdbrdr, kdbidx, kmer.kmer_to_id(prefix))
-        # Now we calculate qt similar to px, the frequency (count / total number of k-mers, read from the header)
-        qt = count/float(total_kmers)
-        # Now we calculate the transition probability of sequence s to sequence t
-        # as qt divided by the sum of each qsc
-        ast = qt / sum_qses
-        # For the final log odds ratio, we sum the logs of the transition probabilities
-        total += math.log(ast)
-        # For the final probability, we multiply the transition probabilities
-        product *= ast
+        if i >= len(seq.seq):
+            break
+            #raise ValueError("kmerdb.probability.markov_probability encountered a value of i={0} larger than the length of the sequence {1}".format(i, len(seq.seq)))
+        elif seq.seq[i] == "":
+            logger.warning("Reached the end of the sequence, i={0} is producing no letters".format(i))
+            break
+        else:
+            logger.debug("{0} : prefix: '{1}' + '{2}'".format(i, prefix[1:], seq.seq[i]))
+            prefix = prefix[1:] + seq.seq[i]
+
+        if len(prefix) != k:
+            break
+        else:
+            logger.debug("New k-mer: '{0}'".format(prefix))
+            logger.debug("Length of prefix: {0}".format(len(prefix)))
+            try:
+                kmerid = kmer.kmer_to_id(prefix)
+            except KeyError as e:
+                logger.debug(i)
+                logger.debug("this should be the new letter: '{0}'".format(seq.seq[i]))
+                logger.debug("this should be the existing prefix: '{0}'".format(prefix))
+                raise e
+            logger.debug("Iteration number {0}, calculating probabilities for k-mer id={1} seq={2}".format(i, kmerid, prefix))
+            logger.debug("Reading counts/neighbors from file according to k-mer id")
+            # We take the prefixes count and new neighbors
+            kmer_id, count, neighbors = index.read_line(kdbrdr, kdbidx, kmerid)
+
+            # Now we calculate qt similar to px, the frequency (count / total number of k-mers, read from the header)
+
+            qt = count/float(total_kmers)
+            # Now we calculate the transition probability of sequence s to sequence t
+            # as qt divided by the sum of each qsc
+            ast = qt / sum_qses
+            # For the final log odds ratio, we sum the logs of the transition probabilities
+            total += math.log10(ast)
+            total2 += math.log10(sum_fomm) # The ratio of the uniform frequency to the first order Markov background
+            # For the final probability, we multiply the transition probabilities
+            product *= ast
     # Fianlly qxis represents our null model in a way.
     # Suppose that the null model consisted of the same number of k-mer as observed,
     # but uniformly distributed amongst all the k-mers
-    qxis = [total_kmers/N for x in range(k, len(seq.seq))]
+    qxis = [um for x in range(k, len(seq.seq))]
     # We put this on the bottom of the Log-odds Ratio because the null model contrasts the transition probabilities
-    lor = log(px) + total / sum([math.log(qx) for qx in qxis])
+    null_model = log10(um) + total2
+    lor = math.log10(px) + total / null_model
     # 
     pseq = px*product#/functools.reduce(lambda x,y: x*y, qxis)