Skip to content
This repository has been archived by the owner on Jun 17, 2024. It is now read-only.

Commit

Permalink
Fixes #16. Many changes but the most important to document was this, …
Browse files Browse the repository at this point in the history
…I'll explain in full in comments on the pull request from this issue.
  • Loading branch information
MatthewRalston committed Jan 9, 2021
1 parent 9c08150 commit 090b91a
Show file tree
Hide file tree
Showing 8 changed files with 19,948 additions and 49 deletions.
2 changes: 1 addition & 1 deletion TODO.org
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The final prototype would be .bgzf format from biopython
* DONE Nearest neighbor profile
CLOSED: [2021-01-07 Thu 21:37]
* VERIFY index class
* TODO Probability function
* VERIFY Probability function
* Rmd report1
** Distribution analysis
** Accurately describe kdb counting algorithm
Expand Down
6 changes: 5 additions & 1 deletion examples/example_report/index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,11 @@ In this report, we will look at the shape of the data as it is normalized by sam

## kmerdb and on-disk k-mer counting

kmerdb is a Python module, distributed with the Python Package Index (PyPI) through the command line interface 'pip'. Installation is very simple and the experience should be painless for most users. The primary feature of kmerdb is a series of command-line actions called "sub-commands." These sub-commands perform specific functions on kdb database files or matrices of their origin. But perhaps more specifically the k-mer database is designed to generate very simple k-mer graph neighbors and abundances and store them in the bgzf file format. The first step of any analysis done with kdb is the k-mer counting done when generating a k-mer "profile." During this step, .fastq or .fasta files and eventually .bam files are assessed to create k-mer database files. Each read (or sequence in the case of fasta format) is a considerably different kind of sub-profile with error mixed in. We don't yet have a layer to propagate that error in to, numerically at this point, per k-mer. And so we leave that where it is, as far as cleaning of read groups prior to analysis with kdb, if fastq or bam files are preferred. To be candid, I am stuck between whether to study the error rates associated with fastq subsampling of a perfect fasta genome, and with whether to study the profiling of profile subcommand,
kmerdb is a Python module, distributed with the Python Package Index (PyPI) through the command line interface 'pip'. Installation is very simple and the experience should be painless for most users. The primary feature of kmerdb is a series of command-line actions called "sub-commands." These sub-commands perform specific functions on kdb database files or matrices of their origin. But perhaps more specifically the k-mer database is designed to generate very simple k-mer graph neighbors and abundances and store them in the bgzf file format.

The first step of any analysis done with kdb is the k-mer counting done when generating one or more k-mer "profile." During this step, .fastq or .fasta files (we eventually look to add support for .bam files) are assessed to create k-mer database files. Each read (or sequence in the case of fasta format) is a considerably different kind of sub-profile with error mixed in. We don't yet have a layer to propagate that error in to, numerically at this point, per k-mer. And so we leave that where it is, as far as cleaning of read groups prior to analysis with kdb, if fastq or bam files are preferred. To be candid, I am stuck between whether to study the error rates associated with fastq subsampling of a perfect fasta genome, and with whether to study the profiling of profile subcommand.

The more deliberate feature of the on-disk k-mer counting is that higher choices of k may be uncovered for which bugs do not have to be uncovered, we tied ourselves to SQLite3 and then let loose the development on other fronts. This is covered to a better extent by `example_report1` because it is more focused on the profiling side of things, to what limit I can explore k meaningfully. The other benefit of having


## Distribution fitting on normalized and unnormalized k-mer counts
Expand Down
2 changes: 1 addition & 1 deletion examples/example_report1/NOTES.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ With consumer hardware, kdb profile generates 8-mer profiles without issue. In p

Certainly with an eye to the future, we can see that the solution prototyped here can scale to at least k=20 with on-disk counting and without choosing an optimal database for rapid counting, and if "rapid" is even necessary if it's only done once. The tool was used to generate 8-mer profiles in this study, and we will b


It is worth noting that the maximum array index for a certain value of k, must not exceed the value of a 64bit integer, which is (2^64)-1 ~ 18 pentillion. However, the maximum index happens to be 1.1 pentillion in the case of 4^30 (k=30). When I looked again at 4^31, 64 bits is capable of storing an index for that extent with numpy. However, k=32 seems to exceed the number by almost 1. Then again, k-mer indices of that magnitude are rarely necessary from my perspective, and they would need a more complex key.

The advent of Next-Generation Sequencing technology may be one of the most influencial technologies of the early 21st century.

Expand Down
12 changes: 9 additions & 3 deletions kmerdb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ def markov_probability(arguments):

if index.has_index(arguments.kdb):
arguments.kdbi = arguments.kdb + "i"
df = pd.DataFrame([])
#df = pd.DataFrame([], columns=["SequenceID", "Log_Odds_ratio", "p_of_seq"])
profiles = np.array([], dtype="int64")
with fileutil.open(arguments.kdb, 'r') as kdb:
k = kdb.header['k']
with index.open(arguments.kdbi, 'r') as kdbi:
Expand All @@ -68,10 +69,15 @@ def markov_probability(arguments):
# Do something here

markov_probs = list(map(lambda p: [p["seq"].name, p["log_odds_ratio"], p["p_of_seq"]], [probability.markov_probability(seq, kdb, kdbi) for seq in recs]))
df.append(pd.DataFrame(markov_probs, columns=["SequenceID", "Log_Odds_ratio", "p_of_seq"]))

recs = [r for r in seqprsr] # Essentially accomplishes an iteration in the file, wrapped by the seqparser.SeqParser class
print(markov_probs)
if profiles.shape == (0,):
profiles = np.array(markov_probs)
else:
np.append(profiles, markov_probs, axis=0)

recs = [r for r in seqprsr] # Essentially accomplishes an iteration in the file, wrapped by the seqparser.SeqParser class
df = pd.DataFrame(profiles, columns=["SequenceID", "Log_Odds_ratio", "p_of_seq"])
df.to_csv(sys.stdout, sep=arguments.delimiter, index=False)

else:
Expand Down
50 changes: 36 additions & 14 deletions kmerdb/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import os
import copy
import gzip
import yaml
import numpy as np
#import jsonschema
sys.path.append('..')
Expand Down Expand Up @@ -91,11 +92,15 @@ def __init__(self, indexfile:str):
raise IOError("Could not determine k by casting the first line of the index as an integer. Improperly formatted index")
i = 0
for line in ifile:
if i == 0:
pass
else:
kmer_id, line_offset = line.rstrip().split(",")
idx[kmer_id] = line_offset
# if i == 0:
# logger.warning("i: {0}, line:\n{1}".format(i, line))

# sys.exit(1)
# pass
# else:
kmer_id, line_offset = [int(i) for i in line.rstrip().split("\t")]
idx[kmer_id] = line_offset
i += 1
self.index = idx

def __enter__(self):
Expand Down Expand Up @@ -130,20 +135,31 @@ def __init__(self, kdbfile:str, k:int):
logger.info("KDBReader logs the position in mode 'r' as 'header_offset', after appending blocks: {0}".format(kdbrdr.header["header_offset"])) # This could be a compressed offset
logger.info("Asked for the position one more time in mode 'r' in 'with this' and received: {0}".format(kdbrdr.tell()))
line_index[0] = kdbrdr.tell()

# SOmething should be checked here
#kdbrdr.seek(kdbrdr.header["header_offset"])
#kdbrdr._load_block()
i = 1
old_idx = line_index[0]
this_line = line_index[0]
logger.debug("0th index, points to the first row? : {0}".format(line_index[0]))
for line in kdbrdr:
try:
pos = kdbrdr.tell()
next_line = kdbrdr.tell()
kmer_id, count = [int(i) for i in line.rstrip().split("\t")[:-1]]
line_index[kmer_id] = old_idx
logger.debug("i: {0}, K-mer id: {1}, count: {2}, index line: {3}, offset: {4}".format(i, kmer_id, count, i, old_idx))
old_idx = pos
line_index[kmer_id] = this_line
kdbrdr.seek(this_line)
new_line = kdbrdr.readline()
if new_line == line:
logger.debug("i: {0}, K-mer id: {1}, count: {2}, index line: {3}, offset: {4}".format(i, kmer_id, count, i, this_line))
elif next_line == 0 or this_line == 0:
logger.warning("i: 0, k-mer id: {1}, this_line: {2}, next_line{3}".format(i, kmer_id, this_line, next_line))
raise ValueError("Encountered a line where kdbrdr.tell() is producing 0")
else:
logger.warning("This line was not reproducibly generated...")
logger.warning("i: {0}, K-mer id: {1}, offset: {2}".format(i, kmer_id, this_line))
sys.exit(1)

this_line = next_line
if kdbrdr.tell() == 0:
raise IOError("kmerdb.index.IndexBuilder expects the kdbrdr to increment with each line")
# elif i > 10:
Expand Down Expand Up @@ -213,12 +229,18 @@ def write_index(index:np.array, indexfile:str, k:int):
raise TypeError("kmerdb.index.write_index expects a str as its second positional argument")
elif os.path.exists(indexfile):
logger.warning("kmerdb.index.write_index is overwriting an existing index '{0}'...".format(indexfile))
mode = "wt"
elif type(k) is not int:
raise TypeError("kmerdb.index.write_index expects an int as its third positional argument")
else:
with gzip.open(indexfile, 'xt') as ofile:
ofile.write("{0}\n".format(k))
for i, line_offset in enumerate(index):
mode = "xt"
with gzip.open(indexfile, mode) as ofile:
ofile.write("{0}\n".format(k))
for i, line_offset in enumerate(index):
if int(line_offset) == 0:
logger.debug("K-mer {0} was not observed, not writing to file".format(i))
pass
else:
ofile.write("{0}\t{1}\n".format(i, line_offset))


Expand Down
7 changes: 6 additions & 1 deletion kmerdb/kmer.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,12 @@ def kmer_to_id(s):
s = str(s)
for c in bytes(s, "UTF-8"): # Use byteshifting for fast conversion to binary encoding
idx = idx << 2
idx = idx | letterToBinary[c]
try:
idx = idx | letterToBinary[c]
except KeyError as e:
logger.error("Entire sequence: {0}".format(s))
logger.error("Problematic character: {0}".format(c))
raise e
return idx


Expand Down
90 changes: 64 additions & 26 deletions kmerdb/probability.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,27 +41,29 @@ def markov_probability(seq:Bio.SeqRecord.SeqRecord, kdbrdr:fileutil.KDBReader, k
:rtype:
"""
if not isinstance(seq, Bio.SeqRecord.SeqRecord):
raise TypeError("kmerdb.probability.MarkovProbability expects a Bio.SeqRecord.SeqRecord as its first positional argument")
raise TypeError("kmerdb.probability.markov_probability expects a Bio.SeqRecord.SeqRecord as its first positional argument")

elif not isinstance(kdbrdr, fileutil.KDBReader):
raise TypeError("kmerdb.probability.MarkovProbability expects a kmerdb.fileutil.KDBReader as its second positionl argument")
raise TypeError("kmerdb.probability.markov_probability expects a kmerdb.fileutil.KDBReader as its second positionl argument")
elif not isinstance(kdbidx, index.IndexReader):
raise TypeError("kmerdb.probability.MarkovProbability expects a kmerdb.index.IndexReader as its third positional argument")
raise TypeError("kmerdb.probability.markov_probability expects a kmerdb.index.IndexReader as its third positional argument")


k = kdbrdr.header['k']
mononucleotides = kdbrdr.header['files'][0]['mononucleotides']
N = 4**k
total_kmers = sum([f['total_kmers'] for f in kdbrdr.header['files']])
um = 1/N # This is the uniform frequency for the space
if len(seq) < k:
raise ValueError("kmerdb.probability.MarkovProbability expects the sequence to be at least k={0} in length".format(k))
raise ValueError("kmerdb.probability.markov_probability expects the sequence to be at least k={0} in length".format(k))
x = seq.seq[:k]
kmer_id_x = kmer.kmer_to_id(x)
if kmer_id_x > N:
raise ValueError("kmerdb.probability.MarkovProbability expected the k-mer id {0} of the subsequence x='{1}' to have an id less than N = 4^k = {2}".format(kmer_id_x, x, N))
raise ValueError("kmerdb.probability.markov_probability expected the k-mer id {0} of the subsequence x='{1}' to have an id less than N = 4^k = {2}".format(kmer_id_x, x, N))
if not isinstance(kdbidx.index, np.ndarray):
raise ValueError("kmerdb.probability.MarkovProbability expects the kdbidx to be a Numpy array")
raise ValueError("kmerdb.probability.markov_probability expects the kdbidx to be a Numpy array")
elif kdbidx.index.size != N:
raise ValueError("kmerdb.probability.MarkovProbability expects the kdbidx to have exactly {0} items, found {1}".format(N, kdbidx.size))
raise ValueError("kmerdb.probability.markov_probability expects the kdbidx to have exactly {0} items, found {1}".format(N, kdbidx.size))


kmer_id, count, neighbors = index.read_line(kdbrdr, kdbidx, kmer_id_x)
Expand All @@ -75,35 +77,71 @@ def markov_probability(seq:Bio.SeqRecord.SeqRecord, kdbrdr:fileutil.KDBReader, k
# Find the correct neighbor and calculate the transition probability
prefix = x
total = 0
total2 = 0
product = 1
total_nucleotides = sum(mononucleotides.values())
logger.debug(seq.seq)
logger.debug(len(seq.seq))
for i in range(k, len(seq.seq)):

sum_qses = 0
for s in neighbors["suffixes"]:
kmer_id, count, _ = index.read_line(kdbrdr, kdbidx, s)
sum_fomm = 0

for char, idx in neighbors["suffixes"].items():
kmer_id, count, _ = index.read_line(kdbrdr, kdbidx, idx)
if kmer_id is None or count is None or _ is None:
raise ValueError("k-mer id '{0}' had an offset of zero in the index, it was not observed in the genome. Effective probability of sequence is 0 and Log-odds ratio of the sequence being generated from the k-mer profile is effectively 0 as well.")
if s != kmer_id:
raise ValueError("k-mer id read from file did not match the provided neighbor suffix id")
kmer_id = idx
count = 0
#raise ValueError("k-mer id '{0}' had an offset of zero in the index, it was not observed in the genome. Effective probability of sequence is 0 and Log-odds ratio of the sequence being generated from the k-mer profile is effectively 0 as well.".format(s))

sum_fomm += (mononucleotides[char] / total_nucleotides)
sum_qses += count/float(total_kmers)
# Prefix is the next prefix sequence to find a suffix for
prefix = prefix[1:] + seq.seq[i]
# We take the prefixes count and new neighbors
kmer_id, count, neighbors = index.read_line(kdbrdr, kdbidx, kmer.kmer_to_id(prefix))
# Now we calculate qt similar to px, the frequency (count / total number of k-mers, read from the header)
qt = count/float(total_kmers)
# Now we calculate the transition probability of sequence s to sequence t
# as qt divided by the sum of each qsc
ast = qt / sum_qses
# For the final log odds ratio, we sum the logs of the transition probabilities
total += math.log(ast)
# For the final probability, we multiply the transition probabilities
product *= ast
if i >= len(seq.seq):
break
#raise ValueError("kmerdb.probability.markov_probability encountered a value of i={0} larger than the length of the sequence {1}".format(i, len(seq.seq)))
elif seq.seq[i] == "":
logger.warning("Reached the end of the sequence, i={0} is producing no letters".format(i))
break
else:
logger.debug("{0} : prefix: '{1}' + '{2}'".format(i, prefix[1:], seq.seq[i]))
prefix = prefix[1:] + seq.seq[i]

if len(prefix) != k:
break
else:
logger.debug("New k-mer: '{0}'".format(prefix))
logger.debug("Length of prefix: {0}".format(len(prefix)))
try:
kmerid = kmer.kmer_to_id(prefix)
except KeyError as e:
logger.debug(i)
logger.debug("this should be the new letter: '{0}'".format(seq.seq[i]))
logger.debug("this should be the existing prefix: '{0}'".format(prefix))
raise e
logger.debug("Iteration number {0}, calculating probabilities for k-mer id={1} seq={2}".format(i, kmerid, prefix))
logger.debug("Reading counts/neighbors from file according to k-mer id")
# We take the prefixes count and new neighbors
kmer_id, count, neighbors = index.read_line(kdbrdr, kdbidx, kmerid)

# Now we calculate qt similar to px, the frequency (count / total number of k-mers, read from the header)

qt = count/float(total_kmers)
# Now we calculate the transition probability of sequence s to sequence t
# as qt divided by the sum of each qsc
ast = qt / sum_qses
# For the final log odds ratio, we sum the logs of the transition probabilities
total += math.log10(ast)
total2 += math.log10(sum_fomm) # The ratio of the uniform frequency to the first order Markov background
# For the final probability, we multiply the transition probabilities
product *= ast
# Fianlly qxis represents our null model in a way.
# Suppose that the null model consisted of the same number of k-mer as observed,
# but uniformly distributed amongst all the k-mers
qxis = [total_kmers/N for x in range(k, len(seq.seq))]
qxis = [um for x in range(k, len(seq.seq))]
# We put this on the bottom of the Log-odds Ratio because the null model contrasts the transition probabilities
lor = log(px) + total / sum([math.log(qx) for qx in qxis])
null_model = log10(um) + total2
lor = math.log10(px) + total / null_model
#
pseq = px*product#/functools.reduce(lambda x,y: x*y, qxis)

Expand Down
Loading

0 comments on commit 090b91a

Please sign in to comment.