Merge pull request #15 from MatthewRalston/sparse_profile

Sparse profile
MatthewRalston · Jan 8, 2021 · 8b5841a · 8b5841a
2 parents ad8a358 + 0ea65f9
commit 8b5841a
Show file tree

Hide file tree

Showing 10 changed files with 171 additions and 79 deletions.
diff --git a/TODO.org b/TODO.org
@@ -6,21 +6,16 @@ The final prototype would be .bgzf format from biopython
 
 
 
-* TODO arguments.method == "Biopython (has a FIXME in bin/kdb)
-** Check this [[https://www.analyticsvidhya.com/blog/2019/05/beginners-guide-hierarchical-clustering/][link]]
-** Needs a euclidean distance matrix
 
-* Need an Rmd and a report directory that autopopulates a report
-** 
 
+* VERIFY Sparse .kdb
+** modify slurp
+** modify profile
+* VERIFY Nearest neighbor profile
+* TODO index class
 
-* Priorities
-** Connect this to meme suite
-** Hypotheses:
-*** Suppose that k-mer spectra have a positive and negative saturation direction.
-*** In this way, more specific signals and antisignals could be surmissed from samples with enough resolution, temporal or otherwise resolved by covariates. 
-*** Think of what could happen if the signals and antisignals were resolved on the order of genes, you could detect gene expression levels with it.
-* Rmd
+* Rmd report1
+** Distribution analysis
 ** Accurately describe kdb counting algorithm
 *** althought the algorithm differs in its approach to fastq k-mer counting from fasta k-mer counting,
 *** First, a selection of sequences is shredded into k-mers in memory
@@ -39,9 +34,19 @@ The final prototype would be .bgzf format from biopython
 *** suggests another issue with the distribution of that sample.
 *** State why we refuse to standardize the data at this point.
 
-* [[https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA][PCA]] (scikit-learn)
-** Uses the LAPACK implementation
-** 
+* Rmd report2
+** algorithm profiling
+** kdb profile k x time x cpu (z)
+**
+* profile reads sam/bam
+** use pysam to iterate over reads, creating a profile in the process.
+* kmerize
+** to use bed/gff features to select reads from bam/bai using pysam
+** and then creating sparse profiles for each feature
+** to split a bam according to gff/bed features, and putting that in an output directory
+* AWS Nepture / rdflib / Berkley DB / MongoDB support
+** Learn the RDF spec
+** Think of a specification for each node.
 * Manifold learning
 ** Isomap (derived from multidimensional scaling (MDS) or Kernel PCA)
 *** Lower dimensional projectsion of the data preserving geodesic distances between all points
@@ -53,20 +58,7 @@ The final prototype would be .bgzf format from biopython
 *** t-SNE will focus on the local structure of the data and will tend to extract clustered local groups of samples.
 *** This ability to group samples based on the local structure might be beneficial to visually disentangle a dataset that comprises several manifolds at once.
 
-* Comment code
-** TODO fileutil
-** profile
-** DONE index
-   CLOSED: [2020-02-11 Tue 17:14]
-
-
-
-* Bugs
-** 0-length blocks
-*** 307 0-length blocks out of 607 in Bsubtilis.kdb
-*** 0-length block creation is too frequent, KDBWriter
-*** Remember to change logger.warning in kdb.index to raise Exception
-** Handle reads /fasta with 'N' (ambiguous base)
+* TODO Comment code
 * index class
 ** need b-tree library
 *** https://pythonhosted.org/BTrees/
@@ -76,10 +68,8 @@ The final prototype would be .bgzf format from biopython
 *** The following searches for all values greater-than(min) or less-than(max), flattening
 *** list(itertools.chain.from_iterable(btree.values(min=int/float)))
 * kdb annotator class (reworked into index class and better metadata specification)
-*** DONE First, further specify kdb record shape
-    CLOSED: [2019-11-24 Sun 23:00]
-*** DONE Second specify kdb metadata shape/parsing routines
-    CLOSED: [2019-11-24 Sun 23:00]
+*** TODO First, further specify kdb record shape
+*** TODO Second specify kdb metadata shape/types/parsing routines
 *** Annotate bools, floats (probability), tags, ints (connectivity/degree)
 **** Eulerian as a tag or a bool?
 *** Index should be designed to rapidly filter tags, rapidly search/filter/narrow on ints
@@ -89,14 +79,12 @@ The final prototype would be .bgzf format from biopython
 *** sort k-mers by counts (in memory, not on file), then create b-tree, leafs are k-mer file indices (above)
 ** tag : hash index
 ** float, int indices : similar to count index above6
-* Documentation
-** README
-** Blog post + review of kmer tools
-** Make note of R1/R2 reverse complement needs in strand-specific sequencing
-*** maybe a nargs=* optional list for R2? FR strand specificity
-*** This wouldn't accomodate RF library preps
-*** Either way, add a note that trimming and reverse complementing the reads should be done prior
-
+* Priorities
+** Connect this to meme suite
+** Hypotheses:
+*** Suppose that k-mer spectra have a positive and negative saturation direction.
+*** In this way, more specific signals and antisignals could be surmissed from samples with enough resolution, temporal or otherwise resolved by covariates. 
+*** Think of what could happen if the signals and antisignals were resolved on the order of genes, you could detect gene expression levels with it.
 * Operations
 ** DONE Get all neighbors
    CLOSED: [2019-11-12 Tue 14:41]

diff --git a/examples/example_report/index.Rmd b/examples/example_report/index.Rmd
@@ -122,13 +122,18 @@ In this report, we will look at the shape of the data as it is normalized by sam
 
 # Methodology
 
-## kdb and on-disk k-mer counting
+## kmerdb and on-disk k-mer counting
+
+kmerdb is a Python module, distributed with the Python Package Index (PyPI) through the command line interface 'pip'. Installation is very simple and the experience should be painless for most users. The primary feature of kmerdb is a series of command-line actions called "sub-commands." These sub-commands perform specific functions on kdb database files or matrices of their origin. The first step of any analysis done with kdb is the k-mer counting done when generating a k-mer "profile." During this step, fastq or fasta files are assessed 
+
 
 ## Distribution fitting on normalized and unnormalized k-mer counts
 
 The negative binomial model is a canonical discrete model often used in the RNA-Seq and k-mer literature to model count data like those obtained through second-generation sequencing experiments (@love2014moderated, @anders2010differential, @daley2013predicting). My null-model is based on the use of this model in these specific ways to model count data without standardizing the raw data. I am concerned with immediately looking at a standardized dimension of this data, and I would prefer for hypothesis testing to be done in the negative-binomial space, whatever methods must be used or devised for such analysis. Since these do represent ecological counts, albeit at a sub-cellular level, they may benefit long-term from analyses that consider both methods on standardized data and on unstandardized counts, which we presume for now could follow a Poissonian model. 
 
-So my first null hypothesis must be formalized as follows: "The k-mer count data do not follow a negative-binomial distribution and perhaps fit a more standard count distribution such as the Poissonian." If we make the hypothesis this aggressive, we could completely address both sides of the hypothesis by covering Poissonian hypothesis testing and distribution fitting through Cullen-Frey analysis. The method for hypothesis testing in a Poissonian framework in R is the ppois function `ppois(x, lambda=l)`, and the y-values associated with the pdf, fully-specified by lambda, can be graphed with `dpois(x, lambda=l)`). We also try transforming the poissonian by setting lambda to the median, the mode, and the geometric average and the arithmetic average.
+So my first null hypothesis must be formalized as follows: "The k-mer count data do not follow a negative-binomial distribution and perhaps fit a more standard count distribution such as the Poissonian." If we make the hypothesis this aggressive, we could completely address both sides of the hypothesis by covering Poissonian hypothesis testing and distribution fitting through Cullen-Frey analysis. The method for hypothesis testing in a Poissonian framework in R is the ppois function `ppois(x, lambda=l)`, and the y-values associated with the pdf, fully-specified by lambda, can be graphed with `dpois(x, lambda=l)`). We also try transforming the poissonian by setting lambda to the median, the mode, and the geometric average and the arithmetic average. Although the Cullen-Frey may be a fairly accepted method, we can still try to imitate the ideal NB behavior through imitating the Poissonian by transforming the values in this way.
+
+My second hypothesis was that the effect of normalization in the next section would have no effect on the distribution selected for modeling. I will address this hypothesis by looking at the efficacy of normalization on the quartiles of each sample's k-mer counts, and then see if the Cullen-Frey analysis produces the same suggestion and thus we would think that normalization does not affect the choice of model we should use.
 
 Once the Cullen-Frey analysis led us to originally prefer the NB model for its purity, we then adopted the DESeq2 method for count normalization.
 

diff --git a/kmerdb/__init__.py b/kmerdb/__init__.py
@@ -36,7 +36,7 @@
 
 
 def index_file(arguments):
-    from kdb import fileutil, index
+    from kmerdb import fileutil, index
     with fileutil.open(arguments.kdb, mode='r') as kdb:
         header = kdb.header
     line_index = index.build_line_index_from_kdb(arguments.kdb, header['k'])
@@ -48,27 +48,40 @@ def distances(arguments):
     from scipy.spatial.distance import pdist, squareform
 
 
-    from kdb import fileutil, distance, config
+    from kmerdb import fileutil, distance, config
     n = len(arguments.kdb)
 
 
     if len(arguments.kdb) > 1:
         files = list(map(lambda f: fileutil.open(f, 'r'), arguments.kdb))
+        logger.debug("Files: {0}".format(files))
         if arguments.k is None:
             arguments.k = files[0].k
         if not all(kdbrdr.k == arguments.k for kdbrdr in files):
             logger.error("Files: {0}".format(files))
             logger.error("Choices of k: {0}".format([kdbrdr.k for kdbrdr in files]))
             logger.error("By default the inferred value of k is used when k is not specified at the command line, which was {0}".format(arguments.k))
             raise TypeError("One or more files did not have k set to be equal to {0}".format(arguments.k))
-        profiles = np.array(list(map(lambda kdbrdr: kdbrdr.slurp(), files)), dtype="int32")
+        logger.debug("Files: {0}".format(files))
+        data = [kdbrdr.slurp() for kdbrdr in files]
+        logger.debug(data)
+        profiles = np.transpose(np.array(data, dtype="int32"))
         # The following does *not* transpose a matrix defined as n x N=4**k
         # n is the number of independent files/samples, (fasta/fastq=>kdb) being assessed
         # N=4**k is the dimensionality of the vector, sparse or not, that makes up the perceived profile, the descriptors is the column dimension.
 
         # The names for the columns are taken as the basenamed filepath, with all extensions (substrings beginning with '.') stripped.
         logger.info("Converting arrays of k-mer counts into a pandas DataFrame...")
+
         column_names = list(map(lambda kdbrdr: os.path.basename(kdbrdr._filepath).split(".")[0], files))
+        if len(column_names) != len(files):
+            raise RuntimeError("Number of column names {0} does not match number of input files {1}...".format(len(column_names), len(files)))
+        logger.debug("Shape: {0}".format(profiles.shape))
+        expected = (4**arguments.k, len(column_names))
+        if profiles.shape != expected:
+            logger.error("Expected shape: {0}".format(expected))
+            logger.error("Actual shape: {0}".format(profiles.shape))
+            raise RuntimeError("Raw profile shape (a Numpy array) doesn't match expected dimensions")
 
         df = pd.DataFrame(profiles, columns=column_names)
     elif len(arguments.kdb) == 1 and (arguments.kdb[0] == "/dev/stdin" or arguments.kdb[0] == "STDIN"):
@@ -123,7 +136,7 @@ def distances(arguments):
                     elif arguments.metric == "euclidean":
                         data[i][j] = distance.euclidean(arguments.kdb[i], arguments.kdb[j])
                     elif arguments.metric == "spearman":
-                        cor, pval = distance.spearman(profiles[:, i], profiles[:, j])
+                        cor, pval = distance.spearman(profiles[i], profiles[j])
                         # FIXME! also print pval matrices
                         data[i][j] = cor
                     else:
@@ -154,7 +167,7 @@ def get_matrix(arguments):
     from sklearn.decomposition import PCA
     from sklearn.manifold import TSNE
 
-    from kdb import fileutil, config
+    from kmerdb import fileutil, config
 
 
     if len(arguments.kdb) > 1:
@@ -338,7 +351,7 @@ def kmeans(arguments):
     from sklearn.pipeline import make_pipeline
     from sklearn.preprocessing import StandardScaler
 
-    from kdb import config
+    from kmerdb import config
 
 
     if isinstance(arguments.input, argparse.FileType) or isinstance(arguments.input, TextIOWrapper):
@@ -483,17 +496,17 @@ def hierarchical(arguments):
 
     from Bio.Cluster import treecluster
 
-    from kdb import config
+    from kmerdb import config
 
 
 
     if isinstance(arguments.input, argparse.FileType) or isinstance(arguments.input, TextIOWrapper):
         df = pd.read_csv(arguments.input, sep=arguments.delimiter)
-        column_names = df.columns
+        column_names = list(df.columns)
         df = df.transpose()
     elif arguments.input is None or arguments.input == "STDIN" or arguments.input == "/dev/stdin":
         df = pd.read_csv(sys.stdin, sep=arguments.delimiter)
-        column_names = df.columns
+        column_names = list(df.columns)
         df = df.transpose()
     else:
         logger.error("An unknown IO type was detected in bin/kdb.cluster()")
@@ -514,7 +527,27 @@ def hierarchical(arguments):
     #plt.show()
     plt.savefig(config.hierarchical_clustering_dendrogram_fig_filepath)
 
-    sys.stderr.write("Saving the dendrogram to '{0}'...".format(config.hierarchical_clustering_dendrogram_fig_filepath))
+    sys.stderr.write("Saving the matplotlib dendrogram to '{0}'...".format(config.hierarchical_clustering_dendrogram_fig_filepath))
+
+    data = np.array(df).tolist()
+
+    n = len(data)
+    m = len(data[0])
+    for i in range(n):
+        for j in range(m):
+            if i < j:
+                data[i].pop(-1)
+    #print(data)
+
+    from Bio.Phylo.TreeConstruction import DistanceMatrix, DistanceTreeConstructor
+    from Bio import Phylo
+    dm = DistanceMatrix(column_names, data)
+    constructor = DistanceTreeConstructor()
+    upgmatree = constructor.upgma(dm)
+    print(upgmatree)
+    Phylo.draw(upgmatree, branch_labels=lambda c: c.branch_length)
+    Phylo.write(upgmatree, config.spearman_upgma_tree_phy, "phyloxml")
+    sys.stderr.write("Saving the phylip format tree to '{0}'".format(config.spearman_upgma_tree_phy))
     sys.stderr.write(config.DONE)
 
 # def rarefy(arguments):
@@ -565,12 +598,10 @@ def hierarchical(arguments):
 
 
 def header(arguments):
-    from kdb import fileutil, config
-    from kdb.config import VERSION
-    from kdb import util
+    from kmerdb import fileutil, config, util
 
     with fileutil.open(arguments.kdb, mode='r') as kdb:
-        if kdb.header["version"] != VERSION:
+        if kdb.header["version"] != config.VERSION:
             logger.warning("KDB version is out of date, may be incompatible with current KDBReader class")
         if arguments.json:
             print(dict(kdb.header))
@@ -580,9 +611,9 @@ def header(arguments):
 
 
 def view(arguments):
-    from kdb import fileutil
+    from kmerdb import fileutil
 
-    from kdb.config import VERSION
+    from kmerdb.config import VERSION
 
     with fileutil.open(arguments.kdb, mode='r') as kdb:
         if kdb.header["version"] != VERSION:
@@ -594,9 +625,9 @@ def view(arguments):
 
 def profile(arguments):
     import math
-    from kdb import parse, fileutil
-
-    from kdb.config import VERSION
+    from kmerdb import parse, fileutil, kmer
+    import json
+    from kmerdb.config import VERSION
 
     metadata = []
     tempdbs = []
@@ -624,17 +655,21 @@ def profile(arguments):
         kdb_out = fileutil.open(arguments.kdb, 'wb', header)
         iterating = True
         while iterating:
-            # The 0th element is the count
             try:
-
+            
                 kmer_counts_per_file = list(map(next, tempdbs)) # T
                 if len(kmer_counts_per_file):
                     i = kmer_counts_per_file[0][0] - 1 # Remove 1 for the Sqlite zero-based indexing
                     count = sum([x[1] for x in kmer_counts_per_file]) # The 1th element is the k-mer count
-                    #sys.stderr.write("\r")
+
                     if arguments.verbose == 2:
                         sys.stderr.write("K-mer counts: {0} = {1}\n".format(list(map(lambda x: x[1], kmer_counts_per_file)), count))
-                    kdb_out.write("{0}\t{1}\n".format(i, count))
+                    if count == 0 and arguments.sparse is True:
+                        pass
+                    else:
+                        seq = kmer.id_to_kmer(i, arguments.k)
+                        neighbors = kmer.neighbors(seq, arguments.k)
+                        kdb_out.write("{0}\t{1}\t{2}\n".format(i, count, json.dumps(neighbors)))
                 else:
                     iterating = False
             except StopIteration as e:
@@ -720,6 +755,7 @@ def cli():
     profile_parser.add_argument("--keep-sqlite", action="store_true", help=argparse.SUPPRESS)
     profile_parser.add_argument("--not-strand-specific", action="store_false", default=True, help="Retain k-mers from the forward strand of the fast(a|q) file only")
     #profile_parser.add_argument("--no-metadata", dest="metadata", action="store_false", default=True, help="Include k-mer metadata in the .kdb")
+    profile_parser.add_argument("--sparse", action="store_true", default=False, help="Whether or not to store the profile as sparse")
     profile_parser.add_argument("-k", default=12, type=int, help="Choose k-mer size (Default: 12)")
     profile_parser.add_argument("seqfile", nargs="+", type=str, metavar="<.fasta|.fastq>", help="Fasta or fastq files")
     profile_parser.add_argument("kdb", type=str, help="Kdb file")

diff --git a/kmerdb/__main__.py b/kmerdb/__main__.py
@@ -16,7 +16,7 @@
    limitations under the License.
 
 '''
-
+import sys
 logger = None
 
 

diff --git a/kmerdb/config.py b/kmerdb/config.py
@@ -17,7 +17,7 @@
 
 
 
-VERSION="0.0.4"
+VERSION="0.0.5"
 
 header_schema = {
     "type": "object",
@@ -103,6 +103,7 @@
 kmeans_clustering_fig_filepath = "kmeans_clustering_of_kmer_profiles.png"
 ecopy_rarefaction_fig_filepath = "ecopy_rarefaction_curve.png"
 hierarchical_clustering_dendrogram_fig_filepath = "dendrogram.png"
+spearman_upgma_tree_phy = "kdb_spearman_upgma_tree.phyloxml"
 files = (pca_variance_fig_filepath, kmeans_elbow_graph_fig_filepath, kmeans_clustering_fig_filepath, ecopy_rarefaction_fig_filepath, hierarchical_clustering_dendrogram_fig_filepath)
 
 DEFAULT_MASTHEAD = """

diff --git a/kmerdb/distance.py b/kmerdb/distance.py
@@ -105,9 +105,9 @@ def euclidean(fname1, fname2):
 
 
 def spearman(x, y):
-    if not isinstance(x, np.array):
+    if type(x) is not np.ndarray:
         raise TypeError("kmerdb.distance.spearman expects a Numpy array as its first positional argument")
-    elif not isinstance(y, np.array):
+    elif type(y) is not np.ndarray:
         raise TypeError("kmerdb.distance.spearman expects a Numpy array as its second positional argument")
     from scipy.stats import spearmanr
     cor, pval = spearmanr(x, b=y)