Skip to content
This repository has been archived by the owner on Jun 17, 2024. It is now read-only.

Commit

Permalink
Merge pull request #15 from MatthewRalston/sparse_profile
Browse files Browse the repository at this point in the history
Sparse profile
  • Loading branch information
MatthewRalston committed Jan 8, 2021
2 parents ad8a358 + 0ea65f9 commit 8b5841a
Show file tree
Hide file tree
Showing 10 changed files with 171 additions and 79 deletions.
70 changes: 29 additions & 41 deletions TODO.org
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,16 @@ The final prototype would be .bgzf format from biopython



* TODO arguments.method == "Biopython (has a FIXME in bin/kdb)
** Check this [[https://www.analyticsvidhya.com/blog/2019/05/beginners-guide-hierarchical-clustering/][link]]
** Needs a euclidean distance matrix

* Need an Rmd and a report directory that autopopulates a report
**

* VERIFY Sparse .kdb
** modify slurp
** modify profile
* VERIFY Nearest neighbor profile
* TODO index class

* Priorities
** Connect this to meme suite
** Hypotheses:
*** Suppose that k-mer spectra have a positive and negative saturation direction.
*** In this way, more specific signals and antisignals could be surmissed from samples with enough resolution, temporal or otherwise resolved by covariates.
*** Think of what could happen if the signals and antisignals were resolved on the order of genes, you could detect gene expression levels with it.
* Rmd
* Rmd report1
** Distribution analysis
** Accurately describe kdb counting algorithm
*** althought the algorithm differs in its approach to fastq k-mer counting from fasta k-mer counting,
*** First, a selection of sequences is shredded into k-mers in memory
Expand All @@ -39,9 +34,19 @@ The final prototype would be .bgzf format from biopython
*** suggests another issue with the distribution of that sample.
*** State why we refuse to standardize the data at this point.

* [[https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA][PCA]] (scikit-learn)
** Uses the LAPACK implementation
**
* Rmd report2
** algorithm profiling
** kdb profile k x time x cpu (z)
**
* profile reads sam/bam
** use pysam to iterate over reads, creating a profile in the process.
* kmerize
** to use bed/gff features to select reads from bam/bai using pysam
** and then creating sparse profiles for each feature
** to split a bam according to gff/bed features, and putting that in an output directory
* AWS Nepture / rdflib / Berkley DB / MongoDB support
** Learn the RDF spec
** Think of a specification for each node.
* Manifold learning
** Isomap (derived from multidimensional scaling (MDS) or Kernel PCA)
*** Lower dimensional projectsion of the data preserving geodesic distances between all points
Expand All @@ -53,20 +58,7 @@ The final prototype would be .bgzf format from biopython
*** t-SNE will focus on the local structure of the data and will tend to extract clustered local groups of samples.
*** This ability to group samples based on the local structure might be beneficial to visually disentangle a dataset that comprises several manifolds at once.

* Comment code
** TODO fileutil
** profile
** DONE index
CLOSED: [2020-02-11 Tue 17:14]



* Bugs
** 0-length blocks
*** 307 0-length blocks out of 607 in Bsubtilis.kdb
*** 0-length block creation is too frequent, KDBWriter
*** Remember to change logger.warning in kdb.index to raise Exception
** Handle reads /fasta with 'N' (ambiguous base)
* TODO Comment code
* index class
** need b-tree library
*** https://pythonhosted.org/BTrees/
Expand All @@ -76,10 +68,8 @@ The final prototype would be .bgzf format from biopython
*** The following searches for all values greater-than(min) or less-than(max), flattening
*** list(itertools.chain.from_iterable(btree.values(min=int/float)))
* kdb annotator class (reworked into index class and better metadata specification)
*** DONE First, further specify kdb record shape
CLOSED: [2019-11-24 Sun 23:00]
*** DONE Second specify kdb metadata shape/parsing routines
CLOSED: [2019-11-24 Sun 23:00]
*** TODO First, further specify kdb record shape
*** TODO Second specify kdb metadata shape/types/parsing routines
*** Annotate bools, floats (probability), tags, ints (connectivity/degree)
**** Eulerian as a tag or a bool?
*** Index should be designed to rapidly filter tags, rapidly search/filter/narrow on ints
Expand All @@ -89,14 +79,12 @@ The final prototype would be .bgzf format from biopython
*** sort k-mers by counts (in memory, not on file), then create b-tree, leafs are k-mer file indices (above)
** tag : hash index
** float, int indices : similar to count index above6
* Documentation
** README
** Blog post + review of kmer tools
** Make note of R1/R2 reverse complement needs in strand-specific sequencing
*** maybe a nargs=* optional list for R2? FR strand specificity
*** This wouldn't accomodate RF library preps
*** Either way, add a note that trimming and reverse complementing the reads should be done prior

* Priorities
** Connect this to meme suite
** Hypotheses:
*** Suppose that k-mer spectra have a positive and negative saturation direction.
*** In this way, more specific signals and antisignals could be surmissed from samples with enough resolution, temporal or otherwise resolved by covariates.
*** Think of what could happen if the signals and antisignals were resolved on the order of genes, you could detect gene expression levels with it.
* Operations
** DONE Get all neighbors
CLOSED: [2019-11-12 Tue 14:41]
Expand Down
9 changes: 7 additions & 2 deletions examples/example_report/index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -122,13 +122,18 @@ In this report, we will look at the shape of the data as it is normalized by sam

# Methodology

## kdb and on-disk k-mer counting
## kmerdb and on-disk k-mer counting

kmerdb is a Python module, distributed with the Python Package Index (PyPI) through the command line interface 'pip'. Installation is very simple and the experience should be painless for most users. The primary feature of kmerdb is a series of command-line actions called "sub-commands." These sub-commands perform specific functions on kdb database files or matrices of their origin. The first step of any analysis done with kdb is the k-mer counting done when generating a k-mer "profile." During this step, fastq or fasta files are assessed


## Distribution fitting on normalized and unnormalized k-mer counts

The negative binomial model is a canonical discrete model often used in the RNA-Seq and k-mer literature to model count data like those obtained through second-generation sequencing experiments (@love2014moderated, @anders2010differential, @daley2013predicting). My null-model is based on the use of this model in these specific ways to model count data without standardizing the raw data. I am concerned with immediately looking at a standardized dimension of this data, and I would prefer for hypothesis testing to be done in the negative-binomial space, whatever methods must be used or devised for such analysis. Since these do represent ecological counts, albeit at a sub-cellular level, they may benefit long-term from analyses that consider both methods on standardized data and on unstandardized counts, which we presume for now could follow a Poissonian model.

So my first null hypothesis must be formalized as follows: "The k-mer count data do not follow a negative-binomial distribution and perhaps fit a more standard count distribution such as the Poissonian." If we make the hypothesis this aggressive, we could completely address both sides of the hypothesis by covering Poissonian hypothesis testing and distribution fitting through Cullen-Frey analysis. The method for hypothesis testing in a Poissonian framework in R is the ppois function `ppois(x, lambda=l)`, and the y-values associated with the pdf, fully-specified by lambda, can be graphed with `dpois(x, lambda=l)`). We also try transforming the poissonian by setting lambda to the median, the mode, and the geometric average and the arithmetic average.
So my first null hypothesis must be formalized as follows: "The k-mer count data do not follow a negative-binomial distribution and perhaps fit a more standard count distribution such as the Poissonian." If we make the hypothesis this aggressive, we could completely address both sides of the hypothesis by covering Poissonian hypothesis testing and distribution fitting through Cullen-Frey analysis. The method for hypothesis testing in a Poissonian framework in R is the ppois function `ppois(x, lambda=l)`, and the y-values associated with the pdf, fully-specified by lambda, can be graphed with `dpois(x, lambda=l)`). We also try transforming the poissonian by setting lambda to the median, the mode, and the geometric average and the arithmetic average. Although the Cullen-Frey may be a fairly accepted method, we can still try to imitate the ideal NB behavior through imitating the Poissonian by transforming the values in this way.

My second hypothesis was that the effect of normalization in the next section would have no effect on the distribution selected for modeling. I will address this hypothesis by looking at the efficacy of normalization on the quartiles of each sample's k-mer counts, and then see if the Cullen-Frey analysis produces the same suggestion and thus we would think that normalization does not affect the choice of model we should use.

Once the Cullen-Frey analysis led us to originally prefer the NB model for its purity, we then adopted the DESeq2 method for count normalization.

Expand Down
82 changes: 59 additions & 23 deletions kmerdb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@


def index_file(arguments):
from kdb import fileutil, index
from kmerdb import fileutil, index
with fileutil.open(arguments.kdb, mode='r') as kdb:
header = kdb.header
line_index = index.build_line_index_from_kdb(arguments.kdb, header['k'])
Expand All @@ -48,27 +48,40 @@ def distances(arguments):
from scipy.spatial.distance import pdist, squareform


from kdb import fileutil, distance, config
from kmerdb import fileutil, distance, config
n = len(arguments.kdb)


if len(arguments.kdb) > 1:
files = list(map(lambda f: fileutil.open(f, 'r'), arguments.kdb))
logger.debug("Files: {0}".format(files))
if arguments.k is None:
arguments.k = files[0].k
if not all(kdbrdr.k == arguments.k for kdbrdr in files):
logger.error("Files: {0}".format(files))
logger.error("Choices of k: {0}".format([kdbrdr.k for kdbrdr in files]))
logger.error("By default the inferred value of k is used when k is not specified at the command line, which was {0}".format(arguments.k))
raise TypeError("One or more files did not have k set to be equal to {0}".format(arguments.k))
profiles = np.array(list(map(lambda kdbrdr: kdbrdr.slurp(), files)), dtype="int32")
logger.debug("Files: {0}".format(files))
data = [kdbrdr.slurp() for kdbrdr in files]
logger.debug(data)
profiles = np.transpose(np.array(data, dtype="int32"))
# The following does *not* transpose a matrix defined as n x N=4**k
# n is the number of independent files/samples, (fasta/fastq=>kdb) being assessed
# N=4**k is the dimensionality of the vector, sparse or not, that makes up the perceived profile, the descriptors is the column dimension.

# The names for the columns are taken as the basenamed filepath, with all extensions (substrings beginning with '.') stripped.
logger.info("Converting arrays of k-mer counts into a pandas DataFrame...")

column_names = list(map(lambda kdbrdr: os.path.basename(kdbrdr._filepath).split(".")[0], files))
if len(column_names) != len(files):
raise RuntimeError("Number of column names {0} does not match number of input files {1}...".format(len(column_names), len(files)))
logger.debug("Shape: {0}".format(profiles.shape))
expected = (4**arguments.k, len(column_names))
if profiles.shape != expected:
logger.error("Expected shape: {0}".format(expected))
logger.error("Actual shape: {0}".format(profiles.shape))
raise RuntimeError("Raw profile shape (a Numpy array) doesn't match expected dimensions")

df = pd.DataFrame(profiles, columns=column_names)
elif len(arguments.kdb) == 1 and (arguments.kdb[0] == "/dev/stdin" or arguments.kdb[0] == "STDIN"):
Expand Down Expand Up @@ -123,7 +136,7 @@ def distances(arguments):
elif arguments.metric == "euclidean":
data[i][j] = distance.euclidean(arguments.kdb[i], arguments.kdb[j])
elif arguments.metric == "spearman":
cor, pval = distance.spearman(profiles[:, i], profiles[:, j])
cor, pval = distance.spearman(profiles[i], profiles[j])
# FIXME! also print pval matrices
data[i][j] = cor
else:
Expand Down Expand Up @@ -154,7 +167,7 @@ def get_matrix(arguments):
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from kdb import fileutil, config
from kmerdb import fileutil, config


if len(arguments.kdb) > 1:
Expand Down Expand Up @@ -338,7 +351,7 @@ def kmeans(arguments):
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from kdb import config
from kmerdb import config


if isinstance(arguments.input, argparse.FileType) or isinstance(arguments.input, TextIOWrapper):
Expand Down Expand Up @@ -483,17 +496,17 @@ def hierarchical(arguments):

from Bio.Cluster import treecluster

from kdb import config
from kmerdb import config



if isinstance(arguments.input, argparse.FileType) or isinstance(arguments.input, TextIOWrapper):
df = pd.read_csv(arguments.input, sep=arguments.delimiter)
column_names = df.columns
column_names = list(df.columns)
df = df.transpose()
elif arguments.input is None or arguments.input == "STDIN" or arguments.input == "/dev/stdin":
df = pd.read_csv(sys.stdin, sep=arguments.delimiter)
column_names = df.columns
column_names = list(df.columns)
df = df.transpose()
else:
logger.error("An unknown IO type was detected in bin/kdb.cluster()")
Expand All @@ -514,7 +527,27 @@ def hierarchical(arguments):
#plt.show()
plt.savefig(config.hierarchical_clustering_dendrogram_fig_filepath)

sys.stderr.write("Saving the dendrogram to '{0}'...".format(config.hierarchical_clustering_dendrogram_fig_filepath))
sys.stderr.write("Saving the matplotlib dendrogram to '{0}'...".format(config.hierarchical_clustering_dendrogram_fig_filepath))

data = np.array(df).tolist()

n = len(data)
m = len(data[0])
for i in range(n):
for j in range(m):
if i < j:
data[i].pop(-1)
#print(data)

from Bio.Phylo.TreeConstruction import DistanceMatrix, DistanceTreeConstructor
from Bio import Phylo
dm = DistanceMatrix(column_names, data)
constructor = DistanceTreeConstructor()
upgmatree = constructor.upgma(dm)
print(upgmatree)
Phylo.draw(upgmatree, branch_labels=lambda c: c.branch_length)
Phylo.write(upgmatree, config.spearman_upgma_tree_phy, "phyloxml")
sys.stderr.write("Saving the phylip format tree to '{0}'".format(config.spearman_upgma_tree_phy))
sys.stderr.write(config.DONE)

# def rarefy(arguments):
Expand Down Expand Up @@ -565,12 +598,10 @@ def hierarchical(arguments):


def header(arguments):
from kdb import fileutil, config
from kdb.config import VERSION
from kdb import util
from kmerdb import fileutil, config, util

with fileutil.open(arguments.kdb, mode='r') as kdb:
if kdb.header["version"] != VERSION:
if kdb.header["version"] != config.VERSION:
logger.warning("KDB version is out of date, may be incompatible with current KDBReader class")
if arguments.json:
print(dict(kdb.header))
Expand All @@ -580,9 +611,9 @@ def header(arguments):


def view(arguments):
from kdb import fileutil
from kmerdb import fileutil

from kdb.config import VERSION
from kmerdb.config import VERSION

with fileutil.open(arguments.kdb, mode='r') as kdb:
if kdb.header["version"] != VERSION:
Expand All @@ -594,9 +625,9 @@ def view(arguments):

def profile(arguments):
import math
from kdb import parse, fileutil

from kdb.config import VERSION
from kmerdb import parse, fileutil, kmer
import json
from kmerdb.config import VERSION

metadata = []
tempdbs = []
Expand Down Expand Up @@ -624,17 +655,21 @@ def profile(arguments):
kdb_out = fileutil.open(arguments.kdb, 'wb', header)
iterating = True
while iterating:
# The 0th element is the count
try:

kmer_counts_per_file = list(map(next, tempdbs)) # T
if len(kmer_counts_per_file):
i = kmer_counts_per_file[0][0] - 1 # Remove 1 for the Sqlite zero-based indexing
count = sum([x[1] for x in kmer_counts_per_file]) # The 1th element is the k-mer count
#sys.stderr.write("\r")

if arguments.verbose == 2:
sys.stderr.write("K-mer counts: {0} = {1}\n".format(list(map(lambda x: x[1], kmer_counts_per_file)), count))
kdb_out.write("{0}\t{1}\n".format(i, count))
if count == 0 and arguments.sparse is True:
pass
else:
seq = kmer.id_to_kmer(i, arguments.k)
neighbors = kmer.neighbors(seq, arguments.k)
kdb_out.write("{0}\t{1}\t{2}\n".format(i, count, json.dumps(neighbors)))
else:
iterating = False
except StopIteration as e:
Expand Down Expand Up @@ -720,6 +755,7 @@ def cli():
profile_parser.add_argument("--keep-sqlite", action="store_true", help=argparse.SUPPRESS)
profile_parser.add_argument("--not-strand-specific", action="store_false", default=True, help="Retain k-mers from the forward strand of the fast(a|q) file only")
#profile_parser.add_argument("--no-metadata", dest="metadata", action="store_false", default=True, help="Include k-mer metadata in the .kdb")
profile_parser.add_argument("--sparse", action="store_true", default=False, help="Whether or not to store the profile as sparse")
profile_parser.add_argument("-k", default=12, type=int, help="Choose k-mer size (Default: 12)")
profile_parser.add_argument("seqfile", nargs="+", type=str, metavar="<.fasta|.fastq>", help="Fasta or fastq files")
profile_parser.add_argument("kdb", type=str, help="Kdb file")
Expand Down
2 changes: 1 addition & 1 deletion kmerdb/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
limitations under the License.
'''

import sys
logger = None


Expand Down
3 changes: 2 additions & 1 deletion kmerdb/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@



VERSION="0.0.4"
VERSION="0.0.5"

header_schema = {
"type": "object",
Expand Down Expand Up @@ -103,6 +103,7 @@
kmeans_clustering_fig_filepath = "kmeans_clustering_of_kmer_profiles.png"
ecopy_rarefaction_fig_filepath = "ecopy_rarefaction_curve.png"
hierarchical_clustering_dendrogram_fig_filepath = "dendrogram.png"
spearman_upgma_tree_phy = "kdb_spearman_upgma_tree.phyloxml"
files = (pca_variance_fig_filepath, kmeans_elbow_graph_fig_filepath, kmeans_clustering_fig_filepath, ecopy_rarefaction_fig_filepath, hierarchical_clustering_dendrogram_fig_filepath)

DEFAULT_MASTHEAD = """
Expand Down
4 changes: 2 additions & 2 deletions kmerdb/distance.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,9 +105,9 @@ def euclidean(fname1, fname2):


def spearman(x, y):
if not isinstance(x, np.array):
if type(x) is not np.ndarray:
raise TypeError("kmerdb.distance.spearman expects a Numpy array as its first positional argument")
elif not isinstance(y, np.array):
elif type(y) is not np.ndarray:
raise TypeError("kmerdb.distance.spearman expects a Numpy array as its second positional argument")
from scipy.stats import spearmanr
cor, pval = spearmanr(x, b=y)
Expand Down
Loading

0 comments on commit 8b5841a

Please sign in to comment.