Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option to exclude k-mers with zero counts? #20

Open
jmcalabr opened this issue Nov 25, 2020 · 6 comments
Open

option to exclude k-mers with zero counts? #20

jmcalabr opened this issue Nov 25, 2020 · 6 comments

Comments

@jmcalabr
Copy link
Contributor

This is not really an issue, just a suggestion -- hope all is well!

We are finding new ways to use SEEKR by specifying custom alphabets. In these experiments, we are encountering words or k-mers that never occur, even in the entire mouse or human transcriptome.

Here (and probably in other use cases as well), it would be useful if seekr_norm_vectors and seekr_count_kmers could take an input flag that would cause them to exclude k-mers with "0" counts across all counted RNAs. Or even a flag that would allow exclusion of k-mers that do not exceed a threshold of variance across the counted sets of RNAs.

Thank you and take care,
Mauro

@Jessime
Copy link
Contributor

Jessime commented Nov 26, 2020

Hey Mauro, I've got lots of thoughts about this one!

At the highest level, I think you're reaching the limit of what it makes sense to expose at the command line, as opposed to writing some simple python scripts and having the flexibility that comes along with that.


by specifying custom alphabets

This is neat! Did you know that I've already got support for custom alphabets? You can see on this line, that "AGTC" is just the default. You could count protein kmers if you wanted, haha.

Inside of a python script, this would just be, for example

counter = BasicCounter('gencode.fa', alphabet='AGTCWS')
counter.get_counts()

input flag that would cause them to exclude k-mers with "0" counts across all counted RNAs.

The generalization of what you're asking here is, "Can we filter kmers?"

As I'm sure you remember, a big surprise for us was that we couldn't find a subset of kmers that gave us more predictive power than the full set of 6mers. Part of the reason it was so surprising was because I spent a ton of time trying. You've listed a few here:

  • max(count_column) == 0
  • max(count_column) < threshold
  • variance(count_column) < threshold

I can think of several dozen more that we tried over the years. A few were:

  • Picking at random
  • GC content of kmer
  • Covariance with other kmers
  • Mean of counts
  • More complicated machine learning attempts

I say all of this to say, I explicitly chose not to build something like seekr_filter_counts because there are so many options and it wasn't clear which of them were worth the effort to build into the official package.

On the other hand, a single filter is pretty easy from within python. if you've run seekr_count_kmers, you could do:

def has_vals(col):
    return max(col) > 0   # your custom filter

df = pd.read_csv('kmers.csv', index_col=0)  # read
filtered_df = df.loc[df.apply(has_vals)]  # apply filter
filtered_df.to_csv('filtered_kmers.csv')  # write

which would exclude k-mers with zero counts. And notice that only the line commented as # your custom filter is actually interesting.

What might make sense is to create a seekr_filter_counts that allows users to specify an arbitrary filter function. That'd allow you to incorporate filtering into your your bash scripts, but wouldn't make it necessary to curate all of your favorite filters and go through me each time you want to try a new filter. This would still require someone to write the python filters, of course, but it sounds like the ones you're thinking about right now are easy enough that someone in lab could write them up without too much trouble.


To summarize:

  • There are 100s of filters one might apply; it probably doesn't make sense to start adding them to seekr right now.
  • It's easy to write one off scripts to filter columns.
  • If you're really interested in having filtering as part of the bash scripting pipeline, we should talk more.

Hope this is helpful! 😄

@jmcalabr
Copy link
Contributor Author

jmcalabr commented Nov 26, 2020 via email

@Jessime
Copy link
Contributor

Jessime commented Nov 28, 2020

This is off topic from the original question, but I'm curious if you've considered creating a third alphabet by mapping base by structure? You could have an alphabet that looks like:

{
'A.': 'a',
 'A(': 'b',
 'A)': 'c',
 'G.': 'd',
 'G(': 'e',
 'G)': 'f',
 'U.': 'g',
 'U(': 'h',
 'U)': 'i',
 'C.': 'j',
 'C(': 'k',
 'C)': 'l'
}

so the first 30 letters of the sequence you gave as an example in your message would be new_seq = 'aagjdhkheeebgkkhagdjajgadaaaaf' The parameter space is pretty big, but you could reasonably do 3mers or 4mers 12^4 ~= 21,000 like this.

Anyway, I'll keep the --exclude-zero-counts suggestion in the back of my mind. Work is pretty busy at the moment, but I'll continue to mull it over. And if I find a slow weekend, I'll see if I can give something a shot.

Cheers!

@jmcalabr
Copy link
Contributor Author

jmcalabr commented Nov 28, 2020 via email

@dan-sprague
Copy link
Contributor

dan-sprague commented Oct 6, 2021

Mauro,

Actually working on very similar problems in my research for the company I work at. Not sure if you remember but I work in the same startup incubator that Moderna started in! Believe it or not, still partially working on ncRNAs, kmers, and HMMs.

I don't have much to say for python implementation of what you are asking, other than if Jessime has had time to look at this yet?

As for the science, I have a couple thoughts that I am able to share:

  1. Check out this paper that developed an encoding for secondary structure that is more information-rich than simple dot bracket notation. Rather than bound/not bound, it encodes information about stems, loops, bulges, their length, etc. The encoding returned is the same length as the dot-bracket and therefore the same length as the original sequence. With this you should be able to identify specific structures and then ask the question whether you see the same k-mers enriched in them, or not:

https://pubmed.ncbi.nlm.nih.gov/24753415/

  1. Using k-mer frequencies and the alphabet you described, I think you could reduce the alphabet size for "free" by using binary bound/not bound alphabet instead of dot-bracket. Since k-mers lose positional information the difference between "(" and ")" is negligible.

  2. RNA structures are hard to work with computationally. I'd recommend looking into using the MFE from RNAfold as a covariate, i.e. if you tile across a long RNA and calculate the structure and its MFE at each position, you can average the MFE over all tiles to get a number for each k-mer. Even the data behind the mountain plot may likely be more useful, since that encodes some information about stem length and overall structure.

  3. If you want to attract people to the lab by flashing "DEEP LEARNING", I think the field has come sufficiently far that I am quite certain that it is not a waste of time. Specifically transfer learning, which is the ability to train a large model on a closely related dataset, then "transfer" that model for use on a much, much smaller dataset. A deep language model can be trained to embed RNA structures into a feature vector of arbitrary dimension, much how you are embedding lncRNA sequences into a kmer feature vector. The reason to consider this is that this "embedding" will correctly model the secondary structure. i.e. the model will learn that .((..(...)..)). is not the same as (..)..((...)).. (or whatever). To put it in SEEKR terms, they would be uncorrelated or negatively correlated. Whereas with a k-mer alphabet, you'd see very high correlation.

RFAM has ~4M sequences, although random RNA sequences (arbitrarily high number) would work just fine to train such a model. This is what I am referring to in terms of a closely related dataset.

As a final note for (4), the other recent development in this field is bayesian inference on deep networks that help estimate uncertainty. There are now straightforward ways to know if a NN is "confidently wrong" about something, perhaps the most important paper being:

https://arxiv.org/abs/1506.02142

Because it is so damn easy to do.

@jmcalabr
Copy link
Contributor Author

jmcalabr commented Oct 6, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants