Skip to content

Meeting notes

Joel edited this page May 21, 2018 · 15 revisions

Notes from meetings

2018-05-21. Present: Alexander, Martin, Peter, Joel, and Erik

  • Look into neighbour-hood joining, and other phylogenetic correlations http://evolution.genetics.washington.edu/phylip.html
  • Talked about homology and what to do about it. Take two parts of sequences and check for homology, if they aren't homologous, compute VLMCs for them and use them as representatives for the entire VLMCs.
  • Need a new implementation of Pisa for analysis of larger DNA sequences.
  • Martin didn't check for homology in the split sequences, so not strictly homology-resistant?

2018-05-09. Present: Alexander, Joel, and Erik

  • What to do about the sequence cluster algorithms. Look into Vmatch. If they don't run (because of memory or time), state in the report that we tried, and they don't work for the reason they didn't work.
  • Bad sensitivity isn't as bad as bad specificity. Bad sensitivity means subdividing classes, which doesn't have to be bad.
  • Report: Things we find irrelevant or don't work well, move to the appendix. Introduction, explain more about the biology and include pictures from NIH. Can also be a bit more dramatic about the opening "Pathogens have endangered life on earth for thousands of years", "Pathogens is the things that kill you if you aint vaccinated". Emerging threats, report from WHO. Plot variability of GC-content or some such.
  • Change to species of the host instead of the class of host species, more specific but also difficult to plot (a lot of different hosts).

2018-05-02. Present: Alexander, Joel, and Erik

  • Brewer palette for selecting colors for the visualisations.
  • Produce a table with the x largest clusters with the #families and #hosts, along with bar charts of of which they are in each row.
  • Metrics: examine all pairs of data points. If both in same class and same cluster, that's good, other permutations give false/true positives/negatives. See Proclust 2002 paper.
  • Can put other plots in the appendix, for things that are semi-interesting or worse.
  • Keep results and discussion two separate sections.

2018-04-19. Present: Alexander, Joel, and Erik

  • Report. Send chapters at a time and TOC for feedback on structure.
  • Talked about the examples, how it is difficult to find real data where the intersection is very different from the union. Changes in high orders always reflect down into lower orders, and states which have different probabilities are more likely to be captured in the trained model.
  • Work on tree building, more robust than doing something with different radiuses for different clusters.
  • Double check the data, and models for plants. Seems like there could be something weird. If not, email Peter/Martin about this occurrence.
  • Plot the distance to closest other VLMC as the function of length.

2018-04-13. Present: Alexander, John, Joel, and Erik

  • It seems hard or impossible to filter out GC-content when comparing the vlmcs.
  • Alexander said that Martin have created some sort of tree of life-image using the genomic signatures. How was this done?
  • Build tree of life with hierarchical clustering, check UPGMA.
  • We discussed if the wobbly neuclotides could account for differences in GC-content. Can test this by ignoring every third bp and see if GC-content changes a lot.

2018-04-10. Present: Martin, Joel, and Erik

  • The cordon pair bias is a good benchmark for comparison between genomes.
  • Try to marginalise for GC-content, show some result on what is important above GC-content.
  • Would be good to not have to choose the number of clusters, but how? Produce metric results on how well the vlmcs cluster with different number of clusters.
  • GenBank - get genomes for plants, bacteria, viruses, eukaryotes and cluster, maybe 100 of each.
  • Remove virus species with multiple hosts, herpes are a good candidate of viruses which usually only have a single host. Bacteria from GenBank.
  • Check the baltimore types, genome length and more metadata about the viruses which cluster together.
  • Do something about homology.
  • Email Martin (and Peter) about presentation when the date is set.

2018-03-29. Present: Alexander, Joel, and Erik

  • Regenerate VLMCs, plot the distance for increasing sequence lengths used in the regeneration process.
  • Plot sequence length vs difference in structure and distance.
  • Maybe instead of regenerating with a long sequence, regenerate with several sequences.
  • Add a flag for the changes introduced which counts the number of parameters differently in the classifier.
  • Clever way of generating sequence homology/genome comparisons. Should ask Peter/Martin what constitutes genome homology. Maybe use the percentage of shared proteins.
  • Plot the cumulative captured percent/genus instead of the individual percentage in each box.
  • Make artificial models, and try to figure out what is or isn't captured in the intersection/union distance calculations.
  • Calculate the probability vectors of every word (for instance 10 long), and calculate the relative entropy between them and compare to union/intersection calculations.
  • Investigate what sort of states are pruned, and what results the distance function gives with/without them.

2018-03-23. Halftime meeting. Present: Alexander, Peter, Devdatt, Joel, and Erik

  • Box plot with the GC-content with some bin-size. Include the taxonomy in a clever way as well.
  • Tune the learning of the VLMCs in some way and make the models more comparable.
  • Figure out some way of working with the union of the states in the Frobenius calculation.
  • Work with other clusterings, maybe top-down, and influence the cluster sizes.
  • Try to do something with an even larger data-set, ask Martin or something from NCBI.
  • Analyse what might be captured in the clusters and by the distances above GC-content.
  • Focus on finishing the structure and contents of the early parts of the report.

2018-03-16. Present: Peter, Alexander, Joel, and Erik

  • Concerns were raised regarding the fact that we currently use the whole genome sequence to generate the VLMCs, we should consider splitting the sequences into two parts to make sure the genomic signatures matches other signatures not because of sequence homology, but because of other patterns/features/motifs that are captured in all parts of the DNA sequence. However, it might be difficult to not have biases towards homology since the viruses might have duplications or transformations from each other.
  • We should check if the clusterings are stable. We can do this by sampling subsets of the data and see if the algorithms produce similar clusterings every time.
  • Our work might provide more proof that the genomic signatures are highly species specific. Even if the clustering does not provide us with clusters which only include e.g. viruses within the same family, it will still provide interesting data from which new questions can be asked: why do these species cluster together? Do they share host species? Is there anything else we know that relates them in some way?
  • When clustering, there might be data points that we should remove because they affect the clustering procedure in some bad manner. How this should be done remains a question.
  • When measuring distance, also consider the homology of the sequences as a metric, together with the distance function and the GC-content.
  • Post-analyse the resulting clustering based on sequence homology.

2018-03-12. Present: Alexander, Joel, and Erik

  • Establish a test suite/test case for the VLMC generation code.
  • When it has been established how well the VLMC generation works, generate even larger models to check if this can increase the accuracy.
  • The other variable order Markov chain structure isn't available anywhere.
  • With VLMCs, may need slightly more parameters than with a fixed-order Markov chain.

2018-03-05. Present: Alexander, Joel, and Erik

  • Deadline for Half-time report: 23/3 10:00. Hold a short presentation, and have a decent report. Invite Martin and Peter as well. Can deliver report to Alexander by end of next week.
  • Tune the parameters in the VLMC generation so that we can actually use the variable order nature. Figure out what works well empirically, and then try to find a more scientific motivation. Maybe an order of 7-8 would be good for the VLMCs.
  • In the estimation procedure and NLL calculation, measure the estimation error rate of different sequence lengths.
  • Instead of estimating the VLMC, find another way of translating between two VLMCs, go from the lower order and up, or the higher order and down. (E.g. possible to go from MC to i.i.d model).
  • The Bioinformatics paper is used for the estimation of the VLMCs, Devdatt believes.
  • Discuss that every part of the VLMC not being equally important with Martin and Peter.
  • Can look at Martin's presentation regarding which biology to present.

2018-02-07. Present: Alexander, Joel, and Erik

  • Speed of distance function: Not crucially extremely fast, but should be able to cluster 100-1000 shouldn't take two weeks. So the overall speed is more important than the distance function itself.
  • Regarding higher-order Markov chains that are used instead of VLMC: This is surprising, and not consistent with Devdatt's work. Should ask Martin about the concrete experiment. Maybe because of the low order.
  • Can an HMM represent a VLMC: In principle, yes, but maybe the number of states will be exponential.
  • Keep in mind that the underlying data aren't stationary by definition, the sequences are transient. So difficult to reason about stationary distributions.
  • There doesn't have to be only one good distance function, can have a rough, but fast one, and a more accurate slow one, certainly for production.
  • Maybe transform VLMC into higher-order Markov chain, if their distance is easier to compute. Unclear what, exactly the transformation would be.
  • Request more signatures, would like set of at least low-hundreds. Would be good to test speed on 10/100/1000 to get an idea of the growth in computation time.
  • Regarding verification of the clustering: Tree of life correlation, may not necessarily always correspond to the observed clusters.
  • Implement single-link clustering.
  • For random graphs, there is a phase transition of clusters of size O(log n) to O(n).
  • There is another way to generate genomic signatures (Fast and Adaptive Variable Order Markov Chain Construction).
  • The training sequences have a limited representation, so good idea still to generate sequences.

2018-01-23. Present: Alexander, Joel, and Erik

Regarding the planning report:

  • Risks: if something doesn't work (no good distance function is found), what to do in that case etc.
  • No need to go into details about how the VLMCs are inferred at this point, but good for final report.
  • Mention clustering procedures, maybe two sentences and a citation (mixture models, graph based).
  • K-Means, the average can be replaced with a likelihood.

Regarding the distance function:

  • Do literature study about HMM distance functions, something similar could maybe be used.
  • Bühlman VLMC methodology 2004.
  • Relative entropy, without brute force sampling of the parameters.
  • Lower bound the distance, if large enough, a more accurate distance won't be needed.

Regarding programming language:

  • Python should work fine.
  • Cython, for C-bindings.

2018-01-17. Present: Peter, Martin, Joel, and Erik

Things that were discussed:

  • The "scoring function" Martin used in his thesis, which was based on log-likelihood is not a good distance measure in general. This is since it shows whether two signatures are very similar, but if they are not, it does not give a meaningful value.
  • Talked a bit more about the background to why the use of genomic signatures are important as opposed to handling all sequences as they are.
  • Martin is going send us the software that Devdatt and Daniel have developed.
  • Martin is going to send us a test-data set for us to start looking into.