-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sequence Data File Hot Spot Analysis #1
Comments
Charles / @bradnerComputation , following up our phone conversation today, I'm waiting on a few items before I can get started:
I plan on restricting this analysis to at most half the cores of TOD and half the available memory at any given time -- let me know if that sounds OK. (If any of my jobs do become a nuisance on TOD, please feel free to just kill them and let me know afterwards so I can make sure my jobs don't get out of hand again.) I assume that bamliquidator.c will be the bottleneck in the data analysis, and I might end up making some changes or replacing it with something faster if necessary. For any work I do, it needs to have an OSI approved open source license. I suggest we add something like the following to the top of any source file I work on (including bamliquidator.c), and we should probably try adding this to all new files in the future (see http://opensource.org/faq#public-domain for why we probably want an explicit copyright and license).
I'm not very familiar with assigning copyright to an organization instead of an individual, so if you have access to any lawyers familiar with open source software, it would probably be a good idea for them to review this. Let me know if you'd prefer any files I create to have the license assigned to myself or "Dana-Farber Cancer Institute Bradner Lab" (or something else). Thanks |
John, I sent you an email with details for the project. If you can time your I am meeting with the dana-farber IT people to discuss copyright/license -Charles Charles Y. Lin, Ph.D. On Tue, Oct 22, 2013 at 9:16 PM, jdimatteo notifications@github.com wrote:
|
This issue is a bit open ended, but the first version is complete and re-integrated. I'm going to create another issue for enhancing the hot spot analysis code to handle arbitrary bin sizes. |
Find the number of data files that correspond to each bin of the genome. Start with the 1 million base pair bins Charles defined, and possibly repeat with smaller bins.
There are hundreds of data files with sequence data, each a couple GB in size. Run bamliquidator on all the data files for each bin to find the number of data files that correspond to each bin. Store the results in a table and generate a heat map to visualize the results. Benchmark how long it takes for the analysis to run for a given sequence length.
Many bioinformatics anecdotally observed a seeming hot spot tendency, where certain areas of the genome tended to have more sequence data associated with it. No one has done a large scale analysis before to confirm whether or not this is really true. The null hypothesis is that there are no hot spots.
The text was updated successfully, but these errors were encountered: