Sequence Data File Hot Spot Analysis #1

jdimatteo · 2013-10-23T00:48:03Z

Find the number of data files that correspond to each bin of the genome. Start with the 1 million base pair bins Charles defined, and possibly repeat with smaller bins.

There are hundreds of data files with sequence data, each a couple GB in size. Run bamliquidator on all the data files for each bin to find the number of data files that correspond to each bin. Store the results in a table and generate a heat map to visualize the results. Benchmark how long it takes for the analysis to run for a given sequence length.

Many bioinformatics anecdotally observed a seeming hot spot tendency, where certain areas of the genome tended to have more sequence data associated with it. No one has done a large scale analysis before to confirm whether or not this is really true. The null hypothesis is that there are no hot spots.

jdimatteo · 2013-10-23T01:16:49Z

Charles / @bradnerComputation , following up our phone conversation today, I'm waiting on a few items before I can get started:

path to bamliquidator executable on TOD, and if possible instructions to compile along with a couple example inputs and expected outputs
paths of a couple data files on TOD that are in public domain and I can copy to my laptop for local testing
pdf of the related Yale publication (if you could email me a PDF that would be appreciated, since I don't have any paid journal subscriptions)

I plan on restricting this analysis to at most half the cores of TOD and half the available memory at any given time -- let me know if that sounds OK. (If any of my jobs do become a nuisance on TOD, please feel free to just kill them and let me know afterwards so I can make sure my jobs don't get out of hand again.)

I assume that bamliquidator.c will be the bottleneck in the data analysis, and I might end up making some changes or replacing it with something faster if necessary. For any work I do, it needs to have an OSI approved open source license. I suggest we add something like the following to the top of any source file I work on (including bamliquidator.c), and we should probably try adding this to all new files in the future (see http://opensource.org/faq#public-domain for why we probably want an explicit copyright and license).

The MIT License (MIT)

Copyright (c) 2013 Dana-Farber Cancer Institute Bradner Lab

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

I'm not very familiar with assigning copyright to an organization instead of an individual, so if you have access to any lawyers familiar with open source software, it would probably be a good idea for them to review this. Let me know if you'd prefer any files I create to have the license assigned to myself or "Dana-Farber Cancer Institute Bradner Lab" (or something else).

Thanks

charlesylin · 2013-10-23T01:47:27Z

John,

I sent you an email with details for the project. If you can time your
runs to occur at night, then you can use as much of the machine as you'd
like. We tend to pile on the machine first thing in the morning, once
again in the afternoon, and again around 9pm-midnight. During those times,
I would restrict CPU/memory usage to about 1/3. If i remember correctly,
the bamliquidator program is actually quite fast and can go through an
average bam file in about 1 hour on 1 core.

I am meeting with the dana-farber IT people to discuss copyright/license
for our stuff. I think the solution will involve putting the MIT license
on top of everything that we commit to the public repo. (bamliquidator is
already in the public repo)

-Charles

Charles Y. Lin, Ph.D.
Dana-Farber Cancer Institute
Department of Medical Oncology
charles_lin@dfci.harvard.edumailto:charles_lin@dfci.harvard.edu
http://bradnerlab.com

On Tue, Oct 22, 2013 at 9:16 PM, jdimatteo notifications@github.com wrote:

Charles / @bradnerComputation https://github.com/bradnerComputation ,
following up our phone conversation today, I'm waiting on a few items
before I can get started:

path to bamliquidator executable on TOD, and if possible
instructions to compile along with a couple example inputs and expected
outputs

paths of a couple data files on TOD that are in public domain and I
can copy to my laptop for local testing

pdf of the related Yale publication (if you could email me a PDF
that would be appreciated, since I don't have any paid journal
subscriptions)

I plan on restricting this analysis to at most half the cores of TOD and
half the available memory at any given time -- let me know if that sounds
OK. (If any of my jobs do become a nuisance on TOD, please feel free to
just kill them and let me know afterwards so I can make sure my jobs don't
get out of hand again.)

I assume that bamliquidator.c will be the bottleneck in the data analysis,
and I might end up making some changes or replacing it with something
faster if necessary. For any work I do, it needs to have an OSI approved
open source license. I suggest we add something like the following to the
top of any source file I work on (including bamliquidator.c), and we should
probably try adding this to all new files in the future (see
http://opensource.org/faq#public-domain for why we probably want an
explicit copyright and license).

The MIT License (MIT)

Copyright (c) 2013 Dana-Farber Cancer Institute Bradner Lab

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

I'm not very familiar with assigning copyright to an organization instead
of an individual, so if you have access to any lawyers familiar with open
source software, it would probably be a good idea for them to review this.
Let me know if you'd prefer any files I create to have the license assigned
to myself or "Dana-Farber Cancer Institute Bradner Lab" (or something else).

Thanks

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/1#issuecomment-26873183
.

jdimatteo · 2013-12-19T13:32:31Z

This issue is a bit open ended, but the first version is complete and re-integrated.

I'm going to create another issue for enhancing the hot spot analysis code to handle arbitrary bin sizes.

Add a separate thread for scoring

ghost assigned jdimatteo Oct 23, 2013

jdimatteo mentioned this issue Dec 17, 2013

Reintegrate with Bam Bin Counting and Minor Change to GPL16043.sh #2

Merged

jdimatteo closed this as completed Dec 19, 2013

jdimatteo mentioned this issue Dec 19, 2013

Support arbitrary bin sizes in hot spot analysis #4

Closed

jdimatteo mentioned this issue Feb 9, 2014

BamLiquidator: Added support for varying bin sizes and improved performance #5

Merged

davidhoover mentioned this issue Aug 28, 2015

Unhandled exception: pthread_attr_setstacksize: Invalid argument #49

Closed

jdimatteo pushed a commit that referenced this issue Aug 28, 2016

Merge pull request #1 from BoulderLabs/multithreaded-motif

f847da1

Add a separate thread for scoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence Data File Hot Spot Analysis #1

Sequence Data File Hot Spot Analysis #1

jdimatteo commented Oct 23, 2013

jdimatteo commented Oct 23, 2013

charlesylin commented Oct 23, 2013

jdimatteo commented Dec 19, 2013

Sequence Data File Hot Spot Analysis #1

Sequence Data File Hot Spot Analysis #1

Comments

jdimatteo commented Oct 23, 2013

jdimatteo commented Oct 23, 2013

charlesylin commented Oct 23, 2013

jdimatteo commented Dec 19, 2013