Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence Data File Hot Spot Analysis #1

Closed
jdimatteo opened this issue Oct 23, 2013 · 3 comments
Closed

Sequence Data File Hot Spot Analysis #1

jdimatteo opened this issue Oct 23, 2013 · 3 comments
Assignees

Comments

@jdimatteo
Copy link
Member

Find the number of data files that correspond to each bin of the genome. Start with the 1 million base pair bins Charles defined, and possibly repeat with smaller bins.

There are hundreds of data files with sequence data, each a couple GB in size. Run bamliquidator on all the data files for each bin to find the number of data files that correspond to each bin. Store the results in a table and generate a heat map to visualize the results. Benchmark how long it takes for the analysis to run for a given sequence length.

Many bioinformatics anecdotally observed a seeming hot spot tendency, where certain areas of the genome tended to have more sequence data associated with it. No one has done a large scale analysis before to confirm whether or not this is really true. The null hypothesis is that there are no hot spots.

@ghost ghost assigned jdimatteo Oct 23, 2013
@jdimatteo
Copy link
Member Author

Charles / @bradnerComputation , following up our phone conversation today, I'm waiting on a few items before I can get started:

  1. path to bamliquidator executable on TOD, and if possible instructions to compile along with a couple example inputs and expected outputs
  2. paths of a couple data files on TOD that are in public domain and I can copy to my laptop for local testing
  3. pdf of the related Yale publication (if you could email me a PDF that would be appreciated, since I don't have any paid journal subscriptions)

I plan on restricting this analysis to at most half the cores of TOD and half the available memory at any given time -- let me know if that sounds OK. (If any of my jobs do become a nuisance on TOD, please feel free to just kill them and let me know afterwards so I can make sure my jobs don't get out of hand again.)

I assume that bamliquidator.c will be the bottleneck in the data analysis, and I might end up making some changes or replacing it with something faster if necessary. For any work I do, it needs to have an OSI approved open source license. I suggest we add something like the following to the top of any source file I work on (including bamliquidator.c), and we should probably try adding this to all new files in the future (see http://opensource.org/faq#public-domain for why we probably want an explicit copyright and license).

The MIT License (MIT)

Copyright (c) 2013 Dana-Farber Cancer Institute Bradner Lab

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

I'm not very familiar with assigning copyright to an organization instead of an individual, so if you have access to any lawyers familiar with open source software, it would probably be a good idea for them to review this. Let me know if you'd prefer any files I create to have the license assigned to myself or "Dana-Farber Cancer Institute Bradner Lab" (or something else).

Thanks

@charlesylin
Copy link
Member

John,

I sent you an email with details for the project. If you can time your
runs to occur at night, then you can use as much of the machine as you'd
like. We tend to pile on the machine first thing in the morning, once
again in the afternoon, and again around 9pm-midnight. During those times,
I would restrict CPU/memory usage to about 1/3. If i remember correctly,
the bamliquidator program is actually quite fast and can go through an
average bam file in about 1 hour on 1 core.

I am meeting with the dana-farber IT people to discuss copyright/license
for our stuff. I think the solution will involve putting the MIT license
on top of everything that we commit to the public repo. (bamliquidator is
already in the public repo)

-Charles

Charles Y. Lin, Ph.D.
Dana-Farber Cancer Institute
Department of Medical Oncology
charles_lin@dfci.harvard.edumailto:charles_lin@dfci.harvard.edu
http://bradnerlab.com

On Tue, Oct 22, 2013 at 9:16 PM, jdimatteo notifications@github.com wrote:

Charles / @bradnerComputation https://github.com/bradnerComputation ,
following up our phone conversation today, I'm waiting on a few items
before I can get started:

  1. path to bamliquidator executable on TOD, and if possible
    instructions to compile along with a couple example inputs and expected
    outputs
  2. paths of a couple data files on TOD that are in public domain and I
    can copy to my laptop for local testing
  3. pdf of the related Yale publication (if you could email me a PDF
    that would be appreciated, since I don't have any paid journal
    subscriptions)

I plan on restricting this analysis to at most half the cores of TOD and
half the available memory at any given time -- let me know if that sounds
OK. (If any of my jobs do become a nuisance on TOD, please feel free to
just kill them and let me know afterwards so I can make sure my jobs don't
get out of hand again.)

I assume that bamliquidator.c will be the bottleneck in the data analysis,
and I might end up making some changes or replacing it with something
faster if necessary. For any work I do, it needs to have an OSI approved
open source license. I suggest we add something like the following to the
top of any source file I work on (including bamliquidator.c), and we should
probably try adding this to all new files in the future (see
http://opensource.org/faq#public-domain for why we probably want an
explicit copyright and license).

The MIT License (MIT)

Copyright (c) 2013 Dana-Farber Cancer Institute Bradner Lab

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

I'm not very familiar with assigning copyright to an organization instead
of an individual, so if you have access to any lawyers familiar with open
source software, it would probably be a good idea for them to review this.
Let me know if you'd prefer any files I create to have the license assigned
to myself or "Dana-Farber Cancer Institute Bradner Lab" (or something else).

Thanks


Reply to this email directly or view it on GitHubhttps://github.com//issues/1#issuecomment-26873183
.

@jdimatteo
Copy link
Member Author

This issue is a bit open ended, but the first version is complete and re-integrated.

I'm going to create another issue for enhancing the hot spot analysis code to handle arbitrary bin sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants