Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master - Monthly merge from Develop #159

Merged
merged 57 commits into from
May 11, 2012
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
a0fe27c
Handle nophix names nicely
percyfal Mar 29, 2012
a568b30
Merge branch 'develop' of git://github.com/SciLifeLab/bcbb into develop
percyfal Mar 29, 2012
e89b175
Add project pruning at flowcell level
percyfal Mar 30, 2012
253ac96
Made a helper function that tries doing float conversion
vals Mar 29, 2012
e65bebf
Rename project_analysis_setup to more generic project_management
percyfal Apr 3, 2012
39ffea2
Merge branch 'develop' of git://github.com/SciLifeLab/bcbb into develop
percyfal Apr 3, 2012
62af537
Merge pull request #56 from vals/brads-master
chapmanb Apr 3, 2012
23187bf
Modify data delivery directory. Customer delivery goes to project_dir…
percyfal Apr 3, 2012
7aea7cc
Merge branch 'develop' of git://github.com/SciLifeLab/bcbb into develop
percyfal Apr 5, 2012
5806029
Update project_management
percyfal Apr 5, 2012
a6412e7
add module scilifelab
percyfal Apr 12, 2012
252fac1
Remove project_analysis_setup.py
percyfal Apr 12, 2012
c1ab6fa
Remove fc_delivery_report
percyfal Apr 12, 2012
aeb4edc
Modify installed files
percyfal Apr 12, 2012
e44df4a
Merge pull request #144 from percyfal/develop
vals Apr 12, 2012
8e0a1b9
updated name parsing to accommodate agilent and rna barcodes
b97pla Apr 12, 2012
55d5d63
Merge pull request #145 from b97pla/develop
b97pla Apr 12, 2012
a435428
added backup task
b97pla Apr 12, 2012
f8bf8f2
removed accidentally committed experimental code
b97pla Apr 12, 2012
d39c7a9
Merge branch 'develop' of github.com:SciLifeLab/bcbb into develop
b97pla Apr 16, 2012
732b4ce
no genomes_filter_out option by default
b97pla Apr 16, 2012
063869e
dont use sample-level description
b97pla Apr 16, 2012
1e6527f
Merge branch 'feature/no_genomes_filter_out' into develop
b97pla Apr 16, 2012
a9e0786
PEP8 formatting
vals Apr 17, 2012
61654b9
Only try to grab parent folder title in create_project_report_on_gdoc…
vals Apr 17, 2012
31a8bb6
Merge pull request #150 from vals/develop
vals Apr 17, 2012
171aa95
Initial framework for converting Complete Genomics structural variant…
chapmanb Apr 18, 2012
7785aff
Merge branch 'master' of github.com:chapmanb/bcbb
chapmanb Apr 18, 2012
af25a6a
Merge branch 'develop' of github.com:SciLifeLab/bcbb into develop
b97pla Apr 18, 2012
e63a27e
added check to skip qseq to fastq conversion if outfile exists
b97pla Apr 18, 2012
b1c7e72
(PEP8 formatting)
vals Apr 19, 2012
5d4c3e5
Removed the comment and pass field from gdocs project report
vals Apr 19, 2012
e997938
Handle new location file format for OLB 1.9
chapmanb Apr 20, 2012
f8232e2
No longer reading the non existant data from the spreasheets
vals Apr 20, 2012
1ca030c
Saving comment and pass status from Summary worksheet
vals Apr 20, 2012
abe4c2d
Handle if the summary doesn't exist for some sample name or project
vals Apr 20, 2012
c328329
Merging from upstream
vals Apr 20, 2012
50fdc8a
Merge pull request #152 from vals/develop
vals Apr 20, 2012
ad08d5d
Update query example to match paper text
chapmanb Apr 30, 2012
8b576ea
Merge branch 'master' of github.com:chapmanb/bcbb
chapmanb Apr 30, 2012
5057374
Support haploid calling, background VCF input and specific target reg…
chapmanb May 2, 2012
3a8648c
Merge remote-tracking branch 'brad/master' into develop
vals May 2, 2012
f9f5971
silly formating
vals May 3, 2012
94b5451
create fastq files even if directory exists
b97pla May 3, 2012
31e2dd1
Merge pull request #153 from b97pla/develop
vals May 4, 2012
25a93cd
Merge remote branch 'central/develop' into develop
vals May 4, 2012
cd094c8
Adding mosaik_index.loc, which failed to come with a merge
vals May 4, 2012
53fffe3
Changing variantcall test to use BWA
vals May 4, 2012
79fac54
Merge pull request #154 from vals/develop
vals May 4, 2012
27bf4b0
more specific glob to find already filtered sequence files
b97pla May 8, 2012
9448666
merge conflict due to scilifelab submodule
vals May 10, 2012
67e1cdf
Merge remote-tracking branch 'central/develop' into develop
vals May 10, 2012
d1077b7
Removed the --compress option in the rsync transfer, fixes issue #155
vals May 10, 2012
262664f
Merge pull request #156 from vals/develop
vals May 10, 2012
658f0b1
Merge branch 'develop' of github.com:SciLifeLab/bcbb into develop
b97pla May 11, 2012
a3d32ec
added parsing of mondrian indexes
b97pla May 11, 2012
1aa74dc
Merge pull request #158 from b97pla/develop
vals May 11, 2012
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[submodule "nextgen/bcbio/scilifelab"]
path = nextgen/bcbio/scilifelab
url = git://github.com/SciLifeLab/scilifelab_bcbio.git
url = git@github.com:vals/scilifelab_bcbio.git
20 changes: 20 additions & 0 deletions nextgen/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -359,6 +359,26 @@ depth:

[v1]: http://www.broadinstitute.org/gsa/wiki/index.php/GATK_resource_bundle

## Configuration options

The YAML configuration file provides a number of hooks to customize analysis.
Place these under the `analysis` keyword. For variant calling:

- `aligner` Aligner to use: [bwa, bowtie, bowtie2, mosaik, novoalign]
- `trim_reads` Whether to trim off 3' B-only ends from fastq reads [false, true]
- `variantcaller` Variant calling algorithm [gatk, freebayes]
- `quality_format` Quality format of fastq inputs [Illumina, Standard]
- `coverage_interval` Regions covered by sequencing. This influences filtra
- `hybrid_target` BED file with target regions for hybrid selection experiments.
- `variant_regions` BED file of regions to call variants in.
- `ploidy` Ploidy of called reads. Defaults to 2 (diploid).

Global reference files for variant calling and assessment:

- `train_hapmap`, `train_1000g_omni`, `train_indels` Training files for GATK
variant recalibration.
- `call_background` Background VCF to use for calling.

## Internals: files generated by this pipeline

### Initial Fastq files (pre-analysis)
Expand Down
11 changes: 9 additions & 2 deletions nextgen/bcbio/distributed/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,16 +22,23 @@ def analyze_and_upload(*args):
remote_info = args[0]
toplevel.analyze_and_upload(remote_info, config_file)


@task(ignore_results=True, queue="toplevel")
def fetch_data(*args):
"""Transfer sequencing data from a remote machine. Could be e.g. a sequencer
or a pre-processing machine.
or a pre-processing machine.
"""
config_file = celeryconfig.BCBIO_CONFIG_FILE
remote_info = args[0]
toplevel.fetch_data(remote_info, config_file)

@task(ignore_results=True, queue="toplevel")
def backup_data(*args):
"""Backup sequencing data from a remote machine. Could be e.g. a sequencer
or a pre-processing machine.
"""
config_file = celeryconfig.BCBIO_CONFIG_FILE
remote_info = args[0]
toplevel.backup_data(remote_info, config_file)

@task(ignore_results=True, queue="storage")
def long_term_storage(*args):
Expand Down
2 changes: 1 addition & 1 deletion nextgen/bcbio/distributed/toplevel_tasks.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
""" Task definitions only related to toplevel queue.
"""
from bcbio.distributed.tasks import ( analyze_and_upload, fetch_data )
from bcbio.distributed.tasks import ( analyze_and_upload, fetch_data, backup_data )
1 change: 1 addition & 0 deletions nextgen/bcbio/google/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ def _to_unicode(str, encoding='utf-8'):
str = unicode(str, encoding)
return str


def get_credentials(config):
"""Get the encoded credentials specified in the post process configuration file"""

Expand Down
207 changes: 128 additions & 79 deletions nextgen/bcbio/google/bc_metrics.py
Original file line number Diff line number Diff line change
@@ -1,34 +1,35 @@
#!/usr/bin/env python
"""Functions for getting barcode statistics from demultiplexing"""

import os
import re
# import os
# import re
import copy
import glob
import logbook
from bcbio.utils import UnicodeReader
# import glob
# import logbook
# from bcbio.utils import UnicodeReader
import bcbio.google.connection
import bcbio.google.document
import bcbio.google.spreadsheet
from bcbio.google import (_from_unicode, _to_unicode, get_credentials)
from bcbio.log import logger2, create_log_handler
from bcbio.pipeline.flowcell import Flowcell
from bcbio.google import _to_unicode # (_from_unicode, _to_unicode, get_credentials)
from bcbio.log import logger2 # , create_log_handler
# from bcbio.pipeline.flowcell import Flowcell
import bcbio.solexa.flowcell
import bcbio.pipeline.flowcell

# The structure of the demultiplex result
BARCODE_STATS_HEADER = [
['Project name', 'project_name'],
['Lane', 'lane'],
['Lane description', 'description'],
['Sample name', 'sample_name'],
['bcbb internal barcode index', 'bcbb_barcode_id'],
['Barcode name', 'barcode_name'],
['Barcode sequence', 'barcode_sequence'],
['Barcode type', 'barcode_type'],
['Demultiplexed read (pair) count', 'read_count'],
['Demultiplexed read (pair) count (millions)', 'rounded_read_count'],
['Comment', 'comment']
]
['Project name', 'project_name'],
['Lane', 'lane'],
['Lane description', 'description'],
['Sample name', 'sample_name'],
['bcbb internal barcode index', 'bcbb_barcode_id'],
['Barcode name', 'barcode_name'],
['Barcode sequence', 'barcode_sequence'],
['Barcode type', 'barcode_type'],
['Demultiplexed read (pair) count', 'read_count'],
['Demultiplexed read (pair) count (millions)', 'rounded_read_count'],
['Comment', 'comment']
]

# The structure of the sequencing result
SEQUENCING_RESULT_HEADER = [
Expand All @@ -55,7 +56,7 @@ def _create_header(header, columns):
return names


def get_spreadsheet(ssheet_title,encoded_credentials):
def get_spreadsheet(ssheet_title, encoded_credentials):
"""Connect to Google docs and get a spreadsheet"""

# Convert the spreadsheet title to unicode
Expand All @@ -70,43 +71,49 @@ def get_spreadsheet(ssheet_title,encoded_credentials):

# Check that we got a result back
if not ssheet:
logger2.warn("No document with specified title '%s' found in GoogleDocs repository" % ssheet_title)
return (None,None)

return (client,ssheet)

logger2.warn("No document with specified title '%s' found in \
GoogleDocs repository" % ssheet_title)
return (None, None)

return (client, ssheet)


def _write_project_report_to_gdocs(client, ssheet, flowcell):

# Get the spreadsheet if it exists
# Otherwise, create it
wsheet_title = "%s_%s" % (flowcell.get_fc_date(),flowcell.get_fc_name())
wsheet_title = "%s_%s" % (flowcell.get_fc_date(), flowcell.get_fc_name())

# Flatten the project_data structure into a list
samples = {}
for sample in flowcell.get_samples():
if sample.get_name() in samples:
samples[sample.get_name()].add_sample(sample)
else:
samples[sample.get_name()] = sample

rows = []
for sample in samples.values():
row = (sample.get_name(),wsheet_title,sample.get_lane(),sample.get_read_count(),sample.get_rounded_read_count(),sample.get_comment(),"")
row = (sample.get_name(), wsheet_title, sample.get_lane(), \
sample.get_read_count(), sample.get_rounded_read_count())
rows.append(row)

# Write the data to the worksheet
return _write_to_worksheet(client,ssheet,wsheet_title,rows,[col_header[0] for col_header in SEQUENCING_RESULT_HEADER],False)

return _write_to_worksheet(client, ssheet, wsheet_title, rows, \
[col_header[0] for col_header in SEQUENCING_RESULT_HEADER[:5]], False)


def _write_project_report_summary_to_gdocs(client, ssheet):
"""Summarize the data from the worksheets and write them to a "Summary" worksheet"""

"""Summarize the data from the worksheets and write them to a "Summary" worksheet
"""

# Summary data
flowcells = {}
samples = {}
# Get the list of worksheets in the spreadsheet
wsheet_feed = bcbio.google.spreadsheet.get_worksheets_feed(client,ssheet)
# Loop over the worksheets and parse the data from the ones that contain flowcell data
wsheet_feed = bcbio.google.spreadsheet.get_worksheets_feed(client, ssheet)
# Loop over the worksheets and parse the data from the ones that contain
# flowcell data
for wsheet in wsheet_feed.entry:
wsheet_title = wsheet.title.text
if wsheet_title.endswith("_QC"):
Expand All @@ -115,81 +122,123 @@ def _write_project_report_summary_to_gdocs(client, ssheet):
bcbio.solexa.flowcell.get_flowcell_info(wsheet_title)
except ValueError:
continue

wsheet_data = bcbio.google.spreadsheet.get_cell_content(client,ssheet,wsheet,'2')

wsheet_data = bcbio.google.spreadsheet.get_cell_content(client, \
ssheet, wsheet, '2')
delim = ';'

# Add the results from the worksheet to the summarized data
for (sample_name,run_name,lane_name,read_count,_,comment,_) in wsheet_data:
sample = bcbio.pipeline.flowcell.Sample({'name': sample_name, 'read_count': read_count},bcbio.pipeline.flowcell.Lane({'lane': lane_name}),comment)

# Add the results from the worksheet to the summarized data
for (sample_name, run_name, lane_name, read_count, _) in wsheet_data:

sample = bcbio.pipeline.flowcell.Sample({ \
'name': sample_name, \
'read_count': read_count}, \
bcbio.pipeline.flowcell.Lane({'lane': lane_name}))

logger2.debug("Comment in Sample object: %s" % sample.get_comment())

if (sample_name in samples):
samples[sample_name]['object'].add_sample(sample,delim)
samples[sample_name]['flowcells'] += "%s%s" % (delim,wsheet_title)
samples[sample_name]['object'].add_sample(sample, delim)
samples[sample_name]['flowcells'] += "%s%s" % (delim, wsheet_title)
else:
samples[sample_name] = {'object': sample, 'flowcells': wsheet_title}

# Get the spreadsheet if it exists
# Otherwise, create it

wsheet_title = "Summary"


# Try getting already existing comments and 'pass' values
existing_summary_wsheet = \
bcbio.google.spreadsheet.get_worksheet(client, ssheet, wsheet_title)

num_rows = bcbio.google.spreadsheet.row_count(existing_summary_wsheet)

name_data = {}
for row_num in range(2, num_rows + 1):
sample_name, _, _, _, _, comment, pass_field = \
bcbio.google.spreadsheet.get_row( \
client, ssheet, existing_summary_wsheet, row_num)
name_data[sample_name] = [comment, pass_field]

# Flatten the project_data structure into a list
rows = []
for sample_data in samples.values():
sample = sample_data['object']
flowcells = sample_data['flowcells']
row = (sample.get_name(),flowcells,sample.get_lane(),sample.get_read_count(),sample.get_rounded_read_count(),sample.get_comment(),"")

sample_name = sample.get_name()
comment, pass_field = name_data.get(sample_name, [None, ""])

logger2.debug("Comment passed in to 'rows': %s" % comment)

row = [sample_name, flowcells, sample.get_lane(), \
sample.get_read_count(), sample.get_rounded_read_count(), \
comment, pass_field]

rows.append(row)

# Write the data to the worksheet
return _write_to_worksheet(client,ssheet,wsheet_title,rows,[col_header[0] for col_header in SEQUENCING_RESULT_HEADER],False)
return _write_to_worksheet(client, ssheet, wsheet_title, rows, \
[col_header[0] for col_header in SEQUENCING_RESULT_HEADER], False)


def write_run_report_to_gdocs(fc, fc_date, fc_name, ssheet_title, encoded_credentials, wsheet_title=None, append=False, split_project=False):
def write_run_report_to_gdocs(fc, fc_date, fc_name, ssheet_title, \
encoded_credentials, wsheet_title=None, append=False, split_project=False):
"""Upload the barcode read distribution for a run to google docs"""

# Connect to google and get the spreadsheet
client, ssheet = get_spreadsheet(ssheet_title,encoded_credentials)
client, ssheet = get_spreadsheet(ssheet_title, encoded_credentials)
if not client or not ssheet:
return False

# Get the projects in the run
projects = fc.get_project_names()
logger2.info("Will write data from the run %s_%s for projects: '%s'" % (fc_date,fc_name,"', '".join(projects)))

# If we will split the worksheet by project, use the project names as worksheet titles
logger2.info("Will write data from the run %s_%s for projects: '%s'" \
% (fc_date, fc_name, "', '".join(projects)))

# If we will split the worksheet by project, use the project
# names as worksheet titles.
success = True
header = _create_header(BARCODE_STATS_HEADER,fc.columns())
header = _create_header(BARCODE_STATS_HEADER, fc.columns())
if split_project:
# Filter away the irrelevent project entries and write the remaining to the appropriate worksheet
# Filter away the irrelevent project entries and write the
# remaining to the appropriate worksheet.
for project in projects:
pruned_fc = fc.prune_to_project(project)
success &= _write_to_worksheet(client,ssheet,project,pruned_fc.to_rows(),header,append)

# Else, set the default title of the worksheet to be a string of concatenated date and flowcell id
success &= _write_to_worksheet(client, ssheet, project, \
pruned_fc.to_rows(), header, append)

# Else, set the default title of the worksheet to be a string of
# concatenated date and flowcell id.
else:
if wsheet_title is None:
wsheet_title = "%s_%s" % (fc_date,fc_name)
success &= _write_to_worksheet(client,ssheet,wsheet_title,fc.to_rows(),header,append)
wsheet_title = "%s_%s" % (fc_date, fc_name)
success &= _write_to_worksheet(client, ssheet, wsheet_title, \
fc.to_rows(), header, append)

return success

def _write_to_worksheet(client,ssheet,wsheet_title,rows,header,append):


def _write_to_worksheet(client, ssheet, wsheet_title, rows, header, append):
"""Generic method to write a set of rows to a worksheet on google docs"""

# Convert the worksheet title to unicode
wsheet_title = _to_unicode(wsheet_title)

# Add a new worksheet, possibly appending or replacing a pre-existing worksheet according to the append-flag
wsheet = bcbio.google.spreadsheet.add_worksheet(client,ssheet,wsheet_title,len(rows)+1,len(header),append)

# Add a new worksheet, possibly appending or replacing a pre-existing
# worksheet according to the append-flag.
wsheet = bcbio.google.spreadsheet.add_worksheet(client, ssheet, \
wsheet_title, len(rows) + 1, len(header), append)
if wsheet is None:
logger2.error("ERROR: Could not add a worksheet '%s' to spreadsheet '%s'" % (wsheet_title,ssheet.title.text))
logger2.error("ERROR: Could not add a worksheet '%s' to spreadsheet '%s'" \
% (wsheet_title, ssheet.title.text))
return False

# Write the data to the worksheet
success = bcbio.google.spreadsheet.write_rows(client,ssheet,wsheet,header,rows)
success = \
bcbio.google.spreadsheet.write_rows(client, ssheet, wsheet, header, rows)
if success:
logger2.info("Wrote data to the '%s':'%s' worksheet" % (ssheet.title.text,wsheet_title))
logger2.info("Wrote data to the '%s':'%s' worksheet" \
% (ssheet.title.text, wsheet_title))
else:
logger2.error("ERROR: Could not write data to the '%s':'%s' worksheet" % (ssheet.title.text,wsheet_title))
logger2.error("ERROR: Could not write data to the '%s':'%s' worksheet" \
% (ssheet.title.text, wsheet_title))
return success


Loading