Skip to content

Commit

Permalink
Version 1.6.12 (#58)
Browse files Browse the repository at this point in the history
- Convert VCF files to Roary/Scoary format, allowing analysis on a wide range of variants (SNPs, indels, structural variations etc)
- Grab columns from the Roary input and put in the output (To get strain-specific protein names, for example)
- Scoary now comes with a manual, located under docs/tex/scoary_manual.pdf
- The log now includes the original command line
  • Loading branch information
AdmiralenOla committed Jun 26, 2017
1 parent 929f6e6 commit 7f69278
Show file tree
Hide file tree
Showing 30 changed files with 2,028 additions and 146 deletions.
2 changes: 2 additions & 0 deletions .travis.yml
Expand Up @@ -23,5 +23,7 @@ script:
# Advanced opts run
- python scoary.py -g scoary/exampledata/Gene_presence_absence.csv -t scoary/exampledata/Tetracycline_resistance.csv -p 0.01 1E-5 -c B EPW --collapse -m 50 -u -n scoary/exampledata/ExampleTree.nwk --threads 4 -o Test4 --no-time

- python scoary/vcf2scoary.py --force scoary/exampledata/Example.vcf

# Add test to verify output
- python 'tests/test_scoary_output.py'
6 changes: 6 additions & 0 deletions CHANGELOG.md
@@ -1,4 +1,10 @@
# CHANGELOG
v1.6.12 (Jun 2017)
- Convert VCF files to Roary/Scoary format, allowing analysis on a wide range of variants (SNPs, indels, structural variations etc)
- Grab columns from the Roary input and put in the output (To get strain-specific protein names, for example)
- Scoary now comes with a manual, located under docs/tex/scoary_manual.pdf
- The log now includes the original command line

v1.6.11 (Apr 2017)
- Blank values in trait files will now correctly be read as missing. Fixes (#54)
- Added --no_pairwise option for simple set differences / categorical enrichment analysis without causal hypothesis (As requested among others in (#53)
Expand Down
1 change: 1 addition & 0 deletions MANIFEST.in
@@ -1 +1,2 @@
graft scoary/exampledata
graft docs
32 changes: 27 additions & 5 deletions README.md
Expand Up @@ -32,9 +32,11 @@ Scoary is designed to take the gene_presence_absence.csv file from [Roary](https

## What's new?

**LATEST VERSION - 1.6.11**
**LATEST VERSION - 1.6.12**

Among the latest features is pairwise comparisons-free analysis, which allows major speed-ups for large datasets for people not interested in causal association. (e.g. just trying to infer genes enriched in particular groups etc)
- Convert VCF files to Roary/Scoary format, allowing analysis on a wide range of variants (SNPs, indels, structural variations etc)
- Grab columns from the Roary input and put in the output (To get strain-specific protein names, for example)
- Scoary now comes with a manual, located under docs/tex/scoary_manual.pdf

All changes are logged in the [CHANGELOG](CHANGELOG.md)

Expand Down Expand Up @@ -128,6 +130,19 @@ It should look something like this:

You can see an example of how the input files could look in the exampledata folder.

#### LS-BSR input
You can also use as input the pan-genome as called from Jason Sahl's program [LS-BSR](https://github.com/jasonsahl/LS-BSR) (Large-Scale Blast Score Ratio). The program includes a python script for converting LS-BSR output to the Roary/Scoary format.

#### Converting VCF files to use as Scoary input
From version 1.6.12, Scoary has a function for converting VCF files to the Roary/Scoary format. This allows you to use a wide range of variants (e.g SNPs, indels, structural variants etc) in your input. The script can be run using the following command:

vcf2scoary myvariants.vcf

The current vcf2scoary script is a beta version, and may not correctly handle every VCF file. (Please report bugs!)

Note that Scoary simplifies analysis for variants with more than 2 alleles. Rather than comparing all possible contrasts, it compares each non-reference with the reference. Say for example that 4 different alleles exist at a known SNP site. Let's call them A, C, G, and T, and let A be the reference allele. (The reference category is always inferred from the VCF file). This allele can be encoded in a single line in a VCF file, but in the Scoary format it needs to be spread over 3 different lines. (One for each contrast to the reference, i.e. A vs C, A vs G, and A vs T). Thus, not every possible contrast is tested in the association analysis! It is for example possible that there is a real difference in phenotype between G and T, but this contrast is not tested.


#### Missing data
Don't worry if you have not measured the phenotype for all your traits. From v1.6.9 on, Scoary can handle missing data. The missing values need to be specified as "NA", "." or "-". Note that Scoary does not actually specify any kind of uncertainty model for these missing values, it simply excludes them from further analysis.

Expand Down Expand Up @@ -351,13 +366,15 @@ A user wanted to screen for possible genetic causes of resistance towards a new

Mycobacterium abscessus contains numerous subspecies, and the user wanted to test only M. abscessus ss abscessus. The Roary output additionally contained other subspecies, such as M. abscessus ss masiliense. To avoid altering the Roary file, a csv was made containing the names of all isolates that were M. abscessus ss abscessus. To write a separate gene presence/absence file from only these isolates (and to speed up analysis), the -w parameter was used.

A high number of isolates was used in the experiment, and it was therefore decided to set the p-values low. The experiment was interested in causal mutations, so pairwise comparisons had to be used. (Population structure could be a major confounder). It was decided to require that the entire range of pairwise comparison values should be < 1E-4. Additionally, after 10.000 permutations the input configuration should be in the top 0.1 percentile. (Among 10.000 randomly permuted datasets, no more than 9 were allowed to have a even higher number of contrasting pairs for a gene to be included in results).
A high number of isolates was used in the experiment, and it was therefore decided to set the significance threshold high (i.e. require low p-values). The experiment was interested in causal mutations, so pairwise comparisons had to be used. (Population structure could be a major confounder). It was decided to require that the entire range of pairwise comparison values should be < 1E-4. Additionally, after 10.000 permutations the input configuration should be in the top 0.1 percentile. (Among 10.000 randomly permuted datasets, no more than 9 were allowed to have a even higher number of contrasting pairs for a gene to be included in results).

A ML phylogeny was built with a dedicated tree program and provided as a custom tree.

Finally, since it was possible that the resistance determinant was inherited as a set of genes (such as a plasmid), the --collapse flag was used to collapse genes with identical distribution patterns.

The analysis was run with the following command:
```
scoary -t Resistancefile -g Gene_presence_absence.csv -p 1E-4 1E-3 -c EPW P -e 10000 -w -r OnlyAbscessusIsolates.csv --collapse
scoary -t Resistancefile -g Gene_presence_absence.csv -p 1E-4 1E-3 -c EPW P -e 10000 -w -r OnlyAbscessusIsolates.csv --collapse -n raxmltree.nwk
```
Results showed that the top two hits were different alleles of the same gene, one positively and one negatively associated with the trait. (The two alleles were different enough to not be clustered as the same by Roary). The interpretation was that this gene was likely to play a role in the resistance pattern.

Expand All @@ -377,6 +394,12 @@ The analysis was run with the following command:
scoary -g gene_presence_absence.csv -t Hostgroup_membership.csv -p 1E-5 -c BH --no_pairwise
```

#### 3. SNPs linked to penicillin resistance in Neisseria meningitidis
For population structure-aware association analysis to work, it is imperative to work on trees that best represent the genealogy of the input sample. Due to the high frequency of recombination in Neisseria meningitidis, the internal tree builder in Scoary is likely to perform poorly. In this case (actually, in almost any case) it would be advisable to use a dedicated tree program and provide this to Scoary instead. There are now many programs that can produce phylogenetic trees where only the clonally evolved patterns are retained (i.e. "free" from the obfuscating effects of recombination). Some examples are [Gubbins](https://sanger-pathogens.github.io/gubbins), [ClonalFrameML](https://github.com/xavierdidelot/ClonalFrameML) and [BRATNextGen](http://www.helsinki.fi/bsg/software/BRAT-NextGen).
```
scoary -g gene_presence_absence.csv -t penicillinres.csv -n clonaltree.nwk
```

## License
Scoary is freely available under a GPLv3 license.

Expand Down Expand Up @@ -411,7 +434,6 @@ Most certainly not.

## Coming soon
- Multiprocessing also when using the GUI. (The GUI currently only uses a single thread. See Issues).
- Continous integration
- Support for non-binary traits
- Please feel free to suggest improvements, point out bugs or methods that could be better optimized.

Expand Down
1 change: 1 addition & 0 deletions docs/scoary_manual.pdf
50 changes: 50 additions & 0 deletions docs/tex/citations.bib
@@ -0,0 +1,50 @@
@article{north2002note,
title={A note on the calculation of empirical P values from Monte Carlo procedures},
author={North, Bernard V and Curtis, David and Sham, Pak C},
journal={The American Journal of Human Genetics},
volume={71},
number={2},
pages={439--441},
year={2002},
publisher={Cell Press}
}
@article{read1995inference,
title={Inference from binary comparative data},
author={Read, Andrew F and Nee, Sean},
journal={Journal of Theoretical Biology},
volume={173},
number={1},
pages={99--108},
year={1995},
publisher={Elsevier}
}
@article{maddison2000testing,
title={Testing character correlation using pairwise comparisons on a phylogeny},
author={MADDISON, WAYNE P},
journal={Journal of Theoretical Biology},
volume={202},
number={3},
pages={195--204},
year={2000},
publisher={Elsevier}
}
@article{brynildsrud2016rapid,
title={Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary},
author={Brynildsrud, Ola and Bohlin, Jon and Scheffer, Lonneke and Eldholm, Vegard},
journal={Genome biology},
volume={17},
number={1},
pages={238},
year={2016},
publisher={BioMed Central}
}
@article{page2015roary,
title={Roary: rapid large-scale prokaryote pan genome analysis},
author={Page, Andrew J and Cummins, Carla A and Hunt, Martin and Wong, Vanessa K and Reuter, Sandra and Holden, Matthew TG and Fookes, Maria and Falush, Daniel and Keane, Jacqueline A and Parkhill, Julian},
journal={Bioinformatics},
volume={31},
number={22},
pages={3691--3693},
year={2015},
publisher={Oxford Univ Press}
}
Binary file added docs/tex/images/badlink.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/tex/images/bestpossible.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/tex/images/gene_presence_and_absence.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/tex/images/goodlink.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/tex/images/scoary_gui.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/tex/images/scoary_logo.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/tex/images/worstpossible.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
77 changes: 77 additions & 0 deletions docs/tex/scoary_manual.aux
@@ -0,0 +1,77 @@
\relax
\citation{brynildsrud2016rapid}
\citation{page2015roary}
\@writefile{toc}{\contentsline {section}{\numberline {1}Scoary utility}{1}}
\@writefile{toc}{\contentsline {section}{\numberline {2}Installation}{1}}
\@writefile{toc}{\contentsline {subsection}{\numberline {2.1}Dependencies}{1}}
\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces Scoary GUI}}{2}}
\newlabel{fig:gui}{{1}{2}}
\@writefile{toc}{\contentsline {subsection}{\numberline {2.2}Installation}{2}}
\@writefile{toc}{\contentsline {section}{\numberline {3}Basic usage}{2}}
\@writefile{toc}{\contentsline {subsection}{\numberline {3.1}Getting started}{2}}
\@writefile{toc}{\contentsline {subsection}{\numberline {3.2}Input}{3}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {3.2.1}Gene presence/absence file}{3}}
\@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces Input Roary file (Source: http://sanger-pathogens.github.io/Roary)}}{3}}
\newlabel{fig:gpa}{{2}{3}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {3.2.2}Traits file}{3}}
\gdef \LT@i {\LT@entry
{1}{69.32pt}\LT@entry
{1}{68.92001pt}\LT@entry
{1}{68.92001pt}\LT@entry
{1}{68.92001pt}\LT@entry
{1}{68.92001pt}}
\@writefile{lot}{\contentsline {table}{\numberline {1}{A properly formatted traits file}}{4}}
\newlabel{tab:traits}{{1}{4}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {3.2.3}Converting VCF files to use as Scoary input}{4}}
\gdef \LT@ii {\LT@entry
{2}{143.64886pt}\LT@entry
{1}{201.35114pt}}
\@writefile{toc}{\contentsline {subsection}{\numberline {3.3}Output}{5}}
\@writefile{lot}{\contentsline {table}{\numberline {2}{Explanation of columns in the output}}{5}}
\newlabel{tab:cols}{{2}{5}}
\@writefile{lot}{\contentsline {table}{\numberline {2}{Explanation of columns in the output}}{6}}
\newlabel{tab:cols}{{2}{6}}
\@writefile{toc}{\contentsline {section}{\numberline {4}Advanced usage}{7}}
\@writefile{toc}{\contentsline {subsection}{\numberline {4.1}Restricting analysis to a subset of isolates with the -r parameter}{9}}
\@writefile{toc}{\contentsline {subsection}{\numberline {4.2}Getting input right when using non-standard Roary files using -s}{9}}
\@writefile{toc}{\contentsline {subsection}{\numberline {4.3}Controlling the output}{10}}
\@writefile{toc}{\contentsline {subsection}{\numberline {4.4}Writing a newick tree}{10}}
\@writefile{toc}{\contentsline {subsection}{\numberline {4.5}Setting a custom tree with the -n parameter}{10}}
\citation{north2002note}
\citation{read1995inference}
\citation{maddison2000testing}
\@writefile{toc}{\contentsline {subsection}{\numberline {4.6}Post-analysis label-switching permutations}{11}}
\@writefile{toc}{\contentsline {subsection}{\numberline {4.7}Collapsing correlated variants}{11}}
\@writefile{toc}{\contentsline {section}{\numberline {5}Population structure}{11}}
\@writefile{lof}{\contentsline {figure}{\numberline {3}{\ignorespaces A not-so-significant link between gene and trait}}{12}}
\newlabel{fig:badlink}{{3}{12}}
\@writefile{lof}{\contentsline {figure}{\numberline {4}{\ignorespaces A significant link between gene and trait}}{13}}
\newlabel{fig:goodlink}{{4}{13}}
\@writefile{lof}{\contentsline {figure}{\numberline {5}{\ignorespaces A best possible pairing}}{13}}
\newlabel{fig:best}{{5}{13}}
\@writefile{lof}{\contentsline {figure}{\numberline {6}{\ignorespaces A worst possible pairing}}{14}}
\newlabel{fig:worst}{{6}{14}}
\@writefile{toc}{\contentsline {section}{\numberline {6}Example data}{14}}
\@writefile{toc}{\contentsline {section}{\numberline {7}Example use cases}{15}}
\gdef \LT@iii {\LT@entry
{1}{69.32pt}\LT@entry
{1}{68.92001pt}\LT@entry
{1}{68.92001pt}\LT@entry
{1}{68.92001pt}\LT@entry
{1}{68.92001pt}}
\@writefile{lot}{\contentsline {table}{\numberline {3}{Traits input for example 2}}{16}}
\newlabel{tab:hostgroups}{{3}{16}}
\@writefile{toc}{\contentsline {section}{\numberline {8}License}{17}}
\@writefile{toc}{\contentsline {section}{\numberline {9}Etymology}{17}}
\@writefile{toc}{\contentsline {section}{\numberline {10}FAQ}{17}}
\bibdata{citations}
\bibcite{brynildsrud2016rapid}{1}
\@writefile{toc}{\contentsline {section}{\numberline {11}Acknowledgements}{19}}
\@writefile{toc}{\contentsline {section}{\numberline {12}Feedback}{19}}
\@writefile{toc}{\contentsline {section}{\numberline {13}Citation}{19}}
\@writefile{toc}{\contentsline {section}{\numberline {14}Contact}{19}}
\bibcite{page2015roary}{2}
\bibcite{north2002note}{3}
\bibcite{read1995inference}{4}
\bibcite{maddison2000testing}{5}
\bibstyle{unsrt}
33 changes: 33 additions & 0 deletions docs/tex/scoary_manual.bbl
@@ -0,0 +1,33 @@
\begin{thebibliography}{1}

\bibitem{brynildsrud2016rapid}
Ola Brynildsrud, Jon Bohlin, Lonneke Scheffer, and Vegard Eldholm.
\newblock Rapid scoring of genes in microbial pan-genome-wide association
studies with scoary.
\newblock {\em Genome biology}, 17(1):238, 2016.

\bibitem{page2015roary}
Andrew~J Page, Carla~A Cummins, Martin Hunt, Vanessa~K Wong, Sandra Reuter,
Matthew~TG Holden, Maria Fookes, Daniel Falush, Jacqueline~A Keane, and
Julian Parkhill.
\newblock Roary: rapid large-scale prokaryote pan genome analysis.
\newblock {\em Bioinformatics}, 31(22):3691--3693, 2015.

\bibitem{north2002note}
Bernard~V North, David Curtis, and Pak~C Sham.
\newblock A note on the calculation of empirical p values from monte carlo
procedures.
\newblock {\em The American Journal of Human Genetics}, 71(2):439--441, 2002.

\bibitem{read1995inference}
Andrew~F Read and Sean Nee.
\newblock Inference from binary comparative data.
\newblock {\em Journal of Theoretical Biology}, 173(1):99--108, 1995.

\bibitem{maddison2000testing}
WAYNE~P MADDISON.
\newblock Testing character correlation using pairwise comparisons on a
phylogeny.
\newblock {\em Journal of Theoretical Biology}, 202(3):195--204, 2000.

\end{thebibliography}
46 changes: 46 additions & 0 deletions docs/tex/scoary_manual.blg
@@ -0,0 +1,46 @@
This is BibTeX, Version 0.99d (TeX Live 2013/Debian)
Capacity: max_strings=35307, hash_size=35307, hash_prime=30011
The top-level auxiliary file: scoary_manual.aux
The style file: unsrt.bst
Database file #1: citations.bib
You've used 5 entries,
1791 wiz_defined-function locations,
484 strings with 4517 characters,
and the built_in function-call counts, 1305 in all, are:
= -- 112
> -- 69
< -- 0
+ -- 25
- -- 20
* -- 116
:= -- 220
add.period$ -- 15
call.type$ -- 5
change.case$ -- 5
chr.to.int$ -- 0
cite$ -- 5
duplicate$ -- 45
empty$ -- 118
format.name$ -- 20
if$ -- 271
int.to.chr$ -- 0
int.to.str$ -- 5
missing$ -- 5
newline$ -- 28
num.names$ -- 5
pop$ -- 5
preamble$ -- 1
purify$ -- 0
quote$ -- 0
skip$ -- 16
stack$ -- 0
substring$ -- 112
swap$ -- 5
text.length$ -- 0
text.prefix$ -- 0
top$ -- 0
type$ -- 0
warning$ -- 0
while$ -- 14
width$ -- 6
write$ -- 57
Binary file added docs/tex/scoary_manual.dvi
Binary file not shown.

0 comments on commit 7f69278

Please sign in to comment.