Merge branch 'docs' of github.com:RomainFeron/RadSex into docs

SexGenomicsToolkit · Oct 8, 2018 · b9f3f4f · b9f3f4f
2 parents 9edaf1b + 7bbb36f
commit b9f3f4f
Show file tree

Hide file tree

Showing 3 changed files with 151 additions and 282 deletions.
diff --git a/README.md b/README.md
@@ -13,12 +13,13 @@ The results of RADSex are meant to be visualized with the `radsex-vis` R package
 
 This pipeline was developed for the PhyloSex project, which investigates sex determining factors in a wide range of fish species.
 
-## Full Documentation
+## Documentation
 
-The full documentation for RADSex (still under construction) is available there: https://radsex.readthedocs.io/en/docs .
+This README file contains a simple documentation, including a basic installation guide as well as a quick start section. 
+
+The full documentation for RADSex (still under construction) is available [there](https://radsex.readthedocs.io/en/docs).
 It contains a complete Getting Started section, a detailed usage for all functions, and real-life datasets examples covering many situations.
 
-A simple documentation is provided in this readme file, including a simple installation guide and a simple getting started section.
 
 ## Requirements
 
@@ -114,283 +115,6 @@ The resulting file `mapping.tsv` is a table with five columns:
 The mapping results generated by `map` can be visualized with the `plot_genome()` function of `radsex-vis`, which generates a [circular plot](./examples/figures/genome.png). 
 Mapping results for a specific scaffold can be visualized with the `plot_scaffold()` function to generate a [linear plot](./examples/figures/scaffold.png).
 
-## Usage
-
-### General
-
-`radsex <command> [options]`
-
-**Available commands** :
-
-Command            | Description
------------------- | ------------
-`process`    | Compute a table of coverage from a set of demultiplexed reads
-`distrib` | Compute the distribution of sequences between sexes
-`subset` | Extract a subset of the coverage table
-`signif` | Extract sequences significantly associated with sex
-`loci` | Recreate polymorphic loci from a subset of coverage table
-`mapping` | Map a subset of sequences (coverage table or fasta) to a reference genome and output sex-association metrics for each mapped sequence
-`freq` | Compute sequence frequencies for the population
-
-### process
-
-`radsex process --input-dir input_dir_path --output-file output_file_path [ --threads n_threads --min-coverage min_cov ]`
-
-*Generates a table of coverage for all individuals and all sequences. The output is a tabulated file, where each line contains the ID, sequence and coverage for each individual of a sequence.*
-
-**Options** :
-
-Option | Description
---- | ---
-`--input-dir` | Path to a folder containing demultiplexed reads |
-`--output-file` | Path to the output file |
-`--threads` | Number of threads to use (default: 1) |
-`--min-coverage` | Minimum coverage to consider a sequence in an individual (default: 1) |
-
-### distrib
-
-`radsex distrib --input-file input_file_path --output-file output_file_path --popmap-file popmap_file_path [ --min-coverage min_cov --output-matrix ]`
-
-*Generates a table which contains the number of sequences present with coverage higher than min_cov and the probability of association with sex for every combination of number of males and number of females.*
-
-**Options** :
-
-Option | Description
------- | -------------
-`--input-file` | Path to an coverage table obtained with `process` |
-`--output-file` | Path to the output file |
-`--popmap-file` | Path to a popmap file indicating the sex of each individual |
-`--min-coverage` | Minimum coverage to consider a sequence present in an individual (default: 1) |
-`--output-matrix` | If true, outputs the resutls as a matrix with males in columns and females in rows instead of a table (default: false) |
-
-### subset
-
-`radsex subset --input-file input_file_path --output-file output_file_path --popmap-file popmap_file_path [ --output-format output_format --min-coverage min_cov --min-males min_males --min-females min_females --max-males max_males --max-females max_females --min-individuals min_individuals --max-individuals max_individuals]`
-
-*Filters the coverage table to only export sequences present in any combination of M males and F females, with min_males ≤ M ≤ max_males, min_females ≤ F ≤ max_females, and min_individuals ≤ M + F ≤ max_individuals*
-
-**Options** :
-
-Option | Description
---- | ---
-`--input-file` | Path to an coverage table obtained with `process` |
-`--output-file` | Path to the output file |
-`--popmap-file` | Path to a popmap file indicating the sex of each individual |
-`--output-format` | Output format, either "table" or "fasta" |
-`--min-coverage` | Minimum coverage to consider a sequence present in an individual (default: 1) |
-`--min-males` | Minimum number of males with a retained sequence |
-`--min-females` | Minimum number of females with a retained sequence |
-`--max-males` | Maximum number of males with a retained sequence |
-`--max-females` | Maximum number of females with a retained sequence |
-`--max-individuals` | Maximum number of individuals with a retained sequence |
-`--max-individuals` | Maximum number of individuals with a retained sequence |
-
-### signif
-
-`radsex signif --input-file input_file_path --output-file output_file_path --popmap-file popmap_file_path [ --output-format output_format --min-coverage min_cov ]`
-
-*Filters the coverage table to only export sequences significantly associated with sex, defined as sequences for which p < 0.05 (after Bonferroni correction), with p being the p-value of a chi-squared test on the numbers of males and females.*
-
-**Options** :
-
-Option | Description
---- | ---
-`--input-file` | Path to an coverage table obtained with `process` |
-`--output-file` | Path to the output file |
-`--popmap-file` | Path to a popmap file indicating the sex of each individual |
-`--output-format` | Output format, either "table" or "fasta" |
-`--min-coverage` | Minimum coverage to consider a sequence present in an individual (default: 1) |
-
-### map
-
-`radsex map --input-file input_file_path --output-file output_file_path --popmap-file popmap_file_path --genome-file genome_file_path [ --min-coverage min_cov --min-quality min_quality --min-frequency min_frequency ]`
-
-*Maps the sequences from the coverage table to a reference genome and outputs mapping position, sex bias, and p-value of association with sex for each mapped sequence.*
-
-**Options** :
-
-Option | Description
---- | ---
-`--input-file` | Path to an coverage table obtained with `process` |
-`--output-file` | Path to the output file |
-`--popmap-file` | Path to a popmap file indicating the sex of each individual |
-`--genome-file` | Path to a reference genome file in fasta format |
-`--min-coverage` | Minimum coverage to consider a sequence present in an individual (default: 1) |
-`--min-quality` | Minimum mapping quality, as defined in BWA, to consider a sequence properly mapped (default: 20) |
-`--min-frequency` | Minimum frequency in at least one sex for a sequence to be retained (default: 0.25) |
-
-### freq
-
-`radsex freq --input-file input_file_path --output-file output_file_path [ --min-coverage min_cov ]`
-
-*Computes the sequences frequencies for the entire population*
-
-**Options** :
-
-Option | Description
---- | ---
-`--input-file` | Path to an coverage table obtained with `process` |
-`--output-file` | Path to the output file |
-`--min-coverage` | Minimum coverage to consider a sequence present in an individual (default: 1) |
-
-## FILE FORMATS
-
-### Population map
-
-A population map file is a tabulated file without header, with individual ID in the first column and sex in the second column. 
-Sex is encoded as 'M' for males, 'F' for females, and 'N' for undetermined. An example of population map is given below:
-
-```
-individual_1    M
-individual_2    M
-individual_3    F
-individual_4    N
-individual_5    F
-```
-
-Individual IDs can be anything, but it is important that they correspond to the name of the demultiplexed files. 
-For instance, the reads file for *individual_1* should be named `individual_1.fastq.gz` (in any format supported by your demultiplexer). 
-If you are using Stacks with a barcodes file for demultiplexing, just make sure that individual IDs in the barcodes file and in the population map are the same.
-
-### Chromosomes names
-
-Genome-wide results from the `map` command are visualized using the `plot_genome()` function of `radsex-vis`. 
-This function can automatically detect chromosomes in the reference file if their name starts with 'LG' or 'chr' (case unsensitive). 
-If this is not the case, you should provide a chromosomes names file to `plot_genome()`. 
-This file should be a tabulated file without header, with scaffold ID in the reference in the first column and corresponding chromosome name in the second column. 
-An example of chromosomes names file is given below for the [Northern Pike genome](https://www.ncbi.nlm.nih.gov/genome/?term=esox%20lucius) :
-
-```
-NC_025968.3     LG01
-NC_025969.3     LG02
-NC_025970.3     LG03
-NC_025971.3     LG04
-NC_025972.3     LG05
-NC_025973.3     LG06
-NC_025974.3     LG07
-NC_025975.3     LG08
-NC_025976.3     LG09
-NC_025977.3     LG10
-NC_025978.3     LG11
-NC_025979.3     LG12
-NC_025980.3     LG13
-NC_025981.3     LG14
-NC_025982.3     LG15
-NC_025983.3     LG16
-NC_025984.3     LG17
-NC_025985.3     LG18
-NC_025986.3     LG19
-NC_025987.3     LG20
-NC_025988.3     LG21
-NC_025989.3     LG22
-NC_025990.3     LG23
-NC_025991.3     LG24
-NC_025992.3     LG25
-```
-
-The chromosomes names can be anything starting with 'LG' or 'chr' (LG1, LG_01, chr1, chromosome_01 ...).
-
-### RADSex output files
-
-#### Coverage table
-
-Coverage tables tabulated files with header generated by the `process` command for the entire dataset, and by the `subset` and `signif` commands for a subset of sequences. 
-The first column contains the sequence ID, and the second column contains the sequence itself. Each other column contains the coverage of the corresponding sequence in a given individual. 
-An example of coverage table is given below (the sequence was shortened for visual reasons):
-
-```
-ID    Sequence    individual_1    individual_2    individual_3    individual_4    individual_5
-0    TGCA..TATT         0              15              24              17              21
-1    TGCA..GACC        20              18               3              26               4
-2    TGCA..ATCG         2               1               5              16               0
-3    TGCA..CCGA        14              29              23               2              19
-```
-
-#### FASTA file
-
-FASTA files can be generated by the `subset` and `signif` commands for a subset of sequences.
-
-In the `subset` analysis, FASTA headers are generated as follows:
-
-```
-<ID>_<number of males>M_<number of females>F_cov:<minimum coverage>
-```
-
-In the `signif` analysis, another field containing the p-value of association with sex is added:
-
-```
-<ID>_<number of males>M_<number of females>F_cov:<minimum coverage>_p:<p-value>
-```
-
-#### Distribution of sequences between sexes
-
-##### Table format
-
-A table of distribution of sequences between sexes is a tabulated file with header generated by the `distrib` command. 
-The first and second columns indicate the number of males and females in which a sequence is present, the third column contains the number of sequences found in the corresponding number of males and females, 
-the fourth column contains the p-value of a chi-squared test for association with sex, and the fifth column indicates whether this p-value is significant after Bonferroni correction. 
-An example of sex distribution table is given below for 3 males and 3 females:
-
-```
-Males    Females    Sequences      P      Signif
-  0         1           7          1      False
-  0         2           3       0.39      False
-  0         3           1       0.10      False
-  1         0           6          1      False
-  1         1           5          1      False
-  1         2           1          1      False
-  1         3           2       0.39      False
-  2         0           3       0.39      False
-  2         1           8          1      False
-  2         2           4          1      False
-  2         3           2          1      False
-  3         0           4       0.10      False
-  3         1           7       0.39      False
-  3         2           6          1      False
-  3         3           9          1      False
-```
-
-In this example, there are 68 sequences in total, therefore sequences are significantly associated with sex if the p-value of a chi-squared test on the number of males and females is lower than 0.05 / 68 = 0.00074.
-
-##### Matrix format
-
-The distribution of sequences between sexes can also be output as a matrix, which is a tabulated file without header, with number of females as rows and number of males as rows. 
-The sex distribution matrix for the example described above is given below:
-
-```
-0    6    3    4
-7    5    8    7
-3    1    4    6
-1    2    2    9
-```
-
-#### Mapping results
-
-Results from the `map` command are output as a tabulated file with header. 
-The first column contains the sequence ID, the second column contains the contig to which the sequence mapped in the reference genome, and the third columns contains the position where the sequence mapped on the contig. 
-The fourth column contains a sex-bias value, defined as `(number of males with the sequence) / (total number of males) - (number of females with the sequence) / (total number of females)`. 
-The fifth column contains the p-value of a chi-squared test for association with sex, and the sixth column indicates whether this p-value is significant after Bonferroni correction. 
-An example of mapping results is given below:
-
-```
-Sequence      Contig      Position      SexBias        P     Signif
-0             LG09        10052920            0        1     False
-1             LG45         4008419            0        1     False
-2             LG06        20521435            0        1     False
-3             LG24         7643946         0.13     0.44     False
-4             LG06        16975491            0        1     False
-5             LG27        16580048            0        1     False
-6             LG49         7206356         0.03        1     False
-7             LG30         5571989            0        1     False
-8             LG05        20094761            0        1     False
-9             LG14        20088495            0        1     False
-10            LG34        11566459        -0.04        1     False
-11            LG21        17338149            0        1     False
-12            LG05        14652417         0.13     0.55     False
-13            LG25        23851527         0.75    0.001     True
-
-```
-
 ## LICENSE
 
 Copyright (C) 2018 Romain Feron and INRA LPGP

diff --git a/docs/input_files.rst b/docs/input_files.rst
@@ -1,8 +1,65 @@
 Input file formats
 ==================
 
+Reads files
+-----------
+
+RADSex accepts demultiplexed reads files as first input. RADSex should work with any demultiplexed RAD-sequencing reads files regardless of technology (single / double digest) or enzyme. Input files can be in fasta or fastq formats, and can be compressed. RADSex uses file extensions to detect input files, and supports the following extensions: **.fa**, **.fa.gz**, **.fq**, **.fq.gz**, **.fasta**, **.fasta.gz**, **.fastq**, **.fastq.gz**, **.fna**, and **.fna.gz**.
+
 Population map
 --------------
 
+A population map file is a tabulated file (TSV, tab as a separator) without header, with individual ID in the first column and sex in the second column. Sex is encoded as **"M"** for males, **"F"** for females, and **"N"** for undetermined. An example of population map is given below:
+
+::
+
+	individual_1    M
+	individual_2    M
+	individual_3    F
+	individual_4    N
+	individual_5    F
+
+Individual IDs can be anything, but it is important that they correspond to the name of the demultiplexed files. 
+For instance, the reads file for *individual_1* should be named `individual_1.fastq.gz` (in any format supported by your demultiplexer).
+
+If you are using Stacks with a barcodes file for demultiplexing, just make sure that individual IDs in the barcodes file and in the population map are the same.
+
+
 Chromosomes names file
 ----------------------
+
+Genome-wide results from the ``map`` command are visualized using the ``plot_genome()`` function of ``radsex-vis``. 
+This function can automatically detect chromosomes in the reference file if their name starts with 'LG' or 'chr' (case unsensitive). If this is not the case, you should provide a chromosomes names file to ``plot_genome()``. 
+This file should be a tabulated file without header, with scaffold ID in the reference in the first column and corresponding chromosome name in the second column.
+
+An example of chromosomes names file is given below for the `Northern Pike <https://www.ncbi.nlm.nih.gov/genome/?term=esox%20lucius>`_ genome.
+
+::
+
+	NC_025968.3     LG01
+	NC_025969.3     LG02
+	NC_025970.3     LG03
+	NC_025971.3     LG04
+	NC_025972.3     LG05
+	NC_025973.3     LG06
+	NC_025974.3     LG07
+	NC_025975.3     LG08
+	NC_025976.3     LG09
+	NC_025977.3     LG10
+	NC_025978.3     LG11
+	NC_025979.3     LG12
+	NC_025980.3     LG13
+	NC_025981.3     LG14
+	NC_025982.3     LG15
+	NC_025983.3     LG16
+	NC_025984.3     LG17
+	NC_025985.3     LG18
+	NC_025986.3     LG19
+	NC_025987.3     LG20
+	NC_025988.3     LG21
+	NC_025989.3     LG22
+	NC_025990.3     LG23
+	NC_025991.3     LG24
+	NC_025992.3     LG25
+
+.. note:: Any scaffold included in the chromosomes names file will be considered a chromosome to be plotted as a sector. In most NCBI genomes, mitochondria are also named NC_XXX. As mitochondria are usually too small to be included as a sector in the circos plot, you should not add them to the chromosomes names file.