-
Notifications
You must be signed in to change notification settings - Fork 8
Home
Use pip to install the safeseqs package:
$ pip install safeseqs
SafeSeqS depends on several other packages and these will be installed automatically by pip if they are not already installed.
Required packages:
- scipy
SafeSeqS accepts the following list of run time parameters and requires (1) a set of input files in the Study Directory (2) a settings JSON file identifying study parameters in the Study Data Directory.
| -d | Required | Directory containing the Study input files. |
| -r | Required | Run directory. Will be created under the Study directory. |
| -sf | Required | Settings file that identifies the parameters for this run. Settings file must exist in the SafeSeqs Data Directory. |
| -w | Optional | Number of concurrent worker sub-processes to run. Default: 1 |
| -s | Optional | Start Stepname. Used when re-starting partially completed run. Processing will begin at the named step. |
| -e | Optional | End Stepname. Used when a partial run is desired. Processing will end before the named step. |
Example:
python -m safeseqs_controller -d C:\labName\studyName -r runName -sf SettingsTemplate.json
An example of each file is provided in the \example sub-directory.
-
safeseqs.json - Contains information about the study data. It identifies the FASTQ input files, the well barcode association filename, the UID length being used in the study data, and the ascii adjustment being used in the study data.
-
Study Data FASTQ Files - One or more sets of reads and barcode files - must be in the Study Directory. Both FASTQ and FASTQ.gz are supported.
-
barcodemap.txt - Tab delimited file containing mapping of samples to barcodes and primer sets.
-
primers.txt - Tab delimited file containing primers for the study.
A settings file must be identified at run time with the -sf argument. Parameters in the settings file control decision making during processing runs. A SettingsTemplate.json is delivered with the package. It is located in the Study Data Directory. It can be cloned and modified to create different run scenarios.
| fh_limit | Required | File handle limit for system. Example 2048 |
| max_mismatches_for_used_reads | Required | Integer. Example 3 |
| max_indels_for_used_reads | Required | Integer. Example 1 |
| max_amp_per_UID_family | Required | Integer. Example 1 |
| min_good_reads_usable_family | Required | Integer. Example 2 |
| min_perc_good_reads_per_UID_family | Required | Integer. Example 95 = 95% |
| super_mut_perc_homegeneity | Required | Integer. Example 90 = 90% |
| default_indel_rate | Required | Float. Example .001 |
| default_sbs_rate | Required | Float. Example .001 |
| mark_UIDs_with_Ns_UnUsable | Optional | Valid Values: Yes or No Default: No |
| perform_opt_dup_removal | Optional | Valid Values: Yes or No Default: No |
| opt_dup_distance | Required with peform_opt_dup_removal | Integer. Example 5000 |
| load_bad_bc | Optional | Valid Values: Yes or No Default: No |
| load_not_used_bc | Optional | Valid Values: Yes or No Default: No |
| save_merge | Optional | Valid Values: Yes or No Default: No |
- COSMIC.txt Tab delimited file containing COSMIC values for known single base pair mutations. An empty file is delivered with the package.
- dbSNP.txt Tab delimited file containing dbSNP values for known single base pair mutations. An empty file is delivered with the package.
- safeseqs.json - Contains information about the study data. It identifies the FASTQ input files, the well barcode association filename, the UIDlength being used in the study data, and the ascii adjustment being used in the study data. Format of the file is:
{ "reads_files" : ["studydata_R1_001.fastq.gz", "studydata_R1_002.fastq.gz", ... ], "barcodes_files" : ["studydata_I1_001.fastq.gz", "studydata_I1_002.fastq.gz", ...], "reads_pattern" : "", "barcodes_pattern" : "", "barcodemap" : "barcodemap.txt", "uidlength" : 14, "ascii_adj" : 33}
Study Data FASTQ Files - One or more sets of reads and barcode files - must be in the Study Directory. Both FASTQ and FASTQ.gz are supported. There must be a barcodes file for each reads file. Files can be identified by providing a list of file names in the 'reads_files' and 'barcodes_files' parameters. Files can alternatively be identified by providing a text pattern that is unique to the filenames in the 'reads_pattern' and the 'barcodes_pattern' parameters. To facilitate properly pairing files, the patterns provided should be the only difference in the file names. The safeseqs controller will gather all FASTQ files with that pattern in the filename from the Study Directory.
The barcode map file links barcodes in the study data to samples. It also identifies the primers sets being used in the alignment step. The format is tab delimited with the following columns: barcodeNumber, barcode, wbcPlateNumber, template(sample), purpose, gEsWellOrTotalULUsed, mutOrTotalGEsWell, ampMatchName(primer set), row, col
The primers file identifies the primers to be searched in the aligned. ampMatchName - primer name gene - gene being targetted read1 - forward primer sequence read2 - reverse primer sequence ampSeq - target sequence target_len - target sequence length chrom - chromosome being targetted readStrand - indicator for fowwrd or reverse primer hg19_start - starting position on the chromosome for this primer sequence hg19_end - - ending position on the chromosome for this primer sequence
settings parameters
Example assuming input files are in a directory called C:\labName\studyName. The results should be stored in the sub-directory \runName under the C:\labName\studyName directory. The file SettingsTemplate.json should exist in the SafeSeqS Project Data directory and will be used to determine the run time settings. Up to 6 sub-processes may be run concurrently.
python -m safeseqs_controller -d C:\labName\studyName -r runName -sf SettingsTemplate.json -w 6