Skip to content
lryaninsilico edited this page Dec 13, 2017 · 20 revisions

SafeSeqS Installation

Use pip to install the safeseqs package:

$ pip install safeseqs

SafeSeqS depends on several other packages and these will be installed automatically by pip if they are not already installed.

Required packages:

  • scipy

Contents

Quick Start
Input and Parameters
Example
____

Quick Start

SafeSeqS accepts the following list of run time parameters and requires (1) a set of input files in the Study Directory (2) a settings JSON file identifying study parameters in the Study Data Directory.

Run Time Parameter Descriptions

-d Required Directory containing the Study input files.
-r Required Run directory. Will be created under the Study directory.
-sf Required Settings file that identifies the parameters for this run. Settings file must exist in the SafeSeqs Data Directory.
-w Optional Number of concurrent worker sub-processes to run. Default: 1
-s Optional Start Stepname. Used when re-starting partially completed run. Processing will begin at the named step.
-e Optional End Stepname. Used when a partial run is desired. Processing will end before the named step.

Example: python -m safeseqs_controller -d C:\labName\studyName -r runName -sf SettingsTemplate.json

Required Input Files in the Study Directory

An example of each file is provided in the \example sub-directory.

  1. safeseqs.json - Contains information about the study data. It identifies the FASTQ input files, the well barcode association filename, the UID length being used in the study data, and the ascii adjustment being used in the study data.

  2. Study Data FASTQ Files - One or more sets of reads and barcode files - must be in the Study Directory. Both FASTQ and FASTQ.gz are supported.

  3. barcodemap.txt - Tab delimited file containing mapping of samples to barcodes and primer sets.

  4. primers.txt - Tab delimited file containing primers for the study.

Settings File in the Study Data Directory

A settings file must be identified at run time with the -sf argument. Parameters in the settings file control decision making during processing runs. A SettingsTemplate.json is delivered with the package. It is located in the Study Data Directory. It can be cloned and modified to create different run scenarios.

max_mismatches_for_used_reads Required Integer. Example 3
max_indels_for_used_reads Required Integer. Example 1
max_amp_per_UID_family Required Integer. Example 1
min_good_reads_usable_family Required Integer. Example 2
min_perc_good_reads_per_UID_family Required Integer. Example 95 = 95%
super_mut_perc_homegeneity Required Integer. Example 90 = 90%
default_indel_rate Required Float. Example .001
default_sbs_rate Required Float. Example .001
mark_UIDs_with_Ns_UnUsable Optional Valid Values: Yes or No Default: No
perform_opt_dup_removal Optional Valid Values: Yes or No Default: No
opt_dup_distance Required with peform_opt_dup_removal Integer. Example 5000
fh_limit Required File handle limit for system. Example 2048
load_bad_bc Optional Valid Values: Yes or No Default: No
load_not_used_bc Optional Valid Values: Yes or No Default: No

Optional Reference Data Files in the Study Data Directory

  1. COSMIC.txt - Tab delimited file containing COSMIC values for known single base pair mutations. An empty file is delivered with the package.
  2. dbSNP.txt - Tab delimited file containing dbSNP values for known single base pair mutations. An empty file is delivered with the package.

Input and Parameters

safeseqs.json Study File

Contains information about the study data. It identifies the FASTQ input files, the well barcode association file, the UID length being used in the study data, and the ascii adjustment being used in the study data. Format of the file is:

{ "reads_files" : ["studydata_R1_001.fastq.gz", "studydata_R1_002.fastq.gz", ... ],
  "barcodes_files" : ["studydata_I1_001.fastq.gz", "studydata_I1_002.fastq.gz", ...],
  "reads_pattern" :   "",
  "barcodes_pattern" : "",
  "barcodemap" : "barcodemap.txt",
  "uidlength" : 14,
  "ascii_adj" : 33}

UID Length is used to separate the UID from the beginning of the Read Sequence in the reads file.

ASCII Adjustment is used to convert the ASCII representation of the quality score from the reads file to its integer form.

Study Data

Study Data FASTQ Files - One or more sets of reads and barcode files. Files must be in the Study Directory. Both FASTQ and FASTQ.gz formats are supported. There must be a barcodes file for each reads file. Files can be identified by providing a list of file names in the 'reads_files' and 'barcodes_files' parameters. Files can alternatively be identified by providing a text pattern that is unique to the filenames in the 'reads_pattern' and the 'barcodes_pattern' parameters. To facilitate proper pairing of input files, the patterns provided should be the only difference in the file names. The safeseqs controller will gather all FASTQ files with that pattern in the filename from the Study Directory.

Barcode Map File

The barcode map file links barcodes in the study data to samples. It also identifies the primers sets being used in the alignment step. The format is tab delimited with the following columns:

 barcodeNumber - unique number for barcode
 barcode - barcode identifier (valid values contain A,C,G,T)
 wbcPlateNumber - _part of the Sample Super Mutant header_
 template - sample identifier
 purpose - _part of the Sample Super Mutant header_
 gEsWellOrTotalULUsed - _part of the Sample Super Mutant header_
 mutOrTotalGEsWell - _part of the Sample Super Mutant header_
 ampMatchName - primer set for the sample
 row - 
 col - 

Primers File

The primers file identifies the primers to be searched in the alignment process.

 ampMatchName - primer name
 gene - gene being targetted
 read1 - forward primer sequence
 read2 - reverse primer sequence
 ampSeq - target sequence
 target_len - target sequence length
 chrom - chromosome being targetted
 readStrand - indicator for fowwrd or reverse primer
 hg19_start - starting position on the chromosome for this primer sequence
 hg19_end - - ending position on the chromosome for this primer sequence

Configuration Settings Parameters

Parameters in the settings file control decision making during processing runs. Multiple versions can be generated to create different run scenarios. Format of the file is:

 { "max_mismatches_for_used_reads" : 3,
   "max_indels_for_used_reads" : 1,
   "mark_UIDs_with_Ns_UnUsable" : "yes",
   "perform_opt_dup_removal" : "yes",
   "opt_dup_distance" : 5000,
   "max_amp_per_UID_family" : 1,
   "min_good_reads_usable_family" : 2,
   "min_perc_good_reads_per_UID_family" : 95,
   "super_mut_perc_homegeneity" : 90,
   "default_indel_rate" : 0.001,
   "default_sbs_rate" : 0.001,
   "fh_limit" : 2048,
   "load_bad_bc" : "no",
   "load_not_used_bc" : "yes",

}

max_mismatches_for_used_reads is used to set the maximum allowable SBS mismatches a read sequence can contain and still be considered a good read for Super Mutant consideration.

max_indels_for_used_reads is used to set the maximum allowable indels (insertions or deletions) a read sequence can contain and still be considered a good read for Super Mutant consideration.

max_amp_per_UID_family is used to set the maximum number of primers that can be found within a UID family for the family reads to be usable for Super Mutant consideration.

min_good_reads_usable_family is used to set the minimum number of good reads that must be found within a UID family for the family reads to be usable for Super Mutant consideration.

min_perc_good_reads_per_UID_family is used to set the minimum percentage of good reads vs. total reads that must be found within a UID family for the family reads to be usable for Super Mutant consideration.

super_mut_perc_homegeneity is used to set the minimum percentage of good reads vs. total reads that must contain a change within a UID family for the change to be considered a Super Mutant.

default_indel_rate - Default 0.001

default_sbs_rate - Default 0.001

mark_UIDs_with_Ns_UnUsable is used to control whether UIDs with N's (instead of C, G, A, T) will be used in Super Mutant calculations.

perform_opt_dup_removal is used to determine whether optical duplicates in the original study data should be removed before Super Mutant calculations.

opt_dup_distance is used to determine the distance required between reads. Reads that are closer will be considered optical duplicates and disregarded before Super Mutant calculations.

fh_limit is used to control the number of open file handles allowed by python. It should be set larger than the number of barcode being used by the study.

load_bad_bc drives whether barcodes that do not exist in the barcode map file should be processed. These records usually indicate errors. Loading records with errors is not recommended. These records will not be eligible for Super Mutant calculations because they do not match a recognized barcode/sample combination. Processing them will slow down performance.

load_not_used_bc drives whether barcodes that exist in the barcode map file BUT do not belong to a sample in the study should be processed. These records usually indicate errors. Loading records with errors is not recommended. These records will not be eligible for Super Mutant calculations because they do not match a recognized barcode/sample combination. Processing them will slow down performance.

Example

Example assuming input files are in a directory called C:\labName\studyName. The results should be stored in the sub-directory \runName under the C:\labName\studyName directory. The file SettingsTemplate.json should exist in the SafeSeqS Project Data directory and will be used to determine the run time settings. Up to 6 sub-processes may be run concurrently.

python -m safeseqs_controller -d C:\labName\studyName -r runName -sf SettingsTemplate.json -w 6

Clone this wiki locally