Skip to content

Input and Parameters

Michael Ryan edited this page Dec 18, 2017 · 8 revisions

Files that must be in the Study Directory (-d directory)

Study Data

Study Data FASTQ Files - One or more pairs of reads and barcode files. Files must be in the Study Directory. Both FASTQ and compressed FASTQ.gz formats are supported.

safeseqs.json Study File

This is a json formatted file that contains information about the study data. It identifies the FASTQ input files, the well barcode to sample association file, the UID length being used in the study data, and the ready quality score ascii value offset being used in the study data. Format of the file is:

{ "reads_files" : ["studydata_R1_001.fastq.gz", "studydata_R1_002.fastq.gz", ... ],
  "barcodes_files" : ["studydata_I1_001.fastq.gz", "studydata_I1_002.fastq.gz", ...],
  "reads_pattern" :   "",
  "barcodes_pattern" : "",
  "barcodemap" : "barcodemap.txt",
  "uidlength" : 14,
  "ascii_adj" : 33}

NOTE: the pattern and files parameters are mutually exclusive. You may specify the exact files in the read_files and barcode_files list. Matching barcode and read files must be in the same relative position in the lists. Alternately you may specify a pattern - string fragment - that identifies all the read files and a pattern that identifies barcode files like "_R1_" and "_I1_".

UID Length is used to separate the safeseqs UID from the beginning of the Read Sequence in the reads file.

ASCII Adj is the offset of the ascii value used to interpret ready quality flags. This is generally 33.

Barcode Map File

The barcode to map file links barcodes in the study data to samples. The format is tab delimited with the following columns:

 barcodeNumber - unique number for barcode
 barcode - barcode identifier (valid values contain A,C,G,T)
 wbcPlateNumber - _part of the Sample Super Mutant header_
 template - sample identifier
 purpose - _part of the Sample Super Mutant header_
 gEsWellOrTotalULUsed - _part of the Sample Super Mutant header_
 mutOrTotalGEsWell - _part of the Sample Super Mutant header_
 ampMatchName - primer set for the sample
 row - 
 col - 

Primers File

The primers file is the complete list of primers used for any of the samples. The primer list will be used in the alignment process to identify target sequences.

 ampMatchName - primer name
 gene - gene being targetted
 read1 - forward primer sequence
 read2 - reverse primer sequence
 ampSeq - amplicon sequence
 target_len - target sequence length
 chrom - chromosome being targetted
 readStrand - indicator for forward or reverse primer
 hg19_start - starting position on the chromosome for this primer sequence
 hg19_end - - ending position on the chromosome for this primer sequence

Files installed with the software in the safeseqs/data directory

settings.json file - Configuration Settings Parameters

Parameters in the settings file control decision making during processing runs. Multiple versions can be generated to create different run scenarios. Format of the file is:

 { "max_mismatches_for_used_reads" : 3,
   "max_indels_for_used_reads" : 1,
   "mark_UIDs_with_Ns_UnUsable" : "yes",
   "perform_opt_dup_removal" : "no",
   "opt_dup_distance" : 5000,
   "max_amp_per_UID_family" : 1,
   "min_good_reads_usable_family" : 2,
   "min_perc_good_reads_per_UID_family" : 95,
   "super_mut_perc_homegeneity" : 90,
   "default_indel_rate" : 0.001,
   "default_sbs_rate" : 0.001,
   "fh_limit" : 2048,
   "load_bad_bc" : "no",
   "load_not_used_bc" : "no",

}

max_mismatches_for_used_reads is used to set the maximum allowable SBS mismatches a read sequence can contain and still be considered a good read for Super Mutant consideration.

max_indels_for_used_reads is used to set the maximum allowable indels (insertions or deletions) a read sequence can contain and still be considered a good read for Super Mutant consideration.

max_amp_per_UID_family is used to set the maximum number of primers that can be found within a UID family for the family reads to be usable for Super Mutant consideration.

min_good_reads_usable_family is used to set the minimum number of good reads that must be found within a UID family for the family reads to be usable for Super Mutant consideration.

min_perc_good_reads_per_UID_family is used to set the minimum percentage of good reads vs. total reads that must be found within a UID family for the family reads to be usable for Super Mutant consideration.

super_mut_perc_homegeneity is used to set the minimum percentage of good reads vs. total reads that must contain a change within a UID family for the change to be considered a Super Mutant.

default_indel_rate - Default 0.001

default_sbs_rate - Default 0.001

mark_UIDs_with_Ns_UnUsable is used to control whether UIDs with N's (instead of C, G, A, T) will be used in Super Mutant calculations.

perform_opt_dup_removal is used to determine whether optical duplicates in the original study data should be removed before Super Mutant calculations.

opt_dup_distance is used to determine the distance required between reads. Reads that are closer will be considered optical duplicates and disregarded before Super Mutant calculations.

fh_limit is used to control the number of open file handles allowed by python. It should be set larger than the number of barcodes being used by the study.

load_bad_bc drives whether barcodes that do not exist in the barcode map file should be processed. These records usually indicate errors. Loading records with errors is not recommended. These records will not be eligible for Super Mutant calculations because they do not match a recognized barcode/sample combination. Processing them will slow down performance.

load_not_used_bc drives whether barcodes that exist in the barcode map file BUT do not belong to a sample in the study should be processed. These records usually indicate errors. Loading records with errors is not recommended. These records will not be eligible for Super Mutant calculations because they do not match a recognized barcode/sample combination. Processing them will slow down performance.

dbSNP Reference Data

Tab delimited file containing dbSNP single base pair mutations. This is used to allow for expected variants when calculating mismatches. The first line of the file is a header.

The columns in the dbSNP file must be:

chrom Position BaseFrom BaseTo SNP

COSMIC Reference Data

Tab delimited file containing known COSMIC single base pair mutations. The first line of the file is a header.

The columns in the COSMIC file must be:

chrom Position BaseFrom BaseTo SomaticCount