Skip to content

alaynamead/RNAseq_scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

RNAseq_scripts

Set of scripts for processing RNAseq data files.

convert_qseq_to_fastq.py

Converts qseq sequence files to fastq files.

Usage: python3 convert_qseq_to_fastq -i [input qseq file] -o [output fastq file] [options]

Written in python 3.

Assumes qseq files are formatted like those from UCLA's BSCRC sequencing core: tab-separated file, columns are: instrument name, run ID, lane number, tile number, x coordinate, y coordinate, index, end number, read, quality scores, and filter (1=passes, 0 = fails).

Can take gzipped or uncompressed files as input.

Can filter out reads that fail the Illumina quality filter (default), or keep them in the output (option --nofilter).

demultiplexer.py

Demultiplexes fastq files - reads with the same barcode sequence are output to the same fastq files.

Usage: python3 demultiplexer.py ['set_A' or 'set_B'] ['paired' or 'single'] [path to directory with files]

This works with paired-end or single-end reads, and gzipped or uncompressed files.

It will identify Illumia TruSeq LT set A or set B indices, depending on which you specify. (Info about barcode sequences here: http://web.archive.org/web/20160327094802/http://support.illumina.com:80/content/dam/illumina-support/documents/documentation/chemistry_documentation/samplepreps_truseq/truseqsampleprep/truseq-library-prep-pooling-guide-15042173-01.pdf)

The script assumes names of fastq files are formatted like this:

s_*1*.fastq* file with read 1 sequence

s_*2*.fastq* file with barcodes

s_*3*.fastq* file with read 2 sequence (if paired-end)

The script will include barcodes with 1 mismatched base in the demultiplexed file for that barcode. Because barcode sequences from UCLA's BSCRC sequencing core include 7 bases, the script will match the full 7-base barcode, the barcode with a "." in the first position instead of the correct character (this seems to be fairly common), or the 7-base barcode with one of the bases incorrect. For example, all of these barcode sequences will match index 25:

ACTGATA (the actual barcode)

.CTGATA (the barcode with a "." in the first position)

ATTGATA (the barcode with one base mismatch in the second position)

The script will output files named after the barcode sequence: index_25.fastq (index25_read1.fastq and index25_read2.fastq if paired-end), as well as a file for the unmatched reads: NOMATCH.fastq.

filter_reads_by_quality.py

Filters reads in fastq files by overall read quality. Keeps any reads that have at least the given percent of bases at or above the given quality.

usage: python3 filter_reads_by_quality.py -i [input file] -o [output file] -q [quality cutoff] -p [percent]

So to keep reads with at least 80% of the bases at or above a score of 20, set -q to 20 and -p to 80.

About

Set of scripts for processing RNAseq data files.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages