WinstonCleaner

WinstonCleaner is a software tool for detecting and removing cross-contaminated contigs from assembled transcriptomes. The program uses BLAST to identify suspicious contigs and RPKM values to sort these as either correct or contamination.

Requirements

To run WinstonCleaner, the following requirements must be satisfied:

Python 2.7
blast
bbtools (pileup.sh)
bowtie2

Installation

Make sure that you use Python 2.7
pip install winston-cleaner

Quick Start

Prepare the folder with input data and an empty folder for the results
Generate the config.yml file by running winston_generate_config and specify input, output paths and other options
Prepare the data for your dataset (Winston will BLAST your sequences and map reads to detect the coverage)

winston_prepare_data

In your output folder all necessary data will be created.
winston_find_contaminations

Usage

Input

The input data should be presented as a set of triads of files for each dataset. For each dataset it is necessary to prepare:

left reads .fastq
right reads .fastq
assembled transcriptome .fasta file

Names of the files must be in the following format:

NAME_1.fastq
NAME_2.fastq
NAME.fasta

For example:

brucei_1.fastq
brucei_2.fastq
brucei.fasta
giardia_1.fastq
giardia_2.fastq
giardia.fasta

For file names only letters, digits and _ symbols are allowed.

All the files must be placed together in one folder.

Configuration

The dummy settings.yml file can be automatically generated by the script winston_generate_config:

Options:
  -h, --help            show this help message and exit
  --output_folder=OUTPUT_FOLDER
                        Where to put new settings.yml (current folder by
                        default)

The list of available settings:

winston.paths.input — input folder with reads and contigs
winston.paths.output — output folder with the results
winston.paths.tools.pileup_sh — (optional) bbtools pileup.sh execution command
winston.paths.tools.bowtie2 — (optional) bowtie2 execution command
winston.paths.tools.bowtie2_build — (optional) bowtie2-build execution command
winston.hits_filtering.len_ratio — minimal qcovhsp for hits filtering
winston.hits_filtering.len_minimum — minimal hit lenth for hits filtering
winston.coverage_ratio.regular — coverage ratio for REGULAR dataset pair type (lower values make contamination prediction more strict, less contaminations will be found)
winston.coverage_ratio.close — coverage ratio for CLOSE dataset pair type
winston.threads.multithreading — enable multithreading (disabling is convenient for debugging purposes)
winston.threads.count — number of threads if multithreading enabled
winston.tools.blast.threads — number of threads for BLAST processing
winston.tools.bowtie.threads — number of threads for bowtie2 processing
winston.in_memory_db — load coverage database to RAM in the beginning. Makes contamination lookup faster, but requires decent amount of memory.

winston:
  in_memory_db: false

  paths:
    input: /path/to/folder/with/data/
    output: /path/to/output/folder

  hits_filtering:
    len_ratio: 70
    len_minimum: 100

  coverage_ratio:
    REGULAR: 1.1
    CLOSE: 0.04

  threads:
    multithreading:  true
    count:   8

  tools:
    blast:
      threads: 8
    bowtie:
      threads: 8

Data preparation

The first step is to prepare the data for WinstonCleaner processing.

winston_prepare_data

The result will be stored in the folder, specified in winston.paths.output option.

After the preparation the file types.csv can be inspected and edited. It contains all possible combinations of dataset pairs and their types.

The default types are:

CLOSE - taxonomically close organisms
REGULAR - simple pair of organisms

In types.csv there can also be specified any amount of custom types. Their names must be in upper case.

predator,prey,95.0,LEFT_EATS_RIGHT
prey,predator,95.0,RIGHT_EATS_LEFT

In these case coverage ratio for each custom type must be specified in winston.coverage_ratio section of settings.yml file:

...
  coverage_ratio:
    REGULAR: 1.1
    CLOSE: 0.04
    LEFT_EATS_RIGHT: 10
    RIGHT_EATS_LEFT: 0.1
...

Contamination cleanup

winston_find_contaminations

Output

The results will be saved in the folder, specified in winston.paths.output option.

For each datasets there will be the following structure of files.

DATASET_NAME_clean.fasta — clean contigs
DATASET_NAME_deleted.fasta — contaminated contigs
DATASET_NAME_suspicious_hits.csv — all suspicious BLAST hits
DATASET_NAME_contamination_sources.csv — sources of contaminations with a following columns: source contamination dataset name, number of sequences
DATASET_NAME_contaminations.csv — list of blast hits from which contaminations were detected
DATASET_NAME_missing_coverage.csv — list of contig ids without a coverage

TODO

Moving to python3
Logging system
Extended testing
export to graph format

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
bin		bin
test		test
winston		winston
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

test

test

winston

winston

.gitignore

.gitignore

LICENSE

LICENSE

MANIFEST.in

MANIFEST.in

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

WinstonCleaner

Requirements

Installation

Quick Start

Usage

Input

Configuration

Data preparation

Contamination cleanup

Output

TODO

About

Releases

Packages

Languages

License

kolecko007/WinstonCleaner

Folders and files

Latest commit

History

Repository files navigation

WinstonCleaner

Requirements

Installation

Quick Start

Usage

Input

Configuration

Data preparation

Contamination cleanup

Output

TODO

About

Resources

License

Stars

Watchers

Forks

Languages