Skip to content

DjampaKozlowski/cONTent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cONTent

Description

cONTent is a tool-box allowing the analysis of ONT long-reads length and quality.

cONTent is composed of 3 sub-programs:

  • extract : parse a read library (/!\ SHOULD BE A FASTQ FILE AND NOT A FASTQ.GZ) and extract each reads' id, length and average phred quality. Then results are saved as tab-separated file with a '.content' extension. The extracted information are per read identifier, length, and mean quality (phred score).
  • distribution : subsample read-librar(y/ies) and plot reads' quality as a function of the reads' length. Also compute basic statistics for these two measurments. NB : if several libraries are provided, individual plot and statistics will be generated for each library in addition to a global plot and table.
  • coverage : compute genome coverage using different length and quality cut-offs. Display the results as a heatmap. This program might be usefull to set minimal reads length and quality cut-off to reach a target genome coverage. NB : The output table only display rows for which the coverage obtained with these values of minimal reads' length and quality satisfies the required coverage.

Programs usages and ouputs are extensively described the 'Usage' section below.

Installation guide

Create a dedicated virtual environment (OPTIONAL)

It is advised to create a dedicated virtual environment (here we use Conda) to install cONTent. The following lines will

  • create a python 3.10.4 conda environment named 'content_env'
  • activate the 'content_env'
conda create -n content_env -y python=3.10.4
conda activate content_env

Clone the github repository

Go to the desired location and clone the repository.

cd <INSERT_PATH_OF_DESIRED_LOCATION>
git clone https://github.com/DjampaKozlowski/cONTent.git
cd cONTent/

Install cONTent.

cONTent rely on various scripts in python and C++ (information extraction from the fastq). You can install cONTent following two strategies :

  • automatic
  • manual

Automatic installation

Execute the following line :

make install

This will :

  • install the python dependencies and install cONTent as a python module
  • create a directory named build in content
  • compile the C++ program content/fastq_processor.cpp and generate an executable stored in content/build.

Manual installation

echo "Installing python requirements & installing the content module"
pip install -r requirements.txt
pip install -e .
echo "Building the fastq parser"
mkdir -p content/build/
g++ -std=c++11 content/fastq_processor.cpp -o content/build/fastq_processor

Usage

Activate the environment (if not already done)

conda activate content_env

NB : parameters between brackets are optional parameters with default values.

extract

Launch cONTent.py extract doing :

python cONTent.py extract [-h] -i INPUTFILEPATH -o OUTPUTFILEDIR

where:

  • < INPUTFILEPATH > : input fastq file (/!\ SHOULD BE A FASTQ FILE AND NOT A FASTQ.GZ) path [mendatory]
  • < OUTPUTFILEDIR > : output directory path [mendatory]. The output file will be named after the input file name with the extension .content.

Information extraction is a time consuming process. We advise if possible to run one process per library rather than concatenating the libraries and running one process. The other tools from the cONTent suite are designed to merge information from multiple cONTent.py extract output files

distrib

Launch cONTent.py distrib doing :

python cONTent.py distrib [-h] -input INPUTPATH -outdir OUTPUTPATH -prefix PREFIX [-fraction FRACTION]
  • < INPUTPATH > : Input directory/file path. If the path point to a directory, all the '.content' files will be analysed (individually and together). [mendatory]
  • < OUTPUTPATH > : Output directory path. Nb: if the ouput directory does not exist, it will be created along with its parent directories. If only a directory name is provided, the directory will be created in the execution directory. In any case,if the directory exist, it will be overwritten as well as the files it might contain (if files with the same name exist). [mendatory]
  • < PREFIX > : Prefix used to name output files but also as plots' title (for global analysis). Spaces will be replaced with '_' in the files names [mendatory].
  • < FRACTION > : fraction of reads to subsample per analysed library (distribution plot only). The biggest is the fraction, the longer the analysis will take. (default : 0.01)

coverage

Launch cONTent.py coverage doing :

python cONTent.py coverage [-h] -input INPUTPATH -outdir OUTPUTPATH -prefix PREFIX -genomesize GENOMESIZE [-n N] [-m M] [-mincoverage MINCOV] [-minquality MINQ] [-minlength MINL]
  • < INPUTPATH > : Input directory/file path. If the path point to a directory, all the '.content' files will be analysed (individually and together). [mendatory]
  • < OUTPUTPATH > : Output directory path. Nb: if the ouput directory does not exist, it will be created along with its parent directories. If only a directory name is provided, the directory will be created in the execution directory. In any case,if the directory exist, it will be overwritten as well as the files it might contain (if files with the same name exist). [mendatory]
  • < PREFIX > : Prefix used to name output files but also as plots' title (for global analysis). Spaces will be replaced with '_' in the files names [mendatory].
  • < GENOMESIZE > : Genome size (bp). Necessary to compute genome coverage
  • < N > : Number of interval to create in reads length space (optimization plot only; used to compute coverage). Increasing n makes the coverage length/quality trade-off analysis more precise but also more time consuming. (default : 100).
  • < M > : Number of interval to create in reads quality space (optimization plot only; used to compute coverage) Increasing n makes the coverage length/quality trade-offanalysis more precise but also more time consuming. (default : 100).
  • < MINCOV > : Minimal coverage to represent (optimization plot only). (default : 20).
  • < MINQ > : Minimal quality to represent (optimization plot only) (default : 12)
  • < MINL > :Minimal length of sequences to represent (optimization plot only) (default : 1000)

About

ONT Fastq files analyser: Extract per reads informations, Create grpaths (reads length and quality distribution ; genome coverage)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published