Skip to content

Scripts for the production of P. ovale whole genome sequences

License

Notifications You must be signed in to change notification settings

Rcoppee/P_ovale_sWGA_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Production of fasta sequences from Plasmodium ovale whole genome data

An algorithm for producing the whole genome sequence (.fasta format) for Plasmodium ovale species.
Prior to use this pipeline, we suppose you produced a sorted.bam file at this step (for example using Samtools view and sort).
The algorithm was written for Plasmodium ovale species, but can also be used for other Plasmodium species and other organisms.


1. Prerequisites

To use this pipeline, you need the following programs:

- Perl 5
- Python 3.x
- Samtools 1.4


2. Preparing a pileup file

The first step consists in the production of a pileup file using Samtools mpileup that calculates for each position across the genome the depth coverage, the quality and content of the reads:

samtools mpileup -a -f reference_genome.fasta file_sorted.bam > file_sorted.pileup

where reference_genome.fasta is the genome of interest in .fasta format (provided with the parameter -f, and file_sorted.bam is the sorted bam file previously produced with samtools sort function. The -a parameter allows to consider positions that were not covered.


3. Generating a table of A, T, G, C and indels content

The second step allows to formate a table that indicate the number of A, T, G, C and indels for each position of the genome. For that, we developed a perl script, named extract_data_pileup.pl, that takes the previously produced pileup file as an input (option -p).

perl extract_data_pileup.pl -p file_sorted.pileup -o ./

The algorithm will produce a table, named data_from_pileup.tsv. The table is then simplified to retain only the chrosomosomes, the genomic positions and the major bases (that we named here simplified_data.tsv), using the cut function in Linux. Importantly, we used the IUPAC code to produce the major allele, so other letters than A, T, G and C may appear as the result of mixed alleles.

cut data_from_pileup.tsv -f 1,2,8 > simplified_data.tsv

where data_from_pileup.tsv is the file previously produced, and the -f option allows to retain columns 1, 2 and 8 of the file.


4. Producing the .fasta sequence

From the simplified simplified_data.tsv table, we then just concatenated positions to form a fasta sequence. At the end of the program, the total length of the sequence and the number of lacking positions and nucleotides are indicated.

python3 extract_fasta_pileup.py -f simplified_data.tsv -o out.fasta

where -f is the input file (simplified_data.tsv) and -o corresponds to the output fasta file (out.fasta).


5. Citation

If you use this pipeline for your own work, please cite:

Development and Optimization of a Selective Whole-Genome Amplification To Study Plasmodium ovale Spp.
DOI: https://doi.org/10.1128/spectrum.00726-22

About

Scripts for the production of P. ovale whole genome sequences

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published