Skip to content

Tutorial 3_0

leonarDubois edited this page Jan 20, 2021 · 6 revisions

Screening for Eubacterium rectale strains in gut metagenomes

This tutorial provide a step-by-step analysis using the PanPhlAn software to map some samples against the E.rectale species pangenome.

1. Download and install PanPhlAn

Let's start by creating a folder for this analysis

mkdir panphlan_tutorial
cd panphlan_tutorial-tutoril

Then download the PanPhlAn software

git clone https://github.com/SegataLab/panphlan.git

2. Download the metagenomics samples

In this example, we use 25 samples coming from various datasets (different studies, pathologies, country, westernized and non-westernized...). Sample can be downloaded using the following command :

wget https://www.dropbox.com/s/oi26jg0v7ktlavc/panphlan_tutorial_samples.tar.bz2

(samples have been pre-processed and filtered in order to keep only matching sequence and thus speed up download and running time needed for this tutorial ) Then, the archive must be extracted in order to get one file per sample

tar -xvjf panphlan_tutorial_samples.tar.bz2

25 fastq files are now available in a samples_fastq/ folder. They correspond to various metagenomes samples from datasets spanning across studies and countries.

Here are some metadata :

sampleID	Dataset	Country
CCMD34381688ST-21-0	ZellerG_2014	DEU
G78505	VatanenT_2016	RUS
G88884	SchirmerM_2016	NLD
G88970	SchirmerM_2016	NLD
G89027	SchirmerM_2016	NLD
H2M514903	LiJ_2017	CHN
H3M518116	LiJ_2017	CHN
HD-1	QinN_2014	CHN
HD-5	QinN_2014	CHN
HV-6	QinN_2014	CHN
LD-48	QinN_2014	CHN
M1.48.ST	BritoIL_2016	FJI
M2.48.ST	BritoIL_2016	REF
M2.58.ST	BritoIL_2016	FJI
N021	WenC_2017	CHN
RA023	WenC_2017	CHN
S353	KarlssonFH_2013	SWE
SID530054	FengQ_2015	AUT
SRS011302	HMP_2012	USA
SZAXPI003417-4	YuJ_2015	CHN
T2D-025	QinJ_2012	CHN
T2D-063	QinJ_2012	CHN
T2D-105	QinJ_2012	CHN
W1.27.ST	BritoIL_2016	FJI
YSZC12003_36795	XieH_2016	GBR

3. Download the reference genomes

The pangenome of Eubacterium rectale can be easily retrieved via PanPhlAn pangenome database using the follwoing line :

python panphlan/panphlan_download_pangenome.py -i Eubacterium_rectale

Let's check the content of the folder downloaded (named Eubacterium_rectale)

ls  -l Eubacterium_rectale/
Eubacterium_rectale.1.bt2
Eubacterium_rectale.2.bt2
Eubacterium_rectale.3.bt2
Eubacterium_rectale.4.bt2
Eubacterium_rectale.rev.1.bt2
Eubacterium_rectale.rev.2.bt2
Eubacterium_rectale_pangenome.tsv
Eubacterium_rectale_pangenome_contigs.fna
panphlan_Eubacterium_rectale_annot.tsv

This will download and extract a folder containing :

  • Eubacterium_rectale_pangenome_contigs.fna : all contigs from the reference genomes
  • Eubacterium_rectale.X.bt2 : the bowtie2 indexes (6 files)
  • panphlan_Eubacterium_rectale_annot.tsv : an annotation mapping tsv file, can be useful for further analysis of gene families of interest
  • Eubacterium_rectale_pangenome.tsv : a pangenome tsv file containing information about gene families UniRef90 ID, gene family name and position (genome, contig, start and stop) like the following example.
head Eubacterium_rectale_pangenome.tsv
UniRef90_A0A174CG45	ywqN_1	GCA_000209955	gnl|X|MDHPEDEI_1	5355	5880
UniRef90_A0A174HJF4	ymdB_1	GCA_000209955	gnl|X|MDHPEDEI_1	7141	7519
UniRef90_D4JYV8	sttH	GCA_000209955	gnl|X|MDHPEDEI_1	7563	8091
UniRef90_A0A173V9T5	chrA	GCA_000209955	gnl|X|MDHPEDEI_1	9240	9807
UniRef90_A0A173V954	btr_1	GCA_000209955	gnl|X|MDHPEDEI_1	12790	14395
UniRef90_C4ZFH4	araN_1	GCA_000209955	gnl|X|MDHPEDEI_1	17732	18986
UniRef90_A0A173V9Z2	ugpA_1	GCA_000209955	gnl|X|MDHPEDEI_1	19135	20014
UniRef90_A0A173V933	sugB	GCA_000209955	gnl|X|MDHPEDEI_1	20025	20880
UniRef90_D4JMJ2	xerC_1	GCA_000209955	gnl|X|MDHPEDEI_1	23844	24993
UniRef90_A0A174NHE9	mgrA	GCA_000209955	gnl|X|MDHPEDEI_1	26198	26636
UniRef90_A0A174EBC4	prpC	GCA_000209955	gnl|X|MDHPEDEI_1	28391	29738
UniRef90_C7GEM8	clsC_1	GCA_000209955	gnl|X|MDHPEDEI_1	30340	31774
UniRef90_A0A174CL23	hssR_1	GCA_000209955	gnl|X|MDHPEDEI_1	31791	32463
UniRef90_A0A173UV77	baeS_1	GCA_000209955	gnl|X|MDHPEDEI_1	32462	33506

4. Map samples against pangenome

Here is one example for a sample.

python panphlan/panphlan_map.py -i samples_fastq/CCMD34381688ST-21-0.fastq.bz2 \
                         --indexes Eubacterium_rectale/Eubacterium_rectale \    
                         -p Eubacterium_rectale/Eubacterium_rectale_pangenome.tsv \
                         -o map_results/CCMD34381688ST-21-0_erectale.tsv

One could automatized all samples mapping with a simple loop:

for f in samples_fastq/*; do

    fname=$(basename $f)
    python panphlan/panphlan_map.py \
        -i samples_fastq/${fname} \
        --indexes Eubacterium_rectale/Eubacterium_rectale \
        -p Eubacterium_rectale/Eubacterium_rectale_pangenome.tsv \
         -o map_results/${fname%.*}_erectale.tsv ;

done;

If the output folder map_results/ does not exist, it will be created.
The script has to be run one time for each sample. If samples are numerous, using GNU parallel can be a relevant option.

For a quicker progress through this tutorial, the results from the mapping process for all 25 samples can be directly downloaded :

wget https://www.dropbox.com/s/z81e87dvtzp6pu3/panphlan_tutorial_map_results.tar.bz2
tar -xvjf panphlan_tutorial_map_results.tar.bz2

5. Profiling strains

The most important part of the PanPhlAn software is the profiling part (panphlan_profiling.py).

python panphlan/panphlan_profiling.py -i map_results/ \
                               --o_matrix result_profile_erectale.tsv \  
                               -p Eubacterium_rectale/Eubacterium_rectale_pangenome.tsv \
                               --add_ref

This line will generate the presence/absence matrix of gene families. 1 means the gene family is present in the sample, 0 it is not.

CCMD34381688ST-21-0	G78505	G88884	G88970	G89027	H2M514903	H3M518116	HD-1	HD-5	HV-6	LD-48	M1	M2_48	M2_58	N021	RA023	S353	SID530054	SRS011302	SZAXPI003417-4	T2D-025	T2D-063	T2D-105	W1	YSZC12003_36795
UniRef90_A0A069A530	0	0	0	0	0	1	1	1	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0
UniRef90_A0A076YZQ3	1	0	1	0	0	0	1	0	1	0	1	0	0	0	0	0	0	1	1	0	0	0	1	1	1
UniRef90_A0A095XLP0	1	0	1	1	1	0	0	0	0	0	1	0	0	0	1	1	1	1	1	0	0	0	0	1	1
UniRef90_A0A095XNF2	1	0	1	1	1	0	0	0	0	0	1	0	0	0	1	1	1	1	1	0	0	0	0	1	1
UniRef90_A0A095ZX26	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	0	0	0	1
UniRef90_A0A0D0SCG4	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0

In order to assess whether or not a gene family is present in a sample, PanPlhAn draws a coverage curve of gene families for each sample. Such curve can be obtained using the --o_covplot erectale_covplot creating the erectale_covplot.png figure :

Here, the peak on the leftmost part of the plot has been cut in order to highlight the plateau and the differences between sample. The option --covplot_ymax 100 has been used.

PanPhlAn's aim is to detect a plateau in the coverage curve. Default threshold can be changed for more or less sensitive options. Too stringent option lead to discarding a sample from the matrix as the species will be judged non present in the sample.
Least stringent option is performed by adding --min_coverage 1 --left_max 1.70 --right_min 0.30 in the command line.

By default only the mapped samples whose coverage curves display a satisfying plateau are present in the final matrix. In order to add the reference genomes in the matrix as well, the option --add_strains must be specified.

Moreover, if a specific file mapping UniRef90 gene families names to other annotation data is provided, it can be added in the matrix. Here it is panphlan_Eubacterium_rectale_annot.tsv retrieved with the pangenome download For example :

Eubacterium_rectale/panphlan_Eubacterium_rectale_annot.tsv

NR90	NR50	GO	KO	KEGG	Pfam	EC	eggNOG
UniRef90_A0A069A530	UniRef50_C7H750						
UniRef90_A0A069A530	UniRef50_C7H750				PF09035		
UniRef90_A0A069A530	UniRef50_C7H750	GO:0015074,GO:0006310			PF09035		
UniRef90_A0A076YZQ3	UniRef50_A0A2N9Z693						
UniRef90_A0A076YZQ3	UniRef50_A0A2N9Z693				PF02195		
UniRef90_A0A076YZQ3	UniRef50_A0A2N9Z693	GO:0003677			PF02195		
UniRef90_A0A076YZQ3	UniRef50_A0A2N9Z693	GO:0003677	K03497	rus:RBI_I01277	PF02195		
UniRef90_A0A095XLP0	UniRef50_A7B041						
UniRef90_A0A095XLP0	UniRef50_A7B041				PF00890,PF02910

given as argument as in here:
panphlan/panphlan_profile.py -c 39491 -i map_results/ -o profile_erectale.csv --funct_annot HUMAnN2_CHOCOPhlAn_201901_functional_annotation_mapping.tsv --verbose
will add an annotation column in the presence/absence matrix. Here, without any specification, the second field is taken (UniRef50) but it can be chosen via the --field argument (by default --field has value 1, thus using the second field of the annotation file).

Visualizing the matrix

Using the matrix .tsv file, visualization can be made through PCoA plots or Heatmaps of gene families presence/absence fingerprint for each sample.

For example, a PCoA of E.rectale strains highlights the existence of westernized (AUT, USA, NLD, SWE, GBR...) cluster compared to a non-westernized (CHN) one.

Such clusters can also be visualized through heatmap and hierarchical clustering using the same color code for countries. On such Heatmap, the core genome can also be visualized (yellow columns on the right)