Skip to content

Tutorial

Georgios Koutsovoulos edited this page Sep 9, 2024 · 26 revisions

Install the program and format the database (instructions)

For this tutorial we are going to use the swissprot database

Prepare input files

  1. Download the C. elegans protein dataset and gff3 file from WormBase
  2. Unzip both files
gunzip *.gz
  1. Run diamond
diamond blastp -q caenorhabditis*protein.fa -d ~/DB/uniprot_sprot.fasta.dmnd --evalue 1e-5 --max-target-seqs 500 --out cel_swissprot.out --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids
  1. Create groups.yaml file
---
Ingroup:
 33208: Metazoa
EGP:
 6231: Nematoda
  1. Calculate Alien Index (AI) and other stats
~/software/AvP/aux_scripts/calculate_ai.py -i cel_swissprot.out -x groups.yaml

Run AvP

  1. Create config.yaml file. We point to the db fasta_path: /home/DB/uniprot_sprot.fasta. We want to select genes having either AI above 10 or AHS above 0, and outg_pct above 80 so we set ai_cutoff: 10 and selection from the default "ai or ahs" to "(ai or ahs) and outg_pct".
---
max_threads: 4

# DB path
blast_db_path: nr    # blast: change to the local blast_db path
fasta_path: /home/DB/uniprot_sprot.fasta  # diamond: change to the local fasta path for sp, ur90, or custom database
mode: sp    # use blast for blast database, use sp for swissprot database, ur90 for uniref90 or custom database
data_type: AA    # data type DNA, AA

## Algorithm options
# prepare
ai_cutoff: 10
ahs_cutoff: 0
outg_pct_cutoff: 80
selection: "(ai or ahs) and outg_pct" # select sequences based on which metrics, another example "(ai or ahs) and outg_pct"
percent_identity: 100   # select hits equal or below this number
cutoffextend: 20    # when ingroup hit is found, we take this hit + n hits
number_hits_noingroup: 50   # when no ingroup hit is found, we take this number of hits
trimal: false
min_num_hits: 4   # select queries with at least that many blast hits
percentage_similar_hits: 0.7  # group queries based on this
# detect, clasify, evaluate
fastml: true  # Use fasttree instead of IQTree
node_support: 0  # nodes below that number will collapse
complex_per_ingroup: 20   # if D/(D+I) smaller than this then node is considered Ingroup
complex_per_donor: 80   # if D/(D+I) greater than this then node is considered Donor
complex_per_node: 90  # if node contains percent number of this category, it is assigned

# Program specific options
mafft_options: '--anysymbol --auto'
trimal_options: '-automated1'

#IQ-Tree
iqmodel: '-mset WAG,LG,JTT -AICc -mrate E,I,G,R'
ufbootstrap: 1000
iq_threads: 4
  1. Run AvP prepare
~/software/AvP/avp prepare -a cel_swissprot.out_ai.out -o cel_avp -f caenorhabditis*protein.fa -b cel_swissprot.out -x groups.yaml -c config.yaml
  1. Run AvP detect
~/software/AvP/avp detect -i cel_avp/mafftgroups/ -o cel_avp/ -g cel_avp/groups.tsv -t cel_avp/tmp/taxonomy_nexus.txt -c config.yaml

Downstream analyses

  1. Run AvP classify
~/software/AvP/avp classify -i cel_avp/fasttree_nexus/ -o cel_avp/ -t cel_avp/fasttree_tree_results.txt -f ~/software/AvP/depot/sample.classification_ingroup_Metazoa.txt -c config.yaml
  1. Run AvP evaluate
~/software/AvP/avp evaluate -i cel_avp/mafftgroups/ -o cel_avp/ -t cel_avp/fasttree_tree_results.txt -c config.yaml
  1. Calculate hgt local score
  • Format gff3 according to the format described here
grep -P "WormBase\tmRNA" caenorhabditis*gff3 | perl -ane '@p=split(/;/,$F[8]); @d=split(/\./,$p[0]); $d[0]=~s/Transcript://; $F[8]="$d[0].$d[1];$p[1]"; $l=join("\t",@F); print "$l\n";' > caenorhabditis_mRNA.gff3
  • Run hgt local score.
~/software/AvP/aux_scripts/hgt_local_score.py -f caenorhabditis_mRNA.gff3 -a cel_swissprot.out_ai.out -t cel_avp/fasttree_tree_results.txt -m 0