-
Notifications
You must be signed in to change notification settings - Fork 2
Tutorial
Georgios Koutsovoulos edited this page Sep 9, 2024
·
26 revisions
Install the program and format the database (instructions)
For this tutorial we are going to use the swissprot database
- Download the C. elegans protein dataset and gff3 file from WormBase
- Unzip both files
gunzip *.gz
- Run diamond
diamond blastp -q caenorhabditis*protein.fa -d ~/DB/uniprot_sprot.fasta.dmnd --evalue 1e-5 --max-target-seqs 500 --out cel_swissprot.out --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids
- Create
groups.yaml
file
---
Ingroup:
33208: Metazoa
EGP:
6231: Nematoda
- Calculate Alien Index (AI) and other stats
~/software/AvP/aux_scripts/calculate_ai.py -i cel_swissprot.out -x groups.yaml
- Create
config.yaml
file. We point to the dbfasta_path: /home/DB/uniprot_sprot.fasta
. We want to select genes having either AI above 10 or AHS above 0, and outg_pct above 80 so we setai_cutoff: 10
and selection from the default"ai or ahs"
to"(ai or ahs) and outg_pct"
.
---
max_threads: 4
# DB path
blast_db_path: nr # blast: change to the local blast_db path
fasta_path: /home/DB/uniprot_sprot.fasta # diamond: change to the local fasta path for sp, ur90, or custom database
mode: sp # use blast for blast database, use sp for swissprot database, ur90 for uniref90 or custom database
data_type: AA # data type DNA, AA
## Algorithm options
# prepare
ai_cutoff: 10
ahs_cutoff: 0
outg_pct_cutoff: 80
selection: "(ai or ahs) and outg_pct" # select sequences based on which metrics, another example "(ai or ahs) and outg_pct"
percent_identity: 100 # select hits equal or below this number
cutoffextend: 20 # when ingroup hit is found, we take this hit + n hits
number_hits_noingroup: 50 # when no ingroup hit is found, we take this number of hits
trimal: false
min_num_hits: 4 # select queries with at least that many blast hits
percentage_similar_hits: 0.7 # group queries based on this
# detect, clasify, evaluate
fastml: true # Use fasttree instead of IQTree
node_support: 0 # nodes below that number will collapse
complex_per_ingroup: 20 # if D/(D+I) smaller than this then node is considered Ingroup
complex_per_donor: 80 # if D/(D+I) greater than this then node is considered Donor
complex_per_node: 90 # if node contains percent number of this category, it is assigned
# Program specific options
mafft_options: '--anysymbol --auto'
trimal_options: '-automated1'
#IQ-Tree
iqmodel: '-mset WAG,LG,JTT -AICc -mrate E,I,G,R'
ufbootstrap: 1000
iq_threads: 4
- Run AvP prepare
~/software/AvP/avp prepare -a cel_swissprot.out_ai.out -o cel_avp -f caenorhabditis*protein.fa -b cel_swissprot.out -x groups.yaml -c config.yaml
- Run AvP detect
~/software/AvP/avp detect -i cel_avp/mafftgroups/ -o cel_avp/ -g cel_avp/groups.tsv -t cel_avp/tmp/taxonomy_nexus.txt -c config.yaml
- Run AvP classify
~/software/AvP/avp classify -i cel_avp/fasttree_nexus/ -o cel_avp/ -t cel_avp/fasttree_tree_results.txt -f ~/software/AvP/depot/sample.classification_ingroup_Metazoa.txt -c config.yaml
- Run AvP evaluate
~/software/AvP/avp evaluate -i cel_avp/mafftgroups/ -o cel_avp/ -t cel_avp/fasttree_tree_results.txt -c config.yaml
- Calculate hgt local score
- Format gff3 according to the format described here
grep -P "WormBase\tmRNA" caenorhabditis*gff3 | perl -ane '@p=split(/;/,$F[8]); @d=split(/\./,$p[0]); $d[0]=~s/Transcript://; $F[8]="$d[0].$d[1];$p[1]"; $l=join("\t",@F); print "$l\n";' > caenorhabditis_mRNA.gff3
- Run hgt local score.
~/software/AvP/aux_scripts/hgt_local_score.py -f caenorhabditis_mRNA.gff3 -a cel_swissprot.out_ai.out -t cel_avp/fasttree_tree_results.txt -m 0