Skip to content

Prepare input files

Georgios Koutsovoulos edited this page Jun 21, 2024 · 23 revisions

Prepare Input files

Decide on the taxonomic Ingroup and the taxonomic groups to exclude (EGP). For example if we want to find non Metazoan origin of proteins in plant parasitic nematodes of the genus Meloidogyne (suborder Tylenchina taxid=6300) we set Ingroup to 33208 (Metazoa) and EGP to 6300.

  1. Similarity file
  • Using NR
blastp -query [proteins.fa] -db nr -outfmt '6 std staxids' -seg no -evalue 1e-5 -out [similarity.out]
  • Using other databases
diamond blastp -q [proteins.fa] -d [db.fasta.dmnd] --evalue 1e-5 --max-target-seqs 500 --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids --out [similarity.out]
  1. Create groups.yaml file (a sample file can be found under AvP/depot/)
  1. AI features file (2 choises)
  • If you want to use the AHS metric or hgt_local_score or both, use calculate_ai.py. It will create a file called *_ai.out (described below)
calculate_ai.py -i [similarity.out] -x groups.yaml
  • else, use Alienness webserver (the file needed is called *_Alieness_FEATURES.xls) (only works for NR) (ongoing)
  1. Create config.yaml file (a sample file can be found under AvP/depot/)

Output of calculate_ai

calculate_ai.py produces the tab-delimited file *_ai.out.

Column Description Article
1 Gene Name
2 Best Donor String
3 Best Ingroup String
4 Alien Index (AI) Gladyshev et al., 2008
5 HGT index Boschetti et al., 2012
6 Number of blast hits
7 AHS score Koutsovoulos et al., 2022
8 outg_pct Li et al., 2022

Columns 3 and 4 contain a string with multiple information for the best donor and ingroup hit delimited by :

Gene_hit_name:Position_in_blast_list:Identity:E_value:Bitscore