### **Software info**

|Software     |Version|
|-------------|-------|
|python    |3.11.9|
|ipykernel    |[6.28.0](https://anaconda.org/anaconda/ipykernel)|
|Biopython    |[1.70](https://anaconda.org/bioconda/biopython)|
|Entrez-direct|[21.6](https://anaconda.org/bioconda/entrez-direct)|
|mafft        |[7.525](https://anaconda.org/bioconda/mafft)|
|iq-tree2     |[2.3.0](https://anaconda.org/bioconda/iqtree)|
|DSTU         |[0.5.0 pre-release](https://github.com/iliapopov17/Detailed-Sequences-for-Trees-Unblemished)|

Conda envinronment: `dstu_hantavirus_phylo.yaml`<br>
Install the envinronment with:

In [None]:
! conda env create -f dstu_hantavirus_phylo.yaml

Reload VS Code (close & open), then activate this envinronment as kernel

### **Hardware info**

- OS: Ubuntu 22.04 (Windows Subsystem for Linux)
- CPU: Intel Xeon E5-2670v3
- RAM: 32GB (16GB for WSL)

In [1]:
! lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  24
  On-line CPU(s) list:   0-23
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
    CPU family:          6
    Model:               63
    Thread(s) per core:  2
    Core(s) per socket:  12
    Socket(s):           1
    Stepping:            2
    BogoMIPS:            4589.36
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscal
                         l nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopo
                         logy cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1
                          sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervis
                         or lahf_lm abm invpcid_single pti ssbd i

### **Step 0. Install `DSTU`**

In [None]:
! wget https://github.com/iliapopov17/Detailed-Sequences-for-Trees-Unblemished/releases/download/v0.5.0-alpha/DSTU.py

In [2]:
from DSTU import *

### **Step 1. Download sequences**

`accession_numbers.txt` file was created manually based on previously published papers:
1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10025241/<br>
2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7106157/<br>
3. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10943075/<br>

In [10]:
get_sequences('iljapopov17@gmail.com', 'data/accession_numbers.txt', 'genbank_sequences')

Downloaded: NC_038515
Downloaded: KT316176
Downloaded: MN337866
Downloaded: MG663536
Downloaded: JX193700
Downloaded: KX779125
Downloaded: KU950715
Downloaded: KR920360
Downloaded: OM912841
Downloaded: KX845680
Downloaded: OM912842
Downloaded: OM912840
Downloaded: OM963009
Downloaded: OM912844
Downloaded: OM912843
Downloaded: MK165653
Downloaded: KJ000540
Downloaded: KY040508
Downloaded: KT899703
Downloaded: MN850095
Downloaded: AF005729
Downloaded: NC_043407
Downloaded: GQ200821
Downloaded: EU788002
Downloaded: GU997097
Downloaded: FJ858378
Downloaded: NC_038299
Downloaded: AB620102
Downloaded: AB620105
Downloaded: HM015222
Downloaded: MN639746
Downloaded: AB677488
Downloaded: JX028271
Downloaded: KJ857315
Downloaded: KJ857316
Downloaded: MK883761
Downloaded: KY751731
Downloaded: MK542664
Downloaded: MT441741
Downloaded: HQ728461
Downloaded: KC880348
Downloaded: GU566021
Downloaded: KJ857320
Downloaded: KF974361
Downloaded: JX990941
Downloaded: MN006903
Downloaded: JX990965
Downloaded

#### **Step 1.1. Check downloaded sequences**

In [11]:
! ls genbank_sequences/| wc -l

93


The number of accession numbers is 99, but there are 93 downloaded sequences. There are non unique accession numbers extracted from one of these papers:
1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10025241/<br>
2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7106157/<br>

In [12]:
def count_non_unique_strings(file_path):
    counts = {}
    non_unique_count = 0
    with open(file_path, 'r') as file:
        for line in file:
            line = line.strip()
            if line in counts:
                if counts[line] == 1:
                    non_unique_count += 1
                counts[line] += 1
            else:
                counts[line] = 1

    return non_unique_count

In [13]:
file = 'data/accession_numbers.txt'
non_unique = count_non_unique_strings(file)
print(f"Number of non-unique strings: {non_unique}")

Number of non-unique strings: 6


Above mentioned papers utilize 6 identical sequences for building the phylogenetic tree. 93 sequences downloaded is explained in that case. No data has been lost.

#### **Step 1.2. 10 additional sequences**

10 additional hantaviruses associated with human infections will be included in our analysis

**Orthohantavirus seoulense**

https://www.ncbi.nlm.nih.gov/nuccore/MG386252.1<br>
https://pubmed.ncbi.nlm.nih.gov/29774860/

In [14]:
! esearch -db nucleotide -query "MG386252" | efetch -format fasta > genbank_sequences/MG386252.fasta

In [23]:
! echo "MG386252" >> data/accession_numbers.txt

https://www.ncbi.nlm.nih.gov/nuccore/OR047284.1<br>
https://pubmed.ncbi.nlm.nih.gov/38147030/

In [24]:
! esearch -db nucleotide -query "OR047284" | efetch -format fasta > genbank_sequences/OR047284.fasta

In [25]:
! echo "OR047284" >> data/accession_numbers.txt

**Orthohantavirus tulaense**

https://www.ncbi.nlm.nih.gov/nuccore/KU297981.1<br>
https://pubmed.ncbi.nlm.nih.gov/26691901/

In [26]:
! esearch -db nucleotide -query "KU297981" | efetch -format fasta > genbank_sequences/KU297981.fasta

In [27]:
! echo "KU297981" >> data/accession_numbers.txt

https://www.ncbi.nlm.nih.gov/nuccore/MT993951.1<br>
https://pubmed.ncbi.nlm.nih.gov/33754997/

In [28]:
! esearch -db nucleotide -query "MT993951" | efetch -format fasta > genbank_sequences/MT993951.fasta

In [29]:
! echo "MT993951" >> data/accession_numbers.txt

**Orthohantavirus dobravaense**

https://www.ncbi.nlm.nih.gov/nuccore/MK605664.1<br>
https://pubmed.ncbi.nlm.nih.gov/31625853/

In [30]:
! esearch -db nucleotide -query "MK605664" | efetch -format fasta > genbank_sequences/MK605664.fasta

In [31]:
! echo "MK605664" >> data/accession_numbers.txt

https://www.ncbi.nlm.nih.gov/nuccore/MK605665.1<br>
https://pubmed.ncbi.nlm.nih.gov/31625853/

In [32]:
! esearch -db nucleotide -query "MK605665" | efetch -format fasta > genbank_sequences/MK605665.fasta

In [33]:
! echo "MK605665" >> data/accession_numbers.txt

**Hantaan orthohantavirus**

https://www.ncbi.nlm.nih.gov/nuccore/MW349026.1<br>
https://pubmed.ncbi.nlm.nih.gov/34370707/

In [34]:
! esearch -db nucleotide -query "MW349026" | efetch -format fasta > genbank_sequences/MW349026.fasta

In [35]:
! echo "MW349026" >> data/accession_numbers.txt

https://www.ncbi.nlm.nih.gov/nuccore/MZ191082.1<br>
https://pubmed.ncbi.nlm.nih.gov/34370707/

In [36]:
! esearch -db nucleotide -query "MZ191082" | efetch -format fasta > genbank_sequences/MZ191082.fasta

In [37]:
! echo "MZ191082" >> data/accession_numbers.txt

**Orthohantavirus sinnombreense**

https://www.ncbi.nlm.nih.gov/nuccore/ON571586.1<br>
https://pubmed.ncbi.nlm.nih.gov/37486231/

In [38]:
! esearch -db nucleotide -query "ON571586" | efetch -format fasta > genbank_sequences/ON571586.fasta

In [39]:
! echo "ON571586" >> data/accession_numbers.txt

https://www.ncbi.nlm.nih.gov/nuccore/ON571589.1<br>
https://pubmed.ncbi.nlm.nih.gov/37486231/

In [40]:
! esearch -db nucleotide -query "ON571589" | efetch -format fasta > genbank_sequences/ON571589.fasta

In [41]:
! echo "ON571589" >> data/accession_numbers.txt

### **Step 2. Combine all sequences to one file**

In [42]:
! cat genbank_sequences/*.fasta > all_seqs.fa

In [43]:
with open("all_seqs.fa", "r") as fasta_file:
    content = fasta_file.read()
    num_sequences = content.count(">")
print(f"The number of sequences in combined file: {num_sequences}")

The number of sequences in combined file: 103


### **Step 3. Multiple sequence alignment**

In [None]:
! mafft --auto data/all_seqs.fa > data/all_seqs_mafft.fa

### **Step 4. Launching `ModelFinder` to get the best substitution model**

In [None]:
! iqtree2 -m MFP -s data/all_seqs_mafft.fa --prefix model-finder/tree_MF2 -T AUTO

In [44]:
! head -42 model-finder/tree_MF2.iqtree | tail -6

Best-fit model according to BIC: GTR+F+I+G4

List of models sorted by BIC scores: 

Model                  LogL         AIC      w-AIC        AICc     w-AICc         BIC      w-BIC
GTR+F+I+G4      -209288.601  419003.202 -  0.00514  419017.199 -   0.0154  420454.559 +        1


### **Step 5. Building the final tree**

`iq-tree2` launch with the best substitution model & generating 1000 replicas of ultrafast bootstrap.

In [None]:
! iqtree2 -s data/all_seqs_mafft.fa -m GTR+F+I+G4 -pre tree/tree_ufb -bb 1000 -nt AUTO

### **Step 6. First tree visualisation**

For this purpose [iTOL](https://itol.embl.de) software was used with followed tree annotation in Pixelmator Pro.<br>
To visualise the tree `tree_ufb.treefile` must be uploaded to [iTOL](https://itol.embl.de).

### **Step 7. Annotating the tree**

#### **Step 7.1. Returning organisms names to the tree**

In [45]:
get_organisms('iljapopov17@gmail.com', 'data/accession_numbers.txt', 'data/accession_organism.txt')

The request has been fulfilled.
File saved to data/accession_organism.txt


In [1]:
! head -5 data/accession_organism.txt

NC_038515.1 Laibin virus
KT316176.1 Makokou virus
MN337866.1 Sarawak mobatvirus
MG663536.1 Dakrong virus
JX193700.1 Kilimanjaro virus


Everything worked well.

#### **Step 7.2. Updating the tree**

In [49]:
update_tree('data/accession_organism.txt', 'tree/tree_ufb.treefile', 'tree/annotated_tree.treefile')

The request has been fulfilled.
File saved to tree/annotated_tree.treefile


In [2]:
! head -1 tree/annotated_tree.treefile

(AB620030.1 Amur virus:0.0771041878,((((((((((AB620102.1 Orthohantavirus montanoense:0.3586812403,AB620105.1 Carrizal virus:0.3164538911)100:0.0821256184,(ON571586.1 Orthohantavirus sinnombreense:0.0003496417,ON571589.1 Orthohantavirus sinnombreense:0.0000027792)100:0.3526346556)94:0.0679804387,(((((AF005729.1 Orthohantavirus negraense:0.3758037094,NC_043407.1 Necocli virus:0.3953678541)99:0.0611167013,MN850095.1 Orthohantavirus andesense:0.2115210053)99:0.1127135879,EU788002.1 Maporal virus:0.3285749060)99:0.0815653694,GQ200821.1 Orthohantavirus delgaditoense:0.3673061736)78:0.0440712496,(FJ858378.1 Catacamas virus:0.2198999425,(GU997097.1 Orthohantavirus nigrorivense:0.2424566531,NC_038299.1 Orthohantavirus bayoui:0.2143819459)97:0.0646432754)100:0.1905032908)100:0.0596284326)100:0.1483520267,((((AB677488.1 Ussuri virus:0.2315952341,JX028271.1 Muju virus:0.2947312110)57:0.0621199827,MN639746.1 Orthohantavirus puumalaense:0.2298150170)100:0.1004645997,(((((HQ728461.1 Orthohantavirus t

Now tree leaves have accession number + organism name (AB620030.1 Amur virus) instead of just an accession number (AB620030.1)

#### **Step 7.3. Fetching information about viruses hosts**

In [51]:
get_hosts('iljapopov17@gmail.com', 'data/accession_numbers.txt', 'data/accession_host.txt')

The request has been fulfilled.
File saved to data/accession_host.txt


In [3]:
! head -5 data/accession_host.txt

NC_038515.1 Taphozous melanopogon
KT316176.1 Hipposideros ruber
MN337866.1 Murina aenea
MG663536.1 Aselliscus stoliczkanus (Stoliczka's Asian trident bat)
JX193700.1 Myosorex zinki


Everything worked well.

#### **Step 7.4. Fetching information about viruses hosts's phylogenetic order**

In [53]:
get_hosts_orders('iljapopov17@gmail.com', 'data/accession_host.txt', 'data/accession_order.txt')

The request has been fulfilled.
File saved to data/accession_order.txt
Please do not forget to edit the file manually.
The query to NCBI database from this function is pretty difficult.
Sometimes this function prints:
"Error - HTTP Error 400: Bad Request" in case of bad connection or
"Note - False record" in case there is no record about the host organism.


In [4]:
! head -5 data/accession_order.txt

NC_038515.1	Chiroptera
KT316176.1	Chiroptera
MN337866.1	Chiroptera
MG663536.1	Chiroptera
JX193700.1	Eulipotyphla


Everything worked well.

#### **Step 7.5. Setting up the color map for visualization in iTOL**

In [55]:
unique_orders = get_unique_orders("data/accession_order.txt")
print(unique_orders)

['Chiroptera', 'Eulipotyphla', 'Rodentia', 'ND', 'Primates']


In [56]:
color_map = set_color_map("data/accession_order.txt")
print(color_map)

{'Chiroptera': '#32cd32', 'Eulipotyphla': '#ffd700', 'Rodentia': '#1e90ff', 'ND': '#FFFFFF', 'Primates': '#8a2be2'}


In [57]:
get_itol_dataset("data/accession_organism.txt", "data/accession_order.txt", "data/dataset_for_iTOL.txt", color_map)

Colors were set by the user.
The request has been fulfilled.


### **Step 8. Final tree visualization**

1. Visit [iTOL](https://itol.embl.de)
2. Upload `tree/annotated_tree.treefile` as the tree
3. Upload `data/dataset_for_iTOL.txt` as the annotation dataset