# Phylogenetic analysis
To investigate further the functional roles and evolutionary origin of your candidate (upregulated) genes, you decide to perform a phylogenetic analysis comparing them against other prokaryotic proteomes (including the reference genome of the species matching our isolated strain in the hot spring).”

### Tasks:
For each over expressed gene (you have individual FASTA files created from previous section), run a standard phylogenetic workflow: 
1. Run a blast search for each over expressed protein against all reference proteomes
2. Extract hits with e-value <= 0.00001 (tip: you can use blast parameters for this)
3. Create a FASTA file with all the sequences of selected hits  (tip: you can use the extract_sequences_from_blast_result.py)
4. Build a phylogenetic tree out of the fastA file (suggested tools: clustalo, iqtree)
5. Visualize the result (suggested tools: etetoolkit.org/treeview, itol.embl.de, ete3)
6. Using the additional info file located at  additional_seq_info.tsv (see full path above) extract functional and taxonomic information for each homolog in the trees, and visualize it in the tree for better interpretation.  


### 1. BLAST
The first thing to do is building the BLAST database using all reference proteomes from the file `/home/2019_2020/data/phylo/all_ref_proteomes.faa`  <br><br>
We will have to extract the four sequences of the over-expressed genes from `/home/2019_2020/data/phylo/novel_proteome.faa` <br><br>
Then we can do the BLAST and extract the best hits with the python script provided as `python extract_seqs_from_blast_result.py  blast_output all_ref_proteomes.faa > homologs.faa`

In [49]:
cd 
#mkdir ./phylo
cd ./phylo
#cp /home/2019_2020/data/phylo/extract_seqs_from_blast_result.py .
#cp /home/2019_2020/data/phylo/all_ref_proteomes.faa .
ls

NP_213724.1.fas  NP_Unk02.fas               all_ref_proteomes.faa.pin
NP_213887.1.fas  all_ref_proteomes.faa      all_ref_proteomes.faa.psq
NP_Unk01.fas     all_ref_proteomes.faa.phr  [0m[38;5;34mextract_seqs_from_blast_result.py[0m


In [25]:
# Make database
makeblastdb -dbtype prot -in all_ref_proteomes.faa



Building a new DB, current time: 01/06/2020 22:00:04
New DB name:   /home/2019_2020/s.sanchez-heredero/phylo/all_ref_proteomes.faa
New DB title:  all_ref_proteomes.faa
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 88473 sequences in 5.04095 seconds.


In [13]:
# Extract my sequences
cat /home/2019_2020/data/phylo/novel_proteome.faa | grep -A 1 'NP_213724.1' > NP_213724.1.fas
cat /home/2019_2020/data/phylo/novel_proteome.faa | grep -A 1 'NP_213887.1' > NP_213887.1.fas
cat /home/2019_2020/data/phylo/novel_proteome.faa | grep -A 1 'NP_Unk01' > NP_Unk01.fas
cat /home/2019_2020/data/phylo/novel_proteome.faa | grep -A 1 'NP_Unk02' > NP_Unk02.fas

In [50]:
# BLAST
blastp -task blastp -query NP_213724.1.fas -db all_ref_proteomes.faa -outfmt 6 -evalue 0.00001 > NP_213724.1.blast
blastp -task blastp -query NP_213887.1.fas -db all_ref_proteomes.faa -outfmt 6 -evalue 0.00001 > NP_213887.1.blast
blastp -task blastp -query NP_Unk01.fas -db all_ref_proteomes.faa -outfmt 6 -evalue 0.00001 > NP_Unk01.blast
blastp -task blastp -query NP_Unk02.fas -db all_ref_proteomes.faa -outfmt 6 -evalue 0.00001 > NP_Unk02.blast

python extract_seqs_from_blast_result.py NP_213724.1.blast all_ref_proteomes.faa > NP_213724.1.blast_homologs.faa
python extract_seqs_from_blast_result.py NP_213887.1.blast all_ref_proteomes.faa > NP_213887.1.blast_homologs.faa
python extract_seqs_from_blast_result.py NP_Unk01.blast all_ref_proteomes.faa > NP_Unk01.blast_homologs.faa
python extract_seqs_from_blast_result.py NP_Unk02.blast all_ref_proteomes.faa > NP_Unk02.blast_homologs.faa

In [51]:
ls

NP_213724.1.blast               NP_Unk02.blast
NP_213724.1.blast_homologs.faa  NP_Unk02.blast_homologs.faa
NP_213724.1.fas                 NP_Unk02.fas
NP_213887.1.blast               all_ref_proteomes.faa
NP_213887.1.blast_homologs.faa  all_ref_proteomes.faa.phr
NP_213887.1.fas                 all_ref_proteomes.faa.pin
NP_Unk01.blast                  all_ref_proteomes.faa.psq
NP_Unk01.blast_homologs.faa     [0m[38;5;34mextract_seqs_from_blast_result.py[0m
NP_Unk01.fas


<br><br>
### 2. Building the trees
Now that we have the sequences that we want to use for the phylogenetic analysis, we can do the phylogenetic analysis, for which we will need to:
1. Make an alignment of the sequences using `clustalo`
2. Build the tree using `iqtree`

In [52]:
# Make alignmets
/home/miniconda3/bin/clustalo -i NP_213724.1.blast_homologs.faa > NP_213724.1.blast_homologs.alg
/home/miniconda3/bin/clustalo -i NP_213887.1.blast_homologs.faa > NP_213887.1.blast_homologs.alg
/home/miniconda3/bin/clustalo -i NP_Unk01.blast_homologs.faa > NP_Unk01.blast_homologs.alg
/home/miniconda3/bin/clustalo -i NP_Unk02.blast_homologs.faa > NP_Unk02.blast_homologs.alg

In [None]:
# Build trees
/home/miniconda3/bin/iqtree -s NP_213724.1.blast_homologs.alg
/home/miniconda3/bin/iqtree -s NP_213887.1.blast_homologs.alg
/home/miniconda3/bin/iqtree -s NP_Unk01.blast_homologs.alg
/home/miniconda3/bin/iqtree -s NP_Unk02.blast_homologs.alg

IQ-TREE multicore version 1.6.12 for Linux 64-bit built Aug 15 2019
Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor,
Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host:    localhost.localdomain (SSE3, 125 GB RAM)
Command: /home/miniconda3/bin/iqtree -s NP_213724.1.blast_homologs.alg
Seed:    323297 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Mon Jan  6 23:03:24 2020
Kernel:  SSE2 - 1 threads (40 CPU cores detected)

HINT: Use -nt option to specify number of threads because your CPU has 40 cores!
HINT: -nt AUTO will automatically determine the best number of threads to use.

Reading alignment file NP_213724.1.blast_homologs.alg ... Fasta format detected
Alignment most likely contains protein sequences
Alignment has 36 sequences with 563 columns, 466 distinct patterns
300 parsimony-informative, 107 singleton sites, 156 constant sites
                     Gap/Ambiguity  Composition  p-value
   1  122586.Q9JX95_NEIMB   40.67%    passed     96.59%
 

 82  WAG+I+G4      17994.628    71  36131.256    36152.079    36438.919
 83  WAG+R2        18129.585    71  36401.169    36421.992    36708.832
 84  WAG+R3        18021.818    73  36189.636    36211.730    36505.965
 85  WAG+R4        17990.853    75  36131.706    36155.114    36456.702
 86  WAG+R5        17988.475    77  36130.951    36155.718    36464.613
 92  WAG+F         18847.350    88  37870.700    37903.746    38252.028
 93  WAG+F+I       18566.959    89  37311.918    37345.787    37697.580
 94  WAG+F+G4      18048.400    89  36274.799    36308.668    36660.461
 95  WAG+F+I+G4    18023.312    90  36226.625    36261.328    36616.620
 96  WAG+F+R2      18166.484    90  36512.967    36547.671    36902.962
 97  WAG+F+R3      18054.021    92  36292.041    36328.450    36690.703
 98  WAG+F+R4      18017.539    94  36223.078    36261.241    36630.406
 99  WAG+F+R5      18016.646    96  36225.292    36265.258    36641.287
105  cpREV         18985.557    69  38109.113    38128.707    38

264  LG+I+G4       17927.679    71  35997.357    36018.180    36305.020
265  LG+R2         18110.916    71  36363.831    36384.654    36671.494
266  LG+R3         17964.514    73  36075.027    36097.121    36391.357
267  LG+R4         17927.896    75  36005.791    36029.200    36330.787
268  LG+R5         17924.500    77  36003.000    36027.767    36336.662
274  LG+F          18953.923    88  38083.846    38116.893    38465.175
275  LG+F+I        18662.904    89  37503.807    37537.676    37889.469
276  LG+F+G4       17983.584    89  36145.167    36179.036    36530.829
277  LG+F+I+G4     17963.037    90  36106.073    36140.776    36496.068
278  LG+F+R2       18150.012    90  36480.023    36514.726    36870.018
279  LG+F+R3       17999.707    92  36183.413    36219.822    36582.075
280  LG+F+R4       17964.008    94  36116.016    36154.178    36523.344
281  LG+F+R5       17960.695    96  36113.390    36153.356    36529.385
287  DCMut         19230.280    69  38598.560    38618.155    38

448  Blosum62+R3   18049.399    73  36244.798    36266.892    36561.127
449  Blosum62+R4   18021.180    75  36192.360    36215.769    36517.356
450  Blosum62+R5   18020.048    77  36194.096    36218.863    36527.758
456  Blosum62+F    18811.880    88  37799.759    37832.806    38181.088
457  Blosum62+F+I  18556.015    89  37290.029    37323.898    37675.691
458  Blosum62+F+G4 18059.770    89  36297.541    36331.410    36683.202
459  Blosum62+F+I+G4 18037.171    90  36254.341    36289.045    36644.336
460  Blosum62+F+R2 18177.814    90  36535.629    36570.332    36925.624
461  Blosum62+F+R3 18061.842    92  36307.684    36344.093    36706.346
462  Blosum62+F+R4 18031.265    94  36250.529    36288.691    36657.857
463  Blosum62+F+R5 18029.983    96  36251.965    36291.931    36667.960
469  mtMet         20552.782    69  41243.563    41263.158    41542.560
470  mtMet+I       20260.503    70  40661.006    40681.209    40964.335
471  mtMet+G4      19089.035    70  38318.070    38338.273    

<br><br>
### 3. Display trees
Given that the phylogenetic trees will be visualized using ete3, a Python tool, I needed to create a separate Python script that can be executed as `xvfb-run python print_trees.py <treefile> additional_seq_info.tsv`

In [None]:
xvfb-run python print_trees.py NP_Unk01.blast_homologs.alg.treefile additional_seq_info.tsv
xvfb-run python print_trees.py NP_Unk02.blast_homologs.alg.treefile additional_seq_info.tsv
# running this commands raises an error and never shows the trees so I printed the trees running this program in my own python launcher

#### Unk01 tree
<img src="Unk01_tree.png" alt="Drawing" style="width: 500px;"/>  
#### Unk02 tree
<img src="Unk02_tree.png" alt="Drawing" style="width: 500px;"/>

<br><br>
### 4. Answers


__What is the closest ortholog from a phylogenetic point of view? From what species?__ <br>

- Unk01. The closest ortholog  is the 4Fe-4S iron sulfur cluster binding proteins from NifH/frxC family and belonging to the Arquea species Methanosarcina acetivorans <br>
- NP_Unk02. The closest ortholog is norR, a PFAM sigma-54 factor interaction domain-containing protein from Aquifex aeolicus.

__Do orthology assignment support your previous functional annotations?  (you might need to look up the functional annotation (i.e. gene names) of close orthologs)__ <br>

- Unk01. This orthology assignment does support my previous functional annotation given that the protein’s function is also nitrogen binding <br>
- NP_Unk02. According to my prediction, this protein was a σ-54 dependent transcriptional regulator, and its closest ortholog is protein domain that interacts with σ-54, so I think the prediction was accurate.

__Are all genes present in the reference proteome of the same species? Why not? Do they all over expressed genes share the same evolutionary history?__ <br>

No, in the reference proteome we have genes from many different species because in order to do a phylogenetic tree and infer orthologs you need to have sequences from different species to which you can compare you problem sequences. 
Not all the over expressed genes share the same evolutionary history. Unk01 seems to be closer to methanogenic Archaea while Unk02 is closer to other bacteria. It is posible that Unk01 is the result of horizontal gene transfer and that is why it seems to have such a different origin.