# Phylogenetic analysis
To investigate further the functional roles and evolutionary origin of your candidate (upregulated) genes, you decide to perform a phylogenetic analysis comparing them against other prokaryotic proteomes (including the reference genome of the species matching our isolated strain in the hot spring).”

### Tasks:
For each over expressed gene (you have individual FASTA files created from previous section), run a standard phylogenetic workflow: 
1. Run a blast search for each over expressed protein against all reference proteomes
2. Extract hits with e-value <= 0.00001 (tip: you can use blast parameters for this)
3. Create a FASTA file with all the sequences of selected hits  (tip: you can use the extract_sequences_from_blast_result.py)
4. Build a phylogenetic tree out of the fastA file (suggested tools: clustalo, iqtree)
5. Visualize the result (suggested tools: etetoolkit.org/treeview, itol.embl.de, ete3)
6. Using the additional info file located at  additional_seq_info.tsv (see full path above) extract functional and taxonomic information for each homolog in the trees, and visualize it in the tree for better interpretation.  


### 1. BLAST
The first thing to do is building the BLAST database using all reference proteomes from the file `/home/2019_2020/data/phylo/all_ref_proteomes.faa`  <br><br>
We will have to extract the four sequences of the over-expressed genes from `/home/2019_2020/data/phylo/novel_proteome.faa` <br><br>
Then we can do the BLAST and extract the best hits with the python script provided as `python extract_seqs_from_blast_result.py  blast_output all_ref_proteomes.faa > homologs.faa`

In [33]:
cd 
#mkdir ./phylo
cd ./phylo
#cp /home/2019_2020/data/phylo/extract_seqs_from_blast_result.py .
#cp /home/2019_2020/data/phylo/all_ref_proteomes.faa .
ls

NP_213724.1.blast  NP_Unk02.fas               all_ref_proteomes.faa.psq
NP_213724.1.fas    all_ref_proteomes.faa      [0m[38;5;34mextract_seqs_from_blast_result.py[0m
NP_213887.1.fas    all_ref_proteomes.faa.phr
NP_Unk01.fas       all_ref_proteomes.faa.pin


In [25]:
# Make database
makeblastdb -dbtype prot -in all_ref_proteomes.faa



Building a new DB, current time: 01/06/2020 22:00:04
New DB name:   /home/2019_2020/s.sanchez-heredero/phylo/all_ref_proteomes.faa
New DB title:  all_ref_proteomes.faa
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 88473 sequences in 5.04095 seconds.


In [13]:
# Extract my sequences
cat /home/2019_2020/data/phylo/novel_proteome.faa | grep -A 1 'NP_213724.1' > NP_213724.1.fas
cat /home/2019_2020/data/phylo/novel_proteome.faa | grep -A 1 'NP_213887.1' > NP_213887.1.fas
cat /home/2019_2020/data/phylo/novel_proteome.faa | grep -A 1 'NP_Unk01' > NP_Unk01.fas
cat /home/2019_2020/data/phylo/novel_proteome.faa | grep -A 1 'NP_Unk02' > NP_Unk02.fas

In [42]:
# BLAST
blastp -task blastp -query NP_213724.1.fas -db all_ref_proteomes.faa -outfmt 6 -evalue 0.001 > NP_213724.1.blast
blastp -task blastp -query NP_213887.1.fas -db all_ref_proteomes.faa -outfmt 6 -evalue 0.001 > NP_213887.1.blast
blastp -task blastp -query NP_Unk01.fas -db all_ref_proteomes.faa -outfmt 6 -evalue 0.001 > NP_Unk01.blast
blastp -task blastp -query NP_Unk02.fas -db all_ref_proteomes.faa -outfmt 6 -evalue 0.001 > NP_Unk02.blast

python extract_seqs_from_blast_result.py NP_213724.1.blast all_ref_proteomes.faa > NP_213724.1.blast_homologs.faa
python extract_seqs_from_blast_result.py NP_213887.1.blast all_ref_proteomes.faa > NP_213887.1.blast_homologs.faa
python extract_seqs_from_blast_result.py NP_Unk01.blast all_ref_proteomes.faa > NP_Unk01.blast_homologs.faa
python extract_seqs_from_blast_result.py NP_Unk02.blast all_ref_proteomes.faa > NP_Unk02.blast_homologs.faa

In [43]:
ls

NP_213724.1.blast               NP_Unk02.blast
NP_213724.1.blast_homologs.faa  NP_Unk02.blast_homologs.faa
NP_213724.1.fas                 NP_Unk02.fas
NP_213887.1.blast               all_ref_proteomes.faa
NP_213887.1.blast_homologs.faa  all_ref_proteomes.faa.phr
NP_213887.1.fas                 all_ref_proteomes.faa.pin
NP_Unk01.blast                  all_ref_proteomes.faa.psq
NP_Unk01.blast_homologs.faa     [0m[38;5;34mextract_seqs_from_blast_result.py[0m
NP_Unk01.fas


<br><br>
### 2. Building the trees
Now that we have the sequences that we want to use for the phylogenetic analysis, we can do the phylogenetic analysis, for which we will need to:
1. Make an alignment of the sequences using `clustalo`
2. Build the tree using `iqtree`

In [None]:
# Make alignmets
/home/miniconda3/bin/clustalo -i NP_213724.1.blast_homologs.faa > NP_213724.1.blast_homologs.alg
/home/miniconda3/bin/clustalo -i NP_213887.1.blast_homologs.faa > NP_213887.1.blast_homologs.alg
/home/miniconda3/bin/clustalo -i NP_Unk01.blast_homologs.faa > NP_Unk01.blast_homologs.alg
/home/miniconda3/bin/clustalo -i NP_Unk02.blast_homologs.faa > NP_Unk02.blast_homologs.alg

In [46]:
#Build trees
/home/miniconda3/bin/iqtree -s NP_213724.1.blast_homologs.alg
/home/miniconda3/bin/iqtree -s NP_213887.1.blast_homologs.alg
/home/miniconda3/bin/iqtree -s NP_Unk01.blast_homologs.alg
/home/miniconda3/bin/iqtree -s NP_Unk02.blast_homologs.alg