## Workshop 4 - Multiple Sequence Alignment
### Perform multiple alignment and create a phylogeny tree using the Linux command line
**Remember to change the kernel to Bash or Calysto Bash before running these commands!** 
Follow the steps below to perform the analysis and use the results to answer the worksheet.   
You will be required to write the code yourself for some of the steps, so feel free to experiment!


1. Create a folder to work in:

In [None]:
cd ~/Workshop4_MSA

3. Check the content of the file (just first 10 lines with the `head` command)

In [None]:
head vasa_seqs.fasta

**- What type of sequences are in the file (DNA/RNA or Protein)?**  
4. Extract the sequence ids/accessions from the sequence headers of `vasa_seqs.fasta` and save them into `vasa_seqs.ids`

In [None]:
grep ">" vasa_seqs.fasta | cut -f1,1 -d" " | sed 's/>//' > vasa_seqs.ids

5. Check the content of the file with the 'cat' command:

In [None]:
cat vasa_seqs.ids 

**- How many sequences we had in our original file (based on the list of ids)?**
*You can use the `wc -l` command to confirm this`

6. Use the following commands from the [NCBI E-utilities](https://www.ncbi.nlm.nih.gov/books/NBK25501/) to retrieve vasa protein sequences and metadata based on the accessions in the fasta headers (that are saved in `vasa_seqs.ids`):

In [None]:
epost -db nucleotide -input vasa_seqs.ids -format acc | efetch -format fasta_cds_aa > vasa_prots.fasta
epost -db nucleotide -input vasa_seqs.ids -format acc | esummary | xtract -pattern DocumentSummary -element Title,AccessionVersion,Organism,TaxId > vasa_prots_table.txt

7. Check that the files were downloaded into the current folder

8. Check the content of the metadata table with `cat vasa_prots_table.txt` or print the sequence headers from the fasta file with `grep ">" vasa_prots.fasta` 

**- How many proteins were downloaded?**  
Can you count the number of entries directly in one command?

9. Align the vasa sequences and proteins using `probcons`
*Note that it will take 10-15 minutes to complete, so it’s a good time to make yourself a cuppa or have a toilet break*

In [None]:
probcons vasa_seqs.fasta > vasa_seqs_probcons.fasta

Now do the same for the proteins:

10. Convert the aligned fasta to `phylip` (phylogenetic tree) format with `bioconvert` (see [documentation](https://bioconvert.readthedocs.io/en/dev/user_guide.html#)):

In [None]:
bioconvert fasta2phylip vasa_seqs_probcons.fasta vasa_seqs_probcons.phy

11. Create a phylogenetic tree with `phyml`:

In [None]:
phyml -i vasa_seqs_probcons.phy

12. Download the files in `Workshop4_MSA` folder to your computer, we'll need them for the rest of the workshop!