## Workshop 4 - Multiple Sequence Alignment
### Perform multiple alignment and create a phylogeny tree using the Linux command line
**Remember to change the kernel to Bash or Calysto Bash before running these commands!** 
Follow the steps below to perform the analysis and use the results to answer the worksheet.   
You will be required to write the code yourself for some of the steps, so feel free to experiment!


1. Create a folder to work in:

In [2]:
mkdir -p ~/sandbox/week4
cd ~/sandbox/week4




2. Copy our input file `vasa_seqs.fasta` (or upload from your computer 

In [None]:
cp ~/Workshop4_MSA/vasa_seqs.fasta ~/sandbox/week4/

3. Check the content of the file (just first 10 lines with the `head` command)

**- What type of sequences are in the file (DNA/RNA or Protein)?**  
4. Extract the sequence ids/accessions from the sequence headers of `vasa_seqs.fasta` and save them into `vasa_seqs.ids`

In [2]:
grep ">" vasa_seqs.fasta | cut -f1,1 -d" " | sed 's/>//' > vasa_seqs.ids




5. Check the content of the file with the 'cat' command:

In [3]:
cat vasa_seqs.ids 

AB032566.1
JX437185.1
JN712912.1
KJ397267.1
EU253482.1
GU581280.1
AF262962.1
AB372211.1
AY626785.1
HQ412807.1
EU035615.1
AB235177.1
DQ095772.2
DQ288391.1
AF513908.1



**- How many sequences we had in our original file (based on the list of ids)?**
*You can use the `wc -l` command to confirm this`

15 vasa_seqs.ids



6. Use the following commands from the [NCBI E-utilities](https://www.ncbi.nlm.nih.gov/books/NBK25501/) to retrieve vasa protein sequences and metadata based on the accessions in the fasta headers (that are saved in `vasa_seqs.ids`):

In [5]:
epost -db nucleotide -input vasa_seqs.ids -format acc | efetch -format fasta_cds_aa > vasa_prots.fasta
epost -db nucleotide -input vasa_seqs.ids -format acc | esummary | xtract -pattern DocumentSummary -element Title,AccessionVersion,Organism,TaxId > vasa_prots_table.txt

<essionVersion,Organism,TaxId > vasa_prots_table.txt



7. Check that the files were downloaded into the current folder

8. Check the content of the metadata table with `cat vasa_prots_table.txt` or print the sequence headers from the fasta file with `grep ">" vasa_prots.fasta` 

**- How many proteins were downloaded?**  
Can you count the number of entries directly in one command?

cat vasa_prots_table.txt
Bubalus bubalis vasa mRNA, complete cds	KJ397267.1	water buffalo	89462
Bos taurus vasa mRNA, complete cds, alternatively spliced	JX437185.1	cattle	9913
Salmo salar vasa mRNA, complete cds	JN712912.1	Atlantic salmon	8030
Macropus eugenii DEAD (Asp-Glu-Ala-Asp) box polypeptide 4 transcript variant 2 (DDX4) mRNA, complete cds, alternatively spliced	HQ412807.1	tammar wallaby	9315
Rana rugosa mRNA for DEAD box protein, complete cds	AB372211.1	Japanese wrinkled frog	8410
Thunnus orientalis vasa mRNA, complete cds	EU253482.1	Pacific bluefin tuna	8238
Sus scrofa VASA-like protein mRNA, complete cds	AY626785.1	pig	9823
Euthynnus affinis vasa mRNA, complete cds	GU581280.1	eastern little tuna	8227
Litopenaeus vannamei vasa-like protein mRNA, complete cds	DQ095772.2	Pacific white shrimp	6689
Drosophila virilis DEAD-box RNA helicase mRNA, complete cds	AF513908.1	Drosophila virilis	7244
Homo sapiens VASA protein mRNA, complete cds	AF262962.1	human	9606
Rana nigromaculata put

9. Align the vasa sequences and proteins using `probcons`
*Note that it will take 10-15 minutes to complete, so it’s a good time to make yourself a cuppa or have a toilet break*

In [8]:
probcons vasa_seqs.fasta > vasa_seqs_probcons.fasta

probcons vasa_seqs.fasta > vasa_seqs_probcons.fasta

PROBCONS version 1.12 - align multiple protein sequences and print to standard output
Written by Chuong Do

Using parameter set:
    initDistrib[] = { 0.6814756989 8.615339902e-05 8.615339902e-05 0.1591759622 0.1591759622 }
        gapOpen[] = { 0.0119511066 0.0119511066 0.008008334786 0.008008334786 }
      gapExtend[] = { 0.3965826333 0.3965826333 0.8988758326 0.8988758326 }

Loading sequence file: vasa_seqs.fasta
Alignment tree: (((((AB032566.1 Oncorhynchus mykiss vas mRNA for Vasa, complete cds JN712912.1 Salmo salar vasa mRNA, complete cds) (EU253482.1 Thunnus orientalis vasa mRNA, complete cds GU581280.1 Euthynnus affinis vasa mRNA, complete cds)) (AB372211.1 Rana rugosa mRNA for DEAD box protein, complete cds EU035615.1 Rana nigromaculata putative ATP-dependent RNA helicase DDX4 mRNA, complete cds)) (((JX437185.1 Bos taurus vasa mRNA, complete cds, alternatively spliced AY626785.1 Sus scrofa VASA-like protein mRNA, complete cd

Now do the same for the proteins:

10. Convert the aligned fasta to `phylip` (phylogenetic tree) format with `bioconvert` (see [documentation](https://bioconvert.readthedocs.io/en/dev/user_guide.html#)):

In [3]:
bioconvert fasta2phylip vasa_seqs_probcons.fasta vasa_seqs_probcons.phy




11. Create a phylogenetic tree with `phyml`:

In [None]:
phyml -i vasa_seqs_probcons.phy

12. Download the files in `sandbox/week4` folder to your computer, we'll need them for the rest of the workshop!