In [1]:
!date

Wed Oct 26 23:04:19 PDT 2016


#### Goal for this homework project: ####

Identify a gene in my project taxa that is similar to a gene differentially expressed in sea stars exhibiting wasting disease. 

I'll use [DESeq](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) in R to find differentially expressed genes in sea stars exhibiting wasting disease vs healthy sea stars. That script to install and run DESeq2 can be viewed [here](https://github.com/MeganEDuffy/FISH-546/blob/master/tutorials/seastar-tutorial/DESeq.R).

For that I need to download count data from the sea star wasting disease project GitHub repository. 
I'll also need the full sea star transcriptome to merge with the differentially expressed genes, also available from the project repo.

Later, I'll use the BLAST to find a gene in a marine metatrascriptome that's similar to a gene differentially expressed by the sea stars with a wasting viral infection. This notebook shows where I got that metatranscriptome.

First step: downloading ```Phel_countdata.txt``` from https://github.com/sr320/eimd-sswd/blob/master/data/Phel_countdata.txt 

In [10]:
!curl https://raw.githubusercontent.com/sr320/eimd-sswd/master/data/Phel_countdata.txt \
    > /Users/meganduffy/Documents/git-repos/FISH-546/tutorials/seastar-tutorial/data/Phel_countdata.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  997k  100  997k    0     0  2401k      0 --:--:-- --:--:-- --:--:-- 2942k


In [3]:
pwd

'/Users/meganduffy/Documents/git-repos/FISH-546/tutorials/seastar-tutorial'

In [5]:
!head data/Phel_countdata.txt

Feature ID	Treated_FHL - Total gene reads	Treated_PH - Total gene reads	Treated_L - Total gene reads	Control_FHL - Total gene reads	Control_DB - Total gene reads	Control_PH - Total gene reads
Phel_contig_1	168	37	8	89	28	38
Phel_contig_10	9518	2752	839	22	42	180
Phel_contig_100	260	565	413	616	1234	6104
Phel_contig_1000	2043	3842	3070	4311	8527	31946
Phel_contig_10000	9	12	13	32	21	211
Phel_contig_10001	44	225	89	90	54	365
Phel_contig_10002	38	61	80	185	478	1267
Phel_contig_10003	9	29	20	17	29	186
Phel_contig_10004	8	25	6	4	19	92


In [6]:
!wc -l data/Phel_countdata.txt

   29476 data/Phel_countdata.txt


I will also download the entire sea star transcriptome, also using curl:

In [7]:
!curl https://raw.githubusercontent.com/sr320/eimd-sswd/master/data/Phel_transcriptome.fasta \
    > /Users/meganduffy/Documents/git-repos/FISH-546/tutorials/seastar-tutorial/data/Phel_transcriptome.fasta

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39.4M  100 39.4M    0     0  7088k      0  0:00:05  0:00:05 --:--:-- 10.6M


In [12]:
!head -2 data/Phel_transcriptome.fasta

>Phel_contig_1
CAAATATATGAACGGTTGATTGTCAACGATTAGTACATGTTTTCATTGTTCCCCACGCCCGCCCCCCCCCACTCAAACATTTAAAGTGTGAAATATTATTTATCCACAAATTTCCTTAAACCTGCAAACTTGTCTGCTGTCTCTTATTGGAAGTTATGAAAAAGAACAACGGGTTTTCTTTAAAGGGTCTGCGTGCGATTTTCAACCTTTTGAGTAATAGCAGTTATTTTGATAACCGATTTTTTTCAAAGCTCAACAGCTTTTTAAAATAAGGAATCCTATAATGGCCAAACGAATACTATAAAAATAAGGGTTCTCTTAATTGTATAAAACGTATAATTTTATCAATTTTGGGACCGTGTAATTTTTTAAAGACCACAAGAATGTTACATACAACAAATAGACGAAACTCGTAGCTTTGGAAACTACGTCATGGGCGTTTGGTCAAAAGCTGGAGAGAAAGAGAGGTGGGGTGCCAGACTTAAGTAGTCACGTGATCTGACCAACGCACATCGGAAGCTCGATCGGATGAAATCTTCTCTATCGTTCTTGCGTCTATACGTGCTACGAAGAGCTGACAGAAGTTTGGACTTGTTTACTTCTTGCACCTGTTGATGGAACGGCCACGGACCTTGTCGCACGCACACCTGGAGCCAGTGCTCGGATCGACGCAACGGATGTACTGTCTTCCCCTTCCGCGTTTCTCAAGTAGGTACTCAAAGTCGTCCGCGTCGAAGTTGGCCTCGGCGTCCCTCTTCTCCAGCTCCTCCATGTCCTCCTCTGTGTAGTACGGGGTGACGAGCACCACCAGGGCGGCCACAATGGCCAGTGCTAGAAGACACTTCGTATTCATTCTGCTGGTGGTTGGATGTGCGCAAACAAGACAGGAGAGACTTATTAGAATC


In [13]:
!fgrep -c ">" data/Phel_transcriptome.fasta

29476


I also manually downloaded a marine metatransciptome that was genereated by Kopf et al **(2015)**, "Metatranscriptome of marine bacterioplankton during winter time in the North Sea assessed by total RNA sequencing", _Marine Genomics_, 19:45-46. You can view this publication [here](http://www.sciencedirect.com.offcampus.lib.washington.edu/science/article/pii/S1874778714001469). Data from this study are availble at the [European Nucelotide Archive](http://www.ebi.ac.uk/ena) with the accession number [PRJEB5205](http://www.ebi.ac.uk/ena/data/view/PRJEB5205)

downloaded at 11:44 pm 2016-10-28 as a gzip compressed archive called ```ERR653268.fastq.gz```.

In [3]:
pwd

'/Users/meganduffy/Documents/git-repos/FISH-546/tutorials/seastar-tutorial'

In [4]:
ls data/

DEG_virus_ex.png                      Phel_diff_transcriptome_db.fasta.nin
ERR653268.fastq.gz                    Phel_diff_transcriptome_db.fasta.nsq
Galaxy_Phel_transciptome.tab          Phel_diffexpressed_transcriptome.tab
Phel_DEGlist.tab                      Phel_merged_transcriptome.fasta
Phel_countdata.txt                    Phel_transcriptome.fasta
Phel_diff_transcriptome.fasta         marine-metatracript-datasum.png
Phel_diff_transcriptome.tab           pro-mar.fasta
Phel_diff_transcriptome_db.fasta.nhr  pro-mar.fasta.gz


In [5]:
!gunzip data/ERR653268.fastq.gz

gunzip: data/ERR653268.fastq.gz: unexpected end of file


When I inspect this file, it's 0 KB. So maybe it's corrupted in some way?

I decided instead to download a marine microbial eukaryotic metatransciptome from an iMicrobe project on CyVerse.
[This](http://datacommons.cyverse.org/browse/iplant/home/shared/imicrobe/projects/104/samples/2468) is where I found the data. The README explains the fasta file I downloaded:

```
contigs.fa:  Contigs  from the assembly, minimum 150 bp. Possibly
includes UTRs. Sequences contain IUPAC ambiguity codes represent-
ing    ambiguous   bases,   http://www.bioinformatics.org/sms/iu-
pac.html.
```

In [6]:
!curl https://de.iplantcollaborative.org/anon-files//iplant/home/shared/imicrobe/projects/104/samples/2468/MMETSP1325.cds.fa \
    > /Users/meganduffy/Documents/git-repos/FISH-546/tutorials/seastar-tutorial/data/MMETSP1325.cds.fasta

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9.8M    0  9.8M    0     0  3591k      0 --:--:--  0:00:02 --:--:-- 3658k


In [9]:
!head -5 data/MMETSP1325.cds.fasta

>MMETSP1325-20131115|3_1 len=198
XXGGCCGAGCGAGCTGTCGCACAGCTTGAGGTACGGCTTCATTCAGCAAACGACAAGTACAGGACAACGGCGGCCGCCCATGCAGCGAACGTAGATGAGC
TGCAACGCGGATTCAAAGCAGCTCAGCAGCAGCGAGAGGCGTCGGCGATGGAGGCGGACGCCGTTTTGAAACGGGAGCTTGCCACTATGCGAGAGTCG
>MMETSP1325-20131115|1_1 len=558
ATGAACGCCTCATTCCGTGGCGCGCTGGAGCACTACCTCGTTCCGAGGAGAGAGGTTACGCACTCTCAGGCGGTGTGCGGTTTGTACAGGAAGTGCCTGA


**I'll use this as my query later on when I blast (see seastar-tutorial/04-blast.ipynb).**