# This notebook downloads the fungi genomes and annotation. In addition it generates the STAR index needed for alignment and the needed adapter trimming fasta file.


Generate the needed file structure for the downloaded genomes, etc.

~
+-- _annotations
|   +-- _sacCer3_Sp_merged_genome
|       +-- _STAR_index
|       +-- _fasta
|           +-- sacCer3_Sp_merged.fa
|       +-- _anno
|           +-- r64_2_1.gff3
|   +-- _common
|       +-- Lexogen_adapters_with_pA.fa.gz
|   +-- _sacCer3_UCSC
|       +-- 
|   +-- _Spombe_ENSEMBLE
|       +--


In [3]:
mkdir -p ~/annotations/sacCer3_Sp_merged_genome/fasta
mkdir -p ~/annotations/sacCer3_Sp_merged_genome/anno
mkdir -p ~/annotations/sacCer3_Sp_merged_genome/STAR_index
mkdir -p ~/annotations/common
mkdir -p ~/annotations/sacCer3_UCSC
mkdir -p ~/annotations/Spombe_ENSEMBLE


Download the Saccharomyces cerevisiae genome assembly from UCSC.


In [10]:
cd ~/annotations/sacCer3_UCSC
wget 'ftp://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/chromosomes/*'
cat *fa.gz > sacCer3.fa.gz
gzip -d sacCer3.fa.gz

--2019-05-17 04:28:13--  ftp://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/chromosomes/*
           => ‘.listing’
Resolving hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)|128.114.119.163|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /goldenPath/sacCer3/chromosomes ... done.
==> PASV ... done.    ==> LIST ... done.

.listing                [ <=>                ]   1.38K  --.-KB/s    in 0.002s  

2019-05-17 04:28:14 (623 KB/s) - ‘.listing’ saved [1412]

Removed ‘.listing’.
--2019-05-17 04:28:14--  ftp://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/chromosomes/README.txt
           => ‘README.txt’
==> CWD not required.
==> PASV ... done.    ==> RETR README.txt ... done.
Length: 2135 (2.1K)


2019-05-17 04:28:14 (352 MB/s) - ‘README.txt’ saved [2135]

--2019-05-17 04:28:14--  ftp://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/


Download the Schizosaccharomyces pombe genome assembly from ENSEMBL.


In [9]:
cd ~/annotations/Spombe_ENSEMBLE
wget 'ftp://ftp.ensemblgenomes.org/pub/fungi/release-43/fasta/schizosaccharomyces_pombe/dna/Schizosaccharomyces_pombe.ASM294v2.dna.toplevel.fa.gz'
gzip -d Schizosaccharomyces_pombe.ASM294v2.dna.toplevel.fa.gz

--2019-05-17 04:26:39--  ftp://ftp.ensemblgenomes.org/pub/fungi/release-43/fasta/schizosaccharomyces_pombe/dna/Schizosaccharomyces_pombe.ASM294v2.dna.toplevel.fa.gz
           => ‘Schizosaccharomyces_pombe.ASM294v2.dna.toplevel.fa.gz’
Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.197.94
Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.197.94|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/fungi/release-43/fasta/schizosaccharomyces_pombe/dna ... done.
==> SIZE Schizosaccharomyces_pombe.ASM294v2.dna.toplevel.fa.gz ... 3921364
==> PASV ... done.    ==> RETR Schizosaccharomyces_pombe.ASM294v2.dna.toplevel.fa.gz ... done.
Length: 3921364 (3.7M) (unauthoritative)


2019-05-17 04:26:41 (4.05 MB/s) - ‘Schizosaccharomyces_pombe.ASM294v2.dna.toplevel.fa.gz’ saved [3921364]




Download the Schizosaccharomyces pombe gene annotation from ENSEMBL.


In [2]:
cd ~/annotations/Spombe_ENSEMBLE/
wget 'ftp://ftp.ensemblgenomes.org/pub/fungi/release-43/gtf/schizosaccharomyces_pombe/Schizosaccharomyces_pombe.ASM294v2.43.gtf.gz' \
-O Sp_genes.gtf


--2019-05-17 17:57:25--  ftp://ftp.ensemblgenomes.org/pub/fungi/release-43/gtf/schizosaccharomyces_pombe/Schizosaccharomyces_pombe.ASM294v2.43.gtf.gz
           => ‘Sp_genes.gtf’
Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.197.94
Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.197.94|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/fungi/release-43/gtf/schizosaccharomyces_pombe ... done.
==> SIZE Schizosaccharomyces_pombe.ASM294v2.43.gtf.gz ... 804707
==> PASV ... done.    ==> RETR Schizosaccharomyces_pombe.ASM294v2.43.gtf.gz ... done.
Length: 804707 (786K) (unauthoritative)


2019-05-17 17:57:28 (1.53 MB/s) - ‘Sp_genes.gtf’ saved [804707]




Concatenate the two fungi genome files to generate the merged genome file required for STAR indexing.


In [12]:
cd ~/annotations/sacCer3_Sp_merged_genome/fasta
cat ~/annotations/sacCer3_UCSC/sacCer3.fa \
~/annotations/Spombe_ENSEMBLE/Schizosaccharomyces_pombe.ASM294v2.dna.toplevel.fa > \
sacCer3_Sp_merged.fa


Download the sacCer3 gene annotation from SGD for the gff file.


In [1]:
cd ~/annotations/sacCer3_Sp_merged_genome/anno
wget https://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff \
-O r64_2_1.gff3


--2019-05-17 13:29:02--  https://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff
Resolving downloads.yeastgenome.org (downloads.yeastgenome.org)... 171.67.205.104
Connecting to downloads.yeastgenome.org (downloads.yeastgenome.org)|171.67.205.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18136762 (17M)
Saving to: ‘r64_2_1.gff3’


2019-05-17 13:29:03 (16.7 MB/s) - ‘r64_2_1.gff3’ saved [18136762/18136762]




Generate the STAR index


In [4]:
STAR \
--runThreadN 1 \
--runMode genomeGenerate \
--genomeDir ~/annotations/sacCer3_Sp_merged_genome/STAR_index \
--genomeFastaFiles ~/annotations/sacCer3_Sp_merged_genome/fasta/sacCer3_Sp_merged.fa \
--sjdbGTFfile ~/annotations/sacCer3_Sp_merged_genome/anno/r64_2_1.gff3 \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbGTFfeatureExon CDS \
--genomeSAindexNbases 11 \
--sjdbOverhang 49

May 17 13:33:47 ..... started STAR run
May 17 13:33:47 ... starting to generate Genome files
May 17 13:33:47 ... starting to sort Suffix Array. This may take a long time...
May 17 13:33:48 ... sorting Suffix Array chunks and saving them to disk...
May 17 13:34:09 ... loading chunks from disk, packing SA...
May 17 13:34:09 ... finished generating suffix array
May 17 13:34:09 ... generating Suffix Array index
May 17 13:34:12 ... completed Suffix Array index
May 17 13:34:12 ..... processing annotations GTF
May 17 13:34:12 ..... inserting junctions into the genome indices
May 17 13:34:13 ... writing Genome to disk ...
May 17 13:34:13 ... writing Suffix Array to disk ...
May 17 13:34:13 ... writing SAindex to disk
May 17 13:34:13 ..... finished successfully



Make the adapters file for trimming.


In [1]:
cd ~/annotations/common/

cat >Lexogen_adapters_with_pA.fa <<EOL
>TruSeq_Adapter_Index_1_6
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_2
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_3
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_4
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_5
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_6
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_7
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_8
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_9
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_10
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_11
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_12
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_13
GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_14
GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTTCCGTATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_15
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_16
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGTCCCGATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_18_7
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_19
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGAAACGATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_20
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGGCCTTATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_21
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_22
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGTACGTAATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_23
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGAGTGGATATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_25
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTGATATATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_27
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTTTATCTCGTATGCCGTCTTCTGCTTG
>polyA
AAAAAAAAAAAAAAAAAA
EOL

gzip Lexogen_adapters_with_pA.fa

gzip: Lexogen_adapters_with_pA.fa.gz already exists; do you wish to overwrite (y or n)? 
