# Get raw data
This notebook contains commands for obtaining and preparing raw data towards creation of the A. thaliana 8-ecotypes pan-genome.

In [1]:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import re

## WGS data
Get raw fastQ files from WGS of the 7 accessions.

In [23]:
run_ids = {"Eri": "ERR3624573",
          "Kyo": "ERR3624576",
          "Cvi-0": "ERR3624578",
          "Ler": "ERR3624574",
          "Sha": "ERR3624575",
          "C24": "ERR3624577",
          "An-1": "ERR3624579"}

In [2]:
ena_fast_script = "~/ena-fast-download.py"
download_dest = "../data/raw_fastq"

In [3]:
! mkdir -p $download_dest

In [9]:
for ecotype in run_ids:
    eco_dir = "%s/%s" %(download_dest, ecotype)
    run_id = run_ids[ecotype]
    ! mkdir $eco_dir
    ! python $ena_fast_script $run_id --output_directory $eco_dir

mkdir: cannot create directory ‘../data/raw_fastq/Eri’: File exists
10/18/2020 06:29:09 PM INFO: Using aspera ssh key file: $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh
10/18/2020 06:29:09 PM INFO: Querying ENA for FTP paths for ERR3624573..
10/18/2020 06:29:11 PM INFO: Downloading 2 FTP read set(s): ftp.sra.ebi.ac.uk/vol1/fastq/ERR362/003/ERR3624573/ERR3624573_1.fastq.gz, ftp.sra.ebi.ac.uk/vol1/fastq/ERR362/003/ERR3624573/ERR3624573_2.fastq.gz
10/18/2020 06:29:11 PM INFO: Running command: ascp -T -l 300m -P33001  -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/ERR362/003/ERR3624573/ERR3624573_1.fastq.gz ../data/raw_fastq/Eri
ERR3624573_1.fastq.gz                         100% 3269MB  128Mb/s    03:04    
Completed: 3348404K bytes transferred in 184 seconds
 (148783K bits/sec), in 1 file.
10/18/2020 06:32:24 PM INFO: Running command: ascp -T -l 300m -P33001  -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac

ERR3624579_1.fastq.gz                         100% 4459MB  213Mb/s    04:16    
Completed: 4566127K bytes transferred in 256 seconds
 (145761K bits/sec), in 1 file.
10/18/2020 07:14:02 PM INFO: Running command: ascp -T -l 300m -P33001  -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/ERR362/009/ERR3624579/ERR3624579_2.fastq.gz ../data/raw_fastq/An-1
ERR3624579_2.fastq.gz                         100% 4207MB  288Mb/s    03:59    
Completed: 4308500K bytes transferred in 239 seconds
 (147471K bits/sec), in 1 file.
10/18/2020 07:18:18 PM INFO: All done.


## Transcriptome data
Transcripts will be used as annotation evidence. We'll use two different transcript sets:
1. General - transcripts from 17 accessions (not overlapping with the 7 accessions) from the ["WTCHGMott2011" project ](http://mtweb.cs.ucl.ac.uk/mus/www/19genomes/index.html).
2. Per-sample - transcripts for each of the 7 accessions from the MPIPZ project

### General transcripts

In [1]:
general_transcripts_data_dir = "../data/transcripts/general"

In [2]:
! mkdir -p $general_transcripts_data_dir

In [3]:
general_transcripts_acc = ["Bur_0", "Can_0", "Ct_1", "Edi_0", "Hi_0", "Kn_0",
                           "Mt_0", "No_0", "Oy_0", "Po_0", "Rsch_4", "Sf_2",
                           "Tsu_0", "Wil_2", "Ws_0", "Wu_0", "Zu_0"]

In [7]:
for acc in general_transcripts_acc:
    url = "http://mtweb.cs.ucl.ac.uk/mus/www/19genomes/sequences/RNA/RNA_sequences.%s.fasta.bz2" % acc
    ! wget $url -P $general_transcripts_data_dir
    f = "%s/RNA_sequences.%s.fasta.bz2" %(general_transcripts_data_dir, acc)
    ! bzip2 -d $f

In [12]:
# go over all transcripts and remove duplicate sequences
out_records = {}
for acc in general_transcripts_acc:
    print(acc)
    fasta = "%s/RNA_sequences.%s.fasta" %(general_transcripts_data_dir, acc)
    tot = 0
    uniq = 0
    for rec in SeqIO.parse(fasta, 'fasta'):
        tot += 1
        if rec.seq not in out_records:
            out_records[rec.seq] = rec
            uniq += 1
    print("Transcripts: %s\tNew: %s" %(tot,uniq))

Bur_0
Transcripts: 41306	New: 40777
Can_0
Transcripts: 41097	New: 35463
Ct_1
Transcripts: 41341	New: 32276
Edi_0
Transcripts: 41332	New: 27962
Hi_0
Transcripts: 41428	New: 28848
Kn_0
Transcripts: 41275	New: 26710
Mt_0
Transcripts: 41281	New: 23805
No_0
Transcripts: 41206	New: 21437
Oy_0
Transcripts: 41238	New: 20440
Po_0
Transcripts: 41399	New: 23400
Rsch_4
Transcripts: 41211	New: 18122
Sf_2
Transcripts: 41173	New: 23433
Tsu_0
Transcripts: 41267	New: 17658
Wil_2
Transcripts: 41171	New: 18221
Ws_0
Transcripts: 41146	New: 17700
Wu_0
Transcripts: 41306	New: 15549
Zu_0
Transcripts: 41379	New: 17226


In [14]:
general_transcripts_all = "%s/unique_trans.fasta" % general_transcripts_data_dir
SeqIO.write(out_records.values(), general_transcripts_all, 'fasta')

409027

In [15]:
del out_records

## Reference data
Get A. thaliana ref data - TAIR10 assembly and Araport11 annotation (from Ensemble plants).

In [2]:
ref_dir = "../data/A_thaliana_ref"
! mkdir $ref_dir

In [18]:
# genome
! wget ftp://ftp.ensemblgenomes.org/pub/plants/release-48/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz -P $ref_dir
! pigz -d "$ref_dir/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz"

In [19]:
# transcripts
! wget ftp://ftp.ensemblgenomes.org/pub/plants/release-48/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz -P $ref_dir
! pigz -d "$ref_dir/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz"    

--2020-10-21 15:36:04--  ftp://ftp.ensemblgenomes.org/pub/plants/release-48/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
           => ‘../data/A_thaliana_ref/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz’
Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.197.75
Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.197.75|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/plants/release-48/fasta/arabidopsis_thaliana/cdna ... done.
==> SIZE Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz ... 21129666
==> PASV ... done.    ==> RETR Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz ... done.
Length: 21129666 (20M) (unauthoritative)


2020-10-21 15:36:09 (5.62 MB/s) - ‘../data/A_thaliana_ref/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz’ saved [21129666]



In [20]:
# proteins
! wget ftp://ftp.ensemblgenomes.org/pub/plants/release-48/fasta/arabidopsis_thaliana/pep/Arabidopsis_thaliana.TAIR10.pep.all.fa.gz -P $ref_dir
! pigz -d "$ref_dir/Arabidopsis_thaliana.TAIR10.pep.all.fa.gz"

--2020-10-21 15:37:34--  ftp://ftp.ensemblgenomes.org/pub/plants/release-48/fasta/arabidopsis_thaliana/pep/Arabidopsis_thaliana.TAIR10.pep.all.fa.gz
           => ‘../data/A_thaliana_ref/Arabidopsis_thaliana.TAIR10.pep.all.fa.gz’
Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.197.75
Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.197.75|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/plants/release-48/fasta/arabidopsis_thaliana/pep ... done.
==> SIZE Arabidopsis_thaliana.TAIR10.pep.all.fa.gz ... 9690822
==> PASV ... done.    ==> RETR Arabidopsis_thaliana.TAIR10.pep.all.fa.gz ... done.
Length: 9690822 (9.2M) (unauthoritative)


2020-10-21 15:37:37 (5.24 MB/s) - ‘../data/A_thaliana_ref/Arabidopsis_thaliana.TAIR10.pep.all.fa.gz’ saved [9690822]



In [21]:
# annotation
! wget ftp://ftp.ensemblgenomes.org/pub/plants/release-48/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.48.gff3.gz -P $ref_dir
! pigz -d "$ref_dir/Arabidopsis_thaliana.TAIR10.48.gff3.gz"    

--2020-10-21 15:40:12--  ftp://ftp.ensemblgenomes.org/pub/plants/release-48/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.48.gff3.gz
           => ‘../data/A_thaliana_ref/Arabidopsis_thaliana.TAIR10.48.gff3.gz’
Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.197.75
Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.197.75|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/plants/release-48/gff3/arabidopsis_thaliana ... done.
==> SIZE Arabidopsis_thaliana.TAIR10.48.gff3.gz ... 9516561
==> PASV ... done.    ==> RETR Arabidopsis_thaliana.TAIR10.48.gff3.gz ... done.
Length: 9516561 (9.1M) (unauthoritative)


2020-10-21 15:40:16 (5.34 MB/s) - ‘../data/A_thaliana_ref/Arabidopsis_thaliana.TAIR10.48.gff3.gz’ saved [9516561]



In [3]:
# rename transcripts and proteins to match mRNA IDs in GFF3
! sed -i 's/>\([^ ]*\) .*/>transcript:\1/' $ref_dir/Arabidopsis_thaliana.TAIR10.pep.all.fa
! sed -i 's/>\([^ ]*\) .*/>transcript:\1/' $ref_dir/Arabidopsis_thaliana.TAIR10.cdna.all.fa

### Repeats library
A repeats library is required for masking during annotation. I create the repeats library by downloading a soft-masked version of the genome and extracting lowercase sequences.

In [3]:
! wget ftp://ftp.ensemblgenomes.org/pub/plants/release-48/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.gz -P $ref_dir

--2020-10-22 15:09:26--  ftp://ftp.ensemblgenomes.org/pub/plants/release-48/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.gz
           => ‘../data/A_thaliana_ref/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.gz’
Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.197.75
Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.197.75|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/plants/release-48/fasta/arabidopsis_thaliana/dna ... done.
==> SIZE Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.gz ... 38108027
==> PASV ... done.    ==> RETR Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.gz ... done.
Length: 38108027 (36M) (unauthoritative)


2020-10-22 15:10:14 (794 KB/s) - ‘../data/A_thaliana_ref/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.gz’ saved [38108027]



In [4]:
! pigz -d $ref_dir/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.gz

In [10]:
sm_chr = "%s/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa" % ref_dir
rep_records = {}
rep_regex = re.compile(r'[atgcn]+')
for chrom in SeqIO.parse(sm_chr, 'fasta'):
    reps = rep_regex.finditer(str(chrom.seq))
    for rep in reps:
        rep_seq = rep.group()
        if rep_seq in rep_records:
            continue
        rep_start = rep.start()
        rep_end = rep.end()
        rep_record = SeqRecord(Seq(rep_seq), id="%s_%s_%s" %(chrom.id, rep_start, rep_end), description="")
        rep_records[rep_seq] = rep_record
        
repeats_lib = "%s/Arabidopsis_thaliana.TAIR10.repeats_lib.fa" % ref_dir
SeqIO.write(rep_records.values(), repeats_lib, 'fasta')

95671