### purpose upload fastq files to SRA


### outline

create files to upload to SRA, dryad, or for metadata (SRA, biosample). The script was run (without transferring to SRA) to create the biosample metadata to register the biosamples, the script was then rerun to add in biosample accession info before uploading to SRA.

[get fastqfiles](#fastq)
- for both Doug-fir and jack pine, get file paths to the fastq files I want to upload to SRA
- use pipeline config files to get applicable file names, upload to SRA

[check md5s](#check)
- check md5 files for each of the fastq files, upload these to SRA

[bgzip snp tables](#copy)
- bgzip the .txt files representing the post-pipeline SNP files used in previous notebooks
- upload these to SRA

[prepare SRA metadata doc](#sradoc)
- use this file to manually upload SRA metadata
- add md5 hashes as columns for `filename` and `filename2`

[prepare biosample metadata doc](#biosample)
- using the SRA metadata doc, create the biosample metadata file to upload to NCBI

[transfer files to SRA](#transfer)
- transfer fastq and fastq.md5 files to SRA via ftp

In [None]:
from pythonimports import *
import ftplib

fqdir = '/data/fastq/mengmeng/SeqCap_Test_for_CoAdapTree'
tmpdir = makedir('/lu213/brandon.lind/data/testdir/dryad_testdata')
sra_main = '/data/projects/pool_seq/sra_docs'

lview,dview = get_client()

latest_commit()
sinfo(html=True)

In [2]:
# a list for all files to transfer
sra_transfer_files = []
dryad = []

<a id='fastq'></a>
# get the fastq files

### pool and individual files

In [3]:
# infile/config to gatk pipeline for pool-seq and individual-seq
datafile = '/data/projects/pool_seq/non-pangenome/gatk_diploid_testdata/JP_i101-gatk/datatable.txt'
datatable = pd.read_table(datafile)
datatable.head()

Unnamed: 0,sample_name,library_name,pool_name,ploidy,file_name_r1,file_name_r2,adaptor_1,adaptor_2,ref,rgid,rglb,rgpl,rgpu,rgsm
0,DF_52_20,DF_cap3_kit1,DF_i52,2,HI.4779.008.D705---D505.DF_52_20_cap3_kit1_R1....,HI.4779.008.D705---D505.DF_52_20_cap3_kit1_R2....,AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT,/project/def-saitken/lindb/refgenomes/DF_db/DF...,,DF.cap3.kit1,ILLUMINA,,DF_52_20_cap3_kit1
1,DF_52_21,DF_cap3_kit1,DF_i52,2,HI.4779.008.D706---D505.DF_52_21_cap3_kit1_R1....,HI.4779.008.D706---D505.DF_52_21_cap3_kit1_R2....,AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT,/project/def-saitken/lindb/refgenomes/DF_db/DF...,,DF.cap3.kit1,ILLUMINA,,DF_52_21_cap3_kit1
2,DF_52_22,DF_cap3_kit1,DF_i52,2,HI.4779.008.D707---D505.DF_52_22_cap3_kit1_R1....,HI.4779.008.D707---D505.DF_52_22_cap3_kit1_R2....,AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT,/project/def-saitken/lindb/refgenomes/DF_db/DF...,,DF.cap3.kit1,ILLUMINA,,DF_52_22_cap3_kit1
3,DF_52_23,DF_cap3_kit1,DF_i52,2,HI.4779.008.D708---D505.DF_52_23_cap3_kit1_R1....,HI.4779.008.D708---D505.DF_52_23_cap3_kit1_R2....,AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT,/project/def-saitken/lindb/refgenomes/DF_db/DF...,,DF.cap3.kit1,ILLUMINA,,DF_52_23_cap3_kit1
4,DF_52_24,DF_cap3_kit1,DF_i52,2,HI.4779.008.D709---D505.DF_52_24_cap3_kit1_R1....,HI.4779.008.D709---D505.DF_52_24_cap3_kit1_R2....,AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT,/project/def-saitken/lindb/refgenomes/DF_db/DF...,,DF.cap3.kit1,ILLUMINA,,DF_52_24_cap3_kit1


In [4]:
# get file paths to each of the files
# 40 individual for each species (R1 + R2 * 20 samps * 2 species)
# 2 poolseq files for each species (R1 + R2 * 1 samps * 2 species)
# 40 + 2 + 40 + 2 = 84
fastqs = []
fqs = datatable['file_name_r1'].tolist() + datatable['file_name_r2'].tolist()
for fq in fqs:
    fastq = op.join(fqdir, fq)
    assert op.exists(fastq)
    fastqs.append(fastq)
assert len(fastqs) == luni(fastqs)
len(fastqs), luni(fastqs)

(84, 84)

In [5]:
sra_transfer_files.extend(fastqs)
sra_transfer_files.extend([f + '.md5' for f in fastqs])

In [6]:
dryad.append(datafile)

### mega files 

In [7]:
# get mega-seq files
megadir = '/data/fastq/mengmeng/Mega_RefSeq'
megafile = '/data/projects/pool_seq/non-pangenome/varscan_mega/JP_RFmg7/datatable.txt'
megadt = pd.read_table(megafile)
megadt = megadt[megadt['sample_name'] == 'JP_RFmg7']
mega_fastqs = [op.join(megadir, f) for f in megadt['file_name_r1'].tolist() + megadt['file_name_r2'].tolist()]
mega_fastqs

['/data/fastq/mengmeng/Mega_RefSeq/HI.4992.002.D709---D502.JP_RFmg7_cap44_kit6_R1.fastq.gz',
 '/data/fastq/mengmeng/Mega_RefSeq/HI.4992.002.D709---D502.JP_RFmg7_cap44_kit6_R2.fastq.gz']

In [8]:
sra_transfer_files.extend(mega_fastqs)
sra_transfer_files.extend([f + '.md5' for f in mega_fastqs])

In [9]:
dryad.append(megafile)

<a id='check'></a>
# check md5s

In [10]:
def check_md5(f):
    import shutil, subprocess, os
    from os import path as op
    
    os.chdir(op.dirname(f))
    
    md5 = f + '.md5'
    assert op.exists(md5)
    
    stdout = subprocess.check_output(
        [shutil.which('md5sum'),
         '-c',
         md5
        ]
    )
    
    return stdout

In [11]:
# double check md5 of existing .md5 files in parallel
jobs = []
for fastq in fastqs + mega_fastqs:
    jobs.append(
        lview.apply_async(
            check_md5, fastq
        )
    )
    
watch_async(jobs)

[1m
Watching 86 jobs ...[0m


100%|██████████| 86/86 [00:25<00:00,  3.38it/s]


In [12]:
# make sure all of the ind-seq/pool-seq md5 files check out
for j in jobs:
    status = j.r\
            .decode('utf-8')\
            .split('\n')[0]\
            .rstrip('\n')\
            .split()[-1]
    assert status == 'OK'
status

'OK'

In [10]:
# map file to md5 hash
md5s = {}
for f in fastqs + mega_fastqs:
    md5 = f + '.md5'
    assert op.exists(md5)
    md5hash = read(md5)[0].split()[0]
    md5s[op.basename(f)] = md5hash

assert len(md5s) == len(fastqs) + len(mega_fastqs)

<a id='copy'></a>
# bgzip snp tables

In [13]:
def copy_and_bgzip(src, dst):
    import shutil, subprocess
    
    shutil.copy(src, dst)
    
    stdout = subprocess.check_output(
        [shutil.which('bgzip'),
         '-f',
         dst
        ]
    )
    
    return stdout

In [14]:
snptables = {
    'mega' : {
        'JP': op.join('/data/projects/pool_seq/non-pangenome/varscan_mega/JP_RFmg7/snpsANDindels',
                      'JP_RFmg7-varscan_all_bedfiles_SNP_paralog_snps_translated.txt'),
        'DF': op.join('/data/projects/pool_seq/non-pangenome/varscan_mega/DF_megaSNPs_from_download/DF_mega',
                      'snpsANDindels/02_baseline_filtered/DF_mega-varscan_all_bedfiles_SNP_paralog_snps.txt')},
    'gatk' : {
        'JP': op.join('/data/projects/pool_seq/non-pangenome/gatk_diploid_testdata/JP_i101-gatk/filtered_snps',
                      'JP_i101_filtered_concatenated_snps_max-missing_table_biallelic-only_translated.txt'),
        'DF': op.join('/data/projects/pool_seq/non-pangenome/gatk_diploid_testdata/DF_i52-gatk/filtered_snps',
                      'DF_i52_filtered_concatenated_snps_max-missing_table_biallelic-only_p000_translated.txt')},
    'varscan' : {
        'JP': op.join('/data/projects/pool_seq/non-pangenome/varscan_pooled/JP_pooled/snpsANDindels',
                      'JP_pooled-varscan_all_bedfiles_SNP_translated.txt'),
        'DF': op.join('/data/projects/pool_seq/non-pangenome/varscan_pooled/DF_p52/snpsANDindels',
                      'DF_p52-varscan_all_bedfiles_SNP.txt')
    }
}

In [9]:
# copy snp tables to new folder, bgzip in parallel
jobs = []
for source,sppdict in snptables.items():
    for spp,snptable in sppdict.items():
        dst = op.join(tmpdir, op.basename(snptable))
        if op.exists(dst + '.gz'):
            continue
        jobs.append(
            lview.apply_async(
                copy_and_bgzip, *(snptable, dst)
            )
        )
watch_async(jobs)

[1m
Watching 6 jobs ...[0m


100%|██████████| 6/6 [07:02<00:00, 70.37s/it] 


In [11]:
# get the bgzipped files
snpgzs = fs(tmpdir, endswith='.gz')
len(snpgzs)

6

In [12]:
dryad.extend(snpgzs)

<a id='sradoc'></a>
# prepare SRA doc

In [13]:
# load SRA metadata template file
sra_doc = pd.read_table(op.join(sra_main, 'SRA_metadata_acc.txt')).loc[range(0)]
sra_doc.head()

Unnamed: 0,biosample_accession,library_ID,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename,filename2,filename3,filename4,assembly,fasta_file


In [14]:
def add_to_table(df):
    sra_tmp = pd.read_table(op.join(sra_main, 'SRA_metadata_acc.txt')).loc[range(0)]
    for row in df.index:
        samp,group,r1,r2 = df.loc[row, ['sample_name', 'pool_name', 'file_name_r1', 'file_name_r2']]
        
        sra_tmp.loc[row, ['filename', 'filename2']] = r1,r2

        if 'mg' in samp:
            tissue_type = 'haploid megagametophyte'
        else:
            tissue_type = 'diploid needle'

        if 'JP' in samp:
            spp = 'jack pine (Pinus banksiana)'
        else:
            spp = 'Douglas-fir (Pseudotsuga menziesii)'

        if '_p' in samp:
            ploidy = 'pooled diploid'
        else:
            ploidy = 'individual diploid'
        if tissue_type == 'haploid megagametophyte':
            ploidy = 'haploid'


        sra_tmp.loc[row, 'library_ID'] = samp
        sra_tmp.loc[row, 'title'] = f'{ploidy} exome-capture of {spp} : {tissue_type} tissue'
    
    sra_tmp['design_description'] = 'exome sequences were isolatated using custom targeted sequence capture probes'
    sra_tmp['library_source'] = 'GENOMIC'
    sra_tmp['library_selection'] = 'PCR'
    sra_tmp['library_layout'] = 'Paired'
    sra_tmp['platform'] = 'Illumina'
    sra_tmp['instrument_model'] = 'Illumina HiSeq 4000'
    sra_tmp['library_strategy'] = 'OTHER'
    sra_tmp['filetype'] = 'fastq'
    
    return sra_tmp

In [15]:
sra_dir = makedir(op.join(sra_main, 'testdata_submission'))
# tsv created by registering samples with biosample metadata below
# use to map our pop ID to biosample_accession
biosamplefile = op.join(sra_dir, 'biosample_attributes.tsv')
biodf = pd.read_table(biosamplefile)
biodict = dict(zip(biodf['sample_name'], biodf['accession']))
print(biodict)

{'DF_52_20': 'SAMN20086271', 'DF_52_21': 'SAMN20086272', 'DF_52_22': 'SAMN20086273', 'DF_52_23': 'SAMN20086274', 'DF_52_24': 'SAMN20086275', 'DF_52_25': 'SAMN20086276', 'DF_52_26': 'SAMN20086277', 'DF_52_27': 'SAMN20086278', 'DF_52_28': 'SAMN20086279', 'DF_52_29': 'SAMN20086280', 'DF_52_30': 'SAMN20086281', 'DF_52_31': 'SAMN20086282', 'DF_52_32': 'SAMN20086283', 'DF_52_33': 'SAMN20086284', 'DF_52_34': 'SAMN20086285', 'DF_52_35': 'SAMN20086286', 'DF_52_36': 'SAMN20086287', 'DF_52_37': 'SAMN20086288', 'DF_52_38': 'SAMN20086289', 'DF_52_39': 'SAMN20086290', 'DF_p52': 'SAMN20086291', 'JP_101_1': 'SAMN20086292', 'JP_101_2': 'SAMN20086293', 'JP_101_3': 'SAMN20086294', 'JP_101_4': 'SAMN20086295', 'JP_101_5': 'SAMN20086296', 'JP_101_6': 'SAMN20086297', 'JP_101_7': 'SAMN20086298', 'JP_101_8': 'SAMN20086299', 'JP_101_9': 'SAMN20086300', 'JP_101_10': 'SAMN20086301', 'JP_101_11': 'SAMN20086302', 'JP_101_12': 'SAMN20086303', 'JP_101_13': 'SAMN20086304', 'JP_101_14': 'SAMN20086305', 'JP_101_15': 'SA

In [16]:
# fill in SRA metadata
sra_doc = pd.concat(
    [
        add_to_table(datatable),
        add_to_table(megadt)
    ]
).reset_index(drop=True)

sra_doc['biosample_accession'] = sra_doc['library_ID'].map(biodict)
sra_doc['md5sum_filename'] = sra_doc['filename'].map(md5s)
sra_doc['md5sum_filename2'] = sra_doc['filename2'].map(md5s)
assert sra_doc['md5sum_filename'].isnull().sum() == 0
assert sra_doc['md5sum_filename2'].isnull().sum() == 0

sra_doc

Unnamed: 0,biosample_accession,library_ID,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename,filename2,filename3,filename4,assembly,fasta_file,md5sum_filename,md5sum_filename2
0,SAMN20086271,DF_52_20,individual diploid exome-capture of Douglas-fi...,OTHER,GENOMIC,PCR,Paired,Illumina,Illumina HiSeq 4000,exome sequences were isolatated using custom t...,fastq,HI.4779.008.D705---D505.DF_52_20_cap3_kit1_R1....,HI.4779.008.D705---D505.DF_52_20_cap3_kit1_R2....,,,,,e5d15b40d1854dbe22f80fa424a950a3,ee60808baf5be0efead6d98ff4df8b57
1,SAMN20086272,DF_52_21,individual diploid exome-capture of Douglas-fi...,OTHER,GENOMIC,PCR,Paired,Illumina,Illumina HiSeq 4000,exome sequences were isolatated using custom t...,fastq,HI.4779.008.D706---D505.DF_52_21_cap3_kit1_R1....,HI.4779.008.D706---D505.DF_52_21_cap3_kit1_R2....,,,,,dde217cf48666870be64700eb9c0c609,5c5fcfc1aabfd3bd585b82accfacc7cb
2,SAMN20086273,DF_52_22,individual diploid exome-capture of Douglas-fi...,OTHER,GENOMIC,PCR,Paired,Illumina,Illumina HiSeq 4000,exome sequences were isolatated using custom t...,fastq,HI.4779.008.D707---D505.DF_52_22_cap3_kit1_R1....,HI.4779.008.D707---D505.DF_52_22_cap3_kit1_R2....,,,,,ff3c134c6e9b973f52435f9f6bf40b46,cb006b5bcf0bfbf2fc2db7d47884d00b
3,SAMN20086274,DF_52_23,individual diploid exome-capture of Douglas-fi...,OTHER,GENOMIC,PCR,Paired,Illumina,Illumina HiSeq 4000,exome sequences were isolatated using custom t...,fastq,HI.4779.008.D708---D505.DF_52_23_cap3_kit1_R1....,HI.4779.008.D708---D505.DF_52_23_cap3_kit1_R2....,,,,,5405160b462c2bb8cc1402b001752030,0530c698b878fa5072be186ab376c299
4,SAMN20086275,DF_52_24,individual diploid exome-capture of Douglas-fi...,OTHER,GENOMIC,PCR,Paired,Illumina,Illumina HiSeq 4000,exome sequences were isolatated using custom t...,fastq,HI.4779.008.D709---D505.DF_52_24_cap3_kit1_R1....,HI.4779.008.D709---D505.DF_52_24_cap3_kit1_R2....,,,,,a26801d87bed5cadd1e144655f557fa3,ccf722a07c85e4f7dd23e78ad887b5a1
5,SAMN20086276,DF_52_25,individual diploid exome-capture of Douglas-fi...,OTHER,GENOMIC,PCR,Paired,Illumina,Illumina HiSeq 4000,exome sequences were isolatated using custom t...,fastq,HI.4779.008.D710---D505.DF_52_25_cap3_kit1_R1....,HI.4779.008.D710---D505.DF_52_25_cap3_kit1_R2....,,,,,fbfe8a79f4cd7d43e3241daf0631b436,5cef6d35bf4f78fc7f2cf93dce81b176
6,SAMN20086277,DF_52_26,individual diploid exome-capture of Douglas-fi...,OTHER,GENOMIC,PCR,Paired,Illumina,Illumina HiSeq 4000,exome sequences were isolatated using custom t...,fastq,HI.4779.008.D711---D505.DF_52_26_cap4_kit1_R1....,HI.4779.008.D711---D505.DF_52_26_cap4_kit1_R2....,,,,,56dc37f3687405ca9c4384fe6d52d5d7,07d132123750d7f9a10b4a48efc8da2e
7,SAMN20086278,DF_52_27,individual diploid exome-capture of Douglas-fi...,OTHER,GENOMIC,PCR,Paired,Illumina,Illumina HiSeq 4000,exome sequences were isolatated using custom t...,fastq,HI.4779.008.D712---D505.DF_52_27_cap4_kit1_R1....,HI.4779.008.D712---D505.DF_52_27_cap4_kit1_R2....,,,,,4439283a754d92079df65a8e1009286c,d7e67ea063d959210e95f14d93b8fc06
8,SAMN20086279,DF_52_28,individual diploid exome-capture of Douglas-fi...,OTHER,GENOMIC,PCR,Paired,Illumina,Illumina HiSeq 4000,exome sequences were isolatated using custom t...,fastq,HI.4779.008.D701---D506.DF_52_28_cap4_kit1_R1....,HI.4779.008.D701---D506.DF_52_28_cap4_kit1_R2....,,,,,0ca31d096ac7b5b276fc34c45fa04805,1cab668b96a996ffd7c09f2d2dfe02d4
9,SAMN20086280,DF_52_29,individual diploid exome-capture of Douglas-fi...,OTHER,GENOMIC,PCR,Paired,Illumina,Illumina HiSeq 4000,exome sequences were isolatated using custom t...,fastq,HI.4779.008.D702---D506.DF_52_29_cap4_kit1_R1....,HI.4779.008.D702---D506.DF_52_29_cap4_kit1_R2....,,,,,3f2c9090a11bcc406ae47e95c79d10d6,15ed66bb31debe7ad1487bf225b8e664


In [17]:
# make sure I have unique data
for col in ['library_ID', 'filename', 'filename2']:
    assert luni(sra_doc[col]) == nrow(sra_doc)

assert luni(sra_doc['filename'].tolist() + sra_doc['filename2'].tolist()) == nrow(sra_doc) * 2

In [18]:
# create metadata file for upload
srafile = op.join(sra_dir, 'testdata_sra_doc.txt')
sra_doc.to_csv(srafile, sep='\t', index=False)

In [19]:
srafile

'/data/projects/pool_seq/sra_docs/testdata_submission/testdata_sra_doc.txt'

In [20]:
dryad.append(srafile)

<a id='biosample'></a>
# prepare biosample metadata

use this to register biosamples

In [22]:
sra_main

'/data/projects/pool_seq/sra_docs'

In [23]:
def create_bio():
    """Create biosample metadata using NCBI template."""
    bio_template = op.join(sra_main, 'Plant.1.0.txt')
    biodf = pd.read_table(bio_template).loc[range(0)]
    biodf.columns = [col.replace("*", "") for col in biodf.columns]
    return biodf

create_bio()

Unnamed: 0,sample_name,sample_title,bioproject_accession,organism,isolate,cultivar,ecotype,age,dev_stage,geo_loc_name,tissue,biomaterial_provider,cell_line,cell_type,collected_by,collection_date,culture_collection,disease,disease_stage,genotype,growth_protocol,height_or_length,isolation_source,lat_lon,phenotype,population,sample_type,sex,specimen_voucher,temp,treatment,description


In [24]:
# fill in biosample metadata
bio = create_bio()
for row in sra_doc.index:
    samp,title = sra_doc.loc[row, ['library_ID', 'title']]
    
    if 'DF' in samp:
        spp = 'Pseudotsuga menziesii'
        geo = 'USA:California'
        latlon = '39.380 N 120.670 W'
        ecotype = 'var. menziessii'
    else:
        spp = 'Pinus banksiana'
        geo = 'Canada:Saskatchewan'
        latlon = '54.083 N 107.250 W'
        ecotype = 'Sakskatchewan'

    bio.loc[row, 'sample_name'] = samp
    bio.loc[row, 'sample_title'] = title
    bio.loc[row, 'organism'] = spp
    bio.loc[row, 'geo_loc_name'] = geo
    bio.loc[row, 'tissue'] = title.split(": ")[-1]
    bio.loc[row, 'population'] = samp
    bio.loc[row, 'lat_lon'] = latlon
    bio.loc[row, 'ecotype'] = ecotype

bio['dev_stage'] = 'juvenile'
bio['bioproject_accession'] = 'PRJNA744263'

bio

Unnamed: 0,sample_name,sample_title,bioproject_accession,organism,isolate,cultivar,ecotype,age,dev_stage,geo_loc_name,tissue,biomaterial_provider,cell_line,cell_type,collected_by,collection_date,culture_collection,disease,disease_stage,genotype,growth_protocol,height_or_length,isolation_source,lat_lon,phenotype,population,sample_type,sex,specimen_voucher,temp,treatment,description
0,DF_52_20,individual diploid exome-capture of Douglas-fi...,PRJNA744263,Pseudotsuga menziesii,,,var. menziessii,,juvenile,USA:California,diploid needle tissue,,,,,,,,,,,,,39.380 N 120.670 W,,DF_52_20,,,,,,
1,DF_52_21,individual diploid exome-capture of Douglas-fi...,PRJNA744263,Pseudotsuga menziesii,,,var. menziessii,,juvenile,USA:California,diploid needle tissue,,,,,,,,,,,,,39.380 N 120.670 W,,DF_52_21,,,,,,
2,DF_52_22,individual diploid exome-capture of Douglas-fi...,PRJNA744263,Pseudotsuga menziesii,,,var. menziessii,,juvenile,USA:California,diploid needle tissue,,,,,,,,,,,,,39.380 N 120.670 W,,DF_52_22,,,,,,
3,DF_52_23,individual diploid exome-capture of Douglas-fi...,PRJNA744263,Pseudotsuga menziesii,,,var. menziessii,,juvenile,USA:California,diploid needle tissue,,,,,,,,,,,,,39.380 N 120.670 W,,DF_52_23,,,,,,
4,DF_52_24,individual diploid exome-capture of Douglas-fi...,PRJNA744263,Pseudotsuga menziesii,,,var. menziessii,,juvenile,USA:California,diploid needle tissue,,,,,,,,,,,,,39.380 N 120.670 W,,DF_52_24,,,,,,
5,DF_52_25,individual diploid exome-capture of Douglas-fi...,PRJNA744263,Pseudotsuga menziesii,,,var. menziessii,,juvenile,USA:California,diploid needle tissue,,,,,,,,,,,,,39.380 N 120.670 W,,DF_52_25,,,,,,
6,DF_52_26,individual diploid exome-capture of Douglas-fi...,PRJNA744263,Pseudotsuga menziesii,,,var. menziessii,,juvenile,USA:California,diploid needle tissue,,,,,,,,,,,,,39.380 N 120.670 W,,DF_52_26,,,,,,
7,DF_52_27,individual diploid exome-capture of Douglas-fi...,PRJNA744263,Pseudotsuga menziesii,,,var. menziessii,,juvenile,USA:California,diploid needle tissue,,,,,,,,,,,,,39.380 N 120.670 W,,DF_52_27,,,,,,
8,DF_52_28,individual diploid exome-capture of Douglas-fi...,PRJNA744263,Pseudotsuga menziesii,,,var. menziessii,,juvenile,USA:California,diploid needle tissue,,,,,,,,,,,,,39.380 N 120.670 W,,DF_52_28,,,,,,
9,DF_52_29,individual diploid exome-capture of Douglas-fi...,PRJNA744263,Pseudotsuga menziesii,,,var. menziessii,,juvenile,USA:California,diploid needle tissue,,,,,,,,,,,,,39.380 N 120.670 W,,DF_52_29,,,,,,


In [25]:
# save
biofile = op.join(sra_dir, 'testdata_biosample.txt')
bio.to_csv(biofile, sep='\t', index=False)

In [26]:
dryad.append(biofile)

In [27]:
biofile

'/data/projects/pool_seq/sra_docs/testdata_submission/testdata_biosample.txt'

In [28]:
dryad

['/data/projects/pool_seq/non-pangenome/gatk_diploid_testdata/JP_i101-gatk/datatable.txt',
 '/data/projects/pool_seq/non-pangenome/varscan_mega/JP_RFmg7/datatable.txt',
 '/lu213/brandon.lind/data/testdir/dryad_testdata/DF_i52_filtered_concatenated_snps_max-missing_table_biallelic-only_p000_translated.txt.gz',
 '/lu213/brandon.lind/data/testdir/dryad_testdata/DF_mega-varscan_all_bedfiles_SNP_paralog_snps.txt.gz',
 '/lu213/brandon.lind/data/testdir/dryad_testdata/DF_p52-varscan_all_bedfiles_SNP.txt.gz',
 '/lu213/brandon.lind/data/testdir/dryad_testdata/JP_RFmg7-varscan_all_bedfiles_SNP_paralog_snps_translated.txt.gz',
 '/lu213/brandon.lind/data/testdir/dryad_testdata/JP_i101_filtered_concatenated_snps_max-missing_table_biallelic-only_translated.txt.gz',
 '/lu213/brandon.lind/data/testdir/dryad_testdata/JP_pooled-varscan_all_bedfiles_SNP_translated.txt.gz',
 '/data/projects/pool_seq/sra_docs/testdata_submission/testdata_sra_doc.txt',
 '/data/projects/pool_seq/sra_docs/testdata_submission/

In [29]:
len(dryad), luni(dryad)

(10, 10)

<a id='transfer'></a>
# transfer files to Short Read Archive via ftp

In [39]:
def connect(password_file):
    """Connect to NCBI SRA ftp."""
    import ftplib
    
    user,password = read(password_file, lines=True)
    ftp = ftplib.FTP('ftp-private.ncbi.nlm.nih.gov', user, password)
    ftp.set_pasv(True)
    
    return ftp

def upload_file(f, password_file):
    """Upload file, `f`, to ftp."""
    from os import path as op
    import os
    
    ftp = connect(password_file)
    ftp.cwd('uploads/lind.brandon.m_gmail.com_oR317JsQ/coadaptree_testdata_files')

    with open(f, 'rb') as file:
        ftp.storbinary(f"STOR {op.basename(f)}", file)

    ftp.quit()
    
    return op.basename(f)

dview['connect'] = connect
dview['read'] = read

In [37]:
# upload individual and pool-seq fastqs
user_passfile = '/data/projects/pool_seq/non-pangenome/gatk_diploid_testdata/testdata_sra_password'

jobs = []
for f in sra_transfer_files:
    jobs.append(
        lview.apply_async(
            upload_file, *(f, user_passfile)
        )
    )
    
watch_async(jobs)

[1m
Watching 172 jobs ...[0m


100%|██████████| 172/172 [21:20<00:00,  7.44s/it] 


# assert successful SRA transfer

In [36]:
def check_transfers():
    ftp = connect(user_passfile)
    ftp.cwd('uploads/lind.brandon.m_gmail.com_oR317JsQ/coadaptree_testdata_files')
    files = ftp.nlst()
    return files

In [38]:
dsts = [j.r for j in jobs]
dsts[0]

'HI.4779.008.D705---D505.DF_52_20_cap3_kit1_R1.fastq.gz'

In [41]:
files = check_transfers()

len(files)

172

In [42]:
# assert all transfer files present in the ftp directory?
assert all([op.basename(f) in files for f in sra_transfer_files])

In [43]:
# make sure all the files in the metadata are in the ftp directory
file_checks = sra_doc['filename'].tolist() + sra_doc['filename2'].tolist()
assert all([f in files for f in file_checks])

In [44]:
luni(file_checks)  # 172 / 2 = 86 (md5s + fastqs; md5s will be deleted)

86

In [45]:
[os.remove(user_passfile)]

In [46]:
# download dryad files to upload to dryad
for f in dryad:
    print('rsync -azv', f'yeaman03:{f}', op.basename(f), '--progress', '\n')

rsync -azv yeaman03:/data/projects/pool_seq/non-pangenome/gatk_diploid_testdata/JP_i101-gatk/datatable.txt datatable.txt --progress 

rsync -azv yeaman03:/data/projects/pool_seq/non-pangenome/varscan_mega/JP_RFmg7/datatable.txt datatable.txt --progress 

rsync -azv yeaman03:/lu213/brandon.lind/data/testdir/dryad_testdata/DF_i52_filtered_concatenated_snps_max-missing_table_biallelic-only_p000_translated.txt.gz DF_i52_filtered_concatenated_snps_max-missing_table_biallelic-only_p000_translated.txt.gz --progress 

rsync -azv yeaman03:/lu213/brandon.lind/data/testdir/dryad_testdata/DF_mega-varscan_all_bedfiles_SNP_paralog_snps.txt.gz DF_mega-varscan_all_bedfiles_SNP_paralog_snps.txt.gz --progress 

rsync -azv yeaman03:/lu213/brandon.lind/data/testdir/dryad_testdata/DF_p52-varscan_all_bedfiles_SNP.txt.gz DF_p52-varscan_all_bedfiles_SNP.txt.gz --progress 

rsync -azv yeaman03:/lu213/brandon.lind/data/testdir/dryad_testdata/JP_RFmg7-varscan_all_bedfiles_SNP_paralog_snps_translated.txt.gz JP_RF

In [4]:
# write dryad readme
README = op.join(sra_main, 'dryad_README.txt')

with open(README, 'w') as readme:
    readme.write('''Brandon Lind July 7, 2021

All .txt and .txt.gz files are tab-delimited (once decompressed).

haploid_pipeline_datatable.txt - this is the `datatable.txt` file needed to configure the run of our jack pine haploid megagametophyte data through the Varscan Pipeline

pooled_individual_pipeline_datatable.txt - this is the `datatable.txt` file needed to configure the runs of our jack pine and Douglas-fir pooled and individual sequence data through both the Varscan Pipeline and the GATK pipeline

DF_i52_filtered_concatenated_snps_max-missing_table_biallelic-only_p000_translated.txt.gz - the SNP calls from our GATK Pipeline for Douglas-fir individual sequence data

DF_mega-varscan_all_bedfiles_SNP_paralog_snps.txt.gz - the heterozygous SNP calls for sites output from the Varscan Pipeline for our haploid Douglas-fir data

DF_p52-varscan_all_bedfiles_SNP.txt.gz - the SNP calls for our pooled Douglas-fir data from the Varscan Pipeline

JP_RFmg7-varscan_all_bedfiles_SNP_paralog_snps_translated.txt.gz - the heterozygous SNP calls for sites output from the Varscan Pipeline for our haploid jack pine data

JP_i101_filtered_concatenated_snps_max-missing_table_biallelic-only_translated.txt.gz - the SNP calls from our GATK Pipeline for jack pine individual sequence data

JP_pooled-varscan_all_bedfiles_SNP_translated.txt.gz - the SNP calls for our pooled jack pine data from the Varscan Pipeline

testdata_sra_doc.txt - NCBI Short Read Archive metadata

testdata_biosample.txt - NCBI Biosample metadata
''')
    
print(read(README, lines=False))

Brandon Lind July 7, 2021

All .txt and .txt.gz files are tab-delimited (once decompressed).

haploid_pipeline_datatable.txt - this is the `datatable.txt` file needed to configure the run of our jack pine haploid megagametophyte data through the Varscan Pipeline

pooled_individual_pipeline_datatable.txt - this is the `datatable.txt` file needed to configure the runs of our jack pine and Douglas-fir pooled and individual sequence data through both the Varscan Pipeline and the GATK pipeline

DF_i52_filtered_concatenated_snps_max-missing_table_biallelic-only_p000_translated.txt.gz - the SNP calls from our GATK Pipeline for Douglas-fir individual sequence data

DF_mega-varscan_all_bedfiles_SNP_paralog_snps.txt.gz - the heterozygous SNP calls for sites output from the Varscan Pipeline for our haploid Douglas-fir data

DF_p52-varscan_all_bedfiles_SNP.txt.gz - the SNP calls for our pooled Douglas-fir data from the Varscan Pipeline

JP_RFmg7-varscan_all_bedfiles_SNP_paralog_snps_translated.txt

In [5]:
README

'/data/projects/pool_seq/sra_docs/dryad_README.txt'