# Test sequences for pipeline
---
### About the data

Fasta files were retrieved from Griffith's lab tutorial on RNA-seq https://github.com/griffithlab/rnaseq_tutorial:

`Malachi Griffith*, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, Obi L. Griffith*. 2015. Informatics for RNA-seq: A web resource for analysis on the cloud. PLoS Comp Biol. 11(8):e1004393.`

*The practice dataset includes 3 replicates of data from the HCC1395 breast cancer cell line and 3 replicates of data from HCC1395BL matched lymphoblastoid line. So, this will be a tumor vs normal (cell line) comparison. The reads are paired-end 151-mers generated on an Illumina HiSeq instrument. The test data has been pre-filtered for reads that appear to map to chromosome 22.*

In [41]:
import sys
import os
from os import listdir,path
import pandas as pd
import io
import subprocess
import json

output_path = path.abspath('test_data')
sequences_url = r'http://genomedata.org/rnaseq-tutorial/practical.tar'

### Download fasta files

In [12]:
%%bash -s "$sequences_url" "$output_path"
mkdir -p "$2/fastq"
wget "$1" -P "$2/"
tar -C "$2/fastq/" -xvf "$2/practical.tar"
rm "$2/practical.tar"

tar -C /home/hugo/software/REAP/test_data/fastq/ -xvf /home/hugo/software/REAP/test_data/practical.tar
rm /home/hugo/software/REAP/test_data/practical.tar


### Download gencode files

In [90]:
%%bash -s "$output_path"
gencode_path="$1/gencode"
mkdir -p "$gencode_path"
cd "$gencode_path"
wget "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.annotation.gtf.gz"
wget "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.transcripts.fa.gz"

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/hugo/.wget-hsts'. HSTS will be disabled.
--2020-05-11 15:53:19--  ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.annotation.gtf.gz
           => ‘gencode.v34.annotation.gtf.gz’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.197.74|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/databases/gencode/Gencode_human/release_34 ... done.
==> SIZE gencode.v34.annotation.gtf.gz ... 43164654
==> PASV ... done.    ==> RETR gencode.v34.annotation.gtf.gz ... done.
Length: 43164654 (41M) (unauthoritative)

     0K .......... .......... .......... .......... ..........  0% 1.05M 39s
    50K .......... .......... .......... .......... ..........  0% 2.29M 29s
   100K .......... .......... .........

---
### Create template for configuration files

In [87]:
def create_config_stub(output_config_file,conda_path=None,env_name=None):
    data = {}
    if env_name is not None and conda_path is not None:        
        call = '%s list -n %s'%(path.join(conda_path,'bin','conda'),env_name)        
        s = subprocess.check_output(call,shell=True).decode()        
        df = pd.read_csv(io.StringIO(s),header=None,skiprows=4,sep=r'\s+',
                        names=['package','version','build','channel']).set_index('package')        
        for script in ['r','star','bedtools','samtools','deeptools','stringtie','scallop','kallisto','qualimap']:
            data[script+'_version'] = df.loc[script]['version']
        print(df)
    
    print(data)
    return df

In [88]:
conda_path = r'~/anaconda3/'
env_name = 'REAP'
conda_list = create_config_stub(None,env_name=env_name,conda_path=conda_path)

              version           build      channel
package                                           
_r-mutex        1.0.0           mro_2            r
aioeasywebdav   2.4.0       py36_1000  conda-forge
aiohttp         3.6.2  py36h7b6447c_0          NaN
appdirs         1.4.3  py36h28b3542_0          NaN
async-timeout   3.0.1          py36_0          NaN
...               ...             ...          ...
wrapt          1.12.1  py36h7b6447c_1          NaN
xz              5.2.5      h7b6447c_0          NaN
yaml            0.1.7      had09818_2          NaN
yarl            1.4.2  py36h7b6447c_0          NaN
zlib           1.2.11      h7b6447c_3          NaN

[154 rows x 3 columns]
{'r_version': '3.5.0', 'star_version': '2.7.3a', 'bedtools_version': '2.29.2', 'samtools_version': '1.7', 'deeptools_version': '3.4.3', 'stringtie_version': '2.1.2', 'scallop_version': '0.10.4', 'kallisto_version': '0.46.2', 'qualimap_version': '2.2.2a'}


In [86]:
conda_list[conda_list.index=='r']

Unnamed: 0_level_0,version,build,channel
package,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
r,3.5.0,mro350_0,
