This script is meant to help the user to create an input file that will tell the pipeline which files to work with and when to use them. There are four major file types that can be fed through the pipeline, each of which has different requirements:  
* Reference genome files, that need to go through the pipeline the same as an already assembled file
* Reference transcriptome files, that come a database such as the MMETSP and have already been assembled
* One sided experimental transcriptome files, that need to be trimmed and assembled before being passed through the pipeline, and are one-sided
* Two-sided experimental transcriptome files, that need to be trimmed and assembled before being passed through the pipeline, and are two-sided

Below, the user has the opportunity to specify the files that need to be used for each of the categories. "Refgen" is used to refer to a reference genome, "reftrans" is used to refer to a reference transcriptome, "onesid" is used to refer to a one-sided experimental transcriptome, and "twosid" is used to refer to a two-sided experimental transcriptome. 

In [5]:
import pandas as pd
import io
import os
import yaml
import configparser
configfile = "../config.yaml"
with open(configfile) as file:
    config = yaml.load(file, Loader=yaml.FullLoader)

In [6]:
separator = " "
refgen = config["referencenames"]
refgen = refgen #.split(separator)
print(refgen)

# We can grab this list from the MMETSP data file
mmetsp = pd.read_csv("../data/forNCBI_MMETSP_2.csv")
reftrans = list(mmetsp[mmetsp["ORGANISM"] == "Emiliania huxleyi"].SAMPLE_NAME)
reftrans = reftrans

# These we have specified manually using the FASTQ files that we downloaded. May want to do this out of config as well.
onesid = "DRR004457 DRR004459 DRR004460 DRR004462 DRR004464 DRR004458 DRR004461 DRR004463 SRR6296274 SRR6296278 SRR6296279 SRR6296282 SRR6296286 SRR6296288 SRR6296294 SRR6296295 SRR6296296 SRR6296298 SRR6296276 SRR6296277 SRR6296280 SRR6296283 SRR6296284 SRR6296289 SRR6296290 SRR6296293 SRR6296297 SRR6296273 SRR6296275 SRR6296281 SRR6296285 SRR6296287 SRR6296291 SRR6296292 SRR6296299"
onesid = onesid.split(separator)
twosid = ""

['thapsreference', 'thapsreferencecoding', 'pseudreference', 'pseudreferencecoding', 'fragreference', 'fragreferencecoding', 'ehuxreference', 'ehuxreferencecoding']


In [7]:
files = []
labels = []
extensions = []

files.extend(refgen)
labels.extend(["refgen"] * len(refgen))
extensions.extend(["fastq"] * len(refgen))

files.extend(reftrans)
labels.extend(["reftrans"] * len(reftrans))
extensions.extend(["fasta"] * len(reftrans))

files.extend(onesid)
labels.extend(["onesid"] * len(onesid))
extensions.extend(["fastq"] * len(onesid))

files.extend(twosid)
labels.extend(["twosid"] * len(twosid))
labels.extend(["fastq"] * len(twosid))

outputframe = pd.DataFrame({"FileName": files,\
             "FileType": labels, \
             "FileExtension": extensions})

In [8]:
outputframe.to_csv(path_or_buf = "../data/6April2020.txt", sep = "\t")

In [9]:
outputframe

Unnamed: 0,FileName,FileType,FileExtension
0,thapsreference,refgen,fastq
1,thapsreferencecoding,refgen,fastq
2,pseudreference,refgen,fastq
3,pseudreferencecoding,refgen,fastq
4,fragreference,refgen,fastq
5,fragreferencecoding,refgen,fastq
6,ehuxreference,refgen,fastq
7,ehuxreferencecoding,refgen,fastq
8,MMETSP0994,reftrans,fasta
9,MMETSP0996,reftrans,fasta


In [39]:
sample = pd.read_csv("../data/6April2020.txt", sep = "\t", index_col = 2, header = 0)
sample.loc[["refgen","onesid"]]

Unnamed: 0_level_0,Unnamed: 0,FileName,FileExtension
FileType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
refgen,0,thapsreference,.fastq
refgen,1,thapsreferencecoding,.fastq
refgen,2,pseudreference,.fastq
refgen,3,pseudreferencecoding,.fastq
refgen,4,fragreference,.fastq
refgen,5,fragreferencecoding,.fastq
refgen,6,ehuxreference,.fastq
refgen,7,ehuxreferencecoding,.fastq
onesid,24,DRR004457,.fastq
onesid,25,DRR004459,.fastq


In [40]:
dict({"a":1})
sample.iloc[1]

Unnamed: 0                          1
FileName         thapsreferencecoding
FileExtension                  .fastq
Name: refgen, dtype: object