# Test Database Formation
This notebook turns an expedition folder with the results of multiple sequencings and a description excel file into a small database for testing the difference consensus/identification pipelines.

#### Imports

In [1]:
from general_helpers import *
import os.path as ospath
import pandas as pd
import os

#### Setting the name of the expedition folder to be converted and the name of the new test database

all of the databases are stored in the folder `testdbs`. Please choose the name of the expedition folder in `input_expidition_folder` and the name you want to give to the newly created database in `name_of_db`

In [2]:
input_expedition_folder = "matK_rbcL_trnh_ITS_12samples_publicationsummer2022_9Qiagen_3MN"
name_of_db = "summer_expedition_experiments"

#### Finding the location of the description sample excel file

In [15]:
db_path = ospath.join("input", input_expedition_folder)
main_dir = None
for root, dirs, files in os.walk(db_path):
    if "Description_sample.xlsx" in files:
        main_dir = root
        description_path = ospath.join(main_dir,"Description_sample.xlsx")
    break
if main_dir == None:
    print("There is no excel description folder for expedition")
else: 
    print(description_path)

input/matK_rbcL_trnh_ITS_12samples_publicationsummer2022_9Qiagen_3MN/Description_sample.xlsx


#### extracting information about genes sequenced for each plant, barcode sequences and general expedition information.
The excel description file needs to have a very specific structure with:
- general information in the first sheet (columns: ref, experiment, description, notes).
- the barcode sequences as numbered rows in the second sheet.
- information about the genes sequenced for each species in the third sheet (columns = samples, species, matk, rbcL, trnH-psbA, ITS).

For the third sheet, an X indicates that the species was sequenced for this gene and an empty cell that it was not.

In [17]:
info_db = pd.read_excel(description_path,sheet_name=0,index_col=0)
barseq_db = pd.read_excel(description_path,sheet_name=1,index_col=0)
sample_db = pd.read_excel(description_path,sheet_name=2,index_col=0)
primer_db = pd.read_excel(description_path,sheet_name=3,index_col=0)
sample_db[sample_db == "X"]= True
sample_db[sample_db.isna()]= False
print("Experiment:", info_db.experiment[0])

Experiment: Sequencing of Qiagen samples without purification


#### Creating the new test database
for each sample, the corresponding fastq pass reads are extracted in one file. Thre is also a fasta file with the reference sequences for matK, rbcL, psbA-trnH and ITS from GenBank.

In [12]:
#creating the new database folder
new_db = ospath.join("testdbs", name_of_db)
if not ospath.exists(new_db):
    os.makedirs(new_db)

#iterating over the samples
for index, row in sample_db.iterrows():

    #new folder for each sample
    species = row["Species"].replace(" ", "_")
    new_dir= ospath.join(new_db, species+"_sample"+str(index))
    if not ospath.exists(new_dir):
        os.makedirs(new_dir)

    #extracting the fastq from the input expedition folder
    file_location = ospath.join(new_dir, species+"_sample"+str(index) )
    extract_fastq(main_dir, index, file_location)

    #downloading the reference sequences from NCBI
    reference_seq_location = ospath.join(new_dir, species+"_reference_seq.fasta")
    gene_list = ["matK", "rbcL", "trnH-psbA", "ITS"]
    for gene in gene_list:
        download_sequence(row["Species"], gene, reference_seq_location, 0, 5000)   



#### Logging information about the expedition as csv files to be easily reimported as Pandas DataFrames.

In [18]:
info_db.to_csv(ospath.join(new_db, "general_info.csv"))
sample_db.to_csv(ospath.join(new_db, "sample_info.csv"))
primer_db.to_csv(ospath.join(new_db, "primer_info.csv"))