Old Analysis of operator domain, with different approach and different bio tools

# Operator domain data analysis
- Tables 3 and 4 in the paper number of tools used in the workflows 
- (e.g., bioinformatics tools such as samtools, bcftools)

In this seperate notebook, we want to analyze the operators in the nf-core git-repos. 
The target operators are provided in two excel sheets and can be found in the folder 'data'. Also we want to consider the operators from the Nexflow documentation here: https://www.nextflow.io/docs/latest/operator.html

In [None]:
import pandas as pd

df_operators = pd.read_excel('./data/all_operator_dataset.xlsx')
#df_bioinformatic_tools = pd.read_excel('./data/bioinformatics_tools.xlsx')

pattern_list = df_operators['operator'].tolist()

print(pattern_list)

In [None]:
len(pattern_list)

In [None]:
df_operators_clustured = pd.read_csv('./data/operator_dataset_clustered.csv')

We have 6 unique domain groups: ["Bioinformatics", "Control flow", "File processing", "File system", "Code execution", "Data handling", "Misc", "Noise", "under 10"]


e.g.: Assigned Group 0 = "Bioinformatics"
if domain >= 6: "Other" domain

In [None]:
df_operators_clustured.head()

In [None]:
df_operators_clustured['Assigned Group'].unique()

In [None]:
df

# Operator domain data (all operators from csv) in processes

In [None]:
import os
import re
import csv
import pandas as pd

def find_operators_in_processes(directory, pattern_list, output_file_name):
    result_list = []

    for pipeline in os.listdir(directory):
        subdirectory_path = os.path.join(directory, pipeline)
        folder_name = os.path.basename(subdirectory_path)

        for root, dirs, files in os.walk(subdirectory_path):
            for filename in files:
                file_path = os.path.join(root, filename)
                
                if filename.endswith('.nf') or filename.endswith('.config'):
                    with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
                        lines = file.readlines()
                    
                    process_flag = False
                    process_name = None
                    brace_stack = []

                    process_pattern = r'process\s+(\w+)\s*{'

                    for line_num, line in enumerate(lines):
                        stripped_line = line.strip()
                        
                        if line.strip().startswith(('#', '//', '*')):
                                continue

                        if re.search(process_pattern, stripped_line):
                            process_flag = True
                            process_match = re.search(process_pattern, stripped_line)
                            if process_match:
                                process_name = process_match.group(1)
                            print(f'Process {process_name} in filepath {file_path}')
                            brace_stack.append('{')

                        if '{' in stripped_line:
                            brace_stack.append('{')

                        if '}' in stripped_line:
                            if brace_stack:
                                brace_stack.pop()
                                if not brace_stack and process_flag:
                                    process_flag = False

                        if process_flag:
                            pattern_found = False
                            for pattern in pattern_list:

                                regex_pattern = re.compile(re.escape(pattern))
                                #regex_pattern = re.compile(r'(\.\s*|\s+)' + re.escape(pattern) + r'(\s+|\.\s*)')
                                if regex_pattern.search(stripped_line):
                                    result_list.append([folder_name, process_name, pattern, file_path, line_num+1, stripped_line])
                                    print(f'Operator found {pattern} with content: {stripped_line}')
                                    pattern_found = True

                    if brace_stack:
                        print("Error: Unclosed process block in file:", file_path)

    with open(f'./results/{output_file_name}.csv', 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['pipeline', 'process_name', 'operator', 'file_path', 'line_number', 'line_content'])
        writer.writerows(result_list)

In [None]:
directory_path = './git_repos'

df_operators = pd.read_excel('./data/all_operator_dataset.xlsx')
operator_list = [str(operator) for operator in df_operators['operator']]

output_file_name = 'all_operators_in_processes'

find_operators_in_processes(directory_path, operator_list, output_file_name)

Bioinformatic tools in processes

In [None]:
df_operators = pd.read_excel('./data/bioinformatics_tools.xlsx')
pattern_list = [str(pattern) for pattern in df_operators['operator']]

output_file_name = 'all_operators_in_processes'

find_operators_in_processes(directory_path, pattern_list, output_file_name)

# Parser for operator domain data occurences in config files

In [12]:
import os
import re
import csv
import pandas as pd

def find_operators_in_config(directory, pattern_list, output_file_name):
    result_list = []

    for pipeline in os.listdir(directory):
        subdirectory_path = os.path.join(directory, pipeline)
        folder_name = os.path.basename(subdirectory_path)

        for root, dirs, files in os.walk(subdirectory_path):
            for filename in files:
                file_path = os.path.join(root, filename)
                
                if filename.endswith('.config'):
                #if filename.endswith('.nf'):
                    with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
                        lines = file.readlines()

                    for line_num, line in enumerate(lines):
                        stripped_line = line.strip()
                        
                        if line.strip().startswith(('#', '//', '*')):
                            continue
                        
                        for pattern in pattern_list:

                            regex_pattern = re.compile(re.escape(pattern))
                            if regex_pattern.search(stripped_line):
                                result_list.append([folder_name, pattern, file_path, line_num+1, stripped_line])
                                print(f'Operator found {pattern} with content: {stripped_line}')

    with open(f'./results/{output_file_name}.csv', 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['pipeline', 'operator', 'file_path', 'line_number', 'line_content'])
        writer.writerows(result_list)

In [13]:
'''directory_path = './git_repos'

df_operators = pd.read_excel('./data/all_operator_dataset.xlsx')
operator_list = [str(pattern) for pattern in df_operators['operator']]

output_file_name = 'all_operators_in_config'

find_operators_in_config(directory_path, operator_list, output_file_name)'''

"directory_path = './git_repos'\n\ndf_operators = pd.read_excel('./data/all_operator_dataset.xlsx')\noperator_list = [str(pattern) for pattern in df_operators['operator']]\n\noutput_file_name = 'all_operators_in_config'\n\nfind_operators_in_config(directory_path, operator_list, output_file_name)"

Bioinformatic tools

In [14]:
directory_path = './git_repos'

df_operators = pd.read_excel('./data/bioinformatics_tools.xlsx')
operator_list = [str(pattern) for pattern in df_operators['operator']]

output_file_name = 'bioinformatic_tools_in_config'

find_operators_in_config(directory_path, operator_list, output_file_name)

Operator found multiqc with content: skip_multiqc               = false
Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator found fastp with content: path: { "${params.outdir}/fastp/${meta.id}" },
Operator found fastp with content: path: { "${params.outdir}/fastp/${meta.id}/" },
Operator found fastp with content: pattern: "*.fastp.fastq.gz"
Operator found fastqc with content: path: { "${params.outdir}/fastqc/postassembly" },
Operator found multiqc with content: ext.args   = params.multiqc_title ? "--title \"$params.multiqc_title\"" : ''
Operator found multiqc with content: path: { "${params.outdir}/multiqc" },
Operator found cutadapt with content: cutadapt_min_overlap = 3
Ope

Operator found qiime with content: nextflow run nf-core/ampliseq -profile test_qiimecustom,<docker/singularity> --outdir <OUTDIR>
Operator found qiime with content: config_profile_description = 'Minimal test dataset to check --qiime_ref_tax_custom'
Operator found qiime with content: qiime_ref_tax_custom = "https://raw.githubusercontent.com/nf-core/test-datasets/ampliseq/testdata/85_greengenes.fna.gz,https://raw.githubusercontent.com/nf-core/test-datasets/ampliseq/testdata/85_greengenes.tax.gz"
Operator found qiime with content: skip_qiime_downstream = true
Operator found kraken2 with content: kraken2_ref_tax_custom = "https://genome-idx.s3.amazonaws.com/kraken/16S_Greengenes13.5_20200326.tgz"
Operator found kraken2 with content: kraken2_assign_taxlevels = "D,P,C,O"
Operator found qiime with content: qiime_ref_tax_custom = "https://raw.githubusercontent.com/nf-core/test-datasets/ampliseq/testdata/85_greengenes.tar.gz"
Operator found qiime with content: skip_qiime_downstream = true
Opera

Operator found samtools with content: path: { "${params.outdir}/${params.aligner}/library/samtools_stats" },
Operator found bwa with content: if (params.aligner == 'bwa') {
Operator found bwa with content: params.bwa_min_score ? " -T ${params.bwa_min_score}" : '',
Operator found bowtie2 with content: if (params.aligner == 'bowtie2') {
Operator found picard with content: path: { "${params.outdir}/${params.aligner}/merged_library/picard_metrics" },
Operator found samtools with content: path: { "${params.outdir}/${params.aligner}/merged_library/samtools_stats" },
Operator found samtools with content: path: { "${params.outdir}/${params.aligner}/merged_library/samtools_stats" },
Operator found samtools with content: path: { "${params.outdir}/${params.aligner}/merged_library/samtools_stats" },
Operator found picard with content: if (!params.skip_picard_metrics) {
Operator found picard with content: path: { "${params.outdir}/${params.aligner}/merged_library/picard_metrics" },
Operator found p

Operator found samtools with content: samtools_collate_fast = true
Operator found fastqc with content: skip_initial_fastqc = false
Operator found fastqc with content: skip_trimming_fastqc = false
Operator found samtools with content: skip_samtools_stats = false
Operator found multiqc with content: multiqc_config = false
Operator found multiqc with content: max_multiqc_email_size = 25.MB
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh3

Operator found bwa with content: bwa         = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Oryz

Operator found bismark with content: bismark     = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "

Operator found bwa with content: bwa         = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: b

Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/Bowtie2Index/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/Bowtie2Index/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/Bowtie2Index/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/Bowtie2Index/"
Operator found bwa with content: bwa         = "${param

Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequen

Operator found gatk with content: ploidy_priors     = "https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/gatk/contig_ploidy_priors_table.tsv"
Operator found multiqc with content: ext.args   = { params.multiqc_title ? "--title \"$params.multiqc_title\"" : '' }
Operator found multiqc with content: path: { "${params.outdir}/multiqc" },
Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Ind

Operator found bismark with content: bismark     = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params

Operator found fastqc with content: skip_fastqc                = false
Operator found bowtie2 with content: aligner                    = "bowtie2"
Operator found macs2 with content: macs2_pvalue               = null
Operator found macs2 with content: macs2_qvalue               = 0.01
Operator found macs2 with content: macs2_narrow_peak          = true
Operator found macs2 with content: macs2_broad_cutoff         = 0.1
Operator found multiqc with content: skip_multiqc               = false
Operator found multiqc with content: multiqc_config                    = null
Operator found multiqc with content: multiqc_title                     = null
Operator found multiqc with content: multiqc_logo                      = null
Operator found multiqc with content: multiqc_methods_description       = null
Operator found multiqc with content: max_multiqc_email_size            = "25.MB"
Operator found fastqc with content: validationSchemaIgnoreParams     = 'igenomes_base,genomes,callers,dedup_contr

Operator found picard with content: ext.suffix = "_meta_picard_dups"
Operator found macs2 with content: if(params.run_peak_calling && 'macs2' in params.callers) {
Operator found macs2 with content: params.macs2_pvalue ? "-p ${params.macs2_pvalue}" : "-q ${params.macs2_qvalue}",
Operator found macs2 with content: params.macs2_narrow_peak ? '' : "--broad --broad-cutoff ${params.macs2_broad_cutoff}"
Operator found macs2 with content: ext.prefix = { "${meta.id}.macs2" }
Operator found macs2 with content: path: { "${params.outdir}/03_peak_calling/04_called_peaks/macs2" },
Operator found macs2 with content: if(params.run_peak_calling && 'macs2' in params.callers) {
Operator found macs2 with content: ext.suffix  = ".macs2.peaks"
Operator found macs2 with content: path: { "${params.outdir}/03_peak_calling/04_called_peaks/macs2" },
Operator found multiqc with content: if (params.run_multiqc) {
Operator found multiqc with content: ext.args   = params.multiqc_title ? "-v --title \"$params.multiqc

Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/BWAIndex/"
Operator found bwa with content: bwamem2     = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/BWAmem2Ind

Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/BWAIndex/"
Operator found bwa with content: bwamem2     = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/BWAmem2Ind

Operator found salmon with content: run_salmon_selective_alignment = true
Operator found salmon with content: run_salmon_selective_alignment = true
Operator found bwa with content: bwa_index = null
Operator found fastqc with content: skip_fastqc = false
Operator found bwa with content: mapper = 'bwaaln'
Operator found bwa with content: bwaalnn = 0.01 // From Oliva et al. 2021 (10.1093/bib/bbab076)
Operator found bwa with content: bwaalnk = 2
Operator found bwa with content: bwaalnl = 1024 // From Oliva et al. 2021 (10.1093/bib/bbab076)
Operator found bwa with content: bwaalno = 2 // From Oliva et al. 2021 (10.1093/bib/bbab076)
Operator found bowtie2 with content: bt2n = 0 // Do not set Cahill 2018 recommendation of 1 here, so not to 'hide' overriding bowtie2 presets
Operator found bedtools with content: run_bedtools_coverage = false
Operator found gatk with content: gatk_call_conf = 30
Operator found gatk with content: gatk_ploidy = 2
Operator found gatk with content: gatk_downsample =

Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Mus_musculus/Ensembl/G

Operator found multiqc with content: multiqc_mappings_config             = "${params.test_data_base}/modules/local/multiqc_mappings_config/SRX9626017_SRR13191702.mappings.csv"
Operator found multiqc with content: includeConfig "../../modules/local/multiqc_mappings_config/nextflow.config"
Operator found abricate with content: arg_skip_abricate                       = false
Operator found abricate with content: arg_abricate_db                         = 'ncbi'
Operator found abricate with content: arg_abricate_minid                      = 80
Operator found abricate with content: arg_abricate_mincov                     = 80
Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator foun

Operator found abricate with content: path: { "${params.outdir}/arg/hamronization/abricate" },
Operator found abricate with content: ext.prefix = { "${report}.abricate" }
Operator found abricate with content: arg_skip_abricate      = true
Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 wit

Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index

Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Mus_musculus/Ensembl/G

Operator found bwa with content: aligner                    = "bwa-mem"
Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_runkraken          = true
Operator found multiqc with content: multiqc_methods_description = null
Operator found multiqc with content: params.multiqc_runkraken = false
Operator found multiqc with content: params.multiqc_runkraken = false
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/

Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/BWAIndex/"
Operator found bwa with content: bwamem2     = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/BWAmem2Index/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/Bowtie2Index/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/Bowtie2Index/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/Bowtie2Index/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/Bo

Operator found fastqc with content: path: { "${params.outdir}/fastqc" },
Operator found multiqc with content: ext.args   = { params.multiqc_config ? "--config $multiqc_custom_config" : '' }
Operator found multiqc with content: path: { "${params.outdir}/multiqc" },
Operator found bwa with content: path: { "${params.outdir}/genome/bwa_index" },
Operator found cutadapt with content: if(!params.skip_cutadapt){
Operator found cutadapt with content: ext.args    = { "-Z -e 0.15 --no-indels --action none --discard-untrimmed -g ${params.cutadapt_5end}" }
Operator found samtools with content: ext.args2   = '-bhS' // for samtools view
Operator found bwa with content: path: { "${params.outdir}/bwa/mapped/bam" },
Operator found bwa with content: path: { "${params.outdir}/bwa/mapped/bam" },
Operator found bwa with content: path: { "${params.outdir}/bwa/mapped/QC" },
Operator found bwa with content: path: { "${params.outdir}/bwa/mapped/QC" },
Operator found bwa with content: path: { "${params.outdir}

Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/BWAIndex/"
Operator found bwa with content: bwamem2     = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/BWAmem2Ind

Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequen

Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/BWAIndex/"
Operator found bwa with content: bwamem2     = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/BWAmem2Ind

Operator found gtdbtk with content: skip_gtdbtk                 = true
Operator found gtdbtk with content: skip_gtdbtk                 = true
Operator found kraken2 with content: kraken2_db                    = null
Operator found megahit with content: skip_megahit                  = true
Operator found gtdbtk with content: skip_gtdbtk                   = true
Operator found gtdbtk with content: skip_gtdbtk                 = true
Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator found bowtie2 with content: bowtie2     = "${params.genomes_base}/mm10_v32/bowtie2/"
Operator found bowtie2 with content: bowtie2     = "${params.genomes_base}/mm10/bowtie2/"
Operator found bowtie2 

Operator found bwa with content: bwa         = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/BWAIndex/version0.6.0/"
Operator found bowt

Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator found bismark with content: aligner                    = 'bismark'
Operator found multiqc with content: skip_multiqc               = false
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequenc

Operator found multiqc with content: skip_multiqc                    = false
Operator found multiqc with content: multiqc_config                  = null
Operator found multiqc with content: multiqc_title                   = null
Operator found multiqc with content: multiqc_logo                    = null
Operator found multiqc with content: max_multiqc_email_size          = '25.MB'
Operator found multiqc with content: multiqc_methods_description     = null
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie

Operator found fastqc with content: skip_fastqc = false
Operator found picard with content: skip_picard_metrics = false
Operator found multiqc with content: skip_multiqc = false
Operator found multiqc with content: multiqc_config = false
Operator found multiqc with content: max_multiqc_email_size = 25.MB
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${para

Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator found multiqc with content: skip_multiqc               = false
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     =

Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Mus_musculus/Ensembl/G

Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/BWAIndex/"
Operator found bwa with content: bwamem2     = "${params.igenomes_base}/Homo_sapiens/UCSC/CHM13/Sequence/BWAmem2Ind

Operator found bwa with content: bwa         = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/"
O

Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequen

Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator found seqkit with content: assembly_fasta_gz       = "${params.test_data_base}/modules/local/seqkit/seq/assembly.fasta.gz"
Operator found fastqc with content: test_1_fastq_gz_fastqc_zip = "https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/illumina/fastqc/test_fastqc.zip"
Operator found kraken2 with content: kraken2_tar_gz          = "https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/db/kraken2.tar.gz"
Operator found seqkit with content: path: { "${params.outdir}/AssemblyFilter/seqkit/seq" },
Operator found bowtie2 with content: path: { "${params

Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator found multiqc with content: ext.args   = params.multiqc_title ? "--title \"$params.multiqc_title\"" : ''
Operator found multiqc with content: path:   { "${params.outdir}/multiqc" },
Operator found bwa with content: aligner                    = 'bwamem2'
Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator found bwa 

Operator found bwa with content: bwa         = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Se

Operator found gatk with content: test_baserecalibrator_table                    = "${params.test_data_base}/data/genomics/sarscov2/illumina/gatk/test.baserecalibrator.table"
Operator found picard with content: test_single_end_bam_readlist_txt               = "${params.test_data_base}/data/genomics/sarscov2/illumina/picard/test.single_end.bam.readlist.txt"
Operator found kraken2 with content: classified_reads_assignment                    = "${params.test_data_base}/data/genomics/sarscov2/metagenome/test_1.kraken2.reads.txt"
Operator found kraken2 with content: kraken_report                                  = "${params.test_data_base}/data/genomics/sarscov2/metagenome/test_1.kraken2.report.txt"
Operator found salmon with content: rnaseq_matrix                                  = "${params.test_data_base}/data/genomics/mus_musculus/rnaseq_expression/SRP254919.salmon.merged.gene_counts.top1000cov.tsv"
Operator found salmon with content: deseq_results                                  = "${

Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/BismarkIndex/"
Operator found bwa with content: aligner                         = 'bwa-mem'    // Only STAR is currently supported.
Operator found hisat2 with content: hisat2_build_memory             = null
Operator found hisat2 with content: hisat2_index                    = null
Operator found gatk with content: gatk_interval_scatter_count     = 25
Operator 

Operator found bwa with content: bwa         = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Drosophi

Operator found stringtie with content: path: { "${params.outdir}/stringtie/${meta.id}" },
Operator found fastp with content: extra_fastp_args           = null
Operator found salmon with content: aligner                    = 'star_salmon'
Operator found salmon with content: salmon_quant_libtype       = null
Operator found hisat2 with content: hisat2_build_memory        = '200.GB'  // Amount of memory required to build HISAT2 index with splice sites
Operator found stringtie with content: stringtie_ignore_gtf       = false
Operator found salmon with content: extra_salmon_quant_args    = null
Operator found kallisto with content: extra_kallisto_quant_args  = null
Operator found kallisto with content: kallisto_quant_fraglen     = 200
Operator found kallisto with content: kallisto_quant_fraglen_sd  = 200
Operator found stringtie with content: skip_stringtie             = false
Operator found fastqc with content: skip_fastqc                = false
Operator found multiqc with content: skip_mul

Operator found samtools with content: path: { "${params.outdir}/${params.aligner}/samtools_stats" },
Operator found salmon with content: path: { ( ['star_salmon','hisat2'].contains(params.aligner) &&
Operator found hisat2 with content: path: { ( ['star_salmon','hisat2'].contains(params.aligner) &&
Operator found salmon with content: saveAs: { ( ['star_salmon','hisat2'].contains(params.aligner) &&
Operator found hisat2 with content: saveAs: { ( ['star_salmon','hisat2'].contains(params.aligner) &&
Operator found salmon with content: path: { ( ['star_salmon','hisat2'].contains(params.aligner) &&
Operator found hisat2 with content: path: { ( ['star_salmon','hisat2'].contains(params.aligner) &&
Operator found salmon with content: saveAs: { ( ['star_salmon','hisat2'].contains(params.aligner) &&
Operator found hisat2 with content: saveAs: { ( ['star_salmon','hisat2'].contains(params.aligner) &&
Operator found picard with content: path: { "${params.outdir}/${params.aligner}/picard_metrics" },


Operator found salmon with content: input     = 'https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/samplesheet_salmon.csv'
Operator found salmon with content: source    = 'salmon_results'
Operator found salmon with content: aligner = 'star_salmon'
Operator found gatk with content: gatk_interval_scatter_count     = 25
Operator found gatk with content: gatk_hc_call_conf               = 20
Operator found gatk with content: gatk_vf_window_size             = 35
Operator found gatk with content: gatk_vf_cluster_size            = 3
Operator found gatk with content: gatk_vf_fs_filter               = 30.0
Operator found gatk with content: gatk_vf_qd_filter               = 2.0
Operator found multiqc with content: skip_multiqc                    = false
Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Ope

Operator found bwa with content: bwa                  = null
Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.

Operator found checkm with content: includeConfig 'conf/modules/ngscheckmate.config'
Operator found bwa with content: bwa                   = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh37/Sequence/BWAIndex/"
Operator found checkm with content: ngscheckmate_bed      = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh37/Annotation/NGSCheckMate/SNP_GRCh37_hg19_wChr.bed"
Operator found bwa with content: bwa                     = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Sequence/BWAIndex/"
Operator found bwa with content: bwamem2                 = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Sequence/BWAmem2Index/"
Operator found gatk with content: known_indels_vqsr       = '--resource:gatk,known=false,training=true,truth=true,prior=10.0 Homo_sapiens_assembly38.known_indels.vcf.gz --resource:mills,known=false,training=true,truth=true,prior=10.0 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz'
Operator found checkm with content: ngscheckmate_bed        = "${params.igenomes_b

Operator found samtools with content: ext.when   = { !(params.skip_tools && params.skip_tools.split(',').contains('samtools')) }
Operator found samtools with content: path: { "${params.outdir}/reports/samtools/${meta.id}" },
Operator found fastqc with content: ext.when   = { !(params.skip_tools && params.skip_tools.split(',').contains('fastqc')) }
Operator found fastqc with content: path: { "${params.outdir}/reports/fastqc/${meta.id}" },
Operator found multiqc with content: ext.args   = { params.multiqc_title ? "--title \"$params.multiqc_title\"" : '' }
Operator found multiqc with content: path: { "${params.outdir}/multiqc" },
Operator found samtools with content: ext.when   = { !(params.skip_tools && params.skip_tools.split(',').contains('samtools')) }
Operator found samtools with content: path: { "${params.outdir}/reports/samtools/${meta.id}" },
Operator found samtools with content: ext.when   = { !(params.skip_tools && params.skip_tools.split(',').contains('samtools')) }
Operator fo

Operator found bwa with content: bwa         = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${

Operator found gffread with content: ext.prefix = { "${gff.baseName}_gffread" }
Operator found kallisto with content: if (params.aligner == 'kallisto') {
Operator found multiqc with content: multiqc_config = false
Operator found multiqc with content: max_multiqc_email_size = 25.MB
Operator found multiqc with content: withName: multiqc {
Operator found fastqc with content: skip_fastqc = false
Operator found multiqc with content: multiqc_config = false
Operator found multiqc with content: max_multiqc_email_size = 25.MB
Operator found multiqc with content: withName: 'multiqc' {
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: 

Operator found fastp with content: fastp_min_length            = 17
Operator found fastp with content: fastp_known_mirna_adapters  = "$projectDir/assets/known_adapters.fa"
Operator found fastqc with content: skip_fastqc                 = false
Operator found multiqc with content: skip_multiqc                = false
Operator found fastp with content: skip_fastp                  = false
Operator found fastp with content: fastp_max_length            = 40
Operator found multiqc with content: multiqc_config              = null
Operator found multiqc with content: multiqc_title               = null
Operator found multiqc with content: multiqc_logo                = null
Operator found multiqc with content: max_multiqc_email_size      = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2    

Operator found fastqc with content: path: { "${params.outdir}/${meta.id}/fastqc" },
Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator found multiqc with content: multiqc_config             = "${projectDir}/conf/multiqc_ssds_config.yaml"
Operator found multiqc with content: multiqc_title              = 'SSDS'
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bism

Operator found bwa with content: bwa         = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/BWAIndex/genome.fa"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/BismarkIndex/"
Operator found fastqc with content: 'fastqc' {
Operator found multiqc with content: 'multiqc' {
Operator found picard with content: 'picard/sortsam_parsed' {
Operator found picard with content: 'picard/sortsam_co

Operator found bwa with content: bwa         = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Drosophi

Operator found kraken2 with content: run_kraken2                           = false
Operator found kraken2 with content: run_kraken2                           = false
Operator found kraken2 with content: run_kraken2                           = true
Operator found kraken2 with content: run_kraken2                           = false
Operator found kraken2 with content: run_kraken2                           = false
Operator found kraken2 with content: run_kraken2                           = true
Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator fou

Operator found multiqc with content: multiqc_config             = null
Operator found multiqc with content: multiqc_title              = null
Operator found multiqc with content: multiqc_logo               = null
Operator found multiqc with content: max_multiqc_email_size     = '25.MB'
Operator found multiqc with content: multiqc_methods_description = null
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
Operator found bismark with content: bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
Operator found bwa with content: bwa         = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/"
Operator found bowtie2 with content: bowtie2     = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index

Operator found bcftools with content: if (!variant_caller) { variant_caller = params.protocol == 'amplicon' ? 'ivar' : 'bcftools' }
Operator found fastqc with content: if (!params.skip_fastqc) {
Operator found fastqc with content: path: { "${params.outdir}/fastqc/raw" },
Operator found fastp with content: if (!params.skip_fastp) {
Operator found fastp with content: path: { "${params.outdir}/fastp" },
Operator found fastp with content: path: { "${params.outdir}/fastp/log" },
Operator found fastp with content: path: { "${params.outdir}/fastp" },
Operator found fastqc with content: if (!params.skip_fastqc) {
Operator found fastqc with content: path: { "${params.outdir}/fastqc/trim" },
Operator found kraken2 with content: if (!params.skip_kraken2) {
Operator found kraken2 with content: path: { "${params.outdir}/kraken2" },
Operator found bowtie2 with content: path: { "${params.outdir}/variants/bowtie2/log" },
Operator found bowtie2 with content: path: { "${params.outdir}/variants/bowtie2/u

# Bioinformatic tools in configuration files

In [15]:
import pandas as pd
import numpy as np

df_bioinformatics = pd.read_csv('./results/bioinformatic_tools_in_config.csv')
df_bioinformatics

Unnamed: 0,pipeline,operator,file_path,line_number,line_content
0,airrflow,multiqc,./git_repos\airrflow\nextflow.config,107,skip_multiqc = false
1,airrflow,multiqc,./git_repos\airrflow\nextflow.config,108,multiqc_config = null
2,airrflow,multiqc,./git_repos\airrflow\nextflow.config,109,multiqc_title = null
3,airrflow,multiqc,./git_repos\airrflow\nextflow.config,110,multiqc_logo = null
4,airrflow,multiqc,./git_repos\airrflow\nextflow.config,111,max_multiqc_email_size = '25.MB'
...,...,...,...,...,...
8299,viralrecon,multiqc,./git_repos\viralrecon\conf\modules_nanopore.c...,345,"path: { ""${params.outdir}/multiqc/${params.art..."
8300,viralrecon,kraken2,./git_repos\viralrecon\conf\test.config,31,kraken2_db = 'https://raw.githubusercontent.co...
8301,viralrecon,bcftools,./git_repos\viralrecon\conf\test_full_sispa.co...,26,variant_caller = 'bcftools'
8302,viralrecon,kraken2,./git_repos\viralrecon\conf\test_sispa.config,29,kraken2_db = 'https://raw.githubusercontent.co...


In [16]:
pipelines_total = df_bioinformatics.groupby('pipeline').size().reset_index(name='bioinformatic_occur')
config_count = df_bioinformatics.groupby('file_path').size().reset_index(name='bioinformatic_occur')

print(f'Matched in {len(pipelines_total)} pipelines and {len(config_count)} config files')

Matched in 88 pipelines and 332 config files


In [17]:
df_bioinformatics

Unnamed: 0,pipeline,operator,file_path,line_number,line_content
0,airrflow,multiqc,./git_repos\airrflow\nextflow.config,107,skip_multiqc = false
1,airrflow,multiqc,./git_repos\airrflow\nextflow.config,108,multiqc_config = null
2,airrflow,multiqc,./git_repos\airrflow\nextflow.config,109,multiqc_title = null
3,airrflow,multiqc,./git_repos\airrflow\nextflow.config,110,multiqc_logo = null
4,airrflow,multiqc,./git_repos\airrflow\nextflow.config,111,max_multiqc_email_size = '25.MB'
...,...,...,...,...,...
8299,viralrecon,multiqc,./git_repos\viralrecon\conf\modules_nanopore.c...,345,"path: { ""${params.outdir}/multiqc/${params.art..."
8300,viralrecon,kraken2,./git_repos\viralrecon\conf\test.config,31,kraken2_db = 'https://raw.githubusercontent.co...
8301,viralrecon,bcftools,./git_repos\viralrecon\conf\test_full_sispa.co...,26,variant_caller = 'bcftools'
8302,viralrecon,kraken2,./git_repos\viralrecon\conf\test_sispa.config,29,kraken2_db = 'https://raw.githubusercontent.co...


In [18]:
bio_df = df_bioinformatics.groupby('operator').agg({'pipeline': 'nunique', 'file_path': 'nunique'}).reset_index().sort_values(by='pipeline', ascending=False)
#grouped_df = merged_df.groupby('Assigned Group').agg({'config_file': 'nunique', 'operator_in_file': 'nunique'}).reset_index()

In [19]:
bio_df.head()

Unnamed: 0,operator,pipeline,file_path
23,multiqc,83,140
5,bowtie2,63,84
6,bwa,63,106
4,bismark,57,59
11,fastqc,40,62


# Operator domain data analysis

- we get all the lines with occuring operators, filter by length of operator due to noisy data!

In [1]:
import pandas as pd

df = pd.read_csv('./results/all_operators_in_config.csv')
df_operator_clustered = pd.read_csv('./data/operator_dataset_clustered.csv')

In [2]:
df = pd.read_csv('./results/all_operators_in_config.csv')

Here we filter by the length of the operator, due to noise in data results similar approach to GitLab Snakemake (https://gitlab.informatik.hu-berlin.de/pohlseba/github_scraping/-/blob/main/analysis/operators/operators.py, line 222)

In [3]:
df = df[df['operator'].str.len() > 3]

By applying that, we can make sure that every operator occures just once in a pipline and we dont count multiple occurences

In [4]:
# we get each operator occurence once per file from 76137 to 12132  entries

df = df.groupby(['pipeline', 'operator', 'file_path']).size().reset_index(name='operator_in_file')

In [5]:
df = df.rename(columns={'operator': 'Operator', 'file_path': 'config_file'})

In [6]:
merged_df = pd.merge(df, df_operator_clustered, on='Operator')

In [7]:
# 1022 configuration files
merged_df['config_file'].nunique()

1022

Group by Assigned Group

In [8]:
# n = number of operators to be shown
def most_common_operators(x, n=4):
    return list(x.value_counts().head(n).index)

In [15]:
agg_bio = merged_df.groupby('Assigned Group').agg({'pipeline': 'nunique', 'config_file': 'nunique', 'operator_in_file': 'count', 'Operator': lambda x: most_common_operators(x)}).reset_index()
agg_bio

Unnamed: 0,Assigned Group,pipeline,config_file,operator_in_file,Operator
0,0,87,406,680,"[fasta, fastq, fastqc, fastq.gz]"
1,1,94,446,580,"[docker, else, centrifuge, classifier]"
2,2,94,838,1113,"[params, print, echo, paired]"
3,3,13,22,22,"[zcat, coverm, cpan, sourmash]"
4,4,94,715,2060,"[script, singularity, report, conda]"
5,5,7,11,20,"[gzip, bgzip, gunzip, unzip]"
6,6,94,1013,7676,"[file, test, time, input]"


In [16]:
plot = merged_df.groupby('Assigned Group', 'config_file').agg({'operator_in_file': 'nunique'}).reset_index()
plot

ValueError: No axis named config_file for object type DataFrame

- Assigned Group 0:    bioinformatics tool
- Assigned Group 1:    control flow
- Assigned Group 2:    file processing
- Assigned Group 3:    file system
- Assigned Group 4:    code execution
- Assigned Group 5:    data handling
- Assigned Group 6:    other

# Analysis of operator domain data in processes

In [71]:
import pandas as pd

df = pd.read_csv('./results/all_operators_in_processes.csv')

df_operator_clustered = pd.read_csv('./data/operator_dataset_clustered.csv')

In [72]:
df = df[df['operator'].str.len() > 3]

In [73]:
df = df.rename(columns={'operator': 'Operator'})

In [74]:
merged_df = pd.merge(df, df_operator_clustered, on='Operator')

In [75]:
grouped_df = merged_df.groupby('Assigned Group').agg({'pipeline': 'nunique', 'process_name': 'nunique', 'Operator': 'count', 'file_path': 'nunique'}).reset_index()

Add ratio = # occuring processes /# total rules

In [76]:
sum_processes = merged_df['process_name'].nunique()

In [77]:
sum_processes

1982

In [78]:
grouped_df['ratio'] = grouped_df['process_name']/sum_processes

In [79]:
grouped_df

Unnamed: 0,Assigned Group,pipeline,process_name,Operator,file_path,ratio
0,0,88,863,9315,1269,0.435419
1,1,93,1588,4045,2463,0.801211
2,2,94,1622,8498,1943,0.818365
3,3,36,35,120,42,0.017659
4,4,94,1979,18899,2573,0.998486
5,5,70,187,1120,283,0.094349
6,6,94,1982,33788,2576,1.0
