# Example Module

An example of a module section on the [xqtl-protocol website](https://statfungen.github.io/xqtl-protocol/README.html) that could be used in the paper.  
Parts marked with superscript "hint_[name]" would be referenced in the paper to mark places to take text from.

Everything below would replace or augment what is in the module page on the website.   
For example, everything below would replace or augment what is on the [Quantifying expression from RNA-seq data](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/calling/RNA_calling.html) page under the `RNA-seq expression` miniprotocol section on the website.

# Title (ex: Quantifying expression from RNA-seq data)
this would be the name of a part under a subsection in the Procedures in the paper

## Description
this would go in the Experimental Design part of the Introduction section of the paper.

This is a longer description of what this does and the tools invovled. 


## Input
Input for this module. If there are multiple steps to this module (like with `Quantifying expression from RNA-seq data`), then place this section under each step in the Command Interface section below instead.

## Output
Input for this module. If there are multiple steps to this module (like with `Quantifying expression from RNA-seq data`), then place this section under each step in the Command Interface section below instead.

## Minimal Working Example Steps
This would be the steps and code cells to be displayed on both the website and paper. 

Include CRITICAL STEP parts where something important needs to be mentioned

Each step should have a Timing part noting how long it takes sto run

If troubleshooting may be invovled, add a TROUBLESHOOTING note to tell readers to look at the Troubleshooting section table elsewhere in this notebook

> #### CRITICAL STEP
>
> something critical you may need to do

### i. step 1

Timing <X hours

brief description.

In [None]:
sos run pipeline/RNA_calling.ipynb fastqc \
    --cwd output_test \
    --samples test_data/test_data.fastqlist \
    --data-dir test_data/ \
    --container containers/rna_quantification.sif 

### ii. step 2

Timing <X hours

brief description.

> TROUBLESHOOTING

In [None]:
sos run pipeline/RNA_calling.ipynb fastp_trim_adaptor \
    --cwd output_test \
    --samples test_data/test_data.fastqlist \
    --data-dir test_data/ \
    --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf.ref.flat -s build 

## Troubleshooting

| Step | Substep | Problem | Possible Reason | Solution |
|------|---------|---------|------------------|---------|
| Molecular Phenotype Quantification | step 2 ii) | couldn't find file | you used the wrong file | use the right file |




## Command interface
all the code in the section with the same name in the current notebooks containing actual code to run

## Setup and global parameters

varies depending on the module, but looks something like this:

In [None]:

[global]
# Covariate file
parameter: covariate_file = path
# A genotype file in PLINK binary format (bed/bam/fam) format, per chrom
parameter: genotype_file = path
# An optional subset of regions of molecular features to analyze. The last column is the  gene names
parameter: region_list = path()
# An optional list documenting the custom cis window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_cis_windows = path()
# Path to the work directory of the analysis.
parameter: cwd = path('output')
# Phenotype file, a list of phenotype per region.
parameter: phenotype_file = path

# Prefix for the analysis output
parameter: name = f"{phenotype_file:bn}_{covariate_file:bn}"
# Minor allele count cutoff
parameter: MAC = 0
# The name of phenotype corresponding to gene_id or gene_name in the region
parameter: region_name = "gene_id"
# The phenotype group file to group molecule_trait into molecule_trait_object
# This applies to multiple molecular events in the same region, such as sQTL analysis.
parameter: phenotype_group = path() 
parameter: region_list_phenotype_column = 4


# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# This parameter will be zero if `customized_cis_windows` is provided.
parameter: window = 1000000
# Set number of sample to be keep
parameter: keep_sample = path()

# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
# Container option for software to run the analysis: docker or singularity
parameter: container = ''
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""

# Use the header of the covariate file to decide the sample size
import pandas as pd
N = len(pd.read_csv(covariate_file, sep = "\t",nrows = 1).columns) - 1

# Minor allele frequency cutoff. It will overwrite minor allele cutoff.
parameter: maf_threshold = MAC/(2.0*N)




cells after this should have the actual code used by the pipeline.