# TRE MPRAnalyze Pipeline (TMP)

## Structure

- barcode_map_data
    - finalBarcodeMap.csv  
- multi_comparative_results/
- pairwise_comparative_results/
- runs/
    - run1/
    - run2/
    - ... 
    - etc.
- scripts/
- TMP_comparative.py
- TMP_empirical.py
- cluster.yaml


## General Work Flow

The general work flow of TMP is as follows. For each of your runs, you will run TMP_empirical.py. Each time you run TMP_empirical.py, a new directory will be created in the runs/ directory.

Each directory created by TMP_empirical.py will contain the following.

    - empirical_results.csv
    - joined_files/ 
        - a directory containing the joined fastq files
        - May be empty if only single end reads provided
        - un_files/ 
            - a directory containing the un-paired reads
        - merge_stats.csv
            - describes the number of reads that were joined vs not paired
    - mpra_input/
        - a directory containg input for MPRAnalyze
    - raw_counts/
        - a directory containing the counts of each barcode from the fastq files provided
    - rna_dna_samples/
        - a directory containing the filtered RNA and DNA samples
    - run_descriptive_stats/
        - a directory containing the information about the run
    - star_code/ 
        - a directory containing starcode files
    - trimmed_files/
        - a directory containing trimmed fastq files. 
        - only fastq files that will be joined will be trimmed
        
After running TMP_empirical.py for each of your runs, run TMP_comparative.py to create and run comparisons. Comparisons can be pairwise, IE comparing a baseline against a stimulated treatment. And comparisons can be multivariate, IE comparing multiple treatment types against each other. TMP_comparative.py can be run with input TSV files (see Flags TMP_comparative.py) to specify the treatments to compare, or without flags to manually input the treatments. The results from TMP_comparative.py will be found in the multi_comparitive_results/ and pairwise_comparitive_results/ directories. 

## Flags TMP_empirical.py

### -r --run-name
    - This flag is used to specify the name of the run
    - The same string you enter here will be seen in the /runs directory
    - The name may not incluc "_" or "/"
    - EXAMPLE "-r 19919"
    
### -p --path_to_fastq
    - This flag is used to specify the path to the dirctory containing the fastq files for your run
    - The path may be absolute or direct
    - The fastq files inside the directory may be ".fq", ".fastq", ".fq.gz", or ".fastq.gz" files
    - EXAMPLE "-r ../sequence_data/19919_fq_files"

### -t --path_to_treatment_tsv
    - Specify the path to the treatment TSV
    - Treatment TSV is used to pair sample numbers with treatment names
    - The first column should correspond to sample numbers
    - The second column should correspond to treatment names
    - At least one DNA sample must be present for the pipeline to work. All DNA samples should simply be labeled as "DNA"
    - If you have multiple replicates of the same treatment they must be the same name
        - IE if you had 
        - Serum Free 1 
        - Serum Free 2 
        - Serum Free 3
        - Serum Free 1 2 and 3 would all be treated as separate treatment types. 
    - Do not include a header for your tsv file
    - See example of TSV below
    
```python
1   Serum Free
2   Serum Free
3   Serum Free
4   ATP
5   Forskolin
6   DNA
7   DNA
8   DNA
```

### -dt --path_to_dna_tsv
    - Specify the path to the DNA TSV
    - DNA TSV is only needed if more than one DNA sample is present in treatment TSV
    - The DNA TSV is used to specify which DNA samples correspond to which RNA samples
    - The first column corresponds to DNA sample numbers, the second column corresponds to RNA sample numbers
    - Each RNA sample must correspond to 1 and only 1 DNA sample
    - If you had a treatment TSV that looked like the one above (see -t) you would creat a DNA TSV like the one below
    
```python
6   1
6   2
6   3
7   4
8   5
```
    - This would mean the DNA from sample number 6 correspond to the Serum Free treatments, the DNA from sample 7 corresponds to the ATP treatment and the DNA from sample 8 corresponds to the Forskolin treatment.
    
### -sr --sample_number_regex
    - Used to specify a regex to detect sample numbers from fastq file names
    - By default, and number directly following "S" will be counted as the sample number.
    - If your file names looked like this 19919X42_220722_A00421_0459_AHH3JFDRX2_S42_L001_R2_001.fastq.gz
    - The default would call this file name to be sample number 42
    - If your file names looked like this 19919_R1_L001_sample_number_42.fastq.gz
    - You would put " -sr sample_number_" to automatically detect the 42 as the sample number 

### -d --path_to_DNA_fastq
    - Sometimes you may not keep the DNA fastq files in the same location that your RNA fastq files
    - Use this flag to specify the path to the directory containing the fastq files for your DNA samples

## Flags TMP_comparative.py

### -n, --n_workers
    - This flag is used for parallelization of the MPRAnalyze's "analyzeComparative()" function
    - Default is 6
    - EXAMPLE "-n 8"
    
### -p, --pairwise_tsv
    - Used to specify a path to a TSV containing the desired pairwise comparisons
    - This flag is optional. If you do not use this flag, you will be prompted to input the desired comparisons on the command line.
    -See example below for the format of the pairwise TSV
```python
id  base_treatment    stim_treatment    base_run    stim_run
1   Serum Free        ATP               19919       19919 
2   Serum Free        Forskolin         19919       19664
3   Serum Free        FBS               19919       19919 
```
    -The header for the TSV must match the example exactly
    -The names in the base_run and stim_run (stimulated run) columns should match the names of the directories under the runs/ directory
    -The names of base_treatment and stim_treatment (stimulated treatement) must match the names exactly entered previously from treatment TSV (See -t Flags TMP_empirical.py)
    -The id column should simply be numbered 1 through n number of comparisons
    
### -m, --multi_tsv
    - Used to specify a path to a TSV containing the desired multivariate  comparisons
    - This flag is optional. If you do not use this flag, you will be prompted to input the desired comparisons on the command line.
    -See example below for the format of the pairwise TSV
```python
id	treatment	run
1	Serum Free	19919
1	ATP	20250
1	Forskolin	20250
2	Marin1	20250
2	Marin2	20250
2	Marin3	20250
2	Marin4	20250
```
    -The header for the TSV must match the example exactly
    -The rows with the same ID will compare the treatment against each other. In this example, 2 comparisons will be run. 1 will compare Serum Free, ATP and Forskolin, 2 will compare Marin1 throu Marin4
    -The names of the run column must match the directory names under the runs/ dirctory
    -The names of the treatment column must match the names exactly entered previously from treatment TSV (See -t Flags TMP_empirical.py)
    
