# APA calling Pipeline

## Aim
The purpose of this notebook is to call APA-based information (PDUI) based on DAPARS2 method
(https://github.com/3UTR/DaPars2).

### Methods overview
(Optional) 3UTR Generation:
* _gtf2bed12.py_ : Covert gtf to bed format (Source from in-house codes from Li Lab: https://github.com/Xu-Dong/Exon_Intron_Extractor/blob/main/scripts/gtf2bed12.py)  

    `wget https://raw.githubusercontent.com/seriousbamboo/Exon_Intron_Extractor/main/scripts/gtf2bed12.py`

* _DaPars_Extract_Anno.py_ : extract the 3UTR regions in bed formats from the whole genome bed (Source from Dapars 2: https://github.com/3UTR/DaPars2/blob/master/src/DaPars_Extract_Anno.py)

    `wget https://raw.githubusercontent.com/3UTR/DaPars2/master/src/DaPars_Extract_Anno.py`

1 - Config files Generation:  
* _Python 3_ loops to read line by line the sum of reads coverage of all chromosome.

2 - Dapars2 Main Function:
* _Dapars2_Multi_Sample.py_: use the least sqaures methods to calculate the usage of long isoforms (https://github.com/3UTR/DaPars2/blob/master/src/Dapars2_Multi_Sample.py)  

    `wget https://raw.githubusercontent.com/seriousbamboo/DaPars2/master/src/Dapars2_Multi_Sample.py`
    
    Note: this part of code have been modified from source to deal with some formatting discrepancy in wig file

3 - Impute missing values in Dapars result

#### Dependence
* _Python2_ (Note: codes in python2 can be update to Python 3 easily)
* _Python3_
* _R_

### Input for the whole Pipeline

Required input: 
*  The path to the directory where the wig files are stored (denoted as bfile, please refer to further sessions for more detailed requirment)
*  The 3'UTR annotation file

If you do not have 3'UTR annotation file, please generate it following step 1. The input of generation is:
*  GTF(served as the reference) 

### Output

*  Dapars config files (in the current directory)
*  PUDI (Raw) information saved in txt (in the specified output directory)
*  PDUI (Imputed) information saved in txt. This can be used for further analysis. 

### Workflow

In [1]:
[global]
parameter: walltime = '36h'
parameter: mem = '100G'
parameter: ncore = 22
# the output directory for generated files
parameter: cwd = path
# path to GTF file 
parameter: thread = 8
parameter: job_size = 22

### Step 0: Generate 3UTR regions based on GTF 

The 3UTR regions (saved in bed format) could be use __repeatly__ for different samples. It only served as the reference region, such that you __should not__ run it if given generated hg19/hg38 3UTR regions.

In [2]:
# Generate the 3UTR region according to the gtf file
[UTR_generation_1]
# gtf file
parameter: gtf = path
input: gtf
output: [f'{cwd}/gene_annotation.bed', f'{cwd}/transcript_to_geneName.txt']
bash: expand = '${ }'
    python2 Scripts/gtf2bed12.py --gtf "${_input}" --out "${cwd}"

In [3]:
[UTR_generation_2]
parameter: gtf = path
input: [f'{cwd}/gene_annotation.bed', f'{cwd}/transcript_to_geneName.txt']
output: f'{cwd}/{gtf:bn}_3UTR.bed'
bash: expand = '${ }'
    python2 Scripts/DaPars_Extract_Anno.py -b "${_input[0]}" -s "${_input[1]}" -o "${_output}"

### Step 1: Generating config files and calculating sample depth

#### Notes on input file format

For the input file, it has the following format. Additional notes are:
* The first line is the information of file. If you do not have them, please add any content on first line
* The file must end with ".wig". It will not cause any problem if you directly change from ".bedgraph"
* If your input wig file did not have the characters __"chr"__ in the first column, please set `no_chr_prefix = T`

In [8]:
head -n 10 Sample_Input/Wigfiles/Sample1.wig

track type=bedGraph
chr1	10000	10812	0
chr1	10812	10820	1
chr1	10820	11094	0
chr1	11094	11170	1
chr1	11170	11404	0
chr1	11404	11480	1
chr1	11480	11504	0
chr1	11504	11517	1
chr1	11517	11625	0


In [9]:
# Calculcate total depth and configuration file
[APAconfig]
parameter: bfile = path
parameter: annotation = path
parameter: job_size = 1
# Default parameters for Dapars2:
parameter: least_pass_coverage_percentage = 0.3
parameter: coverage_threshold = 10

parameter: no_chr_prefix = "F"
output: [f'{cwd}/sample_mapping_files.txt',f'{cwd}/sample_configuration_file.txt']
task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = ncore
python3: expand = "${ }"
    import re
    import os
    target_all_sample = os.listdir("${bfile}")
    target_all_sample = list(filter(lambda v: re.match('.*wig$', v), target_all_sample))
    target_all_sample = ["${bfile}" + "/" + w for w in target_all_sample]
    #print(target_all_sample)
    print("INFO: Total",len(target_all_sample),"samples found in provided dirctory!")
    # Total depth file:
    chr = []
    for i in range(22):
        chr.append(str(i+1))
    chr = chr + ["X","Y"]
    if "${no_chr_prefix}" == "F":
        chr = ['chr' + str(a) for a in chr]
    mapping_file = open("${_output[0]}", "w")
    for current_sample in target_all_sample:
        current_sample_total_depth = 0
        # skip the default type = bedgraph line
        for line in open(current_sample,'r'):
            if line[0] != '#' and line[0] != 't':
                fields = line.strip('\n').split('\t')
                curr_chr = fields[0]
                region_start = int(fields[1])
                region_end = int(fields[2])
                current_sample_total_depth += (curr_chr in chr) * int(float(fields[-1])) * (region_end - region_start)
        field_out = [current_sample, str(current_sample_total_depth)]
        mapping_file.writelines('\t'.join(field_out) + '\n')

        print("Coverage of sample ", current_sample, ": ", current_sample_total_depth)
    mapping_file.close()

    # Configuration file:

    config_file = open(${_output[1]:r},"w")
    config_file.writelines(f"Annotated_3UTR=${annotation}\n")
    config_file.writelines( "Aligned_Wig_files=%s\n" % ",".join(target_all_sample))
    config_file.writelines(f"Output_directory=${cwd:bn}/${bfile:bn}\n")
    config_file.writelines(f"Output_result_file=Dapars_result\n")
    config_file.writelines(f"Least_pass_coverage_percentage=${least_pass_coverage_percentage}\n")
    config_file.writelines( "Coverage_threshold=${coverage_threshold}\n")
    config_file.writelines( "Num_Threads=${thread}\n")
    config_file.writelines(f"sequencing_depth_file=${_output[0]}") 
    config_file.close()    

### Step2: Run Dapars2 main to calculate PDUIs
#### Tip: modified Dapars2_Multi_Sample.py
Default input of Dapars2_Multi_Sample.py did not consider the situation that first column did not contain "chr" (shown in _Step 2_).   
* We add a new argument no_chr_prefix (default is FALSE)

In [15]:
# Call Dapars2 multi_chromosome
[APAmain]
parameter: no_chr_prefix = False
parameter: chrlist = list
input: for_each = 'chrlist'
output: [f'{cwd}/Wigfiles_{x}/Dapars_result_result_temp.{x}.txt' for x in chrlist], group_by = 1
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = ncore
bash: expand = True
    python2 Scripts/Dapars2_Multi_Sample.py {cwd}/sample_configuration_file.txt {_chrlist} {no_chr_prefix}

## Step3: Inpute the result:
This step impute the missing value in PDUI matrix and return the Imputed one

In [9]:
# Dapars result Imputation 
[APAimpute]
# Input path
# parameter: dapars_raw = 
parameter: chrlist = list
# Default k neighbours. Set k = 0 will train the k in a data-driven manner
parameter: knn = 5
input: [f'{cwd}/Wigfiles_{x}/Dapars_result_result_temp.{x}.txt' for x in chrlist], group_by = 1
output: [f'{cwd}/Dapars_result_imputed_{x}.txt' for x in chrlist], group_by = 1
R: expand= "${ }"


suppressPackageStartupMessages(require(dplyr))
suppressPackageStartupMessages(require(tidyr))
suppressPackageStartupMessages(require(doParallel))
suppressPackageStartupMessages(require(VIM))
suppressPackageStartupMessages(require(preprocessCore))
  
  
# KNN impute

# Decide the optimal n if possible. Set the k to 2:#sample 
# and train a best k. The idea is that, randomly remove 
# 20% numbers in the full value subset 5 times, calculate 
# MSE of imputed value by VIM and true value to decide which 
# k is the optimal one. (an analog from 5 fold CV)
  
  
knn_train_sample <- function(input_df, k_nb){
      # Sample data and remove entries
      # We must keep 1 full col for impute
      dim_df <- dim(input_df)
      sample_size <-  round(dim_df[1] * dim_df[2] * 0.2, 0)
      row_index <- sample(1:nrow(input_df), sample_size)
      col_index <- sample(1:ncol(input_df), sample_size, replace = T)
      val_list <- c()
      for(i in 1:length(row_index)){
        val_list <- c(val_list, input_df[row_index[i], col_index[i]])
        input_df[row_index[i], col_index[i]] <- NA
      }

      # Impute value

      impute_data <- VIM::kNN(input_df, k = k_nb)
      impute_data <- impute_data[,1:ncol(input_df)]

      # Calculate the MSE:
      predict_list <- c()
      for(i in 1:length(row_index)){
        predict_list <- c(predict_list, impute_data[row_index[i], col_index[i]])
      }
      tss <- sum((predict_list - val_list)^2)
      return(tss)
}
  
 # Read the data
    input_dir <- ${_input:r}
  

    dapars_result <- 
      read.table(input_dir,
                 header = T) 
    
    dapars_names <- 
      dapars_result %>% 
      select(1:3)
  
    dapars_result <-
      dapars_result %>% 
      select(-1:-3) 
 
  if(${knn} == 0){
    

    dapars_train_data <- dapars_result %>% drop_na()

    # Train the model
    no_cores <- detectCores() - 1
    cl <- makeCluster(no_cores)
    registerDoParallel(cl)

    index <- seq(2, ncol(dapars_train_data), by = 1)  # Change by = x for precision, for large dataset, please make it bigger
    result_list <- c()
    for(i in 1:length(index)){
        print(paste0("Train k = ",index[i]))
        re <- foreach(j=1:5,.combine = c) %dopar% 
          knn_train_sample(dapars_train_data, index[i])
        result_list <- c(result_list, mean(re))
      }

    optimal_k <- index[which.min(result_list)]
    print(paste0("Optimal k selected is ", optimal_k))
    }else{
      optimal_k <- ${knn} 
      print(paste0("Use k = ", optimal_k, " for imputation"))
    }
      
    
  
    # Train the whole dataset:
    imputed_full_data <- VIM::kNN(dapars_result, k = optimal_k)
    imputed_full_data <- imputed_full_data[,1:ncol(dapars_result)]
  
    qnorm_full_data <- 
      preprocessCore::normalize.quantiles(as.matrix(imputed_full_data), 
                                          copy = F)
  
    final_data <- cbind(dapars_names, qnorm_full_data)
    write.table(final_data, file = ${_output:r}, quote = F)

# Minimum working example
### Step 0: 3UTR generation

In [3]:
sos run apacalling.ipynb UTR_generation \
    --cwd /Users/albert29/Documents/Xqtl/Call_Dapars/Output \
    --gtf /Users/albert29/Documents/Xqtl/Call_Dapars/Sample_Input/gencode.v19.annotation.gtf 

INFO: Running [32mUTR_generation_1[0m: Generate the 3UTR region according to the gtf file
Done!
INFO: [32mUTR_generation_1[0m is [32mcompleted[0m.
INFO: [32mUTR_generation_1[0m output:   [32m/Users/albert29/Documents/Xqtl/Call_Dapars/Output/gene_annotation.bed /Users/albert29/Documents/Xqtl/Call_Dapars/Output/transcript_to_geneName.txt[0m
INFO: Running [32mUTR_generation_2[0m: 
Generating regions ...
Total extracted 3' UTR: 140563
Finished
INFO: [32mUTR_generation_2[0m is [32mcompleted[0m.
INFO: [32mUTR_generation_2[0m output:   [32m/Users/albert29/Documents/Xqtl/Call_Dapars/Output/gencode.v19.annotation_3UTR.bed[0m
INFO: Workflow UTR_generation (ID=wbdda56e90fddc24c) is executed successfully with 2 completed steps.


In [8]:
tree /Users/albert29/Documents/Xqtl/Call_Dapars/Output

[01;34m/Users/albert29/Documents/Xqtl/Call_Dapars/Output[00m
├── gencode.v19.annotation_3UTR.bed
├── gene_annotation.bed
├── sample_configuration_file.txt
├── sample_mapping_files.txt
└── transcript_to_geneName.txt

0 directories, 5 files


### Step 1: Generating config files and calculating sample depth


In [10]:
sos run apacalling.ipynb APAconfig \
    --cwd /Users/albert29/Documents/Xqtl/Call_Dapars/Output \
    --bfile /Users/albert29/Documents/Xqtl/Call_Dapars/Sample_Input/Wigfiles \
    --annotation /Users/albert29/Documents/Xqtl/Call_Dapars/Output/gencode.v19.annotation_3UTR.bed \
    --no_chr_prefix F

INFO: Running [32mAPAconfig[0m: Calculcate total depth and configuration file
INFO: Total 4 samples found in provided dirctory!
Coverage of sample  /Users/albert29/Documents/Xqtl/Call_Dapars/Sample_Input/Wigfiles/Sample3.wig :  859989694
Coverage of sample  /Users/albert29/Documents/Xqtl/Call_Dapars/Sample_Input/Wigfiles/Sample2.wig :  1017641808
Coverage of sample  /Users/albert29/Documents/Xqtl/Call_Dapars/Sample_Input/Wigfiles/Sample1.wig :  851834461
Coverage of sample  /Users/albert29/Documents/Xqtl/Call_Dapars/Sample_Input/Wigfiles/Sample4.wig :  890005265
INFO: [32mAPAconfig[0m is [32mcompleted[0m.
INFO: [32mAPAconfig[0m output:   [32m/Users/albert29/Documents/Xqtl/Call_Dapars/Output/sample_mapping_files.txt /Users/albert29/Documents/Xqtl/Call_Dapars/Output/sample_configuration_file.txt[0m
INFO: Workflow APAconfig (ID=wbeede41b336d4ece) is executed successfully with 1 completed step.


In [11]:
tree /Users/albert29/Documents/Xqtl/Call_Dapars/Output

[01;34m/Users/albert29/Documents/Xqtl/Call_Dapars/Output[00m
├── gencode.v19.annotation_3UTR.bed
├── gene_annotation.bed
├── sample_configuration_file.txt
├── sample_mapping_files.txt
└── transcript_to_geneName.txt

0 directories, 5 files


### Step 2: Dapars2 Main
Note: the example is a truncated version, which just have coverage in chr1,chr11 and chr12

In [12]:
sos run apacalling.ipynb APAmain \
    --cwd /Users/albert29/Documents/Xqtl/Call_Dapars/Output \
    --no_chr_prefix F \
    --chrlist chr1 chr10 chr11

INFO: Running [32mAPAmain[0m: Call Dapars2 multi_chromosome
[Thu Dec 16 15:41:29 2021] Start Analysis ...
All samples Joint Processing chr10 ...
[Thu Dec 16 15:41:29 2021] Loading Coverage ...
[Thu Dec 16 15:41:29 2021] Start Analysis ...
[Thu Dec 16 15:41:29 2021] Start Analysis ...
All samples Joint Processing chr1 ...
[Thu Dec 16 15:41:29 2021] Loading Coverage ...
All samples Joint Processing chr11 ...
[Thu Dec 16 15:41:29 2021] Loading Coverage ...
/Users/albert29/Documents/Xqtl/Call_Dapars/Sample_Input/Wigfiles/Sample3.wig
/Users/albert29/Documents/Xqtl/Call_Dapars/Sample_Input/Wigfiles/Sample2.wig
/Users/albert29/Documents/Xqtl/Call_Dapars/Sample_Input/Wigfiles/Sample1.wig
/Users/albert29/Documents/Xqtl/Call_Dapars/Sample_Input/Wigfiles/Sample4.wig
/Users/albert29/Documents/Xqtl/Call_Dapars/Sample_Input/Wigfiles/Sample2.wig
/Users/albert29/Documents/Xqtl/Call_Dapars/Sample_Input/Wigfiles/Sample3.wig
/Users/albert29/Documents/Xqtl/Call_Dapars/Sample_Input/Wigfiles/Sample1.wig
/

In [13]:
tree /Users/albert29/Documents/Xqtl/Call_Dapars/Output

[01;34m/Users/albert29/Documents/Xqtl/Call_Dapars/Output[00m
├── [01;34mWigfiles_chr1[00m
│   ├── Dapars_result_result_temp.chr1.txt
│   └── [01;34mtmp[00m
│       ├── Each_processor_3UTR_Result_1.txt
│       ├── Each_processor_3UTR_Result_2.txt
│       ├── Each_processor_3UTR_Result_3.txt
│       ├── Each_processor_3UTR_Result_4.txt
│       ├── Each_processor_3UTR_Result_5.txt
│       ├── Each_processor_3UTR_Result_6.txt
│       ├── Each_processor_3UTR_Result_7.txt
│       └── Each_processor_3UTR_Result_8.txt
├── [01;34mWigfiles_chr10[00m
│   ├── Dapars_result_result_temp.chr10.txt
│   └── [01;34mtmp[00m
│       ├── Each_processor_3UTR_Result_1.txt
│       ├── Each_processor_3UTR_Result_2.txt
│       ├── Each_processor_3UTR_Result_3.txt
│       ├── Each_processor_3UTR_Result_4.txt
│       ├── Each_processor_3UTR_Result_5.txt
│       ├── Each_processor_3UTR_Result_6.txt
│       ├── Each_processor_3UTR_Result_7.txt
│       └── Each_processor_3UTR_Result_8.txt
├── [01;34mWigfi

### Step 3: Impute

In [14]:
sos run apacalling.ipynb APAimpute \
    --cwd /Users/albert29/Documents/Xqtl/Call_Dapars/Output \
    --chrlist chr1 chr10 chr11 

INFO: Running [32mAPAimpute[0m: Dapars result Imputation
[1] "Use k = 5 for imputation"
[1] "Use k = 5 for imputation"
[1] "Use k = 5 for imputation"
In preprocessCore::normalize.quantiles(as.matrix(imputed_full_data),  :
  NAs introduced by coercion
INFO: [32mAPAimpute[0m (index=1) is [32mcompleted[0m.
In preprocessCore::normalize.quantiles(as.matrix(imputed_full_data),  :
  NAs introduced by coercion
INFO: [32mAPAimpute[0m (index=2) is [32mcompleted[0m.
In preprocessCore::normalize.quantiles(as.matrix(imputed_full_data),  :
  NAs introduced by coercion
INFO: [32mAPAimpute[0m (index=0) is [32mcompleted[0m.
INFO: [32mAPAimpute[0m output:   [32m/Users/albert29/Documents/Xqtl/Call_Dapars/Output/Dapars_result_imputed_chr1.txt /Users/albert29/Documents/Xqtl/Call_Dapars/Output/Dapars_result_imputed_chr10.txt... (3 items in 3 groups)[0m
INFO: Workflow APAimpute (ID=wc04627bf743db81f) is executed successfully with 1 completed step and 3 completed substeps.
