## Introduction
This Jupyter notebook walks through the basic functionalities of ConSReg that allow for building regulatory networks, and prioritizing important transcription factors (TFs) from the integration of DAP-seq, ATAC-seq and single cell RNA-seq data. Datasets used in this analysis are listed below:

1. DAP-seq: [O'Malley et al., 2016](https://www.ncbi.nlm.nih.gov/pubmed/27203113)
2. ATAC-seq: [Lu et al., 2017](https://academic.oup.com/nar/article/45/6/e41/2605943)
3. single cell RNA-seq: [Ryu et al., 2019](https://www.ncbi.nlm.nih.gov/pubmed/30718350)

In [1]:
import pandas as pd
import os
import re

from ConSReg.main import ConSReg

## Steps for single cell analysis are the same with bulk analysis. The only difference here is the input data and type of negative training genes

File names of input data

In [3]:
# Dap-seq narrow peak files
dap_file_list = os.listdir("data/dap_seq_all_peaks/")
dap_files = [ "data/dap_seq_all_peaks/" + file for file in dap_file_list if re.match(".*narrowPeak",file) is not None]

# ATAC-seq peak file
atac_file = "data/atac_seq_all_peaks/all_merged.bed"

# Arabidopsis genome annotation file
gff_file = "data/TAIR10_GFF3_genes.gff"

# Differential contrast result
sc_diff_file = ["data/diff_single_cell/cortext-endodermis.csv"]

Read and preprocess all data files

In [5]:
analysis = ConSReg()

# Specify parameters for preprocessing
params = {
    'dap_files':dap_files,
    'diff_files':sc_diff_file,
    'atac_file':atac_file,
    'gff_file':gff_file,
    'dap_chr_col':0,
    'dap_chr_start_col':1,
    'dap_chr_end_col':2,
    'dap_strand_col':None,
    'dap_signal_col':None,
    'atac_chr_col':0,
    'atac_chr_start_col':1, 
    'atac_chr_end_col':2,
    'atac_signal_col':None,
    'up_tss':3000,
    'down_tss':500,
    'up_type':'all', 
    'down_type':'all',
    'use_peak_signal':False,
    'n_jobs':16,
    'verbose':True
}
analysis.preprocess(**params)

Merging DAP-seq peaks...
Done
Assigning CREs to nearest genes...
Done
Overlapping CREs with ATAC-seq...
Done
Reading diff tables...
Done


<ConSReg.main.ConSReg at 0x7fec96857588>

Generate feature matrices for each differential contrast

In [6]:
analysis.gen_feature_mat(neg_type = 'udg')

Existing feature matrices will be overwritten.
Generating feature matrices...
Done


<ConSReg.main.ConSReg at 0x7fec96857588>

Compute AUC-ROC and AUC-PRC from corss-validation (CV) using LRLASSO method. Mean and standard deviation of AUC-ROC and AUC-PRC were reporeted from five replicates of CV 

In [7]:
analysis.eval_by_cv(ml_engine = 'lrlasso',rep = 5)

Performing cross-validation for each feature matrix using lrlasso engine...
Old evaluation results will be ovewritten


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Done


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.6s finished


<ConSReg.main.ConSReg at 0x7fec96857588>

Check the CV results

In [8]:
analysis.auroc

Unnamed: 0,diff_name,auroc_mean_UR,auroc_std_UR,auroc_mean_DR,auroc_std_DR
0,cortext-endodermis.csv,0.654879,0.029298,0.604074,0.023465


In [9]:
analysis.compute_imp_score(n_resampling = 200, n_jobs = 36, verbose = True)

Existing importance scores will be overwritten.
Performing stability selection and compute importance score for each TF...


[Parallel(n_jobs=36)]: Using backend LokyBackend with 36 concurrent workers.
[Parallel(n_jobs=36)]: Done   5 out of  10 | elapsed:    4.8s remaining:    4.8s
[Parallel(n_jobs=36)]: Done  10 out of  10 | elapsed:    4.9s finished
[Parallel(n_jobs=36)]: Using backend LokyBackend with 36 concurrent workers.
[Parallel(n_jobs=36)]: Done 128 tasks      | elapsed:    1.9s
[Parallel(n_jobs=36)]: Done 200 out of 200 | elapsed:    5.7s finished
[Parallel(n_jobs=36)]: Using backend LokyBackend with 36 concurrent workers.
[Parallel(n_jobs=36)]: Done   5 out of  10 | elapsed:    1.3s remaining:    1.3s
[Parallel(n_jobs=36)]: Done  10 out of  10 | elapsed:    1.4s finished
[Parallel(n_jobs=36)]: Using backend LokyBackend with 36 concurrent workers.


Done


[Parallel(n_jobs=36)]: Done 200 out of 200 | elapsed:    2.1s finished


<ConSReg.main.ConSReg at 0x7fec96857588>

In [10]:
analysis.gen_networks(imp_cutoff = 0.5, verbose = True)

Existing networks will be overwritten.
Generating networks...
Done


<ConSReg.main.ConSReg at 0x7fec96857588>

Save analysis result

In [12]:
# Cross-validation result
analysis.auroc.to_csv("results/single_cell_analysis/auroc_result.csv")
analysis.auprc.to_csv("results/single_cell_analysis/auprc_result.csv")

# Importance scores
analysis.imp_scores_UR.to_csv("results/single_cell_analysis/imp_score_UR.csv")
analysis.imp_scores_DR.to_csv("results/single_cell_analysis/imp_score_DR.csv")

# Networks were saved in the format of edge list
for diff_name, network in zip(analysis._diff_name_list, analysis.networks_UR):
    network.to_csv("results/single_cell_analysis/{}_UR_network.csv".format(diff_name))
    
for diff_name, network in zip(analysis._diff_name_list, analysis.networks_DR):
    network.to_csv("results/single_cell_analysis/{}_DR_network.csv".format(diff_name))