scNanoHi-C is a single-cell long-read concatemer sequencing method which could be used to investigate the higher-order 3D genomic structures in individual cells. This repository contains snakemeke workflows to preprocess scNanoHi-C data and other codes for analyses in our paper.
scNanoHi-C is based on Snakemake framework and conda enrivonments. Install dependencies and environments as follows:
conda install -c bioconda snakemake
# create conda environment
conda env create -f r_env.yaml
conda env create -f hic_env.yaml
conda env create -f py2_env.yaml
Then, dip-c and hickit should be installed under the py2_env (for 3D modelling).
scNanoHi-C sequencing data was first demultiplexed to single cells by Nanoplexer using known cell barcodes with default parameters. Adapter sequences were trimmed by Cutadapt and reads shorter than 500bp were also removed.
-
Input files:
-
pass.fastq.gz
the output sequencing data from Nanopore platform; -
bc_index
a tab-separated file containing PCR barcodes (1-96) in the first column, the start and end of TN5 barcodes (1-24) in the second and third column and an optional library name in the fourth column as follows:$ head bc_index 25 1 8 A 25 17 24 A 26 9 16 B 27 17 24 C
The resulted name of cells in this example will be:
A25B1 A25B2 ... C27B24
And the library name will assign with ‘P’, if the library name was not supplied. In default, if the file contains only the first column, all 1-24 TN5 barcodes will be used for each PCR barcode.
-
-
Output files:
raw_data/
containing raw fastq files for each single celltrim/
containing trimmed fastq files for each single cell, can be used as the input files for Pore-C-Snakemakeraw_fastq.stats
statistics of raw fastq filestrim_fastq.stats
statistics of trimmed fastq files
-
usage:
put
pass.fastq.gz
andbc_index
together withdata_processing.smk
into the working directory and run snakemake via:snakemake -s data_processing.smk -j 10
Trimmed fastq files for single cell were then used to run default Pore-C-Snakemake workflow, see details in Pore-C-Snakemake.
The main script of this step is filter_contacts.smk
, which could be run with the shell script run_smk.sh
. This step is consist of following sections
-
remove artifact contacts in single cells
To remove artifact contacts, concatemers from each single cell were firstly decomposed into VPCs. Five types of artifacts were removed sequentially:
- Adjacent contacts: contacts assigned to adjacent restriction fragments;
- Close contacts: contacts with two alignments separating from less than 1000 bp in genomic distance;
- Duplicates: we designated contacts from the same pair of restriction fragments as duplicates. In order to remove PCR duplicates and preserve the high-order structure of concatemers as much as possible, only one contact from the concatemer with the highest cardinality in each duplicates set was selected to remain.
- Promiscuous contacts: contacts from restriction fragments which involved in more than 10 interactions.
- Isolated contacts: contacts which no other contact within 1Mb distance.
More information about the evaluation of these filters could be found in the Methods of our paper.
-
reconstruct single-cell 3D models
The haplotype-tagged virtual pair-wise contacts from scNanoHi-C were used for haplotype imputation and single-cell 3D genome structure model reconstruction by hickit and dip-C packages.The high-depth data (24 cells per run) of scNanoHi-C were recommended for 3D genome construction.
-
generate quality control metric for downstream analysis
-
calculated the single-cell A/B compartment values (scA/B)
The scA/B values were calculated through the dip-C package in both 2D and 3D mode, which could be used for dimensionality reduction and cell clustering. The recommended resolution is 1 Mb.
-
Input files:
run_smk.sh
: the main script.- results from Pore-C pipeline: including the
.parquet
files and concatemer summary file of each cell. config.yaml
: the configure files containing the parameters and location of files used in the pipeline.sample
: a file containing the name of cells in each row.
-
main output files:
*.contacts.clean.parquet
: the files containing filtered contacts for each cell.*_stats_summary.csv
: files containing the quality control information of all cells.*.n.cif
: results of single-cell 3D models.RMSD_summary.txt
: summary of RMSD of all models.cpg_2d_*_b1m_summary.csv
: matrix of 2D scA/B values in all cells.
These outputs could be optional selected by adjusting the
rule all
infilter_contacts.smk
. -
usage:
put the
run_smk.sh
,config.yaml
andsample
files into themerged_contacts
directory of Pore-C output, and modify these files accordingly then run as following:$ sh run_smk.sh Run scNanoHi-C snakemake pipeline? [y/dry/n]: # type 'y' to run the pipeline directly # type 'dry' to display what would be done # type 'n' to cancel the program