Molecule Distance Script

The molecule distance workflow supplements Bionano's Guided Assembly and Rare Variant Analysis pipelines as a downstream analysis tool to identify mosaicism in repeat expansion genes or unstable insertions between 2 DLE1 molecule labels.

This script calculates and visualizes the distance between two labels for all molecules (the exact labels chosen are a input parameter). There are 3 main steps:

Parse input files from alignmolvref/merge/* to get complete molecule distance dataset.
Trim complete dataset to remove outliers beyond 3 s.d.
Plot complete and trimmed data:

Set-up and Usage

This script accepts both Rare Variant Analysis (RVA) and Guided Assembly (GA) files as input,specifically intermediary output files from alignmolvref/merge/*. However, pre-processing is recommended for RVA to dramatically reduce runtime.

Skip to Launching Molecule Distance Scripts for quick-run.

Full command examples in test/test_run_molecule_distance_full_script_commands.sh

Set-up/dependencies

This script can only be run on machine with Linux-based operating system because it relies on several Linux tools & Python.
There is an optional script "mol_dist_reqs" that will install the required R packages for the command line interface. This can be installed onto the local computer directly or through running this script.
Additional R packages:

mclust
prabclus
ggplot2

Local Guided Assembly (GA) via Command Line

It is recommended to launch assemblies from molecule .bnx files via command line, since assembly intermediate files -alignmolvref step - are needed as input for molecule distance script. Assemblies downloaded from Access will likely not contain alignmolvref intermediate files. The output file is similar to an Enfocus Fragile X assembly, with the difference that it estimates repeat expansion distances along the locus specified by the .csv file instead of FMR1. More information on Solve (3.7.2) Local Guided Assembly can be found in local_guided_assembly repository.

Rare Variant Analysis (RVA) Input Pre-processing

The molecule distance script also accepts other assemblies' (RVA and de novo) alignmolvref files as input. However, unlike GA and de novo, RVA output files (.xmap & .cmap) are not partitioned by chromosome. Directly running molecule distance script on RVA output will result in long runtimes. Therefore, a pre-processing step using filter_map_files.py is recommended to:

Select .xmap rows with RefContigID matching chromosome of interest
Filter r.cmap & q.cmap files for rows matching filtered .xmap file RefContigID & QryContigID columns, respectively.

Launch

python filter_map_files.py chr \
/path/to/*_r.cmap \
/path/to/*.xmap \
/path/to/*_q.cmap \
/path/to/output/name_for_plot_files

For example, the following command specifies filtering of molecules from chrX of the human genome.

# filter q.cmap & r.cmap by xmap IDs relevant to chr X (23)
python ~/.../molecule_distance_scripts/filter_map_files.py 23 \
~/.../output/contigs/alignmolvref/merge/exp_alignmolvref_r.cmap \
~/.../output/contigs/alignmolvref/merge/exp_alignmolvref.xmap \
~/.../output/contigs/alignmolvref/merge/exp_alignmolvref_q.cmap \
~/.../output/sample_1/sample_1_chr23 \

This script outputs filtered r.cmap, .xmap, and q.cmap files to be used as input for molecule distance script below.

Launching Molecule Distance Scripts

After activating R conda environment. From the command line, the general usage is:

Rscript molecule_distance_full.R \
/path/to/r_cmap \
/path/to/xmap \
/path/to/q_cmap \
start_label end_label path/to/output/name_for_plot_files

Example command:

# launch molecule distance script on guided assembly alignmolvref files from chr 4, calculating molecule distances containing reference labels 7722 and 7725
Rscript ~/.../molecule_distance_scripts/molecule_distance_full.R \
~/.../RFC1_16_guided_assembly/output/contigs/alignmolvref/merge/exp_refineFinal1_contig4_r.cmap \
~/.../RFC1_16_guided_assembly/output/contigs/alignmolvref/merge/exp_refineFinal1_contig4.xmap \
~/.../RFC1_16_guided_assembly/output/contigs/alignmolvref/merge/exp_refineFinal1_contig4_q.cmap \
7722 7725 ~/.../RFC1_16_guided_assembly/mol_distance \

Alternatively, if the .xmap and .cmap files have been processed by molecule_distance_full.R, the following script takes in *_complete_data.csv, and does only steps:

to trim dataset for outliers and
data plotting (e.g. delta from reference, histograms, auto-clustering).

This script offers the flexibility to customize plots without executing resource intensive step 1. to parse .cmap & .xmap input files.

Rscript /archived/molecule_distance_delta_ref.R \
/path/to/out_handle_complete_data.csv \
/path/to/output/name_for_plot_files

Example command:

Rscript ~/.../molecule_distance_scripts/archived/molecule_distance_delta_ref.R \
~/.../sample_1/sample_1_complete_data.csv 13021 13022 \
~/.../sample_1/sample_1_output

Script Output

The script outputs multiple files:

Output csv of molecule distances and summary statistics.
Output csv of molecule distances trimmed for outliers beyond 3 s.d.
Generate various plots (Examples below).

Complete data violin plots

Complete data barplot

Distance between specified start and end labels for all molecules

Trimmed data barplot

Trimmed data excludes datapoints beyond 3 standard deviation from molecule distance mean.

Trimmed data delta barplot

The y-axis plots delta, which is the distances between molecules' labels subtracted by the distance between reference labels.

Auto Clustering

Uses the Mclust package to determine clusters for molecule distances using a maximum-likelihood method. The example below shows clustered molecule distances between labels 26242 and 26247 in chrX.

Interpretation of Results

Bar plot showing molecule distances between specified labels. There are two clusters observed in blue and red. The blue cluster refer to the normal allele, while red cluster on the right half of the x-axis suggests unstable expansion.

Contact

Molecule distance script is authored by Joyce Lee, Syukri Shukor, Jillian Burke, and Andy Pang. For any questions, please reach out to Syukri Shukor (sshukor@bionano.com) or Andy Pang (apang@bionano.com) for questions and issues.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
archived		archived
guided_enfocus_scripts		guided_enfocus_scripts
images		images
tests		tests
.gitignore		.gitignore
README.md		README.md
filter_map_files.py		filter_map_files.py
molecule_distance_full.R		molecule_distance_full.R
molecule_distance_reqs.R		molecule_distance_reqs.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Molecule Distance Script

Set-up and Usage

Set-up/dependencies

Local Guided Assembly (GA) via Command Line

Rare Variant Analysis (RVA) Input Pre-processing

Launching Molecule Distance Scripts

Script Output

Complete data violin plots

Complete data barplot

Trimmed data barplot

Trimmed data delta barplot

Auto Clustering

Interpretation of Results

Contact

About

Releases

Packages

Languages

bionanogenomics/molecule_distance

Folders and files

Latest commit

History

Repository files navigation

Molecule Distance Script

Set-up and Usage

Set-up/dependencies

Local Guided Assembly (GA) via Command Line

Rare Variant Analysis (RVA) Input Pre-processing

Launching Molecule Distance Scripts

Script Output

Complete data violin plots

Complete data barplot

Trimmed data barplot

Trimmed data delta barplot

Auto Clustering

Interpretation of Results

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages