Skip to content

molleraj/CARDlongread-report-parser

Repository files navigation

NIA CARD Long Read Sequencing Report Parsing and Visualization Dashboard

The NIA CARD Long-Read Sequencing group sequences some 30-50 human genomes per week from brain and blood genomic DNA. Given these samples both often belong to larger cohorts (more than 30-50 samples), batch to batch variation is inevitable, and Oxford Nanopore Technologies has steadily modified sequencing library chemistry, flow cells, instrument software, and basecalling models, regular sequencing quality control is critical to examine how our sequencing efforts change over time. The included Python scripts automate the process of collecting key sequencing statistics from weekly and cohort-wide sequencing runs and generate QC analytics as an Excel spreadsheet. They take run output JSON files as input and pull sequencing statistics from exact locations in each JSON hierarchical data structure. The scripts have been written to handle JSON files from MinKNOW version 24.11.11 and earlier. Recent updates make it possible to incorporate platform QC flow cell checks from before sequencing into the QC visualization dashboard as well. Grouping functions to handle multiple input summary TSVs and color code by input name (e.g., cohort name) have been implemented (see usage below) and will be further documented with a tutorial example. Please report any bugs in the Issues tab to further improve the parser.

Dependencies

The Python scripts were tested and developed with the following dependency modules (and respective versions where applicable):

Python 3.10.8
numpy 1.26.4
pandas 2.0.3
seaborn 0.12.2
matplotlib 3.7.2
json 2.0.9
argparse 1.1
dateutil 2.8.2
openpyxl 3.1.2
xlsxwriter 3.1.2
datetime
dateutil
statistics
dataclasses
glob
io

Usage

CARDlongread_extract_from_json.py takes a list of Oxford Nanopore sequencing report JSON files as inputs (or a directory containing all JSON files to analyze) and returns a table with the following fields per JSON:

Experiment Name, Sample Name, Run Date, PROM ID, Flow Cell Position, Flow Cell ID, Flow Cell Product Code, Data output (Gb), Read Count (M), N50 (kb), MinKNOW Version, Passed Modal Q Score, Failed Modal Q Score, Starting Active Pores, Second Pore Count, Start Run ISO Timestamp, Start Run Timestamp

N50 (kb) refers to the read N50 for estimated bases read lengths (not basecalled bases). N50 extraction has been patched to extract this field from the corresponding estimated bases read length histogram in JSONs from MinKNOW versions older and newer than 24.11.11.

Below is sample output from the script:

Experiment Name	Sample Name	Run Date	PROM ID	Flow Cell Position	Flow Cell ID	Flow Cell Product Code	Data output (Gb)	Read Count (M)	N50 (kb)	MinKNOW Version	Passed Modal Q Score	Failed Modal Q Score	Starting Active Pores	Second Pore Count	Start Run ISO Timestamp	Start Run Timestamp
Chile_404	Chile_404	2024-05-14	PC24B302	2D	PAW33034	FLO-PRO114M	141.083	7.345	23.39	24.02.10	NA	NA	7644	7295	2024-05-14T21:23:35.883780864Z	1715721816
Chile_406	Chile_406	2024-05-01	PC24B302	2E	PAW73369	FLO-PRO114M	131.699	7.426	22.79	24.02.10	NA	NA	7431	7118	2024-05-01T19:49:27.062103580Z	1714592967
Chile_509	Chile_509	2024-05-14	PC24B302	2F	PAW61512	FLO-PRO114M	117.861	7.947	17.12	24.02.10	NA	NA	7604	7028	2024-05-14T21:28:27.860263815Z	1715722108
Chile_511	Chile_511	2024-05-08	PC24B302	2D	PAW71368	FLO-PRO114M	133.097	7.548	20.69	24.02.10	NA	NA	7370	6972	2024-05-08T20:46:57.642743302Z	1715201218
Chile_516	Chile_516	2024-05-01	PC24B302	2D	PAW71977	FLO-PRO114M	129.544	6.625	23.63	24.02.10	NA	NA	7837	7424	2024-05-01T19:47:57.769730497Z	1714592878

The file list should contain paths to each report on each line, like so:

/data/CARD_AUX/LRS_temp/CHILE/SEQ_REPORTS/Chile_404/PAW33034/other_reports_PAW33034/report_PAW33034_20240514_2123_c7d4ac03.json
/data/CARD_AUX/LRS_temp/CHILE/SEQ_REPORTS/Chile_406/PAW73369/other_reports_PAW73369/report_PAW73369_20240501_1949_4e83275d.json
/data/CARD_AUX/LRS_temp/CHILE/SEQ_REPORTS/Chile_509/PAW61512/other_reports_PAW61512/report_PAW61512_20240514_2128_5a659d71.json
/data/CARD_AUX/LRS_temp/CHILE/SEQ_REPORTS/Chile_511/PAW71368/other_reports_PAW71368/report_PAW71368_20240508_2046_20d811c8.json
/data/CARD_AUX/LRS_temp/CHILE/SEQ_REPORTS/Chile_516/PAW71977/other_reports_PAW71977/report_PAW71977_20240501_1947_c6a80210.json

Example usage (python CARDlongread_extract_from_json.py -h):

usage: CARDlongread_extract_from_json.py [-h] [--json_dir JSON_DIR] [--filelist FILELIST] [--output OUTPUT_FILE]

Extract data from long read JSON report

optional arguments:
  -h, --help            show this help message and exit
  --json_dir JSON_DIR   path to directory containing JSON files, if converting whole directory
  --filelist FILELIST   text file containing list of all JSON reports to parse
  --output OUTPUT_FILE  Output long read JSON report summary table in tab-delimited format

CARDlongread_extract_summary_statistics.py then generates an sequencing QC analytics spreadsheet from the output table of CARDlongread_extract_from_json.py containing a sequencing statistics summary table and both violin plot and scatter plot visualizations of data output, read N50, and starting active pores (active pores after starting sequencing). Recent updates incorporate evaluation of active pores per flow cell at the time of initial checks (platform QC) as well, further calculating differences in active pore count between platform QC and the start of sequencing, and visualizing relationships between these differences and run data output. Violin plots are provided separately for output (Gbp) per run (corresponding to each line in the input TSV table), per flow cell, and per experiment. Individual runs (lines in TSV table) are highlighted indicating whether they are an initial run, top up, reconnection, or recovery.

Sequencing runs are typically conducted over 72 hours, with one 20 fmol library load every 24 hours.

An initial run corresponds either to a sequencing run that went through three full loads successfully (one every 24 hours) or to the first part of a full sequencing run where the flow cell was later reconnected at a distinct position (e.g., 1E to 3C) or temporarily disconnected and reconnected at the same position (e.g., 1E).

A reconnection is the second part of a run after a flow cell is disconnected and reconnected, as described above. Reconnections are identified from lines in the JSON parser output TSV that have the same sample name and same flow cell ID.

A recovery run is conducted when either insufficient library is prepared for three full loads (over three days) or when active pores substantially decrease from pre-sequencing checks after the first load. Library is recovered by slow aspiration from the flow cell sample port itself. Library is generally recovered after the first load (often due to active pore drop off) and either loaded onto a new flow cell or to the same flow cell for a subsequent load.

A top up run is conducted on a new flow cell when an experiment has not yielded the desired 90 Gbp (30x coverage) from the initial run (and reconnections/recoveries where applicable). Top up runs are conducted either with initial library if there is sufficient library for more than three loads or with new library if not.

Top up runs are labeled with the suffix _topup and recovery runs are labeled with the suffix _recovery in the sample name column.

Example data visualizations corresponding to data snippets shown earlier:

Total per flow cell output violin plot with embedded box plot and displayed data points:

image

Per run data output vs. starting active pore scatter plot with run type annotated by color and cutoffs provided for starting active pores (red - less than 5000 pores for ONT warranty return, green - 6500 pores or higher for internal QC) and data output (gray - desired 30X human genome coverage or 90 Gbp sequencing output).

image

Example usage (python CARDlongread_extract_summary_statistics.py -h):

usage: CARDlongread_extract_summary_statistics.py [-h] [-input INPUT_FILE [INPUT_FILE ...]] [-names [NAMES ...]] [-output OUTPUT_FILE] [-platform_qc PLATFORM_QC] [-plot_title PLOT_TITLE] [--plot_cutoff | --no-plot_cutoff]
                                                  [-run_cutoff RUN_CUTOFF] [--strip_plot | --no-strip_plot] [-colors [COLORS ...]] [-legend_colors [LEGEND_COLORS ...]] [-legend_labels [LEGEND_LABELS ...]] [--group_count | --no-group_count]

This program gets summary statistics from long read sequencing report data.

optional arguments:
  -h, --help            show this help message and exit
  -input INPUT_FILE [INPUT_FILE ...]
                        Input tab-delimited tsv file(s) containing features extracted from long read sequencing reports.
  -names [NAMES ...]    Names corresponding to input tsv file(s); required if more than one tsv provided.
  -output OUTPUT_FILE   Output long read sequencing summary statistics XLSX
  -platform_qc PLATFORM_QC
                        Input platform QC table to calculate active pore dropoff upon sequencing (optional)
  -plot_title PLOT_TITLE
                        Title for each plot in output XLSX (optional)
  --plot_cutoff, --no-plot_cutoff
                        Include cutoff lines in violin plots (optional; default true; --no-plot_cutoff to override) (default: True)
  -run_cutoff RUN_CUTOFF
                        Minimum data output per flow cell run to include (optional, 1 Gb default)
  --strip_plot, --no-strip_plot
                        Show strip plots instead of swarm plots inside violin plots (optional; default false) (default: False)
  -colors [COLORS ...]  Color palette corresponding to sequential groups displayed (e.g., 'blue', 'red', 'blue'); optional and used only if more than one tsv provided.
  -legend_colors [LEGEND_COLORS ...]
                        Colors shown in the legend (e.g., 'blue', 'red'); optional and used only if more color palette included above. Must be palette subset.
  -legend_labels [LEGEND_LABELS ...]
                        Labels for each color in legend in order specified in -legend_colors.
  --group_count, --no-group_count
                        Show group count in x-axis labels (optional; default false) (default: False)

Tutorial

To clone from GitHub and do a test run:

# Download this repo
git clone https://github.com/molleraj/CARDlongread-report-parser.git
cd CARDlongread-report-parser

# Test on pregenerated set of 50 random PPMI JSON files (use exact file provided)
# Random JSON file set generated like so with Linux coreutils shuf (shuffle) command
# shuf -n 50 full_PPMI_json_report_list_090524.txt > random_50_test_PPMI_json_paths_090524.txt
# random_50_test_PPMI_json_paths_090524.txt included in repository as example_json_reports.txt

# Execute with file list of json reports (one per line):
python3 CARDlongread_extract_from_json.py --filelist example_json_reports.txt --output example_output.tsv

# Alternatively, execute on all json files within a directory
# (does not descend into subdirectories)
python3 CARDlongread_extract_from_json.py --json_dir /data/CARDPB/data/PPMI/SEQ_REPORTS/example_json_reports/ --output example_output.tsv

# Make sequencing QC analytics spreadsheet from above QC output table (example_output.tsv)
python3 CARDlongread_extract_summary_statistics.py -input example_output.tsv -output example_summary_spreadsheet.xlsx -platform_qc example_platform_qc.csv -plot_title "PPMI tutorial example"

Example sequencing QC visualizations from tutorial summary spreadsheet:

Per run output violin plot with embedded box plot and displayed data points, plus run type annotated by point color and red cutoff line for desired 30x coverage/90 Gbp output:

image

Per run starting active pores violin plot with embedded box plot and displayed data points, plus run type annotated by point color and red cutoff line for recommended minimum 6500 starting active pores:

image

Per run data output vs. starting active pore scatter plot with run type annotated by point color and cutoffs provided for starting active pores (red - less than 5000 active pores for ONT warranty return, green - 6500 active pores or higher for internal QC) and data output (gray - desired output of 30X human genome coverage or 90 Gbp sequencing output).

image

Per run data output vs. difference between starting active pores (after sequencing) and platform QC active pores at the initial flow cell check with run type annotated by point color and cutoff provided for data output (gray - desired output of 30X human genome coverage or 90 Gbp sequencing output).

image

Per run data output vs. date sequencing run was conducted with run type annotated by point color and cutoff provided for data output (gray - desired output of 30X human genome coverage or 90 Gbp sequencing output).

image

Comparing QC metrics across groups

Having sequenced more than five cohorts and often over 10 samples each week, we found it often advantageous to compare raw QC metrics across different arbitrarily defined groups. We thus implemented group comparison functionality available through the -input [INPUT_FILE ...], -names [NAMES ...], and/or -colors [COLORS ...] command line options. We have thus provided an additional tutorial below demonstrating group comparison with custom coloring and labeling for 20 sequencing runs randomly selected from each of five different cohorts. Paths provided in JSON lists are from the NIH Biowulf HPC cluster. Input files are provided in the provided group_comparison folder.

# run in CARDlongread-report-parser directory
cd CARDlongread-report-parser

# prepare input tables for each cohort
# cohort 1
python CARDlongread_extract_from_json.py --filelist group_comparison/cohort_1_json_list.txt --output group_comparison/cohort_1_output.tsv
# cohort 2
python CARDlongread_extract_from_json.py --filelist group_comparison/cohort_2_json_list.txt --output group_comparison/cohort_2_output.tsv
# cohort 3
python CARDlongread_extract_from_json.py --filelist group_comparison/cohort_3_json_list.txt --output group_comparison/cohort_3_output.tsv
# cohort 4
python CARDlongread_extract_from_json.py --filelist group_comparison/cohort_4_json_list.txt --output group_comparison/cohort_4_output.tsv
# cohort 5
python CARDlongread_extract_from_json.py --filelist group_comparison/cohort_5_json_list.txt --output group_comparison/cohort_5_output.tsv

# make dashboard for all five cohorts, coloring cohorts by sample type (blood or brain)
python CARDlongread_extract_summary_statistics.py \
  -input \
  -names "Cohort 1" "Cohort 2" "Cohort 3" "Cohort 4" "Cohort 5" \
  -output \
  -platform_qc \
  --strip_plot

About

Renamed copy of the forked longread-report-parser repository

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages