<a href="https://colab.research.google.com/github/CDPHE-bioinformatics/ncbi-cluster-tracker/blob/main/ncbi_cluster_tracker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ncbi-cluster-tracker

Use this Colab notebook to run [ncbi-cluster-tracker](https://github.com/CDPHE-bioinformatics/ncbi-cluster-tracker), a tool for creating HTML reports to track SNP clusters within the [NCBI Pathogen Detection](https://www.ncbi.nlm.nih.gov/pathogens/) system. Given an input sample sheet CSV containing BioSample IDs for isolates of interest (referred to as "internal isolates"), the tool creates a report with a high-level overview of all of the clusters associated with the internal isolates. For each cluster the output report links to the corresponding NCBI Pathogen Detection tree and displays additional visualizations such as a pairwise SNP distance matrix. Any additional metadata or alternate IDs provided in the sample sheet are used to further annotate the internal isolates within the report. If provided a ZIP file output from a previous report generation, ncbi-cluster-tracker will compare to the previous report and indicate any new isolates, new clusters, and changes to isolate counts for existing clusters.


## Environment setup
Run the next cell, and click "Restart session" when prompted. Do not run the cell again once the session has restarted.

You may see some warnings and the error message shown below, but this should not cause problems and can be ignored.

```
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed
```

In [None]:
# install pip for new python
!sudo apt-get install python3.12-distutils
!wget https://bootstrap.pypa.io/get-pip.py
!python get-pip.py
!pip install ncbi-cluster-tracker

## Verify installation

Run the next cell to verify ncbi-cluster-tracker installed successfully.

In [None]:
!ncbi-cluster-tracker --version
!ncbi-cluster-tracker --help

## Data upload and configuration
1. Create and upload a sample sheet CSV file with your isolates of interest. The file name must begin with `metadata` and have the `.csv` extension. The only required column is `biosample` with the NCBI BioSample IDs of the isolates. Optionally, you can specify alternate sample IDs with an `id` column, and exact collection dates with a `collection_date` column. Any number of additional user-defined metadata columns can also be added. An example sample sheet CSV is shown below. To upload the file, click the folder icon to the left then click the upload icon.

```
biosample,id,collection_date,facility_id
SAMN44007286,lab_id_01,2024-10-01,facility A
SAMN44382137,lab_id_02,2024-10-22,facility A
SAMN40977906,lab_id_03,2024-04-16,facility A
SAMN39926221,lab_id_04,2024-02-13,facility B
SAMN39127228,lab_id_05,2023-12-27,facility B
SAMN36340825,lab_id_06,2023-07-07,facility C
SAMN42236741,lab_id_07,2024-07-03,facility D
SAMN43374642,lab_id_08,2024-08-27,facility D
```

2. If you have run this notebook before and would like to compare clusters and isolate counts to the previous report, upload the ZIP file output from the previous report and set the `comparison_data` variable below accordingly. Otherwise set `comparison_data` to "Do not compare".

3. If you have a Google Cloud Platform account, set `isolates_browser_data_location` to "BigQuery pdbrowser dataset" and set `project_id` to the name of your GCP project. Otherwise, you'll need to manually search the this data on the Pathogen Detection website: set `isolates_browser_data_location` to "Uploaded browser*.tsv" and leave `project_id` blank. To obtain this TSV file, copy and paste the "biosample" column from your sample sheet into the [Pathogen Detection Isolates Browser search bar](https://www.ncbi.nlm.nih.gov/pathogens/) and press Enter. Make sure all columns are displayed in the Matched Isolates table that loads by selecting "Choose columns", then click the "Download" button to export the table as a TSV. Open the file then copy the "SNP cluster" column and paste into to the search bar (while keeping the BioSample IDs from the previous search in the search bar) and press Enter. Download the updated Matched Isolates table as a new TSV file and upload this second TSV file to the notebook. The file name must begin with `browser` and have the `.tsv` extension.

4. Check `show_carbapenemases` to add carbapenemase gene detections from AMRFinderPlus to the tables and labels files in the report.

5. Run the cell below once all the files have been uploaded and the variables have been set

In [None]:
comparison_data = 'Do not compare' # @param ['Do not compare', 'Uploaded ZIP containing isolates_*.csv and clusters_*.csv']
isolates_browser_data_location = 'Uploaded browser*.tsv' # @param ['Uploaded browser*.tsv', 'BigQuery pdbrowser dataset']
project_id = '' #@param {type: 'string'}
show_carbapenemases = True #@param {type: 'boolean'}

import glob
import os
import zipfile

from pathlib import Path
from google.colab import drive
from google.colab import auth

def get_uploaded_file(glob_pattern):
    files = glob.glob(glob_pattern)
    if not files:
        raise Exception(f'No {glob_pattern} file found')
    if len(files) > 1:
        raise Exception(f'More than one {glob_pattern} file found')
    return files[0]

if isolates_browser_data_location == 'BigQuery pdbrowser dataset':
    if not project_id:
        raise ValueError('`project_id` must be set to use the BigQuery pdbrowser dataset')
    auth.authenticate_user()
    !gcloud config set project {project_id}
    os.environ['GOOGLE_CLOUD_PROJECT'] = project_id # Set environment variable for python subprocess
else:
    browser_file = get_uploaded_file('browser*.tsv')

metadatafile = get_uploaded_file('metadata*.csv')
if comparison_data != 'Do not compare':
    comparison_zip = get_uploaded_file('*.zip')
    comparison_dir = Path(comparison_zip).stem
    with zipfile.ZipFile(comparison_zip, 'r') as zip:
        zip.extractall(path=comparison_dir)

## Generate report

Run the next cell to create the HTML report.

In [None]:
import os
import glob

command = f"ncbi-cluster-tracker '{metadatafile}' --out-dir ."
if isolates_browser_data_location == 'Uploaded browser*.tsv':
    command += f" --browser-file '{browser_file}'"
if comparison_data != 'Do not compare':
    command += f' --compare-dir {comparison_dir}'
if show_carbapenemases:
    command += ' --amr --filter-amr BETA-LACTAM:CARBAPENEM'

print(f'Running: {command}')
!{command}
out_folder = os.path.dirname(glob.glob('*/*.html')[0])
out_clusters_csv = glob.glob(f'{out_folder}/clusters*.csv')[0]
out_isolates_csv = glob.glob(f'{out_folder}/isolates*.csv')[0]
zip_name = f'{out_folder}.zip'
!zip -j '{zip_name}' '{out_clusters_csv}' '{out_isolates_csv}'

## Download files

The output HTML report should now appear in the Files explorer to the left, inside a folder named based on the current timestamp (for example `20260120_021427/clusters_20260120_021427.html`). Download the file and double click to open the report in your default web browser. Even though the HTML file is rendered with the browser, the data will stay local to your computer.

Download the ZIP file with the current cluster data. This file can be uploaded to the notebook the next time a report is created to compare clusters and isolate counts to the previous report.