Skip to content

klebgenomics/Public-Klebsiella-Reference-Strains

Repository files navigation

DOI

Introduction

👩‍🔬 The purpose of this resource is to provide a genomic context to Klebsiella reference strains commonly used in research to aid laboratory researchers in strain selection for particular experiments.

🌍 All of these strains may be purchased from their respective culture collections and all genomes are publicly available.

🖥️ The results of this analysis can be accessed from the Microreact project page or directly from this repository. This Markdown also contains the code used to perform and reproduce the analysis for yourself!

Public Klebsiella genomes from the following culture collections are included:

The associated publications of these collections are:

  1. Dicks J, Fazal MA, Oliver K, Grayson NE, Turnbull JD, Bane E, Burnett E,
    Deheer-Graham A, Holroyd N, Kaushal D, Keane J, Langridge G,
    Lomax J, McGregor H, Picton S, Quail M, Singh D, Tracey A, Korlach J,
    Russell JE, Alexander S, Parkhill J. NCTC3000: a century of bacterial strain
    collecting leads to a rich genomic data resource. Microb Genom. 2023 May;9(5):
    mgen000976. doi: 10.1099/mgen.0.000976.
    PMID: 37194944; PMCID: PMC10272881.
  2. Martin MJ, Stribling W, Ong AC, Maybank R, Kwak
    YI, Rosado-Mendez JA, Preston LN, Lane KF, Julius M, Jones AR, Hinkle M,
    Waterman PE, Lesho EP, Lebreton F, Bennett JW, Mc Gann PT. A panel of diverse
    Klebsiella pneumoniae clinical isolates for
    research and development. Microb Genom. 2023 May;9(5):mgen000967.
    doi: 10.1099/mgen.0.000967.
    PMID: 37141116; PMCID: PMC10272860.

Genomic analysis of the genomes includes:

  • 🧬 Kleborate for typing, AMR and virulence gene detection.
  • 🦠 Kaptive for K and O antigen prediction.
  • 🌳 Phylogenetic analysis to infer the relationships between the strains.
  • 📊 Visualisation of the dataset with Microreact.

🎉 All associated metadata has also been included! 🎉

If you found this resource useful, please consider cite: Stanton, T. D., Wyres, K., & Holt, K. (2023). Genomic Analysis of Public Klebsiella Reference Strains (0.0.1) [Data set]. Zenodo. 10.5281/zenodo.1041996.

Microreact visualisation

The results of this analysis can be accessed from the Microreact project page.

Interpreting the data

Microreact screenshot

The Microreact project is laid out in three panels:

  • Tree panel: Two phylogenetic trees showing the relatedness between all strains and just the KpSC strains.

  • Map panel: The map panel shows the geographical distribution of the strains if they have associated location metadata.

  • Metadata panel: A filterable table containing all the metadata associated with the strains.

The values in the strain column represent the names given by the culture collections. These names can be used to request/purchase any strains of interest from the respective collection.

The tree tips are coloured by species and the blocks are coloured by MLST and K-locus where available.

Only the most important columns are shown in the table, however more columns can be selected with the "Columns" dropdown.

We have included data for all Klebsiella species, not just Klebsiella pneumoniae, but you can view data for species of your choice using the "species" column.

MLST results were generated by the MLST tool (MLST column) and Kleborate (ST column). Kleborate currently only types KpSC so the MLST column contains results for all Klebsiella species.

K- and O- locus typing was performed with Kaptive 3. Where confidence is low, the locus is marked as "Untypeable". These results should be interpreted with caution!

Tutorial

Fetching the genomes

To fetch the genomes, you can either download each one manually 🙅‍♀️ or use the NCBI Datasets and ATCC Genome Portal APIs to fetch them in bulk 🙌.

NCTC and BEI genomes

First, I identified the associated BioProjects for both collections:

Then, I used the NCBI Datasets API to download the genomes from both BioProjects.

datasets download genome accession PRJEB6403 PRJNA717739 --search Klebsiella --assembly-source GenBank

ATCC genomes

To use the ATCC Genome Portal API, you will need to obtain an API key from the ATCC Genome Portal.

Then, I used Python to search for and download all the Klebsiella genomes from the ATCC collection.

from genome_portal_api import *
API_KEY = "YOUR_API_KEY_HERE"  # Replace with your API key
download_catalogue(api_key=API_KEY, output="atcc_catalogue.pkl")  # Download ATCC catalogue
# Fuzzy search for Klebsiella strains in the ATCC catalogue
klebsiella_metadata = {x['product_id']: x for x in search_fuzzy(term="Klebsiella", catalogue_path="atcc_catalogue.pkl")}

# Identify non-Klebsiella strains
non_klebsiella_metadata = {k: v for k, v in klebsiella_metadata.items() if 'Klebsiella' not in v['taxon_name']}
for acc in non_klebsiella_metadata.keys():
    klebsiella_metadata.pop(acc)  # Remove non-Klebsiella strains from Klebsiella metadata

with open('atcc_assembly_urls.txt', 'wt') as f:  # Write the assembly URLs to a file
    for i in klebsiella_metadata.values():
        url = download_assembly(api_key=API_KEY, id=i['id'], download_link_only="True", download_assembly="False")
        f.write(f"{url}\n")

Now you can use curl to download all the genomes and parallelize with xargs.

xargs -P 8 -n 1 curl -O < atcc_assembly_urls.txt

Fetch Metadata

NCTC and BEI metadata

I used the NCBI Datasets API to download the metadata for all the genomes in the NCTC3000 and BEI BioProjects.

datasets summary genome accession PRJEB6403 PRJNA717739 --search Klebsiella --as-json-lines --assembly-source GenBank | \
dataformat tsv genome > NCTC3000_BEI_metadata.tsv

ATCC metadata

We can use the same metadata dictionary we created earlier to format the metadata for the ATCC genomes into a TSV file.

def flatten_dict(parent_dict: dict):
    """Function to recursively flatten a nested dictionary"""
    flattened = {}  # Create an empty dictionary to store the flattened dictionary
    for k, v in parent_dict.items():
        if isinstance(v, dict):  # Recursively flatten dictionaries
            flattened.update(flatten_dict({f"{k}_{k2}": v2 for k2, v2 in v.items()}))
        elif isinstance(v, list):  # Join lists into comma separated strings
            flattened[k] = ','.join(str(i) for i in v)
        else:
            flattened[k] = v
    return flattened

import pandas as pd
data = {}  # Create an empty dictionary to store the flattened metadata
for acc, metadata in klebsiella_metadata.items():  # Iterate over the metadata we just downloaded
    metadata.pop('primary_assembly')  # Remove the primary assembly from the metadata
    data[acc] = flatten_dict(metadata)  # Flatten the metadata dictionary and add to the data dictionary

df = pd.DataFrame.from_dict(data, orient='index')  # Convert the data dictionary to a Pandas DataFrame
df.to_csv('atcc_metadata.tsv', sep='\t', index=False)  # Write the DataFrame to a TSV file

MLST

We'll use the MLST tool to perform multi-locus sequence typing on all the genomes, including Klebsiella species that are not part of the Klebsiella pneumoniae species complex.

mlst assemblies/*.fna > mlst.tsv

Now load and parse the tab-separated results.

mlst <- readr::read_tsv(
  "mlst.tsv", 
  col_types=readr::cols_only(Assembly="c", MLST_scheme="c", MLST="c"), 
  col_names = c("Assembly", 'MLST_scheme', "MLST")) |> 
  dplyr::mutate(Assembly = fs::path_ext_remove(fs::path_file(Assembly)))

Now we'll mark all those genomes that are part of the Klebsiella pneumoniae species complex.

kpneumoniae <- mlst |> 
  dplyr::filter(MLST_scheme == "kpneumoniae") |> 
  dplyr::pull(Assembly)

Kleborate

Kleborate was run on all the genomes to detect AMR, virulence genes and to perform typing. Details of how to use Kleborate can be found in the wiki.

kleborate -a assemblies/*.fna --resistance -o reference_klebs_kleborate.txt

Now load and parse the tab-separated results.

kleborate <- readr::read_tsv("reference_klebs_kleborate.tsv", show_col_types = FALSE) |>
  # Drop Kleborate K/O results in favour of full Kaptive results
  # Rename strain to Assembly to match Kaptive data
  dplyr::select(!tidyselect::matches("_(locus|type)"), 'Assembly'=strain)

Kaptive

Next, we'll use Kaptive to perform in silico K and O antigen typing. Details of how to use Kaptive can be found in the wiki.

kaptive kpsc_k assemblies/*.fna > reference_klebs_kaptive_k.tsv
kaptive kpsc_o assemblies/*.fna > reference_klebs_kaptive_o.tsv

Now load and parse the tab-separated results.

kaptive <- dplyr::full_join(
    readr::read_tsv("reference_klebs_kaptive_k.tsv", show_col_types=FALSE),
    readr::read_tsv("reference_klebs_kaptive_o.tsv", show_col_types=FALSE), 
    by="Assembly", suffix = c(" K-locus", " O-locus")
  ) |> 
  dplyr::rename_with(~stringr::str_remove(.x, "Best match locus ")) |> 
  dplyr::mutate(
    `K-locus` = dplyr::if_else(`Confidence K-locus` == "Untypeable", paste(`K-locus`, "(Untypeable)"), `K-locus`),
    `O-locus` = dplyr::if_else(`Confidence O-locus` == "Untypeable", paste(`O-locus`, "(Untypeable)"), `O-locus`),
    `Confidence K-locus` = NULL, `Confidence O-locus` = NULL
    )

Merge analysis

Now we'll merge all our results so far.

data <- dplyr::select(kleborate, 1:9) |>
  dplyr::left_join(mlst, by="Assembly") |>
  dplyr::left_join(dplyr::filter(kaptive, Assembly %in% kpneumoniae), by="Assembly") |>
  dplyr::left_join(dplyr::filter(dplyr::select(kleborate, -2:-9), Assembly %in% kpneumoniae), by="Assembly") |>
  # Extract properly formatted NCBI/ATCC accessions to match metadata
  dplyr::mutate(
    accession = dplyr::if_else(
      startsWith(Assembly, "GC"), 
      stringr::str_extract(Assembly, "GC(A|F)_[0-9]+\\.[0-9]"),
      stringr::str_replace(stringr::str_remove(Assembly, ".*ATCC_"), '_', '-')
      ),
    # Extract ST from Kleborate results if MLST is missing
    x = stringr::str_extract(ST, "[0-9]+"),
    MLST = ifelse(!is.na(x) & x != MLST, x, MLST),
    x=NULL,
    .before = 1
    )

Now we can merge all the metadata.

metadata <- dplyr::bind_rows(
      readr::read_tsv(
        "NCTC3000_BEI_metadata.tsv", show_col_types = FALSE, 
        name_repair = ~stringr::str_to_lower(stringr::str_replace_all(.x, "\\s+", "_"))
        ) |> 
        dplyr::select(accession=current_accession, assembly_biosample_attribute_name, assembly_biosample_attribute_value) |> 
        dplyr::filter(accession %in% data$accession) |> 
        dplyr::distinct() |> 
        tidyr::pivot_wider(id_cols=1, names_from = 2, values_from = 3) |> 
        dplyr::select(strain, accession, geo_loc_name, serovar),
      readr::read_tsv("atcc_metadata.tsv", show_col_types = FALSE) |> 
        dplyr::select(strain=name, accession=product_id)
    ) |> 
  dplyr::mutate(geo_loc_name = dplyr::if_else(is.na(geo_loc_name), "", geo_loc_name)) |>
  ggmap::mutate_geocode(geo_loc_name)

Writing results

dplyr::inner_join(metadata, data, by="accession") |> 
  readr::write_tsv("data.tsv", na = "")

Phylogenetic tree

Finally we'll do a quick and dirty phylogenetic inference using pairwise Jaccard distance estimates. This method is less accurate than a proper recombination-free SNP-based phylogeny, but as we have genomes from multiple species, this is an appropriate method to use and follows the recommendation of this paper.

mash sketch -p 8 -o genomes.msh -s 100000 -k 21 assemblies/*.fna  # Create a mash sketch
mash dist genomes.msh genomes.msh -p 8 -t > genomes.dist  # Calculate pairwise distances

Then a neighbor-joining tree is calculated using the BIONJ algorithm, implemented in the ape R package and exported in Newick format after some cleaning.

tree <- readr::read_tsv("genomes.dist", show_col_types=FALSE) |>
    tibble::column_to_rownames('#query') |>
    base::as.matrix() |>
    ape::bionj()  # neighbour-joining tree using the BIONJ algorithm

We edit the tree to remove the file extension from the tip labels, set any negative branch lengths to zero, and midpoint root the tree.

tree$tip.label <- purrr::map_chr(tree$tip.label, ~fs::path_ext_remove(fs::path_file(.x)))
tree$edge.length <- base::pmax(tree$edge.length, 0.0)  # set any negative branch lengths to zero
tree |> 
  phangorn::midpoint() |>   # midpoint root the tree
  ape::write.tree("tree.newick")  # write the tree to a file

We'll also create a separate tree just for the Klebsiella pneumoniae species complex genomes.

tree |> 
  ape::drop.tip(tree$tip.label[!tree$tip.label %in% kpneumoniae]) |> 
  phangorn::midpoint() |>   # midpoint root the tree
  ape::write.tree("kpsc_tree.newick")  # write the tree to a file