Taxa overlap of query database with metagnomics taxa from raw metagenomics database

Why we do this:
The raw metagenomics database represents the most comprehensive snapshot of the microbial community present in the sample based on DNA sequencing. By comparing the taxa in our DIAMOND-based protein hits (from custom query databases) to this reference set, we can assess how well our metaproteomics-derived proteins reflect the underlying microbial composition.

What the results imply:

A high overlap indicates that the identified protein taxa are representative of the microbes in the sample.

A low overlap may suggest detection limitations, missing reference proteins, or differences between transcriptomic/proteomic activity and genomic presence.

How we do this:
We load both:

A CSV file of unique taxa detected in the metagenomics DNA reference.

The annotated DIAMOND results which contain organism names from matched protein hits.

We filter by a selected taxonomic rank (e.g., "species") and compute:

Unique taxa in each set.

The intersection and differences between them.

The percentage overlap in both directions (proteomics → genomics and vice versa).

In [23]:
# === File paths ===
# Paste the path of the raw metagenomics taxa CSV file
metagenomics_taxa_csv = r"C:\Users\Yusuf\OneDrive\LST\Derde_jaar\Y3Q4\Metaproteomics_with_db\db_results_analysis\unique_taxa_in_metagendb.csv"

# Paste the path of the query DataFrame CSV file
query_df_csv = r"C:\Users\Yusuf\OneDrive\LST\Derde_jaar\Y3Q4\Metaproteomics_with_db\db_results_analysis\unique_taxa_in_merged_pept2lca.csv"

# Give a label for the query DataFrame
query_db_label = "Pept2lca hits dataframe"

# === CONFIGURATION ===
# Choose a taxonomic rank (e.g., strain", "species", "genus", etc.)
# Set to None to compare all taxa regardless of rank
selected_rank = "genus"  # ← modify this line only

In [24]:
import pandas as pd

# === Load and filter metagenomics taxa ===
df_meta = pd.read_csv(metagenomics_taxa_csv)
if selected_rank:
    df_meta = df_meta[df_meta["rank"] == selected_rank]
meta_taxa = set(df_meta["taxon_name"])

# === Load and filter DIAMOND-based taxa ===
df_query = pd.read_csv(query_df_csv)
if selected_rank:
    df_query = df_query[df_query["taxonomy_rank"] == selected_rank]
query_taxa = set(df_query["organism"])

# === Compare sets ===
query_in_meta = query_taxa & meta_taxa
query_not_in_meta = query_taxa - meta_taxa

meta_in_query = meta_taxa & query_taxa
meta_not_in_query = meta_taxa - query_taxa

# === Print results ===
print(f"Total unique {selected_rank} taxa in {query_db_label}: {len(query_taxa)}")
print(f"{query_db_label} taxa found in metagenomics raw DB ({selected_rank or 'all ranks'}): {len(query_in_meta)}")
print(f"{query_db_label} taxa NOT found in metagenomics raw DB ({selected_rank or 'all ranks'}): {len(query_not_in_meta)}")
print(f"Percentage of {query_db_label} taxa found: "
      f"{(len(query_in_meta) / len(query_taxa)) * 100:.2f}%" if query_taxa else "0.00%")

print(f"\nTotal unique {selected_rank} taxa in metagenomics: {len(meta_taxa)}")
print(f"Metagenomics raw DB taxa found in {query_db_label} ({selected_rank or 'all ranks'}): {len(meta_in_query)}")
print(f"Metagenomics raw DB taxa NOT found in {query_db_label} ({selected_rank or 'all ranks'}): {len(meta_not_in_query)}")
print(f"Percentage of raw metagenomics taxa found in {query_db_label}: "
      f"{(len(meta_in_query) / len(meta_taxa)) * 100:.2f}%" if meta_taxa else "0.00%")


Total unique genus taxa in Pept2lca hits dataframe: 7
Pept2lca hits dataframe taxa found in metagenomics raw DB (genus): 6
Pept2lca hits dataframe taxa NOT found in metagenomics raw DB (genus): 1
Percentage of Pept2lca hits dataframe taxa found: 85.71%

Total unique genus taxa in metagenomics: 1003
Metagenomics raw DB taxa found in Pept2lca hits dataframe (genus): 6
Metagenomics raw DB taxa NOT found in Pept2lca hits dataframe (genus): 997
Percentage of raw metagenomics taxa found in Pept2lca hits dataframe: 0.60%


Taxa overlap of query database with metagnomics taxa from psm's

Why we do this:
This comparison focuses on a more specific subset of the metagenomics database: those protein taxa that actually matched MS/MS spectra in a PSM search. This narrows down the metagenomics DB to only what was detectable via peptide-spectrum matching, making it a stricter comparison.

What the results imply:

A high overlap shows that the DIAMOND-matched protein taxa from our custom DBs also match those from PSM-based identifications, supporting biological and technical consistency.

A low overlap could indicate that our DIAMOND DBs capture complementary protein evidence that was not matched through traditional PSM methods.

How we do this:
We compare:

The unique taxa found in the PSM-annotated metagenomics database, filtered at a chosen rank.

Against the unique taxa from the DIAMOND alignment results.

We then compute:

Counts and percentages of shared and unshared taxa.

Insights into how much proteomic signal (from de novo + DIAMOND) overlaps with more confident, PSM-based identifications.

In [25]:
# === File paths ===
# Paste the path of the metagenomics PSM taxa CSV file
metagenomics_psm_taxa_csv = r"C:\Users\Yusuf\OneDrive\LST\Derde_jaar\Y3Q4\Metaproteomics_with_db\Community_comparisons\metagenomics_psm_taxa_annotated.csv"

# Paste the path of the query DataFrame CSV file
query_df_csv = r"C:\Users\Yusuf\OneDrive\LST\Derde_jaar\Y3Q4\Metaproteomics_with_db\db_results_analysis\unique_taxa_in_merged_pept2lca.csv"

# Give a label for the query DataFrame
query_db_label = "Pept2lca hits dataframe"

# === CONFIGURATION ===
# Choose a taxonomic rank (e.g., "strain", "species", "genus", etc.)
# Set to None to compare all taxa regardless of rank
selected_rank = "genus"  # ← modify this line only

In [26]:
import pandas as pd

# === Load and filter metagenomics taxa ===
df_meta = pd.read_csv(metagenomics_psm_taxa_csv)
if selected_rank:
    df_meta = df_meta[df_meta["taxon_rank"] == selected_rank]
meta_taxa = set(df_meta["taxon_name"])

# === Load and filter DIAMOND-based taxa ===
df_query = pd.read_csv(query_df_csv)
if selected_rank:
    df_query = df_query[df_query["taxonomy_rank"] == selected_rank]
query_taxa = set(df_query["organism"])

# === Compare sets ===
query_in_meta = query_taxa & meta_taxa
query_not_in_meta = query_taxa - meta_taxa

meta_in_query = meta_taxa & query_taxa
meta_not_in_query = meta_taxa - query_taxa

# === Print results ===
print(f"Total unique {selected_rank} taxa in {query_db_label}: {len(query_taxa)}")
print(f"{query_db_label} taxa found in metagenomics psm DB ({selected_rank or 'all ranks'}): {len(query_in_meta)}")
print(f"{query_db_label} taxa NOT found in metagenomics psm DB ({selected_rank or 'all ranks'}): {len(query_not_in_meta)}")
print(f"Percentage of {query_db_label} taxa found: "
      f"{(len(query_in_meta) / len(query_taxa)) * 100:.2f}%" if query_taxa else "0.00%")

print(f"\nTotal unique {selected_rank} taxa in metagenomics psm: {len(meta_taxa)}")
print(f"Metagenomics psm DB taxa found in {query_db_label} ({selected_rank or 'all ranks'}): {len(meta_in_query)}")
print(f"Metagenomics psm DB taxa NOT found in {query_db_label} ({selected_rank or 'all ranks'}): {len(meta_not_in_query)}")
print(f"Percentage of metagenomics psm taxa found in {query_db_label}: "
      f"{(len(meta_in_query) / len(meta_taxa)) * 100:.2f}%" if meta_taxa else "0.00%")

Total unique genus taxa in Pept2lca hits dataframe: 7
Pept2lca hits dataframe taxa found in metagenomics psm DB (genus): 6
Pept2lca hits dataframe taxa NOT found in metagenomics psm DB (genus): 1
Percentage of Pept2lca hits dataframe taxa found: 85.71%

Total unique genus taxa in metagenomics psm: 37
Metagenomics psm DB taxa found in Pept2lca hits dataframe (genus): 6
Metagenomics psm DB taxa NOT found in Pept2lca hits dataframe (genus): 31
Percentage of metagenomics psm taxa found in Pept2lca hits dataframe: 16.22%
