First we determine the psms of the metagenomics database. Then we determine psms of your custom database and look for the peptides that are only in the psms of the metagenomics database and not in the custom database psms. The peptides that are only in the psms of the metagenomics database will be queried in Unipept to look for any proteins or organisms that might match these peptides. If these peptides return no matches or very few, this means that these peptides could also not have been matched when constructing the custom database and this proves the limitation of the method where we match de novo peptides with Unipept. However, if a substantial number of matches can be found for these metagnomics only peptides, this means that these peptides were simply not queried since these were not among the de novo peptides (otherwise they would return those matches), which means that the Novolign method can be optimized.  

In [6]:
import pandas as pd

# === Step 1: Load PSM data ===
metagen_psm_path = r"C:\Users\Yusuf\OneDrive\LST\Derde_jaar\Y3Q4\Metaproteomics_with_db\db_psm_results\Peaks_export\YA_RZ_GW_SHMX_DB_analysis_GW_MG_DB_clean\DB search psm metagenomics.csv"
customdb_psm_path = r"C:\Users\Yusuf\OneDrive\LST\Derde_jaar\Y3Q4\Metaproteomics_with_db\db_psm_results\Peaks_export\YA_RZ_GW_SHMX_DB_analysis_genus_db\DB search psm genus.csv"

df_metagen = pd.read_csv(metagen_psm_path)
df_custom = pd.read_csv(customdb_psm_path)

# === Step 2: Extract unique peptides ===
metagen_peptides = set(df_metagen["Peptide"].unique())
custom_peptides = set(df_custom["Peptide"].unique())

# === Step 3: Compare sets ===
overlapping_peptides = metagen_peptides & custom_peptides
only_in_metagen = metagen_peptides - custom_peptides
only_in_custom = custom_peptides - metagen_peptides

# === Step 4: Print summary ===
print(f"🧬 Total unique peptides in metagenomics DB: {len(metagen_peptides)}")
print(f"🧪 Total unique peptides in custom DB: {len(custom_peptides)}\n")

print(f"🔁 Overlapping peptides: {len(overlapping_peptides)}")
print(f"🔍 Peptides only in metagenomics: {len(only_in_metagen)}")
print(f"📦 Peptides only in custom database: {len(only_in_custom)}")

# === Step 5: Optional — Save results to CSV ===
pd.DataFrame({"Peptide": list(only_in_metagen)}).to_csv("only_in_metagenomics.csv", index=False)
pd.DataFrame({"Peptide": list(only_in_custom)}).to_csv("only_in_customdb.csv", index=False)
pd.DataFrame({"Peptide": list(overlapping_peptides)}).to_csv("overlapping_peptides.csv", index=False)


🧬 Total unique peptides in metagenomics DB: 10265
🧪 Total unique peptides in custom DB: 4235

🔁 Overlapping peptides: 3178
🔍 Peptides only in metagenomics: 7087
📦 Peptides only in custom database: 1057


In [None]:
import pandas as pd
import re

# === Function to clean peptides ===
def wrangle_peptides(sequence: str, ptm_filter: bool=True, li_swap: bool=True) -> str:
    if ptm_filter:
        sequence = "".join(re.findall(r"[A-Z]+", sequence))
    if li_swap:
        sequence = sequence.replace("L", "I")
    return sequence

# === Step 1: Convert the set to a DataFrame ===
only_in_metagen_df = pd.DataFrame({"Peptide": list(only_in_metagen)})

# === Step 2: Apply the cleaning function to each peptide ===
only_in_metagen_df["Cleaned Sequence"] = only_in_metagen_df["Peptide"].apply(lambda x: wrangle_peptides(x))

# === Step 3: Display results ===
print("Shape after cleaning:")
display(only_in_metagen_df.shape)

print("Preview of cleaned sequences (Rows 15-28):")
display(only_in_metagen_df[["Peptide", "Cleaned Sequence"]].iloc[15:28])

# === Optional: Save to file ===
only_in_metagen_df.to_csv("only_in_metagenomics_cleaned.csv", index=False)


Shape after cleaning:


(7087, 2)

Preview of cleaned sequences (Rows 11-21):


Unnamed: 0,Peptide,Cleaned Sequence
15,YDGIINPSWLVEASVAHSTNK,YDGIINPSWIVEASVAHSTNK
16,TGETIGSTGYPLQTNSAAMDR,TGETIGSTGYPIQTNSAAMDR
17,ATFSVNWR,ATFSVNWR
18,WNFC(+57.02)EGK,WNFCEGK
19,HYKEETLIALVEALER,HYKEETIIAIVEAIER
20,EMC(+57.02)EGSTFETIAQSIR,EMCEGSTFETIAQSIR
21,GPLAIEQLIGVGGTK,GPIAIEQIIGVGGTK
22,VGHMEVNYGDAHFR,VGHMEVNYGDAHFR
23,GRVEIYSPFPTSSFR,GRVEIYSPFPTSSFR
24,DRAPMLPIANGGSGVAFR,DRAPMIPIANGGSGVAFR


In [10]:
import pandas as pd
import requests
import time

# === Fetch helper ===
def fetch_request(url: str, retries: int = 3, delay: int = 5) -> requests.Response:
    for attempt in range(retries):
        req_get = requests.get(url)
        if req_get.status_code == 200:
            return req_get
        print(f"Request failed with status {req_get.status_code}. Retrying ({attempt+1}/{retries}) in {delay} seconds...")
        time.sleep(delay)
    raise RuntimeError(f"Request failed after {retries} retries: statuscode {req_get.status_code}")

# === UniPept pept2lca query function ===
def request_unipept_pept_to_lca(pept_df: pd.DataFrame, seq_col: str) -> pd.DataFrame:
    base_url = "http://api.unipept.ugent.be/api/v1/pept2lca.json?equate_il=true"
    batch_size = 100
    seq_series = ["&input[]=" + seq for seq in pept_df[seq_col].drop_duplicates()]
    lca_df = pd.DataFrame(columns=[seq_col, "Global LCA", "Global LCA Rank"], dtype=object)
    x = 0

    while True:
        peptides = seq_series[x:] if x + batch_size >= len(seq_series) else seq_series[x:x+batch_size]
        req_str = "".join([base_url, *peptides])
        response = fetch_request(req_str).json()

        lca_df = pd.concat([
            lca_df,
            pd.DataFrame(
                [
                    (
                        elem.get("peptide", "Unknown"),
                        elem.get("taxon_id", "Unknown"),
                        elem.get("taxon_rank", "Unknown")
                    )
                    for elem in response
                ],
                columns=[seq_col, "Global LCA", "Global LCA Rank"]
            )
        ])

        x += batch_size
        if x >= len(seq_series):
            break

    return lca_df


In [None]:
# Call the UniPept function on your cleaned metagenomics-only peptides
lca_results_df = request_unipept_pept_to_lca(only_in_metagen_df, "Cleaned Sequence")

# Save results (optional)
lca_results_df.to_csv("only_in_metagenomics_cleaned_lca_results.csv", index=False)

# Show match statistics
matched_count = lca_results_df[lca_results_df["Global LCA"] != "Unknown"].shape[0]
total = only_in_metagen_df.shape[0]
print(f"Found LCA match for {matched_count} out of {total} peptides ({matched_count/total:.2%})")

# Display the LCA mapping results as a preview
print("LCA Mapping Results (first 10 peptides):")
display(lca_results_df.head(10))


🔍 Found LCA match for 3279 out of 7087 peptides (46.27%)
LCA Mapping Results (first 10 peptides):


Unnamed: 0,Cleaned Sequence,Global LCA,Global LCA Rank
0,DYIANPNR,1,no rank
1,FWDGIPSDIR,28216,class
2,IVFVIPIKAPADAIR,433330,species
3,IAEAGINIR,1,no rank
4,IGGSPTDNNAIIFR,2,domain
5,GAEMAITDINAK,327159,genus
6,WIETVEAQEDAK,1,no rank
7,IIFFPDQHIGR,1,no rank
8,KAADEFGINYK,1,no rank
9,EIIIDWSK,1,no rank


### Interpretation of UniPept LCA Results for Metagenomics-Only Peptides

After identifying the peptides that were **only found in the PSM results from the metagenomics database** (and not in the custom protein database), we cleaned these sequences by removing PTMs and equating leucine with isoleucine. This step ensures compatibility with UniPept’s taxonomic assignment system.

We then submitted the cleaned peptide list (7,087 peptides) to the [UniPept pept2lca](https://unipept.ugent.be/apidocs#pept2lca) API to determine whether these peptides could be linked to any taxa. The goal was to assess whether the custom database construction approach (based on de novo sequencing) failed to include potentially informative peptides.

#### Summary of Results:
- **Total unique cleaned peptides (metagenomics-only):** 7,087  
- **Peptides with a taxonomic match in UniPept:** 3,279  
- **Success rate:** **46.27%**

#### Interpretation:
This result suggests that **nearly half of the peptides that were missed by the custom database actually have taxonomic relevance**, as shown by their successful annotation via UniPept. This points to a **critical limitation** of the current custom database generation method: many relevant peptides may be excluded simply because they were not among the de novo predicted sequences or they were filtered out when filtering for ALC > 70%. 

Thus, this analysis underscores the importance of:
- **Broadening peptide inclusion criteria** in custom database construction.
- **Considering hybrid approaches** that combine de novo, database search, and external reference data (like metagenomics or UniProt) to ensure higher coverage.
- **Optimizing Novolign/Unipept integration**, as missing these peptides directly impacts taxonomic interpretation and biological conclusions.

#### Conclusion:
The fact that 46% of metagenomics-only peptides are mappable via UniPept but absent in the custom DB highlights a substantial **opportunity to improve database sensitivity**. The custom DB method currently underrepresents a significant portion of biologically relevant peptide information, which can be resolved by including such externally validated peptide evidence in future iterations of the pipeline.
