First we determine the psms of the metagenomics database. Then we determine psms of your custom database and look for the peptides that are only in the psms of the metagenomics database and not in the custom database psms. The peptides that are only in the psms of the metagenomics database will be queried in Unipept to look for any proteins or organisms that might match these peptides. If these peptides return no matches or very few, this means that these peptides could also not have been matched when constructing the custom database and this proves the limitation of the method where we match de novo peptides with Unipept. However, if a substantial number of matches can be found for these metagnomics only peptides, this means that these peptides were simply not queried since these were not among the de novo peptides (otherwise they would return those matches), which means that the Novolign method can be optimized.  

In [2]:
import pandas as pd

# === Step 1: Define the path to your PEAKS PSM result file ===
# This is the output of searching the metagenomics protein database against the actual experimental spectra.
metagenpsm_path = r"C:\Users\Yusuf\OneDrive\LST\Derde_jaar\Y3Q4\Metaproteomics_with_db\db_psm_results\Peaks_export\YA_RZ_GW_SHMX_DB_analysis_GW_MG_DB_clean\DB search psm metagenomics.csv"

# === Step 2: Load the PSM result CSV into a DataFrame ===
# This file contains all identified peptide-spectrum matches (PSMs) with associated metadata
df_psm = pd.read_csv(metagenpsm_path)

# === Step 3: Display basic info ===
# Display the shape of the DataFrame to get an idea of the number of PSMs (rows) and columns
display(df_psm.shape)

# Show the first few rows to inspect the structure and column names
display(df_psm.head())


(15049, 18)

Unnamed: 0,Peptide,-10lgP,Mass,Length,ppm,m/z,Z,RT,Area,Fraction,Id,Scan,from Chimera,Source File,Accession,PTM,AScore,Found By
0,TPTTDGTQNDSAYDFSAAVHSAR,148.86,2411.0625,23,-1.2,804.6938,3,99.95,110500000.0,1,40532,32903,No,MP_RZ07032023_GW_flat_180min_DDA02.raw,NODE_514_length_56624_cov_10.429334_2,,,PEAKS DB
1,TPTTDGTQNDSAYDFSAAVHSAR,141.51,2411.0625,23,-1.2,804.6938,3,99.95,110500000.0,1,40533,33198,No,MP_RZ07032023_GW_flat_180min_DDA02.raw,NODE_514_length_56624_cov_10.429334_2,,,PEAKS DB
2,TPTTDGTQNDSAYDFSAAVHSAR,121.6,2411.0625,23,-1.2,804.6938,3,99.95,110500000.0,1,40534,33478,No,MP_RZ07032023_GW_flat_180min_DDA02.raw,NODE_514_length_56624_cov_10.429334_2,,,PEAKS DB
3,TPTTDGTQNDSAYDFSAAVHSAR,120.23,2411.0625,23,1.5,1206.5403,2,99.92,1075000.0,1,40508,32757,No,MP_RZ07032023_GW_flat_180min_DDA02.raw,NODE_514_length_56624_cov_10.429334_2,,,PEAKS DB
4,TPTTDGTQNDSAYDFSAAVHSAR,120.16,2411.0625,23,-0.1,804.6947,3,102.14,0.0,1,78090,33762,No,MP_RZ07032023_GW_flat_180min_DDA02.raw,NODE_514_length_56624_cov_10.429334_2,,,PEAKS DB


In [3]:
# === Basic peptide count metrics ===

# Total number of peptide-spectrum matches (PSMs) observed
total_psms = len(df_psm)

# Count of unique peptide sequences — helps assess sequence diversity in the sample
unique_peptides = df_psm["Peptide"].nunique()

# Print summary stats
print(f"Total PSMs: {total_psms}")
print(f"Unique peptides: {unique_peptides}")

# === Optional: peptide counts per protein accession ===

# Group the data by protein accession and count how many *unique* peptides were matched to each protein.
# This helps to understand which proteins are highly represented in terms of distinct peptide support.
peptides_per_protein = df_psm.groupby("Accession")["Peptide"].nunique().sort_values(ascending=False)

# Display the top 10 most peptide-rich proteins
print("\n📦 Peptide counts per protein (top 10):")
print(peptides_per_protein.head(10))

Total PSMs: 15049
Unique peptides: 10265

📦 Peptide counts per protein (top 10):
Accession
NODE_28686_length_3771_cov_2.843649_3      61
NODE_864_length_43498_cov_10.271850_1      42
NODE_514_length_56624_cov_10.429334_2      41
NODE_1816_length_28045_cov_24.963737_2     36
NODE_24640_length_4187_cov_21.225557_1     33
NODE_33213_length_3431_cov_8.757109_1      32
NODE_28686_length_3771_cov_2.843649_2      31
NODE_57_length_140165_cov_37.325423_74     30
NODE_30903_length_3596_cov_24.096865_1     30
NODE_4264_length_15603_cov_27.157062_12    30
Name: Peptide, dtype: int64
