# Peptide Identification Pipeline using Custom Database

In this project, we aim to develop an algorithm that identifies the microbial composition of a mass spectrometry (MS) sample based on de novo peptide sequencing data. 
Using the predicted peptides, we reconstruct a custom protein sequence database that is optimized for the specific microbial community in the sample.

The pipeline involves several key steps:
- Filtering the de novo peptides based on Average Local Confidence (ALC) scores to retain only high-confidence sequences.
- Cleaning peptide sequences to remove post-translational modification notations.
- Determining the taxonomic origin of peptides by querying UniProt in batch mode.
- Building the microbial community composition based on the taxonomy assignments.
- Constructing a targeted protein database by collecting protein sequences from the identified organisms.
- Reducing database redundancy through clustering to optimize search efficiency and minimize false positives.

By tailoring the database to the actual community composition, we aim to achieve more accurate protein identifications in metaproteomic studies — approaching the performance of genome-based identification strategies, without the need for extensive metagenomic sequencing.

As a first step, we will filter the de novo sequencing results to retain only high-confidence peptides with an Average Local Confidence (ALC) score greater than 70%.

In [22]:
# Import necessary libraries
import pandas as pd

# Load the de novo peptide data
file_path = "de_novo_garmerwolde.csv"
df = pd.read_csv(file_path)

# Display the dataset dimensions
print(f"Original dataset shape: {df.shape}")

# Display rows 24219 to 24227 (Python is zero-indexed → use iloc[24218:24227+1])
print("Original Data (rows 24219–24227):")
display(df.iloc[24218:24228])

# Filter peptides with ALC (%) > 70
filtered_df = df[df["ALC (%)"] > 70]

# Display the filtered dataset dimensions
print(f"Filtered dataset shape: {filtered_df.shape}")

# Display same rows (if they exist after filtering — some will be removed)
print("Filtered Data (rows 24219-24227):")
display(filtered_df.iloc[24218:24228])  # This only works if enough rows remain after filtering
print(f"Number of peptides filtered out: {df.shape[0] - filtered_df.shape[0]}")

Original dataset shape: (189185, 20)
Original Data (rows 24219–24227):


Unnamed: 0,Fraction,Source File,Feature,Peptide,Scan,Tag Length,Denovo Score,ALC (%),length,m/z,z,RT,Predict RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode
24218,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,-,KYAYFPVWVNDDKMSLPLR,F2:51275,19,76,68,19,781.3988,3,143.42,142.11,0.0,2341.1929,-7.8,,45 39 41 35 33 43 89 98 98 94 95 96 94 90 83 4...,KYAYFPVWVNDDKMSLPLR,HCD
24219,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:35083,MWALDLR,F1:38190,7,76,78,7,452.7394,2,113.16,129.39,34588.0,903.4636,0.6,,56 71 84 79 79 85 90,MWALDLR,HCD
24220,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,F2:98838,LAAALTPAPVLAK,F2:33083,13,76,70,13,618.3896,2,100.95,110.95,13217000.0,1234.7649,-0.2,,48 50 80 99 100 100 99 96 33 31 48 45 82,LAAALTPAPVLAK,HCD
24221,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,F2:83383,APMSGLQNWK,F2:29687,10,76,70,10,566.2845,2,92.4,83.12,214530.0,1130.5542,0.3,,96 94 85 92 82 41 28 21 69 92,APMSGLQNWK,HCD
24222,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,F2:107262,SLTAPSGVPGM(+15.99)FK,F2:26077,13,76,67,13,654.3237,2,83.51,83.66,75731.0,1306.6592,-20.1,Oxidation (M),37 46 79 93 95 96 88 84 23 17 26 89 97,SLTAPSGVPGM(+15.99)FK,HCD
24223,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,F2:139236,EVQAWSWWWM(+15.99)TRRAAMDVPGR,F2:49588,21,76,68,21,879.0955,3,139.45,141.68,822420.0,2634.2371,10.4,Oxidation (M),37 25 33 44 44 41 81 84 90 88 90 40 52 96 95 8...,EVQAWSWWWM(+15.99)TRRAAMDVPGR,HCD
24224,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,F2:3957,MGFYGGR,F2:13573,7,76,68,7,394.1815,2,50.45,47.75,376120.0,786.3483,0.3,,17 29 90 86 80 83 90,MGFYGGR,HCD
24225,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:103943,TLGALDGTN(+.98)MNR,F1:23486,12,76,71,12,632.3035,2,78.32,67.85,3662600.0,1262.5925,0.0,Deamidation (NQ),21 21 26 76 99 98 90 88 70 82 91 93,TLGALDGTN(+.98)MNR,HCD
24226,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:23734,LGGLSSAMK,F1:16822,9,76,75,9,432.2354,2,59.98,63.31,416090.0,862.4582,-2.4,,62 77 48 88 82 73 68 78 97,LGGLSSAMK,HCD
24227,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,-,LDQ(+.98)AHLYNWQWYR,F2:50574,13,76,77,13,897.4365,2,141.76,127.6,0.0,1792.832,14.8,Deamidation (NQ),31 38 40 61 54 99 98 96 96 94 95 98 100,LDQ(+.98)AHLYNWQWYR,HCD


Filtered dataset shape: (32111, 20)
Filtered Data (rows 24219-24227):


Unnamed: 0,Fraction,Source File,Feature,Peptide,Scan,Tag Length,Denovo Score,ALC (%),length,m/z,z,RT,Predict RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode
25984,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,-,LVTTDANGWYNK,F2:16205,12,75,73,12,691.3395,2,57.63,71.8,0.0,1380.6675,-2.1,,59 60 95 95 95 89 83 70 69 28 43 98,LVTTDANGWYNK,HCD
25985,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,F2:39500,KATSLMDR,F2:7950,8,75,76,8,461.2376,2,32.97,17.78,841190.0,920.4749,-15.5,,77 34 38 87 93 88 96 94,KATSLMDR,HCD
25986,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,F2:103841,WAADLDQ(+.98)MVTK,F2:42699,11,75,75,11,639.8031,2,123.18,107.95,109590.0,1277.5962,-3.6,Deamidation (NQ),78 60 90 90 89 94 89 69 28 41 98,WAADLDQ(+.98)MVTK,HCD
25989,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:18876,Q(+.98)VYLHMDGFR,F1:35026,10,75,72,10,422.869,3,105.9,95.37,299660.0,1265.5862,-0.9,Deamidation (NQ),26 28 45 61 92 95 98 92 94 96,Q(+.98)VYLHMDGFR,HCD
25990,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,F2:113009,TLLPALYLLQGR,F2:54646,12,75,74,12,679.4126,2,151.38,164.41,25185.0,1356.813,-1.7,,51 50 50 33 82 89 93 95 85 79 88 98,TLLPALYLLQGR,HCD
25991,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:139595,FGFGLSDPDDN(+.98)NLFAHFKPLC(+57.02)K,F1:54391,22,75,72,22,847.3984,3,151.74,161.24,6012700.0,2539.1841,-4.2,Deamidation (NQ); Carbamidomethylation,49 40 57 62 82 96 69 44 92 49 28 34 96 97 97 8...,FGFGLSDPDDN(+.98)NLFAHFKPLC(+57.02)K,HCD
25993,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,-,YHDALTYVWNWGGFTGK,F1:45457,17,75,85,17,1007.9756,2,130.26,163.81,0.0,2013.9373,-0.3,,34 20 62 78 99 99 98 98 98 97 98 97 97 98 98 9...,YHDALTYVWNWGGFTGK,HCD
25994,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:852,NSLAVLR,F1:12592,7,75,81,7,386.7372,2,48.5,66.0,1484500.0,771.4603,-0.4,,61 60 82 92 94 94 86,NSLAVLR,HCD
25995,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,F2:141090,NYLDDLR,F2:41876,7,75,85,7,908.4518,1,121.32,83.38,208150.0,907.4399,5.0,,85 83 85 82 85 88 90,NYLDDLR,HCD
25997,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,F2:129117,YM(+15.99)APQEVGPGSPFR,F2:34615,14,75,78,14,776.3701,2,104.14,93.71,57385.0,1550.7188,4.5,Oxidation (M),37 35 43 41 90 96 95 97 93 92 94 82 95 100,YM(+15.99)APQEVGPGSPFR,HCD


Number of peptides filtered out: 157074


In [23]:
# Define the wrangle_peptides function
import re

def wrangle_peptides(sequence: str, ptm_filter: bool=True, li_swap: bool=True) -> str:
    """Process peptide sequences by removing post-translational modifications
    and/or equating Leucine and Isoleucine amino acids.

    Args:
        sequence (str): Peptide sequence string
        ptm_filter (bool, optional): Remove PTMs from sequence. Defaults to True.
        li_swap (bool, optional): Equate Leucine (L) and Isoleucine (I). Defaults to True.

    Returns:
        str: Processed sequence string
    """
    if ptm_filter:
        sequence = "".join(re.findall(r"[A-Z]+", sequence))
    if li_swap:
        sequence = sequence.replace("L", "I")
    return sequence

# Apply wrangle_peptides function to the filtered data
filtered_df["Cleaned Sequence"] = filtered_df["Peptide"].apply(lambda x: wrangle_peptides(x))

# Display rows 11 to 21 (Python is 0-indexed, so need 10:21)
print("Cleaned Peptides (Rows 11–21):")
display(filtered_df[["Peptide", "Cleaned Sequence"]].iloc[10:21])


Cleaned Peptides (Rows 11–21):


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df["Cleaned Sequence"] = filtered_df["Peptide"].apply(lambda x: wrangle_peptides(x))


Unnamed: 0,Peptide,Cleaned Sequence
10,TSLLN(+.98)YLR,TSIINYIR
11,VVQLTMQ(+.98)TQEK,VVQITMQTQEK
12,ATMSDFSPK,ATMSDFSPK
13,LSELTSLTSAPR,ISEITSITSAPR
14,VSQ(+.98)AVLAASSGR,VSQAVIAASSGR
15,VGLAWDR,VGIAWDR
16,YLPASC(+57.02)R,YIPASCR
17,ASVEDLLK,ASVEDIIK
18,YPDVATTHGGSATK,YPDVATTHGGSATK
19,VMGVAFN(+.98)R,VMGVAFNR


In [24]:
# Import necessary libraries
import requests

# Define the fetch_request function
def fetch_request(url: str, retries: int = 3, delay: int = 5) -> requests.Response:
    """Send GET request and return response object. Retries if server error encountered.

    Args:
        url (str): input URL.
        retries (int, optional): Number of retry attempts. Defaults to 3.
        delay (int, optional): Seconds to wait between retries. Defaults to 5 seconds.

    Raises:
        RuntimeError: If request fails after all retries.

    Returns:
        requests.Response: Response object
    """
    import time
    for attempt in range(retries):
        req_get = requests.get(url)
        
        if req_get.status_code == 200:
            return req_get  # Success!

        print(f"Request failed with status {req_get.status_code}. Retrying ({attempt+1}/{retries}) in {delay} seconds...")
        time.sleep(delay)
    
    error_msg = f"Request failed after {retries} retries: statuscode {req_get.status_code}"
    raise RuntimeError(error_msg)

In [25]:
# Define the request_unipept_pept_to_lca function
def request_unipept_pept_to_lca(pept_df: pd.DataFrame, seq_col: str) -> pd.DataFrame:
    """From a dataset of peptides, fetch LCA taxonomy and rank from UniPept database.

    Args:
        pept_df (pd.DataFrame): Peptide dataset.
        seq_col (str): Column with peptide sequences.

    Returns:
        pd.DataFrame: Peptide sequences with LCA taxonomy and LCA rank.
    """
    base_url = "http://api.unipept.ugent.be/api/v1/pept2lca.json?equate_il=true"
    batch_size = 100

    # Prepare peptide inputs
    seq_series = ["&input[]=" + seq for seq in pept_df[seq_col].drop_duplicates()]

    # Initialize dataframe to collect results
    lca_df = pd.DataFrame(columns=[seq_col, "Global LCA", "Global LCA Rank"], dtype=object)

    x = 0
    while True:
        if x + batch_size >= len(seq_series):
            peptides = seq_series[x:]
        else:
            peptides = seq_series[x:x+batch_size]

        # Construct request URL
        req_str = "".join([base_url, *peptides])
        
        # Send request
        response = fetch_request(req_str).json()

        # Safely extract fields from each response item.
        # If 'peptide', 'taxon_id', or 'taxon_rank' is missing, fill with 'Unknown' to prevent crashes.
        lca_df = pd.concat([
            lca_df,
            pd.DataFrame(
                [
                    (
                        elem.get("peptide", "Unknown"),
                        elem.get("taxon_id", "Unknown"),
                        elem.get("taxon_rank", "Unknown")
                    )
                    for elem in response
                ],
                columns=[seq_col, "Global LCA", "Global LCA Rank"]
            )
        ])
        
        x += batch_size

        if x >= len(seq_series):
            break

    return lca_df

In [26]:
# Now run the function on our cleaned peptide dataset
# We pass filtered_df with the 'Cleaned Sequence' column

print("Fetching taxonomy information from UniPept...")
lca_results_df = request_unipept_pept_to_lca(filtered_df, seq_col="Cleaned Sequence")

# Display the first few results
print("LCA Mapping Results:")
display(lca_results_df.head())

Fetching taxonomy information from UniPept...
LCA Mapping Results:


Unnamed: 0,Cleaned Sequence,Global LCA,Global LCA Rank
0,AISTWFTIK,1234,genus
1,MAGSQTAMTR,327159,genus
2,ITGMAFR,1,no rank
3,IGSDAYNQK,3379134,kingdom
4,IGATYDIFGDGK,1,no rank


In [27]:
# Display the lca dataset dimensions
print(f"LCA results dataset shape: {lca_results_df.shape}")

LCA results dataset shape: (8703, 3)


Now that we have obtained the taxonomic assignments for our peptides in the lca_results_df dataset, we want to examine how many peptides were classified at each taxonomic rank (e.g., species, genus, family).
This helps us evaluate the overall resolution of our data and determine which rank is most appropriate for downstream analysis and visualization of the microbial community composition.

In [28]:
# Count the number of peptides per LCA rank
rank_counts = lca_results_df["Global LCA Rank"].value_counts()

# Display the counts
print("Peptide counts per LCA rank:")
display(rank_counts)

Peptide counts per LCA rank:


Global LCA Rank
no rank          6610
species           824
domain            356
genus             268
class             142
strain            126
phylum            115
kingdom           109
family             61
order              36
subphylum          10
subfamily           8
subclass            7
subkingdom          6
subspecies          5
subgenus            5
infraorder          5
superclass          4
species group       3
suborder            1
tribe               1
varietas            1
Name: count, dtype: int64

In [37]:
# Step 1: Replace 'no rank' with 'Unclassified' for clarity
lca_results_df["Global LCA Rank"] = lca_results_df["Global LCA Rank"].replace("no rank", "Unclassified")

# Step 2: Define acceptable ranks for filtering
allowed_ranks = ["domain", "kingdom", "phylum", "class", "order", "family","genus", "species", "strain"]

# Step 3: Filter the DataFrame to only include allowed ranks
filtered_lca_df = lca_results_df[lca_results_df["Global LCA Rank"].isin(allowed_ranks)].copy()

# Step 4: Count how many times each taxon ID appears
taxon_counts = filtered_lca_df["Global LCA"].value_counts()

# Step 5: Apply a frequency cutoff (e.g. at least 10 peptides per taxon ID)
taxa_to_keep = taxon_counts[taxon_counts >= 10].index

# Step 6: Keep only rows with a frequently occurring taxon ID
filtered_lca_df = filtered_lca_df[filtered_lca_df["Global LCA"].isin(taxa_to_keep)]

# Summary printout
print(f"Filtered dataset shape: {filtered_lca_df.shape}")
print("Preview of filtered dataset:")
display(filtered_lca_df.head())


Filtered dataset shape: (1005, 3)
Preview of filtered dataset:


Unnamed: 0,Cleaned Sequence,Global LCA,Global LCA Rank
0,AISTWFTIK,1234,genus
1,MAGSQTAMTR,327159,genus
3,IGSDAYNQK,3379134,kingdom
6,ISAVGEVYDIK,1400863,strain
8,ATMSDFSPK,327160,species


In [33]:
import requests
import time

# Step 1: Get unique taxon IDs from the filtered dataframe
unique_taxa = filtered_lca_df["Global LCA"].unique()

# Step 2: Define a dictionary to store Taxon ID → Scientific Name
taxon_to_name = {}

# Step 3: Query NCBI Entrez API for each taxon ID
for taxon_id in unique_taxa:
    try:
        url = f"https://api.ncbi.nlm.nih.gov/taxonomy/v0/taxons/{taxon_id}"
        response = requests.get(url)
        data = response.json()
        
        if "scientificName" in data:
            taxon_to_name[taxon_id] = data["scientificName"]
        elif "lineage" in data and len(data["lineage"]) > 0:
            # Fallback: try to grab the name from lineage if available
            taxon_to_name[taxon_id] = data["lineage"][-1]["scientificName"]
        else:
            taxon_to_name[taxon_id] = "Unknown"
    
    except Exception as e:
        taxon_to_name[taxon_id] = "Error"
    
    time.sleep(0.2)  # Be kind to the API

# Step 4: Map scientific names back into a new dataframe
filtered_lca_named_df = filtered_lca_df.copy()
filtered_lca_named_df["Scientific Name"] = filtered_lca_named_df["Global LCA"].map(taxon_to_name)

# Step 5: Preview the new dataframe
print("Preview of dataframe with scientific names:")
display(filtered_lca_named_df)


Preview of dataframe with scientific names:


Unnamed: 0,Cleaned Sequence,Global LCA,Global LCA Rank,Scientific Name
0,AISTWFTIK,1234,genus,Error
1,MAGSQTAMTR,327159,genus,Error
3,IGSDAYNQK,3379134,kingdom,Error
6,ISAVGEVYDIK,1400863,strain,Error
8,ATMSDFSPK,327160,species,Error
...,...,...,...,...
24,TVVAFGPR,1224,phylum,Error
13,RAAPGENK,1224,phylum,Error
15,QPIEPYAK,3379134,kingdom,Error
6,ITAEIQAGK,1783272,kingdom,Error


In [34]:
# Debugging the scientific name error
taxon_id = 1234  # try with other IDs as well
url = f"https://api.ncbi.nlm.nih.gov/taxonomy/v0/taxons/{taxon_id}"

response = requests.get(url)

print("Status code:", response.status_code)
print("Response JSON:")
print(response.json())

Status code: 404
Response JSON:


JSONDecodeError: Expecting value: line 2 column 1 (char 1)

In [35]:
taxon_id = 1234
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
params = {
    "db": "taxonomy",
    "id": taxon_id,
    "retmode": "json"
}

response = requests.get(url, params=params)
print("Status code:", response.status_code)
print(response.json())  # <- This should contain 'scientificname'


Status code: 200
{'header': {'type': 'esummary', 'version': '0.3'}, 'result': {'uids': ['1234'], '1234': {'uid': '1234', 'status': 'active', 'rank': 'genus', 'division': 'bacteria', 'scientificname': 'Nitrospira', 'commonname': '', 'taxid': 1234, 'akataxid': '', 'genus': '', 'species': '', 'subsp': '', 'modificationdate': '2015/08/12 00:00', 'genbankdivision': 'Bacteria'}}}


In [45]:
import requests
import time

# Step 1: Get unique taxon IDs from the filtered dataframe
unique_taxa = filtered_lca_df["Global LCA"].unique()

# Step 2: Define a dictionary to store Taxon ID → Scientific Name
taxon_to_name = {}

# Step 3: Query NCBI Entrez API for each taxon ID
for taxon_id in unique_taxa:
    try:
        url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
        params = {
            "db": "taxonomy",
            "id": taxon_id,
            "retmode": "json"
        }

        response = requests.get(url, params=params)
        data = response.json()
        
        result = data.get("result", {})
        if str(taxon_id) in result:
            sci_name = result[str(taxon_id)].get("scientificname", "Unknown")
            taxon_to_name[taxon_id] = sci_name
        else:
            taxon_to_name[taxon_id] = "Unknown"
    
    except Exception as e:
        taxon_to_name[taxon_id] = "Error"
    
    time.sleep(0.2)  # Be kind to the API

# Step 4: Map scientific names back into a new dataframe
filtered_lca_named_df = filtered_lca_df.copy()
filtered_lca_named_df["Scientific Name"] = filtered_lca_named_df["Global LCA"].map(taxon_to_name)

# Step 5: Preview the new dataframe
print("Preview of dataframe with scientific names:")
display(filtered_lca_named_df)


Preview of dataframe with scientific names:


Unnamed: 0,Cleaned Sequence,Global LCA,Global LCA Rank,Scientific Name
0,AISTWFTIK,1234,genus,Nitrospira
1,MAGSQTAMTR,327159,genus,Candidatus Accumulibacter
3,IGSDAYNQK,3379134,kingdom,Pseudomonadati
6,ISAVGEVYDIK,1400863,strain,Candidatus Competibacter denitrificans Run_A_D11
8,ATMSDFSPK,327160,species,Candidatus Accumulibacter phosphatis
...,...,...,...,...
16,FAAACQQK,2,domain,Bacteria
25,FIEISWPK,2759,domain,Eukaryota
28,GNSDVGFR,2759,domain,Eukaryota
16,ASVIFMPK,2759,domain,Eukaryota


In [None]:
import xml.etree.ElementTree as ET

# Step 1: Get unique taxon IDs from your DataFrame
taxon_ids = filtered_lca_named_df["Global LCA"].dropna().unique()

# Step 2: Create an empty dictionary to store taxon ID → lineage mapping
taxon_lineages = {}

# Step 3: Function to fetch lineage from NCBI Entrez
def fetch_lineage(taxon_id):
    url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    params = {
        "db": "taxonomy",
        "id": str(taxon_id),
        "retmode": "xml"
    }

    try:
        response = requests.get(url, params=params)
        root = ET.fromstring(response.content)
        lineage_dict = {}
        for taxon in root.findall(".//Taxon"):
            rank = taxon.find("Rank")
            name = taxon.find("ScientificName")
            if rank is not None and name is not None and rank.text != "no rank":
                lineage_dict[rank.text.lower()] = name.text
        return lineage_dict
    except Exception as e:
        print(f"Error with taxon ID {taxon_id}: {e}")
        return {}

# Step 4: Loop through all taxon IDs and build the dictionary
for tid in taxon_ids:
    taxon_lineages[tid] = fetch_lineage(tid)
    time.sleep(0.3)  # to avoid overwhelming the server

# Optional: Display the result as a table
import pandas as pd
lineage_df = pd.DataFrame.from_dict(taxon_lineages, orient="index").fillna("NA")
display(lineage_df.head())


Unnamed: 0,genus,cellular root,domain,kingdom,phylum,class,order,family,strain,species,clade
1234,Nitrospira,cellular organisms,Bacteria,Pseudomonadati,Nitrospirota,Nitrospiria,Nitrospirales,Nitrospiraceae,,,
327159,Candidatus Accumulibacter,cellular organisms,Bacteria,Pseudomonadati,Pseudomonadota,Betaproteobacteria,,,,,
1400863,Candidatus Competibacter,cellular organisms,Bacteria,Pseudomonadati,Pseudomonadota,Gammaproteobacteria,,Candidatus Competibacteraceae,Candidatus Competibacter denitrificans Run_A_D11,Candidatus Competibacter denitrificans,
327160,Candidatus Accumulibacter,cellular organisms,Bacteria,Pseudomonadati,Pseudomonadota,Betaproteobacteria,,,,Candidatus Accumulibacter phosphatis,
330214,Nitrospira,cellular organisms,Bacteria,Pseudomonadati,Nitrospirota,Nitrospiria,Nitrospirales,Nitrospiraceae,,Nitrospira defluvii,


In [None]:
print(filtered_lca_named_df.shape)
print(lineage_df.shape)
print(len(taxon_ids)) # it seems that the majority of the taxon IDs are duplicates of each other

(1005, 4)
(22, 11)
22


In [46]:
lineage_df = lineage_df.reset_index().rename(columns={"index": "Global LCA"})
merged_df = filtered_lca_named_df.merge(lineage_df, on="Global LCA", how="left")
print(merged_df.shape)
display(merged_df.head())

(1005, 15)


Unnamed: 0,Cleaned Sequence,Global LCA,Global LCA Rank,Scientific Name,genus,cellular root,domain,kingdom,phylum,class,order,family,strain,species,clade
0,AISTWFTIK,1234,genus,Nitrospira,Nitrospira,cellular organisms,Bacteria,Pseudomonadati,Nitrospirota,Nitrospiria,Nitrospirales,Nitrospiraceae,,,
1,MAGSQTAMTR,327159,genus,Candidatus Accumulibacter,Candidatus Accumulibacter,cellular organisms,Bacteria,Pseudomonadati,Pseudomonadota,Betaproteobacteria,,,,,
2,IGSDAYNQK,3379134,kingdom,Pseudomonadati,,cellular organisms,Bacteria,Pseudomonadati,,,,,,,
3,ISAVGEVYDIK,1400863,strain,Candidatus Competibacter denitrificans Run_A_D11,Candidatus Competibacter,cellular organisms,Bacteria,Pseudomonadati,Pseudomonadota,Gammaproteobacteria,,Candidatus Competibacteraceae,Candidatus Competibacter denitrificans Run_A_D11,Candidatus Competibacter denitrificans,
4,ATMSDFSPK,327160,species,Candidatus Accumulibacter phosphatis,Candidatus Accumulibacter,cellular organisms,Bacteria,Pseudomonadati,Pseudomonadota,Betaproteobacteria,,,,Candidatus Accumulibacter phosphatis,


In [47]:
def assign_taxonomic_label(df: pd.DataFrame, target_rank: str) -> pd.Series:
    """
    Assigns each row in the DataFrame to a taxonomic label at the specified rank.

    If a peptide is annotated at or below the target rank, it will be assigned
    the label from that rank. If not, it will be labeled as 'Unclassified'.

    Parameters:
        df (pd.DataFrame): DataFrame containing taxonomic lineage columns.
        target_rank (str): The desired rank to group peptides by (e.g., 'phylum', 'kingdom').

    Returns:
        pd.Series: A column with assigned labels for each peptide at the given rank.
    """
    if target_rank not in df.columns:
        raise ValueError(f"Target rank '{target_rank}' not found in the dataframe columns.")
    
    return df[target_rank].fillna("Unclassified").replace("NA", "Unclassified")
