# Peptide Identification Pipeline using Custom Database

In this project, we aim to develop an algorithm that identifies the microbial composition of a mass spectrometry (MS) sample based on de novo peptide sequencing data. 
Using the predicted peptides, we reconstruct a custom protein sequence database that is optimized for the specific microbial community in the sample.

The pipeline involves several key steps:
- Filtering the de novo peptides based on Average Local Confidence (ALC) scores to retain only high-confidence sequences.
- Cleaning peptide sequences to remove post-translational modification notations.
- Determining the taxonomic origin of peptides by querying UniProt in batch mode.
- Building the microbial community composition based on the taxonomy assignments.
- Constructing a targeted protein database by collecting protein sequences from the identified organisms.
- Reducing database redundancy through clustering to optimize search efficiency and minimize false positives.

By tailoring the database to the actual community composition, we aim to achieve more accurate protein identifications in metaproteomic studies — approaching the performance of genome-based identification strategies, without the need for extensive metagenomic sequencing.

As a first step, we will filter the de novo sequencing results to retain only high-confidence peptides with an Average Local Confidence (ALC) score greater than 70%.

In [2]:
# Import necessary libraries
import pandas as pd

# Load the de novo peptide data
file_path = "de_novo_garmerwolde.csv"
df = pd.read_csv(file_path)

# Display the first few rows to check the data
print("Original Data:")
display(df.head())

# Filter peptides with ALC (%) > 70
filtered_df = df[df["ALC (%)"] > 70]

# Display the first few rows of the filtered data
print("Filtered Data (ALC > 70%):")
display(filtered_df.head())

Original Data:


Unnamed: 0,Fraction,Source File,Feature,Peptide,Scan,Tag Length,Denovo Score,ALC (%),length,m/z,z,RT,Predict RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode
0,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,F2:71565,ALSTWFTLK,F2:44359,9,99,99,9,533.8009,2,126.78,127.18,18854000.0,1065.5859,1.2,,99 100 100 100 100 100 100 100 100,ALSTWFTLK,HCD
1,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:69836,APDNVGVLLR,F1:27375,10,99,99,10,527.3062,2,87.21,87.47,5356300.0,1052.5979,-0.1,,100 100 100 100 100 100 100 100 100 100,APDNVGVLLR,HCD
2,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:69709,MAGSQTAMTR,F1:5020,10,99,99,10,527.2445,2,23.33,23.29,2474100.0,1052.4744,0.1,,100 100 100 100 100 100 100 100 100 100,MAGSQTAMTR,HCD
3,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:6102,LTGMAFR,F1:16983,7,99,99,7,398.2128,2,60.29,60.67,5092200.0,794.4109,0.3,,100 100 100 99 100 100 100,LTGMAFR,HCD
4,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:149136,WDNAATYTSPNWSGFTAK,F1:43490,18,99,99,18,1008.9567,2,125.38,125.89,129080000.0,2015.9014,-1.3,,99 99 99 100 100 100 100 100 100 100 100 100 1...,WDNAATYTSPNWSGFTAK,HCD


Filtered Data (ALC > 70%):


Unnamed: 0,Fraction,Source File,Feature,Peptide,Scan,Tag Length,Denovo Score,ALC (%),length,m/z,z,RT,Predict RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode
0,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,F2:71565,ALSTWFTLK,F2:44359,9,99,99,9,533.8009,2,126.78,127.18,18854000.0,1065.5859,1.2,,99 100 100 100 100 100 100 100 100,ALSTWFTLK,HCD
1,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:69836,APDNVGVLLR,F1:27375,10,99,99,10,527.3062,2,87.21,87.47,5356300.0,1052.5979,-0.1,,100 100 100 100 100 100 100 100 100 100,APDNVGVLLR,HCD
2,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:69709,MAGSQTAMTR,F1:5020,10,99,99,10,527.2445,2,23.33,23.29,2474100.0,1052.4744,0.1,,100 100 100 100 100 100 100 100 100 100,MAGSQTAMTR,HCD
3,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:6102,LTGMAFR,F1:16983,7,99,99,7,398.2128,2,60.29,60.67,5092200.0,794.4109,0.3,,100 100 100 99 100 100 100,LTGMAFR,HCD
4,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:149136,WDNAATYTSPNWSGFTAK,F1:43490,18,99,99,18,1008.9567,2,125.38,125.89,129080000.0,2015.9014,-1.3,,99 99 99 100 100 100 100 100 100 100 100 100 1...,WDNAATYTSPNWSGFTAK,HCD


In [4]:
# Define the wrangle_peptides function
import re

def wrangle_peptides(sequence: str, ptm_filter: bool=True, li_swap: bool=True) -> str:
    """Process peptide sequences by removing post-translational modifications
    and/or equating Leucine and Isoleucine amino acids.

    Args:
        sequence (str): Peptide sequence string
        ptm_filter (bool, optional): Remove PTMs from sequence. Defaults to True.
        li_swap (bool, optional): Equate Leucine (L) and Isoleucine (I). Defaults to True.

    Returns:
        str: Processed sequence string
    """
    if ptm_filter:
        sequence = "".join(re.findall(r"[A-Z]+", sequence))
    if li_swap:
        sequence = sequence.replace("L", "I")
    return sequence

# Apply wrangle_peptides function to the filtered data
filtered_df["Cleaned Sequence"] = filtered_df["Peptide"].apply(lambda x: wrangle_peptides(x))

# Display rows 11 to 21 (Python is 0-indexed, so need 10:21)
print("Cleaned Peptides (Rows 11–21):")
display(filtered_df[["Peptide", "Cleaned Sequence"]].iloc[10:21])


Cleaned Peptides (Rows 11–21):


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df["Cleaned Sequence"] = filtered_df["Peptide"].apply(lambda x: wrangle_peptides(x))


Unnamed: 0,Peptide,Cleaned Sequence
10,TSLLN(+.98)YLR,TSIINYIR
11,VVQLTMQ(+.98)TQEK,VVQITMQTQEK
12,ATMSDFSPK,ATMSDFSPK
13,LSELTSLTSAPR,ISEITSITSAPR
14,VSQ(+.98)AVLAASSGR,VSQAVIAASSGR
15,VGLAWDR,VGIAWDR
16,YLPASC(+57.02)R,YIPASCR
17,ASVEDLLK,ASVEDIIK
18,YPDVATTHGGSATK,YPDVATTHGGSATK
19,VMGVAFN(+.98)R,VMGVAFNR


In [9]:
# Import necessary libraries
import requests

# Define the fetch_request function
def fetch_request(url: str, retries: int = 3, delay: int = 5) -> requests.Response:
    """Send GET request and return response object. Retries if server error encountered.

    Args:
        url (str): input URL.
        retries (int, optional): Number of retry attempts. Defaults to 3.
        delay (int, optional): Seconds to wait between retries. Defaults to 5 seconds.

    Raises:
        RuntimeError: If request fails after all retries.

    Returns:
        requests.Response: Response object
    """
    import time
    for attempt in range(retries):
        req_get = requests.get(url)
        
        if req_get.status_code == 200:
            return req_get  # Success!

        print(f"Request failed with status {req_get.status_code}. Retrying ({attempt+1}/{retries}) in {delay} seconds...")
        time.sleep(delay)
    
    error_msg = f"Request failed after {retries} retries: statuscode {req_get.status_code}"
    raise RuntimeError(error_msg)

In [10]:
# Define the request_unipept_pept_to_lca function
def request_unipept_pept_to_lca(pept_df: pd.DataFrame, seq_col: str) -> pd.DataFrame:
    """From a dataset of peptides, fetch LCA taxonomy and rank from UniPept database.

    Args:
        pept_df (pd.DataFrame): Peptide dataset.
        seq_col (str): Column with peptide sequences.

    Returns:
        pd.DataFrame: Peptide sequences with LCA taxonomy and LCA rank.
    """
    base_url = "http://api.unipept.ugent.be/api/v1/pept2lca.json?equate_il=true"
    batch_size = 100

    # Prepare peptide inputs
    seq_series = ["&input[]=" + seq for seq in pept_df[seq_col].drop_duplicates()]

    # Initialize dataframe to collect results
    lca_df = pd.DataFrame(columns=[seq_col, "Global LCA", "Global LCA Rank"], dtype=object)

    x = 0
    while True:
        if x + batch_size >= len(seq_series):
            peptides = seq_series[x:]
        else:
            peptides = seq_series[x:x+batch_size]

        # Construct request URL
        req_str = "".join([base_url, *peptides])
        
        # Send request
        response = fetch_request(req_str).json()

        # Safely extract fields from each response item.
        # If 'peptide', 'taxon_id', or 'taxon_rank' is missing, fill with 'Unknown' to prevent crashes.
        lca_df = pd.concat([
            lca_df,
            pd.DataFrame(
                [
                    (
                        elem.get("peptide", "Unknown"),
                        elem.get("taxon_id", "Unknown"),
                        elem.get("taxon_rank", "Unknown")
                    )
                    for elem in response
                ],
                columns=[seq_col, "Global LCA", "Global LCA Rank"]
            )
        ])
        
        x += batch_size

        if x >= len(seq_series):
            break

    return lca_df

In [11]:
# Now run the function on our cleaned peptide dataset
# We pass filtered_df with the 'Cleaned Sequence' column

print("Fetching taxonomy information from UniPept...")
lca_results_df = request_unipept_pept_to_lca(filtered_df, seq_col="Cleaned Sequence")

# Display the first few results
print("LCA Mapping Results:")
display(lca_results_df.head())

Fetching taxonomy information from UniPept...
LCA Mapping Results:


Unnamed: 0,Cleaned Sequence,Global LCA,Global LCA Rank
0,AISTWFTIK,1234,genus
1,MAGSQTAMTR,327159,genus
2,ITGMAFR,1,no rank
3,IGSDAYNQK,3379134,kingdom
4,IGATYDIFGDGK,1,no rank
