# Peptide Identification Pipeline using Custom Database

In this project, we aim to develop an algorithm that identifies the microbial composition of a mass spectrometry (MS) sample based on de novo peptide sequencing data. 
Using the predicted peptides, we reconstruct a custom protein sequence database that is optimized for the specific microbial community in the sample.

The pipeline involves several key steps:
- Filtering the de novo peptides based on Average Local Confidence (ALC) scores to retain only high-confidence sequences.
- Cleaning peptide sequences to remove post-translational modification notations.
- Determining the taxonomic origin of peptides by querying UniProt in batch mode.
- Building the microbial community composition based on the taxonomy assignments.
- Constructing a targeted protein database by collecting protein sequences from the identified organisms.
- Reducing database redundancy through clustering to optimize search efficiency and minimize false positives.

By tailoring the database to the actual community composition, we aim to achieve more accurate protein identifications in metaproteomic studies — approaching the performance of genome-based identification strategies, without the need for extensive metagenomic sequencing.

As a first step, we will filter the de novo sequencing results to retain only high-confidence peptides with an Average Local Confidence (ALC) score greater than 70%.

In [1]:
# Import necessary libraries
import pandas as pd

# Load the de novo peptide data
file_path = "de_novo_garmerwolde.csv"
df = pd.read_csv(file_path)

# Display the first few rows to check the data
print("Original Data:")
display(df.head())

# Filter peptides with ALC (%) > 70
filtered_df = df[df["ALC (%)"] > 70]

# Display the first few rows of the filtered data
print("Filtered Data (ALC > 70%):")
display(filtered_df.head())

Original Data:


Unnamed: 0,Fraction,Source File,Feature,Peptide,Scan,Tag Length,Denovo Score,ALC (%),length,m/z,z,RT,Predict RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode
0,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,F2:71565,ALSTWFTLK,F2:44359,9,99,99,9,533.8009,2,126.78,127.18,18854000.0,1065.5859,1.2,,99 100 100 100 100 100 100 100 100,ALSTWFTLK,HCD
1,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:69836,APDNVGVLLR,F1:27375,10,99,99,10,527.3062,2,87.21,87.47,5356300.0,1052.5979,-0.1,,100 100 100 100 100 100 100 100 100 100,APDNVGVLLR,HCD
2,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:69709,MAGSQTAMTR,F1:5020,10,99,99,10,527.2445,2,23.33,23.29,2474100.0,1052.4744,0.1,,100 100 100 100 100 100 100 100 100 100,MAGSQTAMTR,HCD
3,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:6102,LTGMAFR,F1:16983,7,99,99,7,398.2128,2,60.29,60.67,5092200.0,794.4109,0.3,,100 100 100 99 100 100 100,LTGMAFR,HCD
4,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:149136,WDNAATYTSPNWSGFTAK,F1:43490,18,99,99,18,1008.9567,2,125.38,125.89,129080000.0,2015.9014,-1.3,,99 99 99 100 100 100 100 100 100 100 100 100 1...,WDNAATYTSPNWSGFTAK,HCD


Filtered Data (ALC > 70%):


Unnamed: 0,Fraction,Source File,Feature,Peptide,Scan,Tag Length,Denovo Score,ALC (%),length,m/z,z,RT,Predict RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode
0,2,MP_RZ07032023_GW_flat_180min_DDA02.raw,F2:71565,ALSTWFTLK,F2:44359,9,99,99,9,533.8009,2,126.78,127.18,18854000.0,1065.5859,1.2,,99 100 100 100 100 100 100 100 100,ALSTWFTLK,HCD
1,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:69836,APDNVGVLLR,F1:27375,10,99,99,10,527.3062,2,87.21,87.47,5356300.0,1052.5979,-0.1,,100 100 100 100 100 100 100 100 100 100,APDNVGVLLR,HCD
2,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:69709,MAGSQTAMTR,F1:5020,10,99,99,10,527.2445,2,23.33,23.29,2474100.0,1052.4744,0.1,,100 100 100 100 100 100 100 100 100 100,MAGSQTAMTR,HCD
3,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:6102,LTGMAFR,F1:16983,7,99,99,7,398.2128,2,60.29,60.67,5092200.0,794.4109,0.3,,100 100 100 99 100 100 100,LTGMAFR,HCD
4,1,MP_RZ07032023_GW_flat_180min_DDA01.raw,F1:149136,WDNAATYTSPNWSGFTAK,F1:43490,18,99,99,18,1008.9567,2,125.38,125.89,129080000.0,2015.9014,-1.3,,99 99 99 100 100 100 100 100 100 100 100 100 1...,WDNAATYTSPNWSGFTAK,HCD


Now that we have filtered our dataset