# Keywords

- **Peptides:** Peptides are short chains of amino acids that are the building blocks of proteins.
- **Previous amino acid:** The "previous amino acid" refers to the amino acid that comes before a specific amino acid in a peptide sequence.
- **K or R:** These are specific amino acids.
- **K:** K stands for lysine
- **R:** R stands for arginine
- **SPP:** Stands for `Plasmodium proteins signal peptide peptidase` - is an enzyme found in malaria-causing parasites called Plasmodium. Its job is to remove specific "signal peptides" from newly made proteins as they are being processed inside the parasite's cells.
- **NAT:** stands for `N-acetyltransferase` - N-acetyltransferase (NAT) is an enzyme that helps modify different substances inside living organisms. It transfers a small molecule called an acetyl group to other molecules, changing their properties... NATs are involved in a wide range of biological processes, including the metabolism of drugs, toxins, and endogenous compounds in organisms. They play a role in detoxification by modifying and facilitating the elimination of certain substances from the body.
- **Signal Peptides:** Signal peptides are like addresses that guide the proteins to the right place within the parasite's cells. Once the proteins reach their destination, SPP cuts off these signal peptides, allowing the proteins to function properly.
- **Research question 4:** In what peptides does the previous amino acid not = K or R... does this differ between SPP vs. NAT?

In [19]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

# # Load the dataset
# data = pd.read_excel('Malaria_Research_Data.xlsx', header=0)

# # Step 2:Calculate the total protein spectra in each sample
# total_spectra = data.groupby('Experiment name')['Protein percentage of total spectra'].sum()

# # Calculate the normalized protein percentage
# data['Normalized protein percentage'] = (data['Protein percentage of total spectra'] / data['Experiment name'].map(total_spectra)) * 100

# # Print the modified dataset
# print(data)

# # Perform further analysis or visualization as desired
# # For example, you can create a bar plot to compare the normalized protein percentage between SPP and NAT samples
# sns.barplot(data=data, x='Biological sample category', y='Normalized protein percentage')
# plt.show()

data = pd.read_excel('Malaria_Research_Data.xlsx', header=0)
total_spectra = data.groupby('Biological sample category')['Protein percentage of total spectra'].sum()

# Normalize the protein percentages within each group
data['Normalized protein percentage'] = data.groupby('Biological sample category')['Protein percentage of total spectra'].transform(lambda x: (x / x.sum()) * 100)

# Calculate the corrected normalized protein percentage
data['Corrected normalized protein percentage'] = data.groupby('Biological sample category')['Normalized protein percentage'].transform(lambda x: (x / x.sum()) * 100)

print(data)

                                  Experiment name Biological sample category   
0        SPP vs. NAT coIP results recieved 1.9.23                        NAT  \
1        SPP vs. NAT coIP results recieved 1.9.23                        NAT   
2        SPP vs. NAT coIP results recieved 1.9.23                        NAT   
3        SPP vs. NAT coIP results recieved 1.9.23                        NAT   
4        SPP vs. NAT coIP results recieved 1.9.23                        NAT   
...                                           ...                        ...   
19259  SPP vs. NAT coIP results recieved 11.15.22                        SPP   
19260  SPP vs. NAT coIP results recieved 11.15.22                        SPP   
19261  SPP vs. NAT coIP results recieved 11.15.22                        SPP   
19262  SPP vs. NAT coIP results recieved 11.15.22                        SPP   
19263  SPP vs. NAT coIP results recieved 11.15.22                        SPP   

                                       

In [20]:
relevant_columns = ['Peptide sequence', 'Previous amino acid', 'Biological sample category']
filtered_data = data[relevant_columns]

In [21]:
filtered_data = filtered_data[(filtered_data['Previous amino acid'] != 'K') & (filtered_data['Previous amino acid'] != 'R')]


In [22]:
grouped_data = filtered_data.groupby('Biological sample category')

In [23]:
for name, group in grouped_data:
    if name == 'SPP':
        spp_peptides = group['Peptide sequence'].unique()
    elif name == 'NAT':
        nat_peptides = group['Peptide sequence'].unique()

# Find the peptides that differ between SPP and NAT
differing_peptides = set(spp_peptides) - set(nat_peptides)

In [25]:
print("Peptides that differ between SPP and NAT:")
for peptide in differing_peptides:
    print(peptide)

Peptides that differ between SPP and NAT:
QLQNITVQK
VAHNNVLPNVHLHK
KINEIINKYSSNK
GALDESTPVPSR
SNIHTLAEYR
AEQFTEDIGVVNKRLLEPVPFVK
SGNNVQEEDSTFHVSNLYSETEIK
MNEQDYLPIEIK
MDELNKEEIVDNINNEQAK
LTLTGNGK
VLTELGTQITNAFR
SGNNVQEEDSTFHVSNLYSETEIKK
INNIIINK
SNLTAAEEK
SSTEKNEVINSNDTR
VKNLIENVEIK
AHCTEYMNAPK
TSEFWPDLDFK
PTISVYEDDLFEK
INNKYGSK
ATSEELKQLR
MIGIQEGR
ASTEEVSQER
MNNLNILFFNNLGENILK
SQNNPLSVCVADNLINYDIQNESFR
SSVSTLPYIGSK
LQNNKLFDNLR
SNVLEECIK
