### Research question 2: Do protein modifications differ between SPP vs. NAT?

#### Protein Modifications
- Proteins are the building blocks of cells and play crucial roles in various biological processes. Protein modifications refer to the chemical alterations that occur to proteins after they are synthesized. These modifications can occur through several mechanisms, including enzymatic reactions, post-translational modifications (PTMs), and interactions with other molecules.
- Common types of protein modifications include phosphorylation, acetylation, methylation, glycosylation, ubiquitination, and many others. These modifications can alter a protein's structure, stability, localization, activity, and interactions with other molecules. Understanding protein modifications is essential in deciphering cellular processes, disease mechanisms, and potential therapeutic targets, such as in malaria research.

#### Plasmodium proteins signal peptide peptidase (SPP):
- Plasmodium proteins signal peptide peptidase, or SPP, is an enzyme found in the Plasmodium parasite, which causes malaria. Its main job is to process and modify proteins that are important for the parasite's survival and ability to infect human cells.
- When proteins are made in cells, they usually have a "signal peptide" attached to them. This signal peptide acts like a tag, telling the cell where the protein needs to go. However, before the protein can reach its destination, the signal peptide needs to be removed. That's where SPP comes in.
- SPP is responsible for cutting off the signal peptide from the protein. It acts like a pair of molecular scissors, recognizing the specific region of the protein where the signal peptide is attached and snipping it off. Once the signal peptide is removed, the protein can then be transported to its proper location within the Plasmodium parasite.
- The role of SPP in Plasmodium is important because it ensures that proteins are correctly processed and directed to the right places. This helps the parasite survive and carry out its infectious activities. Researchers studying malaria may be interested in understanding SPP's function and exploring its potential as a target for developing new treatments against the disease.

#### N-Acetyltransferase (NAT)
- N-Acetyltransferases (NATs) are a class of enzymes that catalyze the transfer of an acetyl group from acetyl-coenzyme A (acetyl-CoA) to the amino group of various substrates, including small molecules, drugs, and proteins. NATs are involved in the process of acetylation, one of the common post-translational modifications of proteins.
- In the context of malaria research, NATs can be of interest due to their potential involvement in the modification of proteins within the Plasmodium parasite or host cells. The acetylation of proteins mediated by NATs can affect protein stability, function, and protein-protein interactions, thereby influencing various cellular processes. By studying the role of NATs in malaria, researchers can gain insights into the molecular mechanisms underlying parasite-host interactions and identify potential targets for intervention.

In [19]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

pd.set_option('display.max_colwidth', None)

data = pd.read_excel('Malaria_Research_Data.xlsx', header=0)
total_spectra = data.groupby('Biological sample category')['Protein percentage of total spectra'].sum()

# Normalize the protein percentages within each group
data['Normalized protein percentage'] = data.groupby('Biological sample category')['Protein percentage of total spectra'].transform(lambda x: (x / x.sum()) * 100)

# Calculate the corrected normalized protein percentage
data['Corrected normalized protein percentage'] = data.groupby('Biological sample category')['Normalized protein percentage'].transform(lambda x: (x / x.sum()) * 100)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19264 entries, 0 to 19263
Data columns (total 20 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Experiment name                          19264 non-null  object 
 1   Biological sample category               19264 non-null  object 
 2   Protein group                            19264 non-null  object 
 3   Protein accession number                 19264 non-null  object 
 4   Protein name                             19264 non-null  object 
 5   Protein identification probability       19264 non-null  float64
 6   Protein percentage of total spectra      19264 non-null  float64
 7   Number of unique peptides                19264 non-null  int64  
 8   Number of unique spectra                 19264 non-null  int64  
 9   Number of total spectra                  19264 non-null  int64  
 10  Peptide sequence                         19264

In [23]:
# collect protein modification columns from source dataframe
# protein_mods_data = data[['Biological sample category', 'Protein accession number', 'Peptide sequence', 'Peptide identification probability', 'Modifications identified by spectrum']]
# print(f"\nprotein_mods_data dataset info:\n")
# print(protein_mods_data.info())

# create subset of protein_mods for SPP
protein_mods_spp = data[protein_mods_data['Biological sample category'] == 'SPP']

# create subset of protein_mods for NAT
protein_mods_nat = data[protein_mods_data['Biological sample category'] == 'NAT']

In [10]:
# get count of each distinct occurance of `Modifications identified by spectrum` for SPP
spp_modifications = protein_mods_spp['Modifications identified by spectrum']
print(f"The total number of modificaitons for SPP:", spp_modifications.count())
print(f"\nThe number of SPP modfications identified by spectrum:\n")
print(spp_modifications.value_counts(sort = False))

The total number of modificaitons for SPP: 3341

The number of SPP modfications identified by spectrum:

Modifications identified by spectrum
Carbamidomethyl (+57)                                                                   1039
Deamidated (+1)                                                                          922
Acetyl (+42)                                                                              77
Deamidated (+1), Carbamidomethyl (+57)                                                    39
Oxidation (+16)                                                                          549
Oxidation (+16), Acetyl (+42)                                                             29
Carbamidomethyl (+57), Carbamidomethyl (+57)                                             172
Deamidated (+1), Deamidated (+1)                                                          94
Carbamidomethyl (+57), Deamidated (+1)                                                    60
Acetyl (+42), Deamida

In [4]:
# get count of each distinct occurance of `Modifications identified by spectrum` for NAT
nat_modifications = protein_mods_nat['Modifications identified by spectrum']
print(f"The total number of modificaitons for NAT:", nat_modifications.count())
print(f"\nThe number of NAT modfications identified by spectrum:\n")
print(nat_modifications.value_counts(sort = False))

The total number of modificaitons for NAT: 3331

The number of NAT modfications identified by spectrum:

Modifications identified by spectrum
Carbamidomethyl (+57), Carbamidomethyl (+57)                                                                   147
Carbamidomethyl (+57)                                                                                          971
Acetyl (+42)                                                                                                    90
Oxidation (+16), Acetyl (+42)                                                                                   19
Deamidated (+1)                                                                                                979
Deamidated (+1), Carbamidomethyl (+57)                                                                          44
Deamidated (+1), Deamidated (+1)                                                                               103
Oxidation (+16), Carbamidomethyl (+57), Deamidated (+

In [24]:
# find which modifications exist in SPP but not NAT
spp_mods_not_in_nat = protein_mods_spp[~protein_mods_spp['Modifications identified by spectrum'].isin(protein_mods_nat['Modifications identified by spectrum'])]
print(f"\nModifcations found in SPP but not NAT\n")
# print(spp_mods_not_in_nat.value_counts(sort = False))
display(spp_mods_not_in_nat[['Protein accession number', 'Protein name', 'Peptide sequence', 'Peptide identification probability', 'Modifications identified by spectrum']])


Modifcations found in SPP but not NAT



Unnamed: 0,Protein accession number,Protein name,Peptide sequence,Peptide identification probability,Modifications identified by spectrum
4956,Q8I3A3,"Ubiquitin specific protease, putative OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_0904600 PE=4 SV=1",KNDNIIQNNK,0.911,"Deamidated (+1), Deamidated (+1), Deamidated (+1), Deamidated (+1)"
12899,Q8I3T8,"60S ribosomal protein L12, putative OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_0517000 PE=3 SV=2",EMLGTCNSIGCTVDGK,0.997,"Oxidation (+16), Carbamidomethyl (+57), Deamidated (+1), Carbamidomethyl (+57)"
13271,Q8II57,"Structural maintenance of chromosomes protein 1, putative OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_1130700 PE=4 SV=1",QINCKNYLNEKK,0.929,"Deamidated (+1), Deamidated (+1), Carbamidomethyl (+57), Deamidated (+1)"
13536,Q8IKW5,Elongation factor 2 OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_1451100 PE=3 SV=1,YTEQVQDVPCGNTCCLVGVDQYIVK,0.954,"Carbamidomethyl (+57), Deamidated (+1), Carbamidomethyl (+57), Carbamidomethyl (+57)"
13546,Q8IDN6,Protein transport protein SEC61 subunit alpha OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_1346100 PE=3 SV=1,GTLMELGISPIVTSGMVMQLLAGSK,0.997,"Oxidation (+16), Oxidation (+16), Oxidation (+16)"
16901,Q8I5J4,Uncharacterized protein OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_1221900 PE=4 SV=1,KVNKNDEDLNNNSK,0.975,"Deamidated (+1), Deamidated (+1), Deamidated (+1), Deamidated (+1)"
18003,Q7KQL9,Fructose-bisphosphate aldolase OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=FBPA PE=1 SV=1,AHCTEYMNAPK,0.997,"Acetyl (+42), Carbamidomethyl (+57), Oxidation (+16)"
18004,Q7KQL9,Fructose-bisphosphate aldolase OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=FBPA PE=1 SV=1,AHCTEYMNAPK,0.997,"Acetyl (+42), Carbamidomethyl (+57), Oxidation (+16)"
18264,Q8IJT9,Eukaryotic translation initiation factor 2 subunit beta OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_1010600 PE=3 SV=1,YITEYVTCQMCK,0.979,"Carbamidomethyl (+57), Oxidation (+16), Carbamidomethyl (+57)"
18265,Q8IJT9,Eukaryotic translation initiation factor 2 subunit beta OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_1010600 PE=3 SV=1,YITEYVTCQMCK,0.989,"Carbamidomethyl (+57), Oxidation (+16), Carbamidomethyl (+57)"


In [17]:
# find which modifications exist in NAT but not SPP
nat_mods_not_in_spp = protein_mods_nat[~protein_mods_nat['Modifications identified by spectrum'].isin(protein_mods_spp['Modifications identified by spectrum'])]
print(f"\nModifcations found in NAT but not SPP\n")
# print(spp_mods_not_in_nat.value_counts(sort = False))
display(nat_mods_not_in_spp[['Protein accession number', 'Protein name', 'Peptide sequence', 'Peptide identification probability', 'Modifications identified by spectrum']])


Modifcations found in NAT but not SPP



Unnamed: 0,Protein accession number,Protein name,Peptide sequence,Peptide identification probability,Modifications identified by spectrum
1240,Q8I5Y5,Shewanella-like protein phosphatase 2 OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_1206000 PE=4 SV=1,FCVCCYNGPTFNR,0.934,"Carbamidomethyl (+57), Carbamidomethyl (+57), Carbamidomethyl (+57), Deamidated (+1)"
1241,Q8I5Y5,Shewanella-like protein phosphatase 2 OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_1206000 PE=4 SV=1,FCVCCYNGPTFNR,0.938,"Carbamidomethyl (+57), Carbamidomethyl (+57), Carbamidomethyl (+57), Deamidated (+1)"
1242,Q8I5Y5,Shewanella-like protein phosphatase 2 OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_1206000 PE=4 SV=1,FCVCCYNGPTFNR,0.997,"Carbamidomethyl (+57), Carbamidomethyl (+57), Carbamidomethyl (+57), Deamidated (+1)"
1243,Q8I5Y5,Shewanella-like protein phosphatase 2 OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_1206000 PE=4 SV=1,FCVCCYNGPTFNR,0.997,"Carbamidomethyl (+57), Carbamidomethyl (+57), Carbamidomethyl (+57), Deamidated (+1)"
1282,Q8IIC8,Golgi protein 2 OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_1123500 PE=4 SV=1,NKMIDYTNMLQRSK,0.945,"Oxidation (+16), Deamidated (+1), Oxidation (+16), Deamidated (+1)"
2624,Q8IBQ6,"60S ribosomal protein L11a, putative OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_0719600 PE=1 SV=1",EQNVMREIKVNK,0.924,"Deamidated (+1), Oxidation (+16), Deamidated (+1)"
3010,Q8IL96,"N-acetyltransferase, GNAT family, putative OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_1437000 PE=4 SV=2",NNNDTCNEQNKDNNNNNNNNNNNNNNQLSK,0.937,"Carbamidomethyl (+57), Deamidated (+1), Deamidated (+1), Deamidated (+1)"
3275,Q8ID94|YPF12_PLAF7-DECOY,Q8ID94|YPF12_PLAF7-DECOY,IDDPINMSSMVGPVLNNDMNTINNNVTSNKK,0.923,"Deamidated (+1), Deamidated (+1), Deamidated (+1), Deamidated (+1), Deamidated (+1)"
5719,Q8I3X4,Purine nucleoside phosphorylase OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PNP PE=1 SV=1,FLCVSHGVGSAGCAVCFEELCQNGAK,0.997,"Carbamidomethyl (+57), Carbamidomethyl (+57), Carbamidomethyl (+57), Carbamidomethyl (+57), Deamidated (+1)"
5720,Q8I3X4,Purine nucleoside phosphorylase OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PNP PE=1 SV=1,FLCVSHGVGSAGCAVCFEELCQNGAK,0.997,"Carbamidomethyl (+57), Carbamidomethyl (+57), Carbamidomethyl (+57), Carbamidomethyl (+57), Deamidated (+1)"
