# A Quantitative Proteomic Workflow for TP53 Mutations

Author: Christian Saluta

In the previous Jupyter Notebook, several questions were proposed and explored regarding the protein-level mechanisms behind TP53's role in the dysregulation of cellular proliferation and the potential for cancer development. On the topic of mutated residues that can interferre with TP53's functionality, a quantitative proteomic workflow can possibly help determine mutations that occur in TP53's sequence and/or mutations that occur in proteins that regulate TP53. The focus of this Jupyter Notebook will focus on said workflow to information within a hypothetical study.

# A Hypothetical Study Scenario

A handful of patients (~20) from around the world have been diagnosed with a rare, unknown form of cancer after biopsy testing within the past few decades. All patients exhibit the same symptomps and localization of cancerous cells. The cancer has little to no known information within various database entries beyond basic testing knowledge that the excessive TP53 has been identified. An ample amount of cells from their biopsies were saved in cryogenic storage for further testing. Cells from the same site but in healthy individuals without cancer are also available for analysis from other studies.

Researchers are interested to conduct a proteomic analysis of proteins that directly regulate expression of TP53 protein. As seen in the previous Jupyter Notebook and within other network databases such as BioPlex (Huttlin et al., 2021; Schweppe et al., 2017), TP53 expression can be regulated by other proteins that subsequently affects the regulation of DNA expression and ultimately cellular proliferation. Several examples of proteins within this regulatory proteome that were found in these network databases include MDM4, UBE3A, ING2, and others.

The cells obtained from the frozen biopsies were to be used to extract proteins (including TP53 and its regulators) from within through labroatory techniques such as lysing of the cells, precipitation of the protein in a solvent, and/or centrifuge separation into layers from other cellular components. Separation of TP53 from other proteins can then be achieved through the use of further laboratory techniques such as gel electrophoresis, immunoprecipitation, and/or the commonly used liquid chromatography methods in proteomic analysis. For this scenario, the researchers decide to lyse the cells from each biopsy and use liquid chromatography to extract a predetermined set of proteins. 

# Mass Spectrometry Approach

With the proteins of different cells ready to be analyzed, mass spectrometry can be performed to determine what the  proteins were present. Different types and variations of mass spectrometry along with different comparison databases of spectra data can be performed to detect and/or quantify the proteins within the samples. Before subjecting proteins into mass spectrometry machinery, several processing steps take place. In addition to the liquid chromatography method to separate the contents of the sample, proteins should be broken down into smaller pieces/fragments through proteolysis with enzymes that target specific positions to cleave within the protein's backbone such as trypsin. Additionally, ionization of these fragments is performed so that the electromagnetic field of a mass analyzer within can act upon the fragments and separate them more as a combination of mass and charge (mass to charge ratio, m/z). The level of ionization can adjusted based on whether MALDI (from 1-5 protons) or TOF (about 1 proton) ionization is utilized.

The prepared fragments are then passed through the electromagnetic field and are separated by the m/z ratio depending on how long they take to cross. The process can either end here with fragments passing a detector or another round of fragmentation can occur on these precursor ions to pass through another electromagnetic field before reaching a detector at the end. The researchers go with the latter method as a tandem mass spectrometry technique to achieve better specificity. The detector at the end of the process in the machine records the ionized fragments as they pass by during a time interval and graphs this information in terms of the parameters of m/z ratio and the abundance (intensity) of the ionized fragments detected. An algorithmic database search approach is utilized by the researchers to compare their obtained data to a database that contains spectra data related to disease variants of proteins. An algorithm-based search engine such as SEQUEST (Eng et al., 1994) allows this comparison through their own scoring algorithms. The database should contain wild-type and variations of each protein structure to be observed during the workflow. Further analysis programs such as PeptideShaker are utilized by the researchers to perform other analyses such as visualization, validation, and more.

If this study were to be performed in the future, groundbreaking techniques in mass spectrometry are currently becoming commonplace procedures that improve on the current standards above. Data-independent acquisition mass spectrometry incorporates deep learning to help with the analysis and prediction of several parameters such as the retention time, spectral angle, shape, and peptide detection with higher specificity and quantitation while reducing time spent from the spectra data. Eventually all-ion fragmentation could become a possible approach that would have no specificity for the ions.

# Experiment and Data Aspects Influencing Statistical Significance

Certain aspects of the experiment and the data should be kept in mind to generate data results that have statistical relevance. At the start of the scenario, the biopsy cells from each patient are kept frozen until they are ready to be tested. Although there is a chance that patients within this cohort may simply have different cancers with different mutations that manifest the same symptoms, multiple experimental runs conducted using the same batch of biopsy cells allows for validation of the mass spectrometry process. 

A hypothesis for this investigation could be: do all these diagnosed cancer patients share the same mutational defects in TP53's interacton network? A null hypothesis that can be potentially rejected could be created around this hypothesis: this cancer in these patients does not share the same mutational defects within TP53's interaction network.

The independent variable for this workflow would be the spectra data of each protein in TP53's interaction network. These proteins can be selected based on the information displayed within databases such as IntAct and BioPlex as mentioned previously. The dependent variable would be the sequences that are obtained from analyzing the spectra graphs. There are several control variables such as the cells obtained from healthy individuals, the tandem mass spectrometry equipment, program that digitizes the detector results into spectra graphs, the search engine SEQUEST with its scoring processes and algorithms, and the peptide spectra database that is searched for to compare graphs to obtain the sequences. When an algorithm search engine such as SEQUEST reads the input query spectra graph to compare to other graphs within a database, peaks of abundance (intensity) are compared between graphs to determine the sequence of the protein from mass spectrometry. Below is a spectrum graph of an ionized peptide fragment found within the wild-type sequence of TP53 obtained from the PeptideAtlas database (Desiere et al., 2006): 

![Screen%20Shot%202023-12-07%20at%2012.21.09%20PM.png](attachment:Screen%20Shot%202023-12-07%20at%2012.21.09%20PM.png)

The graph above (2013) has several details that can be observed. The yellow lines represent the precursor ions that are present before the second fragmentation occurs during tandem mass spectrometry. The numbered a and b lines of the fragment pieces represent where the fragmentations occurred from the amino end of the peptide to the carboxyl end. On the other hand, the numbered y lines represent fragmentation from the carboyxl end to the amino end. Fragments such as y1+ and b1+ represent the amino acid ends that are fragmented while higher numbers represent larger intact fragments of the peptide where fragmentation occurs further down the sequence from the starting ends. If the TP53 protein or any other protein from its interaction network were unmutated with wild-type sequences, the spectra graphs generated would generally match the ones within the PeptideAtlas database (or other database that is used). If there are mutations, however, one or more of the lines generally wouldn't match up at the same mass/charge x-axis position due to different amino acids' mass to charge ratio affecting travel time through the mass spectrometry equipment. 

It is important to keep in mind that the analysis outline above is not as simple in other conditions and to acknowledge the false positive and false negative errors that the search engine algorithms are handling that may occur. Using the SEQUEST search engine as an example but applicable to any of the search engines available, some amino acids may produce similar peaks within the graph. For SEQUEST's process for matching, scoring (Xcorr) between an experimentally obtained spectrum peak and the database spectrum is calculated as the matched ion intensity calibrated by the random average correlation. ∆CN is then calculated by finding the difference between best two Xcorr matches divided by the Xcorr value of the best match. Although having a smaller ∆CN value may appear to be more favorable, false positives may occur where an incorrect protein sequence is considered true from the Xcorr value calculated. Newer database search engine algorithms that the researchers may also decide to utilize in their study such as MS2PIP (Degroeve and Martens, 2013), ReScore (Declercq et al., 2022), and ionbot (Degroeve et al., 2022) use more advanced methods such as machine learning to improve accuracy and identification of the peptide fragment ions.

# Post-MS Data Analysis Workflow

After determining the statistically significant sequences of proteins present within the cellular samples, comparisons can be made to the wild-type proteins' sequence and subsequently other characteristics such a higher level structures, protein-protein interactions, and so forth. Databases introduced within the previous Jupyter Notebook will be mentioned again within this section. For this portion of the analysis, a hypothetical mutated TP53 sequence was found within the mass spectrum from the section above but the overall workflow can be applied to any number of mutated proteins in TP53's interaction network that were encountered during the mass spectrometry proteomic study.

Starting with the sequence comparison, the previous Jupyter Notebook presented several diseases associated with TP53 mutations such as Li-Fraumeni Syndrome that may occur due to amino acid residue mutations with certain portions of the protein sequence. To see if the patient's form of cancer has similar mutations in similar positions such as the DNA-binding domain, an alignment can be performed between the wild-type sequence presented in the previous Jupyter Notebook and a mutated TP53 sequence obtained from the mass spectrometry analysis above. This can be performed via a web-server based approach or through a local program installation approach such as BioPython's Smith-Waterman local alignment algorithm. To keep the amount of code down for novice biologists, the former approach will be utilized. 

There are many web-server based programs that utilize a number of different of different global and local alignment algorithms. The EMBL-EBI database introduced in the previous Jupyter Notebook also provides several sequence alignment tools including EMBOSS water (2022) that utilizes the Smith-Waterman local sequence alignmment algorithm that can be found in BioPython. By inputting the FASTA forms of the wild-type sequence of TP53 and the sequence obtained from the mass spectrometry method above, an alignment can be made where certain residue pairs that do not match will not be indicated by the | symbol between the two sequences. There may be some ambiguity on the amino and carboxyl ends depending on the results from the mass spectrometry. Comparison of these residues that do not align can be done through comparison to the Disease & Variants section of TP53's entry in UniProt introduced in the previous Jupyter Notebook or through knowledge of the changes in physical/chemical properties that occur due to the mutation.

Another approach that can be done if different mutated versions of a protein were found during tandem mass spectrometry would be to use a global alignment algorithm. Similar to the Smith-Waterman algorithm, global alignment algorithms such as ClustalW and MUSCLE are also available via local program installations such as BioPython or within web-server based locations such as EMBL-EBI's database. Using the same web-server based approach as the local alignment process above, multiple FASTA sequences of mutated TP53 that may have been discovered during the mass spectrometry section can be entered along with the FASTA sequence of the wild-type TP53 protein. A global alignment can be conducted between the multiple sequences by opening gaps where needed have amino acid residues match throughout the sequence length.

Visualization and homology modeling can also be performed using the sequences determined from the mass spectrometry section. Several programs that can accomplish this are available such as SWISS-MODEL (2023) and MODELLER (2021). For SWISS-MODEL, mutated protein sequences can be entered in FASTA as a query to search for models based on sequence similarity. Building a structure based on this model can provide a PDB file format that was seen in the previous Jupyter Notebook. Using the Chimera visualization program (2023) from before allows for the 3D visualization of the mutated protein. Chimera can utilized again by providing the PDB file that was obtained from building a model in SWISS-MODEL. For quantification of similarity between the the wild-type protein structure and the mutated protein structure in 3D space, the PDB files of each can be uploaded as queries in the PDB database's protein comparison tool introduced in the previous Jupyter Notebook. In addition to a similar 3D visualization as the Chimera program, the PDB protein comparison tool provides several quantitative parameters such as percent identity values, number of identical residues, and RMSD between the input queries. These values may also help determine whether the cancerous cells of different individuals share the same point residue mutations if results from mass spectrometry need further validation.

Performing these processes after completing the questions proposed in the previous Jupyter Notebook may allow for a further understanding on how cancer and cellular dysregulation can arise whether or not TP53 itself is mutated and not functioning correctly. Additionally, knowledge of which proteins in the proteome are dysfunctional may allow for treatment plans to be consolidated and/or the development of new drugs/treatments for this specific cancer.

# Summary of Workflow

To wrap up this Jupyter Notebook, a quick summary of the proteomic workflow for this research study scenario will be reviewed below:

Hypothesis: individuals diagnosed with a new type of cancer have the same mutations that occur within the TP53 interaction network

1.) Biopsy of cancerous cells is obtained from individuals and frozen for future testing

2.) Frozen cells along with healthy individual controll cells are retrieved and lysed open to obtain intracellular contents

3.) Preparation for tandem mass spectrometry with fragmentation, ionization, and liquid chromatography

4.) Tandem mass spectrometry performed (ion selector --> 2nd round of fragmentation --> fragment mass analyzer)

5.) Detection of ionized peptide fragments with dectector

6.) Mass spectra graphs created based on mass to charge ratio against abundance (intensity)

7.) Repeat process steps 2-6 using cells from each biopsy/control individual again for validation and statistical significance about 5 times to get about 100 results

8.) Spectra graphs used to search against spectra databases using database search algorithms (database should be based on protein residue variations/mutations) such as SEQUEST

9.) Confirm whether proteins in TP53's network or TP53 itself within the proteome have a wild-type phenotype/genotype or if they are mutated

10.) Post-mass spectrometry analysis to collect data of other proteins within TP53's interaction network for a deeper understanding of cancer development and cellular proliferation/regulation/dysfunction

11.) Characterization of patient diagnoses as a newly discovered cancer/form of cellular dysregulation or as cancer occuring due to random mutations and chance, and potential treatment plan with drugs and/or development of new drugs/treatment

# Works Cited:

Declercq, A., Bouwmeester, R., Hirschler, A., Carapito, C., Degroeve, S., Martens, L., &amp; Gabriels, R. (2022). 
    MS2Rescore: Data-Driven Rescoring Dramatically Boosts Immunopeptide Identification Rates. Molecular &amp;
    Cellular Proteomics, 21(8), 100266. https://doi.org/10.1016/j.mcpro.2022.100266

Degroeve, S., &amp; Martens, L. (2013). MS2PIP: A tool for MS/Ms Peak Intensity Prediction. Bioinformatics, 29(24), 
    3199–3203. https://doi.org/10.1093/bioinformatics/btt544
    
Degroeve, S., Gabriels, R., Velghe, K., Bouwmeester, R., Tichshenko, N., &amp; Martens, L. (2022). Ionbot: A novel, 
    innovative and sensitive machine learning approach to LC-MS/MS peptide identification. bioRxiv. 
    https://doi.org/10.1101/2021.07.02.450686

Desiere, F., Deutsch, E. W., King, N. L., Nesvizhskii, A. I., Mallick, P., Eng, J., Chen, S., Eddes, J., Loevenich, 
    S. N., &amp; Aebersold, R. (2006). The PeptideAtlas project. Nucleic Acids Research, 34(90001). 
    https://doi.org/10.1093/nar/gkj040 

Download Chimera. UCSF Chimera. (2023, July 6). https://www.cgl.ucsf.edu/chimera/download.html

Eng, J. K., McCormack, A. L., &amp; Yates, J. R. (1994). An approach to correlate tandem mass spectral data of 
    peptides with amino acid sequences in a protein database. Journal of the American Society for Mass 
    Spectrometry, 5(11), 976–989. https://doi.org/10.1016/1044-0305(94)80016-2 

Huttlin, E. L., Bruckner, R. J., Navarrete-Perea, J., Cannon, J. R., Baltier, K., Gebreab, F., Gygi, M. P., 
    Thornock, A., Zarraga, G., Tam, S., Szpyt, J., Gassaway, B. M., Panov, A., Parzen, H., Fu, S., Golbazi, A.,
    Maenpaa, E., Stricker, K., Guha Thakurta, S., ... Gygi, S. P. (2021). Dual proteome-scale networks reveal 
    cell-specific remodeling of the human interactome. Cell, 184(11). 
    https://doi.org/10.1016/j.cell.2021.04.011

Madeira, F., Pearce, M., Tivey, A. R., Basutkar, P., Lee, J., Edbali, O., Madhusoodanan, N., Kolesnikov, A., &amp; 
    Lopez, R. (2022). Search and sequence analysis tools services from EMBL-EBI in 2022. Nucleic Acids Research, 
    50(W1). https://doi.org/10.1093/nar/gkac240 

Sali, A., & Webb, B. (2021, February 10). Tutorial. Modeller.
    https://www.salilab.org/modeller/tutorial/

Schweppe, D. K., Huttlin, E. L., Harper, J. W., &amp; Gygi, S. P. (2017). BioPlex Display: An Interactive Suite for 
    Large-Scale AP–MS Protein–Protein Interaction Data. Journal of Proteome Research, 17(1), 722–726.
    https://doi.org/10.1021/acs.jproteome.7b00572

Spectrum for [itraq4plex]-tyqgsygfr +2. PeptideAtlas. (2013, August 2). 
    https://db.systemsbiology.net/sbeams/cgi/PeptideAtlas/ShowObservedSpectrum?
    atlas_build_id=550&amp;spectrum_identification_id=117789662&amp;peptide=%5BiTRAQ4plex%5D-
    TYQGSYGFR&amp;assumed_charge=2&amp;chimera_level=&amp;sample_id=6450&amp;spectrum_name=TCGA_114C_13-1497-01A-
    01_OVARIAN-CONTROL_61-1995-01A-01_W_JHUZ_20130802_F9.03252.03252.2 

University of Basel. (2023). Introduction to SWISS-MODEL. SWISS. https://swissmodel.expasy.org/docs/help