# Proteomics Analysis for HNSC_Omics_Database

This notebook serves as a preliminary exploration and analysis of **proteomics data** for the **HNSC_Omics_Database**. The goal is to analyze and process protein abundance data from **Head and Neck Squamous Cell Carcinoma (HNSC)** samples. Through this pipeline, we aim to identify patterns, quantify proteins, and prepare data for integration into the **HNSC_Omics_Database**, focusing on the tumor microenvironment.

We will leverage the **CPTAC Python package** to access and explore proteomics data efficiently. This package provides a streamlined interface for retrieving proteogenomic datasets, enabling rapid data exploration and analysis.

---

### Objectives

In this notebook, we will:
1. Explore the structure and content of the **proteomics dataset** using the CPTAC Python package.
2. Perform preliminary data quality checks.
3. Analyze protein abundance patterns across samples.
4. Format and prepare data for integration into the database.

Each step includes visualizations to interpret the data and its quality.


### **CPTAC Data Overview**

CPTAC datasets include:
- **Proteomics**: Quantitative protein abundance values, normalized and log-transformed.
- **Transcriptomics**: RNA-Seq expression data.
- **Phosphoproteomics**: Phosphorylation-specific proteomics.
- **Somatic Mutations**: Mutation data in MAF format.
- **Clinical Metadata**: Patient-level information such as age, sex, race, tumor type, and more.

We will use the **CPTAC Python package** to load and analyze the HNSCC dataset, leveraging its built-in functions for seamless data access.

---

### Next Steps:
1. Import necessary libraries.
2. Load the HNSCC dataset.
3. Preview available data types and their sources.

### **Step 1: Set Up the Environment**

This step ensures that we have the necessary libraries, environment configuration, and output paths prepared for the rest of the analysis.

---

#### **1.1 Import Libraries**
Import libraries that will be used for data manipulation, visualization, and working with the CPTAC package.

In [1]:
# Import the CPTAC Python package for accessing CPTAC data
import cptac

# Import pandas for data manipulation
import pandas as pd

# Import numpy for numerical operations
import numpy as np

# Import matplotlib for creating visualizations
import matplotlib.pyplot as plt

# Import seaborn for enhanced visualizations
import seaborn as sns


import scipy.stats
import statsmodels.stats.multitest
import math
import cptac.utils as ut



#### **1.2 Configure Visualization Settings**
To ensure that all plots have a consistent look and feel across the notebook.

In [2]:
# Set seaborn style for clean and consistent visualizations
sns.set(style="whitegrid")

# Set default figure size for plots
plt.rcParams["figure.figsize"] = (12, 6)


#### **1.3 Define Output Paths**
Define paths for saving processed data and generated visualizations. This ensures that all outputs are saved in an organized manner.

In [3]:
# Define directory for metadata
metadata_dir = "../resources/metadata/cptac_metadata/"

# Define directory for raw data
raw_data_dir = "../resources/data/raw/CPTAC/"

# Define directory for processed data
processed_data_dir = "../resources/data/processed/Proteomics/"

# Define directory for analysis results
results_dir = "../resources/results/Proteomics/"

# Create directories if they don't exist
import os
for directory in [metadata_dir, raw_data_dir, processed_data_dir, results_dir]:
    if not os.path.exists(directory):
        os.makedirs(directory)

print(f"Metadata will be saved in: {metadata_dir}")
print(f"Raw data will be saved in: {raw_data_dir}")
print(f"Processed data will be saved in: {processed_data_dir}")
print(f"Analysis results will be saved in: {results_dir}")


Metadata will be saved in: ../resources/metadata/cptac_metadata/
Raw data will be saved in: ../resources/data/raw/CPTAC/
Processed data will be saved in: ../resources/data/processed/Proteomics/
Analysis results will be saved in: ../resources/results/Proteomics/



#### **1.4 Verify CPTAC Installation**
Confirm that the CPTAC package is installed and list available cancer types for verification.

In [4]:
# List available cancer datasets
available_cancers = cptac.get_cancer_info()
print("\nAvailable Cancer Datasets:")
for abbrev, name in available_cancers.items():
    print(f"{abbrev}: {name}")



Available Cancer Datasets:
brca: Breast invasive carcinoma
ccrcc: Clear cell renal cell carcinoma
coad: Colon adenocarcinoma
gbm: Glioblastoma multiforme
hnscc: Head and Neck squamous cell carcinoma
lscc: Lung squamous cell carcinoma
luad: Lung adenocarcinoma
ov: Ovarian serous cystadenocarcinoma
pda: Pancreatic ductal adenocarcinoma
pdac: Pancreatic ductal adenocarcinoma
ucec: Uterine Corpus Endometrial Carcinoma


### Step 2: Exploring the HNSCC Dataset

In this step, we will focus on exploring the **Head and Neck Squamous Cell Carcinoma (HNSCC)** dataset provided by the CPTAC package. This involves understanding the types of data available and their sources, ensuring that the dataset contains the information we need for further analysis.

### 2.1: Loading the HNSCC Dataset
- Load the HNSCC dataset into a Python object.
- Ensure successful initialization of the dataset.

In [6]:
# Load the HNSCC dataset
print("Loading HNSCC dataset...")
try:
    hnscc = cptac.Hnscc()
    print("HNSCC dataset loaded successfully!")
except Exception as e:
    print(f"Error loading HNSCC dataset: {e}")


Loading HNSCC dataset...
HNSCC dataset loaded successfully!


### 2.2 Listing Available Data Types and Sources
- Explore the types of data available in the HNSCC dataset.
- Identify the sources of each data type, which may correspond to different bioinformatics pipelines.

In [7]:
# List the available data types and their sources in the HNSCC dataset
print("\nListing available data types and sources for HNSCC:")
try:
    data_sources = hnscc.list_data_sources()
    print(data_sources)
except Exception as e:
    print(f"Error listing data sources: {e}")



Listing available data types and sources for HNSCC:
              Data type    Available sources
0          circular_RNA                [bcm]
1                 miRNA         [bcm, washu]
2     phosphoproteomics         [bcm, umich]
3            proteomics         [bcm, umich]
4       transcriptomics  [bcm, broad, washu]
5   ancestry_prediction         [harmonized]
6      somatic_mutation  [harmonized, washu]
7              clinical               [mssm]
8             follow-up               [mssm]
9       medical_history               [mssm]
10                  CNV              [washu]
11            cibersort              [washu]
12           hla_typing              [washu]
13         tumor_purity              [washu]
14                xcell              [washu]


### 2.3 Inspecting Metadata and Omics Data
- Check the structure of clinical metadata available in the dataset.
- Preview a few rows from available omics data types, such as proteomics or transcriptomics, to confirm data quality and consistency.

#### 2.3.1 Clinical Metadata

In [8]:
# Preview clinical metadata
print("\nPreviewing clinical metadata...")
try:
    clinical_data = hnscc_data.get_clinical('mssm')
    print(clinical_data.head())
except Exception as e:
    print(f"Error accessing clinical metadata: {e}")



Previewing clinical metadata...
Name       tumor_code discovery_study type_of_analyzed_samples  \
Patient_ID                                                       
C3L-00977       HNSCC             Yes                    Tumor   
C3L-00987       HNSCC             Yes                    Tumor   
C3L-00994       HNSCC             Yes         Tumor_and_Normal   
C3L-00995       HNSCC             Yes         Tumor_and_Normal   
C3L-00997       HNSCC             Yes         Tumor_and_Normal   

Name       confirmatory_study type_of_analyzed_samples age   sex     race  \
Patient_ID                                                                  
C3L-00977                 NaN                      NaN  56  Male  Unknown   
C3L-00987                 NaN                      NaN  61  Male  Unknown   
C3L-00994                 NaN                      NaN  50  Male  Unknown   
C3L-00995                 NaN                      NaN  56  Male  Unknown   
C3L-00997                 NaN             

#### 2.3.2 Proteomics Data

In [9]:
# Preview proteomics data
print("\nPreviewing proteomics data...")
try:
    proteomics_data = hnscc_data.get_proteomics('umich')
    print(proteomics_data.head())
except Exception as e:
    print(f"Error accessing proteomics data: {e}")



Previewing proteomics data...
Name                     ARF5              M6PR             ESRRA  \
Database_ID ENSP00000000233.5 ENSP00000000412.3 ENSP00000000442.6   
Patient_ID                                                          
C3L-00977           -0.395609         -0.126981          0.271001   
C3L-00987           -0.333629         -0.583884         -0.240685   
C3L-00994           -0.176258         -0.167526          0.282665   
C3L-00995           -0.045460          0.037155          0.615215   
C3L-00997            0.295573         -0.118091          0.057487   

Name                    FKBP4           NDUFAF7             FUCA2  \
Database_ID ENSP00000001008.4 ENSP00000002125.4 ENSP00000002165.5   
Patient_ID                                                          
C3L-00977           -0.143356         -0.087402          0.116355   
C3L-00987            0.257096          0.333955         -0.034048   
C3L-00994           -0.201174         -0.069770          0.171495   
C3

In [13]:
proteomics = hnscc_data.get_dataframe('proteomics', 'umich')

proteomics = hnscc_data.get_proteomics('umich')

samples = proteomics.index
proteins = proteomics.columns
print("Samples:", samples[0:20].tolist()) #print first 20 samples
print("Proteins:", proteins[0:20].tolist()) #print first 20 proteins

Samples: ['C3L-00977', 'C3L-00987', 'C3L-00994', 'C3L-00995', 'C3L-00997', 'C3L-00999', 'C3L-01138', 'C3L-01237', 'C3L-02617', 'C3L-02621', 'C3L-02651', 'C3L-03378', 'C3L-04025', 'C3L-04354', 'C3L-04791', 'C3L-04844', 'C3L-04849', 'C3N-00204', 'C3N-00295', 'C3N-00297']
Proteins: [('ARF5', 'ENSP00000000233.5'), ('M6PR', 'ENSP00000000412.3'), ('ESRRA', 'ENSP00000000442.6'), ('FKBP4', 'ENSP00000001008.4'), ('NDUFAF7', 'ENSP00000002125.4'), ('FUCA2', 'ENSP00000002165.5'), ('HS3ST1', 'ENSP00000002596.5'), ('SEMA3F', 'ENSP00000002829.3'), ('CFTR', 'ENSP00000003084.6'), ('CYP51A1', 'ENSP00000003100.8'), ('USP28', 'ENSP00000003302.4'), ('NIPAL3', 'ENSP00000003912.3'), ('TMEM176A', 'ENSP00000004103.3'), ('SLC7A2', 'ENSP00000004531.10'), ('HSPB6', 'ENSP00000004982.3'), ('ZNF195', 'ENSP00000005082.9'), ('PDK4', 'ENSP00000005178.5'), ('RALA', 'ENSP00000005257.2'), ('BAIAP2L1', 'ENSP00000005260.8'), ('TMEM132A', 'ENSP00000005286.4')]


In [14]:
proteomics.head()

Name,ARF5,M6PR,ESRRA,FKBP4,NDUFAF7,FUCA2,HS3ST1,SEMA3F,CFTR,CYP51A1,...,BTD,TNK2,ETNK1,MYO6,MPZ,EED,DDHD1,ZBTB3,WIZ,RFX7
Database_ID,ENSP00000000233.5,ENSP00000000412.3,ENSP00000000442.6,ENSP00000001008.4,ENSP00000002125.4,ENSP00000002165.5,ENSP00000002596.5,ENSP00000002829.3,ENSP00000003084.6,ENSP00000003100.8,...,ENSP00000500403.1,ENSP00000500452.1,ENSP00000500633.1,ENSP00000500710.1,ENSP00000500814.2,ENSP00000500914.1,ENSP00000500986.2,ENSP00000501025.1,ENSP00000501300.1,ENSP00000501317.1
Patient_ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
C3L-00977,-0.395609,-0.126981,0.271001,-0.143356,-0.087402,0.116355,-0.078268,,,-0.210886,...,-0.871789,,-0.01517,-0.112716,1.061597,-0.075046,0.339418,-0.889271,0.197584,-0.549962
C3L-00987,-0.333629,-0.583884,-0.240685,0.257096,0.333955,-0.034048,0.113751,,,-0.197915,...,-0.057613,0.17482,0.169549,-0.392975,-1.808249,0.165879,-0.005407,,0.247806,
C3L-00994,-0.176258,-0.167526,0.282665,-0.201174,-0.06977,0.171495,-0.154504,,,-0.88009,...,0.214462,,-0.141281,0.129055,-0.20178,-0.141542,-0.000664,-0.439349,-0.016434,-1.861432
C3L-00995,-0.04546,0.037155,0.615215,-0.231108,-0.179687,-0.503235,-0.515438,-0.56114,,-0.68381,...,-0.191998,-0.052048,0.712887,-0.143419,0.719894,-0.201199,-0.125278,,-0.088695,
C3L-00997,0.295573,-0.118091,0.057487,0.480692,-0.03823,0.233127,,0.2964,,-0.005773,...,-0.532029,0.264623,0.298447,,-0.565447,0.150717,0.819997,0.26757,0.507509,0.188039


In [15]:
transcriptomics = hnscc_data.get_transcriptomics('bcm')
transcriptomics.head()

Downloading HNSCC-gene_rsem_removed_circRNA_tumor_normal_UQ_log2(x+1)_BCM.txt.gz: 100%|██████████| 12.0M/12.0M [00:03<00:00, 3.73MB/s]  
Downloading gencode.v34.basic.annotation-mapping.txt.gz: 100%|██████████| 1.75M/1.75M [00:01<00:00, 1.53MB/s]


Name,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2ML1-AS1,A2ML1-AS2,A2MP1,A3GALT2,...,ZXDB,ZXDC,ZYG11A,ZYG11AP1,ZYG11B,ZYX,ZYXP1,ZZEF1,hsa-mir-1253,hsa-mir-423
Database_ID,ENSG00000121410.12,ENSG00000268895.6,ENSG00000148584.15,ENSG00000175899.15,ENSG00000245105.4,ENSG00000166535.20,ENSG00000256661.1,ENSG00000256904.1,ENSG00000256069.7,ENSG00000184389.9,...,ENSG00000198455.4,ENSG00000070476.15,ENSG00000203995.10,ENSG00000232242.2,ENSG00000162378.13,ENSG00000159840.16,ENSG00000274572.1,ENSG00000074755.15,ENSG00000272920.1,ENSG00000266919.3
Patient_ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
C3L-00977,4.75,7.13,4.19,13.93,5.57,13.75,3.61,0.0,3.59,0.0,...,9.09,9.69,6.17,0.0,11.51,11.0,0.0,12.36,0.0,0.0
C3L-00987,5.67,7.29,3.51,13.64,6.44,14.75,3.29,0.96,3.29,0.0,...,9.06,10.43,6.16,0.0,11.59,10.88,0.0,11.65,0.0,0.0
C3L-00994,5.2,6.75,3.14,14.42,6.81,9.32,0.0,0.0,3.86,1.85,...,8.73,10.26,6.11,0.0,11.76,11.13,0.0,12.1,0.0,0.0
C3L-00995,6.04,6.66,1.0,13.71,7.98,15.06,3.81,2.61,4.13,0.0,...,8.54,10.28,6.59,0.0,11.26,11.38,0.0,12.43,0.0,0.0
C3L-00997,4.86,5.65,3.91,13.71,6.76,13.81,3.23,1.41,3.22,0.0,...,8.89,10.19,6.84,0.0,11.2,11.15,0.0,11.6,0.0,0.0


In [16]:
clinical = hnscc_data.get_clinical('mssm')
clinical.head()

Name,tumor_code,discovery_study,type_of_analyzed_samples,confirmatory_study,type_of_analyzed_samples,age,sex,race,ethnicity,ethnicity_race_ancestry_identified,...,additional_treatment_pharmaceutical_therapy_for_new_tumor,additional_treatment_immuno_for_new_tumor,number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional,number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis,"Recurrence-free survival, days","Recurrence-free survival from collection, days","Recurrence status (1, yes; 0, no)","Overall survival, days","Overall survival from collection, days","Survival status (1, dead; 0, alive)"
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C3L-00977,HNSCC,Yes,Tumor,,,56,Male,Unknown,Unknown,White (Caucasian),...,No,No,,,853.0,820.0,1,1537.0,1504.0,1.0
C3L-00987,HNSCC,Yes,Tumor,,,61,Male,Unknown,Unknown,White (Caucasian),...,,,,,,,0,429.0,433.0,0.0
C3L-00994,HNSCC,Yes,Tumor_and_Normal,,,50,Male,Unknown,Unknown,White (Caucasian),...,No,No,,,133.0,107.0,1,202.0,176.0,1.0
C3L-00995,HNSCC,Yes,Tumor_and_Normal,,,56,Male,Unknown,Unknown,White (Caucasian),...,,,,,,,0,-9.0,1.0,1.0
C3L-00997,HNSCC,Yes,Tumor_and_Normal,,,47,Male,Unknown,Unknown,White (Caucasian),...,,,,,,,0,442.0,445.0,0.0


In [18]:
clinical.loc[['C3L-00977', 'C3L-00987', 'C3L-00994', 'C3L-00995', 'C3L-00997']]

Name,tumor_code,discovery_study,type_of_analyzed_samples,confirmatory_study,type_of_analyzed_samples,age,sex,race,ethnicity,ethnicity_race_ancestry_identified,...,additional_treatment_pharmaceutical_therapy_for_new_tumor,additional_treatment_immuno_for_new_tumor,number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional,number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis,"Recurrence-free survival, days","Recurrence-free survival from collection, days","Recurrence status (1, yes; 0, no)","Overall survival, days","Overall survival from collection, days","Survival status (1, dead; 0, alive)"
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C3L-00977,HNSCC,Yes,Tumor,,,56,Male,Unknown,Unknown,White (Caucasian),...,No,No,,,853.0,820.0,1,1537.0,1504.0,1.0
C3L-00987,HNSCC,Yes,Tumor,,,61,Male,Unknown,Unknown,White (Caucasian),...,,,,,,,0,429.0,433.0,0.0
C3L-00994,HNSCC,Yes,Tumor_and_Normal,,,50,Male,Unknown,Unknown,White (Caucasian),...,No,No,,,133.0,107.0,1,202.0,176.0,1.0
C3L-00995,HNSCC,Yes,Tumor_and_Normal,,,56,Male,Unknown,Unknown,White (Caucasian),...,,,,,,,0,-9.0,1.0,1.0
C3L-00997,HNSCC,Yes,Tumor_and_Normal,,,47,Male,Unknown,Unknown,White (Caucasian),...,,,,,,,0,442.0,445.0,0.0


In [19]:
somatic_mutations = hnscc_data.get_somatic_mutation('harmonized')
somatic_mutations.head()

Downloading PanCan_Union_Maf_Broad_WashU_v1.1.maf.gz: 100%|██████████| 138M/138M [00:19<00:00, 7.13MB/s]    


Name,Gene,Mutation,Location,Entrez_Gene_Id,NCBI_Build,Chromosome,Start_Position,End_Position,Strand,Variant_Type,...,HGNC_UniProt_ID(supplied_by_UniProt),HGNC_Ensembl_ID(supplied_by_Ensembl),HGNC_UCSC_ID(supplied_by_UCSC),Oreganno_Build,Simple_Uniprot_alt_uniprot_accessions,dbSNP_TOPMED,HGNC_Entrez_Gene_ID(supplied_by_NCBI),COHORT,getz,washu
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C3L-00977,TAS1R3,Missense_Mutation,p.D470N,83756.0,hg38,chr1,1333053,1333053,+,SNP,...,Q7RTX0,ENSG00000169962,uc010nyk.3,hg38,Q5TA49|Q8NGW9,"0.99996018093781855,0.00002389143730886,0.0000...",83756.0,HNSCC,True,
C3L-00977,ORC6,Silent,p.K202K,23594.0,hg38,chr16,46696060,46696060,+,SNP,...,Q9Y5N6,ENSG00000091651,uc002eeh.3,,B3KN89,,23594.0,HNSCC,True,True
C3L-00977,MARF1,Intron,,9665.0,hg38,chr16,15600989,15600989,+,SNP,...,Q9Y4F3,ENSG00000166783,uc002ddr.4,,,,9665.0,HNSCC,True,
C3L-00977,GSPT1,Missense_Mutation,p.F275L,2935.0,hg38,chr16,11886485,11886485,+,SNP,...,P15170,ENSG00000103342,uc002dbt.4,,J3KQG6|Q96GF2,,2935.0,HNSCC,True,True
C3L-00977,FLYWCH1,Intron,,84256.0,hg38,chr16,2936649,2936649,+,SNP,...,Q4VC44,ENSG00000059122,uc002csc.4,,D3DUA1|Q6ZSQ1|Q8WV62|Q9BQG6|Q9BUS5|Q9HCM0,"0.99990443425076452,0.00009556574923547",84256.0,HNSCC,True,


In [20]:
proteomics.columns

MultiIndex([(   'ARF5', 'ENSP00000000233.5'),
            (   'M6PR', 'ENSP00000000412.3'),
            (  'ESRRA', 'ENSP00000000442.6'),
            (  'FKBP4', 'ENSP00000001008.4'),
            ('NDUFAF7', 'ENSP00000002125.4'),
            (  'FUCA2', 'ENSP00000002165.5'),
            ( 'HS3ST1', 'ENSP00000002596.5'),
            ( 'SEMA3F', 'ENSP00000002829.3'),
            (   'CFTR', 'ENSP00000003084.6'),
            ('CYP51A1', 'ENSP00000003100.8'),
            ...
            (    'BTD', 'ENSP00000500403.1'),
            (   'TNK2', 'ENSP00000500452.1'),
            (  'ETNK1', 'ENSP00000500633.1'),
            (   'MYO6', 'ENSP00000500710.1'),
            (    'MPZ', 'ENSP00000500814.2'),
            (    'EED', 'ENSP00000500914.1'),
            (  'DDHD1', 'ENSP00000500986.2'),
            (  'ZBTB3', 'ENSP00000501025.1'),
            (    'WIZ', 'ENSP00000501300.1'),
            (   'RFX7', 'ENSP00000501317.1')],
           names=['Name', 'Database_ID'], length=12224)

In [21]:
protein = "MMP14"
MMP14_col = proteomics[protein]
MMP14_col.head()

Database_ID,ENSP00000308208.6
Patient_ID,Unnamed: 1_level_1
C3L-00977,0.842312
C3L-00987,-0.079284
C3L-00994,0.653041
C3L-00995,0.658278
C3L-00997,0.235007


In [23]:
proteins = ["MMP14", "PTK7", "LRRC15", "CD276"]
selected_prot = proteomics[proteins]
selected_prot.head()

Name,MMP14,PTK7,LRRC15,CD276
Database_ID,ENSP00000308208.6,ENSP00000230419.4,ENSP00000306276.4,ENSP00000320084.5
Patient_ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
C3L-00977,0.842312,0.048407,0.429372,0.340589
C3L-00987,-0.079284,-0.221838,-0.404289,-0.304818
C3L-00994,0.653041,0.692544,0.666467,0.382136
C3L-00995,0.658278,0.664587,1.75858,0.645418
C3L-00997,0.235007,0.105638,0.365682,0.312187


In [24]:
proteomics.iloc[0:5]

Name,ARF5,M6PR,ESRRA,FKBP4,NDUFAF7,FUCA2,HS3ST1,SEMA3F,CFTR,CYP51A1,...,BTD,TNK2,ETNK1,MYO6,MPZ,EED,DDHD1,ZBTB3,WIZ,RFX7
Database_ID,ENSP00000000233.5,ENSP00000000412.3,ENSP00000000442.6,ENSP00000001008.4,ENSP00000002125.4,ENSP00000002165.5,ENSP00000002596.5,ENSP00000002829.3,ENSP00000003084.6,ENSP00000003100.8,...,ENSP00000500403.1,ENSP00000500452.1,ENSP00000500633.1,ENSP00000500710.1,ENSP00000500814.2,ENSP00000500914.1,ENSP00000500986.2,ENSP00000501025.1,ENSP00000501300.1,ENSP00000501317.1
Patient_ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
C3L-00977,-0.395609,-0.126981,0.271001,-0.143356,-0.087402,0.116355,-0.078268,,,-0.210886,...,-0.871789,,-0.01517,-0.112716,1.061597,-0.075046,0.339418,-0.889271,0.197584,-0.549962
C3L-00987,-0.333629,-0.583884,-0.240685,0.257096,0.333955,-0.034048,0.113751,,,-0.197915,...,-0.057613,0.17482,0.169549,-0.392975,-1.808249,0.165879,-0.005407,,0.247806,
C3L-00994,-0.176258,-0.167526,0.282665,-0.201174,-0.06977,0.171495,-0.154504,,,-0.88009,...,0.214462,,-0.141281,0.129055,-0.20178,-0.141542,-0.000664,-0.439349,-0.016434,-1.861432
C3L-00995,-0.04546,0.037155,0.615215,-0.231108,-0.179687,-0.503235,-0.515438,-0.56114,,-0.68381,...,-0.191998,-0.052048,0.712887,-0.143419,0.719894,-0.201199,-0.125278,,-0.088695,
C3L-00997,0.295573,-0.118091,0.057487,0.480692,-0.03823,0.233127,,0.2964,,-0.005773,...,-0.532029,0.264623,0.298447,,-0.565447,0.150717,0.819997,0.26757,0.507509,0.188039


In [25]:
S001_row = proteomics.loc['C3L-00977']
S001_row.head()

Name     Database_ID      
ARF5     ENSP00000000233.5   -0.395609
M6PR     ENSP00000000412.3   -0.126981
ESRRA    ENSP00000000442.6    0.271001
FKBP4    ENSP00000001008.4   -0.143356
NDUFAF7  ENSP00000002125.4   -0.087402
Name: C3L-00977, dtype: float64