
## Running scPROTEIN stage1 on SCoPE2_Specht dataset
In this tutorial, we show how to run scPROTEIN from peptide-level data to estimate the peptide uncertainty, and then generate the protein-level data in an uncertainty-guided manner. An overall workflow can be shown as follows.



<p align="center">
  <img width="80%" src=./image/stage1.png>
</p>




In [1]:
import numpy as np
import pandas as pd
import os 
import argparse
import os.path as osp
import random
import torch
import numpy as np 
import sys
from utils import *

sys.path.append("./peptide_uncertainty_estimation") 
from multi_task_heteroscedastic_regression_model import *
from peptide_uncertainty_utils import *


In [2]:
parser = argparse.ArgumentParser()
parser.add_argument("--file_path", type=str, default='./data/Peptides-raw.csv', help='data path')
parser.add_argument("--learning_rate", type=float, default=1e-3, help='learning rate.')
parser.add_argument("--weight_decay", type=float, default=1e-4, help='weight decay.')
parser.add_argument("--batch_size", type=int, default=256, help='batch size.')
parser.add_argument("--kernel_nums", type=int, default=[300,200,100], help='kernel num of each conv block.')
parser.add_argument("--kernel_size", type=int, default=[2,2,2], help='kernel size of each conv block.')
parser.add_argument("--max_pool_size", type=int, default=1, help='max pooling size.')
parser.add_argument("--conv_layers", type=int, default=3, help='layer nums of conv.')
parser.add_argument("--hidden_dim", type=int, default=3000, help='hidden dim for fc layer.')
parser.add_argument("--num_epochs", type=int, default=90, help='number of epochs.')
parser.add_argument("--seed", type=int, default=3047, help='random seed.')
parser.add_argument("--split_percentage", type=float, default=0.8, help='split.')
parser.add_argument("--dropout_rate", type=float, default=0.5, help='drop out rate.')

args =parser.parse_known_args()[0]   
setup_seed(args.seed)
torch.cuda.empty_cache()

### Load peptide data

We firstly load the peptide-level data together with the input peptide sequences.
The following functions are used:

<br/>


**load_peptide(data_path)**

**- Function:**

Load the input peptide-level file, and then extract the peptide sequences, peptide-level data along with other meta information.

**- Parameters:**
- `data_path` (str): Data path to load the peptide-level file.

**- Returns:**
- `peptides` (list): Peptide sequences.
- `proteins` (list): Protein names.
- `Y_label` (array): Peptide-level abundance matrix (peptide*cell).
- `cell_list` (list): The list containing the index of each cell.
- `num_cells` (int): Number of total cells.

<br/>

**peptide_encode(peptides)**


**- Function:**

This function takes as input peptide sequences composed of amino acids. It returns the corresponding one-hot encoding data matrix and the total number of different amino acid types.

**- Parameters:**

- `peptides` (list): The input peptide sequences.

**- Returns:**

- `peptide_onehot_padding` (array): One-hot encoding matrix for peptide sequences.
- `num_amino_acid` (int): The number of different amino acid types.




In [3]:
peptides, proteins, Y_label, cell_list, num_cells = load_peptide(args.file_path)
peptide_onehot_padding, num_amino_acid = peptide_encode(peptides)

peptides nums in total: 9354
cell nums: 1490


In [4]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
peptide_onehot_padding = peptide_onehot_padding.to(device)
Y_label = Y_label.to(device)

### Peptide uncertainty estimation 
Establish the framework of scPROTEIN stage 1 and conduct uncertainty learning. The following functions are employed for the construction of scPROTEIN stage1:

<br/>


**peptide_CNN(num_amino_acid, max_pool_size, hidden_dim, output_dim, conv_layers, dropout_rate, kernel_nums, kernel_size)**

**- Function:**

This function defines the Heteroscedastic regression model of scPROTEIN *stage1* for peptide uncertainty estimation.

**- Parameters:**

- `num_amino_acid` (int): The number of different amino acid types.
- `max_pool_size` (int): The size of the sliding window in the max-pooling operation.
- `hidden_dim` (int): The hidden dimension in the fully-connected layer.
- `output_dim` (int): Output dimension of the Heteroscedastic regression model, which is twice the number of cells (each cell has a $\mu$ and a $\sigma$).
- `conv_layers` (int): Number of convolutional layers.
- `dropout_rate` (float): Dropout rate.
- `kernel_nums` (int): Number of kernels in each convolutional block.
- `kernel_size` (int): Kernel size of each convolutional block.

**- Returns:**

- `model` (object): The defined Heteroscedastic regression model object.


<br/>


**scPROTEIN_stage1_learning(model, peptide_onehot_padding, Y_label, learning_rate, weight_decay, split_percentage, num_epochs, batch_size)**

**- Function:**

This function constructs the framework for scPROTEIN *stage1* training and prediction.

**- Parameters:**

- `model` (object): Defined Heteroscedastic regression model object of *stage1*.
- `peptide_onehot_padding` (array): One-hot encoding matrix for the input peptide sequences.
- `Y_label` (array): Peptide-level abundance matrix (peptide*cell).
- `split_percentage` (float): Split percentage of data.
- `learning_rate` (float): Learning rate for the Adam optimizer.
- `weight_decay` (float): Weight decay for the Adam optimizer.
- `num_epochs` (int): Number of epochs for training *stage1*. We empirically set 90 to strike a balance between achieving convergence and reducing training time. 
- `batch_size` (int): Batch size for mini-batch training.

**- Returns:**

- `scPROTEIN_stage1` (object): The scPROTEIN *stage1* object. The functions of `scPROTEIN_stage1` are as follows:
    - `scPROTEIN_stage1.train()`: Perform scPROTEIN *stage1* training.
    - `scPROTEIN_stage1.uncertainty_generation()`: Generate the estimated peptide uncertainty based on the trained *stage1* model.


In [5]:
model = peptide_CNN(num_amino_acid, args.max_pool_size, args.hidden_dim, 2*num_cells, args.conv_layers, args.dropout_rate, args.kernel_nums, args.kernel_size).to(device)
scPROTEIN_stage1 = scPROTEIN_stage1_learning(model, peptide_onehot_padding, Y_label,args.learning_rate, args.weight_decay, args.split_percentage, args.num_epochs, args.batch_size)
scPROTEIN_stage1.train()

epoch 0, loss_regression: 538.3992491662502
epoch 1, loss_regression: 277.88261127471924
epoch 2, loss_regression: 161.06885701417923
epoch 3, loss_regression: 134.76411202549934
epoch 4, loss_regression: 112.60996372625232
epoch 5, loss_regression: 114.79242356866598
epoch 6, loss_regression: 86.18083467253018
epoch 7, loss_regression: 74.28734213113785
epoch 8, loss_regression: 71.19063758850098
epoch 9, loss_regression: 72.40123849362135
epoch 10, loss_regression: 145.74116540327668
epoch 11, loss_regression: 83.23712939023972
epoch 12, loss_regression: 68.98227059841156
epoch 13, loss_regression: 65.35419291257858
epoch 14, loss_regression: 63.31593954563141
epoch 15, loss_regression: 62.715305864810944
epoch 16, loss_regression: 61.290855169296265
epoch 17, loss_regression: 60.39526891708374
epoch 18, loss_regression: 59.57187879085541
epoch 19, loss_regression: 58.92042946815491
epoch 20, loss_regression: 58.40282726287842
epoch 21, loss_regression: 57.46757769584656
epoch 22, lo

### Generate the estimated peptide uncertainty

In [6]:
uncertainty = scPROTEIN_stage1.uncertainty_generation()
uncertainty.shape

(9354, 1490)

### Uncertainty-guided protein-level data generation
Then we use the estimated peptide uncertainty and peptide-level data to compute the protein level data. We can use *load_sc_proteomic_features* function to generate the protein-level data.


<br/>

**load_sc_proteomic_features(stage1)**

**- Function:**

This function specifies whether to use *stage1* and loads the single-cell protein-level data matrix.

**- Parameters:**

- `stage1` (bool): This parameter indicates if scPROTEIN starts from *stage1*. `True` represents generating protein-level data using *stage1* in the uncertainty-guided manner, and `False` denotes directly learning from protein-level data.

**- Returns:**

- `proteins` (list): Protein names.
- `cells` (list): The list containing the index of each cell.
- `features` (array): Single-cell proteomics data matrix.



In [7]:
_, _, protein_data = load_sc_proteomic_features(True)  
protein_data.shape

(1490, 3042)