# eye-scGPT Fine-Tuning Protocol Notebook
`Maintainer: Shanli Ding` \
This is the one-stop notebook that contains all steps described in the fine-tuning protocol. It is recommended to use this notebook on Colab or any other cloud-based computing node.

Un-comment and Run the following code if using Colab

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

## Installation

In [None]:
!pip install scgpt scanpy scvi-tools wandb louvain memory_profiler click

## Exploratory Data Analysis

### Imports

In [None]:
import scanpy as sc
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
adata = sc.read('TRAIN_snRNA.h5ad', backup_url='DATA_URL')
print(adata)

### Visualize the distribution of the number of genes per cell

In [None]:
sc.pp.calculate_qc_metrics(adata, inplace=True)
sns.histplot(adata.obs['n_genes_by_counts'], bins=50, kde=False)
plt.xlabel('Number of genes per cell')
plt.ylabel('Frequency')
plt.title('Distribution of Genes per Cell')
plt.show()

### Visualize the distribution of how many cells express each gene

In [None]:
adata.var['n_cells_by_counts'] = (adata.X > 0).sum(axis=0)
sns.histplot(adata.var['n_cells_by_counts'], bins=50, kde=False)
plt.xlabel('Number of cells per gene')
plt.ylabel('Frequency')
plt.title('Distribution of Cells per Gene')
plt.show()

### Filter cells and genes based on findings
> E.g., min_genes=500, min_cells=10

In [None]:
sc.pp.filter_cells(adata, min_genes=500, max_genes=5000)
sc.pp.filter_genes(adata, min_cells=10)

### Identify highly variable genes (HVGs)
The accompanying plot will indicate if the selected number of HVGs is appropriate

In [None]:
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
sc.pl.highly_variable_genes(adata)

### Visualize the filtered data using a violin plot to assess data distribution

In [None]:
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts'], jitter=0.4)

## Preprocess

### Preprocess datasets for fine-tuning
```bash
--cell_type_col
--batch_id_col
```
These are 2 required parameters in order to run the next fine-tuning task

In [None]:
!python protocol_preprocess.py \
--dataset_directory=./dataset/EVAL/EVAL_BC_class.h5ad \
--cell_type_col=celltype \
--batch_id_col=sampleid \
--n_hvg=5000 \
--load_model=./pretrained_model/scGPT_human \
--wandb_project=bc_evaluation \
--wandb_sync=True

### Preprocess datasets for inference / evaluation
```bash
--load_model
```
It is required to have a loading directory to the model

In [None]:
!python protocol_preprocess.py \
--dataset_directory=./dataset/EVAL/EVAL_BC_class.h5ad \
--n_hvg=5000 \
--load_model=./save/AiO_finetune \
--wandb_project=bc_evaluation \
--wandb_sync=True

## Fine-tune
Note:
`--max_seq_len must be less than/equal to --n_hvg`

In [0]:
!python protocol_finetune.py \
--max_seq_len=5001 \
--config=train \
--include_zero_gene=False \
--epochs=2 \
--batch_size=32 \
--schedule_ratio=0.9

## Inference / Evaluation

### Inference task
This task can ignore the fine-tuning step, but using the different preprocess configurations. \
`--load_model` is the required parameter for performing inferences.

In [0]:
python protocol_preprocess.py \
--dataset_directory=./data/EVAL/EVAL_BC_class.h5ad \
--load_model=./save/dev_protocol_finetune-Jan01-01-01-01 \
--wandb_project=BC_inference \
--wandb_sync=True

#### Start Inference
At the end, all results are stored into one directory that can be found in logs. \
Files including: `predictions csv`, `run.log`

In [0]:
!python protocol_inference.py \
--load_model=/save/dev_protocol_finetune-Jan01-01-01-01 \
--batch_size=32 \
--wandb_sync=True \
--wandb_project=benchmark_BC \
--wandb_name=sample_bm_0101

### Evaluation task
```bash
--cell_type_col
--batch_id_col
```
These 2 parameters are required to do the evaluation.

In [None]:
python protocol_preprocess.py \
--dataset_directory=./data/EVAL/EVAL_BC_class.h5ad \
--cell_type_col=celltype \
--batch_id_col=sampleid \
--load_model=./save/dev_protocol_finetune-Jan01-01-01-01 \
--wandb_project=BC_evaluation \
--wandb_sync=True

#### Start Evaluation
At the end, all results are stored into one directory that can be found in logs. \
Files including: `predictions csv`, `run.log`, `prediction vs. ground truth UMAP`, `results in a serialized pickle file`, `results in a JSON file`, `confusion matrix`.

In [None]:
python protocol_inference.py \
--load_model=./save/dev_retina_finetune-Jan01-01-01-01 \
--wandb_sync=true \
--wandb_project=BC_evaluation \
--wandb_name=bm_BC