# JIND-Multi Notebook Tutorial

This notebook provides a step-by-step guide on how to run the JIND-Multi method using the Pancreas scRNA-seq dataset as an example. Specifically, batch 0 will be used as the source, batch 2 as the target, and batch 1 as an additional intermediate dataset. Since we have labels for the target batch, we will use confusion matrices to evaluate the results.

## 1. Initial Setup

First, let's ensure we have the necessary dependencies and import the `jind_multi` package.

In [7]:
#To use in google colab install this 2 packgages
#!pip install pandas==1.3.5 \
#  scanpy==1.8.0

Collecting pandas==1.3.5
  Downloading pandas-1.3.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting scanpy==1.8.0
  Downloading scanpy-1.8.0-py3-none-any.whl.metadata (5.9 kB)
Collecting anndata>=0.7.4 (from scanpy==1.8.0)
  Downloading anndata-0.10.9-py3-none-any.whl.metadata (6.9 kB)
Collecting umap-learn>=0.3.10 (from scanpy==1.8.0)
  Downloading umap_learn-0.5.6-py3-none-any.whl.metadata (21 kB)
Collecting legacy-api-wrap (from scanpy==1.8.0)
  Downloading legacy_api_wrap-1.4-py3-none-any.whl.metadata (1.8 kB)
Collecting sinfo (from scanpy==1.8.0)
  Downloading sinfo-0.3.4.tar.gz (24 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting array-api-compat!=1.5,>1.4 (from anndata>=0.7.4->scanpy==1.8.0)
  Downloading array_api_compat-1.8-py3-none-any.whl.metadata (1.5 kB)
INFO: pip is looking at multiple versions of anndata to determine which version is compatible with other requirements. This could take a while.
Collecting anndata>=

In [1]:
import sys
import os
import ast

# Get the path to the project root directory
project_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))

# Add the path to sys.path
if project_dir not in sys.path:
    sys.path.append(project_dir)

# Import the jind_multi package
import jind_multi

ModuleNotFoundError: No module named 'scanpy'

# 2. Configuring Parameters

In this section, we define the inputs required for running JIND-Multi by specifying various parameters. These inputs include:

- **Path to the `.h5ad` file**: This is the location of the data file containing the single-cell RNA sequencing data.
- **Batch and cell type column names**: We specify the column names for batch information and cell types within the AnnData object.
- **Source and target batch names**: We indicate which batch will be used as the source for annotation transfer and which batch will be the target for annotation.
- **Output path**: This is where the results of the analysis will be saved.
- **Training configurations**: These include the number of features to consider in the model, the minimum number of cells required per cell type for training in each batch, and whether to use a GPU for computation.
- **Intermediate datasets**: The `TRAIN_DATASETS_NAMES` parameter specifies which batches are used as intermediate datasets for training. These datasets, excluding the source and target batches, help in improving the model’s performance by providing additional training data.

We define these parameters in the `Args` class, which will be used to configure and run the JIND-Multi method.


In [None]:
class Args:
    PATH = "../resources/data/pancreas/pancreas.h5ad"  # path to your data
    BATCH_COL = "batch" # Column name for batch information in the AnnData object
    LABELS_COL = "celltype" # Column name for cell types in the AnnData object
    SOURCE_DATASET_NAME = "0" # Name of the source batch
    TARGET_DATASET_NAME = "2" # Name of the target batch
    OUTPUT_PATH = "../results/pancreas" # Directory to save results
    PRETRAINED_MODEL_PATH = "" # Path to pre-trained models, if available (here we are not introducing any)
    TRAIN_DATASETS_NAMES = "['1']" # List of intermediate datasets for training
    NUM_FEATURES = 5000 # Number of features (genes) to consider for modeling
    MIN_CELL_TYPE_POPULATION = 100 # Minimum number of cells required per cell type for training
    USE_GPU = True # Whether to use GPU for computation

args = Args()
print(args.PATH)

# 3. Setting Up the Training Configuration
We adjust the training configuration according to the specified parameters.

In [None]:
# Set up training configuration (you can modify more things here)
config = jind_multi.get_config()
config['data']['num_features'] = args.NUM_FEATURES
config['data']['min_cell_type_population'] = args.MIN_CELL_TYPE_POPULATION
config['train_classifier']['cuda'] = args.USE_GPU
config['GAN']['cuda'] = args.USE_GPU
config['ftune']['cuda'] = args.USE_GPU
print(f'USE_GPU: {args.USE_GPU}')
print(config)

# 4. Loading and Processing Data
We load and process the data using the `load_and_process_data` function from the `jind_multi` package. Then, we divide the data into training and test sets.

In [None]:
# Load and process the data
data = jind_multi.load_and_process_data(args, config)

# Split into training and test datasets
train_data = data[data['batch'] != args.TARGET_DATASET_NAME]
test_data = data[data['batch'] == args.TARGET_DATASET_NAME]

# 5. Creating the JIND-Multi Object
We create an instance of the JindWrapper class, which encapsulates the functionality of JIND-Multi, including training and evaluating the model.

In [None]:
train_datasets_names = ast.literal_eval(args.TRAIN_DATASETS_NAMES)

jind = jind_multi.JindWrapper(
    train_data=train_data,
    train_dataset_names=train_datasets_names,
    source_dataset_name=args.SOURCE_DATASET_NAME,
    output_path=args.OUTPUT_PATH,
    config=config
)

# 6. Training the Model
Now we train the model and we infer the labels on the test dataset (target batch).

In [None]:
# Train the JIND-Multi model
jind.train(target_data=test_data)

## 7. Applying the Trained Model to a New Target Batch

Once the model has been trained, you may want to apply it to a new target batch. This process involves loading the pre-trained model from the saved directory and using it to make predictions on new data.

### Steps:

1. **Specify the Path to Pre-Trained Models**: Ensure that the path to the directory containing the trained models is correctly set. This directory should include the model files (`.pt` format) and the associated configuration files.

2. **Load the Pre-Trained Model**: Use the `JindWrapper` class to load the pre-trained model from the specified directory.

3. **Apply the Model to the New Target Batch**: Run the model on the new target batch to get predictions.

Here’s how you can do this in code for batch 3:

In [None]:
# Define new parameters for applying the model to a new target batch
class Args:
    PATH = "../resources/data/pancreas/pancreas.h5ad"  # path to your data
    BATCH_COL = "batch" # Column name for batch information in the AnnData object
    LABELS_COL = "celltype" # Column name for cell types in the AnnData object
    SOURCE_DATASET_NAME = "0" # Name of the source batch
    TARGET_DATASET_NAME = "3" # Name of the target batch
    OUTPUT_PATH = "../results/pancreas_target3" # Directory to save results
    PRETRAINED_MODEL_PATH = "../results/pancreas/trained_models" # Path to pre-trained models, if available (here we are not introducing any)
    TRAIN_DATASETS_NAMES = "['1']" # List of intermediate datasets for training
    NUM_FEATURES = 5000 # Number of features (genes) to consider for modeling
    MIN_CELL_TYPE_POPULATION = 5 # Minimum number of cells required per cell type for training
    USE_GPU = True # Whether to use GPU for computation

args = Args()
print(args.PATH)

In [None]:
# Set up training configuration
config = jind_multi.get_config()
config['data']['num_features'] = args.NUM_FEATURES
config['data']['min_cell_type_population'] = args.MIN_CELL_TYPE_POPULATION
config['train_classifier']['cuda'] = args.USE_GPU
config['GAN']['cuda'] = args.USE_GPU
config['ftune']['cuda'] = args.USE_GPU
print(f'USE_GPU: {args.USE_GPU}')
print(config)

# Load and process the data
data = jind_multi.load_and_process_data(args, config)

In [None]:
# Split into training and test datasets
train_data = data[data['batch'] != args.TARGET_DATASET_NAME]
test_data = data[data['batch'] == args.TARGET_DATASET_NAME]

# Create the JIND-Multi object
jind2 = jind_multi.JindWrapper(
                                train_data=train_data,
                                source_dataset_name=args.SOURCE_DATASET_NAME,
                                output_path=args.OUTPUT_PATH,
                                config=config,
                            )

### Loading and Applying the Pre-Trained Model

After setting up your parameters and preparing the data, the next step is to load the pre-trained model and use it for predictions on the new target batch. This process involves several steps:

This step identifies the files containing the pre-trained models stored in the specified directory and loads validation statistics used to evaluate the performance of the pre-trained model.

In [None]:
print('Loading pre-trained models from specified path...')
file_paths = jind_multi.find_saved_models(args.PRETRAINED_MODEL_PATH, train_data)  # Check if pre-trained models are available
print(file_paths)
model = jind_multi.load_trained_models(file_paths, train_data, args.SOURCE_DATASET_NAME)
print(model)
print("Loading validation statistics...")
val_stats = jind_multi.load_val_stats(args.PRETRAINED_MODEL_PATH, 'val_stats_trained_model.json')
print(val_stats)

**Applying the Model to the New Target Batch:**

In [None]:
# Do JIND
jind2.train(target_data=test_data, model=model, val_stats=val_stats)

# 8 Conclusion

This notebook provided a comprehensive guide on configuring and running `JIND-Multi` for single-cell RNA sequencing analysis, using the Pancreas dataset with multiple labeled batches. It covered the key steps, including setting parameters, loading and processing data, and evaluating the model using confusion matrices. To tailor the analysis to your specific dataset and research objectives, adjust the parameters accordingly and review the results.

For further guidance on interpreting the results, please consult the `JIND-Multi` package documentation and the output files located in the `OUTPUT_PATH`.