# API Tutorial: Full pyInfinityFlow Pipeline

This tutorial uses the pyInfinityFlow API to carry out the full analysis pipeline with an example dataset. This example is a subset of the previously published mouse lung dataset[[1]](https://www.science.org/doi/10.1126/sciadv.abg0505), the full data set was made publicly available [here](https://flowrepository.org/id/FR-FCM-Z2LP) in flowrepository.org. You can download the subset with the [pyInfinityFlow repository on GitHub](https://github.com/KyleFerchen/pyInfinityFlow), which consists of 10 InfinityMarkers and 5 Isotype controls located in the ['example_dataset'](https://github.com/KyleFerchen/pyInfinityFlow/tree/main/example_data) directory. This directory also contains the relevant InfinityMarker annotation file as well as the Backbone annotation file, which are necessary for the analysis pipeline.

You can download the repository after [Git has been installed](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) by changing directories to where you want to install it and by using the following command:
```
git clone https://github.com/KyleFerchen/pyInfinityFlow.git
```

## Step 1: Preparing the Inputs

### Backbone Annotation File
First, we need to locate the Backbone annotation file. This will instruct the program which channel names in the input FCS files to use as the Backbone (predictors in the regression model). This is simply a .csv or .tsv file with three columns (in the same order as below) to annotate:

1. The channel names in the reference FCS file(s)(the data we use to build the final InfinityFlow object)
2. The channel names in the InfinityMarker FCS files (the data use to fit and validate the models)
3. The final name to use for the channel in the InfinityFlow object

This file should have the column names as the first line.

After downloading the pyInfinityFlow package repository on GitHub, we can access an example file for this test dataset, Eg.:

    '/media/kyle_ssd1/Repositories/pyInfinityFlow/example_data/mouse_lung_dataset_subset_backbone_anno.csv'
    

The ``pyInfinityFlow.InfinityFlow_Utilities`` module provides a simple function (``read_annotation_table``) to read either a .csv, .tsv, or .txt (tab-delimited) file into a pandas.DataFrame object:


In [6]:
import os
from pyInfinityFlow import InfinityFlow_Utilities

path_to_repo = "/media/kyle_ssd1/Repositories/pyInfinityFlow/"
path_backbone = os.path.join(path_to_repo, "example_data/mouse_lung_dataset_subset_backbone_anno.csv")
backbone_anno = InfinityFlow_Utilities.read_annotation_table(path_backbone)
backbone_anno

Unnamed: 0,Reference_Backbone,Query_Backbone,Final_Name
0,FJComp-APC-A,FJComp-APC-A,CD69-CD301b
1,FJComp-AlexaFluor700-A,FJComp-AlexaFluor700-A,MHCII
2,FJComp-BUV395-A,FJComp-BUV395-A,CD4
3,FJComp-BUV737-A,FJComp-BUV737-A,CD44
4,FJComp-BV421-A,FJComp-BV421-A,CD8
5,FJComp-BV510-A,FJComp-BV510-A,CD11c
6,FJComp-BV605-A,FJComp-BV605-A,CD11b
7,FJComp-BV650-A,FJComp-BV650-A,F480
8,FJComp-BV711-A,FJComp-BV711-A,Ly6C
9,FJComp-BV786-A,FJComp-BV786-A,Lineage


### InfinityMarker Annotation File

The InfinityMarker annotation file specifies what FCS files to use to build the regression models and how they should be treated. Each InfinityMarker (Flow Cytometry signal to impute using the backbone) has a row entry in this annotation file for the following columns:
    
1. The FCS file name
2. The InfinityMarker channel name (exactly as it appears in the FCS file)
3. The name to give the channel in the final InfinityFlow object
4. (OPTIONAL) The final name of Isotype InfinityMarker (should be an entry in the third column for the InfinityMarkers that are Isotype controls)

This file is included in the same directory as the Backbone annotation file in the GitHub repository, Eg.:

    '/media/kyle_ssd1/Repositories/pyInfinityFlow/example_data/mouse_lung_dataset_subset_infinity_marker_anno.csv'

Isotype background correction is an optional step in which a linear model is used to regress out the background binding and fluorescence of an antibody raised with a specific immunoglobulin. You can read more about it from the [original publication](https://www.science.org/doi/10.1126/sciadv.abg0505). The InfinityMarker annotation file is used to specify whether or not to perform background correction. This is optional and will only be attempted in the pipeline if this annotation file has a 4th column.

The InfinityMarker annotation file, like the Backbone annotation file, is expected to be either a .csv, .tsv, or .txt (tab-delimited) file, and can also be read into a pandas.DataFrame using the ``read_annotation_table`` function:

In [7]:
path_infmarker = os.path.join(path_to_repo, 
    "example_data/mouse_lung_dataset_subset_infinity_marker_anno.csv")
infinitymarker_anno = InfinityFlow_Utilities.read_annotation_table(path_infmarker)
infinitymarker_anno

Unnamed: 0,File,Channel,Name,Isotype
0,backbone_Plate2_Specimen_001_G1_G01_073_target...,FJComp-PE(yg)-A,33D1,Isotype_rIgG2b
1,backbone_Plate2_Specimen_001_F7_F07_067_target...,FJComp-PE(yg)-A,Allergin-1,Isotype_mIgG1
2,backbone_Plate2_Specimen_001_F8_F08_068_target...,FJComp-PE(yg)-A,B7-H4,Isotype_AHIgG
3,backbone_Plate1_Specimen_001_A2_A02_002_target...,FJComp-PE(yg)-A,CD1d,Isotype_rIgG2b
4,backbone_Plate1_Specimen_001_G4_G04_076_target...,FJComp-PE(yg)-A,CD103,Isotype_AHIgG
5,backbone_Plate1_Specimen_001_G5_G05_077_target...,FJComp-PE(yg)-A,CD105,Isotype_rIgG2a
6,backbone_Plate1_Specimen_001_G6_G06_078_target...,FJComp-PE(yg)-A,CD106,Isotype_rIgG2a
7,backbone_Plate1_Specimen_001_G7_G07_079_target...,FJComp-PE(yg)-A,CD107a (Lamp-1),Isotype_rIgG2a
8,backbone_Plate1_Specimen_001_G8_G08_080_target...,FJComp-PE(yg)-A,CD107b (Mac-3),Isotype_rIgG1
9,backbone_Plate1_Specimen_001_G9_G09_081_target...,FJComp-PE(yg)-A,CD115,Isotype_rIgG2a


## Step 2: Checking the Inputs and Building an InfinityFlowFileHandler

Next, we need to specify the directory in which the FCS files are saved. This directory is located in the same parent directory as the annotation files on the pyInfinityFlow GitHub repository:

    '/media/kyle_ssd1/Repositories/pyInfinityFlow/example_data/mouse_lung_dataset_subset'

Then we can use the ``check_infinity_flow_annotation_dataframes`` to do the following:

- Validate the input annotation DataFrames
- Scan through the InfinityMarker FCS files to split events into training/validation/pooling subsets
- Return an InfinityFlowFileHandler to store how each of the InfinityMarker files will be processed

Here we will use the n_events_combine parameter to pool events from each of the individual InfinityMarker files for the final InfinityFlow object. Each of original channels from this file will be preserved into the final InfinityFlow object.

Note: it is also possible to use the ``separate_backbone_reference`` argument to supply a separate FCS file onto which the predictions will be made. This is useful if there is a feature(s) that are not well explained by the Backbone channels and therefore should not be imputed.

In [8]:
fcs_dir = "/media/kyle_ssd1/Repositories/pyInfinityFlow/example_data/mouse_lung_dataset_subset"

file_handler = InfinityFlow_Utilities.check_infinity_flow_annotation_dataframes(\
    backbone_annotation=backbone_anno, 
    infinity_marker_annotation=infinitymarker_anno,
    n_events_train=0, # Use all possible events in the FCS file
    n_events_validate=0, # Use all possible events in the FCS file
    ratio_for_validation=0.5,                                                                            
    n_events_combine=1000, # Events to pool into a final InfinityFlow object
    input_fcs_dir=fcs_dir, 
    verbosity=1)

file_handler

Isotype controls detected. Will attempt to use background correction...


InfinityFlowFileHandler Object from pyInfinityFlow
	.handles the following InfinityMarkers:
			33D1
			Allergin-1
			B7-H4
			CD1d
			CD103
			CD105
			CD106
			CD107a (Lamp-1)
			CD107b (Mac-3)
			CD115
			Isotype_rIgG2b
			Isotype_mIgG1
			Isotype_AHIgG
			Isotype_rIgG2a
			Isotype_rIgG1

	Held in the InfinityFlowFileHandler.handles dictionary

	InfinityFlowFileHandler.list_infinity_markers holds ordered list of InfinityMarkers


For example, you can see how the InfinityMarker "33D1" is stored in the ``file_handler.handles`` dictionary, including the name, file_name, directory, reference_backbone_channels, backbone_channels, prediction_channel, train_indices, test_indices, and pool_indices.

This information will be used later on to carry out XGBoost regression.

In [9]:
file_handler.handles["33D1"]

{'name': '33D1',
 'file_name': 'backbone_Plate2_Specimen_001_G1_G01_073_target_33D1.fcs',
 'directory': '/media/kyle_ssd1/Repositories/pyInfinityFlow/example_data/mouse_lung_dataset_subset',
 'reference_backbone_channels': array(['FJComp-APC-A', 'FJComp-AlexaFluor700-A', 'FJComp-BUV395-A',
        'FJComp-BUV737-A', 'FJComp-BV421-A', 'FJComp-BV510-A',
        'FJComp-BV605-A', 'FJComp-BV650-A', 'FJComp-BV711-A',
        'FJComp-BV786-A', 'FJComp-GFP-A', 'FJComp-PE-Cy7(yg)-A',
        'FJComp-PerCP-Cy5-5-A'], dtype=object),
 'backbone_channels': array(['FJComp-APC-A', 'FJComp-AlexaFluor700-A', 'FJComp-BUV395-A',
        'FJComp-BUV737-A', 'FJComp-BV421-A', 'FJComp-BV510-A',
        'FJComp-BV605-A', 'FJComp-BV650-A', 'FJComp-BV711-A',
        'FJComp-BV786-A', 'FJComp-GFP-A', 'FJComp-PE-Cy7(yg)-A',
        'FJComp-PerCP-Cy5-5-A'], dtype=object),
 'prediction_channel': 'FJComp-PE(yg)-A',
 'train_indices': array([     0,      1,      4, ..., 106343, 106344, 106346]),
 'test_indices': arra

## Step 3: Specify Output Directories

Here, we simply need to specify a directory in which to save the outputs of the pipeline. The ``InfinityFlow_Utilities.setup_output_directories`` function will prepare a dictionary that stores where to save different outputs, and create those directories:

In [10]:
output_paths = InfinityFlow_Utilities.setup_output_directories(\
    output_dir="/media/kyle_ssd1/outputs/",
    file_handler=file_handler,
    verbosity=1)

output_paths

{'output_regression_path': '/media/kyle_ssd1/outputs/regression_results',
 'output_umap_feature_plot_path': '/media/kyle_ssd1/outputs/umap_feature_plots',
 'clustering': '/media/kyle_ssd1/outputs/clustering',
 'qc': '/media/kyle_ssd1/outputs/QC',
 'output_umap_bc_feature_plot_path': '/media/kyle_ssd1/outputs/umap_feature_plots_background_corrected'}

## Step 4: Fitting the XGBoost Regression Models

The ``InfinityFlow_Utilities.single_chunk_testing`` function is used to create and fit the XGBoost models. It will return a tuple consisting of a ``InfinityFlow_Utilities.CombinedRegressionModels`` object and a dictionary that saves how much time it took to fit the models for the InfinityMarkers.

In [12]:
regression_models, timings_1 = InfinityFlow_Utilities.single_chunk_training(\
    file_handler=file_handler,
    cores_to_use=12,
    use_logicle_scaling=True, 
    normalization_method=None,  
    verbosity=3)
regression_models

Reading in data from .fcs files for model training...
Applying Logicle normalization to data...


In [13]:
regression_models

CombinedRegressionModels Object from pyInfinityFlow
	Contains regression models for the following InfinityMarkers (Response Variables):
33D1,Allergin-1,B7-H4,CD1d,CD103,CD105,CD106,CD107a (Lamp-1),CD107b (Mac-3),CD115,Isotype_rIgG2b,Isotype_mIgG1,Isotype_AHIgG,Isotype_rIgG2a,Isotype_rIgG1

	Uses the following backbone (Explanatory Variables):
FJComp-APC-A,FJComp-AlexaFluor700-A,FJComp-BUV395-A,FJComp-BUV737-A,FJComp-BV421-A,FJComp-BV510-A,FJComp-BV605-A,FJComp-BV650-A,FJComp-BV711-A,FJComp-BV786-A,FJComp-GFP-A,FJComp-PE-Cy7(yg)-A,FJComp-PerCP-Cy5-5-A

The object holds the following variables:
	ordered_training_channels
	var_annotations
	infinity_markers
	regression_models
	parameter_annotations
	infinity_channels
	validation_metrics

	Access regression models as dictionary with the InfinityMarker as the key: 
		Eg. CombinedRegressionModels.regression_models["33D1"]
