Patients sub-group detection by PILOT

PILOT

In this tutorial, we are using single-cell data from pancreatic ductal adenocarcinomas to find sub-groups of patients using clustering of W distances. And then we rank the cells/genes based on clustering results.

You can download the Anndata (h5ad) file for this tutorial from here and place it in the Datasets folder.

import PILOT as pl
import scanpy as sc

Reading Anndata

adata = sc.read_h5ad('Datasets/PDAC.h5ad')

Loading the required information and computing the Wasserstein distance:

In order to work with PILOT, ensure that your Anndata object is loaded and contains the required information.

Use the following parameters to configure PILOT for your analysis (Setting Parameters):

adata: Pass your loaded Anndata object to PILOT.
emb_matrix: Provide the name of the variable in the obsm level that holds the dimension reduction (PCA representation).
clusters_col: Specify the name of the column in the observation level of your Anndata that corresponds to cell types or clusters.
sample_col: Indicate the column name in the observation level of your Anndata that contains information about samples or patients.
status: Provide the column name that represents the status or disease (e.g., "control" or "case").

pl.tl.wasserstein_distance(
    adata,
    emb_matrix='X_pca',
    clusters_col='cell_types',
    sample_col='sampleID',
    status='status'
    )

Ploting the Cost matrix and the Wasserstein distance:

Here we show the heatmaps of the Cost matrix (cells) and Wasserstein distance (samples). At this point, you should have a 'Results_PILOT' folder inside the folder where you are running your analysis, which includes all the plots created by PILOT.

pl.pl.heatmaps(adata)

Trajectory:

Here we show the Diffusion map of Wasserstein distance. In the upcoming trajectory analysis, the labels "T" stands for Tumor and "N" stands for the Normal.

pl.pl.trajectory(adata, colors = ['red','Blue'])

In this section, we should find the optimal number of clusters.

The Silhouette Score Curve is used to find the optimal number of clusters by plotting the average Silhouette Score for different numbers of clusters. The number of clusters corresponding to the highest average Silhouette Score is considered the optimal number of clusters.

pl.pl.select_best_sil(adata)

Patients sub-group detection by clustering EMD.

Using the Silhouette scores of the previous step, we can find the optimal number of cluster of patients to detect different stage of disease.

proportion_df=pl.pl.clustering_emd(adata, res = adata.uns['best_res'])

Here we can see that all of the Normal samples are in cluster 1, so we can rename the name of clusters for future analysis. Although this step is optional, we chose to do it in this tutorial.

proportion_df.loc[proportion_df['Predicted_Labels'] == '0', 'Predicted_Labels'] = 'Tumor 1'
proportion_df.loc[proportion_df['Predicted_Labels'] == '1', 'Predicted_Labels'] = 'Normal'
proportion_df.loc[proportion_df['Predicted_Labels'] == '2', 'Predicted_Labels'] = 'Tumor 2'

Cell-type selection.

Importantly, we determine which cell type derives the disease from the first stage to the second stage by detecting cell types have statistically significant changes between two sub-groups. This is done using Welch’s t-test, which is appropriate for unequal variances in two groups of samples.

Based on the adjusted p-value threshold you consider, you can choose how statistically significant the cell types you want to have.

pl.pl.cell_type_diff_two_sub_patient_groups(
    proportion_df,
    proportion_df.columns[0:-2],
    group1 = 'Tumor 2', 
    group2 = 'Tumor 1',
    pval_thr = 0.05, 
    figsize = (15, 4)
    )

Next, we can check the distribution of each patient sub-group in each cell type (the three above ones).

pl.pl.plot_cell_types_distributions(
    proportion_df,
    cell_types = ['Stellate cell','Ductal cell type 2','Ductal cell type 1'],
    figsize = (17,8),
    label_order = ['Normal', 'Tumor 1', 'Tumor 2'],
    label_colors = ['#FF7F00','#BCD2EE','#1874CD']
    )

We also can check the distribution of each patient sub-group in the whole cell types.

pl.pl.plot_cell_types_distributions(
    proportion_df,
    cell_types = list(proportion_df.columns[0:-2]),
    figsize = (17,8),
    label_order = ['Normal', 'Tumor 1', 'Tumor 2'],
    label_colors = ['#FF7F00','#BCD2EE','#1874CD']
    )

In the statistical table generated in your results path, all cell types are sorted based on the score and their adjusted p-value. cell types with positive scores shows the most differences cell types in group 1 compared with group 2 and negative scores shows the other way. You can the results in 'Results_PILOT/Diff_Expressions_Results' folder.

Differential expression analysis

At this point for each cell type the plots and results are saved at 'Results_PILOT/Diff_Expressions_Results' folder.

Note:

This step needs the 'limma' package in R, You need to run this function to install it (if you have not already done)!

#pl.tl.install_r_packages()

Ductal cell type 1

In the next step, for specific cell types, we find the Differential genes between two interested patient sub-groups Based on the fold change threshold, you can determine how much difference you want to see between two interested patients sub-groups. For the tutorial, we already saved the needed input for this function (2000 highly variable genes for each cell type).

You can run this function for your own data, and it automatically extracts the gene expressions for each cell type ("2000 highly variable genes). In case you want all/more genes, please change the "highly_variable_genes_" or "n_top_genes" parameters.

cell_type = "Ductal cell type 1"
pl.tl.compute_diff_expressions(
    adata,
    cell_type,
    proportion_df,
    fc_thr =  0.5,
    pval_thr = 0.05,
    group1 = 'Tumor 1', 
    group2 = 'Tumor 2',
    sample_col = 'sampleID',
    col_cell = 'cell_types',
    )

Gene Ontology analysis

Based on the adjusted p-value and fold change threshold, we select genes that are highly differentiated in each patient sub-groups and specify their annotation using the gProfiler

pl.pl.gene_annotation_cell_type_subgroup(
    cell_type = cell_type, 
    group = 'Tumor 1',
    num_gos = 10
    )

pl.pl.gene_annotation_cell_type_subgroup(
    cell_type = cell_type,
    group = 'Tumor 2',
    num_gos = 10
    )

Stellate cell

cell_type = "Stellate cell" 
pl.tl.compute_diff_expressions(
    adata,
    cell_type,
    proportion_df,
    fc_thr = 0.1,
    pval_thr = 0.05,
    group1 = 'Tumor 1',
    group2 = 'Tumor 2',
    sample_col = 'sampleID',
   col_cell = 'cell_types'
    )

Gene Ontology analysis

pl.pl.gene_annotation_cell_type_subgroup(
    cell_type = cell_type, 
    group = 'Tumor 1',
    num_gos = 10
    )

pl.pl.gene_annotation_cell_type_subgroup(
    cell_type = cell_type,
    group = 'Tumor 2',
    num_gos = 10
    )

Ductal cell type 2

cell_type = "Ductal cell type 2" 
pl.tl.compute_diff_expressions(
    adata,
    cell_type, 
    proportion_df,
    fc_thr = 0.020,
    pval_thr = 0.05,
    group1 = 'Tumor 1',
    group2 = 'Tumor 2',
    sample_col = 'sampleID',
    col_cell = 'cell_types',
    )

Gene Ontology analysis

pl.pl.gene_annotation_cell_type_subgroup(
    cell_type = cell_type,
    group = 'Tumor 1',
    num_gos = 10
    )

pl.pl.gene_annotation_cell_type_subgroup(
    cell_type = cell_type, 
    group = 'Tumor 2',
    num_gos = 10
    )

Pseudobulk differential expression analysis

In some situations, We are interested in the genes are expressed different at the sample level without considering the variation across the cells. We define pseudo-bulk samples as summing counts for all cells with the same combination of label and sample. We use the DESeq2 tool to perform pseudo-bulk differential expression analysis on a specific cell type sub-patient group

Note:

Please make sure the conda environment has the required packages for this part:

conda install conda-forge::r-tidyverse
conda install bioconda::bioconductor-singlecellexperiment
conda install bioconda::bioconductor-deseq2

Stellate cell

pl.pl.get_pseudobulk_DE(adata, proportion_df, cell_type = 'Stellate cell', fc_thr = [3.0, 4.0, 0.5])

Ductal cell type 1

pl.pl.get_pseudobulk_DE(adata, proportion_df, cell_type = 'Ductal cell type 1', fc_thr = [5.0, 4.0, 0.5])

Ductal cell type 2

pl.pl.get_pseudobulk_DE(adata, proportion_df, cell_type = 'Ductal cell type 2', fc_thr = [2.0, 2.0, 0.5])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patients_sub_group_detection.md

Patients_sub_group_detection.md

Patients sub-group detection by PILOT

Reading Anndata

Loading the required information and computing the Wasserstein distance:

Ploting the Cost matrix and the Wasserstein distance:

Trajectory:

In this section, we should find the optimal number of clusters.

Patients sub-group detection by clustering EMD.

Cell-type selection.

Differential expression analysis

Note:

Ductal cell type 1

Gene Ontology analysis

Stellate cell

Gene Ontology analysis

Ductal cell type 2

Gene Ontology analysis

Pseudobulk differential expression analysis

Stellate cell

Ductal cell type 1

Ductal cell type 2

Files

Patients_sub_group_detection.md

Latest commit

History

Patients_sub_group_detection.md

File metadata and controls

Patients sub-group detection by PILOT

Reading Anndata

Loading the required information and computing the Wasserstein distance:

Ploting the Cost matrix and the Wasserstein distance:

Trajectory:

In this section, we should find the optimal number of clusters.

Patients sub-group detection by clustering EMD.

Cell-type selection.

Differential expression analysis

Note:

Ductal cell type 1

Gene Ontology analysis

Stellate cell

Gene Ontology analysis

Ductal cell type 2

Gene Ontology analysis

Pseudobulk differential expression analysis

Stellate cell

Ductal cell type 1

Ductal cell type 2