GitHub - nickmmark/immune-phenotyping: Immune cell phenotyping of lung cancer and lung tissue using PCA, tSNE, and other techniques

Dimensional Reduction and Unsupervised Learning for Immune Cell Phenotyping

This repo describes techniques for immune cell phenotyping of non-small cell lung cancer using PCA, tSNE, and other techniques in R and FloJo.

Modern single cell analysis techniques (flow cytometry, mass cytometry, single cell RNA sequencing, etc) capture massive amounts of high dimensional data: for example a comprehensive flow cytometry panel can stain cells with dozens of markers and identify hundreds of distinct cell types, and the raw data can occupy 1-25 GB. Interpreting this high dimensional data can be challenging. Dimensional reduction techniques can be used either to analyze raw flow cytometry data (to naively identify cell populations) or to analyze populations identified through traditional gating approaches (to identify population changes between groups). These techniques can be used to simplify complex high dimensional data and identify novel cell populations, such as interferon gamma producing immune cells in immune cells isolated from non-small cell lung cancer (NSCLC) tumors:

tSNE plot CD45+ immune cells derived from NSCLC tumor and non-tumor adjacent lung tissue, z-axis and color indicates the degree of IFN-gamma production

Principle Component Analysis (PCA)

Principle Component Analysis (PCA) is a linear dimensional reduction algorithm with O(n2) time complexity. PCA is perhaps the most widely used dimensional reduction technique (having been first described in 1933) and has many implementations in R and other programming languages. PCA preserves the global structure of the data but not local structure: PCA places dissimilar points far apart but when reducing high dimensional data to a low dimension manifold similar points are not placed close together.

In this example of immune cell phenotyping of NSCLC and lung tissue, I examined a dataset of samples obtained from individuals undergoing surgical resection of NSCLC; these samples were processed fresh into a single cell suspension and stained with a panel of >30 antibodies against surface and intracellular antigens and >50,000 events were obtained by flow cytometry.

Overlapping immune cell phenotypes of tumor and lung samples

Immune cell populations in tumor and lung samples with Eigenvectors shown

We can also look at other clinical parameters, such as if the sample came from a patient with COPD or not, based on either the degree of airflow obstruction (GOLD stage) or the degree of emphysema as measured radiologically (Goddard score). Although the degree of emphysema and airflow obstruction are correlated there are phenotypic and immunologic differences, as seen below:

For example:

# define the presence of COPD
copd <- mutate(copd, new_gold = ifelse(gold_stage < 1, "No COPD present", "COPD present"))
copd <- mutate(copd, new_goddard = ifelse(goddard_score < 0.5, "No emphysema present", "Emphysema present"))
copd <- mutate(copd, new_copd_any_definition = ifelse(copd_anydefinition < 1, "No COPD", "COPD"))
copd <- na.omit(copd)

# select the correct cell types for inclusion
copd <- select(copd, gold_stage, goddard_score, new_gold, new_goddard, colNames, sample, copd_anydefinition,
               cd45, cd3, cd4, cd8, nkt, pmn, nk, b, #gdt, nk, b, mac, pmn,  # basic cell types
               cd8ifng, th1, th17, treg, gdtil17, gdtifng,      # cytokine profiles
              #cd4_pd1, cd8_pd1,                                # pd-1 expression
              #pmnpdl1, macpdl1, monopdl1, nocd45pdl1,           # pd-l1 expression
              #cd4_tim3, cd8_tim3,                                # tim3 expression
              #cd4pd1tim3, cd8pd1tim3,                          # dual checkpoint expression
               )          

pr <- prcomp(minus_gold)
pc_comps <- data.frame(pr$rotation)
pc1_vars <- select(pc_comps, PC1)
pc2_vars <- select(pc_comps, PC2)
arrange(pc1_vars, PC1)

# Write 2 axis PCA
autoplot(pr, data = copd,
         colour = "new_gold", frame = TRUE, frame.type = "norm",
         #loadings = TRUE, loadings.label = TRUE, loadings.colour = "black"            # show eigenvectors
         ) +
         ggtitle(label = "COPD vs Non-COPD PCA")

This shows us the the effect of COPD being present in the resected non-adjacent lung on immune cell phenotype in the resected tumor. In this case we define COPD as the presence of airflow obstruction based on GOLD stage (see code above).

t-Distributed Stochastic Neighbor Embedding (tSNE)

t-Distributed Stochastic Neighbor Embedding (tSNE) is an non-linear algorithm for performing dimensionality reduction, allowing visualization of complex multi-dimensional data in fewer dimensions while maintaining the overall structure of the data. tSNE was first described in 2008 and has become a widely used dimensional reduction technique (see the creator, Laurens van der Maaten's website for more details). Importantly, tSNE is able to preserve BOTH the local and global structures of the data. tSNE was first described in 2008 and is a powerful and useful technique that can be done either natively in FloJo or R using the rTsne package. For a complete description of the underlying algorithm, see here

For tSNE in FloJo there is excellent documentation available here.

To perform tSNE in R, we can use the rTsne package. In this example, Paired immune cell populations (CD45+) from lung tumor and non-tumor adjacent lung were concatenated (50,000 events from each sample) and analyzed using tSNE. Specific immune cell populations can be labeled according to origin (lung vs tumor), immune effector cell type (CD4+, CD8+, gamma delta TCR+), or intracellular cytokine production (interferon gamma, IL-17a, etc). We can export flow cytometry data in a dataframe such that each row represents a single event (cell) and each column represents the values for each marker.

training_set <- loadExcel("NSCLC.xlsx",1)
immune_cell_tsne <- Rtsne(training_set[,-1], dims = 2, perplexity=25, theta = 0.2, verbose = TRUE, PCA = TRUE, max_iter = 500)
plot(immune_cell_tsne$Y, t='n', main="immune_cell_tsne")
text(immune_cell_tsne$Y, labels=train$label, col=colors[train$label])

When performing tSNE it is important to carefully select hyperparameters:

dimensions - how many dimensions are desired (usually 2)
perplexity - increase with larger number of cells or with a denser cluster; typically 25-100
maximum iterations - typically 500 or 1000
theta (speed/accuracy tradeoff) -
PCA (true or false) -
eta (learning rate) - controls how much the weights are adjusted at each iteration. Optimally set at 7% the number of cells being mapped into tSNE space.

Hyperparameter tuning requires experimentation. I recommend downsampling the dataset to 10,000 events while optimizing the parameters to save time.

In this example, we can see that t-Distributed Stochastic Neighbor Embedding demonstrates overlapping immune cell populations in paired NSCLC and lung samples. Specifically, we can see that there are similar/overlapping immune cell populations in both the lung and tumor populations.

Here is a summary of tSNE findings for multiple concatenated lung and tumor samples: tSNE analysis of multiple lung tumors - note the overlapping immune phenotype for several different lung/tumor samples

While tSNE is a powerful and useful tool for analyzing immune cell populations, there are some important limitations of tSNE to consider:

computationally expensive; with O(n2) time complexity this can take a long time to run (it makes sense to setup a cloud VM with lots of RAM and compute to run for datasets larger than 100k cells)
non-deterministic; running the same data can produce (slightly) different results
sensitive to hyper-parameters; make sure to empirically tune and then standardize the settings used
images can be deceptive; although tSNE space preserves the local and global aspects of the data, the relative area of different regions is not representative of the number of cells
sensitive to the compensation of the data; one of the strengths of tSNE is that it can accomodate log distributed data, however if there are events off scale it will distort the analysis

T cell receptor sequence analysis

Examining TCR sequences, consensus sequences, and shared sequences in both tumor and adjacent tissue using the tcr R package.

Other techniques

Uniform Manifold Approximation and Projection (UMAP) - An alternative non-linear, non-deterministic, dimensional reduction algorithm. I have less experience using UMAP but it has some clear advantages including O(d*n^1.14) rather than O(n2) time complexity that make it appealing for large datasets. UMAP is available as an R package here.
Hierarchical Clustering

version/to do

current version 0.1.3

[ ]current version 0.1.0 - this is a work in progress
[ ]need to cleanup the tSNE R code
[ ]add more detailed examples and explanations
[ ]add additional references

references

Mark NM et al, Chronic Obstructive Pulmonary Disease Alters Immune Cell Composition and Immune Checkpoint Inhibitor Efficacy in Non-Small Cell Lung Cancer, AJRCCM 2018
Thorsson V et al, The Immune Landscape of Cancer, Immunity. 2018
FloJo tSNE documentation
Comprehensive Guide on t-SNE algorithm with implementation in R & Python
Becht E et al, Evaluation of UMAP as an alternative to t-SNE for single-cell data
Nazarov, V.I., Pogorelyy, M.V., Komech, E.A. et al. tcR: an R package for T cell receptor repertoire advanced data analysis. BMC Bioinformatics 16, 175 (2015). https://doi.org/10.1186/s12859-015-0613-1

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
code		code
data		data
figures		figures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

figures

figures

README.md

README.md

Repository files navigation

Dimensional Reduction and Unsupervised Learning for Immune Cell Phenotyping

Principle Component Analysis (PCA)

t-Distributed Stochastic Neighbor Embedding (tSNE)

T cell receptor sequence analysis

Other techniques

version/to do

references

About

Releases

Packages

Languages

nickmmark/immune-phenotyping

Folders and files

Latest commit

History

Repository files navigation

Dimensional Reduction and Unsupervised Learning for Immune Cell Phenotyping

Principle Component Analysis (PCA)

t-Distributed Stochastic Neighbor Embedding (tSNE)

T cell receptor sequence analysis

Other techniques

version/to do

references

About

Topics

Resources

Stars

Watchers

Forks

Languages