# Ensemble Pursuit: an algorithm for finding  overlapping clusters of correlated neurons in large-scale recordings

Maria Kesa, Carsen Stringer\*, Marius Pachitariu\*

Howard Hughes Medical Institute, Janelia, 2019

In [3]:
from EnsemblePursuit.EnsemblePursuit import new_ensemble

## Abstract

Large populations of neurons coordinate their activity to process shared sensory inputs and encode internal behavioral states. To extract these patterns of coordination in large datasets, we developed a fast greedy algorithm based on dictionary learning which extracts correlated and overlapping ensembles of cells from calcium imaging recordings. The learning algorithm continuously initializes and extracts new dictionary terms greedily from the residuals of the cost function given the current active set of dictionary elements. It shares this strategy with projection pursuit methods like independent components analysis, and we thus called the algorithm ”ensemble pursuit”. The method has a tunable parameter to control the sparsity of the ensembles, e.g. how many cells they include on average. Because one cell typically belongs to multiple ensembles, the model is more flexible than traditional clustering approaches. We applied the algorithm to simultaneous calcium imaging recordings of tens of thousands neurons from V1 primary cortical area of the mouse and evaluated the encoding of stimulus and behavioral variables by groups of cells extracted by the algorithm.

## Introduction

To begin our investigations, we ask the basic question: "How is the computation organized at the circuit level in the brain? Are there redundant functional modules in the cortical code, with some neurons representing the same information? Can we extract from high-dimensional data groupings of cells that tend respond similarly to stimuli or share variability through behavioral variables encoded internally by the brain?" We develop a mathematical framework for rigorously exploring these questions. 

Two techniques used to understand high-dimensional data are matrix factorization and clustering. The most famous and widely employed algorithm for matrix factorization method is Principal Components Analysis (PCA). It minimizes the following unconstrained reconstruction cost function:
$$\text{Cost} = \| X - U \cdot V\|^2 $$
In PCA, $U$ is a dense matrix. In contrast, for clustering methods the rows of $U$ have only one non-zero element, which is either fixed at 1, or allowed to take on a scalar positive value. This extreme sparsity constraint makes clustering a highly interpretable method: it groups together datapoints that are similar. However, it has the caveat of only accounting for a small fraction of the variance of most datasets $X$. Ensemble Pursuit is a matrix  factorization algorithm that adds an additional constraint to the reconstruction cost function: 
$$\text{Cost} = \| X - U\cdot V\|^2 + \lambda \|U\|_0 \label{cost_function} $$
By adding a tunable parameter ($\lambda$) as an L0 sparsity penalty to the columns of the matrix U we can steer the algorithm to find solutions in between the hard constraints of clustering and the uninterpretable factors extracted by unconstrained   