# Diffusion maps for single-cell data analysis

By means of this notebook, you can solve all the programming tasks of chapter 4. You can download all the needed files (data1.mat and guo.xlsx) from the homepage. 

In [None]:
%matplotlib inline
import numpy as np
from scipy.io import loadmat 
from pandas import read_excel
from scipy.spatial.distance import pdist
import np.linalg as LA

#add here your required imports

In [None]:
#test cell



## Introduction
#### Task 1: Implement the diffusion maps algorithm.
It is recommended to solve this task by defining a class for diffusion maps and implement a fit_transform function, which returns the embedding of a given data set. This standardizes the code when comparing diffusion maps with other dimensionality reduction methods.

In [None]:
# your code goes here

In [4]:
class DiffMap():
    """
    Class Diffusion Maps
    
    Parameters
    ----------
    n_comp: int, optional, default = 2
        number of dimensions in which the data will be embedded
    sigma: optional, default = 10
        bandwidth of the Gaussian kernel
    alpha: optional, default = 1
        the density rescaling parameter
    """
    
    def __init__(self, n_components = 2, sigma = 10, alpha = 1):
        self.ndim = n_components
        self.sigma = sigma
        self.alpha = alpha
        
    def fit_transform(self, X):
        """
        Computes the embedding
        
        Parameters
        ----------
        X: array
           input data
           
        Returns
        -------
        evecs: array [n_cells, n_comp]
            array of n_comp eigenvectors or diffusion coordinates
        """
        # your code goes here
        K = np.exp(-cdist(X,X, 'sqeuclidean')/(2*self.sigma*self.sigma))
        Q_minusalpha = np.diag(np.pow(np.sum(K, axis=1),-self.alpha))
        K_alpha = Q_minusalpha @ K @ Q_minusalpha 
        np.fill_diagonal(K_alpha, 0)
        D_alpha = np.diag(np.sum(K_alpha,axis=1))
        P = LA.inverse(D_alpha) @ K_alpha
        evals, evecs = LA.eig(P)
        return evals[1:self.n_components+1] @ evecs[1:self.n_components+1,:]
        
        

#### Task 2: Perform a diffusion map analysis on the Buettner data set. 

In [6]:
def load_buettner_data(): 
    #load buettner data
    file = loadmat('data//data1.mat')
    data = file.get('in_X')
    data = np.array(data)

    #group assignments
    labels = file.get('true_labs')
    labels = labels[:,0] -1

    #group names
    stage_names = ['1', '2', '3']

    return data, stage_names, labels

In [None]:
# your code goes here
data, stage_names, labes = load_buettner_data()
clf = DiffMap()

In [None]:
# your code goes here

#### Task 3: Perform a PCA analysis of the Buettner data set.

In [None]:
# your code goes here

## Single-cell data analysis

In the following, we will apply diffusion maps to the Guo data. In the file, you will find some necessary information:

1. the input data, which is a matrix with a certain number of cells as row number and a certain number of genes as column number,
2. the names of the measured genes and
3. an assignment of each cell to an embryonic stage. These assignments have to be converted into numerical labels to use them for the scatter plots.

### Pre-processing
#### Task 4: Pre-process the Guo data.

Take a look at the file guo.xlsx. The naming annotation in the first column refers to the embryonic stage, embryo number, and individual cell number, thus 64C 2.7 refers to the 7th cell harvested from the 2nd embryo collected from the 64-cell stage. In the first row, you will find the names of the measured genes.

In [None]:
def load_guo_data():
    #load guo data
    data_frame = read_excel('data//guo.xlsx', sheet_name = 'Sheet1')

    #data
    adata = data_frame.as_matrix()
    data = adata[:,1:]
    embryonic_stages = adata[:,0]

    #genes
    genes_tmp = data_frame.axes[1][1:]
    genes_names = [genes_tmp[k] for k in range(genes_tmp.size)]

    # your code goes here

    #stage_names and creating labels
    stage_names = ['2C', '4C', '8C', '16C', '32C', '64C']

    labels = np.array([next(np.where([name.startswith(sname) for name in stage_names])[0][0] 
        for sname in stage_names if ename.startswith(sname)) for ename in embryonic_stages])
    
    return data, stage_names,labels

#### Task 5: Perform a diffusion map analysis of the pre-processed Guo data.

In [None]:
# your code goes here

In [None]:
# your code goes here

#### Task 6: Comparison with the un-pre-processed data.

In [None]:
# your code goes here

### Comparison with other dimensionality reduction methods

#### Task 7: Compare diffusion maps with two other methods.

In [None]:
# your code goes here

In [None]:
# your code goes here

### Parameter selection

#### Task 8: Bandwidth comparison.

In [None]:
# your code goes here

#### Task 9: Implement the rule for $\sigma$ and plot the embedding with the $\sigma$ chosen by this rule.

In [None]:
# your code goes here

In [None]:
# your code goes here

### Cell group detection

Now, we want to apply spectral clustering to detect cell groups in the single-cell data.

#### Task 10: Implement the spectral clustering algorithm using k-means with $\Lambda$ as input.

In [None]:
# your code goes here

#### Task 11: Plot the first 20 eigenvalues of transition matrix $P$ for the Guo data and identify $\Lambda$.

In [None]:
# your code goes here

#### Task 12: Perform the spectral clustering algorithm for the Guo data.

In [None]:
# your code goes here