# Import the necessary packages

In [1]:
import numpy as np #numpy and scipy for matrix/array/dataframe manipulations 
import pandas as pd #pd for dataframe stuff
from __future__ import division,print_function #updated printing functions
import h5py
from sklearn.utils import shuffle
import nvr

# 0. Load the full or preprocessed dataset

Since we want the same dataset to go into both algorithms, we should do any kind of preprocessing here, aside from the normalization and transformation. This is because monocle and dpFeature has its own normalization functions. See the NVR tutorial for details about this preprocessing.

In [2]:
s1counts=pd.read_csv("s1_countsRaw.csv",header=None)
s1countsArr=np.asarray(s1counts)
hqGenes=nvr.parseNoise(s1countsArr)
s1countsArrHq=nvr.mkIndexedArr(s1countsArr,hqGenes)
s1countsArrHq.shape

(1597L, 13730L)

Convert the dataset into a pandas DataFrame so that it is compatible with sklearn.utils shuffle function.

In [3]:
s1hqdf=pd.DataFrame(s1countsArrHq)

# 1. Randomly sample the source dataset

The following functions generate random samples of cells in the source dataset. It does so by shuffling the rows, cells in this case, and selecting the first n number of cells based on the sample partition number. This partition number determines how many evenly spaced out random samples will be generated based on cell numbers. The dataset is what we loaded in step 0. These functions are also available in the nvr package.

In [4]:
def subsample(partitions,dataset,seed):
    parts=np.arange(dataset.shape[0]/partitions,dataset.shape[0],dataset.shape[0]/partitions).astype(int)    
    subOut={}
    for i in range(parts.shape[0]):
        subOut["{0}cells".format(parts[i])]=np.asarray(shuffle(dataset,random_state=seed))[0:parts[i],:]
    return subOut

The following function simply wraps the subsample function and returns any number of dictionaries based on the number of replicates designated.

In [5]:
def subsampleReplicates(repNumber,partitions,dataset,seed):
    repOut={}
    for i in range(repNumber):
        repOut["replicate{0}".format(i)]=subsample(partitions,dataset,seed)
    return repOut

This example will make 3 replicate cell number samplings of 10 different granularities, meaning there it will generate random cell samplings representing 10%, 20%, 30%, up to 90% of the source dataset. The pseudorandom seed is set to none so the replicates will actually vary.

In [13]:
sampledDataset=subsampleReplicates(3,10,s1hqdf,None)

# 2. Write the random samples to file

The next few steps will write the data contained by the Python dictionaries to file as a .hdf5, allowing for use in downstream analyses. R also supports this filetype with the h5 library.

In [31]:
def dictToFile(dictionary,replicateKey,outFileName):
    replicateToFile=h5py.File(outFileName,"w")
    for i in range(len(dictionary[replicateKey])):
        replicateToFile.create_dataset("{}".format(dictionary[replicateKey].keys()[i])\
                                    ,data=dictionary[replicateKey].values()[i]\
                                    ,compression="gzip")
    replicateToFile.close()

Here we do a quick check of the keys used by our dictionary.

In [8]:
sampledDataset.keys()

['replicate1', 'replicate0', 'replicate2']

These are used as an argument in the dictToFile function and we repeat it for each replicate. 

In [9]:
dictToFile(sampledDataset,'replicate0',"rep0-10%parts-s1hq.hdf5")
dictToFile(sampledDataset,'replicate1',"rep1-10%parts-s1hq.hdf5")
dictToFile(sampledDataset,'replicate2',"rep2-10%parts-s1hq.hdf5")

Three files, representing the replicate samplings, should now be written to file ready to read into another Python notebook or R environment.