In [12]:
import scipy
import allel
import msprime
import dask

In [14]:
ts = msprime.load('/Users/dnelson/project/pedigree_msp/results/Luke/chrom1_4134samples_Ne10000.h5')

In [15]:
ts.num_samples

8268

** Alternate startegy **

The idea here is to avoid writing the simulated tree sequence to VCF for PCA. Can we do the PCA with scikit-allel? Might have problems with memory if we need to hold the entire genotype matrix at once, but *maybe* that can be addressed using dask or some sort of transparent storage-backed memory.

**Possibly useful steps:**

- `ts.genotype_matrix()` corresponds to `allele.HaplotypeArray`. In order to get `allel.GenotypeArray` it seems we add an extra dimention, ie. one array for each genome copy from the same individual. Separating nodes from each sample individual should take care of this
- Can we iteratively build a dask array (chunked, or storage-backed) by iterating through tree sequence genotypes, to avoid ever loading the whole genotype matrix into memory?
- Can this dask array be used for PCA or does the PCA algorithm need everything strictly loaded into memory, say in a numpy array?

In [5]:
ts = msprime.simulate(10, mutation_rate=1e-8, length=1e8, Ne=100)

In [8]:
ts.genotype_matrix().shape

(1655, 10)

In [13]:
gt = allel.GenotypeArray(ts.genotype_matrix())

TypeError: bad number of dimensions: expected 3; found 2

In [16]:
import itertools

In [19]:
list(itertools.combinations(range(4), 2))

[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]