# Data structure

We introduce the central data stuctures used in this package.

These data will be of shape

- genotype (``n_snp``, ``n_indiv``, ``n_ploidy``)
- local ancestry (``n_snp``, ``n_indiv``, ``n_ploidy``)
- information about SNPs (``n_snp``, ``n_snp_feature``)
- information about individuals (``n_indiv``, ``n_indiv_feature``)

We make use of ``xarray`` to deal with all the data in an unified way, 
specifically, we will have an xarray.Dataset to store these. Below is an illustration 
of an example dataset.

Feature names of SNPs and individuals should be non-overlapping.

In [1]:
import admix
import numpy as np

In [2]:
dset = admix.simulate.admix_geno(n_indiv = 1000, n_snp = 100, n_anc = 2, mosaic_size=20)

In `admix`, we use the following keywords consistently to refer to features of data:
- `CHROM`: chromosomes
- `POS`: positions
- `REF`: reference allele
- `ALT`: alternative allele

In [3]:
# assign a coordinate dset.coords[X] = ("indiv", X)
dset.coords["height"] = ("indiv", np.random.normal(size=1000))
dset.coords["MAF"] = ("snp", dset.geno.mean(axis=[0, 2]).values)
dset.coords["CHROM"] = ("snp", np.repeat(1, dset.dims["snp"]))

# obtain coordinate
display(dset.snp["maf"][0:10].values)
display(dset.indiv["height"][0:10].values)

# obtain multiple coordinates, dset[["maf", "chrom"]].to_dask_dataframe() as a dask dataframe
display(dset[["maf", "chrom"]].to_dataframe())

array([0.2655, 0.223 , 0.3435, 0.2455, 0.2175, 0.26  , 0.3605, 0.216 ,
       0.1895, 0.407 ])

array([-1.1002063 , -0.53840061, -0.78972405, -0.78715722,  1.28078157,
        0.5939903 , -1.15308617,  0.80259499, -0.9591995 ,  0.95856666])

Unnamed: 0_level_0,maf,chrom
snp,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.2655,1
1,0.2230,1
2,0.3435,1
3,0.2455,1
4,0.2175,1
...,...,...
95,0.2175,1
96,0.3370,1
97,0.2555,1
98,0.3660,1


In [4]:
# obtain genotype / local ancestry
# directly use dset.geno / dset.lanc
display(dset.geno)
display(dset.lanc)

Unnamed: 0,Array,Chunk
Bytes,1.60 MB,1.60 MB
Shape,"(1000, 100, 2)","(1000, 100, 2)"
Count,1 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 1.60 MB 1.60 MB Shape (1000, 100, 2) (1000, 100, 2) Count 1 Tasks 1 Chunks Type int64 numpy.ndarray",2  100  1000,

Unnamed: 0,Array,Chunk
Bytes,1.60 MB,1.60 MB
Shape,"(1000, 100, 2)","(1000, 100, 2)"
Count,1 Tasks,1 Chunks
Type,int64,numpy.ndarray


Unnamed: 0,Array,Chunk
Bytes,1.60 MB,1.60 MB
Shape,"(1000, 100, 2)","(1000, 100, 2)"
Count,1 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 1.60 MB 1.60 MB Shape (1000, 100, 2) (1000, 100, 2) Count 1 Tasks 1 Chunks Type int64 numpy.ndarray",2  100  1000,

Unnamed: 0,Array,Chunk
Bytes,1.60 MB,1.60 MB
Shape,"(1000, 100, 2)","(1000, 100, 2)"
Count,1 Tasks,1 Chunks
Type,int64,numpy.ndarray
