# Reading input data and building the SFS

Now that our models are set up, we are ready to read out input data and convert it into site frequency spectra.

In [1]:
from delimitpy import parse_input
from delimitpy import process_empirical
import numpy as np
import os
import pickle

# Parse configuration files and read in intermediate files from previous part of tutorial.

We need to read our config file into memory again, and we need to read the files that were created in the 'Building a Model Set' notebook.

The test data we are using was simulated under a model with three populations and no gene flow. 

In [2]:
# read the configuraiton file
config_parser = parse_input.ModelConfigParser("../../examples/test1/config.txt")
config_values = config_parser.parse_config()

# read the labels and parameterized files
labels = np.load(os.path.join(config_values["output directory"], 'labels.npy'), allow_pickle=True)
with open(os.path.join(config_values["output directory"], 'parameterized_models.pickle'), 'rb') as f:
    parameterized_models = pickle.load(f)


# Read empirical data, and convert it into a numpy array.

First, generate our data processor. Then, we convert our folder of fasta files into a numpy array. We will keep the same number of sites that we kept for simulated data. Missing data will be encoded as -1 (any sites in the alignment other than A, T, C, and G will be converted to -1).

In [3]:
# create our data processor, and convert our fastas to a numpy array
data_processor = process_empirical.DataProcessor(parameterized_models, config=config_values)
empirical_array = data_processor.fasta_to_numpy()
print(empirical_array.shape) # print the shape of our array: (individuals, SNPs).

(30, 1038)


# Choose projection for site-frequency-spectrum

SFS cannot be generated from datasets that include missing data. To circumvent this, we use a downsampling approach such as that described in Satler and Carstens (2017, Molecular Ecology, doi: 10.1111/mec.14137.) We must choose thresholds for each populations (i.e., the minumum number of individuals that must be sampled for a SNP to be used.) To help with this, we use the function find_downsampling from the class DataProcessor. This function generates a dictionary that holds the number of SNPs that meet each threshold.

In [None]:
# generate dictionary with the number of SNPs at different sampling thresholds
empirical_2d_sfs_sampling = data_processor.find_downsampling(empirical_array)

minspns = 1000
min1000 ={key: value for key, value in empirical_2d_sfs_sampling.items() if value >= minspns}
print(min10000)


In [4]:
empirical_2d_sfs = data_processor.numpy_to_2d_sfs(empirical_array, downsampling={"A":6, "B":4, "C":5}, replicates = 10)
empirical_msfs = data_processor.numpy_to_msfs(empirical_array, downsampling={"A":6, "B": 4, "C":5}, replicates = 10)


build sfs
