<a id='home'></a>

### purpose 

explore output of SLiM; apply code from CoAdaptree for training Gradient Forests

    - create files for 1) individual genotypes, and 2) population-level allele frequencies (like pool-seq)

CoAdapTree scripts used to train Gradient Forests can be found here: "Lind et al. How off are genetic offsets: Lessons from common gardens and three clades of conifers"

### outline
1. [create SNP infile for Gradient Forest script](#snps)
    1. [get individual data](#indi)
        1. get a list of the 1000 individuals sampled from the landscape of sims

    1. [load locus info](#locus)
        1. needed to label loci in VCF

    1. [create 012 data](#012)
        1. convert genotypes from VCF to counts of the minor allele - individual training
        1. add MAF and AF columns
        1. save to `filedir`
    1. [create population-level MAF data](#popfreqs)
        1. calculate allele frequencies for each pop for the globally minor allele - pool-seq training
        1. save to `filedir`
1. [create range file for GF script and save](#rangefile)
1. [create environmental file for GF script and save](#envfile)
1. [subset locus sets](#locus_sets)
    1. in addition to using all loci, identify the subset that were under selection
1. [train Gradient Forest](#train)
    1. submit training script to slurm

In [1]:
from pythonimports import *

# directories
DIR = '/work/lotterhos/MVP-Offsets/practice_slim'

slimdir = op.join(DIR, 'mypractice')

training_dir = makedir(op.join(DIR, 'training'))
filedir = makedir(op.join(training_dir, 'gradient_forests/training_files'))
shdir = makedir(op.join(training_dir, 'gradient_forests/training_shfiles'))
outdir = makedir(op.join(training_dir, 'gradient_forests/training_outfiles'))

# notebook timer
t1 = dt.now()

# engines
lview,dview = get_client()

# coding info
latest_commit()
session_info.show()

55 55
##################################################################
Current commit of pythonimports:
commit 357f2a9069d9ca25062146953c9bf88b70e863c0  
Author: Brandon Lind <lind.brandon.m@gmail.com>  
Date:   Thu Feb 3 10:42:21 2022 -0500
Today:	February 11, 2022 - 16:06:00
python version: 3.8.5
##################################################################



In [2]:
# for looking up file names
    # of the form dict[ind_or_pooled][all_or_adaptive]
gf_snp_files = defaultdict(dict)

gf_range_files = defaultdict(dict)

gf_env_files = defaultdict(dict)

<a id='snps'></a>

# 1. create SNP file for Gradient Forest script

<a id='indi'></a> 
### 1.1 get individuals

get the subsampled individuals

[top](#home)

In [3]:
seed = '1231094'

In [4]:
# get the 1000 individuals sampled from the 100 pops (10 inds/pop)
subset = pd.read_table(op.join(slimdir, f'{seed}_Rout_ind_subset.txt'), delim_whitespace=True)
subset.index = ('i' + subset['indID'].astype(str)).tolist()  # this will match to the 'causal' file
subset['sample_name'] = subset.index.tolist()

print(nrow(subset))
subset.head()

1000


Unnamed: 0,seed,subpopID,indID,indSubpopIndex,subpop,phen_sal,phen_temp,sal_opt,temp_opt,fitness,subset,N,opt0,opt1,x,y,PC1,PC2,PC3,LFMM_U1_temp,LFMM_U1_sal,LFMM_U2_temp,LFMM_U2_sal,RDA1,RDA2,RDA_PC1,RDA_PC2,RDA_predict_tempPhen_20KSNPs,RDA_predict_salPhen_20KSNPs,sample_name
i0,1231094,1,0,0,1,0,-1.08809,-1.0,-1.0,0.984601,True,10,-1.0,-1.0,1,1,92.2154,-57.5673,-42.498,-38.876285,72.935933,14.058057,40.216243,-2.464484,-4.413873,-2.401906,-2.096819,-5924.695082,-64.166353,i0
i1,1231094,1,1,1,1,0,-0.771927,-1.0,-1.0,0.901194,True,10,-1.0,-1.0,1,1,92.5184,-58.2218,-44.4978,-39.042619,73.119916,15.385371,40.384374,-2.472104,-3.013454,-2.412105,-2.206939,-5943.060707,-49.737004,i1
i2,1231094,1,2,2,1,0,-0.883133,-1.0,-1.0,0.973054,True,10,-1.0,-1.0,1,1,92.7162,-59.2298,-43.8789,-40.110397,73.313012,14.81441,41.455827,-2.478389,-3.995747,-2.476676,-2.159522,-5958.136915,-59.945481,i2
i3,1231094,1,3,3,1,0,-0.836002,-1.0,-1.0,0.94763,True,10,-1.0,-1.0,1,1,90.1391,-54.8804,-40.6813,-36.756808,71.370522,13.047852,38.074237,-2.409776,-2.373063,-2.273755,-2.013461,-5793.238241,-42.644448,i3
i4,1231094,1,4,4,1,0,-1.07876,-1.0,-1.0,0.98767,True,10,-1.0,-1.0,1,1,91.8586,-56.658,-42.909,-38.019264,72.297632,14.507998,39.346027,-2.443488,-4.116506,-2.350035,-2.134117,-5874.228034,-60.932578,i4


<a id='locus'></a>

### 1.2 load locus info

[top](#home)

In [5]:
def update_locus(locus, locus_list):
    """Since sims can simulate mutations at same 'locus', create new name for duplicate loci names."""
    matches = []
    for name in locus_list:
        prefix,*suffix = name.split('_')  # only my update loci names will have an underscore
        if prefix==locus:
            matches.append(name)

    if len(matches) > 0:
        # update locus name if there are duplicates
        locus = f'{locus}_{len(matches)+1}'

    return locus

# example
_list = ['one', 'two', 'one', 'three', 'four', 'two', 'one']  # note duplicates
_found = []
for _ in _list: 
    __ = update_locus(_, _found)
    print(_, __)
    _found.append(__)
    
_found

one one
two two
one one_2
three three
four four
two two_2
one one_3


['one', 'two', 'one_2', 'three', 'four', 'two_2', 'one_3']

In [6]:
def read_muts_file(muts_file):
    """Read in the seed_Rout_muts_full.txt file, convert `VCFrow` to 0-based, name loci."""
    # make sure it's the right file
    assert op.basename(muts_file).endswith('_Rout_muts_full.txt')

    # read in the table
    muts = pd.read_table(muts_file, delim_whitespace=True)

    # convert to 0-based indexing for python
    assert 0 not in muts['VCFrow'].tolist()
    muts['VCFrow'] = muts['VCFrow'] - 1  # convert to 0-based python

    # update locus names
    found = []
    for row in muts.index:
        locus = 'LG' + \
                muts.loc[row, 'LG'].astype(str) + \
                '-' + \
                muts.loc[row, 'pos_pyslim'].astype(str)
        if locus in found:
            locus = update_locus(locus, found)
        found.append(locus)

    # update index with locus names
    muts.index = found

    # make sure no duplicate locus names remain
    assert luni(muts.index) == nrow(muts)

    return muts

In [7]:
# read in locus data
mut_file = op.join(slimdir, f'{seed}_Rout_muts_full.txt')  # map of VCF index to locus name
muts = read_muts_file(mut_file)

muts.head()

  muts = read_muts_file(mut_file)


Unnamed: 0,mutID,seed,VCFrow,pos_pyslim,a_freq_full,a_freq_subset,muttype,p,cor_sal,cor_temp,mutSalEffect,mutTempEffect,va_temp_full,va_sal_full,va_temp_full_prop,va_sal_full_prop,causal,af_cor_temp,af_slope_temp,af_cor_temp_P,af_cor_sal,af_slope_sal,af_cor_sal_P,af_cor_temp_mlog10P,af_cor_sal_mlog10P,causal_temp,causal_sal,LG,colors,Va_temp,Va_temp_prop,Va_sal,Va_sal_prop,cor_temp_sig,cor_sal_sig,He_outflank,Fst_outflank,LEA3.2_lfmm2_mlog10P_tempenv,LEA3.2_lfmm2_mlog10P_tempenv_sig,LEA3.2_lfmm2_mlog10P_salenv,LEA3.2_lfmm2_mlog10P_salenv_sig,structure_cor_G_LFMM_U1_modsal,structure_cor_G_LFMM_U1_modtemp,structure_cor_G_PC1,RDA1_score,RDA2_score,RDA_mlog10P,RDA_mlog10P_sig,RDA_mut_temp_pred,RDA_mut_sal_pred,af_cor_temp_pooled,af_cor_sal_pooled,color_af.sal.cline,color_af.temp.cline
LG1-50,1,1231094,0,50,0.047958,0.027,,,,,,,,,,,False,0.124617,0.013909,0.1317439,-0.006505,-0.000818,0.937291,0.88027,0.028126,neutral-linked,neutral-linked,1,#FFC1251A,0.0,0.0,0,0,False,False,0.052542,0.104678,0.019074,False,0.028583,False,-0.061969,0.217791,-0.062428,0.010662,-0.000659,0.001338,False,25.632716,0.073255,0.138409,-0.046004,,
LG1-67,1,1231094,1,67,0.260807,0.2395,,,,,,,,,,,False,-0.437909,-0.216682,5.512628e-09,0.056646,0.029864,0.450692,8.258641,0.34612,neutral-linked,neutral-linked,1,#FFC1251A,0.0,0.0,0,0,True,False,0.364279,0.378809,0.511861,False,0.556627,False,0.313336,0.096614,0.312927,-0.166061,0.023384,0.988369,False,-399.225736,-1.005152,-0.460044,0.442339,,
LG1-125,1,1231094,2,125,0.427141,0.407,,,,,,,,,,,False,-0.547073,-0.321,8.036337e-14,0.031778,0.027545,0.66436,13.094942,0.177596,neutral-linked,neutral-linked,1,#FFC1251A,0.0,0.0,0,0,True,False,0.482702,0.319215,0.963241,False,0.463809,False,0.410904,0.104012,0.40986,-0.246046,0.021842,1.118307,False,-591.51696,-1.621762,-0.584307,0.43193,,
LG1-148,1,1231094,3,148,0.425647,0.4065,,,,,,,,,,,False,-0.545749,-0.320591,9.218383e-14,0.033544,0.028227,0.646948,13.035345,0.189131,neutral-linked,neutral-linked,1,#FFC1251A,0.0,0.0,0,0,True,False,0.482515,0.319413,0.96086,False,0.480418,False,0.410163,0.103573,0.409098,-0.245731,0.022363,1.150921,False,-590.759253,-1.613997,-0.584307,0.43193,,
LG1-185,1,1231094,4,185,0.041434,0.086,,,,,,,,,,,False,-0.620555,-0.181636,1.746357e-14,-0.037405,-0.021,0.643935,13.757867,0.191158,neutral-linked,neutral-linked,1,#FFC1251A,0.0,0.0,0,0,True,False,0.157208,0.339492,1.183104,False,0.71098,False,0.440948,-0.271555,0.442062,-0.139306,-0.015692,0.469886,False,-334.904195,-1.208452,-0.669439,-0.449467,,


In [8]:
# map VCF index to locus name so we can assign locus names to the vcf.txt file
VCF_index_to_locus = defaultdict(lambda: 'no_name')
VCF_index_to_locus.update(
    dict(zip(muts['VCFrow'], muts.index))
)

# show preview
dict(list(VCF_index_to_locus.items())[:5])

{0: 'LG1-50', 1: 'LG1-67', 2: 'LG1-125', 3: 'LG1-148', 4: 'LG1-185'}

In [9]:
VCF_index_to_locus['what_about_this?'] # if key isn't in dict
VCF_index_to_locus.pop('what_about_this?')

'no_name'

In [10]:
max(keys(VCF_index_to_locus))

21607

<a id='012'></a>

### 1.3 create 012 snpfile

[top](#home)

In [11]:
def convert_012(df:pd.DataFrame, inds:list) -> pd.DataFrame:
    """Convert individual names to i-format, genotypes to counts of minor allele, and subset for `inds`.
    
    Parameters
    ----------
    - df : pandas.DataFrame from a file of type seed_plusneut_MAF01.recode2.vcf.txt
    - inds : list of i-formatted sample names (eg i0, i1, i2) to filter from full VCF
    """
    from collections import Counter
    from tqdm import tqdm as pbar  # progress bar
    from pythonimports import flatten
    
    # first convert sample names (change eg tsk_0 to i0; tsk_25 to i25)
    firstcols = []
    newcols = []
    for col in df.columns:
        if col.startswith('tsk_'):
            col = col.replace('tsk_', 'i')  # convert to eg i0, i1, i2 ...
        else:
            firstcols.append(col)
        newcols.append(col)
    df.columns = newcols
    
    # subset for inds in `inds`
    df = df[firstcols + inds]
    
    # assert genotype convention = that all genotypes for each individual contain "|"
    assert all(  # assert for all individuals
        df[inds].apply(
            lambda gts: all(['|' in gt for gt in gts]),  # all genotypes contain "|"
            axis=1
        )
    )
    
    # figure out minor allele counts for each individual and across all individuals
    for locus in pbar(df.index, desc='determining minor allele'):
        # count each allele across all samples for `locus`
        allele_counts = Counter(
            flatten(
                [list(gt.replace("|", "")) for gt in df.loc[locus, inds]]  # technically don't need to replace |
            )
        )
        
        # identify minor allele
        if allele_counts['0'] < allele_counts['1']:
            minor_allele = '0'
        else:
            minor_allele = '1'

        # get minor allele counts for each individual
        df.loc[locus, inds] = [gt.count(minor_allele) for gt in df.loc[locus, inds]]
        
        # calculate MAF and AF
        df.loc[locus, 'MAF'] = allele_counts[minor_allele] / (2*len(inds))
        df.loc[locus, 'AF'] = allele_counts['1'] / (2*len(inds))  # '1' is ALT/derived allele

    # replace metadata
    df.loc[df.index, 'FORMAT'] = 'minor_allele_count'
    
    # assert expectations
    assert max(df['MAF']) <= 0.50
    assert all((0 <= df['AF']) & (df['AF'] <= 1))
    
    return df

In [12]:
# get genotypes from vcf, convert to 012 in parallel
functions = create_fundict(convert_012,
                           kwargs={'inds' : subset.index.tolist()})

snpfile = op.join(slimdir, f'{seed}_plusneut_MAF01.recode2.vcf.txt')
snps = parallel_read(snpfile,
                     lview=lview,
                     dview=dview,
                     functions=functions,
                     verbose=False)

# set locus names as row names
loci = snps.index.map(VCF_index_to_locus)
snps.index = loci.tolist()

snps.head()

[1m
Watching 55 parallel_read() jobs ...[0m


1231094_plusneut_MAF01.recode2.vcf.txt: 100%|███████████████| 55/55 [01:00<00:00,  1.09s/it]


Function `parallel_read` completed after : 0-00:01:09


Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,i0,i1,i2,i3,i4,i5,i6,i7,i8,i9,i11,i12,i13,i14,i15,i18,i19,i20,i21,i22,i32,i33,i35,i36,i37,i39,i43,i45,i46,i48,i56,i59,i61,i68,i74,i75,i81,i84,i88,i89,i92,...,i9914,i9915,i9918,i9920,i9921,i9925,i9926,i9929,i9951,i9956,i9961,i9964,i9969,i9978,i9979,i9980,i9983,i9985,i9987,i9990,i9994,i9995,i10000,i10002,i10003,i10004,i10005,i10010,i10015,i10016,i10017,i10018,i10020,i10021,i10023,i10024,i10026,i10027,i10030,i10031,i10032,i10033,i10034,i10035,i10036,i10037,i10038,i10039,MAF,AF
LG1-50,1,50,.,0,1,.,PASS,.,minor_allele_count,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.027,0.027
LG1-67,1,67,.,0,1,.,PASS,.,minor_allele_count,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,1,0,0,0,1,0,0,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.2395,0.2395
LG1-125,1,125,.,0,1,.,PASS,.,minor_allele_count,2,2,2,0,0,1,1,1,1,1,1,1,1,1,2,1,0,1,1,1,1,1,1,1,2,2,1,0,1,2,1,2,2,1,1,0,1,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.407,0.407
LG1-148,1,148,.,0,1,.,PASS,.,minor_allele_count,2,2,2,0,0,1,1,1,1,1,1,1,1,1,2,1,0,1,1,1,1,1,1,1,2,2,1,0,1,2,1,2,2,1,1,0,1,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.4065,0.4065
LG1-185,1,185,.,0,1,.,PASS,.,minor_allele_count,0,0,0,2,2,1,1,1,1,1,1,1,1,1,0,1,2,1,1,1,1,1,1,1,0,0,1,2,1,0,1,0,0,1,1,2,1,2,1,1,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.086,0.086


In [13]:
# how many snps should be excluded - the muts file was refiltered
sum(snps.index=='no_name')

802

In [14]:
# yes, these should be excluded because MAF < 0.01
assert snps[snps.index=='no_name']['MAF'].min() > 0
assert snps[snps.index=='no_name']['MAF'].max() < 0.01

In [15]:
# remove un-annotated loci
snps = snps[snps.index != 'no_name']
nrow(snps)

20806

In [16]:
# save
z12file = snpfile.replace('.txt', '_012.txt')
snps.to_csv(z12file, sep='\t', index=True)

z12file

'/work/lotterhos/MVP-Offsets/practice_slim/mypractice/1231094_plusneut_MAF01.recode2.vcf_012.txt'

In [17]:
# are all of the subset individuals in the z12 file? A: yes!
assert all(subset.index.isin(snps.columns))

In [18]:
# are all of the z12 individuals in the subset set? A: yes!
assert all([ind in subset.index.tolist() for ind in snps.columns if ind.startswith('i')])

In [19]:
# save for gradient forests training script from Lind et al.

# transpose so rows=individuals, columns=loci
gf_snps = snps[subset.index.tolist()].T.copy()  # remove non-individual columns

# add index col needed for gradient_training.R script
gf_snps['index'] = gf_snps.index.tolist()

# save
gf_snp_files['ind']['all'] = op.join(filedir, op.basename(snpfile).replace('.txt', '_GFready_ind_all.txt'))

gf_snps.to_csv(gf_snp_files['ind']['all'],
               index=False,
               sep='\t')

print(gf_snp_files['ind']['all'])

gf_snps.head()

/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/1231094_plusneut_MAF01.recode2.vcf_GFready_ind_all.txt


Unnamed: 0,LG1-50,LG1-67,LG1-125,LG1-148,LG1-185,LG1-198,LG1-222,LG1-353,LG1-358,LG1-416,LG1-437,LG1-437_2,LG1-471,LG1-528,LG1-576,LG1-610,LG1-712,LG1-799,LG1-832,LG1-900,LG1-927,LG1-960,LG1-1001,LG1-1005,LG1-1250,LG1-1338,LG1-1495,LG1-1504,LG1-1524,LG1-1545,LG1-1568,LG1-1572,LG1-1602,LG1-1813,LG1-1845,LG1-1928,LG1-1998,LG1-2069,LG1-2130,LG1-2144,LG1-2176,LG1-2186,LG1-2289,LG1-2342,LG1-2430,LG1-2590,LG1-2686,LG1-2699,LG1-2704,LG1-2722,...,LG20-997440,LG20-997481,LG20-997503,LG20-997522,LG20-997578,LG20-997586,LG20-997608,LG20-997652,LG20-997701,LG20-997704,LG20-997720,LG20-997739,LG20-997764,LG20-998018,LG20-998030,LG20-998075,LG20-998093,LG20-998171,LG20-998209,LG20-998218,LG20-998220,LG20-998303,LG20-998307,LG20-998342,LG20-998363,LG20-998447,LG20-998473,LG20-998480,LG20-998494,LG20-998614,LG20-998692,LG20-998695,LG20-998701,LG20-998864,LG20-998889,LG20-999066,LG20-999083,LG20-999143,LG20-999151,LG20-999238,LG20-999286,LG20-999400,LG20-999400_2,LG20-999415,LG20-999461,LG20-999501,LG20-999550,LG20-999808,LG20-999825,index
i0,0,0,2,2,0,0,0,0,0,2,2,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,2,2,2,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i0
i1,0,2,2,2,0,0,0,0,0,2,2,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i1
i2,0,0,2,2,0,0,0,0,0,2,2,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,2,2,2,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i2
i3,0,0,0,0,2,2,2,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i3
i4,0,0,0,0,2,2,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i4


<a id='popfreqs'></a>
### 1.4 create population-level MAF frequencies

[top](#home)

##### get pop IDs

In [20]:
subset.head()

Unnamed: 0,seed,subpopID,indID,indSubpopIndex,subpop,phen_sal,phen_temp,sal_opt,temp_opt,fitness,subset,N,opt0,opt1,x,y,PC1,PC2,PC3,LFMM_U1_temp,LFMM_U1_sal,LFMM_U2_temp,LFMM_U2_sal,RDA1,RDA2,RDA_PC1,RDA_PC2,RDA_predict_tempPhen_20KSNPs,RDA_predict_salPhen_20KSNPs,sample_name
i0,1231094,1,0,0,1,0,-1.08809,-1.0,-1.0,0.984601,True,10,-1.0,-1.0,1,1,92.2154,-57.5673,-42.498,-38.876285,72.935933,14.058057,40.216243,-2.464484,-4.413873,-2.401906,-2.096819,-5924.695082,-64.166353,i0
i1,1231094,1,1,1,1,0,-0.771927,-1.0,-1.0,0.901194,True,10,-1.0,-1.0,1,1,92.5184,-58.2218,-44.4978,-39.042619,73.119916,15.385371,40.384374,-2.472104,-3.013454,-2.412105,-2.206939,-5943.060707,-49.737004,i1
i2,1231094,1,2,2,1,0,-0.883133,-1.0,-1.0,0.973054,True,10,-1.0,-1.0,1,1,92.7162,-59.2298,-43.8789,-40.110397,73.313012,14.81441,41.455827,-2.478389,-3.995747,-2.476676,-2.159522,-5958.136915,-59.945481,i2
i3,1231094,1,3,3,1,0,-0.836002,-1.0,-1.0,0.94763,True,10,-1.0,-1.0,1,1,90.1391,-54.8804,-40.6813,-36.756808,71.370522,13.047852,38.074237,-2.409776,-2.373063,-2.273755,-2.013461,-5793.238241,-42.644448,i3
i4,1231094,1,4,4,1,0,-1.07876,-1.0,-1.0,0.98767,True,10,-1.0,-1.0,1,1,91.8586,-56.658,-42.909,-38.019264,72.297632,14.507998,39.346027,-2.443488,-4.116506,-2.350035,-2.134117,-5874.228034,-60.932578,i4


In [21]:
# assign samps to pop
samppop = dict(zip(subset.index, subset.subpopID))
popsamps = subset.groupby('subpopID')['sample_name'].apply(list).to_dict()

samppop['i0'], samppop['i911'], popsamps[1]

(1, 22, ['i0', 'i1', 'i2', 'i3', 'i4', 'i5', 'i6', 'i7', 'i8', 'i9'])

##### calc allele freqs per pop

In [22]:
def pop_freq(df:pd.DataFrame) -> pd.DataFrame:
    """For each locus, get MAF for each pop."""
    from collections import defaultdict
    import pandas as pd
    
    pop_freqs = defaultdict(dict)
    for pop,samps in popsamps.items():
        pop_freqs[pop].update(
            dict(  # key = locus, val = pop_MAF
                df[samps].apply(sum, axis=1) / (2*len(samps))  # count frequency of minor allele
            )
        )

    return pd.DataFrame(pop_freqs)

dview['popsamps'] = popsamps

In [23]:
# calc pop freqs in parallel using z12 file (counts of minor allele)

jobs = parallel_read(z12file,
                     lview=lview,
                     dview=dview,
                     functions=create_fundict(pop_freq),
                     verbose=False,
                     index_col=0,
                     maintain_dataframe=False)

freqs = pd.concat(jobs).T
freqs['index'] = freqs.index.tolist()  # for compatibility with gradient_training.R script

print(f'\n{freqs.shape = }')

freqs.head()

[1m
Watching 55 parallel_read() jobs ...[0m


1231094_plusneut_MAF01.recode2.vcf_012.txt: 100%|███████████████| 55/55 [00:00<00:00, 153484.18it/s]


Function `parallel_read` completed after : 0-00:00:08

freqs.shape = (100, 20807)


Unnamed: 0,LG1-50,LG1-67,LG1-125,LG1-148,LG1-185,LG1-198,LG1-222,LG1-353,LG1-358,LG1-416,LG1-437,LG1-437_2,LG1-471,LG1-528,LG1-576,LG1-610,LG1-712,LG1-799,LG1-832,LG1-900,LG1-927,LG1-960,LG1-1001,LG1-1005,LG1-1250,LG1-1338,LG1-1495,LG1-1504,LG1-1524,LG1-1545,LG1-1568,LG1-1572,LG1-1602,LG1-1813,LG1-1845,LG1-1928,LG1-1998,LG1-2069,LG1-2130,LG1-2144,LG1-2176,LG1-2186,LG1-2289,LG1-2342,LG1-2430,LG1-2590,LG1-2686,LG1-2699,LG1-2704,LG1-2722,...,LG20-997440,LG20-997481,LG20-997503,LG20-997522,LG20-997578,LG20-997586,LG20-997608,LG20-997652,LG20-997701,LG20-997704,LG20-997720,LG20-997739,LG20-997764,LG20-998018,LG20-998030,LG20-998075,LG20-998093,LG20-998171,LG20-998209,LG20-998218,LG20-998220,LG20-998303,LG20-998307,LG20-998342,LG20-998363,LG20-998447,LG20-998473,LG20-998480,LG20-998494,LG20-998614,LG20-998692,LG20-998695,LG20-998701,LG20-998864,LG20-998889,LG20-999066,LG20-999083,LG20-999143,LG20-999151,LG20-999238,LG20-999286,LG20-999400,LG20-999400_2,LG20-999415,LG20-999461,LG20-999501,LG20-999550,LG20-999808,LG20-999825,index
1,0.0,0.15,0.55,0.55,0.45,0.45,0.45,0.0,0.45,0.55,0.55,0.55,0.0,0.0,0.0,0.0,0.7,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.15,0.0,0.4,0.4,0.4,0.4,0.0,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.5,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1
2,0.0,0.05,0.5,0.5,0.5,0.5,0.5,0.0,0.5,0.5,0.5,0.5,0.0,0.0,0.0,0.0,0.65,0.0,0.35,0.0,0.0,0.0,0.0,0.0,0.15,0.0,0.45,0.45,0.45,0.45,0.0,0.0,0.45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.45,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2
3,0.0,0.2,0.6,0.6,0.4,0.4,0.4,0.0,0.4,0.6,0.6,0.6,0.0,0.0,0.0,0.0,0.8,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.3,0.3,0.3,0.3,0.0,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.55,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.95,0.0,0.0,0.0,3
4,0.0,0.15,0.5,0.5,0.5,0.5,0.5,0.0,0.5,0.5,0.5,0.5,0.0,0.0,0.0,0.0,0.65,0.0,0.35,0.0,0.0,0.0,0.0,0.0,0.15,0.0,0.35,0.35,0.35,0.35,0.0,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.55,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,4
5,0.0,0.0,0.45,0.45,0.55,0.55,0.55,0.0,0.55,0.45,0.45,0.45,0.0,0.0,0.0,0.0,0.75,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.3,0.0,0.45,0.45,0.45,0.45,0.0,0.0,0.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.6,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,5


In [24]:
# save AFTER TRANSPOSING so that subpopID are columns
gf_snp_files['pooled']['all'] = gf_snp_files['ind']['all'].replace('_ind_all.txt', '_pooled_all.txt')

freqs.T.to_csv(gf_snp_files['pooled']['all'],
             sep='\t',
             index=True)

gf_snp_files['pooled']['all']

'/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/1231094_plusneut_MAF01.recode2.vcf_GFready_pooled_all.txt'

<a id='rangefile'></a>

# 2. create range file for GF script

[top](#home)

##### individual

In [25]:
# get range data
rangedata = subset[['y', 'x', 'sal_opt', 'temp_opt']].copy()
rangedata.columns = ['lat', 'lon', 'sal_opt', 'temp_opt']
print(f'{nrow(rangedata) = }')
rangedata.head()

nrow(rangedata) = 1000


Unnamed: 0,lat,lon,sal_opt,temp_opt
i0,1,1,-1.0,-1.0
i1,1,1,-1.0,-1.0
i2,1,1,-1.0,-1.0
i3,1,1,-1.0,-1.0
i4,1,1,-1.0,-1.0


In [26]:
# save
gf_range_files['ind'] = op.join(filedir, f'{seed}_rangefile_GFready_ind.txt')
rangedata.to_csv(gf_range_files['ind'], index=False, sep='\t')

gf_range_files['ind']

'/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/1231094_rangefile_GFready_ind.txt'

##### pool-seq

In [27]:
# map individual to pop

# create intermediate data.frame
_ = rangedata.copy()
_['subpopID'] = _.index.map(samppop)

# get pop-level data
pool_rangedata = _.groupby('subpopID')[['lat', 'lon', 'sal_opt', 'temp_opt']].apply(np.mean)
pool_rangedata.index.name = None  # remove index label

pool_rangedata.head()

Unnamed: 0,lat,lon,sal_opt,temp_opt
1,1.0,1.0,-1.0,-1.0
2,1.0,2.0,-0.777778,-1.0
3,1.0,3.0,-0.555556,-1.0
4,1.0,4.0,-0.333333,-1.0
5,1.0,5.0,-0.111111,-1.0


In [28]:
# save
gf_range_files['pooled'] = gf_range_files['ind'].replace('_ind.txt', '_pooled.txt')
pool_rangedata.to_csv(gf_range_files['pooled'], index=False, sep='\t')

gf_range_files['pooled']

'/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/1231094_rangefile_GFready_pooled.txt'

<a id='envfile'></a>

# 3. create env file for GF script

[top](#home)

##### individual

In [29]:
envdata = rangedata[['sal_opt', 'temp_opt']].copy()
envdata.head()

Unnamed: 0,sal_opt,temp_opt
i0,-1.0,-1.0
i1,-1.0,-1.0
i2,-1.0,-1.0
i3,-1.0,-1.0
i4,-1.0,-1.0


In [30]:
# save
gf_env_files['ind'] = op.join(filedir, f'{seed}_envfile_GFready_ind.txt')
envdata.to_csv(gf_env_files['ind'], sep='\t', index=True)

gf_env_files['ind']

'/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/1231094_envfile_GFready_ind.txt'

##### pooled

In [31]:
pool_envdata = pool_rangedata[['sal_opt', 'temp_opt']].copy()
pool_envdata.head()

Unnamed: 0,sal_opt,temp_opt
1,-1.0,-1.0
2,-0.777778,-1.0
3,-0.555556,-1.0
4,-0.333333,-1.0
5,-0.111111,-1.0


In [32]:
gf_env_files['pooled'] = gf_env_files['ind'].replace('_ind.txt', '_pooled.txt')

pool_envdata.to_csv(gf_env_files['pooled'],
                    sep='\t',
                    index=True)

gf_env_files['pooled']

'/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/1231094_envfile_GFready_pooled.txt'

<a id='locus_sets'></a>

# 4. subset locus sets

[top](#home)

In [34]:
# identify the loci under selection
adaptive_loci = muts.index[muts['mutID'] != 1]
len(adaptive_loci)

342

In [35]:
# save adaptive loci
locus_file = op.join(filedir, 'adaptive_loci.pkl')
pkldump(adaptive_loci, locus_file)

locus_file

'/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/adaptive_loci.pkl'

In [36]:
gf_snps.head()

Unnamed: 0,LG1-50,LG1-67,LG1-125,LG1-148,LG1-185,LG1-198,LG1-222,LG1-353,LG1-358,LG1-416,LG1-437,LG1-437_2,LG1-471,LG1-528,LG1-576,LG1-610,LG1-712,LG1-799,LG1-832,LG1-900,LG1-927,LG1-960,LG1-1001,LG1-1005,LG1-1250,LG1-1338,LG1-1495,LG1-1504,LG1-1524,LG1-1545,LG1-1568,LG1-1572,LG1-1602,LG1-1813,LG1-1845,LG1-1928,LG1-1998,LG1-2069,LG1-2130,LG1-2144,LG1-2176,LG1-2186,LG1-2289,LG1-2342,LG1-2430,LG1-2590,LG1-2686,LG1-2699,LG1-2704,LG1-2722,...,LG20-997440,LG20-997481,LG20-997503,LG20-997522,LG20-997578,LG20-997586,LG20-997608,LG20-997652,LG20-997701,LG20-997704,LG20-997720,LG20-997739,LG20-997764,LG20-998018,LG20-998030,LG20-998075,LG20-998093,LG20-998171,LG20-998209,LG20-998218,LG20-998220,LG20-998303,LG20-998307,LG20-998342,LG20-998363,LG20-998447,LG20-998473,LG20-998480,LG20-998494,LG20-998614,LG20-998692,LG20-998695,LG20-998701,LG20-998864,LG20-998889,LG20-999066,LG20-999083,LG20-999143,LG20-999151,LG20-999238,LG20-999286,LG20-999400,LG20-999400_2,LG20-999415,LG20-999461,LG20-999501,LG20-999550,LG20-999808,LG20-999825,index
i0,0,0,2,2,0,0,0,0,0,2,2,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,2,2,2,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i0
i1,0,2,2,2,0,0,0,0,0,2,2,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i1
i2,0,0,2,2,0,0,0,0,0,2,2,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,2,2,2,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i2
i3,0,0,0,0,2,2,2,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i3
i4,0,0,0,0,2,2,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i4


In [37]:
# subset Gradient Forest training data and save
adaptive_gf_snps = gf_snps[list(adaptive_loci) + ['index']].copy()
print(adaptive_gf_snps.shape)

gf_snp_files['ind']['adaptive'] = gf_snp_files['ind']['all'].replace("_all.txt", "_adaptive.txt")

adaptive_gf_snps.to_csv(gf_snp_files['ind']['adaptive'],
                        sep='\t',
                        index=False)

gf_snp_files['ind']['adaptive']

(1000, 343)


'/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/1231094_plusneut_MAF01.recode2.vcf_GFready_ind_adaptive.txt'

In [38]:
gf_snps.head()

Unnamed: 0,LG1-50,LG1-67,LG1-125,LG1-148,LG1-185,LG1-198,LG1-222,LG1-353,LG1-358,LG1-416,LG1-437,LG1-437_2,LG1-471,LG1-528,LG1-576,LG1-610,LG1-712,LG1-799,LG1-832,LG1-900,LG1-927,LG1-960,LG1-1001,LG1-1005,LG1-1250,LG1-1338,LG1-1495,LG1-1504,LG1-1524,LG1-1545,LG1-1568,LG1-1572,LG1-1602,LG1-1813,LG1-1845,LG1-1928,LG1-1998,LG1-2069,LG1-2130,LG1-2144,LG1-2176,LG1-2186,LG1-2289,LG1-2342,LG1-2430,LG1-2590,LG1-2686,LG1-2699,LG1-2704,LG1-2722,...,LG20-997440,LG20-997481,LG20-997503,LG20-997522,LG20-997578,LG20-997586,LG20-997608,LG20-997652,LG20-997701,LG20-997704,LG20-997720,LG20-997739,LG20-997764,LG20-998018,LG20-998030,LG20-998075,LG20-998093,LG20-998171,LG20-998209,LG20-998218,LG20-998220,LG20-998303,LG20-998307,LG20-998342,LG20-998363,LG20-998447,LG20-998473,LG20-998480,LG20-998494,LG20-998614,LG20-998692,LG20-998695,LG20-998701,LG20-998864,LG20-998889,LG20-999066,LG20-999083,LG20-999143,LG20-999151,LG20-999238,LG20-999286,LG20-999400,LG20-999400_2,LG20-999415,LG20-999461,LG20-999501,LG20-999550,LG20-999808,LG20-999825,index
i0,0,0,2,2,0,0,0,0,0,2,2,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,2,2,2,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i0
i1,0,2,2,2,0,0,0,0,0,2,2,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i1
i2,0,0,2,2,0,0,0,0,0,2,2,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,2,2,2,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i2
i3,0,0,0,0,2,2,2,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i3
i4,0,0,0,0,2,2,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,i4


In [39]:
freqs.head()

Unnamed: 0,LG1-50,LG1-67,LG1-125,LG1-148,LG1-185,LG1-198,LG1-222,LG1-353,LG1-358,LG1-416,LG1-437,LG1-437_2,LG1-471,LG1-528,LG1-576,LG1-610,LG1-712,LG1-799,LG1-832,LG1-900,LG1-927,LG1-960,LG1-1001,LG1-1005,LG1-1250,LG1-1338,LG1-1495,LG1-1504,LG1-1524,LG1-1545,LG1-1568,LG1-1572,LG1-1602,LG1-1813,LG1-1845,LG1-1928,LG1-1998,LG1-2069,LG1-2130,LG1-2144,LG1-2176,LG1-2186,LG1-2289,LG1-2342,LG1-2430,LG1-2590,LG1-2686,LG1-2699,LG1-2704,LG1-2722,...,LG20-997440,LG20-997481,LG20-997503,LG20-997522,LG20-997578,LG20-997586,LG20-997608,LG20-997652,LG20-997701,LG20-997704,LG20-997720,LG20-997739,LG20-997764,LG20-998018,LG20-998030,LG20-998075,LG20-998093,LG20-998171,LG20-998209,LG20-998218,LG20-998220,LG20-998303,LG20-998307,LG20-998342,LG20-998363,LG20-998447,LG20-998473,LG20-998480,LG20-998494,LG20-998614,LG20-998692,LG20-998695,LG20-998701,LG20-998864,LG20-998889,LG20-999066,LG20-999083,LG20-999143,LG20-999151,LG20-999238,LG20-999286,LG20-999400,LG20-999400_2,LG20-999415,LG20-999461,LG20-999501,LG20-999550,LG20-999808,LG20-999825,index
1,0.0,0.15,0.55,0.55,0.45,0.45,0.45,0.0,0.45,0.55,0.55,0.55,0.0,0.0,0.0,0.0,0.7,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.15,0.0,0.4,0.4,0.4,0.4,0.0,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.5,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1
2,0.0,0.05,0.5,0.5,0.5,0.5,0.5,0.0,0.5,0.5,0.5,0.5,0.0,0.0,0.0,0.0,0.65,0.0,0.35,0.0,0.0,0.0,0.0,0.0,0.15,0.0,0.45,0.45,0.45,0.45,0.0,0.0,0.45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.45,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2
3,0.0,0.2,0.6,0.6,0.4,0.4,0.4,0.0,0.4,0.6,0.6,0.6,0.0,0.0,0.0,0.0,0.8,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.3,0.3,0.3,0.3,0.0,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.55,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.95,0.0,0.0,0.0,3
4,0.0,0.15,0.5,0.5,0.5,0.5,0.5,0.0,0.5,0.5,0.5,0.5,0.0,0.0,0.0,0.0,0.65,0.0,0.35,0.0,0.0,0.0,0.0,0.0,0.15,0.0,0.35,0.35,0.35,0.35,0.0,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.55,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,4
5,0.0,0.0,0.45,0.45,0.55,0.55,0.55,0.0,0.55,0.45,0.45,0.45,0.0,0.0,0.0,0.0,0.75,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.3,0.0,0.45,0.45,0.45,0.45,0.0,0.0,0.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.6,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,5


In [40]:
# subset pooled Gradient Forest data and save
pooled_adaptive_gf_snps = freqs[adaptive_loci].copy()
pooled_adaptive_gf_snps['index'] = pooled_adaptive_gf_snps.index.tolist()
print(pooled_adaptive_gf_snps.shape)

gf_snp_files['pooled']['adaptive'] = gf_snp_files['pooled']['all'].replace("_all.txt", "_adaptive.txt")

pooled_adaptive_gf_snps.to_csv(gf_snp_files['pooled']['adaptive'],
                              sep='\t',
                              index=False)

gf_snp_files['pooled']['adaptive']

(100, 343)


'/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/1231094_plusneut_MAF01.recode2.vcf_GFready_pooled_adaptive.txt'

<a id='train'></a>
# 5. Train

[top](#home)

In [41]:
# set memory and time requests based on dataset
mytime = defaultdict(dict)
mymem = defaultdict(dict)

for ind_or_pooled in ['ind', 'pooled']:
    for all_or_adaptive in ['all', 'adaptive']:
        _time = '7-00:00:00' if ind_or_pooled=='ind' and all_or_adaptive=='all' else '1-00:00:00'
        _mem = '200000M' if ind_or_pooled=='ind' and all_or_adaptive=='all' else '20000M'
        
        mytime[ind_or_pooled][all_or_adaptive] = _time
        mymem[ind_or_pooled][all_or_adaptive] = _mem
        
mytime

defaultdict(dict,
            {'ind': {'all': '7-00:00:00', 'adaptive': '1-00:00:00'},
             'pooled': {'all': '1-00:00:00', 'adaptive': '1-00:00:00'}})

In [42]:
mymem

defaultdict(dict,
            {'ind': {'all': '200000M', 'adaptive': '20000M'},
             'pooled': {'all': '20000M', 'adaptive': '20000M'}})

In [43]:
gf_snp_files

defaultdict(dict,
            {'ind': {'all': '/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/1231094_plusneut_MAF01.recode2.vcf_GFready_ind_all.txt',
              'adaptive': '/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/1231094_plusneut_MAF01.recode2.vcf_GFready_ind_adaptive.txt'},
             'pooled': {'all': '/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/1231094_plusneut_MAF01.recode2.vcf_GFready_pooled_all.txt',
              'adaptive': '/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/1231094_plusneut_MAF01.recode2.vcf_GFready_pooled_adaptive.txt'}})

In [44]:
gf_range_files

defaultdict(dict,
            {'ind': '/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/1231094_rangefile_GFready_ind.txt',
             'pooled': '/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files/1231094_rangefile_GFready_pooled.txt'})

In [45]:
Rscript = '/home/b.lind/anaconda3/envs/r35/bin/Rscript'
training_script = '/home/b.lind/src/GitHub/offset_validation/gradient_training.R'
imports_dir = '/home/b.lind/src/GitHub/r_imports'

shfiles = []
for ind_or_pooled in ['ind', 'pooled']:
    for all_or_adaptive in ['all', 'adaptive']:  # all loci or only those under selection
        basename = f'{seed}_GF_training_{ind_or_pooled}_{all_or_adaptive}'
        shfile = op.join(shdir, f'{basename}.sh')
        
        # set up variables
        _time = mytime[ind_or_pooled][all_or_adaptive]
        _mem = mymem[ind_or_pooled][all_or_adaptive]
        _snpfile = op.basename(gf_snp_files[ind_or_pooled][all_or_adaptive])
        _envfile = op.basename(gf_env_files[ind_or_pooled])
        _rangefile = op.basename(gf_range_files[ind_or_pooled])
        

        shtext = f'''#!/bin/bash
#SBATCH --job-name={basename}
#SBATCH --time={_time}
#SBATCH --mem={_mem}
#SBATCH --output={basename}_%j.out
#SBATCH --mail-user=b.lind@northeastern.edu
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=END

source $HOME/.bashrc
conda deactivate
conda activate r35

cd {filedir}

{Rscript} \\
{training_script} \\
{_snpfile} \\
{_envfile} \\
{_rangefile} \\
{basename} \\
{outdir} \\
{imports_dir}


'''

        with open(shfile, 'w') as o:
            o.write(shtext)
        shfiles.append(shfile)
        
shfiles

['/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_shfiles/1231094_GF_training_ind_all.sh',
 '/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_shfiles/1231094_GF_training_ind_adaptive.sh',
 '/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_shfiles/1231094_GF_training_pooled_all.sh',
 '/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_shfiles/1231094_GF_training_pooled_adaptive.sh']

In [46]:
for sh in shfiles:
    print(ColorText(op.basename(sh)).custom('blue').bold())
    print(read(sh, lines=False), '\n')

[1m[38;2;0;0;255m1231094_GF_training_ind_all.sh[0m[0m
#!/bin/bash
#SBATCH --job-name=1231094_GF_training_ind_all
#SBATCH --time=7-00:00:00
#SBATCH --mem=200000M
#SBATCH --output=1231094_GF_training_ind_all_%j.out
#SBATCH --mail-user=b.lind@northeastern.edu
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=END

source $HOME/.bashrc
conda deactivate
conda activate r35

cd /work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_files

/home/b.lind/anaconda3/envs/r35/bin/Rscript \
/home/b.lind/src/GitHub/offset_validation/gradient_training.R \
1231094_plusneut_MAF01.recode2.vcf_GFready_ind_all.txt \
1231094_envfile_GFready_ind.txt \
1231094_rangefile_GFready_ind.txt \
1231094_GF_training_ind_all \
/work/lotterhos/MVP-Offsets/practice_slim/training/gradient_forests/training_outfiles \
/home/b.lind/src/GitHub/r_imports


 

[1m[38;2;0;0;255m1231094_GF_training_ind_adaptive.sh[0m[0m
#!/bin/bash
#SBATCH --job-name=1231094_GF_training_ind_adaptive
#SBATCH --time=1-00:00:

In [47]:
# how long for notebook to complete
formatclock(dt.now() - t1, exact=True)

'0-00:01:38'

In [48]:
# sbatch all but the individual data using all loci (this is maxing out available resources for now)
sbatch(shfiles[1:])

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.82it/s]


['23181171', '23181172', '23181173']