## Correcting local biases in sampling

In [1]:
import scipy
import numpy as np
import pandas as pd
import itertools as it

from math import sin
import collections

def recursively_default_dict():
        return collections.defaultdict(recursively_default_dict)

from sklearn.neighbors import KernelDensity
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import scale

from scipy.stats.stats import pearsonr 

from scipy.stats import invgamma 
from scipy.stats import beta
import matplotlib.pyplot as plt

import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *

init_notebook_mode(connected=True)

In our first post we examined how to use frequency vectors to generate haplotypes and populations. We then proceeded to generate a universe of frequency vectors, whose distance in feature space allowed us to chose the relative differentiation of the populations we would simulate.

What i didn't touch on in that post was the importance of sampling in principal component analysis. In the last section, i chose vectors close to one another, together with vectors far distant, in order to produce differentiated populations. If you tweeked the population sizes, you might have noticed that if some of the close together populations largely outweighed the rest, the distances to the more differentiated clusters would be reduced.

- see [McVean 2009](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000686) for a study case.


## Index

### I. Generating Frequency Vectors

### II. A One-Shot example

#### a. Biased sampling

#### b. MS correction

#### c. Unbiased sampling

### III. Permutations.




### I. Generating Frequency Vectors.

We will start by generating a space of vectors. 

For each vector we will extract _L_ random samples from the **beta** distribution. We will repeat this process as we vary the mean and variance (parameters _a_ and _b_) of the distribution. The ranges along which to vary these, the number of steps and the number of vectors to generate from each combination of _a_ and _b_ can be specified at the beginning of the next block of code.

In [2]:
# Simulate frequency vectors. 
# We must first define the number of populations, the length of the haplotypes desired, and their respective population sizes
L= 300

import itertools as it
n= 200

# Vary a (beta distribution parameter).
a_range= np.linspace(1,2,20)
a_set= [i for i in a_range for _ in range(n)]

# vary b.
b_range= np.linspace(0.1,.4,20)
b_set= [i for i in b_range for _ in range(n)]

## length of haplotypes to extract.
L_set= [L] * n * 20


background_1= np.array([a_set,b_set,L_set]).T

vector_lib= []
for k in range(background_1.shape[0]):
    
    probs= beta.rvs(background_1[k,0], background_1[k,1], size=int(background_1[k,2]))
    probs[(probs > 1)]= 1
    
    
    vector_lib.append(probs)

vector_lib= np.array(vector_lib)

In [3]:
print('Number of frequency vectors of size {} generated: {}'.format(vector_lib.shape[1],vector_lib.shape[0]))


Number of frequency vectors of size 300 generated: 4000


Perform PCA on data set of frequency vectors created.

This allows us to:

i) verify the desired randomness of the process of generating vectors.
 
ii) select vectors based on how much they resemble each other. 

We have verified in another post how distance in the feature space created this way relates positively to genetic distance as measured by Fst (see **10. A link to Fsts**).


In [4]:
## PCA on vectors simulated
n_comp = 100

pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(vector_lib)
features = pca.transform(vector_lib)# * pca.explained_variance_ratio_

print("; ".join(['PC{0}: {1}'.format(x+1,round(pca.explained_variance_ratio_[x],3)) for x in range(n_comp)]))
print('features shape: {}'.format(features.shape))

PC1: 0.015; PC2: 0.005; PC3: 0.005; PC4: 0.005; PC5: 0.005; PC6: 0.005; PC7: 0.005; PC8: 0.005; PC9: 0.005; PC10: 0.005; PC11: 0.005; PC12: 0.005; PC13: 0.005; PC14: 0.005; PC15: 0.005; PC16: 0.005; PC17: 0.005; PC18: 0.005; PC19: 0.005; PC20: 0.005; PC21: 0.005; PC22: 0.005; PC23: 0.005; PC24: 0.005; PC25: 0.005; PC26: 0.005; PC27: 0.005; PC28: 0.005; PC29: 0.005; PC30: 0.005; PC31: 0.005; PC32: 0.005; PC33: 0.004; PC34: 0.004; PC35: 0.004; PC36: 0.004; PC37: 0.004; PC38: 0.004; PC39: 0.004; PC40: 0.004; PC41: 0.004; PC42: 0.004; PC43: 0.004; PC44: 0.004; PC45: 0.004; PC46: 0.004; PC47: 0.004; PC48: 0.004; PC49: 0.004; PC50: 0.004; PC51: 0.004; PC52: 0.004; PC53: 0.004; PC54: 0.004; PC55: 0.004; PC56: 0.004; PC57: 0.004; PC58: 0.004; PC59: 0.004; PC60: 0.004; PC61: 0.004; PC62: 0.004; PC63: 0.004; PC64: 0.004; PC65: 0.004; PC66: 0.004; PC67: 0.004; PC68: 0.004; PC69: 0.004; PC70: 0.004; PC71: 0.004; PC72: 0.004; PC73: 0.004; PC74: 0.004; PC75: 0.004; PC76: 0.004; PC77: 0.004; PC78: 0.

### MRCA - Most Recent Common Ancestor.

The following block serves to tie all the populations in the vector data set together.

The random generation of frequency vectors creates vectors distinct along, assymptotically, all possible directions.

Here, we limit the number of possible directions, by creating a data set made entirely of vectors generated as described for the manipulation of genetic distances, i.e. from equally distant coordinates between two initial projections. We continue to rely on pairs of initial projections. However, here, only one projection is made to vary, while the other is chosen beforehand and remains the same. 

The result is the starshaped distribution observed in the next graph.


In [5]:
Iter= 50
target= [0,1]
stairs= 4

MRCA= np.random.choice(range(vector_lib.shape[0]),1)
calypso= []
feat= []

for inter in range(stairs):
    Pair= np.random.choice(range(vector_lib.shape[0]),2,replace= False)
    Pair[1]= MRCA
    print(Pair)
    
    coords= features[Pair,:]
    
    vector2= coords[target[1]] - coords[target[0]]
    for angle in np.linspace(-20,20,Iter):
        new_guy = coords[target[0]] + [angle / 10 * x for x in vector2]
        
        feat.append(new_guy)
        
        new_guy= pca.inverse_transform(new_guy)
        new_guy[new_guy < 0]= 0
        new_guy[new_guy > 1]= 1
        
        calypso.append(new_guy)

features= np.array(feat)
vector_lib= np.array(calypso)

[2163  934]
[1369  934]
[255 934]
[169 934]


In [6]:
features.shape

(200, 100)

In [7]:
## Plot vector PCA
fig_data= [go.Scatter3d(
        x = features[:,0],
        y = features[:,1],
        z = features[:,2],
        type='scatter3d',
        mode= "markers",
        text= ['a: {}; b: {}, L: {}; index = {}'.format(background_1[k,0],background_1[k,1],background_1[k,2], k) for k in range(background_1.shape[0])],
        marker= {
        'line': {'width': 0},
        'size': 2,
        'symbol': 'circle',
      "opacity": .6
      }
    )]


layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    )
)

fig = go.Figure(data=fig_data)
iplot(fig)


 **Fig. 1** PCA on frequency vectors generated from the beta distribution. Parameters a and b were made to vary between 
1-2 and .1-.4 respectively at steps of .1. For each combination of parameters 15 vectors were produced.


The point of this is to peruse frequency vector space, and chose populations. Because PCA distances equal correlation between observations, the closer the points the lower the Fst between the pops selected.

### II. One-Shot example.

#### II a. Biased sampling of frequency vectors

chose populations and bias sizes below.

Haplotypes for each population will be generated using the frequency vectors as the probabilities of alleles.

In [8]:
### Select frequency vectors and draw haplotypes.
## Pops selected by Indicies.
N_pops= 4

Pops= np.random.choice(vector_lib.shape[0],N_pops,replace= False)

## Population Sizes and labels
Sizes_bias= [130,80,300,35]
labels_bias= np.repeat(np.array([x for x in range(N_pops)]),Sizes_bias)

## Number of pops

data_ex= []

for k in range(N_pops):
    
    probs= vector_lib[Pops[k],:]
    probs= [[0,x][int(x >= 0)] for x in probs]
    probs= [[1,x][int(x <= 1)] for x in probs]
    
    m= Sizes_bias[k]
    Haps= [[np.random.choice([1,0],p= [1-probs[x],probs[x]]) for x in range(L)] for acc in range(m)]
    
    data_ex.extend(Haps)

data_ex= np.array(data_ex)
print(data_ex.shape)



(545, 300)


We will also calculate genetic distances between pairs of individuals based on the haplotypes. Calculation will be simple, distance based (n_diffs / hap_length).

In [9]:
### Calculate individual pairwise distances for biased sampling.
def pairwise_gen(x,y):
    miss= 0
    same= 0
    if len(x) != len(y):
        return 'vector lengths differ'
    else:
        for n in range(len(x)):
            if x[n] == y[n]:
                same += 1
        return 1 - same / (len(x) - miss)

bias_gen_diffs= pairwise_distances(data_ex,metric= pairwise_gen)
bias_gen_diffs= np.array(bias_gen_diffs)

iugen= np.triu_indices(bias_gen_diffs.shape[0],1)
bias_gen_diffs= bias_gen_diffs[iugen]

From the frequency vectors selected we can also calculate pairwise Fst's.

These will be used later to compare to pairswise centroid distances in feature space.

In [10]:
print(vector_lib.shape)
Pops

(200, 300)


array([176, 142, 146, 120])

In [11]:
### Calculate pairwise Fst based on frequency vectors selected.
def return_fsts2(freq_array):
    pops= range(freq_array.shape[0])
    H= {pop: [1-(freq_array[pop,x]**2 + (1 - freq_array[pop,x])**2) for x in range(freq_array.shape[1])] for pop in range(freq_array.shape[0])}
    Store= []

    for comb in it.combinations(H.keys(),2):
        P= [sum([freq_array[x,i] for x in comb]) / len(comb) for i in range(freq_array.shape[1])]
        HT= [2 * P[x] * (1 - P[x]) for x in range(len(P))]
        per_locus_fst= [[(HT[x] - np.mean([H[p][x] for p in comb])) / HT[x],0][int(HT[x] == 0)] for x in range(len(P))]
        per_locus_fst= np.nan_to_num(per_locus_fst)
        Fst= np.mean(per_locus_fst)

        Store.append([comb,Fst])
    
    
    ### total fst:
    #P= [sum([freq_array[x,i] for x in pops]) / len(pops) for i in range(freq_array.shape[1])]
    #HT= [2 * P[x] * (1 - P[x]) for x in range(len(P))]
    #FST= np.mean([(HT[x] - np.mean([H[p][x] for p in pops])) / HT[x] for x in range(len(P))])
    
    return pd.DataFrame(Store,columns= ['pops','fst'])


freqs_selected= vector_lib[Pops,:]
Pairwise= return_fsts2(freqs_selected)

fsts_compare = Pairwise.fst
Pairwise


invalid value encountered in double_scalars



Unnamed: 0,pops,fst
0,"(0, 1)",0.070181
1,"(0, 2)",0.090115
2,"(0, 3)",0.068171
3,"(1, 2)",0.004465
4,"(1, 3)",0.129572
5,"(2, 3)",0.16035


Finally we perform PCA on the haplotypes generated and keep the first 5 PCs. Notice that right now the multiplication by eigenvalues is commented out. 

To run this test with the multiplication by eigenvalues you can uncomment that (i'll write something simpler later on).

In [12]:
### PCA on haplotypes drawn.
n_comp = 5

pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(data_ex)

bias_features= pca.transform(data_ex)# * pca.explained_variance_ratio_

var_comps= pca.explained_variance_ratio_
print("; ".join(['PC{0}: {1}'.format(x+1,round(var_comps[x],3)) for x in range(n_comp)]))
print(bias_features.shape)

PC1: 0.117; PC2: 0.02; PC3: 0.011; PC4: 0.011; PC5: 0.011
(545, 5)


Estimating population centroids in feature space and plot the PCA on biased-haplotype data set

In [13]:
## Calculate centroids
bias_centroids= [np.mean(bias_features[[y for y in range(bias_features.shape[0]) if labels_bias[y] == z],:],axis= 0) for z in range(N_pops)]
bias_centroids= np.array(bias_centroids)

## plot
fig_data= [go.Scatter(
        x = bias_features[[x for x in range(sum(Sizes_bias)) if labels_bias[x] == i],0],
        y = bias_features[[x for x in range(sum(Sizes_bias)) if labels_bias[x] == i],1],       
        type='scatter',
        mode= "markers",
        marker= {
        'line': {'width': 0},
        'size': 8,
        'symbol': 'circle',
      "opacity": .8
      },
      name= str(i)
    ) for i in range(N_pops)]


fig_data.append(
    go.Scatter(
        x= bias_centroids[:,0],
        y= bias_centroids[:,1],
        type= 'scatter',
        mode= 'markers',
        name= 'centres',
        marker= {
        'line': {'width': 1},
        'size': 10,
        'symbol': 'cross'
        }
    )
)

layout = go.Layout(
    title= 'Biased sampling; eigenvalues factored in',
    yaxis=dict(
        title='PC2: {}'.format(round(var_comps[1],3))),
    xaxis=dict(
    title= 'PC1: {}'.format(round(var_comps[0],3))),
)


fig = go.Figure(data=fig_data, layout=layout)
iplot(fig)

**Fig. 2** PCA on haplotypes sampled unevenly from the frequency vectors selected above. 

Calculate Pairwise centroid distances in feature space (remember we kept 5 components), and individual pairwise distances in this space also.

In [14]:
## centroid distances global
iu1= np.triu_indices(N_pops,1)
bias_pair_dist= pairwise_distances(bias_centroids,metric= 'euclidean')
bias_pair_dist= bias_pair_dist[iu1]
#bias_pair_dist= scale(bias_pair_dist)

### centroid distances by PC:
dist_PC_bias= {}
for PC in range(bias_centroids.shape[1]):
    bias_PC_dist= pairwise_distances(bias_centroids[:,PC].reshape(-1,1),metric= 'euclidean')
    bias_PC_dist= bias_PC_dist[iu1]
    #bias_pair_dist= scale(bias_pair_dist)
    dist_PC_bias[PC]= bias_PC_dist
    
## Individual distances:
bias_feat_dist= pairwise_distances(bias_features, metric= 'euclidean')
bias_feat_dist= bias_feat_dist[iugen]
bias_feat_dist= scale(bias_feat_dist)


### Pearson's r between individual pairwise genetic and feature space distances.
bias_feat_Pearson= pearsonr(bias_feat_dist,bias_gen_diffs)
print('Our first result: Pearson r between individual feature space and genetic distances: {}'.format(bias_feat_Pearson[0]))


Our first result: Pearson r between individual feature space and genetic distances: 0.8155753321776757


In [15]:
bias_pair_dist

array([ 3.67891595,  4.42700209,  3.45149169,  0.89609033,  4.96362495,
        5.83863505])

I chose the first two populations to play the outliers, the rest to be a close pack. To two of these i gave population sizes of 300 and 180, six and 3.6 times the size of the largest outlying population. 

The distortion can be seen in that our outlying populations don't appear as far as we would have expected them to given their vectors alone. They tend to appear in the center because of their reduced impact on variance components.

As remarked by McVean, this can be a problem when deriving conclusions from relative distances in feature space.


#### II. b. MeanShift correction.

MeanShift allows us to identify clusters in feature space, we will resample those clusters equally, inverse transform their coordinates and perform the PCA anew. The actual data is transposed onto the resulting space.

In [16]:
def local_sampling_correct(data_now,n_comp):
    pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(data_now)
    feats= pca.transform(data_now)
    
    N= 50
    bandwidth = estimate_bandwidth(feats, quantile=0.15)
    params = {'bandwidth': np.linspace(np.min(feats), np.max(feats),30)}
    grid = GridSearchCV(KernelDensity(algorithm = "ball_tree",breadth_first = False), params,verbose=0)
    
    ## perform MeanShift clustering.
    ms = MeanShift(bandwidth=bandwidth, bin_seeding=False, cluster_all=False, min_bin_freq=5)
    ms.fit(feats)
    labels1 = ms.labels_
    label_select = {y:[x for x in range(len(labels1)) if labels1[x] == y] for y in sorted(list(set(labels1))) if y != -1}

    ## Extract the KDE of each cluster identified by MS.
    Proxy_data= []

    for lab in label_select.keys():
        if len(label_select[lab]) < 3:
            continue
            
        Quanted_set= feats[label_select[lab],:]
        grid.fit(Quanted_set)

        kde = grid.best_estimator_
        Extract= kde.sample(N)
        Return= pca.inverse_transform(Extract)
        
        #Return= data_now[np.random.choice(label_select[lab],N),:]
        Proxy_data.extend(Return)
    
    Proxy_data= np.array(Proxy_data)
    
    print([len(x) for x in label_select.values()])
    pca2 = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(Proxy_data)
    var_comp= pca2.explained_variance_ratio_
    
    New_features= pca2.transform(data_now)# * var_comp
    return New_features, var_comp


def Distance_profiles(data_now,n_comp,label_select):
    pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(data_now)
    feats= pca.transform(data_now)
    
    N= 50
    bandwidth = estimate_bandwidth(feats, quantile=0.15)
    params = {'bandwidth': np.linspace(np.min(feats), np.max(feats),30)}
    grid = GridSearchCV(KernelDensity(algorithm = "ball_tree",breadth_first = False), params,verbose=0)
    
    ## perform MeanShift clustering.
    ms = MeanShift(bandwidth=bandwidth, bin_seeding=True, cluster_all=False, min_bin_freq=5)
    ms.fit(feats)
    labels1 = ms.labels_
    #label_select = {y:[x for x in range(len(labels1)) if labels1[x] == y] for y in sorted(list(set(labels1))) if y != -1}

    ## Extract the KDE of each cluster identified by MS.
    Proxy_data= []
    label_select_labels= [z for z in it.chain(*[[x] * len(label_select[x]) for x in label_select.keys()])]
    Center_store= {}
    Proxy_indexes= {}
    distance_vecs= []
    
    for lab in label_select.keys():
        if len(label_select[lab]) < 3:
            continue
            
        Quanted_set= feats[label_select[lab],:]
        grid.fit(Quanted_set)

        kde = grid.best_estimator_
        Extract= kde.sample(N)
        
        center= np.mean(Extract,axis= 0)
        Center_store[lab]= center
        Proxy_indexes[lab]= [x for x in range((len(Center_store) - 1) * N, len(Center_store) * N)]        
        Return= pca.inverse_transform(Extract)
        
        #Return= data_now[np.random.choice(label_select[lab],N),:]
        Proxy_data.extend(Return)
    
    ### get distances to other centers:
    Distances_vectors= []
    for lab in Center_store.keys():
        Others= [z for z in it.chain(*[Proxy_indexes[x] for x in Proxy_indexes.keys() if x != lab])]
        distances= euclidean_distances(Center_store[lab].reshape(1,-1),Proxy_data[Others,:])
        #distances= [np.log(x) for x in distances]
        
        X_plot = np.linspace(-8, 2, 1000)

        kde = KernelDensity(kernel='gaussian', bandwidth=0.05).fit(np.array(distances).reshape(-1,1))
        log_dens = kde.score_samples(np.array(X_plot).reshape(-1,1))
        distance_vecs.append(log_dens)
    
    
    return distance_vecs


New_features,var_comp= local_sampling_correct(data_ex,5)
#cluster_profs= Distance_profiles(data_ex,5)

[284, 126, 35]


Plotting our original samples onto our re-computed feature space:

In [17]:
corr_centroids= [np.mean(New_features[[y for y in range(New_features.shape[0]) if labels_bias[y] == z],:],axis= 0) for z in range(N_pops)]
corr_centroids= np.array(corr_centroids)

fig_data= [go.Scatter(
        x = New_features[[x for x in range(sum(Sizes_bias)) if labels_bias[x] == i],0],
        y = New_features[[x for x in range(sum(Sizes_bias)) if labels_bias[x] == i],1],
        type='scatter',
        mode= "markers",
        marker= {
        'line': {'width': 0},
        'size': 8,
        'symbol': 'circle',
      "opacity": .8
      },
      name= str(i)
    ) for i in range(N_pops)]

fig_data.append(
    go.Scatter(
        x= corr_centroids[:,0],
        y= corr_centroids[:,1],
        type= 'scatter',
        mode= 'markers',
        name= 'centres',
        marker= {
        'line': {'width': 1},
        'size': 10,
        'symbol': 'cross'
        }
    )
)


layout = go.Layout(
    title= 'Biased corrected, eigenvalues not factored in',
    yaxis=dict(
        title='PC2: {}'.format(round(var_comp[1],3))),
    xaxis=dict(
    title= 'PC1: {}'.format(round(var_comp[0],3))),
)

fig = go.Figure(data=fig_data, layout=layout)
iplot(fig)

**Fig. 3** MS corrected PCA of unvenly sampled haplotypes. MS clustering was first applied following an initial PCA of the biased data set. The KDE of each cluster identified this way was used to resample equally from that distribution. 50 observations were resampled from the distribution of each cluster identified by MS. Each observation from the resulting data set was inverse transformed and a new PCA was conducted on this data. Finally, the original haplotypes were projected onto the new feature space.

Very different from the original PCA on the biased data set.

We now calculate the distances between centroids in this feature space and the pairwise individual distances, compare the latter to genetic distances.

In [18]:
corr_centroids.shape

(4, 5)

In [19]:
### Centroid distances global
iu1= np.triu_indices(N_pops,1)
corrected_pair_dist= pairwise_distances(corr_centroids,metric= 'euclidean')
corrected_pair_dist= corrected_pair_dist[iu1]
#corrected_pair_dist= scale(corrected_pair_dist)

### centroid distances by PC
dist_PC_corrected= {}
for PC in range(corr_centroids.shape[1]):
    corrected_PC_dist= pairwise_distances(corr_centroids[:,PC].reshape(-1,1),metric= 'euclidean')
    corrected_PC_dist= corrected_PC_dist[iu1]
    #corrected_pair_dist= scale(corrected_pair_dist)
    dist_PC_corrected[PC]= corrected_PC_dist

## Individual distances:
corrected_feat_dist= pairwise_distances(New_features, metric= 'euclidean')
corrected_feat_dist= corrected_feat_dist[iugen]
corrected_feat_dist= scale(corrected_feat_dist)

corrected_gen_pearson= pearsonr(corrected_feat_dist,bias_gen_diffs)

print('Pearon r of individual genetic distances versus feature space distances following correction: {}'.format(round(corrected_gen_pearson[0],3)))


Pearon r of individual genetic distances versus feature space distances following correction: 0.818


In [20]:
corrected_pair_dist

array([ 3.67849877,  4.42687454,  3.45016223,  0.89054557,  4.96034063,
        5.83829555])

#### II. c. Even sampling.


We can compare this output with what we would have gotten from sampling equally across our selected vectors.

For this purpose we sample equally from the same frequency vectors as in the biased scenario and perform PCA on the resulting data set.

In [21]:
#### Selecting new, equal sample sizes but derive haplotypes from the same frequency vectors.

Sizes= [50,50,50,50]
labels= np.repeat(np.array([x for x in range(N_pops)]),Sizes)

data= []

for k in range(N_pops):
    
    probs= vector_lib[Pops[k],:]
    probs= [[0,x][int(x >= 0)] for x in probs]
    probs= [[1,x][int(x <= 1)] for x in probs]
    
    m= Sizes[k]
    Haps= [[np.random.choice([1,0],p= [1-probs[x],probs[x]]) for x in range(L)] for acc in range(m)]
    
    data.extend(Haps)

data= np.array(data)
#data= scale(data)

#### clalculate pairwise genetic distances
iugen_unbiased= np.triu_indices(data.shape[0],1)

unbias_gen_diffs= pairwise_distances(data,metric= pairwise_gen)
unbias_gen_diffs= np.array(unbias_gen_diffs)

unbias_gen_diffs= unbias_gen_diffs[iugen_unbiased]

### perform PCA

n_comp = 5

pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(data)

features= pca.transform(data)# * pca.explained_variance_ratio_

var_comps= pca.explained_variance_ratio_
print("; ".join(['PC{0}: {1}'.format(x+1,round(var_comps[x],3)) for x in range(n_comp)]))
print(features.shape)


#### Calculate centroids of labelled data in feature space.
unbias_centroids= [np.mean(features[[y for y in range(features.shape[0]) if labels[y] == z],:],axis= 0) for z in range(N_pops)]
unbias_centroids= np.array(unbias_centroids)


#### Plot projections + Centroids.

fig_data= [go.Scatter(
        x = features[[x for x in range(sum(Sizes)) if labels[x] == i],0],
        y = features[[x for x in range(sum(Sizes)) if labels[x] == i],1],
        type='scatter',
        mode= "markers",
        marker= {
        'line': {'width': 0},
        'size': 8,
        'symbol': 'circle',
      "opacity": .8
      },
      name= str(i)
    ) for i in range(N_pops)]


fig_data.append(
    go.Scatter(
        x= unbias_centroids[:,0],
        y= unbias_centroids[:,1],
        type= 'scatter',
        mode= 'markers',
        name= 'centres',
        marker= {
        'line': {'width': 1},
        'size': 10,
        'symbol': 'cross'
        }
    )
)

layout = go.Layout(
    title= 'Unbiased sampling; eigenvalues factored in',
    yaxis=dict(
        title='PC2: {}'.format(round(var_comps[1],3))),
    xaxis=dict(
    title= 'PC1: {}'.format(round(var_comps[0],3))),
)

fig = go.Figure(data=fig_data, layout=layout)
iplot(fig)

PC1: 0.152; PC2: 0.036; PC3: 0.015; PC4: 0.015; PC5: 0.014
(200, 5)


**Fig. 4** PCA on evenly sampled haplotypes from the same frequency vectors as above. 50 haplotypes generated from each vector.

Calculate pairwise distances. compare to genetic distances.

In [22]:
### centroid distances global
unbias_pair_dist= pairwise_distances(unbias_centroids,metric= 'euclidean')
unbias_pair_dist= unbias_pair_dist[iu1]
#unbias_pair_dist= scale(unbias_pair_dist)

### centroid distances by PC
dist_PC_even= {}
for PC in range(unbias_centroids.shape[1]):
    unbias_PC_dist= pairwise_distances(unbias_centroids[:,PC].reshape(-1,1),metric= 'euclidean')
    unbias_PC_dist= unbias_PC_dist[iu1]
    #unbias_pair_dist= scale(unbias_pair_dist)
    dist_PC_even[PC]= unbias_PC_dist

## Individual distances:
unbiased_feat_dist= pairwise_distances(features, metric= 'euclidean')

unbiased_feat_dist= unbiased_feat_dist[iugen_unbiased]
unbiased_feat_dist= scale(unbiased_feat_dist)

unbiased_gen_pearson= pearsonr(unbiased_feat_dist,unbias_gen_diffs)

print('Pearson r on individual genetic and feature space distances in the unbiased sampling scenario: {}'.format(round(corrected_gen_pearson[0])))


Pearson r on individual genetic and feature space distances in the unbiased sampling scenario: 1.0


In [23]:
fig_data= [go.Scatter(
    x= dist_PC_even[PC],
    y= dist_PC_corrected[PC],
    mode= 'markers',
    marker= dict(
        color= PC,
        opacity= .6
    ),
    name= 'PC: {}, {}'.format(PC,round(pearsonr(dist_PC_even[PC],dist_PC_corrected[PC])[0],2))
    ) for PC in dist_PC_even.keys()
]

layout = go.Layout(
    title= 'MS correction distances',
    yaxis=dict(
        title='biased and corrected distances'),
    xaxis=dict(
        title='unbiased distances')
)

fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

**Fig. 5** Principal-component-wise relation between pairwise centroid distances in Unbiased versus Corrected scenarios. Average distances were calculated between the projections of population specific haplotypes along specific PCs. Pearson's r was calculated using the function `pearsonr` of the python package `scipy.stats.stats`.

In [24]:
t= np.array([
    unbias_pair_dist,
    bias_pair_dist,
    corrected_pair_dist
]).T

fig_data= [go.Scatter(
    x= t[:,0],
    y= t[:,i],
    mode= 'markers',
    marker= dict(
        color= i,
        opacity= .6
    ),
    name= ['bias','corrected'][i-1]
    ) for i in [1,2]
]

layout = go.Layout(
    title= 'MS correction distances',
    yaxis=dict(
        title='biased and corrected distances'),
    xaxis=dict(
        title='unbiased distances')
)

fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

**Fig. 6** Relation between pairwise centroid distances in unbiased scenario versus biased and corrected scenarios. Distances were left unscaled.

In [25]:

fig_fsts= [go.Scatter(
    x= fsts_compare,
    y= t[:,i],
    mode= 'markers',
    marker= dict(
        color= i,
        opacity= .6
    ),
    name= ['unbiased','biased','corrected'][i]
    ) for i in [0,1,2]
]

layout = go.Layout(
    title= 'PCA to genetic distances',
    yaxis=dict(
        title='centrois distances in feature space'),
    xaxis=dict(
        title='normalized Fst')
)

fig= go.Figure(data=fig_fsts, layout=layout)
iplot(fig)

**Fig. 7** Relationship between Fst and feature space distances in the three scenarios considered: unbiased, biased and MS corrected.


### II. Permutations.

We will now repeat this process sequentially, to get an idea of how much this method actually corrects distances between pops.

At each repetition we will choose a fixed number of frequency vectors from the _Vector Universe_ created at the top of this page. We then perform a biased and an unbiased sampling of each, and perform PCA on both. For each scenario we calculate the pairwise eucledian distances between the centroids of populations and normalize them. 

We then apply the MScorrection to the feature space of the biased scenario and recalculate pairwise centroid distances and normalize them.

This will allow us to compare the unbiased distances to biased and corrected distances. Hopefully, we will have reduced the distortion produced by the biases in sampling.


- You can choose to to multiply PCs of PCAs performed by their respective eigenvalues by typing `Eigen = True` below.

- You can choose to scale Haplotype matrices produced by feature by typing `Scale = True` below.

In [26]:
### Select pre and post processing measures. 
Eigen = False
Scale= False
Center= False

MixL= True # select if to mix N_Pops or not.
Length_increment= True
L_step= 10
MixP= True # select if to mix lengths or not. 
Pairs= False # select if comparing Pairs of distances or the distances themselves
Control_inc= True
Predict= False

length_haps= 220
length_range= [75,vector_lib.shape[1]]
length_step= 10

pop_max= 8 # Number of pops

n_comp= 10 # components to keep following PCA

Iter= 150 # repeats

N_sims= 100 # number of haplotypes to generate from each pop in the unbiased scenario.

#### Predict
predicted= []

#def controled_fsts(vector_lib,Eigen,length_haps,Scale,Center,N_pops,n_comp,Iter,N_sims,MixL,MixP,Pairs):
lengths_vector= []

### Control set to include in the transformation:
control_vecs= np.random.choice(range(vector_lib.shape[0]),2)
control_labels= np.repeat([0,1],N_sims)
### Control set distances
control_even_distances= []
control_bias_distances= []

### store distances between centroids
biased_pairwise= []
unbiased_pairwise= []
corrected_pairwise= []

### store PC projection:
dist_PC_even= {x:[] for x in range(n_comp)}
dist_PC_bias= {x:[] for x in range(n_comp)}
dist_PC_corrected= {x:[] for x in range(n_comp)}

### store increemental PC distances
dist_increment_even= {x:[] for x in range(1,n_comp)}
dist_increment_bias= {x:[] for x in range(1,n_comp)}    

### store fsts
fst_store= []

## Centroid distance profiles.
Cluster_profiles= []


### proceed.

for rep in range(Iter):
    
    if MixP:
        N_pops= np.random.choice(range(3,pop_max),1,replace= False)[0]
    else: 
        N_pops= pop_max
    
    if MixL:
        length_haps= np.random.choice(length_range,1)[0]
    
    if Length_increment:
        length_haps= int(length_range[0] + L_step * np.floor(rep / L_step))
    
    ## Population Sizes and labels
    bias_scheme= np.random.choice(range(25,200),N_pops,replace= False)
    unbiased_sheme= np.repeat(N_sims,N_pops)

    bias_labels= np.repeat(np.array([x for x in range(N_pops)]),bias_scheme)
    unbias_labels= np.repeat(np.array([x for x in range(N_pops)]),unbiased_sheme)

    ### triangular matrices extract.
    iu1= np.triu_indices(N_pops,1) # for centroid comparison

    iu_unbias= np.triu_indices(sum(unbiased_sheme),1)
    iu_bias= np.triu_indices(sum(bias_scheme),1)
    
    iu_control= np.triu_indices(2,1)
    
    Pops= np.random.choice(vector_lib.shape[0],N_pops,replace= False)
    print('vectors selected: {}, hap length: {}'.format(Pops,length_haps))
    ########## FST

    freqs_selected= vector_lib[Pops,:length_haps]
    Pairwise= return_fsts2(freqs_selected)

    #fsts_compare = scale(Pairwise.fst)
    fsts_compare= Pairwise.fst
    if Pairs:
        t= fsts_compare
        fsts_compare= [min([t[y] for y in z]) / max([t[y] for y in z]) for z in it.combinations(range(len(t)),2)]
    
    fst_store.extend(fsts_compare)

    ## lengths
    lengths_vector.extend([length_haps] * len(fsts_compare))
    
    #########################################################
    ########### PCA ####################################
    #########################################################
    ### control sample
    
    control_data= []

    for k in range(2):

        probs= vector_lib[control_vecs[k],:length_haps]
        probs[probs < 0] = 0
        probs[probs > 1] = 1
        m= unbiased_sheme[k]
        Haps= [[np.random.choice([1,0],p= [1-probs[x],probs[x]]) for x in range(length_haps)] for acc in range(m)]
        
        control_data.extend(Haps)

    control_data= np.array(control_data)

    #### generate data and perform PCA.
    data= []

    for k in range(N_pops):

        probs= vector_lib[Pops[k],:length_haps]
        probs[probs < 0] = 0
        probs[probs > 1] = 1
        m= unbiased_sheme[k]
        Haps= [[np.random.choice([1,0],p= [1-probs[x],probs[x]]) for x in range(length_haps)] for acc in range(m)]

        data.extend(Haps)

    data1= np.array(data)

    if Scale:
        data1= scale(data1)
    
    if Control_inc:
        pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(np.vstack([control_data,data1]))
        control_unbias_feat= pca.transform(control_data)
    else:
        pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(data1)
    
    feat_unbias= pca.transform(data1)

    if Eigen:
        feat_unbias= feat_unbias * pca.explained_variance_ratio_

    ####### centroid comparison
    #### Controls
    if Control_inc:
        control_centroids= [np.mean(control_unbias_feat[[y for y in range(control_unbias_feat.shape[0]) if control_labels[y] == z],:],axis= 0) for z in range(2)]
        control_centroids= np.array(control_centroids)

        unbias_control_dist= pairwise_distances(control_centroids,metric= 'euclidean')
        unbias_control_dist= unbias_control_dist[iu_control]

        control_even_distances.extend(unbias_control_dist)

    ####
    unbias_centroids= [np.mean(feat_unbias[[y for y in range(feat_unbias.shape[0]) if unbias_labels[y] == z],:],axis= 0) for z in range(N_pops)]
    unbias_centroids= np.array(unbias_centroids)

    unbias_pair_dist= pairwise_distances(unbias_centroids,metric= 'euclidean')
    unbias_pair_dist= unbias_pair_dist[iu1]
    
    if Pairs:
        t= unbias_pair_dist
        unbias_pair_dist= [min([t[y] for y in z]) / max([t[y] for y in z]) for z in it.combinations(range(len(t)),2)]
    
    if Predict:
        fst_pred= [np.exp(m_coeff* np.log(x)+ b) for x in unbias_pair_dist]
        predicted.extend(fst_pred)
        print(np.array([fst_pred,fsts_compare]).T)
    
    #unbias_pair_dist= scale(unbias_pair_dist)
    unbiased_pairwise.extend(unbias_pair_dist)

    ## PC-wise centroid comparison
    for PC in range(unbias_centroids.shape[1]):
        unbias_PC_dist= pairwise_distances(unbias_centroids[:,PC].reshape(-1,1),metric= 'euclidean')
        unbias_PC_dist= unbias_PC_dist[iu1]
        if Pairs:
            t= unbias_PC_dist
            unbias_PC_dist= [min([t[y] for y in z]) / max([t[y] for y in z]) for z in it.combinations(range(len(t)),2)]
        
        dist_PC_even[PC].extend(unbias_PC_dist)
        if  PC > 0:
            unbias_increment_dist= pairwise_distances(unbias_centroids[:,:PC],metric= 'euclidean')
            unbias_increment_dist= unbias_increment_dist[iu1]
            dist_increment_even[PC].extend(unbias_increment_dist)            

    #################################################
    ############## biased sample

    #### generate data and perform PCA
    data= []

    for k in range(N_pops):

        probs= vector_lib[Pops[k],:]

        m= bias_scheme[k]
        Haps= [[np.random.choice([1,0],p= [1-probs[x],probs[x]]) for x in range(length_haps)] for acc in range(m)]

        data.extend(Haps)

    data2= np.array(data)

    if Scale:
        data2= scale(data2)
    
    if Control_inc:
        pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(np.vstack([control_data,data2]))
        control_bias_feat= pca.transform(control_data)
    else:
        pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(data2)
    
    feat_bias= pca.transform(data2)

    if Eigen:
        feat_bias= feat_bias * pca.explained_variance_ratio_

    #### Centroid distances
    #### Controls
    if Control_inc:
        control_centroids= [np.mean(control_bias_feat[[y for y in range(control_bias_feat.shape[0]) if control_labels[y] == z],:],axis= 0) for z in range(2)]
        control_centroids= np.array(control_centroids)

        bias_control_dist= pairwise_distances(control_centroids,metric= 'euclidean')
        bias_control_dist= bias_control_dist[iu_control]

        control_bias_distances.extend(bias_control_dist)
    
    bias_indexes= {z:[y for y in range(feat_bias.shape[0]) if bias_labels[y] == z] for z in list(set(bias_labels))}
    bias_centroids= [np.mean(feat_bias[[y for y in range(feat_bias.shape[0]) if bias_labels[y] == z],:],axis= 0) for z in range(N_pops)]
    bias_centroids= np.array(bias_centroids)

    bias_pair_dist= pairwise_distances(bias_centroids,metric= 'euclidean')
    bias_pair_dist= bias_pair_dist[iu1]
    #bias_pair_dist= scale(bias_pair_dist)
    if Pairs:
        t= bias_pair_dist
        bias_pair_dist= [min([t[y] for y in z]) / max([t[y] for y in z]) for z in it.combinations(range(len(t)),2)]
    
    biased_pairwise.extend(bias_pair_dist)

    ### PC-wise centroid comparison
    for PC in range(bias_centroids.shape[1]):
        bias_PC_dist= pairwise_distances(bias_centroids[:,PC].reshape(-1,1),metric= 'euclidean')
        bias_PC_dist= bias_PC_dist[iu1]
        if Pairs:
            t= bias_PC_dist
            bias_PC_dist= [min([t[y] for y in z]) / max([t[y] for y in z]) for z in it.combinations(range(len(t)),2)]
        #bias_PC_dist= scale(bias_PC_dist)
        dist_PC_bias[PC].extend(bias_PC_dist)
        if PC > 0:
            bias_increment_dist= pairwise_distances(bias_centroids[:,:PC],metric= 'euclidean')
            bias_increment_dist= bias_increment_dist[iu1]
            #bias_PC_dist= scale(bias_PC_dist)
            dist_increment_bias[PC].extend(bias_increment_dist)
    
    ### Cluster distances - KDE profiles
    if not MixL:
        distances= Distance_profiles(data2,n_comp,bias_indexes)
        Cluster_profiles.extend(distances)
        
    ###############################################################"
    ################## bias correct
    ### perform MS correction on biased samples
    feat_correct,var_comp= local_sampling_correct(data2,n_comp)

    ### centroid Distances
    centroids= [np.mean(feat_correct[[y for y in range(feat_correct.shape[0]) if bias_labels[y] == z],:],axis= 0) for z in range(N_pops)]
    centroids= np.array(centroids)
    pair_dist= pairwise_distances(centroids,metric= 'euclidean')
    pair_dist= pair_dist[iu1]
    #pair_dist= scale(pair_dist)
    if Pairs:
        t= pair_dist
        pair_dist= [min([t[y] for y in z]) / max([t[y] for y in z]) for z in it.combinations(range(len(t)),2)]

    corrected_pairwise.extend(pair_dist)

    ### PC-wise centroid comparison
    for PC in range(centroids.shape[1]):
        corrected_PC_dist= pairwise_distances(centroids[:,PC].reshape(-1,1),metric= 'euclidean')
        corrected_PC_dist= corrected_PC_dist[iu1]
        #corrected_PC_dist= scale(corrected_PC_dist)
        if Pairs:
            t= corrected_PC_dist
            corrected_PC_dist= [min([t[y] for y in z]) / max([t[y] for y in z]) for z in it.combinations(range(len(t)),2)]

        dist_PC_corrected[PC].extend(corrected_PC_dist)
    
    

t= np.array([
fsts_compare,
unbias_pair_dist,
bias_pair_dist,
pair_dist
]).T



vectors selected: [189 179  25 170 137 149], hap length: 75



invalid value encountered in double_scalars



[475]
vectors selected: [171  75  89  67  37 113  53], hap length: 75



invalid value encountered in double_scalars



[581]
vectors selected: [ 78  74 172 102 144 108 129], hap length: 75



invalid value encountered in double_scalars



[390]
vectors selected: [  1 168  72], hap length: 75



invalid value encountered in double_scalars



[193, 176]
vectors selected: [147 107  18 186 167], hap length: 75



invalid value encountered in double_scalars



[259, 228]
vectors selected: [ 90 112 150 124 141  87], hap length: 75



invalid value encountered in double_scalars



[358, 95]
vectors selected: [ 58 130  24], hap length: 75



invalid value encountered in double_scalars



[254, 92]
vectors selected: [ 31 109  33  95], hap length: 75



invalid value encountered in double_scalars



[214, 89]
vectors selected: [49 58 76], hap length: 75



invalid value encountered in double_scalars



[137, 109]
vectors selected: [175  53   0], hap length: 75



invalid value encountered in double_scalars



[164, 134, 96]
vectors selected: [47 39 32], hap length: 85



invalid value encountered in double_scalars



[246]
vectors selected: [  9 165  38 175  41  65], hap length: 85



invalid value encountered in double_scalars



[305, 272, 72]
vectors selected: [196 112 105 100 167], hap length: 85



invalid value encountered in double_scalars



[272, 79, 38]
vectors selected: [ 58 106 105  69  88  99], hap length: 85



invalid value encountered in double_scalars



[237, 223, 175]
vectors selected: [ 81  40  48 169 172], hap length: 85



invalid value encountered in double_scalars



[174]
vectors selected: [100 160  83], hap length: 85



invalid value encountered in double_scalars



[179, 140, 42, 1]
vectors selected: [ 46  48  47  83 117 129], hap length: 85



invalid value encountered in double_scalars



[318]
vectors selected: [12 70 57], hap length: 85



invalid value encountered in double_scalars



[202, 32]
vectors selected: [ 69  77 177], hap length: 85



invalid value encountered in double_scalars



[303, 1]
vectors selected: [175  79  34  20 136 145 154], hap length: 85



invalid value encountered in double_scalars



[450, 220]
vectors selected: [ 37  40 193], hap length: 95



invalid value encountered in double_scalars



[243]
vectors selected: [167 177  37  76 128], hap length: 95



invalid value encountered in double_scalars



[327, 1, 1, 1]
vectors selected: [ 32 159  78  80 127  20   5], hap length: 95



invalid value encountered in double_scalars



[340, 193, 151]
vectors selected: [ 62 148 162  77 184], hap length: 95



invalid value encountered in double_scalars



[173, 165]
vectors selected: [  1 144  54  61  39 191], hap length: 95



invalid value encountered in double_scalars



[334, 154, 54]
vectors selected: [ 19  64 169], hap length: 95



invalid value encountered in double_scalars



[185]
vectors selected: [180  16  53 146  22 121 160], hap length: 95



invalid value encountered in double_scalars



[224, 189, 163, 143]
vectors selected: [198  34  57 104   9  10  22], hap length: 95



invalid value encountered in double_scalars



[419, 346, 175]
vectors selected: [ 36 183 166 111 124 162  67], hap length: 95



invalid value encountered in double_scalars



[357, 251]
vectors selected: [182  90  65], hap length: 95



invalid value encountered in double_scalars



[165, 83]
vectors selected: [115 194 169], hap length: 105



invalid value encountered in double_scalars



[195, 170, 38]
vectors selected: [ 90  52 107 138 119], hap length: 105



invalid value encountered in double_scalars



[151, 126, 99]
vectors selected: [ 39 186  21  24  37  84  26], hap length: 105



invalid value encountered in double_scalars



[466]
vectors selected: [ 19 193  10 170 118], hap length: 105



invalid value encountered in double_scalars



[317, 225, 42]
vectors selected: [177  68 195  76], hap length: 105



invalid value encountered in double_scalars



[273, 58]
vectors selected: [ 34  60 119  26], hap length: 105



invalid value encountered in double_scalars



[197, 175]
vectors selected: [  0 104 179 158  41], hap length: 105



invalid value encountered in double_scalars



[153, 141, 126, 48]
vectors selected: [  2 121  63  93  27], hap length: 105



invalid value encountered in double_scalars



[243, 169, 176]
vectors selected: [165 159  66 102  86], hap length: 105



invalid value encountered in double_scalars



[170, 105, 85, 34]
vectors selected: [ 90  49  44 121 175 133], hap length: 105



invalid value encountered in double_scalars



[298, 248]
vectors selected: [ 51 193   9], hap length: 115



invalid value encountered in double_scalars



[160, 125, 92]
vectors selected: [168  75 147], hap length: 115



invalid value encountered in double_scalars



[193, 138]
vectors selected: [117  19 163  88], hap length: 115



invalid value encountered in double_scalars



[136, 106, 98, 87]
vectors selected: [  4  36 172 193 170  42 192], hap length: 115



invalid value encountered in double_scalars



[456, 223, 56]
vectors selected: [ 81  92 175  71 105 128], hap length: 115



invalid value encountered in double_scalars



[495, 150]
vectors selected: [156  80 116   9], hap length: 115



invalid value encountered in double_scalars



[174, 158, 103]
vectors selected: [189 197  30  40 155  72], hap length: 115



invalid value encountered in double_scalars



[390, 163]
vectors selected: [ 18 162  76  37  81], hap length: 115



invalid value encountered in double_scalars



[294, 80, 48, 1]
vectors selected: [189 134  58 180 186  47], hap length: 115



invalid value encountered in double_scalars



[349, 27]
vectors selected: [ 23  59 173 129 165], hap length: 115



invalid value encountered in double_scalars



[225, 199, 25]
vectors selected: [122 129 106  30  70  39  93], hap length: 125



invalid value encountered in double_scalars



[455]
vectors selected: [36 43 21], hap length: 125



invalid value encountered in double_scalars



[111, 103]
vectors selected: [130 188 190  18 142], hap length: 125



invalid value encountered in double_scalars



[290, 131]
vectors selected: [ 37 175  46  42], hap length: 125



invalid value encountered in double_scalars



[240]
vectors selected: [ 39 102  49  17  68  74], hap length: 125



invalid value encountered in double_scalars



[366, 285, 149, 97]
vectors selected: [ 45 116 191  86   6  40  97], hap length: 125



invalid value encountered in double_scalars



[459, 49]
vectors selected: [  4 189 187], hap length: 125



invalid value encountered in double_scalars



[211, 101, 1]
vectors selected: [134  96  83  91   5  58 152], hap length: 125



invalid value encountered in double_scalars



[331, 197, 159, 101]
vectors selected: [ 33 135 198 138], hap length: 125



invalid value encountered in double_scalars



[381, 1]
vectors selected: [126 147  74], hap length: 125



invalid value encountered in double_scalars



[271, 152, 1]
vectors selected: [ 46  30  96 132  90], hap length: 135



invalid value encountered in double_scalars



[369]
vectors selected: [ 25 159 124  41  14 170  11], hap length: 135



invalid value encountered in double_scalars



[350, 345, 103]
vectors selected: [109 198  96], hap length: 135



invalid value encountered in double_scalars



[329, 59]
vectors selected: [ 75 162  68], hap length: 135



invalid value encountered in double_scalars



[217, 101]
vectors selected: [ 51 192 159 174], hap length: 135



invalid value encountered in double_scalars



[64, 61, 34]
vectors selected: [107  11 135  71], hap length: 135



invalid value encountered in double_scalars



[168, 132, 108]
vectors selected: [ 35 147 104 108], hap length: 135



invalid value encountered in double_scalars



[153, 117]
vectors selected: [ 60 174 105 102], hap length: 135



invalid value encountered in double_scalars



[208, 189, 41]
vectors selected: [ 79  92 120 152 175 167 100], hap length: 135



invalid value encountered in double_scalars



[197, 202, 139]
vectors selected: [138 103  57  21  20], hap length: 135



invalid value encountered in double_scalars



[267, 171, 45]
vectors selected: [  8 138 183 165  61 146  11], hap length: 145



invalid value encountered in double_scalars



[267, 189, 60, 50]
vectors selected: [ 69  34 172 151 111 132], hap length: 145



invalid value encountered in double_scalars



[196, 186, 150, 115, 51]
vectors selected: [  3  36 111], hap length: 145



invalid value encountered in double_scalars



[138, 121, 28]
vectors selected: [59  3 66], hap length: 145



invalid value encountered in double_scalars



[342, 113]
vectors selected: [193  44  85 170], hap length: 145



invalid value encountered in double_scalars



[440, 108]
vectors selected: [166 134  38 185 177 164  52], hap length: 145



invalid value encountered in double_scalars



[426, 249, 164]
vectors selected: [ 50 163  98  33 157], hap length: 145



invalid value encountered in double_scalars



[163, 154, 77]
vectors selected: [ 73 150  53 139  26 129 187], hap length: 145



invalid value encountered in double_scalars



[561, 68]
vectors selected: [ 53 163 160], hap length: 145



invalid value encountered in double_scalars



[233, 27]
vectors selected: [176  11  27 118  34], hap length: 145



invalid value encountered in double_scalars



[248, 157]
vectors selected: [101  75  93  96   7 190], hap length: 155



invalid value encountered in double_scalars



[128, 121, 105, 73]
vectors selected: [ 86 121  75 145 191 194  82], hap length: 155



invalid value encountered in double_scalars



[420, 351]
vectors selected: [ 89  67  41  84  66  37 156], hap length: 155



invalid value encountered in double_scalars



[370, 207, 158]
vectors selected: [173  56  34 142  69], hap length: 155



invalid value encountered in double_scalars



[239, 226]
vectors selected: [144  48 154  57], hap length: 155



invalid value encountered in double_scalars



[160, 156, 45]
vectors selected: [114 193  43  14 182  64  44], hap length: 155



invalid value encountered in double_scalars



[237, 195, 174, 47]
vectors selected: [ 93  88 152  17 147], hap length: 155



invalid value encountered in double_scalars



[231, 143, 29]
vectors selected: [157 198  88  36 137], hap length: 155



invalid value encountered in double_scalars



[431, 86]
vectors selected: [116 140 157 179 138], hap length: 155



invalid value encountered in double_scalars



[131, 84, 60]
vectors selected: [ 96  87 135  73  60 129 187], hap length: 155



invalid value encountered in double_scalars



[300, 220]
vectors selected: [123  92 168 117  64], hap length: 165



invalid value encountered in double_scalars



[212, 160, 158, 148]
vectors selected: [176 111 100 126 165], hap length: 165



invalid value encountered in double_scalars



[222, 144]
vectors selected: [ 33  25  54 174 139], hap length: 165



invalid value encountered in double_scalars



[271, 120]
vectors selected: [193 137  53], hap length: 165



invalid value encountered in double_scalars



[189, 31]
vectors selected: [ 99 128  74 153 129 195], hap length: 165



invalid value encountered in double_scalars



[247, 52]
vectors selected: [121  61 188], hap length: 165



invalid value encountered in double_scalars



[155, 51, 28]
vectors selected: [ 64 138 136], hap length: 165



invalid value encountered in double_scalars



[205, 163]
vectors selected: [ 45 193  89  40 168 159], hap length: 165



invalid value encountered in double_scalars



[301, 238]
vectors selected: [ 11 184 112], hap length: 165



invalid value encountered in double_scalars



[175, 137, 27]
vectors selected: [173  64 166  36 177], hap length: 165



invalid value encountered in double_scalars



[316, 106]
vectors selected: [ 41 181 168  65  11 100 120], hap length: 175



invalid value encountered in double_scalars



[318, 286, 182, 52]
vectors selected: [ 21 156 132 119 175 179], hap length: 175



invalid value encountered in double_scalars



[363, 137]
vectors selected: [185 166  95 149 153], hap length: 175



invalid value encountered in double_scalars



[201, 162]
vectors selected: [171  60 188 184  51  28], hap length: 175



invalid value encountered in double_scalars



[238, 209, 171]
vectors selected: [186 183 148  81  56 142  97], hap length: 175



invalid value encountered in double_scalars



[459, 147]
vectors selected: [165   6  24], hap length: 175



invalid value encountered in double_scalars



[149, 95, 83, 1]
vectors selected: [100 197  89], hap length: 175



invalid value encountered in double_scalars



[197, 46, 1]
vectors selected: [111 139  79 119], hap length: 175



invalid value encountered in double_scalars



[223, 110]
vectors selected: [157 105  50  48 133], hap length: 175



invalid value encountered in double_scalars



[166, 137, 132, 96]
vectors selected: [ 77  53 184  36  27], hap length: 175



invalid value encountered in double_scalars



[331, 109]
vectors selected: [197  78  92  54  46  77], hap length: 185



invalid value encountered in double_scalars



[304, 257, 30]
vectors selected: [ 40  44  26   6  91   7 101], hap length: 185



invalid value encountered in double_scalars



[306, 164, 131, 45]
vectors selected: [145  10 156  97], hap length: 185



invalid value encountered in double_scalars



[161, 110, 90]
vectors selected: [142 159  42 158], hap length: 185



invalid value encountered in double_scalars



[213, 132]
vectors selected: [72 98  3  4], hap length: 185



invalid value encountered in double_scalars



[187, 155, 132]
vectors selected: [110   4 183], hap length: 185



invalid value encountered in double_scalars



[72, 70, 46]
vectors selected: [ 89  34  40 173], hap length: 185



invalid value encountered in double_scalars



[262, 134]
vectors selected: [153   5  68 144 105  77], hap length: 185



invalid value encountered in double_scalars



[175, 170, 161, 152, 77]
vectors selected: [ 61  29 130  50 168], hap length: 185



invalid value encountered in double_scalars



[310, 267, 149]
vectors selected: [ 85 120  95 132], hap length: 185



invalid value encountered in double_scalars



[229]
vectors selected: [165  28  66  97 141 181], hap length: 195



invalid value encountered in double_scalars



[393, 146, 58]
vectors selected: [ 16  70 134  54], hap length: 195



invalid value encountered in double_scalars



[155, 139, 134]
vectors selected: [137 152  49 132  28], hap length: 195



invalid value encountered in double_scalars



[226, 176, 119, 1]
vectors selected: [121 191  95  41], hap length: 195



invalid value encountered in double_scalars



[148, 133]
vectors selected: [  6  53 167 161 159], hap length: 195



invalid value encountered in double_scalars



[354, 172, 170]
vectors selected: [102  48 135   5  67], hap length: 195



invalid value encountered in double_scalars



[220, 172, 167, 118]
vectors selected: [176 142 115 120 197], hap length: 195



invalid value encountered in double_scalars



[246, 215, 34]
vectors selected: [194 161 172  56 141  70 115], hap length: 195



invalid value encountered in double_scalars



[191, 161, 33]
vectors selected: [144 171  15], hap length: 195



invalid value encountered in double_scalars



[150, 128, 46]
vectors selected: [ 25 183  91], hap length: 195



invalid value encountered in double_scalars



[234]
vectors selected: [ 46  62 171 163 112], hap length: 205



invalid value encountered in double_scalars



[169, 132, 62, 34]
vectors selected: [ 24  37 192 167  65  82], hap length: 205



invalid value encountered in double_scalars



[187, 116, 99, 80]
vectors selected: [144  49  29  58], hap length: 205



invalid value encountered in double_scalars



[232, 116, 46]
vectors selected: [103 114  87 105 127], hap length: 205



invalid value encountered in double_scalars



[262, 245]
vectors selected: [ 21 124  79  57  89 192], hap length: 205



invalid value encountered in double_scalars



[464, 147, 54]
vectors selected: [170   5  23  26  22 126  92], hap length: 205



invalid value encountered in double_scalars



[387, 183, 94, 81]
vectors selected: [ 17  78 101 151 183], hap length: 205



invalid value encountered in double_scalars



[182, 159, 151, 85]
vectors selected: [129  97 164 146], hap length: 205



invalid value encountered in double_scalars



[144, 50, 42]
vectors selected: [16 10 73  1 76 66], hap length: 205



invalid value encountered in double_scalars



[282, 254]
vectors selected: [39 86 12], hap length: 205



invalid value encountered in double_scalars



[83, 36]
vectors selected: [165   8 141  30 179  50  25], hap length: 215



invalid value encountered in double_scalars



[276, 108, 118, 91]
vectors selected: [ 37 174 153  22  28 115], hap length: 215



invalid value encountered in double_scalars



[284, 140, 67]
vectors selected: [ 39 174  84 123], hap length: 215



invalid value encountered in double_scalars



[162]
vectors selected: [ 21 129  58  25 190  81], hap length: 215



invalid value encountered in double_scalars



[223, 148, 150]
vectors selected: [ 94 105  15 112  41], hap length: 215



invalid value encountered in double_scalars



[228, 212, 88]
vectors selected: [151  41 146  11  99], hap length: 215



invalid value encountered in double_scalars



[418, 192, 80]
vectors selected: [134  71 189], hap length: 215



invalid value encountered in double_scalars



[171, 115]
vectors selected: [135 139  83   5 151 107 142], hap length: 215



invalid value encountered in double_scalars



[401, 116, 82, 32]
vectors selected: [  6 192 133 153  77   2 195], hap length: 215



invalid value encountered in double_scalars



[352, 223, 65]
vectors selected: [154  54  46 177  40   4  32], hap length: 215



invalid value encountered in double_scalars



[283, 178, 60, 36]


In [27]:
clusters_profiles_m= np.array(Cluster_profiles)
print(clusters_profiles_m.shape)
if not MixL:
    pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(clusters_profiles_m)
    feats= pca.transform(clusters_profiles_m)
    
    fig_data= [go.Scatter3d(
        x = feats[:,0],
        y = feats[:,1],
        z = feats[:,2],
        type='scatter3d',
        mode= "markers",
        #text= ['a: {}; b: {}, L: {}; index = {}'.format(background_1[k,0],background_1[k,1],background_1[k,2], k) for k in range(background_1.shape[0])],
        marker= {
        'line': {'width': 0},
        'size': 2,
        'symbol': 'circle',
      "opacity": .6
      }
    )]

    
    layout = go.Layout(
        margin=dict(
            l=0,
            r=0,
            b=0,
            t=0
        )
    )

    fig = go.Figure(data=fig_data)
    iplot(fig)



(0,)


In [42]:
### Distribution of feature space distances between control populations for even and biased scenarios
from sklearn.neighbors import KernelDensity

###
lengths_pallette= [150]
include= [x for x in range(len(fst_store)) if fst_store[x] <= 1 and fst_store[x] >= 0 and lengths_vector[x] in lengths_pallette]
###

X_plot = np.linspace(min(control_bias_distances) - 2, max(control_bias_distances) + 2, 1000)

kde = KernelDensity(kernel='gaussian', bandwidth=0.05).fit(np.array(control_bias_distances).reshape(-1,1))

log_dens = kde.score_samples(X_plot.reshape(-1,1))

fig_roost_dens= [go.Scatter(x=X_plot, y=np.exp(log_dens), 
                            mode='lines', fill='tozeroy', name= 'Biased senarios',
                            line=dict(color='blue', width=2))]
##
X_plot = np.linspace(min(control_even_distances) - 2, max(control_even_distances) + 2, 1000)

kde = KernelDensity(kernel='gaussian', bandwidth=0.05).fit(np.array(control_even_distances).reshape(-1,1))

log_dens = kde.score_samples(X_plot.reshape(-1,1))

fig_roost_dens.append(go.Scatter(x=X_plot, y=np.exp(log_dens), 
                            mode='lines', fill='tozeroy', name= 'even scenarios (n= {})'.format(N_sims),
                            line=dict(color='red', width=2)))

##

layout= go.Layout(
    title= 'PCA distances between control populations across iterations.'
)

fig = go.Figure(data=fig_roost_dens, layout= layout)
iplot(fig)

In [33]:
from sklearn.metrics import mean_squared_error

fst_lm_range= [.02,.3]
pearsons_lengths= []
slope_lengths= [] 
error_lenths= []

for step in list(set(lengths_vector)):
    
    Lindexes= [x for x in range(len(lengths_vector)) if lengths_vector[x] == step and fst_store[x] >= fst_lm_range[0] and fst_store[x] <= fst_lm_range[1]]
    y_true= [biased_pairwise[x] for x in Lindexes]
    m,b= np.polyfit([fst_store[x] for x in Lindexes], y_true,1)
    p = np.poly1d(np.polyfit([fst_store[x] for x in Lindexes], y_true,1))
    
    y_pred= [x*m + b for x in [fst_store[x] for x in Lindexes]]
    error= mean_squared_error(y_true, y_pred)
    
    error_lenths.append(error)
    slope_lengths.append(m)
    pearsons_lengths.append(pearsonr([fst_store[x] for x in Lindexes], [biased_pairwise[x] for x in Lindexes])[0])

#####
#####


fig_data= [go.Scatter(
    x= list(set(lengths_vector)),
    y= slope_lengths,
    mode= 'markers',
    text= [str(round(x,3)) for x in pearsons_lengths]
    )
]

layout = go.Layout(
    title= 'Slope of Fst to eucledian distances across haplotype lengths',
    yaxis=dict(
        title='slope'),
    xaxis=dict(
        title='Haplotypes lengths')
)

fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

In [34]:

fig_data= [go.Scatter(
    x= list(set(lengths_vector)),
    y= error_lenths,
    mode= 'markers',
    text= [str(round(x,3)) for x in pearsons_lengths]
    )
]

layout = go.Layout(
    title= 'MSE lm(euclidian, Fst({})) across haplotype lengths'.format(fst_lm_range),
    yaxis=dict(
        title='error'),
    xaxis=dict(
        title='Haplotype lengths')
)

fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

**Fig. 8** distribution of distances between the projections of control populations across iterations for even and biased sampling scenarios. Control vectors were maintained across iterations, PCA was performed with control samples included. New samples were generated at each iteration.

In [35]:
## lm_fit
Size= 150

Lindexes= [x for x in range(len(lengths_vector)) if lengths_vector[x] == step and fst_store[x] >= fst_lm_range[0] and fst_store[x] <= fst_lm_range[1]]
y_true= [np.log(biased_pairwise[x]) for x in Lindexes]
m_coeff,b= np.polyfit(y_true,[np.log(fst_store[x]) for x in Lindexes],1)



In [37]:
lengths_pallette= [150]
include= [x for x in range(len(fst_store)) if fst_store[x] <= 1 and fst_store[x] >= 0 and lengths_vector[x] in lengths_pallette]
exclude= [x for x in range(len(fst_store)) if fst_store[x] > 1 and fst_store[x] < 0]


fig_data= [go.Scatter(
    x= [fst_store[x] for x in include],
    y= [dist_increment_bias[PC][x] for x in include],
    mode= 'markers',
    marker= dict(
        color= PC,
        opacity= .6
    ),
    name= 'PC: {}, {}'.format(PC,round(pearsonr([fst_store[x] for x in include],[dist_increment_bias[PC][x] for x in include])[0],3))
    ) for PC in dist_increment_even.keys()
]

layout = go.Layout(
    title= 'Fst to PCA distances, PC increment',
    yaxis=dict(
        title='biased and corrected distances'),
    xaxis=dict(
        title='Fsts')
)

fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)


Mean of empty slice.


invalid value encountered in double_scalars


Mean of empty slice.


invalid value encountered in double_scalars


invalid value encountered in less



**Fig. 9** Correlation between genetic and feature space distances calculated with incrementing number of PCs. Pearson's *r* was calculated on the concatenated vector of distances calculated across permutations. Distances (genetic and euclidian) were not normalized.

In [38]:
fig_data= [go.Scatter(
    x= dist_PC_even[PC],
    y= dist_PC_corrected[PC],
    mode= 'markers',
    marker= dict(
        color= PC,
        opacity= .6
    ),
    name= 'PC: {}, {}'.format(PC + 1,round(pearsonr(dist_PC_even[PC],dist_PC_corrected[PC])[0],2))
    ) for PC in dist_PC_corrected.keys()
]

layout = go.Layout(
    title= 'MS correction distances',
    yaxis=dict(
        title='biased and corrected distances'),
    xaxis=dict(
        title='unbiased distances')
)

fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

**Fig. 10** PC-wise relation between pairwise centroid distances in unbiased scenario versus biased and corrected scenarios across simulations. Distances were left unscaled.

In [39]:
t= np.array([
    unbiased_pairwise,
    biased_pairwise,
    corrected_pairwise
]).T


In [52]:
pearsons= [pearsonr(fst_store,t[:,x])[0] for x in range(t.shape[1])]

lengths_indexes= {z:[x for x in range(len(lengths_vector)) if lengths_vector[x] == z] for z in list(set(lengths_vector))}
lengthy= [x for x in lengths_indexes.keys()]


fig_data= [go.Scatter(
    x= [fst_store[x] for x in lengths_indexes[i]],
    y= [t[x,0] for x in lengths_indexes[i]],
    mode= 'markers',
    marker= dict(
        color= i,
        opacity= .6
    ),
    name= 'L: {}, r: {}'.format(str(i),round(pearsonr([fst_store[x] for x in lengths_indexes[i]],[t[x,0] for x in lengths_indexes[i]])[0],3))
    ) for i in lengthy
]

layout = go.Layout(
    title= 'PCA distances against fst across sampling scenarios; Npops= {},(mixed={})'.format(pop_max,str(MixP)),
    yaxis=dict(
        #range= [0,9],
        title='feature space distances'),
    xaxis=dict(
        #range= [0,.4],
        title='Fst')
)

fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

**Fig. 11** Relationship between Fst and feature space distances across simulations and for the three scenarios considered: unbiased sampling, biased sampling and MS corrected biased sampling. first 10 principal components considered together.

In [53]:
### Pearson's of Fst to feature space distances
pearsons_by_length= [round(pearsonr([fst_store[x] for x in lengths_indexes[i]],[t[x,0] for x in lengths_indexes[i]])[0],3) for i in sorted(lengths_indexes.keys())]

###
fig_data= [go.Scatter(
    x= sorted(lengths_indexes.keys()),
    y= pearsons_by_length,
    mode= 'lines',
    marker= dict(
        color= 'blue'
    ),
    )
]

layout = go.Layout(
    title= 'Feature space to genetic distance across haplotype lengths. Biased sampling, varying population number.'.format(N_pops),
    yaxis=dict(
        range= [0,1],
        title="Pearson's r"),
    xaxis=dict(
        title='Haplotype length')
)

fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

In [54]:
pearsons= [pearsonr(fst_store,t[:,x])[0] for x in range(t.shape[1])]

fig_data= [go.Scatter(
    x= t[:,0],
    y= t[:,i],
    mode= 'markers',
    marker= dict(
        color= i,
        opacity= .6
    ),
    name= ['unbiased','biased','corrected'][i] + ' r: {}'.format(round(pearsons[i],3))
    ) for i in [1]
]

layout = go.Layout(
    title= 'Biased against unbiased distances; Npops= {}'.format(N_pops),
    yaxis=dict(
        range= [0,9],
        title='biased distances'),
    xaxis=dict(
        range= [0,9],
        title='unbiased distances')
)

fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

**Fig. 12** Pairwise centroid distances in biased versus unbiased scenarios. Vectors were concatenated across iterations.

### Haplotype length and population number.

We began by conducting a test of the impact PCA on the relative distances of a single population genetic structure. Because a single scenario is not able to provide us with a statistical significance of the relation between genetic and PCA feature space distances, we then iterated that initial test across as many genetic structures as possible. This allowed us to extract an estimate of that relationship. We chose Pearson's *r* because of its apparent linearity.

However, in order to be able to compare different iterations we were limited to using the same haplotype length and population number. Going further, it is of interest to the practicioner to understand how this relation actually holds up across haplotype lengths and number of population numbers.


For this purpose we will package the block of script that performed the iterations into a function that takes as argument the number of populations and haplotype lengths.

This function is the core of the command-line ready script for this purpose. This script will take as arguments the total raange of population numbers and haplotype lengths to explore, and allow the user to select wether he wants his data to be scaled prior to feature reduction or not.

A [new directory]() was created to hold the final version of that product.

In [21]:

def controled_fsts(vector_lib,Eigen,length_haps,Scale,N_pops,n_comp,Iter,N_sims):

    ## Population Sizes and labels
    bias_scheme= np.random.choice(range(25,200),N_pops,replace= False)
    unbiased_sheme= np.repeat(N_sims,N_pops)

    bias_labels= np.repeat(np.array([x for x in range(N_pops)]),bias_scheme)
    unbias_labels= np.repeat(np.array([x for x in range(N_pops)]),unbiased_sheme)

    ### store distances between centroids
    biased_pairwise= []
    unbiased_pairwise= []
    corrected_pairwise= []

    ### store fsts
    fst_store= []

    ### store Pearson's r comparing gen_diffs and feature space diffs across scenarios
    biased_pears= []
    corrected_pears= []
    unbiased_pears= []

    ### triangular matrices extract.
    iu1= np.triu_indices(N_pops,1) # for centroid comparison

    iu_unbias= np.triu_indices(sum(unbiased_sheme),1)
    iu_bias= np.triu_indices(sum(bias_scheme),1)

    ### proceed.

    for rep in range(Iter):
        Pops= np.random.choice(vector_lib.shape[0],N_pops,replace= False)
        print('vectors selected: {}'.format(Pops))
        ########## FST

        freqs_selected= vector_lib[Pops,:length_haps]
        Pairwise= return_fsts2(freqs_selected)

        #fsts_compare = scale(Pairwise.fst)
        fsts_compare= Pairwise.fst
        fst_store.extend(fsts_compare)
        #########################################################
        ########### PCA ####################################
        #########################################################
        ############# unbiased sample

        #### generate data and perform PCA.
        data= []

        for k in range(N_pops):

            probs= vector_lib[Pops[k],:length_haps]

            m= unbiased_sheme[k]
            Haps= [[np.random.choice([1,0],p= [1-probs[x],probs[x]]) for x in range(length_haps)] for acc in range(m)]

            data.extend(Haps)

        data1= np.array(data)

        if Scale:
            data1= scale(data1)

        pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(data1)

        feat_unbias= pca.transform(data1)

        if Eigen:
            feat_unbias= feat_unbias * pca.explained_variance_ratio_

        ####### centroid comparison
        unbias_centroids= [np.mean(feat_unbias[[y for y in range(feat_unbias.shape[0]) if unbias_labels[y] == z],:],axis= 0) for z in range(N_pops)]
        unbias_centroids= np.array(unbias_centroids)

        unbias_pair_dist= pairwise_distances(unbias_centroids,metric= 'euclidean')
        unbias_pair_dist= unbias_pair_dist[iu1]

        #unbias_pair_dist= scale(unbias_pair_dist)
        unbiased_pairwise.extend(unbias_pair_dist)

        ######## ind distances
        ### genetic data
        unbias_gen_diffs= pairwise_distances(data1,metric= pairwise_gen)
        unbias_gen_diffs= np.array(unbias_gen_diffs)
        unbias_gen_diffs= unbias_gen_diffs[iu_unbias]

        ## feature space
        unbiased_feat_dist= pairwise_distances(feat_unbias, metric= 'euclidean')
        unbiased_feat_dist= unbiased_feat_dist[iu_unbias]
        #unbiased_feat_dist= scale(unbiased_feat_dist)

        unbiased_gen_pearson= pearsonr(unbiased_feat_dist,unbias_gen_diffs)

        unbiased_pears.append(unbiased_gen_pearson[0])

        #################################################
        ############## biased sample

        #### generate data and perform PCA
        data= []

        for k in range(N_pops):

            probs= vector_lib[Pops[k],:]

            m= bias_scheme[k]
            Haps= [[np.random.choice([1,0],p= [1-probs[x],probs[x]]) for x in range(length_haps)] for acc in range(m)]

            data.extend(Haps)

        data2= np.array(data)

        if Scale:
            data2= scale(data2)

        pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(data2)

        feat_bias= pca.transform(data2)

        if Eigen:
            feat_bias= feat_bias * pca.explained_variance_ratio_

        #### Centroid distances
        bias_centroids= [np.mean(feat_bias[[y for y in range(feat_bias.shape[0]) if bias_labels[y] == z],:],axis= 0) for z in range(N_pops)]
        bias_centroids= np.array(bias_centroids)

        bias_pair_dist= pairwise_distances(bias_centroids,metric= 'euclidean')
        bias_pair_dist= bias_pair_dist[iu1]
        #bias_pair_dist= scale(bias_pair_dist)
        biased_pairwise.extend(bias_pair_dist)

        ######## Ind distances
        ### genetic data
        bias_gen_diffs= pairwise_distances(data2,metric= pairwise_gen)
        bias_gen_diffs= np.array(bias_gen_diffs)
        bias_gen_diffs= bias_gen_diffs[iu_bias]

        ## feature space
        biased_feat_dist= pairwise_distances(feat_bias, metric= 'euclidean')
        biased_feat_dist= biased_feat_dist[iu_bias]
        #biased_feat_dist= scale(biased_feat_dist)

        biased_gen_pearson= pearsonr(biased_feat_dist,bias_gen_diffs)

        biased_pears.append(biased_gen_pearson[0])

        ###############################################################"
        ################## bias correct
        ### perform MS correction on biased samples
        feat_correct,var_comp= local_sampling_correct(data2,n_comp)

        ### centroid Distances
        centroids= [np.mean(feat_correct[[y for y in range(feat_correct.shape[0]) if bias_labels[y] == z],:],axis= 0) for z in range(N_pops)]
        centroids= np.array(centroids)
        pair_dist= pairwise_distances(centroids,metric= 'euclidean')
        pair_dist= pair_dist[iu1]
        #pair_dist= scale(pair_dist)
        corrected_pairwise.extend(pair_dist)

        ######## Ind distances

        ## feature space
        corrected_feat_dist= pairwise_distances(feat_correct, metric= 'euclidean')
        corrected_feat_dist= corrected_feat_dist[iu_bias]
        #corrected_feat_dist= scale(corrected_feat_dist)

        corr_gen_pearson= pearsonr(corrected_feat_dist,bias_gen_diffs)

        corrected_pears.append(corr_gen_pearson[0])


        t= np.array([
        fsts_compare,
        unbias_pair_dist,
        bias_pair_dist,
        pair_dist
        ]).T

    t= np.array([
        unbiased_pairwise,
        biased_pairwise,
        corrected_pairwise
    ]).T

    pearsons= [pearsonr(fst_store,t[:,x])[0] for x in range(t.shape[1])]

    return pearsons


