In [8]:
import scipy
import numpy as np
import pandas as pd
import itertools as it

import collections

def recursively_default_dict():
        return collections.defaultdict(recursively_default_dict)

from sklearn.neighbors import KernelDensity
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import scale

from scipy.stats import beta

import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
init_notebook_mode(connected=True)

import StructE_tools as Ste

### PCA and *Fst*

the fixation index, *Fst*, is one of the most commonly used statistics in population genetics. It is a special case of Wright's F-statistics. Under Wright's definition, *Fst* measures the amount of genetic variation that is explained by population substructure.

*Fst* can also be viewed in terms of identity-by-descent. In this case, Fst can be interpreted as a measure of how similar two individuals from the same population are relative to the whole.


Today, the use of *Fst* is interesting not only for what it says about the underlying data, but also because of how
widespread its use is. Reporting this measure can facilitate comparison across studies. However, *Fst* can be difficult to estimate. In particular, one might not posess enough observations from one or more populations with wich to calculate allele frequencies. This is particularly true for the automated analysis of many data sets with varying population sizes. This can lead one to rely on conditional statements to control for limiting eventualities.

In this notebook is provided one way to circumvent this problem. Our conclusions concern the analysis of **binary data** as a model of **phased SNPs**. The observations that led to the approach proposed here were made in the course of a study on the impact of sampling bias on the distances separating observations following dimensionality reduction through principal component analysis (PCA). While the study itself was motivated in part by my lack of knowledge in statistics, the lessons derived from it turned out to be useful (link to notebook below).

The story of this study began with a paper from 2009. In that article, the author explores the relationship between genealogy and the distribution of samples in PCA feature space (McVean 2009). The author partly formalizes the description in the form of a relationship between distances along the first principal component and mean coalescent times. The author notes the connection of this observation to that by Slatkin M., who demonstrated in 1991 a relation of *Fst* to the mean coalescent time between samples (Slakin 1991). In this context, we set out to develop the description of the relation between Fst and Euclidian distances in PCA space.

For this purpose, we simulated population allele frequencies for multiple populations. Samples from these populations were combined to vary population number and sampling size. Pairwise *Fst*s were calculated using the allele frequency vectors directly and euclidian distances were calculated between the centroids of the simulated populations in feature space. This code is reproduced below. It is encapsulated in the function Euc_to_fst(), itself stored in the module `Euc_to_fst`.

Finally, we observed that for a given number of alleles used, the relationship between euclidian distances and Pairwise *Fst* is robust to sampling bias (min = 15 N however), and that the two are linearly related in logarithmic space.

**What does this mean?**

The implication of that result is that PCA can be used to estimate *Fst*. The *limitation* is that any two points used for this inference are assumed to stand at the centroid of stationary populations whose allele frequencies respect the beta distribution. 

Despite this limitation, i believe this approach to be of some use for descriptive analysis. The function `Euc_to_fst()` has been packaged into a script of the same name in this directory. Its use **should be accompanied with a description of the assumptions made**.


- Slatkin, M. (1991). Inbreeding coefficients and coalescence times. Genetical Research, 58(2), 167-175. doi:10.1017/S0016672300029827

- McVean G. 2009. A genealogical interpretation of principal components analysis. PLoS Genet. 5:e1000686.

#### Other Notebooks

- [8. Controling for size](https://nbviewer.jupyter.org/github/SantosJGND/Stats_Lab/blob/master/7.%20Controlling%20for%20size.ipynb), where the idea was formed.

- [7. Machine learning for Genomics](https://nbviewer.jupyter.org/github/SantosJGND/Stats_Lab/blob/master/8.%20Machine%20Learning%20for%20Genomics.ipynb), a practical application.

Visit the [Stats Lab](https://github.com/SantosJGND/Stats_Lab), for an organised exposition of methods and approaches i have found usefull in dealing with a large scale genomics data set.

### Generating simulated populations

The function below begins by generating a predetermined number of allele frequency vectors by drawing frequencies from the beta distribution (scipy). This is followed by a manipulation of this data set to produce allele frequency vectors within an acceptable range of genetic distances (Fst). 

The procedure of extending the range of genetic distances available in the final vector data set is explored in another notebook (link below). PCA is used to produce equally distant populations varying around a common centre along a given number of axes meeting at a common point.

- [1. Generate haplotypes](https://nbviewer.jupyter.org/github/SantosJGND/Stats_Lab/blob/master/1.%20Generating_haplotypes.ipynb)

In [5]:
Nbranches= 4 # number of axes
L= 150 # number of markers.
n= 100 # number of frequency vectors.
rangeA= [1,2.5] # range along which to vary parameter a of beta dist.
rangeB = [.1,.6] # range along which to vary parameter b of beta dist.
steps= 20 # number of steps along ranges of parameters and b.
n_comp = 100 # number of components to retain in PCA of frequency vectors (>>).
density= 50 # number of populations along each branch.

from Generate_freq_vectors import generate_Branches_Beta

features, vector_lib= generate_Branches_Beta(4,50,L,n,rangeA,rangeB,steps,n_comp)
print(features.shape)
print(vector_lib.shape)

(200, 100)
(200, 150)


## Distances

This folowing function performs Fst and euclidian distance calculations. 

**returns**

m_coeff / b : parameters of linear regression of log Fst *vs*. log euclidean diestance.

fst_x: log of measured pairwise population fst.

y_true: log of euclidean distances between centroids.

In [9]:
def Euc_to_fst(vector_lib,n_comp= 5,pop_max= 8,Iter= 20,bias_range= [20,300],Eigen= False, Scale= False,Centre= True):
    ### Select pre and post processing measures. 
    
    length_haps= vector_lib.shape[1]
        
    print('length haps: {}, N iterations: {}, range pops: {}'.format(length_haps,Iter,pop_max))
    
    #### Predict
    predicted= []

    #def controled_fsts(vector_lib,Eigen,length_haps,Scale,Center,N_pops,n_comp,Iter,N_sims,MixL,MixP,Pairs):
    lengths_vector= []

    ### store distances between centroids
    biased_pairwise= []

    ### store PC projection:
    dist_PC_corrected= {x:[] for x in range(n_comp)}

    ### store fsts
    fst_store= []


    ### proceed.

    for rep in range(Iter):
        
        N_pops= np.random.choice(range(3,pop_max),1,replace= False)[0]
        
        ## Population Sizes and labels
        bias_scheme= np.random.choice(range(bias_range[0],bias_range[1]),N_pops,replace= False)
        
        bias_labels= np.repeat(np.array([x for x in range(N_pops)]),bias_scheme)
        
        ### triangular matrices extract.
        iu1= np.triu_indices(N_pops,1) # for centroid comparison

        iu_bias= np.triu_indices(sum(bias_scheme),1)

        iu_control= np.triu_indices(2,1)

        Pops= np.random.choice(vector_lib.shape[0],N_pops,replace= False)
        #print('Iter: {}, vectors selected: {}, hap length: {}'.format(rep,Pops,length_haps))
        ########## FST

        freqs_selected= vector_lib[Pops,:length_haps]
        Pairwise= Ste.return_fsts2(freqs_selected)

        #fsts_compare = scale(Pairwise.fst)
        fsts_compare= Pairwise.fst
        
        fst_store.extend(fsts_compare)

        ## lengths
        lengths_vector.extend([length_haps] * len(fsts_compare))
        
        #### generate data and perform PCA
        data= []

        for k in range(N_pops):

            probs= vector_lib[Pops[k],:]

            m= bias_scheme[k]
            Haps= [[np.random.choice([1,0],p= [1-probs[x],probs[x]]) for x in range(length_haps)] for acc in range(m)]

            data.extend(Haps)

        data2= np.array(data)

        if Scale:
            data2= scale(data2)

        pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(data2)

        feat_bias= pca.transform(data2)

        if Eigen:
            feat_bias= feat_bias * pca.explained_variance_ratio_

        #### Centroid distances
        
        bias_centroids= [np.mean(feat_bias[[y for y in range(feat_bias.shape[0]) if bias_labels[y] == z],:],axis= 0) for z in range(N_pops)]
        bias_centroids= np.array(bias_centroids)

        bias_pair_dist= pairwise_distances(bias_centroids,metric= 'euclidean')
        bias_pair_dist= bias_pair_dist[iu1]
        #bias_pair_dist= scale(bias_pair_dist)
        
        biased_pairwise.extend(bias_pair_dist)

    
    Size= length_haps
    fst_lm_range= [0,.3]
    
    Lindexes= [x for x in range(len(lengths_vector)) if lengths_vector[x] == Size and fst_store[x] >= fst_lm_range[0] and fst_store[x] <= fst_lm_range[1]]
    y_true= [np.log(biased_pairwise[x]) for x in Lindexes]
    fst_x= [np.log(fst_store[x]) for x in Lindexes]
    m_coeff,b= np.polyfit(y_true,fst_x,1)
    
    return m_coeff, b, fst_x, y_true



m_coeff, b, fst_x, y_true= Euc_to_fst(vector_lib)

length haps: 150, N iterations: 20, range pops: 8



invalid value encountered in double_scalars



In [13]:
from plotly import tools

trace1= go.Scatter(
    x= [np.exp(x) for x in fst_x],
    y= [np.exp(x) for x in y_true],
    mode= 'markers'
    )

trace2= go.Scatter(
    x= fst_x,
    y= y_true,
    mode= 'markers'
    )

fig = tools.make_subplots(rows=1, cols=2,
                         subplot_titles=('Euc to Fst','log of relationship'))

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)

fig['layout']['xaxis1'].update(title='Fst')
fig['layout']['xaxis2'].update(title='log Fst')

fig['layout']['yaxis1'].update(title='Euc')
fig['layout']['yaxis2'].update(title='log Euc')


layout = go.Layout(
    title= 'Euclidian distances to fst',
    yaxis=dict(
        title='fst'),
    xaxis=dict(
        title='fst')
)

fig= go.Figure(data=fig, layout=layout)
iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [22]:
from Euc_to_fst import Euc_to_fst, Fst_predict

Fst_predict(vector_lib,m_coeff,b)

length haps: 150, N iterations: 20, range pops: 8



invalid value encountered in double_scalars



### Going a bit further.

In the introductory paragraph it was mentioned that the relation observed between Fst and Euclidean distance, replicated above, is stable for a given number of markers. It is  important to understand how the number of markers influences this relationship. 

This section replicates the calculations of the first section across a range of marker numbers.

In [14]:

Fst_euc_dict= {}

Haplen_range= [50,1000]
steps= 30

for haplen in np.arange(Haplen_range[0],Haplen_range[1],steps):
    
    Nbranches= 4 # number of axes
    L= haplen # number of markers.
    n= 100 # number of frequency vectors.
    rangeA= [1,2.5] # range along which to vary parameter a of beta dist.
    rangeB = [.1,.6] # range along which to vary parameter b of beta dist.
    steps= 20 # number of steps along ranges of parameters and b.
    n_comp = haplen # number of components to retain in PCA of frequency vectors (>>).
    density= 50 # number of populations along each branch.


    features, vector_lib= generate_Branches_Beta(4,50,L,n,rangeA,rangeB,steps,n_comp)
    
    m_coeff, b, fst_x, y_true= Euc_to_fst(vector_lib)
    
    Fst_euc_dict[haplen]= {
        'coeff':m_coeff,
        'b': b,
        'Euclidean': y_true,
        'fst': fst_x
    }
    


length haps: 50, N iterations: 20, range pops: 8



invalid value encountered in double_scalars



length haps: 80, N iterations: 20, range pops: 8
length haps: 110, N iterations: 20, range pops: 8
length haps: 140, N iterations: 20, range pops: 8
length haps: 170, N iterations: 20, range pops: 8
length haps: 200, N iterations: 20, range pops: 8
length haps: 230, N iterations: 20, range pops: 8
length haps: 260, N iterations: 20, range pops: 8
length haps: 290, N iterations: 20, range pops: 8
length haps: 320, N iterations: 20, range pops: 8
length haps: 350, N iterations: 20, range pops: 8
length haps: 380, N iterations: 20, range pops: 8
length haps: 410, N iterations: 20, range pops: 8
length haps: 440, N iterations: 20, range pops: 8
length haps: 470, N iterations: 20, range pops: 8
length haps: 500, N iterations: 20, range pops: 8
length haps: 530, N iterations: 20, range pops: 8
length haps: 560, N iterations: 20, range pops: 8
length haps: 590, N iterations: 20, range pops: 8
length haps: 620, N iterations: 20, range pops: 8
length haps: 650, N iterations: 20, range pops: 8
l

In [15]:
from plotly import tools

Haps_seq= sorted(Fst_euc_dict.keys())

trace1= go.Scatter(
    x= Haps_seq,
    y= [Fst_euc_dict[x]['coeff'] for x in Haps_seq],
    mode= 'markers'
    )

trace2= go.Scatter(
    x= Haps_seq,
    y= [Fst_euc_dict[x]['b'] for x in Haps_seq],
    mode= 'markers'
    )
fig = tools.make_subplots(rows=1, cols=2,
                         subplot_titles=('coeff','b'))

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)


fig['layout']['xaxis1'].update(title='L')
fig['layout']['xaxis2'].update(title='L')

fig['layout']['yaxis1'].update(title='coefficient')
fig['layout']['yaxis2'].update(title='constant')


layout = go.Layout(
    title= 'log relationship Euc to Fst',
    yaxis=dict(
        title='parameter'),
    xaxis=dict(
        title='Hap length')
)

fig= go.Figure(data=fig, layout=layout)
iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [16]:
N_show= 10

showcase= np.linspace(0,len(Haps_seq) - 1,N_show)

fig= [go.Scatter(
    x= [np.exp(x) for x in Fst_euc_dict[Haps_seq[int(sel)]]['fst']],
    y= [np.exp(x) for x in Fst_euc_dict[Haps_seq[int(sel)]]['Euclidean']],
    mode= 'markers',
    name= str(Haps_seq[int(sel)])
) for sel in showcase]

layout = go.Layout(
    title= 'PCA Euc to Fst',
    yaxis=dict(
        title='Euclidean distance'),
    xaxis=dict(
        title='Fst')
)

fig= go.Figure(data=fig, layout=layout)
iplot(fig)

In [17]:
N_show= 7

showcase= np.linspace(0,len(Haps_seq) - 1,N_show)

fig= [go.Scatter(
    x= [np.exp(x) for x in Fst_euc_dict[Haps_seq[int(sel)]]['fst']],
    y= [x for x in Fst_euc_dict[Haps_seq[int(sel)]]['Euclidean']],
    mode= 'markers',
    name= str(Haps_seq[int(sel)])
) for sel in showcase]

layout = go.Layout(
    title= 'PCA Euc to Fst',
    yaxis=dict(
        title='log Euclidean distance'),
    xaxis=dict(
        title='Fst')
)

fig= go.Figure(data=fig, layout=layout)
iplot(fig)