In [3]:
import scipy
import numpy as np
import pandas as pd
import itertools as it

import collections

def recursively_default_dict():
        return collections.defaultdict(recursively_default_dict)

from sklearn.neighbors import KernelDensity
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import scale

from scipy.stats import beta
from IPython.display import clear_output

import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
init_notebook_mode(connected=True)

import structure_tools.StructE_tools as Ste

### PCA and *Fst*

the fixation index, *Fst*, is one of the most commonly used statistics in population genetics. It is a special case of Wright's F-statistics. Under Wright's definition, *Fst* measures the amount of genetic variation that is explained by population substructure.

*Fst* can also be viewed in terms of identity-by-descent. In this case, Fst can be interpreted as a measure of how similar two individuals from the same population are relative to the whole.


Today, the use of *Fst* is interesting not only for what it says about the underlying data, but also because of how
widespread its use is. Reporting this measure can facilitate comparison across studies. However, *Fst* can be difficult to estimate. In particular, one might not posess enough observations from one or more populations with wich to calculate allele frequencies. This is particularly true for the automated analysis of many data sets with varying population sizes. This can lead one to rely on conditional statements to control for limiting eventualities.

In this notebook is provided one way to circumvent this problem. Our conclusions concern the analysis of **binary data** as a model of **phased SNPs**. The observations that led to the approach proposed here were made in the course of a study on the impact of sampling bias on the distances separating observations following dimensionality reduction through principal component analysis (PCA). While the study itself was motivated in part by my lack of knowledge in statistics, the lessons derived from it turned out to be useful (link to notebook below).

The story of this study began with a paper from 2009. In that article, the author explores the relationship between genealogy and the distribution of samples in PCA feature space (McVean 2009). The author partly formalizes the description in the form of a relationship between distances along the first principal component and mean coalescent times. The author notes the connection of this observation to that by Slatkin M., who demonstrated in 1991 a relation of *Fst* to the mean coalescent time between samples (Slakin 1991). In this context, we set out to develop the description of the relation between Fst and Euclidian distances in PCA space.

For this purpose, we simulated population allele frequencies for multiple populations. Samples from these populations were combined to vary population number and sampling size. Pairwise *Fst*s were calculated using the allele frequency vectors directly and euclidian distances were calculated between the centroids of the simulated populations in feature space. This code is reproduced below. It is encapsulated in the function Euc_to_fst(), itself stored in the module `Euc_to_fst`.

Finally, we observed that for a given number of alleles used, the relationship between euclidian distances and Pairwise *Fst* is robust to sampling bias (min = 15 N however), and that the two are linearly related in logarithmic space.

**What does this mean?**

The implication of that result is that PCA can be used to estimate *Fst*. The *limitation* is that any two points used for this inference are assumed to stand at the centroid of stationary populations whose allele frequencies respect the beta distribution. 

Despite this limitation, i believe this approach to be of some use for descriptive analysis. The function `Euc_to_fst()` has been packaged into a script of the same name in this directory. Its use **should be accompanied with a description of the assumptions made**.


- Slatkin, M. (1991). Inbreeding coefficients and coalescence times. Genetical Research, 58(2), 167-175. doi:10.1017/S0016672300029827

- McVean G. 2009. A genealogical interpretation of principal components analysis. PLoS Genet. 5:e1000686.

#### Other Notebooks

- [8. Controling for size](https://nbviewer.jupyter.org/github/SantosJGND/Stats_Lab/blob/master/7.%20Controlling%20for%20size.ipynb), where the idea was formed.

- [7. Machine learning for Genomics](https://nbviewer.jupyter.org/github/SantosJGND/Stats_Lab/blob/master/8.%20Machine%20Learning%20for%20Genomics.ipynb), a practical application.

Visit the [Stats Lab](https://github.com/SantosJGND/Stats_Lab), for an organised exposition of methods and approaches i have found usefull in dealing with a large scale genomics data set.

### Generating simulated populations

The function below begins by generating a predetermined number of allele frequency vectors by drawing frequencies from the beta distribution (scipy). This is followed by a manipulation of this data set to produce allele frequency vectors within an acceptable range of genetic distances (Fst). 

The procedure of extending the range of genetic distances available in the final vector data set is explored in another notebook (link below). PCA is used to produce equally distant populations varying around a common centre along a given number of axes meeting at a common point.

- [1. Generate haplotypes](https://nbviewer.jupyter.org/github/SantosJGND/Stats_Lab/blob/master/1.%20Generating_haplotypes.ipynb)

In [4]:
range_select= True

origin= True

In [5]:
Home= 'CLfreq_one' + '/'

### Freqs

filename= Home + 'CLfreq_one_freqs.txt'

freqs_dict= recursively_default_dict()
freqs_matrix= []

Input= open(filename,'r')

for line in Input:
    line= line.split()
    
    freqs_matrix.append([float(line[x]) for x in range(3,len(line))])
    freqs_dict[int(line[0])][float(line[1])][float(line[2])]= [float(line[x]) for x in range(3,len(line))]
Input.close()



In [6]:
### An initial filtering:
freq_threshold= 0.1

where_NA= [x for x in freqs_matrix if max(x) >= freq_threshold]

print('{} max freq below threshold'.format(len(freqs_matrix) - len(where_NA)))

freqs_matrix= np.array(freqs_matrix)

73 max freq below threshold


In [7]:

n_chose= 200
Chose= np.random.choice(range(freqs_matrix.shape[0]),n_chose)

Across= list(it.chain(*[freqs_matrix[x] for x in Chose]))

X_plot = np.linspace(0, 1, 1000)

freq_kde = KernelDensity(kernel='gaussian', bandwidth=0.02).fit(np.array(Across).reshape(-1,1))

log_dens = freq_kde.score_samples(X_plot.reshape(-1,1))

fig_roost_dens= [go.Scatter(x=X_plot, y=np.exp(log_dens), 
                            mode='lines', fill='tozeroy', name= 'mRNA pVal',
                            line=dict(color='blue', width=2))]
##

layout= go.Layout(
    title= 'allele frequency distribution across clusters',
    yaxis= dict(
        title= 'density'
    ),
    xaxis= dict(
        title= 'frequency'
    )
)

fig = go.Figure(data=fig_roost_dens, layout= layout)
iplot(fig)


In [8]:
from structure_tools.Generate_freq_vectors import generate_vectors_kde

# We must first define the number of populations, the length of the haplotypes desired, and their respective population sizes
L= 150
n= 300

n_comp = L
density= 50


if origin:
    vector_lib= np.array(freqs_matrix)
else:
    vector_lib= generate_vectors_kde(freq_kde,L,n)


## Distances

This folowing function performs Fst and euclidian distance calculations. 

**returns**

m_coeff / b : parameters of linear regression of log Fst *vs*. log euclidean diestance.

fst_x: log of measured pairwise population fst.

y_true: log of euclidean distances between centroids.

In [9]:

def Euc_to_fst(vector_lib,n_comp= 5,pop_max= 6,Iter= 20,bias_range= [7,300],Eigen= False, Scale= False,Centre= True, return_var= True,ploidy= 1,fst_ranfer= [0.01,.3]):
    ### Select pre and post processing measures. 
        
    length_haps= vector_lib.shape[1]
    
    #### Predict
    predicted= []

    #def controled_fsts(vector_lib,Eigen,length_haps,Scale,Center,N_pops,n_comp,Iter,N_sims,MixL,MixP,Pairs):
    lengths_vector= []

    ### store distances between centroids
    biased_pairwise= []

    ### store PC projection:
    dist_PC_corrected= {x:[] for x in range(n_comp)}

    ### store fsts
    fst_store= []
    var_comp_store= []

    ### proceed.

    for rep in range(Iter):
        clear_output()
        
        N_pops= np.random.choice(range(3,pop_max),1,replace= False)[0]
        
        ## Population Sizes and labels
        bias_scheme= np.random.choice(range(bias_range[0],bias_range[1]),N_pops,replace= False)
        
        bias_labels= np.repeat(np.array([x for x in range(N_pops)]),bias_scheme)
        
        ### triangular matrices extract.
        iu1= np.triu_indices(N_pops,1) # for centroid comparison

        iu_bias= np.triu_indices(sum(bias_scheme),1)

        iu_control= np.triu_indices(2,1)

        Pops= np.random.choice(vector_lib.shape[0],N_pops,replace= False)
        print('Iter: {}, sampling: {}, hap length: {}'.format(rep,bias_scheme,length_haps))
        ########## FST

        freqs_selected= vector_lib[Pops,:length_haps]
        Pairwise= Ste.return_fsts2(freqs_selected)

        #fsts_compare = scale(Pairwise.fst)
        fsts_compare= Pairwise.fst
        
        fst_store.extend(fsts_compare)

        ## lengths
        lengths_vector.extend([length_haps] * len(fsts_compare))
        
        #### generate data and perform PCA
        data= []

        for k in range(N_pops):

            probs= vector_lib[Pops[k],:]

            m= bias_scheme[k]
            Haps= [[np.random.choice([ploidy,0],p= [1-probs[x],probs[x]]) for x in range(length_haps)] for acc in range(m)]

            data.extend(Haps)

        data2= np.array(data)

        if Scale:
            data2= scale(data2)

        pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(data2)
        local_pcvar= list(pca.explained_variance_ratio_)
        
        local_pcvar= [N_pops,*local_pcvar]
        var_comp_store.append(local_pcvar)
        
        feat_bias= pca.transform(data2)

        if Eigen:
            feat_bias= feat_bias * pca.explained_variance_ratio_

        #### Centroid distances
        
        bias_centroids= [np.mean(feat_bias[[y for y in range(feat_bias.shape[0]) if bias_labels[y] == z],:],axis= 0) for z in range(N_pops)]
        bias_centroids= np.array(bias_centroids)

        bias_pair_dist= pairwise_distances(bias_centroids,metric= 'euclidean')
        bias_pair_dist= bias_pair_dist[iu1]
        #bias_pair_dist= scale(bias_pair_dist)
        
        biased_pairwise.extend(bias_pair_dist)

    var_comp_store= np.array(var_comp_store)
    var_comp_store= pd.DataFrame(var_comp_store,columns=['K',*['PC' + str(x + 1) for x in range(n_comp)]])
    
    Size= length_haps
    fst_lm_range= fst_ranfer
    
    Lindexes= [x for x in range(len(lengths_vector)) if lengths_vector[x] == Size and fst_store[x] >= fst_lm_range[0] and fst_store[x] <= fst_lm_range[1]]
    y_true= [np.log(biased_pairwise[x]) for x in Lindexes]
    fst_x= [np.log(fst_store[x]) for x in Lindexes]
    m_coeff,b= np.polyfit(y_true,fst_x,1)
    
    if return_var:
        return m_coeff, b, biased_pairwise, fst_x, y_true, var_comp_store
    else:
        return m_coeff, b, biased_pairwise, fst_x, y_true



ploidy= 2
Iter= 200
n_comp= 5

m_coeff, b, distances, fst_x, y_true, var_comp_store= Euc_to_fst(vector_lib,n_comp= n_comp,Iter= Iter,ploidy= ploidy)

Iter: 199, sampling: [207  61  71  98 233], hap length: 150


### Euclidian distances in feature space and *Fst*

In [10]:
from plotly import tools
trim= list(range(len(fst_x)))

trace1= go.Scatter(
    x= [np.exp(fst_x[x]) for x in trim],
    y= [np.exp(y_true[x]) for x in trim],
    mode= 'markers'
    )

trace2= go.Scatter(
    x= [fst_x[x] for x in trim],
    y= [y_true[x] for x in trim],
    mode= 'markers'
    )

fig = tools.make_subplots(rows=1, cols=2,
                         subplot_titles=('Euc to Fst','log of relationship'))

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)

fig['layout']['xaxis1'].update(title='Fst')
fig['layout']['xaxis2'].update(title='log Fst')

fig['layout']['yaxis1'].update(title='Euc')
fig['layout']['yaxis2'].update(title='log Euc')


layout = go.Layout(
    title= 'Euclidian distances to fst',
    yaxis=dict(
        title='fst'),
    xaxis=dict(
        title='fst')
)

fig= go.Figure(data=fig, layout=layout)
iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



### Variance captured

In [11]:
var_comp_store.head()

Unnamed: 0,K,PC1,PC2,PC3,PC4,PC5
0,3.0,0.88342,0.067438,0.01148,0.006022,0.004991
1,3.0,0.638633,0.170385,0.0382,0.024521,0.017397
2,5.0,0.816314,0.077741,0.012923,0.010855,0.01055
3,3.0,0.664306,0.305888,0.010615,0.007531,0.002018
4,5.0,0.550664,0.166196,0.036665,0.017958,0.015493


In [12]:

PC_list= ['PC' + str(x + 1) for x in range(n_comp)]
total_var= var_comp_store[PC_list].sum(axis= 1)



X_plot = np.linspace(-0.1, 1.1, 100)

kde = KernelDensity(kernel='gaussian', bandwidth=0.02).fit(np.array(total_var).reshape(-1,1))

log_dens = kde.score_samples(X_plot.reshape(-1,1))

fig_roost_dens= [go.Scatter(x=X_plot, y=np.exp(log_dens), 
                            mode='lines', fill='tozeroy', name= 'variance captured',
                            line=dict(color='blue', width=2))]
##


layout= go.Layout(
    title= 'ncomp: {}'.format(n_comp),
    yaxis= dict(
        title= 'density'
    ),
    xaxis= dict(
        title= 'variance explained'
    )
)

fig = go.Figure(data=fig_roost_dens, layout= layout)
iplot(fig)

### Use for prediction

In [13]:
from structure_tools.Euc_to_fst import Euc_to_fst, Fst_predict

Fst_predict(vector_lib,m_coeff,b,ploidy= ploidy,pop_max=7)

length haps: 150, N iterations: 20, range pops: 7



invalid value encountered in double_scalars



### Store Euc to Fst coefficients 

In [15]:
Fst_euc_dict= {}

haplen= 150

ploidy= 2

    
Fst_euc_dict[haplen]= {
    'coeff':m_coeff,
    'b': b,
    'Euclidean': y_true,
    'fst': fst_x
}    


In [18]:
import os

try:
    import cPickle as pickle
except ImportError:  # python 3.x
    import pickle

    
Home= 'Complementary_data'
filename= Home+ '/coeff_library_realfreqs.p'

coeff_lib= {}

new_coeff= {
    x:{
        'coeff': Fst_euc_dict[x]['coeff'],
        'b': Fst_euc_dict[x]['b']
    } for x in Fst_euc_dict.keys()
}

coeff_lib[ploidy]= new_coeff

os.makedirs(os.path.dirname(filename), exist_ok=True)

with open(filename, 'wb') as fp:
    pickle.dump(coeff_lib, fp, protocol=pickle.HIGHEST_PROTOCOL)
    