In [1]:
import numpy as np
import pandas as pd

file     = "./10900_Invers_ScanResults.txt"
outDir   = "./SumStatsVecs"

stats = ["Hscan_v1.3_H12", "pcadapt_3.0.4_ALL_log10p", "OutFLANK_0.2_He", "LFMM_ridge_0.0_ALL_log10p",
        "LFMM_lasso_0.0_ALL_log10p", "rehh_2.0.2_ALL_log10p", "Spearmans_ALL_rho", "a_freq_final", 
        "pcadapt_3.0.4_PRUNED_log10p"]

!mkdir -p $outDir

## Objective : 
### Turn '10900_Invers_ScanResults.txt' into a set of 12 feature vectors, one for each class

## Steps:
1) Read the text file into a dataframe

2) Scale all the statistics

3) Label each SNP based on region and muttype

4) Split the dataframe based on label and print each class to its own file

## Read the text file into a dataframe

Pandas has a few convenient function for reading in text files:

df = pd.read_csv(filepath, sep, header,...)


In [3]:
## Did the file get read correctly?

print(featDf.head(3))
print(featDf.shape)
print(featDf.columns)
print(featDf.chrom)

   vcf_ord  pos  chrom  a_freq_old muttype            unique  for_relatedness  \
0        1   62      1      0.3005    MT=1   10900_62_MT=1_1             True   
1        4  392      1      0.2035    MT=1  10900_392_MT=1_4            False   
2        6  445      1      0.0720    MT=1  10900_445_MT=1_6            False   

   a_freq_final  keep_loci  simID ... rehh_2.0.2_ALL_iHS  \
0      0.299197       True  10900 ...                NaN   
1      0.203313       True  10900 ...                NaN   
2      0.072289       True  10900 ...                NaN   

   rehh_2.0.2_ALL_log10p  Spearmans_ALL_rho  selCoef  originGen  freq_old  \
0                    NaN           0.050616      NaN        NaN       NaN   
1                    NaN          -0.025850      NaN        NaN       NaN   
2                    NaN          -0.031636      NaN        NaN       NaN   

   freq_final  pa2  prop  He  
0         NaN  NaN   NaN NaN  
1         NaN  NaN   NaN NaN  
2         NaN  NaN   NaN NaN  



In [16]:
## The data has a 'keep_loci' column, saying whether an SNP passed the filters or not. We should get rid of the
## SNPs that didn't pass

featDf.shape

(6909, 37)

### 'group_by' function

`DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs)`

group_by is a very useful function. You can call it on your dataframe with a function or label to group it by to get a group_by object that consists of the dataframe split up into different dataframes based on the function or label. You can access groups with get_group:

`GroupBy.get_group(name, obj=None)`

name is the name of the group, obj is the group_by object to take it from (default is the object it was called on)

In [5]:
## SNPs on chromosome 9 have variable recombination, meaning this 
## chromosome is not the same in every simulation and it is hard to
## classify. How can we just remove this from our data?

## The group_by() function is well suited to the task

features.shape

(6220, 37)

## Scale the statistics

Machine learning features should be scaled before they are given to a classifier. In this case, the scaling we want to do works like this: 

Take the statistic in one column. If it is negative at any of the SNPs, add the smallest value of the statistic to every value in the column. Now, take the sum of the column and divide each value in the column by that sum. Repeat for each column

All features are now between 0 and 1.

I'll give you a scale stats function I wrote to do the math with a single column, but we have to figure out how to scale every column with it.

In [6]:
def scaleStats(statSeries):
    #### some of the values for pcadaptlog10p were 'Inf'. This breaks some of the math, so I replaced the values
    #### with a very large log10p value of 400, which represents an p-value extremely close to 0 and lower than 
    #### any of the non-Inf p-values
    statSeries.replace('Inf', 400, inplace = True)
    statSeries = pd.to_numeric(statSeries, errors = 'coerce')
    
    # if there are any negative values, scale by addition first
    minStat = statSeries.min()
    if minStat < 0: 
        statSeries = statSeries + minStat
    
    # scale by dividing values by the sum
    if statSeries.sum() != 0: 
        return(statSeries.divide(statSeries.sum(), fill_value = 0))
    else:
        return(statSeries)

### The apply function

`DataFrame.apply(func, axis=0, broadcast=None, raw=False, reduce=None, result_type=None, args=(), **kwds)`

func -- function to apply

axis = 0 applies by column, axis = 1 applies by row

In [7]:
## How can we scale every without using a loop (too slow)?
## Scaling columns such as 'chrom' or 'pos' doesn't make sense, how do we only scale the columns we want?

## The 'apply' function allows us to do it in a single line

scaledFeatures.shape

(6220, 9)

## Label the SNPs

The snps are labeled based on their position and their muttype. We can easily write a function to take a SNPs position and muttype and return a label, but how can we 'apply' this function if it takes variables in two different columns?

# Explanation of Labels

possible muttypes:

 - neutral
 - QTN         : can be either large effect (>.20 of variation in phenotype) or small effect (<.20 of variation)
 - deleterious : mutation that negatively effects fitness
 - sweep       : mutation that has become fixed and is expected to show evidence of a selective sweep around it

possible regions:

 - Background selection : any SNP in the 10,000bp region where deleterious mutations occurred 
 - Near Selective Sweep : within 1,000bp of the selective sweep
 - Far Selective Sweep  : 1,000-2,000bp from the selective sweep
 - large QTN linked     : within 200bp of a QTN of large effect
 - small QTN linked     : within 200bp of a QTN of small effect
 - inversion            : in an inversion
 - low recombination    : in a region of low recombination
 

In [8]:
def findLabel(pos, muttype):
    # 1 = neut, 2 = QTN, 3 = delet, 4 = sweep
    muttypes = {"MT=1" : "neut", 
                "MT=2" : "QTN",
                "MT=3" : "delet",
                "MT=4" : "sweep",
                "MT=5" : "neut"}         ### MT=5 is a artifact from SLiM to preserve the inversion
    try:
        mtLabel = muttypes[muttype]
    except KeyError:
        warnings.warn("Unknown muttype " + muttype)
        mtLabel = "INVALID"
    
    pos = float(pos)
    if  200001 <= pos <= 230000 or  270001 <= pos <= 280000:
        region = "BS"
    elif 174000 <= pos <= 176000:
        region = "NearSS"
    elif 173000 <= pos <= 17399 or 176001 <= pos <= 177000:
        region = "FarSS"
    elif 320000 <= pos <= 330000:
        region = "invers"
    elif 370000 <= pos <= 380000:
        region = "lowRC"
    else:
        region = "neutral"
    return "MT=" + mtLabel + "_R=" + region

### The 'insert' function

`DataFrame.insert(loc, column, value, allow_duplicates=False)`

Pretty simple function that inserts the 'value' at the given 'loc' and names the new column 'column'

In [9]:
## What's the best way to apply a function to two columns of a data frame?

## Answer from stack overflow:
## rewrite the function to take a pandas series. Apply the function row wise
## (https://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe)


print(scaledFeatures.shape)
print(scaledFeatures.head())

(6220, 10)
          classLabel  Hscan_v1.3_H12  pcadapt_3.0.4_ALL_log10p  \
0  MT=neut_R=neutral             0.0                       0.0   
1  MT=neut_R=neutral             0.0                       0.0   
2  MT=neut_R=neutral             0.0                       0.0   
3  MT=neut_R=neutral             0.0                       0.0   
4  MT=neut_R=neutral             0.0                       0.0   

   OutFLANK_0.2_He  LFMM_ridge_0.0_ALL_log10p  LFMM_lasso_0.0_ALL_log10p  \
0         0.000313                   0.000121                   0.000008   
1         0.000242                   0.000142                   0.000120   
2         0.000100                   0.000089                   0.000040   
3         0.000017                   0.000085                   0.000312   
4         0.000022                   0.000533                   0.000178   

   rehh_2.0.2_ALL_log10p  Spearmans_ALL_rho  a_freq_final  \
0                    0.0           0.000137      0.000223   
1            

In [10]:
## Now we need to add some more labels -- MT=2 means a QTN but does not distinguish between large and small QTNS
## In addition, there is no muttype for 'linked to a large QTN'. We are going to need the position column to 
# locate the linked alleles and the proportion column to differentiate the large and small QTNs


# add pos and prop back in to locate QTNs of large and small effect
scaledFeatures.insert(loc = 0, column = 'pos', value = features['pos'].astype("float"))
scaledFeatures.insert(loc = 0, column = 'prop', value = pd.to_numeric(features['prop'], errors = "coerce"))

### The loc function:

`DataFrame.loc[]`

loc accesses a group of rows or columns directly from a dataframe. You can modify the loc object directly, it is not a copy

## The 'isin' function

`DataFrame.isin(values)`

Returns a boolean DataFrame showing whether each element in the dataframe is contained in values

In [11]:
## Now that we have the proportions, we need to actually filter through them find the large and small QTNs
## The rule is, < 20 % proportion is a small QTN and > 20% proportion is a large QTN. This only applies to 
## SNPs marked as QTNs in the first place.

## We can do this with no additional functions, just some more complicated subsetting


## update the labels --  use the 'loc' function and the 'isin' function
## loc function avoids chain indexing, which is important when setting values in the dataframe


### The 'between' function

`Series.between(left, right, inclusive=True)`

Takes a series, returns a boolean series indicating whether each element in the series is between 'left' and 'right'

In [12]:
## Now we need to label all the QTN linked SNPs --- these are defined as any SNP within 200bp of a QTN.
## Small and large QTN linked SNPs are labeled differently, and an SNP that is within 200bp of a large and a small
## QTN should be labeled as large QTN linked

for site in smallQTNs:
    lower = site - 200
    upper = site + 200
    ### use loc to access and change the label of all SNPs between lower and upper, EXCEPT the QTN itself
    
for site in largeQTNs:
    lower = site - 200
    upper = site + 200
    ### again, access and change the label using loc
        
## Can you think of a way to do this that doesn't use a for loop?

## Write the data to separate files, based on the class

Almost finished! Let's drop the columns that don't contain statistics, then split up the dataframe based on class_label (sounds like another job for groupby)

In [13]:
## Use the 'drop' function to remove the 'pos' and 'prop' columns

print(scaledFeatures.head())
print(scaledFeatures.shape)

          classLabel  Hscan_v1.3_H12  pcadapt_3.0.4_ALL_log10p  \
0  MT=neut_R=neutral             0.0                       0.0   
1  MT=neut_R=neutral             0.0                       0.0   
2  MT=neut_R=neutral             0.0                       0.0   
3  MT=neut_R=neutral             0.0                       0.0   
4  MT=neut_R=neutral             0.0                       0.0   

   OutFLANK_0.2_He  LFMM_ridge_0.0_ALL_log10p  LFMM_lasso_0.0_ALL_log10p  \
0         0.000313                   0.000121                   0.000008   
1         0.000242                   0.000142                   0.000120   
2         0.000100                   0.000089                   0.000040   
3         0.000017                   0.000085                   0.000312   
4         0.000022                   0.000533                   0.000178   

   rehh_2.0.2_ALL_log10p  Spearmans_ALL_rho  a_freq_final  \
0                    0.0           0.000137      0.000223   
1                    0.0

In [14]:
## use 'groupby' to group the columns based on class label

print(labelGrouped)

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7fc1875a64e0>


### 'to_csv' function

`DataFrame.to_csv(path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='\n', chunksize=None, tupleize_cols=None, date_format=None, doublequote=True, escapechar=None, decimal='.')`

Call on a dataframe object to print the dataframe to a csv file. Ex:

`df.to_csv(filename, sep = " ", index = False, header = True`

In [15]:
## go through each group, generate a file name, and use the pandas function group.to_csv to print to file

# loop through the groupby object:
        outfile     = outDir + "/" + name + ".fvec"
        outfile     = outfile.replace("=", "-")         ## unix doesn't like havign '=' in file names
        
        with open(outfile, 'a') as f:
            ## apply the .to_csv function here

MT-delet_R-BS.fvec	   MT-neut_R-invers.fvec     MT-neut_R-neutral.fvec
MT-lgQTN_R-lgQTNlink.fvec  MT-neut_R-lgQTNlink.fvec  MT-neut_R-smQTNlink.fvec
MT-neut_R-BS.fvec	   MT-neut_R-lowRC.fvec      MT-smQTN_R-smQTNlink.fvec
MT-neut_R-FarSS.fvec	   MT-neut_R-NearSS.fvec     MT-sweep_R-NearSS.fvec



classLabel	Hscan_v1.3_H12	pcadapt_3.0.4_ALL_log10p	OutFLANK_0.2_He	LFMM_ridge_0.0_ALL_log10p	LFMM_lasso_0.0_ALL_log10p	rehh_2.0.2_ALL_log10p	Spearmans_ALL_rho	a_freq_final	pcadapt_3.0.4_PRUNED_log10p
MT=sweep_R=NearSS	0.0004853543746082687	0.0	5.544337452407637e-05	6.855550384918943e-05	9.133878596663905e-05	0.0	0.000126218561145908	0.0007156107017992282	0.00014585300829671226


In [None]:
!ls $outDir
print("\n\n")
!head $outfile