## Some basics first



Pandas is built around two data structures: 
    - pandas.Series
    - pandas.DataFrame

In [12]:
## A Series is a list with labels that can hold any data type
## from 10 minutes to pandas (https://pandas.pydata.org/pandas-docs/stable/10min.html#min)

import numpy as np
import pandas as pd

s = pd.Series([1,3,5,np.nan,6,8])
print(s)
s[0]

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


1.0

In [13]:
## A DataFrame is a two-dimensional labeled data structure that can be subset into Series objects
## from 10 minutes to pandas

dates = pd.date_range('20130101', periods=6)
print(dates)
df    = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print(df)
print()
print(df.A)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
                   A         B         C         D
2013-01-01  1.501225 -0.129887 -1.258921 -1.118677
2013-01-02  1.529092  0.244244  0.731982 -0.519929
2013-01-03  0.183809  0.230549  1.260948 -0.926389
2013-01-04  1.148851 -0.676734 -3.173422  1.308157
2013-01-05 -0.628584 -0.627767  1.257109 -0.919234
2013-01-06  0.997157 -0.374610 -0.009120 -0.332117

2013-01-01    1.501225
2013-01-02    1.529092
2013-01-03    0.183809
2013-01-04    1.148851
2013-01-05   -0.628584
2013-01-06    0.997157
Freq: D, Name: A, dtype: float64


## Our data

In [14]:
file     = "./10900_Invers_ScanResults.txt"
outDir   = "./SumStatsVecs"

stats = ["Hscan_v1.3_H12", "pcadapt_3.0.4_ALL_log10p", "OutFLANK_0.2_He", "LFMM_ridge_0.0_ALL_log10p",
        "LFMM_lasso_0.0_ALL_log10p", "rehh_2.0.2_ALL_log10p", "Spearmans_ALL_rho", "a_freq_final", 
        "pcadapt_3.0.4_PRUNED_log10p"]

!mkdir -p $outDir

## Objective : 
### Turn '10900_Invers_ScanResults.txt' into a set of 12 feature vectors, one for each class

## Steps:
1) Read the text file into a dataframe

2) Scale all the statistics

3) Label each SNP based on region and muttype

4) Split the dataframe based on label and print each class to its own file

## Read the text file into a dataframe

Pandas has a few convenient function for reading in text files:

`df = pd.read_csv(filepath, sep, header,...)`


In [None]:
## Did the file get read correctly?

print(featDf.head(3))
print(featDf.shape)
print(featDf.columns)
print(featDf.chrom)

In [None]:
## The data has a 'keep_loci' column, saying whether an SNP passed the filters or not. We should get rid of the
## SNPs that didn't pass

featDf.shape

### 'group_by' function

`DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs)`

group_by is a very useful function. You can call it on your dataframe with a function or label to group it by to get a group_by object that consists of the dataframe split up into different dataframes based on the function or label. You can access groups with get_group:

`GroupBy.get_group(name, obj=None)`

name is the name of the group, obj is the group_by object to take it from (default is the object it was called on)

### Drop function

`DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')`

You can call .drop on a datframe to remove data from it

In [None]:
## SNPs on chromosome 9 have variable recombination, meaning this 
## chromosome is not the same in every simulation and it is hard to
## classify. How can we just remove this from our data?

## The group_by() function is well suited to the task

features.shape

## Scale the statistics

Machine learning features should be scaled before they are given to a classifier. In this case, the scaling we want to do works like this: 

Take the statistic in one column. If it is negative at any of the SNPs, add the smallest value of the statistic to every value in the column. Now, take the sum of the column and divide each value in the column by that sum. Repeat for each column

All features are now between 0 and 1.

I'll give you a scale stats function I wrote to do the math with a single column, but we have to figure out how to scale every column with it.

In [None]:
def scaleStats(statSeries):
    #### some of the values for pcadaptlog10p were 'Inf'. This breaks some of the math, so I replaced the values
    #### with a very large log10p value of 400, which represents an p-value extremely close to 0 and lower than 
    #### any of the non-Inf p-values
    statSeries.replace('Inf', 400, inplace = True)
    statSeries = pd.to_numeric(statSeries, errors = 'coerce')
    
    # if there are any negative values, scale by addition first
    minStat = statSeries.min()
    if minStat < 0: 
        statSeries = statSeries + minStat
    
    # scale by dividing values by the sum
    if statSeries.sum() != 0: 
        return(statSeries.divide(statSeries.sum(), fill_value = 0))
    else:
        return(statSeries)

### The apply function

`DataFrame.apply(func, axis=0, broadcast=None, raw=False, reduce=None, result_type=None, args=(), **kwds)`

func -- function to apply

axis = 0 applies by column, axis = 1 applies by row

In [None]:
## How can we scale every without using a loop (too slow)?
## Scaling columns such as 'chrom' or 'pos' doesn't make sense, how do we only scale the columns we want?

## The 'apply' function allows us to do it in a single line

scaledFeatures.shape

## Label the SNPs

The snps are labeled based on their position and their muttype. We can easily write a function to take a SNPs position and muttype and return a label, but how can we 'apply' this function if it takes variables in two different columns?

# Explanation of Labels

possible muttypes:

 - neutral
 - QTN         : can be either large effect (>.20 of variation in phenotype) or small effect (<.20 of variation)
 - deleterious : mutation that negatively effects fitness
 - sweep       : mutation that has become fixed and is expected to show evidence of a selective sweep around it

possible regions:

 - Background selection : any SNP in the 10,000bp region where deleterious mutations occurred 
 - Near Selective Sweep : within 1,000bp of the selective sweep
 - Far Selective Sweep  : 1,000-2,000bp from the selective sweep
 - large QTN linked     : within 200bp of a QTN of large effect
 - small QTN linked     : within 200bp of a QTN of small effect
 - inversion            : in an inversion
 - low recombination    : in a region of low recombination
 

In [None]:
def findLabel(pos, muttype):
    # 1 = neut, 2 = QTN, 3 = delet, 4 = sweep
    muttypes = {"MT=1" : "neut", 
                "MT=2" : "QTN",
                "MT=3" : "delet",
                "MT=4" : "sweep",
                "MT=5" : "neut"}         ### MT=5 is a artifact from SLiM to preserve the inversion
    try:
        mtLabel = muttypes[muttype]
    except KeyError:
        warnings.warn("Unknown muttype " + muttype)
        mtLabel = "INVALID"
    
    pos = float(pos)
    if  200001 <= pos <= 230000 or  270001 <= pos <= 280000:
        region = "BS"
    elif 174000 <= pos <= 176000:
        region = "NearSS"
    elif 173000 <= pos <= 17399 or 176001 <= pos <= 177000:
        region = "FarSS"
    elif 320000 <= pos <= 330000:
        region = "invers"
    elif 370000 <= pos <= 380000:
        region = "lowRC"
    else:
        region = "neutral"
    return "MT=" + mtLabel + "_R=" + region

### The 'insert' function

`DataFrame.insert(loc, column, value, allow_duplicates=False)`

Pretty simple function that inserts the 'value' at the given 'loc' and names the new column 'column'

In [None]:
## What's the best way to apply a function to two columns of a data frame?

## Answer from stack overflow:
## rewrite the function to take a pandas series. Apply the function row wise
## (https://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe)


print(scaledFeatures.shape)
print(scaledFeatures.head())

In [None]:
## Now we need to add some more labels -- MT=2 means a QTN but does not distinguish between large and small QTNS
## In addition, there is no muttype for 'linked to a large QTN'. We are going to need the position column to 
# locate the linked alleles and the proportion column to differentiate the large and small QTNs


# add pos and prop back in to locate QTNs of large and small effect
scaledFeatures.insert(loc = 0, column = 'pos', value = features['pos'].astype("float"))
scaledFeatures.insert(loc = 0, column = 'prop', value = pd.to_numeric(features['prop'], errors = "coerce"))

### The loc function:

`DataFrame.loc[["row_label_1", "row_label_2], 'col1':]`

loc accesses a group of rows or columns or both from a dataframe. The indexing operator ([]) cannot select rows and columns at the same time and can give unexpected results when performing more complex operations. loc is a more consistent and explicit way to select data.

## The 'isin' function

`DataFrame.isin(values)`

Returns a boolean DataFrame showing whether each element in the dataframe is contained in values

In [None]:
## Now that we have the proportions, we need to actually filter through them find the large and small QTNs
## The rule is, < 20 % proportion is a small QTN and > 20% proportion is a large QTN. This only applies to 
## SNPs marked as QTNs in the first place.

## We can do this with no additional functions, just some more complicated subsetting
largeQTNs = scaledFeatures[]
smallQTNs = scaledFeatures[]

In [None]:
## update the labels --  use the 'loc' function and the 'isin' function
scaledFeatures.loc[] = 'MT=lgQTN_R=lgQTNlink'
scaledFeatures.loc[] = 'MT=smQTN_R=smQTNlink'

## check the original data frame was changed
print(scaledFeatures[scaledFeatures.pos.isin(smallQTNs)])

### The 'between' function

`Series.between(left, right, inclusive=True)`

Takes a series, returns a boolean series indicating whether each element in the series is between 'left' and 'right'

In [None]:
## Now we need to label all the QTN linked SNPs --- these are defined as any SNP within 200bp of a QTN.
## Small and large QTN linked SNPs are labeled differently, and an SNP that is within 200bp of a large and a small
## QTN should be labeled as large QTN linked

for site in smallQTNs:
    lower = site - 200
    upper = site + 200
    ### use loc to access and change the label of all SNPs between lower and upper, EXCEPT the QTN itself
    scaledFeatures.loc[] = 'MT=neut_R=smQTNlink'
    
for site in largeQTNs:
    lower = site - 200
    upper = site + 200
    ### again, access and change the label using loc
    scaledFeatures.loc[] = 'MT=neut_R=smQTNlink'
    
## Can you think of a way to do this that doesn't use a for loop?

## Write the data to separate files, based on the class

Almost finished! Let's drop the columns that don't contain statistics, then split up the dataframe based on class_label (sounds like another job for groupby)

In [None]:
## Use the 'drop' function to remove the 'pos' and 'prop' columns

print(scaledFeatures.head())
print(scaledFeatures.shape)

In [None]:
## use 'groupby' to group the columns based on class label

print(labelGrouped.groups.keys())

### 'to_csv' function

`DataFrame.to_csv(path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='\n', chunksize=None, tupleize_cols=None, date_format=None, doublequote=True, escapechar=None, decimal='.')`

Call on a dataframe object to print the dataframe to a csv file. Ex:

`df.to_csv(filename, sep = " ", index = False, header = True`

In [None]:
## go through each group, generate a file name, and use the pandas function group.to_csv to print to file

# loop through the groupby object:
        outfile     = outDir + "/" + name + ".fvec"
        outfile     = outfile.replace("=", "-")         ## unix doesn't like having '=' in file names
        
        ## use the .to_csv function here

In [None]:
!ls $outDir
print("\n\n")
!head $outfile

# Interested in learning more?

There are tons of great online resources for learning pandas!

https://pandas.pydata.org/pandas-docs/stable/tutorials.html

There's also been quite a lot written on Stack Overflow about pandas, if you're not sure how to do something search it and you'll probably find someone with the exact same question