# Creating and Evaluating a Classifer

Steps:

    1) Make the feature vectors
    2) Balance the classes
    3) Perform a grid search to tune and train the classifier
    4) Evaluate the classifier on a test set

# Making the feature vectors

<img src="pics/genetic_map.png">

## Explanation of Labels

possible muttypes:

 - **neutral**
 - **QTN**         : can be either large effect (>.20 of variation in phenotype) or small effect (<.20 of variation)
 - **deleterious** : mutation that negatively effects fitness
 - **sweep**       : mutation that has become fixed and is expected to show evidence of a selective sweep around it

possible regions:

 - **Background selection** : any SNP in the 10,000bp region where deleterious mutations occurred 
 - **Near Selective Sweep** : within 1,000bp of the selective sweep
 - **Far Selective Sweep**  : 1,000-2,000bp from the selective sweep
 - **large QTN linked**     : within 200bp of a QTN of large effect
 - **small QTN linked**     : within 200bp of a QTN of small effect
 - **inversion**            : in an inversion
 - **low recombination**    : in a region of low recombination
 

<img src="pics/manhattan_plot.png" style="max-width:100% width:50%">

## Stats and Explanations

**Hscan_v1.3_H12 :** identifies **selective sweeps** by detecting shifts in haplotype frequencies

**pcadapt_3.0.4_ALL_log10p :** identifies **QTNs** by differentiation outlier analysis, detects outliers along principal components, effected by low recombination

**OutFLANK_0.2_He :** the heterozygosity of the loci

**LFMM_ridge_0.0_ALL_log10p :** GWAS method, detects loci that have effects on phenotypes (**QTNs**)

**LFMM_lasso_0.0_ALL_log10p :** GWAS method that ids **QTNs**, shows signals in region of low RC more than ridge

**rehh_2.0.2_ALL_log10p :** identifes recent, hard **selective sweeps** through reduction in haplotype diversity

**Spearmans_ALL_rho :** GEA method that detects **QTNs**, does not correct for population structure

**a_freq_final :** Frequency of the allele

**pcadapt_3.0.4_PRUNED_log10p :** identifies **QTNs**, should be less effected by low recombination than pcadapt all

**RDAvegan_v2.5.2_ALL_loading_RDA1_envi :** GEA method that detects **QTNs** and does not correct for population structure

**LEA_1.2.0_ALL_K3_log10p :** GEA method to detect **QTNs**, corrects for population structure (pvalue)

**LEA_1.2.0_ALL_K3_z :** GEA method to detect **QTNs**, raw score 

**baypass_2.1_PRUNED_BF_env :** GEA method to detect **QTNs**

**baypass_2.1_PRUNED_XTX :** differentiation outlier method to detect **QTNs**, shows signal at inversion in certain scenarios

**OutFLANK_0.2_PRUNED_log10p :** differentiation outlier method to detect **QTNs**, shows signal at inversion in certain scenario

In [1]:
def findLabel(pos, muttype):
    # 1 = neut, 2 = QTN, 3 = delet, 4 = sweep
    muttypes = {"MT=1" : "neut", 
                "MT=2" : "QTN",
                "MT=3" : "delet",
                "MT=4" : "sweep",
                "MT=5" : "neut"}         ### MT=5 is a artifact from SLiM to preserve the inversion
    try:
        mtLabel = muttypes[muttype]
    except KeyError:
        warnings.warn("Unknown muttype " + muttype)
        mtLabel = "INVALID"
    
    pos = float(pos)
    if  200001 <= pos <= 230000 or  270001 <= pos <= 280000:
        region = "BS"
    elif 174000 <= pos <= 176000:
        region = "NearSS"
    elif 173000 <= pos <= 17399 or 176001 <= pos <= 177000:
        region = "FarSS"
    elif 320000 <= pos <= 330000:
        region = "invers"
    elif 370000 <= pos <= 380000:
        region = "lowRC"
    else:
        region = "neutral"
    return "MT=" + mtLabel + "_R=" + region

In [2]:
def splitAndWrite(features, outdir):
    
    ### remove chr 9 because it is variable and would throw off results
    grouped = features.groupby('chrom')
    features = features.drop(grouped.get_group('9').index)

    ### scale all the stats we will be looking at by chromosome using scaleStats function
    #scaledFeatures = features.groupby('chrom')[stats].transform(scaleStats)
    
    # scale stats across whole genome
    scaledFeatures = features[stats].transform(scaleStats)
    
    # add class labels
    scaledFeatures.insert(loc = 0, column = 'classLabel', 
                          value = np.vectorize(findLabel)(features['pos'], features['muttype']))
    # add pos and prop back in to locate QTNs of large and small effect
    scaledFeatures.insert(loc = 0, column = 'pos', value = features['pos'].astype("float"))
    scaledFeatures.insert(loc = 0, column = 'prop', value = pd.to_numeric(features['prop'], errors = "coerce"))
    
    ## add labels to the SNPs within 200bp of a QTN
    smallQTNs = scaledFeatures[((scaledFeatures['classLabel'] == 'MT=QTN_R=neutral') & 
                               (scaledFeatures['prop'] < .2))]['pos']

    largeQTNs = scaledFeatures[((scaledFeatures['classLabel'] == 'MT=QTN_R=neutral') & 
                               (scaledFeatures['prop'] >= .2))]['pos']
    
    scaledFeatures.loc[scaledFeatures['pos'].isin(smallQTNs), 'classLabel'] = 'MT=smQTN_R=smQTNlink'
    scaledFeatures.loc[scaledFeatures['pos'].isin(largeQTNs), 'classLabel'] = 'MT=lgQTN_R=lgQTNlink'

    for site in smallQTNs:
        lower = site - 200
        upper = site + 200
        scaledFeatures.loc[(scaledFeatures['pos'].between(lower, upper, inclusive = True)) & 
                        (scaledFeatures['pos'] != site), 'classLabel'] = 'MT=neut_R=smQTNlink'
    for site in largeQTNs:
        lower = site - 200
        upper = site + 200
        scaledFeatures.loc[(scaledFeatures['pos'].between(lower, upper, inclusive = True)) &
                        (scaledFeatures['pos'] != site),'classLabel'] = 'MT=neut_R=lgQTNlink'
   
    ## drop the positions and props and write to file by group
    scaledFeatures = scaledFeatures.drop(columns = ['pos', 'prop'])
    labelGrouped   = scaledFeatures.groupby('classLabel')
    for name, group in labelGrouped:
        outfile     = outdir + "/" + name + ".fvec"
        outfile     = outfile.replace("=", "-")
        fileExists  = os.path.exists(outfile)            # does the file already exist?
        
        with open(outfile, 'a') as f:
            group.to_csv(f, sep = '\t', index = False, header = not fileExists)

# Train the classifier

<img src="pics/random_forest.jpg">



## First, balance the classes

In [5]:
from imblearn.over_sampling import ADASYN, SMOTE, RandomOverSampler
#balance the training set
# minClassSize = min([len(XH[classLabel]) for classLabel in  XH.keys()])
X = []
y = []
for classLabel in sorted(XH.keys()):
    print('{:24} : {:>6}'.format(classLabel, str(len(XH[classLabel]))))
    random.shuffle(XH[classLabel])
    for i in range(1000):
        try:
            currVector = XH[classLabel][i]
        except IndexError:
            break
        X.append(currVector)
        y.append(classLabel)
        
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

MT=delet_R=BS            :    379
MT=lgQTN_R=lgQTNlink     :     86
MT=neut_R=BS             :  31197
MT=neut_R=FarSS          :    585
MT=neut_R=NearSS         :    983
MT=neut_R=invers         :  11920
MT=neut_R=lgQTNlink      :    502
MT=neut_R=lowRC          :   8644
MT=neut_R=neutral        : 312608
MT=neut_R=smQTNlink      :   3511
MT=smQTN_R=smQTNlink     :    639
MT=sweep_R=NearSS        :     53


Training set size after split: 6170
Testing set size: 2057
training set size after balancing: 9204


<img src="pics/SMOTE_R_visualisation_4.png">

## Grid Search to tune the hyperparameters

<img src="pics/grid_search.gif">

Checking accuracy when distinguishing among all 12 classes
Training extraTreesClassifier
GridSearchCV took 1144.47 seconds for 432 candidate parameter settings.


Results for extraTreesClassifier
Model with rank: 1
Mean validation score: 0.691 (std: 0.041)
Parameters: {'bootstrap': False, 'min_samples_leaf': 1, 'min_samples_split': 2, 'criterion': 'entropy', 'max_features': 15, 'max_depth': None}

Model with rank: 2
Mean validation score: 0.691 (std: 0.048)
Parameters: {'bootstrap': False, 'min_samples_leaf': 1, 'min_samples_split': 2, 'criterion': 'gini', 'max_features': 3, 'max_depth': None}

Model with rank: 3
Mean validation score: 0.691 (std: 0.037)
Parameters: {'bootstrap': False, 'min_samples_leaf': 1, 'min_samples_split': 3, 'criterion': 'entropy', 'max_features': 15, 'max_depth': None}



['QTNdistinct.p']

## Test the best classifier and evaluate


| Class      | precision  |  recall | f1-score  | support |
|------------|-----------|----------|-----------|---------|                      
| MT=delet_R=BS   |    0.15   |   0.23  |    0.18     |  109 |      
| MT=lgQTN_R=lgQTNlink    |   0.35   |   0.42   |   0.38    |    19 |
| MT=neut_R=BS  |     0.30   |   0.33    |  0.32  |     236 |
| MT=neut_R=FarSS   |    0.33  |    0.32  |    0.33     |  145 |
| MT=neut_R=NearSS   |    0.63   |   0.57   |   0.60   |    253 |
| *MT=neut_R=invers*    |   0.88    |  0.77    |  **0.82**     |  250 |
| MT=neut_R=lgQTNlink   |    0.40    |  0.35   |   0.37    |   133 |
| *MT=neut_R=lowRC*     |  0.66    |  0.82    |  **0.73**     |  233 |
| MT=neut_R=neutral    |   0.29  |    0.26   |   0.27   |    264 |
| MT=neut_R=smQTNlink   |    0.35   |   0.31  |    0.33   |    240 |
| MT=smQTN_R=smQTNlink   |    0.46   |   0.42   |   0.44   |    161 |
| MT=sweep_R=NearSS   |    0.21  |    0.36    |  0.26      |  14 | 
| **avg / total**   |   **0.47**    |  **0.46**   |   **0.46**    |  **2057** |


<img src="pics/organized_output.png">

<img src="pics/bernie.jpeg" style="max-width:100%; width:75%">