In [None]:
%matplotlib inline
%config IPCompleter.greedy=True

## Supervised learning on mice phenotype data

Predicting diet from differential expression data was easy with SVMs. It was very neat and regular data, no cells were missing, all values were in a similar range, etc. We will now use a slightly uglier dataset: the phenotype tables from days 3/4.

You may remember that each of those sheets had one row per strain, and two separate columns for each measurement taken under the two dietary conditions. We have transformed those sheets such that 1) all of them are contained in a single table, 2) each strain gets two rows, one for phenotype measurements under CD and one for HFD diet. We will use the `diet` column as our target.

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold

In [29]:
pheno = pd.read_csv('phenotype_cd_hfd.csv', index_col=0)

target = pheno['diet'].replace('CD', 0).replace('HFD', 1)

### 1.1 Get rid of columns with missing values

Since most ML algorithms can't deal with NaN values, we will first restrict ourselves to those features that are available for every sample.
Identify these columns and put `pheno.loc[:, good_columns]` into the variable `data`.

Also, drop the columns `diet` and `strain` from the data table, since we don't want to use them for prediction.

In [None]:
# data = ...

### 1.2 Use an SVM for your predictions
Try the RBF kernel for a change. First, fit and score using the entire dataset, and print out the accuracy.
Do a proper evaluation using 3-fold cross-validation, and print those scores as well. How did it go?

### 1.3 Use a linear kernel to get the same two values
Was it better or worse than with the RBF? Why?

### 1.4 Standardize the data
Look at the value ranges of each feature. Standardize them, such that they all have zero mean and 1 standard deviation. Either by simply subtracting the means and dividing by the standard deviations, or using the `sklearn.preprocessing.StandardScaler` class.

Display the cross-validated scores using an RBF and a linear SVM.

### 1.4.2 Optional: Standardize the data fold-by-fold

When we standardized the entire dataset in one go, we were cheating a bit. We did not keep the training and test data fully independent. For a truly honest evaluation, we should derive the standardization parameters from the training data only, and apply the same transformation to the test data separately.

If you standardize manually, use the training set means and std's for the transformation of both the training and the test data. If you use `StandardScaler`, use `fit_transform` for the training data and `transform` only for the training data.

Did it influence the accuracy?

### 1.5: Sensitivity, specificity, precision...
In some cases, the accuracy of a prediction is secondary to other quality measures, such as sensitivity or specificity. For example, HIV tests are optimized for sensitivity at the expense of accuracy, ensuring that very few HIV-positive individuals test negative on an HIV test. This results in an HIV-scare for a lot of HIV-negative individuals each year (as higher sensitivity always implies a higher false positive rate) but in exchange no case of HIV goes undetected on a test.

We can tune most ML models similarly, and sacrifice accuracy for higher sensitivity or specificity. But first, simply report the sensitivity of your linear SVM for both classes. You will find tools in `sklearn` that help you calculate this value.

### 1.5.2 Make your SVM 95+% sensitive for HFD
Find a parameter that helps you increase your sensitivity for mice on an HFD diet.

### 1.6 ROC curves
You might be interested in your model's relationship between its accuracy and sensitivity, or a more commonly used pair of quality measures: false positive rate vs. sensitivity (aka true positive rate). This is what ROC (receiver operating characteristic) curves display: the trade-off between these two qualities.

Most classification ML methods, despite their categorical output, use continuous internal variables for their predictions, and their final decision is a simple thresholding of this continuous variable. For example, in the case of SVMs, this variable is the data point's signed distance to the separating plane: positive values are assigned to one class, negative values to the other class. Values close to zero (= close to the boundary) are harder to place in either class, and it's down to the arbitrary threshold how they end up being predicted.

You can create a ROC curve by testing how the choice of threshold affects false positive rate and sensitivity. Needless to say, `sklearn` helps you create such plots. All you need to do is extract the SVM's continuous predictive variables, pass it to the appropriate function with the true labels, and plot the results.

### 1.7 Find the threshold for the desired sensitivity / FPR tradeoff
In 1.5.2 you increased sensitivity for HFD (label 1) by telling the SVM to use a higher weight for that class. Since then, you have learned that you could have also used the SVM's continuous predictive variables, and threshold them to your own liking, instead of leaving it to the SVM's default (0 for `decision_function` and 0.5 for `predict_proba`).

Your task is to find the threshold value that would suit your purpose (i.e. 95% sensitivity). Remember, the `roc_curve` function returned three vectors: the ROC plot's FPR values, sensitivity values and the threshold that corresponded to them.

Hint: iterate over the sensitivity and threshold values together, and report the first threshold where sensitivity exceeds 0.95. You can iterate over two lists together using Python's `zip` function.