In [1]:
%matplotlib inline
%config IPCompleter.greedy=True

# Random forests on mouse phenotype data

Today we will use one of the most popular supervised learning methods: random forests. It is an ensemble method, which means it uses multiple predictors and makes its final prediction based on a democratic vote of its individual predictors.

These sub-predictors are called decision trees, hence the name forest. Decision trees are labelled binary trees. Every leaf is associated with a class label, and every branching point is a test for a feature-value pair: if the feature is larger than the value, the sample is passed on towards the left branch, otherwise it continues on the right branch. The prediction is the label of the leaf it eventually reaches, starting from the tree's root.

Random forests generate a set of decision trees in a structured manner, and aggregate their results. Let's see

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score, cross_val_predict

In [3]:
pheno = pd.read_csv('../example_data/phenotype_cd_hfd.csv', index_col=0)
target = pheno['diet'].replace('CD', 0).replace('HFD', 1)
data = pheno.dropna(axis=1).drop(columns=['strain', 'diet'])

### 1.1 Use a random forest classifier
Initialize it with default settings. Show its biased performance by fitting and predicting on the entire dataset, and then show its real accuracy using a proper cross-validation. For cross-validation, use the one-liner form with yesterday's convencience function.

Then try the same with standardized input data (no need to do it fold-by-fold), and report your findings.

### 1.1.2 Determine the uncertainty of the accuracy by bootstrapping
You must have noticed that the cross-validated accuracy of the model isn't exact: it depends on the particular folds that you get during cross-validation.
If you have `shuffling` enabled, your folds will be different each time. So by repeating cross-validation many times, you could get a clearer idea of the model's accuracy by taking the average, or better yet, looking at the distribution of the individual accuracy figures.

Use 100 repeats and plot the histogram of the resulting accuracy values. What is the 90% confidence interval of the model's accuracy? (What is a confidence interval to begin with?)

### 1.2 Extract prediction probabilities instead of labels
`RandomForestClassifier`s have a `predict_proba` method by default, similar to `SVC`s with `proability=True` turned on. Use the convencience one-liner to get cross-validated class membership probability estimates for each sample. Create a histogram using the probability values for class `1`, but separate them by the true label of the samples, and color the two labels differently.

### 1.2.2 Look at the numeric probability values. Why are they so rough?
Find out what they are derived from, and try to increase their resolution.

### 1.3 Create a ROC curve
Now that you know how to access the continuous internal prediction variables (class membership probability in this case) you can create a ROC plot easily.

### 1.4 Visualize a decision tree
Pick any decision tree from the random forest, export it in `DOT` format using `sklearn.tree.export_graphviz` and visualize it with an online tool.

### 1.5 Feature importances
Random forests utilize their internal structure not just to provide probability estimates, but also to estimate the importance of different features. How do you think it's done? Find a way to access the values, and present them on a bar plot, sorted by importance.

By the way, do you have to use cross-validation for feature importance estimates?

### 1.6 Choose 3 features, and evaluate the model using only those features
Try to choose them with accuracy in mind. Treat it as a challenge: which 3 to choose to get the best model? What was your strategy?

### 1.7 Feature clustering
Plot a clustermap of your data: it's not too big, 113x55 so you can keep all rows and columns and still have a readable figure with all feature labels if you increase its width a bit.

Should you plot the raw or the normalized data?

What patterns do you see? Are they relevant to the previous task?

### 1.7.2 Feature correlations
Calculate correlations between your features, and visualize them on a heatmap. Would it be useful to include the diet as an extra column?