# SciKit learn: Statistical learning

Based on materials from: http://scikit-learn.org/stable/tutorial/statistical_inference/settings.html

## Statistical learning: The setting and the estimator object in scikit-learn

* Data dealed with as 2D arrays
* We can reshape them, by for instance transforming 8x8 images into 64-element feature vectors
* *Estimator* is any object learning from data (classification, regression or clustering algorithm, or *transformer* extracting useful features from raw data)
* We can access estimated parameters from an estimator

In [3]:
from sklearn import datasets
iris = datasets.load_iris()
data = iris.data
data.shape

(150, 4)

General syntax for estimators.

In [6]:
#estimator = Estimator(param1=1, param2=2)
#estimator.param1
#estimator.estimated_param_

## Supervised learning: Predicting an output variable from high-dimensional observations

We try learning link between two datasets: Observed data x and external variable y that we are trying to predict - usually called 'targets' or 'labels'.

### Nearest neighbor and the curse of dimensionality

To be effective we need distance between neighboring points to be less than some value d, for which the size depends on the problem. In one dimension this requires `n ~ 1/d` points. If we scale this up to `P` features we now require `n ~ 1/d^p` points. The number of training points for a good estimator grows exponentially.

This is called *the curse of dimensionality* and is core problem addressed by machine learning.

### Linear model: From regression to sparsity

* In its simplest form, we try minimizing sum of squared residuals of the model.
* If few datapoints per dimension noise in observations results in high variance.
* One solution in high-dimensional statistical learning is to *shrink* regression coefficients to zero - any randomly chosen set of observations are likely to be uncorrelated - This is called *Ridge regression*
    * There is a `bias <-> variance` tradeoff for the `alpha` parameter
* Aha - visualizing in 10 dimensions. Hard to think about, but would be fairly 'empty space'
* One way to mitigate curse of dimensionality is to select only informative features
    * Ridge regression is one way to decrese contribution (but doesn't set to zero)
    * Another method is Lasso (least absolute shrinkage and selection operator) which can set some coefficients to zero - This can be seen as a sparse method
    * LassoLars is scikit-learn implementation able to take on this

### Classification

* For classification linear regression is not right approach as it gives too much weight to data far from decision frontier
    * A linear approach is to fit *sigmoid function* or *logistic function*
    
### Support Vector Machine

* Tries maximizing margin between to classes
* Adjusting regularization parameter C decides how many of observations that are involved in separation
* Can also use *kernel tricky* using other boundary functions

# Model selection: Choosing estimators and their parameters

* Every estimator exposes *score* method that can judge quality of fit (or prediction)
* To better access prediction accuracy we can split data into *folds* used for training and testing

## Cross-validation generators

Using split method data can easily be split into subsets. This facilitates ease of cross-validation. Then the cross-validation score can be calculated.

In [7]:
from sklearn.model_selection import KFold, cross_val_score
X = ['a', 'a', 'b', 'c', 'c', 'c']
k_fold = KFold(n_splits=3)
for train_indices, test_indices in k_fold.split(X):
    print('Train: {} | test: {}'.format(train_indices, test_indices))

Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]


In [10]:
# [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
#     for train, test in k_fold.split(X_digits)]

## Different cross-validation generators

* KFold: Splits data into K folds, trains on K-1 and tests on left-out
* StratifiedKFold: Preseves class distribution within each  fold
* GroupKFold: Ensures same group is not in both testing and training sets
* ShuffleSplit: Genererates train/test indices based on random permutation
* StratifiedShuffleSplit: Preserves class distribution within each distribution
* LeaveOneGroupOut: Takes a group array to group observations (?)
* LeavePGroupsOut: Leave P groups out
* LeaveOneOut: Leave one observation out
* LeavePOut: Leave P observations out
* PredefinedSplit: Generates train/test indices based on predefined splits


# Clustering: Grouping observations together

We can try splitting a dataset into defined separate groups.

## Hierarchical agglomerative clustering

Each observation starts in cluster, and then iteratively merged in a way minimizing *linkage criterion*. In particular efficient when clusters consist of few observations. For large number of clusters: Much more efficient than k-means.

Divisive: Topdown approaches starting in one cluster, and then iteratively splitting down the hierarchy. Slow and not strong for many clusters.

## Decompositions: From a signal to components and loadings

PCA: Selects successive components explaining maximum variance in the signal. We can use this to transform data to reduce dimensionality by projecting on principal subspace.

Independent component analysis (ICA) select components so that distribution of loadings carries maximum independent information. It is able to recover non-Gaussian independent signals. (Interesting! What happens if applied to omics-dataset? What would it mean - what information would be recovered?)