# Active Learning

Active Learning is an interesting supervised learning paradigm, usually applied when obtaining labels is costly. The idea is to start learning with relatively few labeled samples and a large number of unlabeled ones, and then label only those samples that contribute the most to the model quality

## Data Prepraration

We have generated three sets of data here:
* **train**: the set of labeled data, containing 1000 samples
* **unlabeled**: the set of unlabeled data, containing 25,000 samples
* **valid**: the set against which we will evaluate our model, containing 1000 samples

In [1]:
from sklearn.datasets import make_classification
import numpy as np

x, y = make_classification(n_samples=27_000, n_classes=2, random_state=0)

x_train, y_train = x[:1000], y[:1000]
x_unlbl, y_unlbl = x[1000:-1000], y[1000:-1000]
x_valid, y_valid = x[-1000:], y[-1000:]

len(y_train), len(y_unlbl), len(y_valid)

(1000, 25000, 1000)

## Strategy

* The strategy is to apply the current model **f** (trained using existing labeled samples) to each of the remaining unlabeled samples
* For each unlabeled examples **x**, an importance score is calculated:
    * importance(x) = density(x) * uncertainity_f(x)
    * density reflects how many examples surround **x** in its close neighborhood
    * uncertaininty_f(x) reflect how uncertain the prediction of the model **f** is for **x**
    


### Training model f

In [2]:
from sklearn.linear_model import LogisticRegression

In [3]:
model = LogisticRegression().fit(x_train, y_train)
model.score(x_valid, y_valid)

0.841

### Computing Importance

In [4]:
from sklearn.neighbors import NearestNeighbors

class SampleImportance:
    def __init__(self, model):
        self.model = model
    
    def get_density(self, x):
        nbrs = NearestNeighbors(n_neighbors=6).fit(x)
        distances, _ = nbrs.kneighbors(x)
        mean_distances = distances[:, 1:].mean(axis=1)
        return 1.0 / mean_distances
    
    def get_uncertainity(self, x):
        preds = model.predict_proba(x)[:, 1]
        uncertainity = (1.0 - (0.5 - preds)**2)
        return uncertainity
    
    def __call__(self, x):
        return self.get_density(x) * self.get_uncertainity(x)

In [5]:
si = SampleImportance(model)
importance = si(x_unlbl)

In [6]:
# sorting unlabeled data based on importance

p = importance.argsort()
x_unlbl = x_unlbl[p]
y_unlbl = y_unlbl[p]

# taking the 1000 most important samples

x_addtn = x_unlbl[-1000:]
y_addtn = y_unlbl[-1000:]

### Training with additional data

In [7]:
x_train = np.vstack([x_train, x_addtn])
y_train = np.hstack([y_train, y_addtn])

model = LogisticRegression().fit(x_train, y_train)
model.score(x_valid, y_valid)

0.846

**By adding an additional 1000 samples (instead of 25000) to our training data, we already got an improvement**