#### About

> Active learning


Active learning is a machine learning technique that involves selecting a subset of the most informative and uncertain data points from a large unlabeled data set to be labeled by an oracle (such as a human expert) to improve the performance of a supervised learning model. It is typically used when labeled data is scarce or expensive to obtain, and helps reduce the amount of labeled data needed to achieve good model performance.



In [1]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [2]:
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=5, random_state=42)


In [3]:
X_labeled, X_unlabeled, y_labeled, y_unlabeled = train_test_split(X, y, test_size=0.8, random_state=42)


In [4]:
# Train initial model with labeled data
initial_model = LogisticRegression()
initial_model.fit(X_labeled, y_labeled)

In [5]:
# Evaluate initial model
y_pred = initial_model.predict(X_unlabeled)
initial_accuracy = accuracy_score(y_unlabeled, y_pred)
print(f'Initial accuracy: {initial_accuracy:.2f}')


Initial accuracy: 0.85


In [6]:
# Active learning loop
num_iterations = 5  # number of iterations for active learning
batch_size = 50  # number of samples to query at each iteration


In [7]:
for i in range(num_iterations):
    # Select a subset of samples from the unlabeled set for querying
    query_indices = np.random.choice(X_unlabeled.shape[0], batch_size, replace=False)
    query_samples = X_unlabeled[query_indices]
    query_labels = y_unlabeled[query_indices]
    
    # Query the oracle for labels
    # Here, you can replace this step with your own method for obtaining labels, e.g., human labeling or using an external service
    # In this example, we assume that the oracle provides the true labels for the queried samples
    query_labels_pred = query_labels  # replace with your own method
    
    # Add queried samples and labels to labeled set
    X_labeled = np.concatenate([X_labeled, query_samples])
    y_labeled = np.concatenate([y_labeled, query_labels_pred])
    
    # Remove queried samples from the unlabeled set
    X_unlabeled = np.delete(X_unlabeled, query_indices, axis=0)
    y_unlabeled = np.delete(y_unlabeled, query_indices)
    
    # Train model with updated labeled set
    active_model = LogisticRegression()
    active_model.fit(X_labeled, y_labeled)
    
    # Evaluate model after active learning iteration
    y_pred = active_model.predict(X_unlabeled)
    accuracy = accuracy_score(y_unlabeled, y_pred)
    print(f'Iteration {i+1}, Accuracy: {accuracy:.2f}')

Iteration 1, Accuracy: 0.85
Iteration 2, Accuracy: 0.85
Iteration 3, Accuracy: 0.85
Iteration 4, Accuracy: 0.85
Iteration 5, Accuracy: 0.85


Active learning allows the model to actively select the most informative samples from the unlabeled data, which can help in reducing the amount of labeled data needed for training a high-performing model, and ultimately improving the model's accuracy.

