<a href="https://colab.research.google.com/github/NovaVolunteer/Intro-to-Data-Science/blob/master/active_learning_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Active learning is a powerful technique in machine learning that optimizes the labeling process by selecting the most informative data points for annotation. For example, in a manufacturing setting, a company may develop an image classification model to detect defective products on an assembly line. Instead of labeling a vast dataset upfront, the company employs an active learning approach, starting with a small, labeled dataset to train an initial model. The trained model then evaluates a large pool of unlabeled product images and, using an uncertainty sampling strategy, selects the images where it is least confident. These selected images are sent to human annotators for labeling, and the newly labeled data is incorporated into the training set for model retraining. This process repeats iteratively, allowing the model to improve its accuracy while minimizing the number of labeled examples required. Active learning significantly reduces annotation costs, enhances model efficiency by focusing on difficult cases, and accelerates the learning process, making it a practical approach for scenarios where data labeling is expensive or time-consuming. Below is a example script (kinda sudo code) for this scenerio.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

# Placeholder functions for loading data and model
def load_initial_labeled_data():
    """Loads a small initial set of labeled training data."""
    X_train = np.random.rand(100, 64, 64, 3)  # 100 sample images (64x64 RGB)
    y_train = np.random.randint(0, 2, 100)  # Binary classification (0: good, 1: defective)
    return X_train, y_train

def load_unlabeled_data():
    """Loads a pool of unlabeled images."""
    return np.random.rand(1000, 64, 64, 3)  # 1000 unlabeled images

def train_model(X_train, y_train):
    """Trains a model on labeled data. Placeholder for actual model training."""
    model = "Trained Model Placeholder"
    return model

def predict_uncertainty(model, X_unlabeled):
    """Simulates model predictions and computes uncertainty scores."""
    uncertainty_scores = np.random.rand(len(X_unlabeled))  # Random uncertainties
    return uncertainty_scores

def query_most_uncertain_samples(X_unlabeled, uncertainty_scores, num_samples=10):
    """Selects the most uncertain samples for labeling."""
    most_uncertain_indices = np.argsort(uncertainty_scores)[-num_samples:]  # Highest uncertainty
    return most_uncertain_indices

def human_annotation(X_selected):
    """Simulates human annotation of selected images."""
    return np.random.randint(0, 2, len(X_selected))  # Random labels

# Step 1: Load initial labeled dataset and pool of unlabeled data
X_train, y_train = load_initial_labeled_data()
X_unlabeled = load_unlabeled_data()

# Active Learning Loop
for iteration in range(5):  # Perform 5 rounds of active learning
    print(f"Active Learning Iteration {iteration + 1}")

    # Step 2: Train the model on the labeled data
    model = train_model(X_train, y_train)

    # Step 3: Predict uncertainty scores for unlabeled data
    uncertainty_scores = predict_uncertainty(model, X_unlabeled)

    # Step 4: Select most uncertain samples for labeling
    uncertain_indices = query_most_uncertain_samples(X_unlabeled, uncertainty_scores, num_samples=10)
    X_selected = X_unlabeled[uncertain_indices]

    # Step 5: Human annotates selected samples
    y_selected = human_annotation(X_selected)

    # Step 6: Add newly labeled data to the training set
    X_train = np.concatenate([X_train, X_selected], axis=0)
    y_train = np.concatenate([y_train, y_selected], axis=0)

    # Remove labeled samples from the unlabeled pool
    X_unlabeled = np.delete(X_unlabeled, uncertain_indices, axis=0)

print("Active Learning Process Complete!")
