# Active Learning Example

Follow this example to write a Python script that performs active learning with a machine learning model. Active learning is a branch of machine learning that seeks to minimize the total amount of data required for labeling by strategically sampling data that provides insight into the problem you're trying to solve so that you can focus on labeling that data.

## Set up the connection

Label Studio API connection configuration:

In [1]:
LABEL_STUDIO_URL = 'http://localhost:8000'
API_KEY = 'd6f8a2622d39e9d89ff0dfef1a80ad877f4ee9e3'

Check connection to running Label Studio:

In [2]:
from label_studio_sdk import Client

ls = Client(url=LABEL_STUDIO_URL, api_key=API_KEY)
ls.check_connection()

{'status': 'UP'}

Now create a new project

In [3]:
from label_studio_sdk.project import ProjectSampling

project = ls.start_project(
    title='AL Project Created from SDK',
    label_config='''
    <View>
    <Text name="text" value="$text"/>
    <Choices name="sentiment" toName="text" choice="single" showInLine="true">
        <Choice value="Positive"/>
        <Choice value="Negative"/>
        <Choice value="Neutral"/>
    </Choices>
    </View>
    '''
)

!!! Uncertainty sampling


For Active Learning scenario, we need to set "Uncertainty Sampling", that automatically reorder tasks according to the lowest prediction scores

In [None]:
project.set_sampling(ProjectSampling.UNCERTAINTY)

Now let's play with a very simple TF-IDF Text Classification model built on top of scikit-learn API. To perform Active Learning, we have to be able to retrain model weights and make the inference

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

labels_map = {
    'Positive': 0,
    'Negative': 1,
    'Neutral': 2
}
inv_labels_map = {idx: label for label, idx in labels_map.items()}


def get_model():
    # Initialize model with random weights
    return make_pipeline(TfidfVectorizer(), LogisticRegression(C=10, verbose=True))


def train_model(model, input_texts, output_labels):
    # Train the model, given list of input texts and output labels
    model.fit(input_texts, [labels_map[label] for label in output_labels])


def get_model_predictions(model, input_texts):
    # Make model inference and return predicted labels and associated prediction scores
    probabilities = model.predict_proba(input_texts)
    predicted_label_indices = np.argmax(probabilities, axis=1)
    predicted_scores = probabilities[np.arange(len(predicted_label_indices)), predicted_label_indices]
    return [inv_labels_map[i] for i in predicted_label_indices], predicted_scores

## Active learning step

Now let's collect tasks from Label Studio that have been labeled so far

In [None]:
labeled_tasks = project.get_labeled_tasks()
texts, labels = [], []
for labeled_task in labeled_tasks:
    texts.append(labeled_task['data']['text'])
    labels.append(labeled_task['annotations'][0]['result'][0]['value']['choices'][0])

Update model weights based on labeled data

In [None]:
train_model(model, texts, labels)

Now collect unlabeled data from Label Studio. Since unlabeled pool could be large, to reduce complexity we can sample and retrieve only small subset of data:

In [None]:
unlabeled_tasks_ids = project.get_unlabeled_tasks_ids()
batch_ids = random.sample(unlabeled_tasks_ids, 10)
unlabeled_tasks = project.get_tasks(selected_ids=batch_ids)

Now make model inference for extracted tasks subset:

In [None]:
texts = [task['data']['text'] for task in unlabeled_tasks]
pred_labels, pred_scores = get_model_predictions(model, texts)

And finally plug model predictions back to Label Studio. Let's call this model version based on the amount of data used to retrain, but it actually could be any arbitrary unique name.

In [None]:
model_version = f'model_{len(labeled_tasks)}'
predictions = []
for task, pred_label, pred_score in zip(unlabeled_tasks, pred_labels, pred_scores):
    project.create_prediction(
        task_id=task['id'],
        result=[{
            'from_name': 'sentiment',
            'to_name': 'text',
            'type': 'choices',
            'value': {
                'choices': [pred_label]
            }
        }],
        score=pred_score,
        model_version=model_version
    )

The last thing is to tell Label Studio to use the last created model version as a source for uncertainty sampling and preannotations to be shown to the labelers:

In [None]:
project.set_model_version(model_version)

That's it! Now it's easy to create AL loop by iterating Active Learning step from time to time. You can also make this in event-driven way when new portion of annotations created by using Label Studio Webhooks.