# Active Learning Example

Active learning is a branch of machine learning that seeks to minimize the total amount of data required for labeling by strategically sampling data that provides insight into the problem you're trying to solve so that you can focus on labeling that data.

Follow this example to write a Python script using the [Label Studio SDK](https://labelstud.io/sdk/index.html) that performs active learning with a text classification machine learning model. 

## Set up the connection

Start by configuring the connection to the Label Studio API. You can retrieve your API key from your user profile in Label Studio. In your script, write the following, replacing the API_KEY with your own: 

In [13]:
LABEL_STUDIO_URL = 'http://localhost:8001'
API_KEY = '91b3b61589784ed069b138eae3d5a5fe1e909f57'

Then, import the Client module from the Label Studio SDK to make sure that you successfully connected to the API:

In [14]:
from label_studio_sdk import Client

ls = Client(url=LABEL_STUDIO_URL, api_key=API_KEY)
ls.check_connection()

{'status': 'UP'}

## Create a project

After connecting to the Label Studio API with the SDK, create a project in Label Studio to perform active learning with your data labeling tasks. This project performs sentiment analysis for a passage of text. See the [sentiment analysis template](https://labelstud.io/templates/sentiment_analysis.html) for more. 

In [None]:
from label_studio_sdk.project import ProjectSampling

project = ls.start_project(
    title='AL Project Created from SDK',
    label_config='''
    <View>
    <Text name="text" value="$text"/>
    <Choices name="sentiment" toName="text" choice="single" showInLine="true">
        <Choice value="Positive"/>
        <Choice value="Negative"/>
        <Choice value="Neutral"/>
    </Choices>
    </View>
    '''
)

In an active learning scenario, you want to label the tasks with the lowest machine learning model prediction scores first. You can set up **uncertainty sampling** for your tasks to automatically reorder tasks by prediction score, from low to high.

In [20]:
project.set_sampling(ProjectSampling.UNCERTAINTY)

## Set up an example machine learning model

This examples uses a simple TF-IDF Text Classification model built on the [`scikit-learn` API](https://scikit-learn.org/stable/). To perform active learning with this model, we must be able to retrain model weights and make inferences.

In [None]:
!pip install scikit-learn

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

labels_map = {
    'Positive': 0,
    'Negative': 1,
    'Neutral': 2
}
inv_labels_map = {idx: label for label, idx in labels_map.items()}


def get_model():
    # Initialize model with random weights
    return make_pipeline(TfidfVectorizer(), LogisticRegression(C=10, verbose=True))


def train_model(model, input_texts, output_labels):
    # Train the model, given a list of input texts and output labels
    model.fit(input_texts, [labels_map[label] for label in output_labels])


def get_model_predictions(model, input_texts):
    # Make model inference and return predicted labels and associated prediction scores
    probabilities = model.predict_proba(input_texts)
    predicted_label_indices = np.argmax(probabilities, axis=1)
    predicted_scores = probabilities[np.arange(len(predicted_label_indices)), predicted_label_indices]
    return [inv_labels_map[i] for i in predicted_label_indices], predicted_scores

## Perform active learning

Collect the annotated tasks from Label Studio so that you can use them to train the model. Each task is stored in [Label Studio JSON format](https://labelstud.io/guide/export.html#Label-Studio-JSON-format-of-annotated-tasks), with `"text"` field used as input and `"choices"` annotation field to store output label:

In [30]:
labeled_tasks = project.get_labeled_tasks()
texts, labels = [], []
for labeled_task in labeled_tasks:
    texts.append(labeled_task['data']['text'])
    labels.append(labeled_task['annotations'][0]['result'][0]['value']['choices'][0])

Update the model weights based on the annotated tasks. We use `"train_model()"` function from example machine learning described above, but in principle it could be any other classifier trainer.

In [None]:
model = get_model()
train_model(model, texts, labels)

After collecting the annotated tasks, collect the unlabeled tasks so that the machine learning model can make predictions. Because there can be a large number of unlabeled tasks, you can sample them and retrieve only a small subset of data. In this case, collect a random sample of 10 unlabeled tasks:

In [32]:
import random
unlabeled_tasks_ids = project.get_unlabeled_tasks_ids()
batch_ids = random.sample(unlabeled_tasks_ids, 10)
unlabeled_tasks = project.get_tasks(selected_ids=batch_ids)

With the subset of unlabeled tasks that you collected, you can make model inferences to get the predictions from the text classification model:

In [35]:
import numpy as np
texts = [task['data']['text'] for task in unlabeled_tasks]
pred_labels, pred_scores = get_model_predictions(model, texts)

## Send predictions to Label Studio

After the model makes its predictions, return the predictions to Label Studio so that annotators can review and update them. 

Define a model version to identify the latest batch of predictions, in this example based on the amount of data used to retrain the model, but you can use any arbitrary unique name. Setting model version is optional step, but in Active Learning scenario, it helps you to control which model to show in the next iteration.

In [36]:
model_version = f'model_{len(labeled_tasks)}'

Format the predictions and add them to each task:

In [37]:
predictions = []
for task, pred_label, pred_score in zip(unlabeled_tasks, pred_labels, pred_scores):
    project.create_prediction(
        task_id=task['id'],
        # alternatively you can use a simple form here:
        # result=pred_label,
        result=[{
            'from_name': 'sentiment',
            'to_name': 'text',
            'type': 'choices',
            'value': {
                'choices': [pred_label]
            }
        }],
        score=pred_score,
        model_version=model_version
    )

Lastly, update the Label Studio settings to use the newly-created model version when performing uncertainty sampling and displaying pre-annotated tasks to annotators:

In [38]:
project.set_model_version(model_version)

## Conclusion

That's it! The Label Studio SDK makes creating an active learning loop that is easily repeatable with a script. You can run this script on a regular cadence, or use [Label Studio Webhooks](https://labelstud.io/guide/webhooks.html) to perform event-driven active learning. See more about [active learning in Label Studio](https://labelstud.io/guide/active_learning.html).