# Weak Supervision with Label Studio

Perform weak supervision with Label Studio, using noisy labels on a large amount of training data to automatically build a useful training dataset for your supervised learning model. 

In this example, use the [Label Studio SDK](https://labelstud.io/sdk/index.html) to write a Python script that adds noisy labels to tasks, then adds those tasks to Label Studio for review and correction in a supervised learning setting.

**Note:** This code utilizes functions from an older version of the Label Studio SDK (v0.0.34).
The newer versions v1.0 and above still support the functionalities of the old version, but you will need to specify
[`label_studio_sdk._legacy`](../../README.md) in your script.

## Connect to Label Studio

Connect to the API for Label Studio Community, Enterprise, or Teams edition. Use the Client module of the Label Studio SDK and check the connection is working:

In [None]:
import os
from label_studio_sdk.client import LabelStudio

ls = LabelStudio(base_url=os.getenv('LABEL_STUDIO_URL', 'http://localhost:8080'), api_key=os.getenv('LABEL_STUDIO_API_KEY'))

### Create a project

Create a simple text classification project to perform [sentiment analysis](https://labelstud.io/templates/sentiment_analysis.html) to identify the sentiment expressed by a passage of text:

In [None]:
project = ls.projects.create(
    title='Weak Supervision example with SDK',
    label_config='''
    <View>
    <Text name="text" value="$text"/>
    <View style="box-shadow: 2px 2px 5px #999; padding: 20px; margin-top: 2em; border-radius: 5px;">
        <Header value="Choose text sentiment"/>
        <Choices name="sentiment" toName="text" choice="single" showInLine="true">
            <Choice value="Positive"/>
            <Choice value="Negative"/>
            <Choice value="Neutral"/>
        </Choices>
    </View>
    </View>
    '''
)

## Import tasks

Import small text samples into Label Studio, and retrieve their task IDs. This examples uses a small subset of tasks from the available Amazon Review dataset:

In [None]:
!pip install pandas

In [None]:
import pandas as pd, os

p1 = os.path.join('data', 'amazon_cells_labelled.tsv')
p2 = os.path.join('weak_supervision', 'data', 'amazon_cells_labelled.tsv')
csv_path = p1 if os.path.exists(p1) else p2
if not os.path.exists(csv_path):
    print(f"Error: Data file not found at {p1} or {p2}.")
    tasks = []
else:
    tasks = pd.read_csv(csv_path, sep='\t').to_dict('records')
    # Bulk import tasks via v2
    _ = ls.projects.import_tasks(id=project.id, request=tasks)
    # Fetch ids afterwards if needed
    ids = [t.id for t in ls.tasks.list(project=project.id, fields='task_only')]

## Create noisy predictions

Perform programmatic labeling to create weakly supervised annotations for the text samples. Our labeling operations, or in shorthand, **LabelOps**, are noisy programmatic labelers that reflect subject matter experts' domain knowledge in a simple pattern-to-class mapping form. In this example, assigning a sentiment class based on specific key words in the Amazon review. In more complex scenarios, the noisy labeling could be performed on the output of a learned classifier using confidence scores, crowdsourced labels, and more.

In [None]:
import re, random

# Noisy programmatic labelers
label_ops = {
    r'.*\b(good|excellent|great|cool)': 'Positive',
    r'.*\bi\s+like': 'Positive',
    r'.*\bnot': 'Negative',
    r'.*\bdisappointed': 'Negative',
    r'.*\bjunk': 'Negative'
}

# Pre-annotations in Label Studio JSON format (v2)
for label_regex, label in label_ops.items():
    model_version = label_regex
    for t in ls.tasks.list(project=project.id, fields='task_only'):
        text = t.data.get('text', '').lower()
        if re.match(label_regex, text):
            ls.predictions.create(
                task=t.id,
                result=[{
                    'from_name': 'sentiment',
                    'to_name': 'text',
                    'type': 'choices',
                    'value': {
                        'choices': [label]
                    }
                }],
                score=float(random.random()),
                model_version=model_version
            )

## (Optional) Quality metrics

Some quality metrics endpoints are Enterprise-only or not exposed in SDK v2. Skipping metrics in this OSS-compatible notebook.

In [None]:
# Skipped: Enterprise-only or not exposed metrics in SDK v2

## Create annotations from specific model versions

Based on quality metrics from previous steps, select a subset of high-performing programmatic labelers to use, then combine the relevant predictions into annotations for the relevant tasks:

In [None]:
print('Skipping conversion of predictions to annotations (Enterprise-only or custom implementation needed).')

## Conclusion

After performing programmatic noisy labeling on a dataset, you can evaluate the quality of the predictions programmatically. Then, using the Label Studio SDK, you can transform the best quality predictions into annotations to train a weakly supervised model.

If you want, you can also take the most confusing items in the dataset for the programmatic labelers and [import pre-annotations into Label Studio](https://github.com/heartexlabs/label-studio-sdk/blob/master/examples/Import%20preannotations.ipynb) for human-in-the-loop annotator review and correction.
