Copied and adapted from: https://github.com/HumanSignal/label-studio-sdk/blob/master/examples/annotate_data_from_gcs/annotate_data_from_gcs.ipynb

# Import data from Google Cloud Storage (GCS)

It's convenient and secure to host data in the cloud for data labeling, then sync task references to Label Studio to allow data annotators to view and label the tasks without your data leaving the secure cloud bucket. 

If your data is hosted in Google Cloud Storage (GCS), you can write a Python script to continuously sync data from the bucket with Label Studio. Follow this example to see how to do that with the [Label Studio SDK](https://labelstud.io/sdk/index.html). 

## Connect to your GCS bucket

Connect to your GCS bucket and create a list of task references that Label Studio can use, based on the contents of your bucket. 

In [12]:
!pip install --upgrade google-api-python-client



In [13]:
import os
from google.cloud import storage as google_storage

BUCKET_NAME = 'ferre-runway-am'  # specify your bucket name here
GOOGLE_APPLICATION_CREDENTIALS = '../../secrets/service_account_key.json'  # specify your GCS credentials
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = GOOGLE_APPLICATION_CREDENTIALS

google_client = google_storage.Client()
bucket = google_client.get_bucket(BUCKET_NAME)

In [14]:
import pandas as pd
papam = pd.read_csv("../PaPAM_eng.csv")
papam.head()

Unnamed: 0,id_file,id_component,year,collection,season,subject,media_type,color,manuf.processing,manuf.processing.descr,material,material.weave,material.descr,motif,name,theme,typology,typology.sub
0,10894,0,1979,prêt à porter,spring-summer,woman,runway show pictures,,,,suede leather,,,,duster coat,,,
1,10894,1,1979,prêt à porter,spring-summer,woman,runway show pictures,,,,,,,,shirt,,,
2,10894,2,1979,prêt à porter,spring-summer,woman,runway show pictures,,,,,,,,pants,,,
3,10895,0,1979,prêt à porter,spring-summer,woman,runway show pictures,,,,suede leather,,,,duster coat,,,
4,10895,1,1979,prêt à porter,spring-summer,woman,runway show pictures,,,,,,,,pantsuit,,,


In [16]:
def concat_non_null(series):
    non_null_values = series.dropna()  # Drop NaN values
    non_null_values = list(set(non_null_values))
    return ', '.join(non_null_values)

tasks = []
for blob in bucket.list_blobs():
    blob_metadata = blob.metadata # {'clothing_item': 'T-shirt, Jeans, Hat'}
    filename = blob.name
    filemaker_id = int(filename[:-4])
    filtered_df = papam[papam['id_file'] == filemaker_id]
    concatenated_info = filtered_df.groupby('id_file').agg({
        'year': lambda x: x.iloc[0],
        'collection': lambda x: x.iloc[0],
        'season': lambda x: x.iloc[0],
        'subject': lambda x: x.iloc[0],
        'media_type': lambda x: x.iloc[0],
        'name': lambda x: concat_non_null(x),
        'theme': lambda x: concat_non_null(x),
        'typology': lambda x: concat_non_null(x),
        'typology.sub': lambda x: concat_non_null(x)
    }).reset_index()
    tasks.append({
        'image': f'gs://{BUCKET_NAME}/{filename}', 
        'id_file': str(concatenated_info['id_file'].item()), 
        'year': str(concatenated_info['year'].item()), 
        'collection': str(concatenated_info['collection'].item()), 
        'season': str(concatenated_info['season'].item()), 
        'subject': str(concatenated_info['subject'].item()), 
        'media_type': str(concatenated_info['media_type'].item()), 
        'Name': str(concatenated_info['name'].item()), 
        'Theme': str(concatenated_info['theme'].item()),
        'Typology': str(concatenated_info['typology'].item()),
        'Typology_sub': str(concatenated_info['typology.sub'].item())
    })  
tasks      

[{'image': 'gs://ferre-runway-am/11063.jpg',
  'id_file': '11063',
  'year': '1996',
  'collection': 'prêt à porter',
  'season': 'spring-summer',
  'subject': 'woman',
  'media_type': 'runway show pictures',
  'Name': 'trench coat, footwear, accessory',
  'Theme': '',
  'Typology': 'décolleté, gloves, glasses',
  'Typology_sub': ''},
 {'image': 'gs://ferre-runway-am/15600.jpg',
  'id_file': '15600',
  'year': '1986',
  'collection': 'alta moda',
  'season': 'fall-winter',
  'subject': 'nan',
  'media_type': 'runway show pictures',
  'Name': '',
  'Theme': '',
  'Typology': '',
  'Typology_sub': ''},
 {'image': 'gs://ferre-runway-am/60189.jpg',
  'id_file': '60189',
  'year': '1989',
  'collection': 'alta moda',
  'season': 'spring-summer',
  'subject': 'nan',
  'media_type': 'runway show pictures',
  'Name': '',
  'Theme': '',
  'Typology': '',
  'Typology_sub': ''}]

## Create a Label Studio Project

Connect to the Label Studio API with your personal API key, which you can retrieve from your user account page, and confirm you can successfully connect:

In [17]:
from label_studio_sdk import Client
LABEL_STUDIO_URL = 'http://localhost:8080'
API_KEY = '60c169ef3264edc59708e8b6d763947fb6078a90'

ls = Client(url=LABEL_STUDIO_URL, api_key=API_KEY)
ls.check_connection()

{'status': 'UP'}

Create the project. In this example, the project is a basic [image object detection project](https://labelstud.io/templates/image_bbox.html):

In [18]:
!curl -X GET http://localhost:8080/api/projects/ -H "Authorization: Token 60c169ef3264edc59708e8b6d763947fb6078a90"

{"count":3,"next":null,"previous":null,"results":[{"id":7,"title":"Image Annotation Project from SDK","description":"","label_config":"<View>\n  <Header value=\"Name: $Name\"/>\n  <Header value=\"Theme: $Theme\"/>\n  <Header value=\"Typology: $Typology\"/>\n  <Header value=\"Typology_sub: $Typology_sub\"/>\n  \n        <Image name=\"image\" value=\"$image\" zoom=\"true\" zoomControl=\"true\" rotateControl=\"false\"/>\n\n        <Header value=\"Brush Labels\"/>\n        <BrushLabels name=\"tag\" toName=\"image\">\n            <Label value=\"jacket\" background=\"#34a00d\"/>\n            <Label value=\"coat\" background=\"#D4380D\"/>\n            <Label value=\"shirt\" background=\"#FFC069\"/>\n            <Label value=\"blouse\" background=\"#AD8B00\"/>\n            <Label value=\"other tops\" background=\"#D3F261\"/>\n            <Label value=\"jersey shirt\" background=\"#389E0D\"/>\n            <Label value=\"dress\" background=\"#5CDBD3\"/>\n            <Label value=\"jumpsuit\" bac

In [19]:
PROJECT_ID = 7
project = ls.get_project(PROJECT_ID)
### OR
### Start a new project
# project = ls.start_project(
#     title='Image Annotation Project from SDK',
#     label_config='''
#     <View>
#       <Header value="Name: $Name"/>
#       <Header value="Theme: $Theme"/>
#       <Header value="Typology: $Typology"/>
#       <Header value="Typology_sub: $Typology_sub"/>
#         <Image name="image" value="$image" zoom="true" zoomControl="true" rotateControl="false"/>

#         <Header value="Brush Labels"/>
#         <BrushLabels name="tag" toName="image">
#             <Label value="jacket" background="#34a00d"/>
#             <Label value="coat" background="#D4380D"/>
#             <Label value="shirt" background="#FFC069"/>
#             <Label value="blouse" background="#AD8B00"/>
#             <Label value="other tops" background="#D3F261"/>
#             <Label value="jersey shirt" background="#389E0D"/>
#             <Label value="dress" background="#5CDBD3"/>
#             <Label value="jumpsuit" background="#096DD9"/>
#             <Label value="skirt" background="#ADC6FF"/>
#             <Label value="pants" background="#9254DE"/>
#             <Label value="knitwear" background="#F759AB"/>
#             <Label value="tailleur" background="#FFA39E"/>
#             <Label value="swimsuit" background="#D4380D"/>
#             <Label value="accessory" background="#FFC069"/>
#         </BrushLabels>
        
#         <Header value="Keypoint Labels"/>
#         <KeyPointLabels name="tag2" toName="image" smart="true">
#             <Label value="jacket" background="#AD8B00"/>
#             <Label value="coat" background="#D3F261"/>
#             <Label value="shirt" background="#389E0D"/>
#             <Label value="blouse" background="#5CDBD3"/>
#             <Label value="other tops" background="#096DD9"/>
#             <Label value="jersey shirt" background="#ADC6FF"/>
#             <Label value="dress" background="#9254DE"/>
#             <Label value="jumpsuit" background="#F759AB"/>
#             <Label value="skirt" background="#FFA39E"/>
#             <Label value="pants" background="#D4380D"/>
#             <Label value="knitwear" background="#FFC069"/>
#             <Label value="tailleur" background="#AD8B00"/>
#             <Label value="swimsuit" background="#D3F261"/>
#             <Label value="accessory" background="#389E0D"/>
#         </KeyPointLabels>
        
#         <Header value="Rectangle Labels"/>
#         <RectangleLabels name="tag3" toName="image" smart="true" showInline="true">
#             <Label value="jacket" background="#5CDBD3"/>
#             <Label value="coat" background="#096DD9"/>
#             <Label value="shirt" background="#ADC6FF"/>
#             <Label value="blouse" background="#9254DE"/>
#             <Label value="other tops" background="#F759AB"/>
#             <Label value="jersey shirt" background="#FFA39E"/>
#             <Label value="dress" background="#D4380D"/>
#             <Label value="jumpsuit" background="#FFC069"/>
#             <Label value="skirt" background="#AD8B00"/>
#             <Label value="pants" background="#D3F261"/>
#             <Label value="knitwear" background="#389E0D"/>
#             <Label value="tailleur" background="#5CDBD3"/>
#             <Label value="swimsuit" background="#096DD9"/>
#             <Label value="accessory" background="#ADC6FF"/>
#         </RectangleLabels>
        
#         <MagicWand name="magicwand" toName="image"/>
#     </View>
#     '''
# )

## Connect to your GCS bucket

Connect your newly-created project to your GCS bucket:

In [20]:
project.connect_google_import_storage(
    bucket=BUCKET_NAME,
    google_application_credentials=GOOGLE_APPLICATION_CREDENTIALS
)

{'id': 11,
 'type': 'gcs',
 'synchronizable': True,
 'presign': True,
 'bucket': 'ferre-runway-am',
 'prefix': None,
 'regex_filter': None,
 'use_blob_urls': True,
 'google_project_id': None,
 'last_sync': None,
 'last_sync_count': None,
 'last_sync_job': None,
 'status': 'initialized',
 'traceback': None,
 'meta': {},
 'title': '',
 'description': '',
 'created_at': '2024-03-26T14:11:00.033808Z',
 'presign_ttl': 1,
 'project': 7}

## Sync tasks from GCS to Label Studio

After connecting to your bucket, you can import your private GCS links to Label Studio. When opening in Label Studio interface, they're automatically presigned for security! 

In [21]:
project.import_tasks(tasks)

[118, 119, 120]

## Conclusion

In a few lines of code you assessed the data in your bucket, set up a new labeling project, and synced the tasks to the project. You can adapt this example to more easily create a data creation to data labeling pipeline.