# Data Labeling

Labeling (or annotation) is the process of identifying the inputs and outputs that are **worth modeling** (not just what could be modeled).

- use objective as a guide to determine the necessary signals.
- explore creating new signals (via combining features, collecting new data, etc.).
- iteratively add more features to justify complexity and effort.

> **Warning**
> 
> Be careful not to include features that are not present during prediction.


This is also the phase to deepen the data and problem, process, constraints and domain understanding.

Note down or determine:

- augmentations and training split
- enhancement with extra data
- simplifications
- removal of noisy samples
- improvements for labeling


## Process
Regardless of whether we have a custom labeling platform or we choose a generalized platform, the process of labeling and all it's related workflows (QA, data import/export, etc.) follow a similar approach.

### Preliminary steps

[WHAT] Decide what needs to be labeled:

- identify natural labels you may already have (ex. time-series)
- consult with domain experts to ensure you're labeling the appropriate signals
- decide on the appropriate labels (and hierarchy) for your task

[WHERE] Design the labeling interface:

- intuitive, data modality dependent and quick (keybindings are a must!)
- avoid option paralysis by allowing the labeler to dig deeper or suggesting likely labels
- measure and resolve inter-labeler discrepancy

[HOW] Compose labeling instructions:

- examples of each labeling scenario
- course of action for discrepancies


### Workflow setup

Establish data pipelines:

- [IMPORT] new data for annotation
- [EXPORT] annotated data for QA, testing, modeling, etc.

Create a quality assurance (QA) workflow:

- separate from labeling workflow (no bias)
- communicates with labeling workflow to escalate errors

### Iterative setup

Implement strategies to reduce labeling efforts

- identify subsets of the data to label next using active learning
- auto-label entire or parts of a dataset using weak supervision
- focus labeling efforts on long tail of edge cases over time

## Labeled data

For the purpose of this course, our data is already labeled, so we'll perform a basic version of ELT (extract, load, transform) to construct the labeled dataset.

- [projects.csv](./datasets/projects.csv): projects with id, created time, title and description.
- [tags.csv](./datasets/tags.csv): labels (tag category) for the projects by id.

Recall that our objective was to classify incoming content so that the community can discover them easily. These data assets will act as the training data for our first model.

## Extract


In [1]:
import pandas as pd

In [2]:
# Extract data
PROJECTS_URL = "https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/projects.csv"
TAGS_URL = "https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/tags.csv"

projects = pd.read_csv(PROJECTS_URL)
tags = pd.read_csv(TAGS_URL)
projects.head(10)

Unnamed: 0,id,created_on,title,description
0,6,2020-02-20 06:43:18,Comparison between YOLO and RCNN on real world...,Bringing theory to experiment is cool. We can ...
1,7,2020-02-20 06:47:21,"Show, Infer & Tell: Contextual Inference for C...",The beauty of the work lies in the way it arch...
2,9,2020-02-24 16:24:45,Awesome Graph Classification,"A collection of important graph embedding, cla..."
3,15,2020-02-28 23:55:26,Awesome Monte Carlo Tree Search,A curated list of Monte Carlo tree search pape...
4,19,2020-03-03 13:54:31,Diffusion to Vector,Reference implementation of Diffusion2Vec (Com...
5,25,2020-03-07 23:04:31,AttentionWalk,"A PyTorch Implementation of ""Watch Your Step: ..."
6,26,2020-03-07 23:11:58,Graph Wavelet Neural Network,"A PyTorch implementation of ""Graph Wavelet Neu..."
7,27,2020-03-07 23:18:15,APPNP and PPNP,"A PyTorch implementation of ""Predict then Prop..."
8,28,2020-03-07 23:23:46,Attributed Social Network Embedding,A sparsity aware and memory efficient implemen...
9,29,2020-03-07 23:45:38,Signed Graph Convolutional Network,"A PyTorch implementation of ""Signed Graph Conv..."


In [3]:
tags.head(10)

Unnamed: 0,id,tag
0,6,computer-vision
1,7,computer-vision
2,9,graph-learning
3,15,reinforcement-learning
4,19,graph-learning
5,25,graph-learning
6,26,graph-learning
7,27,graph-learning
8,28,graph-learning
9,29,graph-learning


## Transform

In [4]:
df = pd.merge(projects,tags, on = "id")
df.head()

Unnamed: 0,id,created_on,title,description,tag
0,6,2020-02-20 06:43:18,Comparison between YOLO and RCNN on real world...,Bringing theory to experiment is cool. We can ...,computer-vision
1,7,2020-02-20 06:47:21,"Show, Infer & Tell: Contextual Inference for C...",The beauty of the work lies in the way it arch...,computer-vision
2,9,2020-02-24 16:24:45,Awesome Graph Classification,"A collection of important graph embedding, cla...",graph-learning
3,15,2020-02-28 23:55:26,Awesome Monte Carlo Tree Search,A curated list of Monte Carlo tree search pape...,reinforcement-learning
4,19,2020-03-03 13:54:31,Diffusion to Vector,Reference implementation of Diffusion2Vec (Com...,graph-learning


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 955 entries, 0 to 954
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           955 non-null    int64 
 1   created_on   955 non-null    object
 2   title        955 non-null    object
 3   description  955 non-null    object
 4   tag          955 non-null    object
dtypes: int64(1), object(4)
memory usage: 44.8+ KB


In [14]:

print(f"Removing {df[df['tag'].isnull()].shape[0]} entries with null in 'tag'")
df = df[df["tag"].notnull()]

Removing 0 entries with null in 'tag'


## Load

In [16]:
df.to_csv("./datasets/labeled_projects.csv", index = False)

## Libraries

We could have used the user provided tags as our labels but what if the user added a wrong tag or forgot to add a relevant one. To remove this dependency on the user to provide the gold standard labels, we can leverage labeling tools and platforms. These tools allow for quick and organized labeling of the dataset to ensure its quality. And instead of starting from scratch and asking our labeler to provide all the relevant tags for a given project, we can provide the author's original tags and ask the labeler to add / remove as necessary. The specific labeling tool may be something that needs to be custom built or leverages something from the ecosystem.

See [here](https://madewithml.com/courses/mlops/labeling/) for more information about different libraries and tools for labeling.

## Iteration

Labeling isn't just a one time event or something we repeat identically. As new data is available, we'll want to strategically label the appropriate samples and improve slices of our data that are lacking in quality. Once new data is labeled, we can have workflows that are triggered to start the (re)training process to deploy a new version of our system.