In [46]:
import pandas as pd
from snorkel.utils import probs_to_preds
from utils import load_raw_spam_dataset
from wrench.dataset import load_dataset
from wrench.endmodel import EndClassifierModel
from wrench.labelmodel import MajorityVoting, Fable

path_to_data = "data"
# os.chdir("wrench/spam")

In [19]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

# Weak Supervision

todo: some intro, what is WS once again, what do we need to train a classifier

In this tutorial, we are going to train a spam detection classifier using weak supervision. The dataset we will use for training is Spam Detection YouTube comments dataset. 

Some info about the dataset:
- The dataset consists of comments that YouTube users left under different videos.
- Each sample is a comment (i.e., a word, a sentence, or a couple of sentences).
- 1,586 train samples, 120 dev samples, 250 test samples
- There are 2 types of samples:
    - HAM: comments relevant to the video (even very simple ones), or
    - SPAM: irrelevant (often trying to advertise something) or inappropriate messages
- Original dataset is labeled; we are going to use it as unlabeled one (and label it in a weakly-supervised fasion). 

Let's first have a look at the unlabeled dataset.

In [2]:
# load the YouTube dataset
df_train, df_dev, df_test = load_raw_spam_dataset(load_train_labels=True)
Y_train = df_train["label"].values
Y_test = df_test["label"].values

In [23]:
df_train[:10]

Unnamed: 0,author,date,text,label,video
0,Alessandro leite,2014-11-05T22:21:36,pls http://www10.vakinha.com.br/VaquinhaE.aspx?e=313327 help me get vip gun cross fire al﻿,1,1
1,Salim Tayara,2014-11-02T14:33:30,"if your like drones, plz subscribe to Kamal Tayara. He takes videos with his drone that are absolutely beautiful.﻿",1,1
2,Phuc Ly,2014-01-20T15:27:47,go here to check the views :3﻿,0,1
3,DropShotSk8r,2014-01-19T04:27:18,"Came here to check the views, goodbye.﻿",0,1
4,css403,2014-11-07T14:25:48,"i am 2,126,492,636 viewer :D﻿",0,1
5,Giang Nguyen,2014-11-06T04:55:41,https://www.facebook.com/teeLaLaLa﻿,1,1
6,Caius Ballad,2014-11-13T00:58:20,imagine if this guy put adsense on with all these views... u could pay ur morgage﻿,0,1
7,Holly,2014-11-06T13:41:30,Follow me on Twitter @mscalifornia95﻿,1,1
8,King uzzy,2014-11-07T23:19:08,Can we reach 3 billion views by December 2014? ﻿,0,1
9,iKap Taz,2014-11-08T13:34:27,Follow 4 Follow @ VaahidMustafic Like 4 Like ﻿,1,1


For each data sample (i.e., a YouTube comment), we know:
- comment's author,
- date when the corresponding comment was left,
- text of the sample,
- label,
- id of the YouTube video.

Examples of HAM messages (label = 0):
- "3:46 so cute!"
- "This is a weird video."

In [28]:
# some examples of positive (=non-spam) samples

df_train.loc[df_train["label"]==0][:10]

Unnamed: 0,author,date,text,label,video
2,Phuc Ly,2014-01-20T15:27:47,go here to check the views :3﻿,0,1
3,DropShotSk8r,2014-01-19T04:27:18,"Came here to check the views, goodbye.﻿",0,1
4,css403,2014-11-07T14:25:48,"i am 2,126,492,636 viewer :D﻿",0,1
6,Caius Ballad,2014-11-13T00:58:20,imagine if this guy put adsense on with all these views... u could pay ur morgage﻿,0,1
8,King uzzy,2014-11-07T23:19:08,Can we reach 3 billion views by December 2014? ﻿,0,1
10,John Plaatt,2014-11-07T22:22:29,On 0:02 u can see the camera man on his glasses....﻿,0,1
11,Praise Samuel,2014-11-08T11:10:30,2 billion views wow not even baby by justin beibs has that much he doesn't deserve a capitalized name﻿,0,1
16,zhichao wang,2013-11-29T02:13:56,i think about 100 millions of the views come from people who only wanted to check the views﻿,0,1
19,Tedi Foto,2014-11-08T09:33:30,What my gangnam style﻿,0,1
20,Tee Tee,2014-11-07T20:16:51,Loool nice song funny how no one understands (me) and we love it﻿,0,1


Examples of SPAM messages (label = 1):
- "Please check out my vidios"
- "Subscribe to me and I'll subscribe back!!!"

In [29]:
# some examples of negative (=spam) samples

df_train.loc[df_train["label"]==1][:10]

Unnamed: 0,author,date,text,label,video
0,Alessandro leite,2014-11-05T22:21:36,pls http://www10.vakinha.com.br/VaquinhaE.aspx?e=313327 help me get vip gun cross fire al﻿,1,1
1,Salim Tayara,2014-11-02T14:33:30,"if your like drones, plz subscribe to Kamal Tayara. He takes videos with his drone that are absolutely beautiful.﻿",1,1
5,Giang Nguyen,2014-11-06T04:55:41,https://www.facebook.com/teeLaLaLa﻿,1,1
7,Holly,2014-11-06T13:41:30,Follow me on Twitter @mscalifornia95﻿,1,1
9,iKap Taz,2014-11-08T13:34:27,Follow 4 Follow @ VaahidMustafic Like 4 Like ﻿,1,1
12,Malin Linford,2014-11-05T01:13:43,"Hey guys please check out my new Google+ page it has many funny pictures, FunnyTortsPics https://plus.google.com/112720997191206369631/post﻿",1,1
13,Lone Twistt,2013-11-28T17:34:55,Once you have started reading do not stop. If you do not subscribe to me within one day you and you're entire family will die so if you want to stay alive subscribe right now.﻿,1,1
14,Олег Пась,2014-11-03T23:29:00,Plizz withing my channel ﻿,1,1
15,JD COKE,2014-11-08T02:24:02,"It's so hard, sad :( iThat little child Actor HWANG MINOO dancing very active child is suffering from brain tumor, only 6 month left for him .Hard to believe .. Keep praying everyone for our future superstar. #StrongLittlePsY #Fighting SHARE EVERYONE PRAYING FOR HIM http://ygunited.com/2014/11/08/little-psy-from-the-has-brain-tumor-6-months-left-to-live/ ﻿",1,1
17,Rancy Gaming,2014-11-06T09:41:07,What free gift cards? Go here http://www.swagbucks.com/p/register?rb=13017194﻿,1,1


Now let's imagine gold labels disappeared...

<img src="img/poof.jpg" width="400"/>

... and here we are: there is some data we want to use for classifier training, but we don't have any labels and capacity/time/money/... for hiring annotators.

But we can label this data with **weak supervision** :)

<img src="img/rainbow.png" width="700"/>

# Weak Supervision

A brief reminder how weak supervision works:
1. We come up with some heuristic rules and transform these rules into labeling functions.
2. We apply these labeling functions to the data and obtain weak labels.

### What can be a labeling function?

- Keyword searches: looking for specific words in a sentence
- Pattern matching: looking for specific syntactical patterns
- Third-party models: using an pre-trained model (usually a model for a different task than the one at hand)
- ...
- Crowdworker labels: treating each crowdworker as a black-box function that assigns labels to subsets of the data

*What labeling functions do you think are productive and useful to annotate the YouTube spam dataset?* 

*What rules might help to distinguish between spam and not-spam YouTube comments?*

*What patterns are typical for spam YouTube comments? for non-spam comments?*



Labeling Functions: 
1. ...
2. ...
3. ...
4. ...
5. ...

Some more ideas of labeling functions: 
- "check"/"check out": if there is a collocation "check out" in the comment, most probably this comment is spam (and the comment author is promoting his/her channel)
- "subscribe": same
- "my": same
- ...
 
After we collected some rules, we transform them into labeling functions that could *label* the data sample - that is, assign it to one or another class. 

In [32]:
# an example of LF based on a keyword "check out"

def check_out(x):
    return 1 if "check out" in x.text.lower() else -1

# meaning the sample will be assigned to class 1 (=SPAM) if there is a "check out" expression in the comment, 
# otherwise to class 0 (=non-SPAM)

In [33]:
# an example of LF based on a key word "please"

def check(x):
    return 1 if "please" in x.text.lower() else -1

# meaning the sample will be assigned to class 1 (=SPAM) if there is a "please" expression in the comment, 
# otherwise to class 0 (=non-SPAM)

In this tutorial, we are going to use the labeling functions created by [Snorkel team](https://github.com/snorkel-team/snorkel-tutorials/blob/master/spam/01_spam_tutorial.ipynb), which are: 


- keyword "my" (to detect spam comments like "my channel", "my video", etc)
- keyword "subscribe" (to detect spam comments that ask users to subscribe to some channel)
- keyword "http" (to detect spam comments that link to other channels)
- keyword "please"/"plz" (to detect spam comments that make requests rather than commenting)
- keyword "song" (to detect non-spam comments that actually talk about the video's content)
- regex "check_out" (to detect spam comments like "check out this channel", etc)
- short comment (non-spam comments are often short, such as 'cool video!')
- mentioning specific people and are short (using SpaCy library; non-spam comments usually mention some people)
- polarity (using TextBlob library; if polarity > 0.9, it is most probably a non-spam message)
- subjectivity (using TextBlob library; if subjectivity >= 0.5, it is most probably a non-spam message)

(We are not going into details of the labeling process here now - you will hear more about it from my oclleagues later). 

The resulted annotations can be saved in the following format: 

In [50]:
import json
with open("data/youtube/train.json") as train_file:
    train_data = json.load(train_file)
train_data

{'0': {'data': {'text': 'pls http://www10.vakinha.com.br/VaquinhaE.aspx?e=313327 help me get vip gun  cross fire al\ufeff'},
  'label': 1,
  'weak_labels': [-1, -1, 1, -1, -1, -1, -1, -1, -1, -1]},
 '1': {'data': {'text': 'if your like drones, plz subscribe to Kamal Tayara. He takes videos with  his drone that are absolutely beautiful.\ufeff'},
  'label': 1,
  'weak_labels': [-1, 1, -1, 1, -1, -1, -1, -1, -1, 0]},
 '2': {'data': {'text': 'go here to check the views :3\ufeff'},
  'label': 0,
  'weak_labels': [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1]},
 '3': {'data': {'text': 'Came here to check the views, goodbye.\ufeff'},
  'label': 0,
  'weak_labels': [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1]},
 '4': {'data': {'text': 'i am 2,126,492,636 viewer :D\ufeff'},
  'label': 0,
  'weak_labels': [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1]},
 '5': {'data': {'text': 'https://www.facebook.com/teeLaLaLa\ufeff'},
  'label': 1,
  'weak_labels': [-1, -1, 1, -1, -1, -1, 0, -1, -1, -1]},
 '6': {'data': {'te

The processed data structure is the following: 
{"data": {"text": [sample text]}, "label": [initial gold labels], "weak_labels": []}



todo: explain what are the data, labels (= still gold ones), weak_labels (= the annotations after applying LFs)

todo: what is majority vote?
todo: how to get the weak labels from the wrench data to train a classifier?
todo: example of classifier training
todo: other labeling model (e.g. Snorkel)

## How to obtain weak labels?

todo: different labeling functions: 1) (simple) majority vote 2) (more advanced) FABLE

In [51]:
train_data, valid_data, test_data = load_dataset(
    path_to_data,
    "youtube",
    extract_feature=True,
    extract_fn='tfidf'
)

  0%|          | 0/1586 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?it/s]

  0%|          | 0/250 [00:00<?, ?it/s]

### Majority Vote

The simplest and most straightforward method to calculate labels from the noisy annotations is majority voting - is a decision-making method where the option with the most votes is chosen. It's like asking a group of people to pick a movie, and the one that gets the most hands raised wins. 

We will use the Wrench framework for it. 

In [52]:
# initialize and apply the majority vote

label_model = MajorityVoting()
label_model.fit(dataset_train=train_data, dataset_valid=valid_data)

In [49]:
# calculate labels

soft_label_mv = label_model.predict_proba(train_data)    # soft label as probabilities 
hard_label_mv = probs_to_preds(soft_label_mv)               # hard labels  

### Fable [1]
(+ todo: a small description)

[1] Zhang et al. 2023. Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision. 

In [53]:
# initialize and apply the fable model

label_model = Fable(kernel_function=None, num_groups=3)
label_model.fit(dataset_train=train_data, dataset_valid=valid_data)

NaN values included: []


  0%|▎                                                                                                                                         | 2/1000 [00:05<49:23,  2.97s/iter]

stop





array([[0.48683207, 0.51316793],
       [0.06563599, 0.93436401],
       [0.55573324, 0.44426676],
       ...,
       [0.01800126, 0.98199874],
       [0.08608723, 0.91391277],
       [0.00469534, 0.99530466]])

In [54]:
# calculate labels
soft_label_fable = label_model.predict_proba(train_data)
hard_label_fable = probs_to_preds(soft_label_mv)

  0%|▎                                                                                                                                         | 2/1000 [00:05<48:22,  2.91s/iter]

stop





## How to train a classifier?

In [55]:
batch_size = 32
test_batch_size = 32
lr = 0.01

model = EndClassifierModel(
    batch_size=batch_size, test_batch_size=test_batch_size
)