# Snorkel

## Introduction

The goal of this lab is to introduce students to the [Snorkel](http://www.snorkel.org) tool and the possibilities of programmatic label generation using the weak-supervised learning paradigm.

In order to use weakly supervised learning to generate labels, it is necessary to create three datasets:

- **train set**: which does not have any labels
- **validation set**: used for hyperparameter optimization, has labels
- **test set**: used only for final model evaluation, has labels

## Labeling functions

The first step will be to load the dataset and split it into a train set and a test set. Since in our set all SMS have a label, we will simulate a weakly supervised learning problem by randomly removing 80% of the labels. Additionally, Snorkel requires numeric labels, so we need to recode the values.

In [1]:
!head data/smsspamcollection.csv

ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives around here though
spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
ham	Even my brother is not like to speak with me. They treat me like aids patent.
ham	As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
spam	WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
spam	H

In [2]:
import pandas as pd
import numpy as np

pd.set_option('max_colwidth', 600)

SPAM = 1
HAM = 0
ABSTAIN = -1

df = pd.read_csv('./data/smsspamcollection.csv', sep='\t', header=None, names=['old_label', 'text'])

df['label'] = df.old_label.apply(lambda x: SPAM if x == 'spam' else HAM)

df.loc[df.sample(frac=0.8).index, 'label'] = ABSTAIN
df.drop(columns=['old_label'], inplace=True)

df.head()

Unnamed: 0,text,label
0,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",0
1,Ok lar... Joking wif u oni...,-1
2,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,-1
3,U dun say so early hor... U c already then say...,-1
4,"Nah I don't think he goes to usf, he lives around here though",-1


In [3]:
abstain_idx = df.label == ABSTAIN

df_train = df[abstain_idx]
df_test = df[~abstain_idx]

### Simple keyword search

As a first example, we will use a search for the words "check" and "free" in SMS content

In [4]:
from snorkel.labeling import labeling_function

@labeling_function()
def check(sms):
    return SPAM if "check" in sms.text.lower() else ABSTAIN

@labeling_function()
def free(sms):
    return SPAM if "free" in sms.text.lower() else ABSTAIN

The next step is to apply the labeling functions to the train set.

In [5]:
from snorkel.labeling import PandasLFApplier

lfs = [check, free]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

100%|██████████| 4458/4458 [00:00<00:00, 100431.34it/s]


The result of applying the set of labeling functions to the train set is a matrix of size $m \times n$, where $m$ is the number of examples and $n$ is the number of labeling functions. The matrix contains the result of applying each function to each example.

In [6]:
L_train

array([[-1, -1],
       [-1,  1],
       [-1, -1],
       ...,
       [-1, -1],
       [-1,  1],
       [-1, -1]])

In [7]:
df_train.iloc[1,:]

text     Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
label                                                                                                                                                             -1
Name: 2, dtype: object

The simplest way to analyze this is to determine the coverage of labeling functions (i.e., the percentage of cases for which the function returned a result other than `ABSTAIN'.

In [8]:
coverage_check, coverage_free = (L_train != ABSTAIN).mean(axis=0)

print(f"Coverage for check(): {coverage_check * 100:.1f}%")
print(f"Coverage for free(): {coverage_free * 100:.1f}%")

Coverage for check(): 1.0%
Coverage for free(): 4.8%


Fortunately, Snorkel offers additional tools that allow for deeper analysis of the result of labeling functions.

In [9]:
from snorkel.labeling import LFAnalysis

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
check,0,[1],0.010319,0.0,0.0
free,1,[1],0.048004,0.0,0.0


The meaning of each column is as follows:
- `Polarity`: the set of labels returned by the function
- `Coverage`: the percentage of examples for which the function returns a value other than `ABSTAIN`
- Overlaps: the percentage of examples for which at least one other labeling function returned a value
- Conflicts: the percentage of examples for which at least one other labeling function returned a different value

If the train set contained labels, the method would also return:
- `Correct`: the number of correct labels
- `Incorrect`: number of incorrect labels
- `Empirical Accuracy`: the percentage of correct labels

Let's check the examples labeled by the `free()` function as spam

In [10]:
df_train.iloc[L_train[:,1] == SPAM].sample(frac=0.1)

Unnamed: 0,text,label
1380,No. 1 Nokia Tone 4 ur mob every week! Just txt NOK to 87021. 1st Tone FREE ! so get txtin now and tell ur friends. 150p/tone. 16 reply HL 4info,-1
5060,Free video camera phones with Half Price line rental for 12 mths and 500 cross ntwk mins 100 txts. Call MobileUpd8 08001950382 or Call2OptOut/674,-1
2074,"FreeMsg: Claim ur 250 SMS messages-Text OK to 84025 now!Use web2mobile 2 ur mates etc. Join Txt250.com for 1.50p/wk. T&C BOX139, LA32WU. 16 . Remove txtX or stop",-1
1536,You have won a Nokia 7250i. This is what you get when you win our FREE auction. To take part send Nokia to 86021 now. HG/Suite342/2Lands Row/W1JHL 16+,-1
418,FREE entry into our £250 weekly competition just text the word WIN to 80086 NOW. 18 T&C www.txttowin.co.uk,-1
4196,"Double mins and txts 4 6months FREE Bluetooth on Orange. Available on Sony, Nokia Motorola phones. Call MobileUpd8 on 08000839402 or call2optout/N9DX",-1
495,Are you free now?can i call now?,-1
5566,"REMINDER FROM O2: To get 2.50 pounds free call credit and details of great offers pls reply 2 this text with your valid name, house no and postcode",-1
2290,Had your mobile 11mths ? Update for FREE to Oranges latest colour camera mobiles & unlimited weekend calls. Call Mobile Upd8 on freefone 08000839402 or 2StopTx,-1
188,Please call our customer service representative on FREEPHONE 0808 145 4742 between 9am-11pm as you have WON a guaranteed £1000 cash or £5000 prize!,-1


It seems that the phrase "call now" is also a good indicator for spam. So let's add one more labeling function.

In [11]:
@labeling_function()
def call_now(sms):
    return SPAM if "call now" in sms.text.lower() else ABSTAIN

lfs = [check, free, call_now]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

100%|██████████| 4458/4458 [00:00<00:00, 43262.86it/s]


In [12]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
check,0,[1],0.010319,0.0,0.0
free,1,[1],0.048004,0.001346,0.0
call_now,2,[1],0.004038,0.001346,0.0


Let's see which examples were labeled as spam by the `call_now()` function but omitted by `free()`.

In [13]:
from snorkel.analysis import get_label_buckets

buckets = get_label_buckets(L_train[:, 1], L_train[:, 2])
buckets

{(-1, -1): array([   0,    2,    3, ..., 4454, 4455, 4457]),
 (1,
  -1): array([   1,    7,    9,   43,   59,   70,   75,  113,  119,  140,  144,
         146,  153,  187,  206,  218,  220,  240,  287,  295,  308,  320,
         333,  362,  390,  393,  413,  453,  466,  479,  502,  522,  533,
         561,  588,  617,  631,  632,  685,  711,  755,  798,  802,  810,
         848,  874,  909,  951,  972,  977,  979, 1019, 1091, 1104, 1140,
        1159, 1199, 1210, 1218, 1231, 1236, 1262, 1303, 1316, 1336, 1355,
        1388, 1405, 1413, 1420, 1422, 1426, 1432, 1536, 1549, 1576, 1582,
        1603, 1613, 1623, 1676, 1681, 1704, 1716, 1738, 1772, 1801, 1833,
        1849, 1858, 1869, 1871, 1897, 1913, 1920, 1922, 1929, 1972, 2008,
        2038, 2068, 2084, 2090, 2152, 2173, 2186, 2267, 2301, 2310, 2323,
        2342, 2384, 2406, 2416, 2434, 2510, 2515, 2518, 2529, 2587, 2626,
        2664, 2667, 2678, 2736, 2746, 2749, 2750, 2760, 2766, 2807, 2814,
        2830, 2889, 2942, 2957, 2971, 29

In [14]:
df_train.iloc[buckets[(ABSTAIN, SPAM)]]

Unnamed: 0,text,label
876,"Shop till u Drop, IS IT YOU, either 10K, 5K, £500 Cash or £100 Travel voucher, Call now, 09064011000. NTT PO Box CR01327BT fixedline Cost 150ppm mobile vary",-1
1366,"HOT LIVE FANTASIES call now 08707509020 Just 20p per min NTT Ltd, PO Box 1327 Croydon CR9 5WB 0870..k",-1
2170,"Shop till u Drop, IS IT YOU, either 10K, 5K, £500 Cash or £100 Travel voucher, Call now, 09064011000. NTT PO Box CR01327BT fixedline Cost 150ppm mobile vary",-1
2850,"YOUR CHANCE TO BE ON A REALITY FANTASY SHOW call now = 08707509020 Just 20p per min NTT Ltd, PO Box 1327 Croydon CR9 5WB 0870 is a national = rate call",-1
2871,"YOUR CHANCE TO BE ON A REALITY FANTASY SHOW call now = 08707509020 Just 20p per min NTT Ltd, PO Box 1327 Croydon CR9 5WB 0870 is a national = rate call.",-1
2992,"HOT LIVE FANTASIES call now 08707509020 Just 20p per min NTT Ltd, PO Box 1327 Croydon CR9 5WB 0870 is a national rate call",-1
3167,"HOT LIVE FANTASIES call now 08707509020 Just 20p per min NTT Ltd, PO Box 1327 Croydon CR9 5WB 0870..k",-1
3893,URGENT This is our 2nd attempt to contact U. Your £900 prize from YESTERDAY is still awaiting collection. To claim CALL NOW 09061702893. ACL03530150PM,-1
4073,Loans for any purpose even if you have Bad Credit! Tenants Welcome. Call NoWorriesLoans.com on 08717111821,-1
4283,U can call now...,-1


In [15]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
check,0,[1],0.010319,0.0,0.0
free,1,[1],0.048004,0.001346,0.0
call_now,2,[1],0.004038,0.001346,0.0


#### assignment

Write a labeling function that marks as spam all messages containing the word "HOT" written in capitals.

In [16]:
@labeling_function()
def hot(sms):
    return SPAM if "HOT" in sms.text else ABSTAIN

### Searching based on a regular expression

Another type of labeling function is one that uses regexp to find specific expressions.

In [17]:
import re

@labeling_function()
def regex_I_am_free(sms):
    if re.search(r"\b((I)|(she)|(he)|(we)|(them)|(we)(you))\b[^.,!?]*\bfree", sms.text, flags=re.I):
        return HAM
    elif re.search(r"free", sms.text, flags=re.I):
        return SPAM
    else:
        return ABSTAIN

lfs = [check, free, call_now, regex_I_am_free, hot]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

100%|██████████| 4458/4458 [00:00<00:00, 35764.14it/s]


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
check,0,[1],0.010319,0.0,0.0
free,1,[1],0.048004,0.048004,0.004935
call_now,2,[1],0.004038,0.002019,0.000673
regex_I_am_free,3,"[0, 1]",0.048004,0.048004,0.004935
hot,4,[1],0.001122,0.000673,0.0


Let's compare examples that the `free()` function labels as spam and the `regex_I_am_free()` function considers valid.

In [18]:
buckets = get_label_buckets(L_train[:, 1], L_train[:, 3])
df_train.iloc[buckets[(SPAM, HAM)]].sample(10, random_state=1)

Unnamed: 0,text,label
5104,"A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d girl,d boy ran like hell n saved her. She asked 'hw cn u run so fast?' D boy replied ""Boost is d secret of my energy"" n instantly d girl shouted ""our energy"" n Thy lived happily 2gthr drinking boost evrydy Moral of d story:- I hv free msgs:D;): gud ni8",-1
4091,We tried to call you re your reply to our sms for a video mobile 750 mins UNLIMITED TEXT + free camcorder Reply of call 08000930705 Now,-1
1784,No dear i do have free messages without any recharge. Hi hi hi,-1
3526,I not free today i haf 2 pick my parents up tonite...,-1
4949,"Hi this is Amy, we will be sending you a free phone number in a couple of days, which will give you an access to all the adult parties...",-1
3999,We tried to call you re your reply to our sms for a video mobile 750 mins UNLIMITED TEXT free camcorder Reply or call now 08000930705 Del Thurs,-1
3319,I'm freezing and craving ice. Fml,-1
2121,"Argh my 3g is spotty, anyway the only thing I remember from the research we did was that province and sterling were the only problem-free places we looked at",-1
1759,Sorry i'm not free...,-1
4304,Yup i'm free...,-1


#### assignment

Write a labeling function that will mark as spam all messages containing any amounts specified with a currency symbol ($99, £1.50)

In [19]:
@labeling_function()
def contains_money(sms):
    return SPAM if re.search(r"((\$)|(\£)\d)", sms.text) else ABSTAIN

### Searching based on heuristics

A simple heuristic to find spam is to assume that if more than 10% of the message text is written in capitals, there is a good chance it is spam.

In [20]:
@labeling_function()
def has_many_uppercase_words(sms):
    percentage_uppercase = sum([word.isupper() for word in sms.text.split()]) / len(sms.text.split())
    
    return SPAM if percentage_uppercase > 0.1 else ABSTAIN

lfs = [check, free, call_now, regex_I_am_free, has_many_uppercase_words, contains_money]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

100%|██████████| 4458/4458 [00:00<00:00, 25240.02it/s]


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
check,0,[1],0.010319,0.001346,0.0
free,1,[1],0.048004,0.048004,0.004935
call_now,2,[1],0.004038,0.003589,0.000673
regex_I_am_free,3,"[0, 1]",0.048004,0.048004,0.004935
has_many_uppercase_words,4,[1],0.187752,0.04262,0.000449
contains_money,5,[1],0.049798,0.031404,0.000224


#### assignment

Write a labeling function that marks as valid those messages that are shorter than 10 words and do not contain any word written in capitals.

In [21]:
@labeling_function()
def short_and_no_uppercase(sms):
    words = sms.text.split()
    some_capital = any([word == word.upper() for word in words])

    return HAM if len(words) < 10 and not some_capital else ABSTAIN

### Using an external statistical model

When labeling data, you can use external models whose response can be important information for deciding how to label an example. Snorkel has several built-in integrations in the form of the `Preprocessor` interface, in the example below we will use the `SpaCy` library to perform additional grammatical analysis of the text. However, you will need to download the English language model.

In [22]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [23]:
!python -m spacy validate

[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation:
/home/drew/prog/venvs/ml/lib/python3.11/site-packages/spacy[0m

NAME             SPACY            VERSION                            
en_core_web_md   >=3.7.2,<3.8.0   [38;5;2m3.7.1[0m   [38;5;2m✔[0m
en_core_web_sm   >=3.7.2,<3.8.0   [38;5;2m3.7.1[0m   [38;5;2m✔[0m



In [24]:
import spacy

nlp = spacy.load('en_core_web_sm')

In [25]:
_text = """I don't England is a country that is part of the United Kingdom. 
It shares land borders with Wales to its west and Scotland to its north. 
The Irish Sea lies northwest of England and the Celtic Sea to the southwest. 
England is separated from continental Europe by the North Sea to the east and the 
English Channel to the south. The country covers five-eighths of the island of 
Great Britain, which lies in the North Atlantic, and includes over 100 smaller islands, 
such as the Isles of Scilly and the Isle of Wight."""

doc = nlp(_text)

for e in doc.ents:
    print(e.text, e.label_)

England GPE
the United Kingdom GPE
Wales ORG
Scotland GPE
The Irish Sea LOC
England GPE
the Celtic Sea LOC
England GPE
Europe LOC
the North Sea LOC
English LANGUAGE
five-eighths CARDINAL
Great Britain GPE
the North Atlantic LOC
over 100 CARDINAL
the Isles of Scilly GPE
the Isle of Wight GPE


In [26]:
from snorkel.preprocess.nlp import SpacyPreprocessor

spac = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)

Assume that short text messages in which a reference to a specific person appears are not spam.

In [27]:
df_train.columns

Index(['text', 'label'], dtype='object')

In [28]:
@labeling_function(pre=[spac])
def has_person(sms):
    if len(sms.doc) < 20 and any([ent.label_ == "PERSON" for ent in sms.doc.ents]):
        return HAM
    else:
        return ABSTAIN

In [29]:
lfs = [check, free, call_now, regex_I_am_free, has_many_uppercase_words, has_person]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

100%|██████████| 4458/4458 [00:25<00:00, 172.51it/s]


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
check,0,[1],0.010319,0.002692,0.001346
free,1,[1],0.048004,0.048004,0.005832
call_now,2,[1],0.004038,0.003589,0.000673
regex_I_am_free,3,"[0, 1]",0.048004,0.048004,0.005832
has_many_uppercase_words,4,[1],0.187752,0.026469,0.006505
has_person,5,[0],0.049798,0.0083,0.0083


Another example of pre-processing data for labeling would be determining the average word frequency of a document. Below we define a function that determines the average word frequency and we decorate it as an example of a pre-processor. When a text message is sent to the next labeling function, the pre-processor will populate the text message with the average word frequency and, based on that, the labeling function will make a decision (we assume that if the text message contains many rare words then it is spam).

In [30]:
from wordfreq import zipf_frequency
from snorkel.preprocess import preprocessor

@preprocessor(memoize=True)
def avg_word_freq(sms):
    sms.avg_word_freq = sum([zipf_frequency(word, 'en') for word in sms.text.split()]) / len(sms.text.split())
    
    return sms

In [31]:
@labeling_function(pre=[avg_word_freq])
def many_rare_words(sms):
    return ABSTAIN if sms.avg_word_freq >= 4 else SPAM

In [32]:
lfs = [check, free, call_now, regex_I_am_free, has_many_uppercase_words, has_person, many_rare_words]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

100%|██████████| 4458/4458 [00:01<00:00, 3041.60it/s]


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
check,0,[1],0.010319,0.00314,0.001346
free,1,[1],0.048004,0.048004,0.005832
call_now,2,[1],0.004038,0.003589,0.000673
regex_I_am_free,3,"[0, 1]",0.048004,0.048004,0.005832
has_many_uppercase_words,4,[1],0.187752,0.04576,0.006505
has_person,5,[0],0.049798,0.014805,0.014805
many_rare_words,6,[1],0.064603,0.03432,0.007627


In [33]:
df_train.iloc[L_train[:,6] == SPAM].sample(frac=0.1)

Unnamed: 0,text,label
5526,PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S.I.M. points. Call 08718738001 Identifier Code: 49557 Expires 26/11/04,-1
5240,"Gud gud..k, chikku tke care.. sleep well gud nyt",-1
349,Fancy a shag? I do.Interested? sextextuk.com txt XXUK SUZY to 69876. Txts cost 1.50 per msg. TnCs on website. X,-1
2168,Yes.he have good crickiting mind,-1
1032,Yup bathe liao...,-1
1612,645,-1
2841,BABE !!! I miiiiiiissssssssss you ! I need you !!! I crave you !!! :-( ... Geeee ... I'm so sad without you babe ... I love you ...,-1
713,08714712388 between 10am-7pm Cost 10p,-1
1761,Nt yet chikku..simple habba..hw abt u?,-1
4937,K..k.:)congratulation ..,-1


#### assignment

Write a labeling function that marks messages containing more than 3 adjectives as spam. Use the SpaCy library for pre-processing. 

__Hint__: the following example shows how to read the part-of-speech label for each token from the message being analyzed. For information on all token properties recognized by SpaCy, see [API documentation](https://spacy.io/api/token)

In [34]:
import spacy

nlp = spacy.load('en_core_web_sm')

sms = "Yetunde, i'm sorry but moji and i seem too busy to be able to go shopping."

for token in nlp(sms):
    print(f"{token.text:<10} {token.pos_:<10} {token.tag_:<10} {token.lemma_:<10}")

Yetunde    PROPN      NNP        Yetunde   
,          PUNCT      ,          ,         
i          PRON       PRP        I         
'm         AUX        VBP        be        
sorry      ADJ        JJ         sorry     
but        CCONJ      CC         but       
moji       ADJ        JJ         moji      
and        CCONJ      CC         and       
i          PRON       PRP        I         
seem       VERB       VBP        seem      
too        ADV        RB         too       
busy       ADJ        JJ         busy      
to         PART       TO         to        
be         AUX        VB         be        
able       ADJ        JJ         able      
to         PART       TO         to        
go         VERB       VB         go        
shopping   NOUN       NN         shopping  
.          PUNCT      .          .         


In [35]:
@labeling_function(pre=[spac])
def many_adjectives(sms):
    adjectives = sum([token.pos_ == "ADJ" for token in sms.doc])
    if adjectives > 3:
        return SPAM
    else:
        return ABSTAIN

In [36]:
lfs = [many_adjectives]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

LFAnalysis(L=L_train, lfs=lfs).lf_summary()
df_train.iloc[L_train[:,0] == SPAM].sample(frac=0.1)


100%|██████████| 4458/4458 [00:00<00:00, 76134.01it/s]


Unnamed: 0,text,label
1344,Crazy ar he's married. Ü like gd looking guys not me. My frens like say he's korean leona's fave but i dun thk he is. Aft some thinking mayb most prob i'll go.,-1
1251,Ummmmmaah Many many happy returns of d day my dear sweet heart.. HAPPY BIRTHDAY dear,-1
4556,"7 wonders in My WORLD 7th You 6th Ur style 5th Ur smile 4th Ur Personality 3rd Ur Nature 2nd Ur SMS and 1st ""Ur Lovely Friendship""... good morning dear",-1
1044,Mmm thats better now i got a roast down me! id b better if i had a few drinks down me 2! Good indian?,-1
3978,Great NEW Offer - DOUBLE Mins & DOUBLE Txt on best Orange tariffs AND get latest camera phones 4 FREE! Call MobileUpd8 free on 08000839402 NOW! or 2stoptxt T&Cs,-1
3141,sexy sexy cum and text me im wet and warm and ready for some porn! u up for some fun? THIS MSG IS FREE RECD MSGS 150P INC VAT 2 CANCEL TEXT STOP,-1
4752,Your weekly Cool-Mob tones are ready to download !This weeks new Tones include: 1) Crazy Frog-AXEL F>>> 2) Akon-Lonely>>> 3) Black Eyed-Dont P >>>More info in n,-1
4815,Ummmmmaah Many many happy returns of d day my dear sweet heart.. HAPPY BIRTHDAY dear,-1
9,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030,-1
5318,"Good morning, my Love ... I go to sleep now and wish you a great day full of feeling better and opportunity ... You are my last thought babe, I LOVE YOU *kiss*",-1


## Combining labeling functions into a single model

The goal of labeling functions is not to achieve individually large coverage. Labeling functions are inherently noisy and can make many individual errors. The true utility of labeling functions becomes apparent when multiple functions are combined to form a single model.

We will first build a simple model based on majority voting, and then build a more complex model. 

In [37]:
lfs = [check, free, call_now, regex_I_am_free, has_person, many_rare_words]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
L_test = applier.apply(df=df_test)

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

100%|██████████| 4458/4458 [00:00<00:00, 19314.21it/s]
100%|██████████| 1114/1114 [00:07<00:00, 149.84it/s]


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
check,0,[1],0.010319,0.001795,0.001346
free,1,[1],0.048004,0.048004,0.005832
call_now,2,[1],0.004038,0.001795,0.000673
regex_I_am_free,3,"[0, 1]",0.048004,0.048004,0.005832
has_person,4,[0],0.049798,0.009197,0.009197
many_rare_words,5,[1],0.064603,0.015029,0.007627


In [38]:
LFAnalysis(L=L_test, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
check,0,[1],0.012567,0.0,0.0
free,1,[1],0.045781,0.045781,0.004488
call_now,2,[1],0.003591,0.0,0.0
regex_I_am_free,3,"[0, 1]",0.045781,0.045781,0.004488
has_person,4,[0],0.048474,0.014363,0.014363
many_rare_words,5,[1],0.058348,0.01526,0.012567


In [39]:
from snorkel.labeling.model import MajorityLabelVoter

majority_model = MajorityLabelVoter()
preds_train = majority_model.predict(L=L_train)

In [40]:
preds_train

array([ 1,  1, -1, ..., -1, -1, -1])

In [41]:
import numpy as np

labels, counts = np.unique(preds_train, return_counts=True)

for l, c in zip(labels, counts):
    print(f"LABEL: {l}, count: {c}")

LABEL: -1, count: 3807
LABEL: 0, count: 181
LABEL: 1, count: 470


In [42]:
from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=42)

INFO:root:Computing O...
INFO:root:Estimating \mu...
  0%|          | 0/500 [00:00<?, ?epoch/s]INFO:root:[0 epochs]: TRAIN:[loss=0.010]
INFO:root:[100 epochs]: TRAIN:[loss=0.003]
 38%|███▊      | 188/500 [00:00<00:00, 1874.28epoch/s]INFO:root:[200 epochs]: TRAIN:[loss=0.003]
INFO:root:[300 epochs]: TRAIN:[loss=0.002]
INFO:root:[400 epochs]: TRAIN:[loss=0.000]
100%|██████████| 500/500 [00:00<00:00, 1988.01epoch/s]
INFO:root:Finished Training


In [43]:
majority_acc = majority_model.score(L=L_test, Y=df_test.label, tie_break_policy="random")["accuracy"]
print(f"{'Majority voting accuracy:':<25} {majority_acc * 100:.1f}%")

label_model_acc = label_model.score(L=L_test, Y=df_test.label, tie_break_policy="random")["accuracy"]
print(f"{'Probabilistic model accuracy:':<25} {label_model_acc * 100:.1f}%")

Majority voting accuracy: 49.9%
Probabilistic model accuracy: 50.5%


Unfortunately, some data points will not receive any label. It is necessary to filter out these points before sending the labeling result for further processing.

In [44]:
from snorkel.labeling import filter_unlabeled_dataframe
from snorkel.utils import preds_to_probs, probs_to_preds

preds_train, probs_train = label_model.predict(L=L_train, return_probs=True)

df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(X=df_train, y=probs_train, L=L_train)
df_train.shape, df_train_filtered.shape

((4458, 2), (705, 2))

As you can see, we were able to quickly prepare labels for about 650 examples (recall that initially no example in the `df_train` set had labels).

The next step will use prepared labels as training data for the actual classifier. We will use simple [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), first pre-processing the input data. Since we are working with text, we will use the [word vector representation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) created based on 5-grams by `CountVectorizer`.

In [45]:
from snorkel.utils import probs_to_preds
from sklearn.feature_extraction.text import CountVectorizer

preds_train_filtered = probs_to_preds(probs=probs_train_filtered)

vectorizer = CountVectorizer(ngram_range=(1, 5))

X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())

In [46]:
from sklearn.linear_model import LogisticRegression

sklearn_model = LogisticRegression(C=1e3, solver='lbfgs')
sklearn_model.fit(X=X_train, y=preds_train_filtered)

In [47]:
print(f"Logistic regression accuracy: {sklearn_model.score(X=X_test, y=df_test.label) * 100:.1f}%")

Logistic regression accuracy: 53.1%


As can be seen, the final model improved the score over the majority vote and the `LabelModel` model.

#### assignment

Complete the above calls with functions that you wrote yourself and check whether your functions improve the quality of the model.

In [61]:
lfs = [hot, contains_money, short_and_no_uppercase, many_adjectives]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
L_test = applier.apply(df=df_test)

majority_model = MajorityLabelVoter()

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=42)

majority_acc = majority_model.score(L=L_test, Y=df_test.label, tie_break_policy="random")["accuracy"]
print(f"{'Majority voting accuracy:':<25} {majority_acc * 100:.1f}%")

label_model_acc = label_model.score(L=L_test, Y=df_test.label, tie_break_policy="random")["accuracy"]
print(f"{'Probabilistic model accuracy:':<25} {label_model_acc * 100:.1f}%")


preds_train, probs_train = label_model.predict(L=L_train, return_probs=True)
df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(X=df_train, y=probs_train, L=L_train)

preds_train_filtered = probs_to_preds(probs=probs_train_filtered)
vectorizer = CountVectorizer(ngram_range=(1, 5))
X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())

sklearn_model = LogisticRegression(C=1e3, solver='lbfgs')
sklearn_model.fit(X=X_train, y=preds_train_filtered)

log_regression_acc = sklearn_model.score(X=X_test, y=df_test.label)
print(f"Logistic regression accuracy: {log_regression_acc * 100:.1f}%")

  0%|          | 0/4458 [00:00<?, ?it/s]

100%|██████████| 4458/4458 [00:00<00:00, 31524.00it/s]
100%|██████████| 1114/1114 [00:00<00:00, 31257.85it/s]
INFO:root:Computing O...
INFO:root:Estimating \mu...
  0%|          | 0/500 [00:00<?, ?epoch/s]INFO:root:[0 epochs]: TRAIN:[loss=0.043]
INFO:root:[100 epochs]: TRAIN:[loss=0.000]
INFO:root:[200 epochs]: TRAIN:[loss=0.000]
 41%|████      | 203/500 [00:00<00:00, 2024.46epoch/s]INFO:root:[300 epochs]: TRAIN:[loss=0.000]
INFO:root:[400 epochs]: TRAIN:[loss=0.000]
100%|██████████| 500/500 [00:00<00:00, 2052.65epoch/s]
INFO:root:Finished Training


Majority voting accuracy: 62.5%
Probabilistic model accuracy: 62.5%
Logistic regression accuracy: 86.5%


In [62]:
assert(majority_acc < log_regression_acc)
assert(label_model_acc < log_regression_acc)

## Transforming functions

The idea of a transforming function is to perform an atomic transformation of an instance. For data that is an image, typical transformations include cropping, rotating, and changing the color palette. For text data, you can replace words with synonyms, substitute named entities, cut random pieces of text, etc. In the following example we will find types of named entities occurring in the text, and then prepare a simple transformer that will randomly replace occurrences of the `PERSON` entity

In [49]:
import spacy

nlp = spacy.load('en_core_web_sm')

for doc in nlp.pipe(df_train.text.sample(frac=0.05)):
    print(f"Entities: {[(e.text, e.label_) for e in doc.ents]}")

Entities: []
Entities: []
Entities: []
Entities: []
Entities: [('250', 'CARDINAL'), ('weekly', 'DATE'), ('88877', 'CARDINAL'), ('18', 'CARDINAL'), ('T&C', 'ORG')]
Entities: []
Entities: [('1', 'CARDINAL'), ('2', 'CARDINAL')]
Entities: [('first', 'ORDINAL'), ('first', 'ORDINAL')]
Entities: []
Entities: [('2,000', 'MONEY'), ('2', 'CARDINAL'), ('08718726978', 'CARDINAL'), ('Only 10p', 'DATE'), ('BT', 'GPE')]
Entities: []
Entities: [('Aiyah', 'GPE')]
Entities: []
Entities: [('morning', 'TIME'), ('Lol', 'PERSON')]
Entities: [('morning', 'TIME')]
Entities: []
Entities: []
Entities: []
Entities: []
Entities: []
Entities: [('WIN', 'ORG'), ('100', 'CARDINAL'), ('every week', 'DATE'), ('NOW Txt', 'WORK_OF_ART'), ('87066', 'CARDINAL'), ('SkillGame', 'PERSON'), ('150ppermessSubscription', 'CARDINAL')]
Entities: [('Solve d', 'ORG'), ('AfterNoon', 'TIME'), ('1,His', 'CARDINAL'), ('2,Police', 'CARDINAL'), ('2', 'CARDINAL'), ('2', 'CARDINAL'), ('2', 'CARDINAL'), ('U r Brilliant', 'ORG')]
Entities: [('

In [50]:
person_entities = []

for doc in nlp.pipe(df_train.text):
    for e in doc.ents:
        if e.label_ == 'PERSON':
            person_entities.append(e.text)
        
person_entities[:10]

['Nah',
 'Melle Melle',
 'Mark',
 'Yummy',
 'Mallika Sherawat',
 'Matrix3',
 'ShrAcomOrSglSuplt)10',
 'LS1 3AJ',
 'Divorce Barbie',
 'Ken']

In [51]:
from snorkel.augmentation import transformation_function
from snorkel.preprocess.nlp import SpacyPreprocessor

spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)

@transformation_function(pre=[spacy])
def random_person_ner(sms):
    person_ners = [e.text for e in sms.doc.ents]
    
    if person_ners:
        person_to_replace = np.random.choice(person_ners)
        person_to_add = np.random.choice(person_entities)
        sms.text = sms.text.replace(person_to_replace, person_to_add)
    return sms

Another example of transformation could be using WordNet to find synonyms for words. However, this requires downloading a corpus of data

In [52]:
import nltk
from nltk.corpus import wordnet

nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /home/drew/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [53]:
def get_synonym(word):
    
    synsets = wordnet.synsets(word)
    
    if synsets:
        words = [lemma.name() for lemma in synsets[0].lemmas()]
        
        return np.random.choice([w.replace("_", " ") for w in words])


In [54]:
@transformation_function()
def replace_words_with_synonym(sms, num_replacements=5):

    words = sms.text.split()
    
    for _ in range(num_replacements):
        word_idx = np.random.choice(range(len(words)))
        synonym = get_synonym(words[word_idx])
        if synonym:
            words[word_idx] = synonym
        
    sms.text = ' '.join(words)
    return sms

Let us now compare the original text message content with the transformed versions.

In [55]:
# source: https://github.com/snorkel-team/snorkel-tutorials/blob/master/spam/utils.py

from collections import OrderedDict

def preview_tfs(df, tfs):
    transformed_examples = []
    for f in tfs:
        for i, row in df.iterrows():
            transformed_or_none = f(row)
            # If TF returned a transformed example, record it in dict and move to next TF.
            if transformed_or_none is not None:
                transformed_examples.append(
                    OrderedDict(
                        {
                            "TF Name": f.name,
                            "Original Text": row.text,
                            "Transformed Text": transformed_or_none.text,
                        }
                    )
                )
                
    return pd.DataFrame(transformed_examples)


In [56]:
tfs = [random_person_ner, replace_words_with_synonym]

preview_tfs(df_train.sample(frac=0.1), tfs)

Unnamed: 0,TF Name,Original Text,Transformed Text
0,random_person_ner,Was doing my test earlier. I appreciate you. Will call you tomorrow.,Was doing my test earlier. I appreciate you. Will call you u yan jiu.
1,random_person_ner,Your board is working fine. The issue of overheating is also reslove. But still software inst is pending. I will come around 8'o clock.,Your board is working fine. The issue of overheating is also reslove. But still software inst is pending. I will come around Don clock.
2,random_person_ner,Now get step 2 outta the way. Congrats again.,Now get step Sunshine Quiz Wkly outta the way. Congrats again.
3,random_person_ner,Sorry i now then c ur msg... Yar lor so poor thing... But only 4 one night... Tmr u'll have a brand new room 2 sleep in...,Sorry i now then c ur msg... Yar lor so poor thing... But Croydon CR9 5WB 0870... Tmr u'll have a brand new room 2 sleep in...
4,random_person_ner,"Beautiful Truth against Gravity.. Read carefully: ""Our heart feels light when someone is in it.. But it feels very heavy when someone leaves it.."" GOOD NIGHT","Beautiful Truth against Gravity.. Read carefully: ""Our heart feels light when someone is in it.. But it feels very heavy when someone leaves it.."" GOOD NIGHT"
...,...,...,...
887,replace_words_with_synonym,It does it on its own. Most of the time it fixes my spelling. But sometimes it gets a completely diff word. Go figure,It does it on its own. Most of the time it fixes my spelling. simply sometimes it gets angstrom unit completely diff word. Go figure
888,replace_words_with_synonym,I re-met alex nichols from middle school and it turns out he's dealing!,I re-met alex nichols from middle school and it twist out he's dealing!
889,replace_words_with_synonym,How dare you stupid. I wont tell anything to you. Hear after i wont talk to you:-.,How dare you stupid. I wont Tell anything to you. Hear after i wont talk to you:-.
890,replace_words_with_synonym,I'm home.,I'm home.


Applying transforming functions requires some policy defining the order and number of transformations. In the example below, two transformation functions are drawn at random and this sequence of two functions is applied twice to each data point. As a result, we triple the size of the train set.

In [57]:
from snorkel.augmentation import RandomPolicy, PandasTFApplier

random_policy = RandomPolicy(len(tfs), sequence_length=2, n_per_original=2, keep_original=True)

tf_applier = PandasTFApplier(tfs, random_policy)

df_train_sample = df_train.sample(frac=0.1)
df_train_augmented = tf_applier.apply(df_train_sample)

100%|██████████| 446/446 [00:04<00:00, 103.73it/s]


In [58]:
df_train_sample.shape, df_train_augmented.shape

((446, 2), (1338, 2))

#### assignment

Modify the transforming function ``replace_words_with_synonym()`` so that you can restrict the replacement of words with synonyms only for specific parts of speech (e.g., replace only nouns or verbs).

In [59]:
def part_of_speech_decorator(part):
    def decorator(func):
        def part_of_speech_decorated(sms, *args, **kwargs):
            return func(sms, part, *args, **kwargs)
        return part_of_speech_decorated
    return decorator

@transformation_function(pre=[spac])
@part_of_speech_decorator("NOUN")
def replace_words_with_synonym(sms, part=None, num_replacements=5):
    if part is None:
        words = sms.text.split()
        indices = [i for i in range(len(words))]
    else:
        words = sms.doc
        indices = [i for i, token in enumerate(words) if token.pos_ == part]
        if not indices:
            return sms
        words = [str(word) for word in words]

    for _ in range(num_replacements):
        word_idx = np.random.choice(indices)
        synonym = get_synonym(words[word_idx])
        if synonym:
            words[word_idx] = synonym
        
    sms.text = ' '.join(words)
    return sms

In [60]:
tfs = [replace_words_with_synonym]

preview_tfs(df_train.sample(frac=0.1), tfs)

Unnamed: 0,TF Name,Original Text,Transformed Text
0,part_of_speech_decorated,"Two fundamentals of cool life: ""Walk, like you are the KING""...! OR ""Walk like you Dont care,whoever is the KING""!... Gud nyt","Two basic principle of cool life : "" Walk , like you are the KING "" ... ! OR "" Walk like you Do nt care , whoever is the male monarch "" ! ... Gud nyt"
1,part_of_speech_decorated,I'm home...,I 'm topographic point ...
2,part_of_speech_decorated,"1) Go to write msg 2) Put on Dictionary mode 3)Cover the screen with hand, 4)Press &lt;#&gt; . 5)Gently remove Ur hand.. Its interesting..:)","1 ) Go to write MSG 2 ) Put on Dictionary way 3)Cover the projection screen with hand , 4)Press & lt;#&gt ; . 5)Gently remove Ur hand .. Its interesting .. :)"
3,part_of_speech_decorated,Yeah my usual guy's out of town but there're definitely people around I know,Yeah my usual bozo 's out of town but there 're definitely people around I know
4,part_of_speech_decorated,"Sorry pa, i dont knw who ru pa?","Sorry pa , i do nt knw who ru pa ?"
...,...,...,...
441,part_of_speech_decorated,yay! finally lol. i missed our cinema trip last week :-(,yay ! finally lol . i missed our film trip last hebdomad :-(
442,part_of_speech_decorated,S:-)if we have one good partnership going we will take lead:),S:-)if we have one good partnership going we will take lead :)
443,part_of_speech_decorated,Where are the garage keys? They aren't on the bookshelf,Where are the garage key ? They are n't on the bookshelf
444,part_of_speech_decorated,Babe ? I lost you ... :-(,Babe ? I lost you ... :-(
