In this walkthrough, we will develop a neural Sentiment Analysis model for Norwegian without access to labelled data. The first thing we will do is to read in our data and get it into Spacy DocBins:


In [1]:
import sys
import spacy
from spacy.tokens import DocBin
import pandas as pd

nlp = spacy.load("nb_core_news_md")

train_doc_bin = DocBin(store_user_data=True)
dev_doc_bin = DocBin(store_user_data=True)
test_doc_bin = DocBin(store_user_data=True)

train = pd.read_csv("../../data/sentiment/norec_sentence/train.txt", delimiter="\t", header=None)
dev = pd.read_csv("../../data/sentiment/norec_sentence/dev.txt", delimiter="\t", header=None)
test = pd.read_csv("../../data/sentiment/norec_sentence/test.txt", delimiter="\t", header=None)

We load the data as a pandas dataframe, where the first column is the sentence index, the second is the polarity (0=Negative, 1=Neutral, 2=Positive), and the third is the pretokenized sentence, as a string. We then create a Spacy DocBin object for train, dev, and test splits.

In [2]:
print(train.head(10))

   0                                                  1
0  1                                Stor og bred Seagal
1  1  Steven Seagal er blitt like stor som Travolta ...
2  1  I denne filmen handler det om en seriemorder m...
3  1   Dermed er genrens patologi-imperativ ivaretatt .
4  0                     Verre er det med slagsmålene .
5  0  Klipperen har overtatt Seagals martial art , o...
6  1                              Men noen fikse ting :
7  1  Hettemannen ser ut som David Beckham etter en ...
8  1                  Seagal får en Dirty Harry-pakke :
9  2  Ei finslig dame skal bli med på kjøret og obse...


In [3]:
for sid, (label, sent) in train.iterrows():
    doc = nlp(sent)
    doc.user_data["gold"] = label
    train_doc_bin.add(doc)
train_doc_bin.to_disk("../../data/sentiment/norec_sentence/train.docbin")

for sid, (label, sent) in dev.iterrows():
    doc = nlp(sent)
    doc.user_data["gold"] = label
    dev_doc_bin.add(doc)
dev_doc_bin.to_disk("../../data/sentiment/norec_sentence/dev.docbin")

for sid, (label, sent) in test.iterrows():
    doc = nlp(sent)
    doc.user_data["gold"] = label
    test_doc_bin.add(doc)
test_doc_bin.to_disk("../../data/sentiment/norec_sentence/test.docbin")


Next, we will begin to build up our labelling functions. We will start by training a document-level BOW model on the NoReC documents. These are given star ratings from 1-6, so the labelling function will have to map these to negative, neutral, and positive.

In [4]:
import sys
sys.path.insert(0, '../..')
import skweak
from sklearn.metrics import f1_score
from sentiment_models import DocBOWAnnotator

dann = DocBOWAnnotator("doc_norec",
                       model_path="../../data/sentiment/models/doc",
                       doclevel_data="../../data/sentiment/conllu.tar.gz")

Fitting model on ../../data/sentiment/conllu.tar.gz
Doc-level F1: 0.313
Saving vectorizer and model to ../../data/sentiment/models/doc


Next, we will train a multi-lingual BERT model on the Stanford Sentiment Treebank.

In [5]:
from transformer_model import SSTDataLoader, train
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

train_loader = SSTDataLoader("../../data/sentiment/sst/train.txt")
dev_loader = SSTDataLoader("../../data/sentiment/sst/dev.txt")
test_loader = SSTDataLoader("../../data/sentiment/sst/test.txt")

tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-uncased", num_labels=3)
train(model, tokenizer, train_loader, dev_loader, "../../data/sentiment/models/mbert-sst")

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model 

training for 20 epochs...


 16%|█▋        | 88/534 [05:41<28:53,  3.89s/it]


KeyboardInterrupt: 

We can construct a full annotator with all annotators described above, and then run it on the NoReC sentence-level dataset

In [6]:
from norec_sentiment import FullSentimentAnnotator

ann = FullSentimentAnnotator()
ann.add_all()

ann.annotate_docbin("../../data/sentiment/norec_sentence/train.docbin",
                    "../../data/sentiment/norec_sentence/train_pred.docbin")

ann.annotate_docbin("../../data/sentiment/norec_sentence/dev.docbin",
                    "../../data/sentiment/norec_sentence/dev_pred.docbin")

ann.annotate_docbin("../../data/sentiment/norec_sentence/test_pred.docbin",
                    "../../data/sentiment/norec_sentence/test_pred.docbin")

Loading lexicon functions
Loading learned sentiment model functions
Loaded model from ../../data/sentiment/models/doc
Loaded nlptown/bert-base-multilingual-uncased-sentiment
Loaded mBERT from ../../data/sentiment/models/mbert-sst
Number of processed documents: 1
Number of processed documents: 2
Number of processed documents: 3
Number of processed documents: 4
Number of processed documents: 5
Number of processed documents: 6
Number of processed documents: 7
Number of processed documents: 8
Number of processed documents: 9
Number of processed documents: 10
Number of processed documents: 11
Number of processed documents: 12
Number of processed documents: 13
Number of processed documents: 14
Number of processed documents: 15
Number of processed documents: 16
Number of processed documents: 17
Number of processed documents: 18
Number of processed documents: 19
Number of processed documents: 20
Number of processed documents: 21
Number of processed documents: 22
Number of processed documents: 

Number of processed documents: 232
Number of processed documents: 233
Number of processed documents: 234
Number of processed documents: 235
Number of processed documents: 236
Number of processed documents: 237
Number of processed documents: 238
Number of processed documents: 239
Number of processed documents: 240
Number of processed documents: 241
Number of processed documents: 242
Number of processed documents: 243
Number of processed documents: 244
Number of processed documents: 245
Number of processed documents: 246
Number of processed documents: 247
Number of processed documents: 248
Number of processed documents: 249
Number of processed documents: 250
Number of processed documents: 251
Number of processed documents: 252
Number of processed documents: 253
Number of processed documents: 254
Number of processed documents: 255
Number of processed documents: 256
Number of processed documents: 257
Number of processed documents: 258
Number of processed documents: 259
Number of processed 

Number of processed documents: 468
Number of processed documents: 469
Number of processed documents: 470
Number of processed documents: 471
Number of processed documents: 472
Number of processed documents: 473
Number of processed documents: 474
Number of processed documents: 475
Number of processed documents: 476
Number of processed documents: 477
Number of processed documents: 478
Number of processed documents: 479
Number of processed documents: 480
Number of processed documents: 481
Number of processed documents: 482
Number of processed documents: 483
Number of processed documents: 484
Number of processed documents: 485
Number of processed documents: 486
Number of processed documents: 487
Number of processed documents: 488
Number of processed documents: 489
Number of processed documents: 490
Number of processed documents: 491
Number of processed documents: 492
Number of processed documents: 493
Number of processed documents: 494
Number of processed documents: 495
Number of processed 

Number of processed documents: 703
Number of processed documents: 704
Number of processed documents: 705
Number of processed documents: 706
Number of processed documents: 707
Number of processed documents: 708
Number of processed documents: 709
Number of processed documents: 710
Number of processed documents: 711
Number of processed documents: 712
Number of processed documents: 713
Number of processed documents: 714
Number of processed documents: 715
Number of processed documents: 716
Number of processed documents: 717
Number of processed documents: 718
Number of processed documents: 719
Number of processed documents: 720
Number of processed documents: 721
Number of processed documents: 722
Number of processed documents: 723
Number of processed documents: 724
Number of processed documents: 725
Number of processed documents: 726
Number of processed documents: 727
Number of processed documents: 728
Number of processed documents: 729
Number of processed documents: 730
Number of processed 

Number of processed documents: 939
Number of processed documents: 940
Number of processed documents: 941
Number of processed documents: 942
Number of processed documents: 943
Number of processed documents: 944
Number of processed documents: 945
Number of processed documents: 946
Number of processed documents: 947
Number of processed documents: 948
Number of processed documents: 949
Number of processed documents: 950
Number of processed documents: 951
Number of processed documents: 952
Number of processed documents: 953
Number of processed documents: 954
Number of processed documents: 955
Number of processed documents: 956
Number of processed documents: 957
Number of processed documents: 958
Number of processed documents: 959
Number of processed documents: 960
Number of processed documents: 961
Number of processed documents: 962
Number of processed documents: 963
Number of processed documents: 964
Number of processed documents: 965
Number of processed documents: 966
Number of processed 

Number of processed documents: 1169
Number of processed documents: 1170
Number of processed documents: 1171
Number of processed documents: 1172
Number of processed documents: 1173
Number of processed documents: 1174
Number of processed documents: 1175
Number of processed documents: 1176
Number of processed documents: 1177
Number of processed documents: 1178
Number of processed documents: 1179
Number of processed documents: 1180
Number of processed documents: 1181
Number of processed documents: 1182
Number of processed documents: 1183
Number of processed documents: 1184
Number of processed documents: 1185
Number of processed documents: 1186
Number of processed documents: 1187
Number of processed documents: 1188
Number of processed documents: 1189
Number of processed documents: 1190
Number of processed documents: 1191
Number of processed documents: 1192
Number of processed documents: 1193
Number of processed documents: 1194
Number of processed documents: 1195
Number of processed document

Number of processed documents: 1397
Number of processed documents: 1398
Number of processed documents: 1399
Number of processed documents: 1400
Number of processed documents: 1401
Number of processed documents: 1402
Number of processed documents: 1403
Number of processed documents: 1404
Number of processed documents: 1405
Number of processed documents: 1406
Number of processed documents: 1407
Number of processed documents: 1408
Number of processed documents: 1409
Number of processed documents: 1410
Number of processed documents: 1411
Number of processed documents: 1412
Number of processed documents: 1413
Number of processed documents: 1414
Number of processed documents: 1415
Number of processed documents: 1416
Number of processed documents: 1417
Number of processed documents: 1418
Number of processed documents: 1419
Number of processed documents: 1420
Number of processed documents: 1421
Number of processed documents: 1422
Number of processed documents: 1423
Number of processed document

Number of processed documents: 1625
Number of processed documents: 1626
Number of processed documents: 1627
Number of processed documents: 1628
Number of processed documents: 1629
Number of processed documents: 1630
Number of processed documents: 1631
Number of processed documents: 1632
Number of processed documents: 1633
Number of processed documents: 1634
Number of processed documents: 1635
Number of processed documents: 1636
Number of processed documents: 1637
Number of processed documents: 1638
Number of processed documents: 1639
Number of processed documents: 1640
Number of processed documents: 1641
Number of processed documents: 1642
Number of processed documents: 1643
Number of processed documents: 1644
Number of processed documents: 1645
Number of processed documents: 1646
Number of processed documents: 1647
Number of processed documents: 1648
Number of processed documents: 1649
Number of processed documents: 1650
Number of processed documents: 1651
Number of processed document

Number of processed documents: 1853
Number of processed documents: 1854
Number of processed documents: 1855
Number of processed documents: 1856
Number of processed documents: 1857
Number of processed documents: 1858
Number of processed documents: 1859
Number of processed documents: 1860
Number of processed documents: 1861
Number of processed documents: 1862
Number of processed documents: 1863
Number of processed documents: 1864
Number of processed documents: 1865
Number of processed documents: 1866
Number of processed documents: 1867
Number of processed documents: 1868
Number of processed documents: 1869
Number of processed documents: 1870
Number of processed documents: 1871
Number of processed documents: 1872
Number of processed documents: 1873
Number of processed documents: 1874
Number of processed documents: 1875
Number of processed documents: 1876
Number of processed documents: 1877
Number of processed documents: 1878
Number of processed documents: 1879
Number of processed document

Number of processed documents: 2082
Number of processed documents: 2083
Number of processed documents: 2084
Number of processed documents: 2085
Number of processed documents: 2086
Number of processed documents: 2087
Number of processed documents: 2088
Number of processed documents: 2089
Number of processed documents: 2090
Number of processed documents: 2091
Number of processed documents: 2092
Number of processed documents: 2093
Number of processed documents: 2094
Number of processed documents: 2095
Number of processed documents: 2096
Number of processed documents: 2097
Number of processed documents: 2098
Number of processed documents: 2099
Number of processed documents: 2100
Number of processed documents: 2101
Number of processed documents: 2102
Number of processed documents: 2103
Number of processed documents: 2104
Number of processed documents: 2105
Number of processed documents: 2106
Number of processed documents: 2107
Number of processed documents: 2108
Number of processed document

Number of processed documents: 2311
Number of processed documents: 2312
Number of processed documents: 2313
Number of processed documents: 2314
Number of processed documents: 2315
Number of processed documents: 2316
Number of processed documents: 2317
Number of processed documents: 2318
Number of processed documents: 2319
Number of processed documents: 2320
Number of processed documents: 2321
Number of processed documents: 2322
Number of processed documents: 2323
Number of processed documents: 2324
Number of processed documents: 2325
Number of processed documents: 2326
Number of processed documents: 2327
Number of processed documents: 2328
Number of processed documents: 2329
Number of processed documents: 2330
Number of processed documents: 2331
Number of processed documents: 2332
Number of processed documents: 2333
Number of processed documents: 2334
Number of processed documents: 2335
Number of processed documents: 2336
Number of processed documents: 2337
Number of processed document

Number of processed documents: 2539
Number of processed documents: 2540
Number of processed documents: 2541
Number of processed documents: 2542
Number of processed documents: 2543
Number of processed documents: 2544
Number of processed documents: 2545
Number of processed documents: 2546
Number of processed documents: 2547
Number of processed documents: 2548
Number of processed documents: 2549
Number of processed documents: 2550
Number of processed documents: 2551
Number of processed documents: 2552
Number of processed documents: 2553
Number of processed documents: 2554
Number of processed documents: 2555
Number of processed documents: 2556
Number of processed documents: 2557
Number of processed documents: 2558
Number of processed documents: 2559
Number of processed documents: 2560
Number of processed documents: 2561
Number of processed documents: 2562
Number of processed documents: 2563
Number of processed documents: 2564
Number of processed documents: 2565
Number of processed document

Number of processed documents: 2769
Number of processed documents: 2770
Number of processed documents: 2771
Number of processed documents: 2772
Number of processed documents: 2773
Number of processed documents: 2774
Number of processed documents: 2775
Number of processed documents: 2776
Number of processed documents: 2777
Number of processed documents: 2778
Number of processed documents: 2779
Number of processed documents: 2780
Number of processed documents: 2781
Number of processed documents: 2782
Number of processed documents: 2783
Number of processed documents: 2784
Number of processed documents: 2785
Number of processed documents: 2786
Number of processed documents: 2787
Number of processed documents: 2788
Number of processed documents: 2789
Number of processed documents: 2790
Number of processed documents: 2791
Number of processed documents: 2792
Number of processed documents: 2793
Number of processed documents: 2794
Number of processed documents: 2795
Number of processed document

Number of processed documents: 2997
Number of processed documents: 2998
Number of processed documents: 2999
Number of processed documents: 3000
Number of processed documents: 3001
Number of processed documents: 3002
Number of processed documents: 3003
Number of processed documents: 3004
Number of processed documents: 3005
Number of processed documents: 3006
Number of processed documents: 3007
Number of processed documents: 3008
Number of processed documents: 3009
Number of processed documents: 3010
Number of processed documents: 3011
Number of processed documents: 3012
Number of processed documents: 3013
Number of processed documents: 3014
Number of processed documents: 3015
Number of processed documents: 3016
Number of processed documents: 3017
Number of processed documents: 3018
Number of processed documents: 3019
Number of processed documents: 3020
Number of processed documents: 3021
Number of processed documents: 3022
Number of processed documents: 3023
Number of processed document

Number of processed documents: 3225
Number of processed documents: 3226
Number of processed documents: 3227
Number of processed documents: 3228
Number of processed documents: 3229
Number of processed documents: 3230
Number of processed documents: 3231
Number of processed documents: 3232
Number of processed documents: 3233
Number of processed documents: 3234
Number of processed documents: 3235
Number of processed documents: 3236
Number of processed documents: 3237
Number of processed documents: 3238
Number of processed documents: 3239
Number of processed documents: 3240
Number of processed documents: 3241
Number of processed documents: 3242
Number of processed documents: 3243
Number of processed documents: 3244
Number of processed documents: 3245
Number of processed documents: 3246
Number of processed documents: 3247
Number of processed documents: 3248
Number of processed documents: 3249
Number of processed documents: 3250
Number of processed documents: 3251
Number of processed document

Number of processed documents: 3453
Number of processed documents: 3454
Number of processed documents: 3455
Number of processed documents: 3456
Number of processed documents: 3457
Number of processed documents: 3458
Number of processed documents: 3459
Number of processed documents: 3460
Number of processed documents: 3461
Number of processed documents: 3462
Number of processed documents: 3463
Number of processed documents: 3464
Number of processed documents: 3465
Number of processed documents: 3466
Number of processed documents: 3467
Number of processed documents: 3468
Number of processed documents: 3469
Number of processed documents: 3470
Number of processed documents: 3471
Number of processed documents: 3472
Number of processed documents: 3473
Number of processed documents: 3474
Number of processed documents: 3475
Number of processed documents: 3476
Number of processed documents: 3477
Number of processed documents: 3478
Number of processed documents: 3479
Number of processed document

Number of processed documents: 3682
Number of processed documents: 3683
Number of processed documents: 3684
Number of processed documents: 3685
Number of processed documents: 3686
Number of processed documents: 3687
Number of processed documents: 3688
Number of processed documents: 3689
Number of processed documents: 3690
Number of processed documents: 3691
Number of processed documents: 3692
Number of processed documents: 3693
Number of processed documents: 3694
Number of processed documents: 3695
Number of processed documents: 3696
Number of processed documents: 3697
Number of processed documents: 3698
Number of processed documents: 3699
Number of processed documents: 3700
Number of processed documents: 3701
Number of processed documents: 3702
Number of processed documents: 3703
Number of processed documents: 3704
Number of processed documents: 3705
Number of processed documents: 3706
Number of processed documents: 3707
Number of processed documents: 3708
Number of processed document

Number of processed documents: 3910
Number of processed documents: 3911
Number of processed documents: 3912
Number of processed documents: 3913
Number of processed documents: 3914
Number of processed documents: 3915
Number of processed documents: 3916
Number of processed documents: 3917
Number of processed documents: 3918
Number of processed documents: 3919
Number of processed documents: 3920
Number of processed documents: 3921
Number of processed documents: 3922
Number of processed documents: 3923
Number of processed documents: 3924
Number of processed documents: 3925
Number of processed documents: 3926
Number of processed documents: 3927
Number of processed documents: 3928
Number of processed documents: 3929
Number of processed documents: 3930
Number of processed documents: 3931
Number of processed documents: 3932
Number of processed documents: 3933
Number of processed documents: 3934
Number of processed documents: 3935
Number of processed documents: 3936
Number of processed document

Number of processed documents: 4139
Number of processed documents: 4140
Number of processed documents: 4141
Number of processed documents: 4142
Number of processed documents: 4143
Number of processed documents: 4144
Number of processed documents: 4145
Number of processed documents: 4146
Number of processed documents: 4147
Number of processed documents: 4148
Number of processed documents: 4149
Number of processed documents: 4150
Number of processed documents: 4151
Number of processed documents: 4152
Number of processed documents: 4153
Number of processed documents: 4154
Number of processed documents: 4155
Number of processed documents: 4156
Number of processed documents: 4157
Number of processed documents: 4158
Number of processed documents: 4159
Number of processed documents: 4160
Number of processed documents: 4161
Number of processed documents: 4162
Number of processed documents: 4163
Number of processed documents: 4164
Number of processed documents: 4165
Number of processed document

Number of processed documents: 4367
Number of processed documents: 4368
Number of processed documents: 4369
Number of processed documents: 4370
Number of processed documents: 4371
Number of processed documents: 4372
Number of processed documents: 4373
Number of processed documents: 4374
Number of processed documents: 4375
Number of processed documents: 4376
Number of processed documents: 4377
Number of processed documents: 4378
Number of processed documents: 4379
Number of processed documents: 4380
Number of processed documents: 4381
Number of processed documents: 4382
Number of processed documents: 4383
Number of processed documents: 4384
Number of processed documents: 4385
Number of processed documents: 4386
Number of processed documents: 4387
Number of processed documents: 4388
Number of processed documents: 4389
Number of processed documents: 4390
Number of processed documents: 4391
Number of processed documents: 4392
Number of processed documents: 4393
Number of processed document

Number of processed documents: 4596
Number of processed documents: 4597
Number of processed documents: 4598
Number of processed documents: 4599
Number of processed documents: 4600
Number of processed documents: 4601
Number of processed documents: 4602
Number of processed documents: 4603
Number of processed documents: 4604
Number of processed documents: 4605
Number of processed documents: 4606
Number of processed documents: 4607
Number of processed documents: 4608
Number of processed documents: 4609
Number of processed documents: 4610
Number of processed documents: 4611
Number of processed documents: 4612
Number of processed documents: 4613
Number of processed documents: 4614
Number of processed documents: 4615
Number of processed documents: 4616
Number of processed documents: 4617
Number of processed documents: 4618
Number of processed documents: 4619
Number of processed documents: 4620
Number of processed documents: 4621
Number of processed documents: 4622
Number of processed document

Number of processed documents: 4824
Number of processed documents: 4825
Number of processed documents: 4826
Number of processed documents: 4827
Number of processed documents: 4828
Number of processed documents: 4829
Number of processed documents: 4830
Number of processed documents: 4831
Number of processed documents: 4832
Number of processed documents: 4833
Number of processed documents: 4834
Number of processed documents: 4835
Number of processed documents: 4836
Number of processed documents: 4837
Number of processed documents: 4838
Number of processed documents: 4839
Number of processed documents: 4840
Number of processed documents: 4841
Number of processed documents: 4842
Number of processed documents: 4843
Number of processed documents: 4844
Number of processed documents: 4845
Number of processed documents: 4846
Number of processed documents: 4847
Number of processed documents: 4848
Number of processed documents: 4849
Number of processed documents: 4850
Number of processed document

Number of processed documents: 5052
Number of processed documents: 5053
Number of processed documents: 5054
Number of processed documents: 5055
Number of processed documents: 5056
Number of processed documents: 5057
Number of processed documents: 5058
Number of processed documents: 5059
Number of processed documents: 5060
Number of processed documents: 5061
Number of processed documents: 5062
Number of processed documents: 5063
Number of processed documents: 5064
Number of processed documents: 5065
Number of processed documents: 5066
Number of processed documents: 5067
Number of processed documents: 5068
Number of processed documents: 5069
Number of processed documents: 5070
Number of processed documents: 5071
Number of processed documents: 5072
Number of processed documents: 5073
Number of processed documents: 5074
Number of processed documents: 5075
Number of processed documents: 5076
Number of processed documents: 5077
Number of processed documents: 5078
Number of processed document

Number of processed documents: 5280
Number of processed documents: 5281
Number of processed documents: 5282
Number of processed documents: 5283
Number of processed documents: 5284
Number of processed documents: 5285
Number of processed documents: 5286
Number of processed documents: 5287
Number of processed documents: 5288
Number of processed documents: 5289
Number of processed documents: 5290
Number of processed documents: 5291
Number of processed documents: 5292
Number of processed documents: 5293
Number of processed documents: 5294
Number of processed documents: 5295
Number of processed documents: 5296
Number of processed documents: 5297
Number of processed documents: 5298
Number of processed documents: 5299
Number of processed documents: 5300
Number of processed documents: 5301
Number of processed documents: 5302
Number of processed documents: 5303
Number of processed documents: 5304
Number of processed documents: 5305
Number of processed documents: 5306
Number of processed document

Number of processed documents: 10
Number of processed documents: 11
Number of processed documents: 12
Number of processed documents: 13
Number of processed documents: 14
Number of processed documents: 15
Number of processed documents: 16
Number of processed documents: 17
Number of processed documents: 18
Number of processed documents: 19
Number of processed documents: 20
Number of processed documents: 21
Number of processed documents: 22
Number of processed documents: 23
Number of processed documents: 24
Number of processed documents: 25
Number of processed documents: 26
Number of processed documents: 27
Number of processed documents: 28
Number of processed documents: 29
Number of processed documents: 30
Number of processed documents: 31
Number of processed documents: 32
Number of processed documents: 33
Number of processed documents: 34
Number of processed documents: 35
Number of processed documents: 36
Number of processed documents: 37
Number of processed documents: 38
Number of proc

Number of processed documents: 248
Number of processed documents: 249
Number of processed documents: 250
Number of processed documents: 251
Number of processed documents: 252
Number of processed documents: 253
Number of processed documents: 254
Number of processed documents: 255
Number of processed documents: 256
Number of processed documents: 257
Number of processed documents: 258
Number of processed documents: 259
Number of processed documents: 260
Number of processed documents: 261
Number of processed documents: 262
Number of processed documents: 263
Number of processed documents: 264
Number of processed documents: 265
Number of processed documents: 266
Number of processed documents: 267
Number of processed documents: 268
Number of processed documents: 269
Number of processed documents: 270
Number of processed documents: 271
Number of processed documents: 272
Number of processed documents: 273
Number of processed documents: 274
Number of processed documents: 275
Number of processed 

Number of processed documents: 484
Number of processed documents: 485
Number of processed documents: 486
Number of processed documents: 487
Number of processed documents: 488
Number of processed documents: 489
Number of processed documents: 490
Number of processed documents: 491
Number of processed documents: 492
Number of processed documents: 493
Number of processed documents: 494
Number of processed documents: 495
Number of processed documents: 496
Number of processed documents: 497
Number of processed documents: 498
Number of processed documents: 499
Number of processed documents: 500
Number of processed documents: 501
Number of processed documents: 502
Number of processed documents: 503
Number of processed documents: 504
Number of processed documents: 505
Number of processed documents: 506
Number of processed documents: 507
Number of processed documents: 508
Number of processed documents: 509
Number of processed documents: 510
Number of processed documents: 511
Number of processed 

Number of processed documents: 720
Number of processed documents: 721
Number of processed documents: 722
Number of processed documents: 723
Number of processed documents: 724
Number of processed documents: 725
Number of processed documents: 726
Number of processed documents: 727
Number of processed documents: 728
Number of processed documents: 729
Number of processed documents: 730
Number of processed documents: 731
Number of processed documents: 732
Number of processed documents: 733
Number of processed documents: 734
Number of processed documents: 735
Number of processed documents: 736
Number of processed documents: 737
Number of processed documents: 738
Number of processed documents: 739
Number of processed documents: 740
Number of processed documents: 741
Number of processed documents: 742
Number of processed documents: 743
Number of processed documents: 744
Number of processed documents: 745
Number of processed documents: 746
Number of processed documents: 747
Number of processed 

Number of processed documents: 955
Number of processed documents: 956
Number of processed documents: 957
Number of processed documents: 958
Number of processed documents: 959
Number of processed documents: 960
Number of processed documents: 961
Number of processed documents: 962
Number of processed documents: 963
Number of processed documents: 964
Number of processed documents: 965
Number of processed documents: 966
Number of processed documents: 967
Number of processed documents: 968
Number of processed documents: 969
Number of processed documents: 970
Number of processed documents: 971
Number of processed documents: 972
Number of processed documents: 973
Number of processed documents: 974
Number of processed documents: 975
Number of processed documents: 976
Number of processed documents: 977
Number of processed documents: 978
Number of processed documents: 979
Number of processed documents: 980
Number of processed documents: 981
Number of processed documents: 982
Number of processed 

Number of processed documents: 126
Number of processed documents: 127
Number of processed documents: 128
Number of processed documents: 129
Number of processed documents: 130
Number of processed documents: 131
Number of processed documents: 132
Number of processed documents: 133
Number of processed documents: 134
Number of processed documents: 135
Number of processed documents: 136
Number of processed documents: 137
Number of processed documents: 138
Number of processed documents: 139
Number of processed documents: 140
Number of processed documents: 141
Number of processed documents: 142
Number of processed documents: 143
Number of processed documents: 144
Number of processed documents: 145
Number of processed documents: 146
Number of processed documents: 147
Number of processed documents: 148
Number of processed documents: 149
Number of processed documents: 150
Number of processed documents: 151
Number of processed documents: 152
Number of processed documents: 153
Number of processed 

Number of processed documents: 362
Number of processed documents: 363
Number of processed documents: 364
Number of processed documents: 365
Number of processed documents: 366
Number of processed documents: 367
Number of processed documents: 368
Number of processed documents: 369
Number of processed documents: 370
Number of processed documents: 371
Number of processed documents: 372
Number of processed documents: 373
Number of processed documents: 374
Number of processed documents: 375
Number of processed documents: 376
Number of processed documents: 377
Number of processed documents: 378
Number of processed documents: 379
Number of processed documents: 380
Number of processed documents: 381
Number of processed documents: 382
Number of processed documents: 383
Number of processed documents: 384
Number of processed documents: 385
Number of processed documents: 386
Number of processed documents: 387
Number of processed documents: 388
Number of processed documents: 389
Number of processed 

Number of processed documents: 597
Number of processed documents: 598
Number of processed documents: 599
Number of processed documents: 600
Number of processed documents: 601
Number of processed documents: 602
Number of processed documents: 603
Number of processed documents: 604
Number of processed documents: 605
Number of processed documents: 606
Number of processed documents: 607
Number of processed documents: 608
Number of processed documents: 609
Number of processed documents: 610
Number of processed documents: 611
Number of processed documents: 612
Number of processed documents: 613
Number of processed documents: 614
Number of processed documents: 615
Number of processed documents: 616
Number of processed documents: 617
Number of processed documents: 618
Number of processed documents: 619
Number of processed documents: 620
Number of processed documents: 621
Number of processed documents: 622
Number of processed documents: 623
Number of processed documents: 624
Number of processed 

Number of processed documents: 832
Number of processed documents: 833
Number of processed documents: 834
Number of processed documents: 835
Number of processed documents: 836
Number of processed documents: 837
Number of processed documents: 838
Number of processed documents: 839
Number of processed documents: 840
Number of processed documents: 841
Number of processed documents: 842
Number of processed documents: 843
Number of processed documents: 844
Write to ../../data/sentiment/norec_sentence/test_pred.docbin...done


Now we can use the HMM model from skweak to aggregate all the predictions from the weak labelling functions.

In [7]:
unified_model = skweak.aggregation.HMM("hmm", [0, 1, 2], sequence_labelling=False)
unified_model.fit("../../data/sentiment/norec_sentence/train_pred.docbin")
unified_model.annotate_docbin("../../data/sentiment/norec_sentence/train_pred.docbin",
                              "../../data/sentiment/norec_sentence/train_pred.docbin")

unified_model.annotate_docbin("../../data/sentiment/norec_sentence/dev_pred.docbin",
                              "../../data/sentiment/norec_sentence/dev_pred.docbin")

unified_model.annotate_docbin("../../data/sentiment/norec_sentence/test_pred.docbin",
                              "../../data/sentiment/norec_sentence/test_pred.docbin")

KeyError: 'IS_ALPHA'

We can also compare with a majority voting aggregator, which will highlight the benefits of skweak.

In [None]:
mv = skweak.aggregation.MajorityVoter("mv", [0, 1, 2], sequence_labelling=False) #type: ignore
mv.annotate_docbin("../../data/sentiment/norec_sentence/test_pred.docbin",
                   "../../data/sentiment/norec_sentence/test_pred.docbin")

Now, we can evaluate everything

In [None]:
pred_docs = list(utils.docbin_reader("../../data/sentiment/norec_sentence/test_pred.docbin"))
gold = [d.user_data["gold"] for d in pred_docs]

# Evaluate the weak labelling approaches themselves
for lexicon in pred_docs[0].user_data["spans"].keys():
    pred = []
    for d in pred_docs:
        for span in d.spans[lexicon]:
            pred.append(span.label_)

    lex_f1 = f1_score(gold, pred, average="macro")
    print("{0}:\t{1:.3f}".format(lexicon, lex_f1))
    
for aggregator in ["mv", "hmm"]:
    pred = []
    for d in pred_docs:
        for span in d.spans[aggregator]:
            pred.append(span.label_)
    agg_f1 = f1_score(gold, pred, average="macro")
    print("{0}:\t{1:.3f}".format(aggregator, agg_f1))
