<a href="https://colab.research.google.com/github/JayThibs/Weak-Supervised-Learning-Case-Study/blob/main/text_classifier/notebooks/03_dbpedia_14_snorkel_dataset_labeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating a Multi-Class Classification Dataset with Snorkel

In this notebook, we will be preparing a new version of the DBpedia dataset with snorkel. The goal is to gain experience building multi-class dataset with Snorkel and training a multi-class classification model on the new labeled data.

DBpedia is already a labeled dataset, but for labeling purposes we will assume that only a small sample of the dataset was labeled to test the accuracy of the Snorkel labeling process for this particular dataset. Instead of using an unlabeled dataset, this will speed up the labeling process since we already have a list of labels and we can build the labeling functions quicker by looking at them. On an unlabeled dataset, this would be a lot more tedious of a process because we'd need to label a portion of the dataset ourselves. With this approach we can quickly get experience with snorkel and set up a pipeline.

Though we have not yet implemented it yet, part of the goal of this for this project is to use Snorkel to label any text datasets by users, provided that they give us their text data, the labels they are interested in, the corresponding keywords for each label, and what part of text they are looking to classify (paragraphs, sentences, only sections about x, etc.) The output would be a brand new dataset that they can use for their particular use-case like prioritizing paragraphs in a large collection of PDFs (for example, regulators trying to assign tasks for each subject matter expert to read PDFs sent by companies).

In [5]:
!pip install snorkel --quiet
!pip install datasets --quiet
!pip install spacy --quiet
!pip install pytorch-lightning==1.2.8 --quiet
!pip install transformers==4.5.1 --quiet
!pip install wandb --quiet
!pip install onnxruntime --quiet

[31mERROR: tensorflow 2.4.1 has requirement tensorboard~=2.4, but you'll have tensorboard 1.15.0 which is incompatible.[0m
[31mERROR: pytorch-lightning 1.2.8 has requirement tensorboard>=2.2.0, but you'll have tensorboard 1.15.0 which is incompatible.[0m
[31mERROR: snorkel 0.9.7 has requirement tensorboard<2.0.0,>=1.14.0, but you'll have tensorboard 2.5.0 which is incompatible.[0m


In [6]:
from datasets import load_dataset
import re

import pandas as pd
import numpy as np
import os
from snorkel.labeling import labeling_function
from snorkel.labeling import LabelingFunction
from snorkel.labeling.lf.nlp import nlp_labeling_function
from snorkel.preprocess import preprocessor
from textblob import TextBlob
from snorkel.labeling import PandasLFApplier
from snorkel.labeling.apply.dask import DaskLFApplier
from sklearn.model_selection import train_test_split
import spacy
from spacy.matcher import Matcher
import numpy as np
nlp = spacy.load("en_core_web_sm")
from spacy.matcher import PhraseMatcher

from tqdm.auto import tqdm

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

from transformers import BertTokenizerFast, AutoTokenizer, AutoModel, BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup, DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import AutoModelForSequenceClassification

import pytorch_lightning as pl
from pytorch_lightning.metrics.functional import accuracy, f1, auroc
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping, BackboneFinetuning, QuantizationAwareTraining, ModelPruning
from pytorch_lightning.loggers import TensorBoardLogger
RANDOM_SEED = 42
pl.seed_everything(RANDOM_SEED)

# weights and biases
import wandb

# lightning plus wandb
from pytorch_lightning.loggers import WandbLogger

# downloading files from colab
from google.colab import files

# Saving and running model with ONNX
import onnxruntime

  defaults = yaml.load(f)
Global seed set to 42


In [7]:
!nvidia-smi

Sat May 15 02:24:04 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    23W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Loading the DBpedia-14 Dataset

In [8]:
dbpedia_dataset = load_dataset('dbpedia_14')

Reusing dataset d_bpedia14 (/root/.cache/huggingface/datasets/d_bpedia14/dbpedia_14/2.0.0/7f0577ea0f4397b6b89bfe5c5f2c6b1b420990a1fc5e8538c7ab4ec40e46fa3e)


In [9]:
dbpedia_dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 560000
    })
    test: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 70000
    })
})

In [10]:
dbpedia_dataset = dbpedia_dataset.rename_column("label", "labels") # snorkel uses "label"

## Merging the Text Title with the Content

In [11]:
def merge_title_with_content(example):
    example["content"] = example["title"] + " " + example["content"]
    return example

In [12]:
dbpedia_dataset = dbpedia_dataset.map(merge_title_with_content, num_proc=10)

 

HBox(children=(FloatProgress(value=0.0, description='#0', max=56000.0, style=ProgressStyle(description_width='…

 

HBox(children=(FloatProgress(value=0.0, description='#1', max=56000.0, style=ProgressStyle(description_width='…

 

HBox(children=(FloatProgress(value=0.0, description='#2', max=56000.0, style=ProgressStyle(description_width='…

 

HBox(children=(FloatProgress(value=0.0, description='#3', max=56000.0, style=ProgressStyle(description_width='…

 

HBox(children=(FloatProgress(value=0.0, description='#4', max=56000.0, style=ProgressStyle(description_width='…

 

HBox(children=(FloatProgress(value=0.0, description='#5', max=56000.0, style=ProgressStyle(description_width='…

 

HBox(children=(FloatProgress(value=0.0, description='#6', max=56000.0, style=ProgressStyle(description_width='…

 

HBox(children=(FloatProgress(value=0.0, description='#7', max=56000.0, style=ProgressStyle(description_width='…

 

HBox(children=(FloatProgress(value=0.0, description='#8', max=56000.0, style=ProgressStyle(description_width='…

 

HBox(children=(FloatProgress(value=0.0, description='#9', max=56000.0, style=ProgressStyle(description_width='…











 

HBox(children=(FloatProgress(value=0.0, description='#0', max=7000.0, style=ProgressStyle(description_width='i…

 

HBox(children=(FloatProgress(value=0.0, description='#1', max=7000.0, style=ProgressStyle(description_width='i…

  

HBox(children=(FloatProgress(value=0.0, description='#2', max=7000.0, style=ProgressStyle(description_width='i…

 

HBox(children=(FloatProgress(value=0.0, description='#3', max=7000.0, style=ProgressStyle(description_width='i…

 

HBox(children=(FloatProgress(value=0.0, description='#4', max=7000.0, style=ProgressStyle(description_width='i…

 

HBox(children=(FloatProgress(value=0.0, description='#5', max=7000.0, style=ProgressStyle(description_width='i…

 

HBox(children=(FloatProgress(value=0.0, description='#6', max=7000.0, style=ProgressStyle(description_width='i…

  

HBox(children=(FloatProgress(value=0.0, description='#7', max=7000.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='#9', max=7000.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='#8', max=7000.0, style=ProgressStyle(description_width='i…













In [13]:
dbpedia_dataset['train']

Dataset({
    features: ['content', 'labels', 'title'],
    num_rows: 560000
})

## Creating a Pandas DataFrame of the Dataset

We need to prepare a Pandas DataFrame of our dataset so that Snorkel can label it.

### Creating a Training Set DataFrame

In [46]:
data = {'label': dbpedia_dataset['train']['labels'], # these values will be replaced during the labeling process for the training set
        'text': dbpedia_dataset['train']['content']} # need to replace "content" with "text" for snorkel to work

train_df = pd.DataFrame(data)
train_df['text'] = train_df['text'].str.lower()
train_df

Unnamed: 0,label,text
0,0,e. d. abbott ltd abbott of farnham e d abbott...
1,0,schwan-stabilo schwan-stabilo is a german mak...
2,0,q-workshop q-workshop is a polish company loc...
3,0,marvell software solutions israel marvell sof...
4,0,bergan mercy medical center bergan mercy medi...
...,...,...
559995,13,barking in essex barking in essex is a black ...
559996,13,science & spirit science & spirit is a discon...
559997,13,the blithedale romance the blithedale romance...
559998,13,razadarit ayedawbon razadarit ayedawbon (burm...


### Creating a Test Set DataFrame

In [47]:
data = {'label': dbpedia_dataset['test']['labels'],
        'text': dbpedia_dataset['test']['content']} # need to replace "content" with "text" for snorkel to work

test_df = pd.DataFrame(data)
del data
test_df['text'] = test_df['text'].str.lower()
test_df

Unnamed: 0,label,text
0,0,ty ku ty ku /taɪkuː/ is an american alcoholic...
1,0,odd lot entertainment oddlot entertainment fo...
2,0,henkel henkel ag & company kgaa operates worl...
3,0,goat store the goat store (games of all type ...
4,0,ragwing aircraft designs ragwing aircraft des...
...,...,...
69995,13,energy victory energy victory: winning the wa...
69996,13,bestiario bestiario is a book of 8 short stor...
69997,13,wuthering heights wuthering heights is a nove...
69998,13,l'indépendant l'indépendant is a newspaper pu...


Here we create 4 dataset splits. We are emulating the situation where we have a small labeled dataset where we can use for model training, and then mesure our Labeling Function accuracy. What we call `fine_tune_df` (Fine-Tune Set), `val_df` (Validation Set), and `test_df` (Test Set) contain labeled data. The `train_df` is the unlabeled dataset we will be using Snorkel to label.

*   `fine_tune_df`: Labeled data we use for fine-tuning the bert model before putting model into a Snorkel Labeling Function.
*   `val_df`: This is our small dataset split we will use to measure the performance of the bert model trained with `fine_tune_df`.
*    `train_df`: As mentioned, the `train_df` is the unlabeled dataset we will be using Snorkel to label.
*    `test_df`: This is our dataset split for testing Snorkel's Label Model to get an estimate on how many labels were correctly labeled in our `train_df`.

We only label 30k examples in the training set and 4.5k in the test set in the interest of time.

In [48]:
train_df = train_df.sample(31400, random_state=123)
test_df = test_df.sample(5000, random_state=123)

In [49]:
fine_tune_df = train_df.groupby('label').apply(lambda s: s.sample(100, random_state=123)).reset_index(level=0, drop=True)
train_df.drop(fine_tune_df.index, inplace=True)

val_df = test_df.sample(frac=0.1)
test_df.drop(val_df.index, inplace=True)

print('\t Fine-Tune Set:', len(fine_tune_df), 'Valid:', len(val_df), '\t', 'Train:', len(train_df), '\t Test:', len(test_df),)

	 Fine-Tune Set: 1400 Valid: 500 	 Train: 30000 	 Test: 4500


In [50]:
fine_tune_df

Unnamed: 0,label,text
27110,0,autoliv autoliv is a swedish-american company...
36059,0,zymo research zymo research is a manufacturer...
32146,0,hcr relocation hcr group is a global relocati...
6938,0,rs technologies resin systems inc. is a canad...
14322,0,united coffee united coffee is one of europe’...
...,...,...
529531,13,the year's best fantasy stories: 12 the year'...
531528,13,qs world university rankings the qs world uni...
545113,13,the golden khan of ethengar the golden khan o...
551661,13,the road to mars the road to mars is a 1999 s...


## A gentle introduction to LFs

Labeling functions (LFs) help users encode domain knowledge and other supervision sources programmatically.

LFs are heuristics that take as input a data point and either assign a label to it (in this case, HAM or SPAM) or abstain (don’t assign any label). Labeling functions can be noisy: they don’t have perfect accuracy and don’t have to label every data point. Moreover, different labeling functions can overlap (label the same data point) and even conflict (assign different labels to the same data point). This is expected, and we demonstrate how we deal with this later.

Because their only requirement is that they map a data point a label (or abstain), they can wrap a wide variety of forms of supervision. Examples include, but are not limited to:

    Keyword searches: looking for specific words in a sentence
    Pattern matching: looking for specific syntactical patterns
    Third-party models: using an pre-trained model (usually a model for a different task than the one at hand)
    Distant supervision: using external knowledge base
    Crowdworker labels: treating each crowdworker as a black-box function that assigns labels to subsets of the data

### a) Exploring the training set for initial ideas

We’ll start by looking at 20 random data points from the train set to generate some ideas for LFs.


In [51]:
ABSTAIN = -1
Company = 0
EducationalInstitution = 1
Artist = 2
Athlete = 3
OfficeHolder = 4
MeanOfTransportation = 5
Building = 6
NaturalPlace = 7
Village = 8
Animal = 9
Plant = 10
Album = 11
Film = 12
WrittenWork = 13

labels_num_list = [Company,
EducationalInstitution,
Artist,
Athlete,
OfficeHolder,
MeanOfTransportation,
Building,
NaturalPlace,
Village,
Animal,
Plant,
Album,
Film,
WrittenWork]

In [52]:
labels_num_list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]

In [53]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 200)

In [54]:
# # for keyword creation
df_view = train_df[train_df['label'] == 13].sample(1000, random_state=42)
df_view

Unnamed: 0,label,text
524803,13,the begum's fortune the begum's fortune (french: les cinq cents millions de la bégum) also published as the begum's millions is an 1879 novel by jules verne with some elements which could be described as utopian and others which seem clearly dystopian. it is remarkable as the first published book in which verne was cautionary and to some degree pessimistic about the development of science and technology.
557439,13,the touch (mccullough novel) the touch is a historical novel by colleen mccullough published in 2003. it is about the life of a scotswoman elizabeth drummond who travels from her home in kinross scotland to new south wales in order to marry her wealthy cousin alexander kinross. the story takes place over the latter half of the 19th century.
546926,13,the developers the developers published in 2005 is a tech-humor fiction novel by author ben woods.
543891,13,threepenny novel threepenny novel is a work of fiction written by bertolt brecht first published in amsterdam by allert de lange (nl) in 1934 as dreigroschenroman. the novel retains certain elements of the the threepenny opera – for instance characters such as macheath and polly – but is essentially a completely separate work in tone ambition and purpose.
523594,13,blue moon rising blue moon rising is a fantasy novel by british author simon r. green. the first in a series of four books in the forest kingdom series with the main protagonists appearing in six books in the hawk & fisher series by green.the book had a troubled launch with many publishers rejecting green before being published by roc. the book became green's first bestseller.as with other green books it is written in a style which combines dramatic storytelling with plenty of wry humour and light relief.
...,...,...
555209,13,the man of the crowd the man of the crowd is a short story written by edgar allan poe about a nameless narrator following a man through a crowded london first published in 1840.
525454,13,alireza alireza is a memoir written by iranian american author arion golmakani translated into persian by shadi hamedi and published in january 2014. the book told in first person perspective tells the story of an abandoned little boy growing up on the streets of iran before the iranian islamic revolution of 1979.
536845,13,the mistress of wholesome the mistress of wholesome is a play by jacob appel that premiered at the little theatre of alexandria on may 16 2008.the play was directed by keith waters and starred kacie greenwood danielle y. eure and jung weil. a second production at the openstage theater in pittsburgh won the theatre league of western pennsylvania's top honors for 2008. the pittsburgh post gazette described the play as quirky yet delectable.
528819,13,fanfare (magazine) fanfare is an american bimonthly magazine devoted to reviewing recorded music in all playback formats. it mainly covers classical music but since inception has also featured a jazz column in every issue.the magazine now runs to over 600 pages in a 6 x 9 format with about 80% of the editorial copy devoted to record reviews and a front section with a substantial number of interviews and feature articles.


In [55]:
# Showing the most common words to make the keyword creation process quicker
# We are choosing some of words in this list to populate the label:keyword dictionary

# from collections import Counter
# Counter(" ".join(df_view["content"]).split()).most_common(100) # slower than pd.Series()

# faster
pd.Series(' '.join(df_view.text).lower().split()).value_counts()[:100]

the              3923
of               1936
in               1799
a                1705
and              1668
is               1528
by               1310
was               663
it                618
published         589
to                496
book              428
novel             420
on                378
as                343
for               319
first             295
an                293
journal           269
written           247
with              237
from              212
magazine          201
series            185
that              160
story             156
his               153
newspaper         143
has               134
at                131
its               127
which             127
who               125
new               122
science           119
american          119
fiction           117
author            113
stories           106
short              93
daily              85
into               83
collection         82
been               79
about              79
peer-revie

## Creating Label:Keyword Dictionary

We are creating a massive list of keywords for all of the labels in our dataset. These keywords will either feed into our spaCy matcher during the labeling process or into a keyword lookup function, depending on which approach we take.

We are using the spaCy PhraseMatcher (only looks for keywords) instead of the Matcher (can take into lemmatization, part-of-speech, etc.) because the PhraseMatcher is much faster than the Matcher.  This decision was made because we wanted to have a fast snorkel labeler that would reduce the amount of wai time for users who are using a web app (one that we'd create) to label their data.

That said, we realized that using a keyword lookup function was faster than the PhraseMatcher so we went with the keyword lookup instead.

Going forward, I would say that you should use keyword lookup if you want to feed specific words into you labeling function, and only user the spaCy Matcher if you want to do special things like looking a lemmatization of small amount of words or use some of the other features of the Matcher that simply doesn't work with a keyword lookup. You can basically disregard the PhraseMatcher for Snorkel Labeling.

In [56]:
import spacy
from spacy.matcher import PhraseMatcher

# Creating keyword lists with string and split(', ').

# keywords_list = ["""company, headquarter, corporate, finance, ltd., airline, firm, commerce, manufacturer, factory, based in, based out of, founded, corporation, inc., foundation, newspaper""".split(', '),
#                  """university, students, bachelor, degree, school, academy, college, high, located, public, city, academy, county, institute, national, research, business, grammar, government, technology, medicine, mascot, academic, board, program, co-educational, economics, junior, science, schools, faculty""".split(', '),
#                  """dancer, writer, artist, actor, singer-songwriter, teacher, scholar, professor, composer, actress, pianist, novelist, singer, songwriter, born, english, american, chinese, guitarist, author, drummer, vocalist, saxophonist, painter, Canadian, member of, screenwriter, prose, poet, career, jazz, folk""".split(', '),
#                  """rugby, player, football, professional, nfl, league, injury, 1st round, contract, footballer, wrestling, lineman, cricketer, born, retired, former, mlb, pitcher, fencer, driver, american, canadian, english, belgian, attended, season, champion, motorcycle""".split(', '),
#                  """born, author, commentator, director, professor, leader, governor, politician, minister, president, died, representatives, assembly, republic, democratic, house of representatives, house, election, member of parliament, party, speaker, president, lawyer, liberal, candidate, election, deputy, prime minister, official""".split(', '),
#                  """ship, aircraft, boeing, navy, destroyer, diesel, rail, warship, transport, submarine, trike, aerobatic, motorcar, railway, monoplane, vessel, motorcycle, navigation, railway, cars, airliner, naval, whaleship, rail, automobile, flown, ferry, wing, tailplane, car, convoy, flown, naval, locomotive, vehicle, flagship, cruise ship, boat, convoy, automaker, battleships""".split(', '),
#                  """church, church, center, house, historic, dam, district, district, castle, hospital, institution, museum, victorian, farm, building, mall, restaurant, shopping mall, centre, supermarket, built in""".split(', '),
#                  """river, mountain, hill, hills, land, lake, km, m, forest, creek, ocean, stream, strait, gulf, peak, elevation, glacier, volcanic, corona, tributary, mount, flows, border, watershed""".split(', '),
#                  """rural, population, census, central, province, state, village, district, eastern, western, kilometres, km, mi, county, south, north, south-west, south-east, north-west, north-east, town, border, regional, capital, municipality, block, located, families, administrative, federation, india, united states, turkey, iran""".split(', '),
#                  """habitat, species, beetles, subfamily, snail, endemic, family, extinct, shelled, squids, octopuses, wingspan, moth, frogs, dry land, humidity, fruit flies, genus, horse, racehorse, thoroughbred, farm, farms, frog, butterfly, class, flatworm, fish, bred, shark, tropical, subtropical, sphinx, dog, cat, mouse, lion, jaguar, sea, habitats, subclass, populations, fossil""".split(', '),
#                  """plant, family, species, vegetative subspecies, dipterocarpaceae, tillandsia, genus, endemic, orchid, daisy, flower, flowering, plants, legume, habitat, green alga, lettuce, kelp, gutweed, succulent, microphylia, ulmus, coffee, soil, tropical, forest, wood, leaf, cultivated, tree, trees, aster, algae, sedge, grows, evergreen, fruit, seed, seeds, herbs, herb, bulrush, subtropical, violet, floral, meliaceae, wild, grass""".split(', '),
#                  """album, country, singer, band, vocalist, member, guitarist, debut, studio, metal, records, produced, songs, live at, performance, indie, folk, musician, released, ep, music, rapper, official, cd, label, tracks, remastered, reissued, pop, release, recorded, producer, full-length, bonus, chart, reunited, grammy, billboard, featured, concert, singer-songwriter, songs""".split(', '),
#                  """film, directed, starring, drama, based, released, novel, comedy, american, stars, produced, international, cinema, festival, documentary, biographical, love, romance, comedy-romance, movie""".split(', '),
#                  """published, book, novel, journal, series, written, story, magazine, newspaper, daily, stories, peer-reviewed, fiction, covers, comic, volume, science, fantasy, edition, writer, law, created, research, established, history, weekly, issue, travel, academic, mystery, media, author, work, god""".split(', ')]

keywords_list = ["""company, headquarter, corporate, finance, ltd., airline, firm, commerce, manufacturer, factory, based in, based out of, founded, corporation, inc., foundation, newspaper, services, service, products, group, bank, game, business, management, independent, owned, software, operates, limited, technology, brand, offices, million, financial, public, commercial, publishing, media, companies, operated, systems, mobile, games""".split(', '),
                 """university, students, bachelor, degree, school, academy, college, high, located, public, city, academy, county, institute, national, research, business, grammar, government, technology, medicine, mascot, academic, board, program, co-educational, economics, junior, science, schools, faculty, education, private, engineering, catholic, community, christian, medical, elementary, law, higher, founded""".split(', '),
                 """dancer, writer, artist, actor, singer-songwriter, teacher, scholar, professor, composer, actress, pianist, novelist, singer, songwriter, born, English, American, Chinese, guitarist, author, drummer, vocalist, saxophonist, painter, Canadian, member of, film, screenwriter, prose, poet, career, jazz, folk, member, work, album, producer, released, award, career, painter, band, musician, rock, album, author, university, television, school, art, released, composer""".split(', '),
                 """rugby, player, football, professional, nfl, league, injury, 1st round, contract, footballer, wrestling, lineman, cricketer, born, retired, former, mlb, pitcher, fencer, driver, american, canadian, english, attended, season, champion, motorcycle, (born, played, world, playing, major, plays, national, team, won, baseball, hockey, cricket, career, club, rugby, league., australian, olympics, seasons, championship, team., medal, ice, cup, round, player.""".split(', '),
                 """born, author, commentator, director, professor, leader, governor, politician, minister, president, died, representatives, assembly, republic, democratic, house of representatives, house, election, member of parliament, party, speaker, president, lawyer, liberal, candidate, election, deputy, prime minister, official, served, district, elected, district, republican, county, general, university, mayor, government, legislative, secretary, serving, political, appointed, represented, national, politician., prime, lawyer, senate, former""".split(', '),
                 """ship, aircraft, boeing, navy, destroyer, diesel, rail, warship, transport, submarine, trike, aerobatic, motorcar, railway, monoplane, vessel, motorcycle, navigation, railway, cars, airliner, naval, whaleship, rail, automobile, flown, ferry, wing, tailplane, car, convoy, flown, naval, locomotive, vehicle, flagship, cruise ship, boat, convoy, automaker, battleships, built, hms, royal, world, service, designed, commissioned, company, produced, laid, launched, ss, operated, design, sponsored, class, uss, war""".split(', '),
                 """church, church, center, house, historic, dam, district, district, castle, hospital, institution, museum, victorian, farm, building, mall, restaurant, shopping mall, centre, supermarket, built in, located, national, register, places, county, listed, museum, street, city, hospital, hotel, centre, designed, school, home, park, brick, mall, art, tower, revival, complex, village, parish, castle, main, features, architect, cathedral, road, farm, structure, office, downtown, area, frame, constructed, style, hall, site""".split(', '),
                 """river, mountain, hill, hills, land, lake, km, m, forest, forests, creek, ocean, stream, strait, gulf, peak, elevation, glacier, volcanic, corona, tributary, mount, flows, border, watershed, located, county, range, peak, mountains, part, highest, crater, lies, national, elevation, sea, water, park, reservoir, high, situated, bay, river.""".split(', '),
                 """rural, population, census, central, province, state, village, district, eastern, western, kilometres, km, mi, county, south, north, south-west, south-east, north-west, north-east, town, border, regional, capital, municipality, block, located, families, administrative, federation, india, united states, turkey, iran, mi), families., approximately, history, region, town, area, romanized, gmina""".split(', '),
                 """habitat, species, forest, forests, beetles, subfamily, snail, endemic, family, extinct, shelled, squids, octopuses, wingspan, moth, frogs, dry land, humidity, fruit flies, genus, horse, racehorse, thoroughbred, farm, farms, frog, butterfly, class, flatworm, fish, bred, shark, tropical, subtropical, sphinx, dog, cat, mouse, lion, jaguar, sea, habitats, subclass, populations, fossil, found, gastropod, described, mollusk, marine, larvae, native, arctiidae, moist, bird, notiobia, feed""".split(', '),
                 """plant, family, species, vegetative subspecies, dipterocarpaceae, tillandsia, genus, endemic, orchid, daisy, flower, flowering, plants, legume, habitat, green alga, lettuce, kelp, gutweed, succulent, microphylia, ulmus, coffee, soil, tropical, forest, wood, leaf, cultivated, tree, trees, aster, algae, sedge, grows, evergreen, fruit, seed, seeds, herbs, herb, bulrush, subtropical, violet, floral, meliaceae, wild, grass, native, known, found, leaves, grows, plants, flowers, growing, forests""".split(', '),
                 """album, country, singer, band, vocalist, member, guitarist, debut, studio, metal, records, produced, songs, live at, performance, indie, folk, musician, released, ep, music, rapper, official, cd, label, tracks, remastered, reissued, pop, release, recorded, producer, full-length, bonus, chart, reunited, grammy, billboard, featured, concert, singer-songwriter, songs, live, recorded, rock, album), records., release, compilation, single, label, cover""".split(', '),
                 """film, directed, starring, drama, based, released, novel, comedy, american, stars, produced, international, cinema, festival, documentary, biographical, love, romance, comedy-romance, movie, director, roles., series, life, award, music, screenplay, romantic, thriller, shot, action, horror, (film), written, based, silent""".split(', '),
                 """published, book, novel, journal, series, written, story, magazine, newspaper, daily, stories, peer-reviewed, fiction, covers, comic, volume, science, fantasy, edition, writer, law, created, research, established, history, weekly, issue, travel, academic, mystery, media, author, work, god, manga, world, (novel), illustrated, award, news, fantasy, books, collection, life, publication, edition, based, established""".split(', ')]


labels_list = ['Company', 'EducationalInstitution', 'Artist', 'Athlete', 'OfficeHolder',
               'MeanOfTransportation','Building', 'NaturalPlace', 'Village',
               'Animal', 'Plant', 'Album', 'Film', 'WrittenWork']

label_keyword_dict = dict(zip(labels_list, keywords_list))
print(label_keyword_dict['NaturalPlace'])

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)

['river', 'mountain', 'hill', 'hills', 'land', 'lake', 'km', 'm', 'forest', 'forests', 'creek', 'ocean', 'stream', 'strait', 'gulf', 'peak', 'elevation', 'glacier', 'volcanic', 'corona', 'tributary', 'mount', 'flows', 'border', 'watershed', 'located', 'county', 'range', 'peak', 'mountains', 'part', 'highest', 'crater', 'lies', 'national', 'elevation', 'sea', 'water', 'park', 'reservoir', 'high', 'situated', 'bay', 'river.']


In [57]:
# Running nlp.make_doc to speed things up
for label in label_keyword_dict:
  patterns = [nlp.make_doc(text) for text in label_keyword_dict[label]]
  matcher.add(label, patterns)

### Testing our spaCy matcher 

In [58]:
doc = nlp("monika bohge monika bohge (lüdenscheid 1947) is a german writer and actor.")
matches = matcher(doc)
for match_id, start, end in matches:
    rule_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(rule_id, span.text)

WrittenWork writer
Artist writer
Artist actor


In [59]:
matches[0]

(10780820017166675169, 11, 12)

## Snorkel with Automated Labeling Functions

The goal of this section was to create a list containing n functions (n is decided by the user) for each label described by the user. If I can get it working, it would allow users to import text, add n labels of their choice (along with the related keywords) and n functions would be created for every label. This would, at the same time, greatly reduce the amount of work from the user (they only need to import a text dataset and give a list of keywords for the corresponding label) and would open up the labeling feature in a web app for all multi-class text dataset creation.

I spent a lot of time on this section, but, as I said, I wasn't able to make it work. I was able to get it mostly working, but I'm running into an issue where all the functions have the same name and snorkel labeling doesn't work if all the functions have the same name! I tried many things to change the names of the functions, but I wansn't able to change them internally when it's running in snorkel because `nlp_labeling_function()` wraps the function in a spaCy processor and I can't change the internal function name after the fact. If you have any ideas on how to resolve this, let me know!

Note: This section is commented out for the reasons above!

In [60]:
# # labeling function for all labels
# @nlp_labeling_function() # labeling function for using spaCy
# def lf_labeler(x, label_keyword_dict):
#     labels = []
#     doc = nlp(x) # tokenizing the input text
#     matches = PhraseMatcher(doc) # finding all the token words that match a specific label
#     for match_id in matches:
#         labels.append(nlp.vocab.strings[match_id]) # adds every label where there was a match
#     if labels: # runs if labels is not empty
#         return labels # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
#     else:
#         return ABSTAIN # abstains from using this text example in the dataset creation because there was no match

In [61]:
# class all_labeling_functions(object):
#     def __init__(self, labels):
#         for label in labels:
#             setattr(self, label, functools.partial(self._myfunction, label))
    
#     @nlp_labeling_function() # labeling function for using spaCy
#     def lf_labeler(x, label_keyword_dict=label_keyword_dict):
#         labels = []
#         doc = nlp(x) # tokenizing the input text
#         matches = PhraseMatcher(doc) # finding all the token words that match a specific label
        
#         for match_id in matches:
#           labels.append(nlp.vocab.strings[match_id]) # adds every label where there was a match
        
#         if len(labels) != 0: # runs if labels is not empty
#           return labels # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
        
#         else:
#           return ABSTAIN # abstains from using this text example in the dataset creation because there was no match

# lfs = []

# for i in range(len(labels_list)):
#   lf_labeler = all_labeling_functions(i)
#   # namespace['lf_labeler_%s'%i] = functools.partial(all_labeling_functions, i)

#   lfs.append(all_labeling_functions(labels_list))

# # myfunction = all_labeling_functions(labels_list)
# # print(myfunction)

In [62]:
# from types import FunctionType
# from copy import copy

# lf_list = []

# @nlp_labeling_function() # labeling function for using spaCy
# def lf_labeler(x):
#     labels = []
#     doc = nlp(x) # tokenizing the input text
#     matches = PhraseMatcher(doc) # finding all the token words that match a specific label
    
#     for match_id in matches:
#       labels.append(nlp.vocab.strings[match_id]) # adds every label where there was a match
    
#     if len(labels) != 0: # runs if labels is not empty
#       return labels # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    
#     else:
#       return ABSTAIN # abstains from using this text example in the dataset creation because there was no match

#   # namespace = sys._getframe(0).f_globals
#   # namespace['lf_labeler_%s'%i] = functools.partial(all_labeling_functions, i)

# def copy_function(fn, name):

#     return FunctionType(name=name)

# for label in labels_list:
#     name = 'lf_labeler_' + str(label)
#     lf_list.append(copy_function(lf_labeler, name))

# lf_list

In [63]:
# from snorkel.labeling.lf.nlp import nlp_labeling_function
# from functools import partial
# import sys

# namespace = sys._getframe(0).f_globals

# def all_labeling_functions(label): # labeling function for all labels

#   @nlp_labeling_function() # labeling function for using spaCy
#   def lf_labeler(x):
#       labels = []
#       doc = nlp(x) # tokenizing the input text
#       matches = PhraseMatcher(doc) # finding all the token words that match a specific label
      
#       for match_id in matches:
#         labels.append(nlp.vocab.strings[match_id]) # adds every label where there was a match
      
#       if len(labels) != 0: # runs if labels is not empty
#         return labels # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
      
#       else:
#         return ABSTAIN # abstains from using this text example in the dataset creation because there was no match

#     # namespace = sys._getframe(0).f_globals
#     # namespace['lf_labeler_%s'%i] = functools.partial(all_labeling_functions, i)

#   return lf_labeler

# lfs = []

# lf_labeler_list = {f'lf_labeler_{k}': partial(lf_labeler, i=k) for k in range(len(labels_list))}
# for i in range(len(labels_list)):
#   lfs.append(lf_labeler_list[f'lf_labeler_{i}'])


# # lf_labeler_list = {f'lf_labeler_{k}': partial(lf_labeler, i=k) for k in range(len(labels_list))}
# for label in labels_list:
#   lfs[f'lf_labeler_{label}'] = all_labeling_functions(label)

In [64]:
# from snorkel.labeling import LabelingFunction


# def keyword_lookup(x, keywords, label):
#     if any(word in x.text.lower() for word in keywords):
#         return label
#     return ABSTAIN


# def make_keyword_lf(keywords, label=SPAM):
#     return LabelingFunction(
#         name=f"keyword_{keywords[0]}",
#         f=keyword_lookup,
#         resources=dict(keywords=keywords, label=label),
#     )

## Snorkel with Manual Labeling Functions

Here we will created labeling functions manually by writing at least one lableling function for every label.

After trying both the spaCy PhraseMatcher approach and the keyword lookup approach, we decided to opt for the keyword lookup approche since it performed similarly to the spaCy PhraseMatcher approach, but was much faster. We want our labeling to be fast so that it can be used by users in as close to real-time as possible. 

### Matcher Approach

In [None]:
from snorkel.labeling.lf.nlp import nlp_labeling_function

@nlp_labeling_function()
def lf_company(x):
    doc = nlp(str(x)) # tokenizing the input text
    matches = matcher(doc) # finding all the token words that match a specific label
    label_ids = [item[0] for item in matches]
    if any([nlp.vocab.strings[label_id] == 'Company' for label_id in label_ids]):
        return Company # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    else:
        return ABSTAIN # abstains from using this text example in the dataset creation because there was no match

@nlp_labeling_function()
def lf_educational_institution(x):
    doc = nlp(str(x)) # tokenizing the input text
    matches = matcher(doc) # finding all the token words that match a specific label
    label_ids = [item[0] for item in matches]
    if any([nlp.vocab.strings[label_id] == 'EducationalInstitution' for label_id in label_ids]):
        return EducationalInstitution # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    else:
        return ABSTAIN # abstains from using this text example in the dataset creation because there was no match

@nlp_labeling_function()
def lf_artist(x):
    doc = nlp(str(x)) # tokenizing the input text
    matches = matcher(doc) # finding all the token words that match a specific label
    label_ids = [item[0] for item in matches]
    if any([nlp.vocab.strings[label_id] == 'Artist' for label_id in label_ids]):
        return Artist # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    else:
        return ABSTAIN # abstains from using this text example in the dataset creation because there was no match

@nlp_labeling_function()
def lf_athlete(x):
    doc = nlp(str(x)) # tokenizing the input text
    matches = matcher(doc) # finding all the token words that match a specific label
    label_ids = [item[0] for item in matches]
    if any([nlp.vocab.strings[label_id] == 'Athlete' for label_id in label_ids]):
        return Athlete # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    else:
        return ABSTAIN # abstains from using this text example in the dataset creation because there was no match

@nlp_labeling_function()
def lf_office_holder(x):
    doc = nlp(str(x)) # tokenizing the input text
    matches = matcher(doc) # finding all the token words that match a specific label
    label_ids = [item[0] for item in matches]
    if any([nlp.vocab.strings[label_id] == 'OfficeHolder' for label_id in label_ids]):
        return OfficeHolder # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    else:
        return ABSTAIN # abstains from using this text example in the dataset creation because there was no match


@nlp_labeling_function()
def lf_mean_of_transportation(x):
    doc = nlp(str(x)) # tokenizing the input text
    matches = matcher(doc) # finding all the token words that match a specific label
    label_ids = [item[0] for item in matches]
    if any([nlp.vocab.strings[label_id] == 'MeanOfTransportation' for label_id in label_ids]):
        return MeanOfTransportation # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    else:
        return ABSTAIN # abstains from using this text example in the dataset creation because there was no match


@nlp_labeling_function()
def lf_building(x):
    doc = nlp(str(x)) # tokenizing the input text
    matches = matcher(doc) # finding all the token words that match a specific label
    label_ids = [item[0] for item in matches]
    if any([nlp.vocab.strings[label_id] == 'Building' for label_id in label_ids]):
        return Building # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    else:
        return ABSTAIN # abstains from using this text example in the dataset creation because there was no match


@nlp_labeling_function()
def lf_natural_place(x):
    doc = nlp(str(x)) # tokenizing the input text
    matches = matcher(doc) # finding all the token words that match a specific label
    label_ids = [item[0] for item in matches]
    if any([nlp.vocab.strings[label_id] == 'NaturalPlace' for label_id in label_ids]):
        return NaturalPlace # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    else:
        return ABSTAIN # abstains from using this text example in the dataset creation because there was no match


@nlp_labeling_function()
def lf_village(x):
    doc = nlp(str(x)) # tokenizing the input text
    matches = matcher(doc) # finding all the token words that match a specific label
    label_ids = [item[0] for item in matches]
    if any([nlp.vocab.strings[label_id] == 'Village' for label_id in label_ids]):
        return Village # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    else:
        return ABSTAIN # abstains from using this text example in the dataset creation because there was no match


@nlp_labeling_function()
def lf_animal(x):
    doc = nlp(str(x)) # tokenizing the input text
    matches = matcher(doc) # finding all the token words that match a specific label
    label_ids = [item[0] for item in matches]
    if any([nlp.vocab.strings[label_id] == 'Animal' for label_id in label_ids]):
        return Animal # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    else:
        return ABSTAIN # abstains from using this text example in the dataset creation because there was no match


@nlp_labeling_function()
def lf_plant(x):
    doc = nlp(str(x)) # tokenizing the input text
    matches = matcher(doc) # finding all the token words that match a specific label
    label_ids = [item[0] for item in matches]
    if any([nlp.vocab.strings[label_id] == 'Plant' for label_id in label_ids]):
        return Plant # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    else:
        return ABSTAIN # abstains from using this text example in the dataset creation because there was no match


@nlp_labeling_function()
def lf_album(x):
    doc = nlp(str(x)) # tokenizing the input text
    matches = matcher(doc) # finding all the token words that match a specific label
    label_ids = [item[0] for item in matches]
    if any([nlp.vocab.strings[label_id] == 'Album' for label_id in label_ids]):
        return Album # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    else:
        return ABSTAIN # abstains from using this text example in the dataset creation because there was no match

@nlp_labeling_function()
def lf_film(x):
    doc = nlp(str(x)) # tokenizing the input text
    matches = matcher(doc) # finding all the token words that match a specific label
    label_ids = [item[0] for item in matches]
    if any([nlp.vocab.strings[label_id] == 'Film' for label_id in label_ids]):
        return Film # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    else:
        return ABSTAIN # abstains from using this text example in the dataset creation because there was no match

@nlp_labeling_function()
def lf_written_work(x):
    doc = nlp(str(x)) # tokenizing the input text
    matches = matcher(doc) # finding all the token words that match a specific label
    label_ids = [item[0] for item in matches]
    if any([nlp.vocab.strings[label_id] == 'WrittenWork' for label_id in label_ids]):
        return WrittenWork # this would return all the labels that were matches in the text document. Not sure if it's possible to return the list of labels like this.
    else:
        return ABSTAIN # abstains from using this text example in the dataset creation because there was no match

### Keyword Lookup Approach

In [None]:
Company_keywords = label_keyword_dict['Company']
EducationalInstitution_keywords = label_keyword_dict['EducationalInstitution']
Artist_keywords = label_keyword_dict['Artist']
Athlete_keywords = label_keyword_dict['Athlete']
OfficeHolder_keywords = label_keyword_dict['OfficeHolder']
MeanOfTransportation_keywords = label_keyword_dict['MeanOfTransportation']
Building_keywords = label_keyword_dict['Building']
NaturalPlace_keywords = label_keyword_dict['NaturalPlace']
Village_keywords = label_keyword_dict['Village']
Animal_keywords = label_keyword_dict['Animal']
Plant_keywords = label_keyword_dict['Plant']
Album_keywords = label_keyword_dict['Album']
Film_keywords = label_keyword_dict['Film']
WrittenWork_keywords = label_keyword_dict['WrittenWork']

company_keywords

['published',
 'book',
 'novel',
 'journal',
 'series',
 'written',
 'story',
 'magazine',
 'newspaper',
 'daily',
 'stories',
 'peer-reviewed',
 'fiction',
 'covers',
 'comic',
 'volume',
 'science',
 'fantasy',
 'edition',
 'writer',
 'law',
 'created',
 'research',
 'established',
 'history',
 'weekly',
 'issue',
 'travel',
 'academic',
 'mystery',
 'media',
 'author',
 'work',
 'god']

In [None]:
from snorkel.labeling.lf.nlp import nlp_labeling_function

def keyword_lookup(x, keywords, label):
    tokens = ''.join([token.text for token in x.doc])
    if any(word in tokens for word in keywords):
        return label
    return ABSTAIN


@nlp_labeling_function()
def lf_company(x):
  return keyword_lookup(x, keywords_list[0], labels_num_list[0])

@nlp_labeling_function()
def lf_educational_institution(x):
  return keyword_lookup(x, keywords_list[1], labels_num_list[1])

@nlp_labeling_function()
def lf_artist(x):
  return keyword_lookup(x, keywords_list[2], labels_num_list[2])

@nlp_labeling_function()
def lf_athlete(x):
  return keyword_lookup(x, keywords_list[3], labels_num_list[3])

@nlp_labeling_function()
def lf_office_holder(x):
  return keyword_lookup(x, keywords_list[4], labels_num_list[4])

@nlp_labeling_function()
def lf_mean_of_transportation(x):
  return keyword_lookup(x, keywords_list[5], labels_num_list[5])

@nlp_labeling_function()
def lf_building(x):
  return keyword_lookup(x, keywords_list[6], labels_num_list[6])

@nlp_labeling_function()
def lf_natural_place(x):
  return keyword_lookup(x, keywords_list[7], labels_num_list[7])

@nlp_labeling_function()
def lf_village(x):
  return keyword_lookup(x, keywords_list[8], labels_num_list[8])

@nlp_labeling_function()
def lf_animal(x):
  return keyword_lookup(x, keywords_list[9], labels_num_list[9])

@nlp_labeling_function()
def lf_plant(x):
  return keyword_lookup(x, keywords_list[10], labels_num_list[10])

@nlp_labeling_function()
def lf_album(x):
  return keyword_lookup(x, keywords_list[11], labels_num_list[11])

@nlp_labeling_function()
def lf_film(x):
  return keyword_lookup(x, keywords_list[12], labels_num_list[12])

@nlp_labeling_function()
def lf_written_work(x):
  return keyword_lookup(x, keywords_list[13], labels_num_list[13])


In [None]:
lfs = [
       lf_company,
       lf_educational_institution,
       lf_artist,
       lf_athlete,
       lf_office_holder,
       lf_mean_of_transportation,
       lf_building,
       lf_natural_place,
       lf_village,
       lf_animal,
       lf_plant,
       lf_album,
       lf_film,
       lf_written_work
]

## Applying Snorkel Labeling Functions

In [None]:
# company_dataset = dbpedia_train_df.iloc[0:1000]
# company_dataset

df_small_train_dbpedia = dbpedia_train_df.sample(1000, random_state=42).reset_index(drop=True)
df_small_test_dbpedia = dbpedia_test_df.sample(1000, random_state=42).reset_index(drop=True)
df_small_train_dbpedia

Unnamed: 0,labels,text
0,0,sterling piano company the sterling piano company was a piano manufacturer in derby connecticut. the company was founded in 1873 by charles a. sterling as the sterling organ company. sterling had purchased the birmingham organ company in 1871 and had $30000 to fund the company. the sterling organ company began making pianos in 1885.
1,5,nyc s-motor s-motor was the class designation given by the new york central to its alco-ge built s-1 s-2 s-2a and s-3 electric locomotives. the s-motors hold the distinction of being the world's first mass-produced main line electric locomotives with the prototype #6000 being constructed in 1904. the s-motors would serve alone until the more powerful t-motors began to arrive in 1913 eventually displacing them from main line passenger duties.
2,2,axel zwingenberger axel zwingenberger (born may 7 1955 hamburg germany) is a blues and boogie-woogie pianist and songwriter. he is considered one of the finest boogie-woogie music masters in the world.
3,9,sceptrophasma hispidulum sceptrophasma hispidulum commonly known as the andaman islands stick insect is a species of the stick insect family. it originates from the andaman islands and is commonly found in tropical forests there. they eat a variety of foliage though in captivity they commonly eat blackberry bramble hawthorn oak rose and lettuce. the species has the phasmid study group number psg183.
4,7,nucet river (chiojdeanca) the nucet river is a tributary of the chiojdeanca river in romania.
...,...,...
995,2,pat arrowsmith pat arrowsmith (born march 2 1930) is a british author and peace campaigner.arrowsmith was educated at cheltenham ladies college read history at the university of cambridge and then read social science at the university of liverpool and at ohio university as a us-uk fulbright scholar. she is a co-founder of the campaign for nuclear disarmament.she has served eleven prison sentences for her political activities.
996,10,cleistanthus cleistanthus is a plant genus of the family phyllanthaceae. the genus comprises 140 species found from africa to the pacific islands. cleistanthus collinus is known for being toxic and frequently used for homicidal or suicidal purposes.
997,13,kyunghyang shinmun the kyunghyang shinmun or kyonghyang sinmun is a major daily newspaper published in south korea. it is based in seoul. the name literally means urbi et orbi daily news.
998,10,cyathea atropurpurea cyathea atropurpurea is a species of tree fern native to the islands of luzon mindanao leyte and mindanao in the philippines where it grows in forest at above 1000 m. the erect trunk is slender and may be up to 3 m tall. fronds are bipinnate and 1-2 m long. characteristically of this species the final pair of pinnae are usually reduced and occur towards the base of the stipe. these along with the stipe bases are persistent and retained around the trunk long after withering.


In [None]:
from snorkel.labeling import PandasLFApplier
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_small_train_dbpedia)
L_test = applier.apply(df=df_small_test_dbpedia)

  from pandas import Panel











  0%|          | 0/1000 [00:00<?, ?it/s][A[A[A[A[A[A[A[A[A[A[A










 13%|█▎        | 127/1000 [00:00<00:00, 1262.55it/s][A[A[A[A[A[A[A[A[A[A[A










 24%|██▍       | 244/1000 [00:00<00:00, 1231.66it/s][A[A[A[A[A[A[A[A[A[A[A










 36%|███▌      | 357/1000 [00:00<00:00, 1198.25it/s][A[A[A[A[A[A[A[A[A[A[A










 47%|████▋     | 469/1000 [00:00<00:00, 1171.31it/s][A[A[A[A[A[A[A[A[A[A[A










 58%|█████▊    | 579/1000 [00:00<00:00, 1144.46it/s][A[A[A[A[A[A[A[A[A[A[A










 69%|██████▉   | 691/1000 [00:00<00:00, 1136.42it/s][A[A[A[A[A[A[A[A[A[A[A










 80%|████████  | 803/1000 [00:00<00:00, 1129.33it/s][A[A[A[A[A[A[A[A[A[A[A










100%|██████████| 1000/1000 [00:00<00:00, 1145.39it/s]











  0%|          | 0/1000 [00:00<?, ?it/s][A[A[A[A[A[A[A[A[A[A[A










 12%|█▏        | 122/1000 [00:00<00:00, 1208.05it/s][A[

In [None]:
L_train

array([[ 0,  1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., 11, 12, 13],
       [-1, -1,  2, ..., 11, 12, 13],
       ...,
       [ 0, -1, -1, ..., -1, 12, 13],
       [-1, -1, -1, ..., 11, -1, -1],
       [-1, -1,  2, ..., 11, 12, 13]])

In [None]:
type(L_train)

numpy.ndarray

In [None]:
import pickle

with open('L_train_1k_keyword_lookup.pkl','wb') as f:
  pickle.dump(L_train, f)
with open('L_train_1k_keyword_lookup.pkl','rb') as f:
  L_train = pickle.load(f)
  print(x.shape)

(1000, 14)


In [None]:
with open('L_test_1k_keyword_lookup.pkl','wb') as f:
  pickle.dump(L_test, f)
with open('L_test_1k_keyword_lookup.pkl','rb') as f:
  L_test = pickle.load(f)
  print(x.shape)

(1000, 14)


In [None]:
from snorkel.labeling import LFAnalysis
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
lf_company,0,[0],0.286,0.286,0.286
lf_educational_institution,1,[1],0.472,0.472,0.472
lf_artist,2,[2],0.614,0.614,0.614
lf_athlete,3,[3],0.53,0.53,0.53
lf_office_holder,4,[4],0.499,0.499,0.499
lf_mean_of_transportation,5,[5],0.642,0.642,0.642
lf_building,6,[6],0.642,0.642,0.642
lf_natural_place,7,[7],0.992,0.987,0.987
lf_village,8,[8],0.71,0.71,0.71
lf_animal,9,[9],0.482,0.482,0.482


In [None]:
from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=14, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)

In [None]:
Y_test = df_small_test_dbpedia.labels.values
LFAnalysis(L_test, lfs).lf_summary(Y_test)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
lf_company,0,[0],0.262,0.262,0.262,59,203,0.225191
lf_educational_institution,1,[1],0.457,0.457,0.457,71,386,0.155361
lf_artist,2,[2],0.579,0.579,0.579,66,513,0.11399
lf_athlete,3,[3],0.483,0.483,0.483,69,414,0.142857
lf_office_holder,4,[4],0.468,0.468,0.468,68,400,0.145299
lf_mean_of_transportation,5,[5],0.678,0.678,0.678,75,603,0.110619
lf_building,6,[6],0.643,0.643,0.643,90,553,0.139969
lf_natural_place,7,[7],0.989,0.984,0.984,66,923,0.066734
lf_village,8,[8],0.741,0.741,0.741,69,672,0.093117
lf_animal,9,[9],0.491,0.491,0.491,70,421,0.142566


In [None]:
label_acc = label_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Majority Vote Accuracy:':<25} {label_acc * 100:.1f}%")

label_model_acc = label_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")

Majority Vote Accuracy:   13.8%
Label Model Accuracy:     13.8%


## Filtering out unlabeled data points

As we saw earlier, some of the data points in our train set received no labels from any of our LFs. These data points convey no supervision signal and tend to hurt performance, so we filter them out before training using a built-in utility.

In [None]:
probs_train = label_model.predict_proba(L_train)
probs_train

array([[4.52733310e-01, 1.09449255e-01, 1.76981234e-02, ...,
        3.46087310e-02, 1.17073017e-01, 9.79146147e-02],
       [1.28138792e-01, 3.94428720e-04, 1.00144036e-01, ...,
        2.54975012e-01, 4.19615135e-03, 3.61581578e-02],
       [7.96311597e-05, 6.82134611e-04, 4.13196281e-01, ...,
        4.85173945e-02, 3.89070362e-02, 3.33404765e-03],
       ...,
       [3.72906810e-01, 1.20257606e-01, 8.46017360e-03, ...,
        2.23398397e-02, 1.26739285e-01, 3.08053323e-01],
       [1.12634277e-03, 7.68121763e-07, 9.87970092e-05, ...,
        4.62459227e-06, 9.52367601e-09, 2.59774852e-05],
       [1.78858109e-04, 2.17281014e-03, 4.57318604e-01, ...,
        2.46892531e-02, 7.90342664e-02, 2.46838993e-02]])

In [None]:
from snorkel.labeling import filter_unlabeled_dataframe

df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
    X=df_small_train_dbpedia, y=probs_train, L=L_train
)

In [None]:
df_small_train_dbpedia['labels'] = label_model.predict(L_train)

In [None]:
df_small_train_dbpedia.to_csv('./df_small_train_dbpedia.csv')

In [None]:
!pip install transformers==4.5.1 --quiet
!pip install pytorch_lightning==1.2.10 --quiet
!pip install wandb --quiet

[K     |████████████████████████████████| 2.1MB 10.0MB/s 
[K     |████████████████████████████████| 901kB 59.0MB/s 
[K     |████████████████████████████████| 3.3MB 48.1MB/s 
[K     |████████████████████████████████| 849kB 11.2MB/s 
[K     |████████████████████████████████| 184kB 52.5MB/s 
[K     |████████████████████████████████| 276kB 43.5MB/s 
[K     |████████████████████████████████| 10.6MB 21.6MB/s 
[K     |████████████████████████████████| 829kB 72.4MB/s 
[K     |████████████████████████████████| 1.3MB 57.9MB/s 
[K     |████████████████████████████████| 296kB 62.0MB/s 
[K     |████████████████████████████████| 143kB 62.7MB/s 
[?25h  Building wheel for PyYAML (setup.py) ... [?25l[?25hdone
  Building wheel for future (setup.py) ... [?25l[?25hdone
[31mERROR: snorkel 0.9.7 has requirement tensorboard<2.0.0,>=1.14.0, but you'll have tensorboard 2.4.1 which is incompatible.[0m
[K     |████████████████████████████████| 1.8MB 11.7MB/s 
[K     |█████████████████████████

In [None]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import BertModel, AdamW, get_linear_schedule_with_warmup, AutoModelForSequenceClassification, AutoTokenizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import pytorch_lightning as pl
from pytorch_lightning.metrics.functional.classification import auroc
from datasets import load_dataset

import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt

from tqdm import tqdm
import wandb

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

RANDOM_SEED = 42
BASE_MODEL_NAME = 'bert-base-cased'

np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

<torch._C.Generator at 0x7f48d36d2730>

In [None]:
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


wandb: Paste an API key from your profile and hit enter: ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
%env WANDB_LOG_MODEL=true

env: WANDB_LOG_MODEL=true


In [None]:
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




In [None]:
class dbpediaDataset(Dataset):

  def __init__(
      self,
      data: pd.DataFrame,
      tokenizer: tokenizer,
      text_max_token_len: int = 512
  ):

    self.tokenizer = tokenizer
    self.data = data
    self.text_max_token_len = text_max_token_len

  def __len__(self):
    return len(self.data)

  def __getitem__(self, index: int):
    data_row = self.data.iloc[index]

    text = data_row['text']

    text_encoding = tokenizer(
        text,
        add_special_tokens=True,
        max_length=512,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        return_tensors="np",
    )

In [None]:
def encode(batch):
    return tokenizer(
        batch,
        add_special_tokens=True,
        max_length=512,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        return_tensors="np",
    )

In [None]:
df_small_train_dbpedia_torch = df_small_train_dbpedia['text'].map(encode, batched=True, num_proc=10)
df_small_train_dbpedia_torch.set_format(type="torch", 
                           columns=["input_ids", "token_type_ids", 
                                    "attention_mask", "labels"])

TypeError: ignored

In [None]:
from snorkel.classification import DictDataLoader
from model import SceneGraphDataset, create_model

df_train["labels"] = label_model.predict(L_train)

if sample:
    TRAIN_DIR = "data/VRD/sg_dataset/samples"
else:
    TRAIN_DIR = "data/VRD/sg_dataset/sg_train_images"

dl_train = DictDataLoader(
    SceneGraphDataset("train_dataset", "train", TRAIN_DIR, df_train),
    batch_size=16,
    shuffle=True,
)

dl_valid = DictDataLoader(
    SceneGraphDataset("valid_dataset", "valid", TRAIN_DIR, df_valid),
    batch_size=16,
    shuffle=False,
)