# `skweak`: a quick demonstration

## Start: preparing the corpus

We have a small corpus of 200 news articles that we wish to annotate with two entity types: 
- companies
- other (non-commercial) organisations.

The first step is to extract the texts from the corpus:

In [3]:
import tarfile

# We retrieve the texts
texts = [] 
archive_file = tarfile.open("../data/reuters_small.tar.gz")
for archive_member in archive_file.getnames():
    if archive_member.endswith(".txt"):
        text = archive_file.extractfile(archive_member).read().decode("utf8")
        texts.append(text)

We can now run Spacy on those texts to obtain `Doc` objects

In [4]:
import spacy

# We run spacy on the texts    
nlp = spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"])
docs = list(nlp.pipe(texts))


<br>

## Step 1: Labelling functions

Labelling functions are at the core of `skweak`. They take a `Doc` as input and returns a list of spans with their associated labels. 

One simple type of labelling functions are heuristics. For instance, we can write that commercial companies may be recognized by their legal suffix (such as Corp.):

In [5]:
import skweak

def company_detector_fun(doc):
    for chunk in doc.noun_chunks:
        if chunk[-1].lower_.rstrip(".") in {'corp', 'inc', 'ltd', 'llc', 'sa', 'ag'}:
            yield chunk.start, chunk.end, "COMPANY"

# We create the labelling function by giving it a name, and a function to apply
company_detector = skweak.heuristics.FunctionAnnotator("company_detector", company_detector_fun)

# We run the function on the full corpus
docs = list(company_detector.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "company_detector")

<br>
For non-commercial organisations, we can also look for the occurrence of words that are quite typical of public organisations or NGOs: 

In [6]:
OTHER_ORG_CUE_WORDS = {"University", "Institute", "College", "Committee", "Party", "Agency",
                       "Union", "Association", "Organization", "Court", "Office", "National"}
def other_org_detector_fun(doc):
    for chunk in doc.noun_chunks:
        if any([tok.text in OTHER_ORG_CUE_WORDS for tok in chunk]):
            yield chunk.start, chunk.end, "OTHER_ORG"

# We create the labelling function
other_org_detector = skweak.heuristics.FunctionAnnotator("other_org_detector", other_org_detector_fun)

# We run the function on the full corpus
docs = list(other_org_detector.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "other_org_detector")

<br>
In addition to heuristics, we can also exploit _gazetteers_ that search for the occurrences of entries (often extracted from a knowledge base): 

In [7]:

# We extract the entries (from Crunchbase)
tries = skweak.gazetteers.extract_json_data("../data/crunchbase_companies.json.gz")
gazetteer = skweak.gazetteers.GazetteerAnnotator("gazetteer", tries)
print("done building the gazetteer")

# We run the function on the full corpus
docs = list(gazetteer.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "gazetteer")

Extracting data from ../data/crunchbase_companies.json.gz
Populating trie for class COMPANY (number: 539174)
done building the gazetteer


<br>
And finally, we can also take advantage of machine learning models trained from data of related domains. Here, we will use a spacy model to get the usual named entities:

In [8]:

# Run a NER trained on conll2003
ner = skweak.spacy.ModelAnnotator("spacy", "en_core_web_sm")
docs = list(ner.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "spacy")

<br> 

## Step 2: aggregation

Once the labelling functions have been applied, we must then aggregate their results, so that we can a single annotation for each document. This is done in `skweak` by estimating a generative model. Aggregating the labels can be done in a few lines of code: 

In [9]:
# We define the aggregation model
model = skweak.aggregation.HMM("hmm", ["COMPANY", "OTHER_ORG"])

# We indicate that "ORG" is an underspecified value, which may
# represent either COMPANY or OTHER_ORG
model.add_underspecified_label("ORG", ["COMPANY", "OTHER_ORG"])

# And run the estimation
docs = model.fit_and_aggregate(docs)

Starting iteration 1
Finished E-step with 195 documents
Starting iteration 2
         1      -39106.4420             +nan
Finished E-step with 195 documents
Starting iteration 3
         2      -39020.9787         +85.4633
Finished E-step with 195 documents
Starting iteration 4
         3      -39007.4458         +13.5329
Finished E-step with 195 documents
         4      -39005.8610          +1.5848


In [10]:
# Note: if you are running Jupyter Notebook instead of Jupyter Lab, you need to 
# set add_tooltip=False, as Juypter Notebook does not support HTML tooltips
skweak.utils.display_entities(docs[28], "hmm", add_tooltip=True) 

<br>

## Step 3: Training the final model
    
Once we have finished labelling the corpus, we can then train any type of machine learning model on it!

In [11]:
for doc in docs:
    doc.ents = doc.spans["hmm"]
skweak.utils.docbin_writer(docs, "../data/reuters_small.spacy")

Write to ../data/reuters_small.spacy...done


In [13]:
!spacy init config - --lang en --pipeline ner --optimize accuracy | \
spacy train - --paths.train ../data/reuters_small.spacy  --paths.dev ../data/reuters_small.spacy \
--initialize.vectors en_core_web_md --output ../data/reuters_small


[38;5;4mℹ Using CPU[0m
[1m
[2021-04-29 12:16:56,425] [INFO] Set up nlp object from config
[2021-04-29 12:16:56,437] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-04-29 12:16:56,442] [INFO] Created vocabulary
[2021-04-29 12:16:59,300] [INFO] Added vectors: en_core_web_md
[2021-04-29 12:16:59,301] [INFO] Finished initializing nlp object
[2021-04-29 12:17:09,940] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     85.00    0.31    0.25    0.41    0.00
  1     200        258.61   5153.69   75.68   73.13   78.42    0.76
^C


This is of course just a very short example. Please look at our Jupyter notebooks in the `domains` directory for more details.