# Elicit Basics

This example will walk you through the basics of how to:
- Create an "Extractor", a controller which handles the extraction.
- Add required labelling functions and schemas to the extractor.
- Run the extraction process.
- Launch the user interface to begin annotating the extractions.

We will be using the existing Keyword Extractor labelling function as an example.

We begin by importing the requirements.

## Importing Requirements

In [1]:
# Import Extractor class and launch UI function.
from elicit import Extractor, launch_ui
# Import the Keyword Match Labelling Function.
from elicit.generic_labelling_functions import KeywordMatchLF, SimilarityLabellingFunction, NLILabellingFunction, SemanticSearchLF
# Import Pathlib, for better path handling.
from pathlib import Path
# Import OS so we know where the notebook is!
import os

current_path = os.path.abspath('')

# get current directory
current_dir = Path(current_path)

docs = list((current_dir / "basic_example_docs").glob("*.txt"))

print("Current directory:", current_dir)
print("Documents:", [d.name for d in docs])

Current directory: /home/dev/Turing/elicit/examples
Documents: ['doc_1.txt', 'doc_2.txt']


## Creating Extractor

Lets first create an Extractor object, pointing at the DB file we want to create.

In [2]:
# delete db if already exists (just for testing purposes)
# (current_dir / "test_db.sqlite").unlink(missing_ok=True)

extractor = Extractor(db_path=current_dir / "test_db.sqlite", model_path=current_dir / "models", device=0)

Connected to Extraction Database: /home/dev/Turing/elicit/examples/test_db.sqlite


## Registering Schemas

Next, we will add the required schemas. These are the configuration files the labelling functions will use to extract the data.

In this case, we require:
- A categories schema
- A keywords schema

Categories is always required. It tells the system what categories each variable has, or whether it is numerical/raw. More details on this in the documentation.

Keywords are a dictionary of variable category to keyword list. Each category of a variable will have some user-defined set of keywords.

These schemas can either be a Path to a yaml file, or a dictionary.

In [3]:

categories = {"cat_or_dog": ["cat", "dog"]}
keywords = {"cat_or_dog": {"cat": ["meow", "hiss"], "dog": ["woof", "bark"]}}
questions = {"cat_or_dog": ["Is this a cat?", "Is this a dog?"]}

extractor.register_schema(schema=categories,
                            schema_name="categories")
extractor.register_schema(schema=keywords,
                            schema_name="keywords")#
extractor.register_schema(schema=questions,
                            schema_name="questions")

Registered schema: categories
Registered schema: keywords
Registered schema: questions


## Registering Labelling Functions

Next, we register the labelling function. In this case, we have just imported the pre-defined Keyword Extractor labelling function. In a future tutorial, we will create our own labelling functions.

In [4]:
extractor.register_labelling_function(NLILabellingFunction)
extractor.register_labelling_function(SimilarityLabellingFunction)
extractor.register_labelling_function(KeywordMatchLF)
extractor.register_labelling_function(SemanticSearchLF)

Registered labelling function: Q&A → NLI Transformer
Registered labelling function: Q&A → Similarity Transformer
Registered labelling function: Keyword Match
Registered labelling function: Semantic Search


## Running the extractor

We can now run the extraction process, we pass a list of Paths pointing to each document. Currently PDFs and TXTs are supported.

In [5]:
extractor.run(docs)

Running LF: Q&A → NLI Transformer
Loading Resources.
Fine tuned Sequence Classifier model found, loading...
Fine tuned Q&A model found, loading...


The model 'RobertaForQuestionAnsweringWithNegatives' is not supported for question-answering. Supported models are ['QDQBertForQuestionAnswering', 'FNetForQuestionAnswering', 'GPTJForQuestionAnswering', 'LayoutLMv2ForQuestionAnswering', 'RemBertForQuestionAnswering', 'CanineForQuestionAnswering', 'RoFormerForQuestionAnswering', 'BigBirdPegasusForQuestionAnswering', 'BigBirdForQuestionAnswering', 'ConvBertForQuestionAnswering', 'LEDForQuestionAnswering', 'DistilBertForQuestionAnswering', 'AlbertForQuestionAnswering', 'CamembertForQuestionAnswering', 'BartForQuestionAnswering', 'MBartForQuestionAnswering', 'LongformerForQuestionAnswering', 'XLMRobertaForQuestionAnswering', 'RobertaForQuestionAnswering', 'SqueezeBertForQuestionAnswering', 'BertForQuestionAnswering', 'XLNetForQuestionAnsweringSimple', 'FlaubertForQuestionAnsweringSimple', 'MegatronBertForQuestionAnswering', 'MobileBertForQuestionAnswering', 'XLMForQuestionAnsweringSimple', 'ElectraForQuestionAnswering', 'ReformerForQuestio

Running LF: Q&A → Similarity Transformer
Loading Resources.
Fine tuning similarity model found, loading...
Fine tuned Q&A model found, loading...


The model 'RobertaForQuestionAnsweringWithNegatives' is not supported for question-answering. Supported models are ['QDQBertForQuestionAnswering', 'FNetForQuestionAnswering', 'GPTJForQuestionAnswering', 'LayoutLMv2ForQuestionAnswering', 'RemBertForQuestionAnswering', 'CanineForQuestionAnswering', 'RoFormerForQuestionAnswering', 'BigBirdPegasusForQuestionAnswering', 'BigBirdForQuestionAnswering', 'ConvBertForQuestionAnswering', 'LEDForQuestionAnswering', 'DistilBertForQuestionAnswering', 'AlbertForQuestionAnswering', 'CamembertForQuestionAnswering', 'BartForQuestionAnswering', 'MBartForQuestionAnswering', 'LongformerForQuestionAnswering', 'XLMRobertaForQuestionAnswering', 'RobertaForQuestionAnswering', 'SqueezeBertForQuestionAnswering', 'BertForQuestionAnswering', 'XLNetForQuestionAnsweringSimple', 'FlaubertForQuestionAnsweringSimple', 'MegatronBertForQuestionAnswering', 'MobileBertForQuestionAnswering', 'XLMForQuestionAnsweringSimple', 'ElectraForQuestionAnswering', 'ReformerForQuestio

Running LF: Keyword Match
Loading Resources.


Extracting variable: cat_or_dog: 100%|██████████| 2/2 [00:01<00:00,  1.64it/s]


Running LF: Semantic Search
Loading Resources.
Fine tuning similarity model found, loading...


Extracting variable: cat_or_dog: 100%|██████████| 2/2 [00:00<00:00,  8.19it/s]


## Running the user interface

Finally, we can launch the user interface to begin annotating the extractions, pointing either to a database path, or simply passing in the extractor object.

In [6]:
launch_ui(extractor=extractor)

UI Killed


In [5]:
extractor.sort(method="weasul")
extractor.performance(performance_type="confidence")

Updating confidence scores.


  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

{'cat_or_dog': 0.0}

In [6]:
%debug

> [0;32m/home/dev/Turing/elicit/database/db_utils.py[0m(37)[0;36mquery_db[0;34m()[0m
[0;32m     35 [0;31m    [0;34m:[0m[0;32mreturn[0m[0;34m:[0m [0mList[0m [0mof[0m [0mresults[0m[0;34m.[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     36 [0;31m    """
[0m[0;32m---> 37 [0;31m    [0mcur[0m [0;34m=[0m [0mdb[0m[0;34m.[0m[0mexecute[0m[0;34m([0m[0mquery[0m[0;34m,[0m [0margs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     38 [0;31m    [0mrv[0m [0;34m=[0m [0mcur[0m[0;34m.[0m[0mfetchall[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     39 [0;31m    [0mcur[0m[0;34m.[0m[0mclose[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
> [0;32m/home/dev/Turing/elicit/user_interface/server/sorting.py[0m(13)[0;36mset_value_confidence[0;34m()[0m
[0;32m     11 [0;31m[0;34m[0m[0m
[0m[0;32m     12 [0;31m[0;32mdef[0m [0mset_value_confidence[0m[0;34m([0m[0mdb[0m[0;34m,[0m [0mvariable_id[0m[0;34m:

Once some extractions have been validated - labelling functions can be fine-tuned with `extractor.train()`.

In [5]:
extractor.train()

Loading Resources for LF: Keyword Match
Training LF: Keyword Match on variable: cat_or_dog
4 documents with validations for variable: cat_or_dog
Loading Resources for LF: Q&A → Similarity Transformer
No fine tuning similarity model found, loading default.
Fine tuned Q&A model found, loading...


The model 'RobertaForQuestionAnsweringWithNegatives' is not supported for question-answering. Supported models are ['QDQBertForQuestionAnswering', 'FNetForQuestionAnswering', 'GPTJForQuestionAnswering', 'LayoutLMv2ForQuestionAnswering', 'RemBertForQuestionAnswering', 'CanineForQuestionAnswering', 'RoFormerForQuestionAnswering', 'BigBirdPegasusForQuestionAnswering', 'BigBirdForQuestionAnswering', 'ConvBertForQuestionAnswering', 'LEDForQuestionAnswering', 'DistilBertForQuestionAnswering', 'AlbertForQuestionAnswering', 'CamembertForQuestionAnswering', 'BartForQuestionAnswering', 'MBartForQuestionAnswering', 'LongformerForQuestionAnswering', 'XLMRobertaForQuestionAnswering', 'RobertaForQuestionAnswering', 'SqueezeBertForQuestionAnswering', 'BertForQuestionAnswering', 'XLNetForQuestionAnsweringSimple', 'FlaubertForQuestionAnsweringSimple', 'MegatronBertForQuestionAnswering', 'MobileBertForQuestionAnswering', 'XLMForQuestionAnsweringSimple', 'ElectraForQuestionAnswering', 'ReformerForQuestio

Training LF: Q&A → Similarity Transformer on variable: cat_or_dog
4 documents with validations for variable: cat_or_dog


Iteration: 100%|██████████| 2/2 [00:00<00:00, 11.18it/s]
Epoch: 100%|██████████| 1/1 [00:00<00:00,  5.44it/s]


In [6]:
extractor.run(docs)

Running LF: Keyword Match
Loading Resources.


Extracting variable: cat_or_dog: 100%|██████████| 2/2 [00:00<00:00,  2.39it/s]


Running LF: Q&A → Similarity Transformer
Loading Resources.
Fine tuning similarity model found, loading...
Fine tuned Q&A model found, loading...


The model 'RobertaForQuestionAnsweringWithNegatives' is not supported for question-answering. Supported models are ['QDQBertForQuestionAnswering', 'FNetForQuestionAnswering', 'GPTJForQuestionAnswering', 'LayoutLMv2ForQuestionAnswering', 'RemBertForQuestionAnswering', 'CanineForQuestionAnswering', 'RoFormerForQuestionAnswering', 'BigBirdPegasusForQuestionAnswering', 'BigBirdForQuestionAnswering', 'ConvBertForQuestionAnswering', 'LEDForQuestionAnswering', 'DistilBertForQuestionAnswering', 'AlbertForQuestionAnswering', 'CamembertForQuestionAnswering', 'BartForQuestionAnswering', 'MBartForQuestionAnswering', 'LongformerForQuestionAnswering', 'XLMRobertaForQuestionAnswering', 'RobertaForQuestionAnswering', 'SqueezeBertForQuestionAnswering', 'BertForQuestionAnswering', 'XLNetForQuestionAnsweringSimple', 'FlaubertForQuestionAnsweringSimple', 'MegatronBertForQuestionAnswering', 'MobileBertForQuestionAnswering', 'XLMForQuestionAnsweringSimple', 'ElectraForQuestionAnswering', 'ReformerForQuestio