# Elicit Basics

This example will walk you through the basics of how to:
- Create an "Extractor", a controller which handles the extraction.
- Add required labelling functions and schemas to the extractor.
- Run the extraction process.
- Launch the user interface to begin annotating the extractions.

We will be using the existing Keyword Extractor labelling function as an example.

We begin by importing the requirements.

## Importing Requirements

In [1]:
# Import Extractor class and launch UI function.
from elicit import Extractor, launch_ui
# Import the Keyword Match Labelling Function.
from elicit.generic_labelling_functions import KeywordMatchLF, NLILabellingFunction
# Import Pathlib, for better path handling.
from pathlib import Path
# Import OS so we know where the notebook is!
import os

current_path = os.path.abspath('')

# get current directory
current_dir = Path(current_path)

docs = list((current_dir / "basic_example_docs").glob("*.txt"))

print("Current directory:", current_dir)
print("Documents:", [d.name for d in docs])

  from .autonotebook import tqdm as notebook_tqdm


Current directory: /home/dev/Turing/elicit/examples
Documents: ['doc_1.txt', 'doc_2.txt']


## Creating Extractor

Lets first create an Extractor object, pointing at the DB file we want to create.

In [2]:
# delete db if already exists (just for testing purposes)
#(current_dir / "test_db.sqlite").unlink(missing_ok=True)

extractor = Extractor(db_path=current_dir / "test_db.sqlite", model_path=current_dir / "models", device=0)

Connected to Extraction Database: /home/dev/Turing/elicit/examples/test_db.sqlite


## Registering Schemas

Next, we will add the required schemas. These are the configuration files the labelling functions will use to extract the data.

In this case, we require:
- A categories schema
- A keywords schema

Categories is always required. It tells the system what categories each variable has, or whether it is numerical/raw. More details on this in the documentation.

Keywords are a dictionary of variable category to keyword list. Each category of a variable will have some user-defined set of keywords.

These schemas can either be a Path to a yaml file, or a dictionary.

In [3]:

categories = {"cat_or_dog": ["cat", "dog"]}
keywords = {"cat_or_dog": {"cat": ["meow", "hiss"], "dog": ["woof", "bark"]}}
questions = {"cat_or_dog": ["Is this a cat?", "Is this a dog?"]}

extractor.register_schema(schema=categories,
                            schema_name="categories")
extractor.register_schema(schema=keywords,
                            schema_name="keywords")#
extractor.register_schema(schema=questions,
                            schema_name="questions")

Registered schema: categories
Registered schema: keywords
Registered schema: questions


## Registering Labelling Functions

Next, we register the labelling function. In this case, we have just imported the pre-defined Keyword Extractor labelling function. In a future tutorial, we will create our own labelling functions.

In [4]:
extractor.register_labelling_function(KeywordMatchLF)
extractor.register_labelling_function(NLILabellingFunction)

Registered labelling function: Keyword Match
Registered labelling function: Q&A → NLI Transformer


## Running the extractor

We can now run the extraction process, we pass a list of Paths pointing to each document. Currently PDFs and TXTs are supported.

In [5]:
# extractor.run(docs)

## Running the user interface

Finally, we can launch the user interface to begin annotating the extractions, pointing either to a database path, or simply passing in the extractor object.

In [6]:
# launch_ui(extractor=extractor)

In [7]:
extractor.train()

Loading Resources for LF: Keyword Match
Training LF: Keyword Match on variable: cat_or_dog
3 documents with validations for variable: cat_or_dog
Loading Resources for LF: Q&A → NLI Transformer
No fine tuned Q&A model found, loading generic model...


The model 'RobertaForQuestionAnsweringWithNegatives' is not supported for question-answering. Supported models are ['QDQBertForQuestionAnswering', 'FNetForQuestionAnswering', 'GPTJForQuestionAnswering', 'LayoutLMv2ForQuestionAnswering', 'RemBertForQuestionAnswering', 'CanineForQuestionAnswering', 'RoFormerForQuestionAnswering', 'BigBirdPegasusForQuestionAnswering', 'BigBirdForQuestionAnswering', 'ConvBertForQuestionAnswering', 'LEDForQuestionAnswering', 'DistilBertForQuestionAnswering', 'AlbertForQuestionAnswering', 'CamembertForQuestionAnswering', 'BartForQuestionAnswering', 'MBartForQuestionAnswering', 'LongformerForQuestionAnswering', 'XLMRobertaForQuestionAnswering', 'RobertaForQuestionAnswering', 'SqueezeBertForQuestionAnswering', 'BertForQuestionAnswering', 'XLNetForQuestionAnsweringSimple', 'FlaubertForQuestionAnsweringSimple', 'MegatronBertForQuestionAnswering', 'MobileBertForQuestionAnswering', 'XLMForQuestionAnsweringSimple', 'ElectraForQuestionAnswering', 'ReformerForQuestio

Training LF: Q&A → NLI Transformer on variable: cat_or_dog
3 documents with validations for variable: cat_or_dog


100%|██████████| 10/10 [00:03<00:00,  2.63it/s, loss=0.135]


Saving Trained Model to /home/dev/Turing/elicit/examples/models/qna_model


In [8]:
extractor.run(docs)

Running LF: Keyword Match
Loading Resources.


Extracting variable: cat_or_dog:  50%|█████     | 1/2 [00:01<00:01,  1.63s/it]


OperationalError: no such column: alert