This example will walk you through the basics of how to:
- Create an "Extractor", a controller which handles the extraction.
- Add required labelling functions and schemas to the extractor.
- Run the extraction process.
- Launch the user interface to begin annotating the extractions.

We will be using the existing Keyword Extractor labelling function as an example.

Begin by importing the requirements:

In [5]:
# Import Extractor class and launch UI function.
from elicit import Extractor, launch_ui
# Import the Keyword Match Labelling Function.
from elicit.labelling_functions import KeywordMatchLF
# Import Pathlib, for better path handling.
from pathlib import Path
import os
current_path = os.path.abspath('')

from elicit.main import _kill_ui

# get current directory
current_dir = Path(current_path)

docs = list((current_dir / "basic_example_docs").glob("*.txt"))

print("Current directory:", current_dir)
print("Documents:", [d.name for d in docs])

Current directory: /home/dev/Turing/elicit/examples
Documents: ['doc_1.txt', 'doc_2.txt']


Lets first create an Extractor object, pointing at the DB file we want to create.
Then, we will add the required schemas. These are the configuration files the labelling functions will use to extract the data.

In this case, we require:
- A categories schema
- A keywords schema

Categories is always required. It tells the system what categories each variable has, or whether it is numerical/raw. More details on this in the documentation.

Keywords are a dictionary of variable category to keyword list. Each category of a variable will have some user-defined set of keywords.

These schemas can either be a Path to a yaml file, or a dictionary.

In [6]:
# delete db if already exists (just for testing purposes)
(current_dir / "test_db.sqlite").unlink(missing_ok=True)

extractor = Extractor(db_path=current_dir / "test_db.sqlite")


categories = {"cat_or_dog": ["cat", "dog"]}
keywords = {"cat_or_dog": {"cat": ["meow", "hiss"], "dog": ["woof", "bark"]}}

extractor.register_schema(schema=categories,
                            schema_name="categories")
extractor.register_schema(schema=keywords,
                            schema_name="keywords")


Connected to Extraction Database: /home/dev/Turing/elicit/examples/test_db.sqlite


Next, we register the labelling function. In this case, we have just imported the pre-defined Keyword Extractor labelling function. In a future tutorial, we will create our own labelling functions.

We can now run the extraction process, pointing to the directory containing the documents we want to extract from. Currently PDFs and TXTs are supported.

In [7]:
extractor.register_labelling_function(KeywordMatchLF)
extractor.run(docs)

Running LF: Keyword Match
Loading models and stuff...


Extracting variable: cat_or_dog: 100%|██████████| 2/2 [00:00<00:00,  2.86it/s]


Finally, we can launch the user interface to begin annotating the extractions, pointing at the database the extractions were just added to.

In [9]:
launch_ui(db_path=current_dir / "test_db.sqlite")

UI Killed
