Skip to content

Plouc314/blitzbeaver

Repository files navigation

BlitzBeaver

BlitzBeaver is a Python package that allows for persons tracking accross historical records. It is desiged to work with noisy and incomplete data.

Getting Started

Prerequisites

  • Python 3.10 or higher
  • Rust edition 2021

Development Setup

  • Install the Python dependencies:

    pip3 install -r requirements.txt

Note

It is recommended to use a virtual environment for Python dependencies (venv).

  • To compile the Rust code, run:

    maturin develop

    Or in release mode:

    maturin develop -r
  • To install the blitzbeaver package locally:

    pip3 install -e /path/to/blitzbeaver

Records

The historical records given to BlitzBeaver are expected to be in a specific format. The library used to work with the data is polars, it is similar to pandas but has a better integration with Rust (as it is also written in Rust).

Frame
Represents the records at a single point in time (ex: all records from year 1805 if the records are yearly).

Record
Represents a single record (line) in the historical records.

Element
Represents a single value in a record (ex: the name of a person, the birth date, etc.).

import polars as pl

# example of an element
name = "Bob"

# example of a record
record = ["Bob", "Smith", "farmer"]

# example of a frame
frame = pl.DataFrame(
    {
        "name": ["Bob", "Alice"],
        "surname": ["Smith", "Johnson"],
        "occupation": ["farmer", "teacher"]
    }
)

Record Schema

Represents the schema of a record. It is used to define the structure of the records in the historical records. Each field in the schema corresponds to a column in the frame.

Elements can be of one of two types:

  • ElementType.String: a single string value (ex: "Bob")
  • ElementType.MultiStrings: a list of strings (ex: ["Bob", "Alice"])
import blitzbeaver as bb

record_schema = bb.RecordSchema(
    [
        bb.FieldSchema("address", bb.ElementType.String),
        bb.FieldSchema("firstname", bb.ElementType.String),
        bb.FieldSchema("lastname", bb.ElementType.String),
        bb.FieldSchema("origin", bb.ElementType.String),
        bb.FieldSchema("occupation", bb.ElementType.String),
        bb.FieldSchema("children", bb.ElementType.MultiStrings),
    ]
)

Tracking

The tracking process attempts to match records across different frames, the results of this process are a list of tracking chains. Each tracking chain represents a single entity (person), it is composed of a list of records that are believed to be the same entity at different points in time.

The results of the tracking process is actually not stored as a list of tracking chains, but rather as a graph (TrackingGraph) where each node is a record and each edge represents a link between two records.

Tracker

The tracker is the component responsible for tracking a single entity across the frames, it gradually builds a tracking chain.

The tracker has a memory, it is responsible for producing the most representative values from the records it has seen so far.

import blitzbeaver as bb

# reconstruct a tracking chain from the tracking graph
chain = graph.materialize_tracking_chain(tracker_id, dataframes, record_schema)

# display the tracking chain as a dataframe
chain.as_dataframe()

# outputs:
┌───────────┬─────────┬───────────┬──────────┬────────────┬────────────┬──────────────┐
│ frame_idxaddressfirstnamelastnameoriginoccupationchildren     │
╞═══════════╪═════════╪═══════════╪══════════╪════════════╪════════════╪══════════════╡
│ 0bourgclemontraffordanglaislampiste   ┆ ["francois"] │
│ 1bourglementpraffordnullnullnull         │
│ 2bourgclementtraffordanglaisnull       ┆ ["francois"] │
│ 3bourgclementpraffordanglaisrentier    ┆ ["francois"] │
│ 4bourgrementgraffordanglaisrentier    ┆ ["francois"] │
│ 5boulgclementraffordangleterrerentier    ┆ ["francois"] │
└───────────┴─────────┴───────────┴──────────┴────────────┴────────────┴──────────────┘

Configuration

The tracking process takes a configuration that defines all the parameters of the tracking process.

Here is an example of a configuration:

import blitzbeaver as bb

distance_metric_config = bb.DistanceMetricConfig(
    metric="lv_substring",
    caching_threshold=4,
    use_sigmoid=False,
    lv_substring_weight=0.5,
)
normal_memory_config = bb.MemoryConfig(
    memory_strategy="median",
)
multi_memory_config = bb.MemoryConfig(
    memory_strategy="mw-median",
    multiword_threshold_match=0.6,
    multiword_distance_metric=distance_metric_config,
)

config = bb.config(
    record_schema=record_schema,
    distance_metric_config=distance_metric_config,
    record_scorer_config=bb.RecordScorerConfig(
        record_scorer="average",
        weights=None,
        min_weight_ratio=None
    ),
    resolver_config=bb.ResolverConfig(
        resolving_strategy="best-match",
    ),
    memory_config=normal_memory_config,
    multistring_memory_config=multi_memory_config,
    interest_threshold=0.6,
    limit_no_match_streak=3,
    num_threads=10,
)

Execution

The tracking process is executed as follows:

import blitzbeaver as bb

tracking_graph = bb.execute_tracking(config, record_schema, dataframes)

Diagnostics

The tracking process also returns some diagnostics information (Diagnostics). These information provide insights on the tracking process, for example:

  • The state of the memory of each tracker for each frame.
  • The score of each record of interest for a tracker as well as the distances of each feature.

Beaver file

The tracking graph and diagnostics information can be saved and loaded to/from a .beaver file. The .beaver file is a binary file with a specific format.

import blitzbeaver as bb

path_graph = "./graph.beaver"

# load the graph from a .beaver file
graph = bb.read_beaver(path_graph)

# save the graph to a .beaver file
bb.save_beaver(path_graph, graph)

Normalization

Once computed, the tracking graph can be used to normalize values of the historical records. The idea being to use the link between multiple records of different frames of a tracking chain to correct errors and fill missing values.

# the same tracker as above, with normalized values
chain = graph.materialize_tracking_chain(tracker_id, dataframes, record_schema, normalized_dataframes)

# display the tracking chain as a dataframe
chain.as_dataframe(normalized=True)

# outputs:
┌───────────┬─────────┬───────────┬──────────┬─────────┬────────────┬──────────────┐
│ frame_idxaddressfirstnamelastnameoriginoccupationchildren     │
╞═══════════╪═════════╪═══════════╪══════════╪═════════╪════════════╪══════════════╡
│ 0bourgclementpraffordanglaisrentier    ┆ ["francois"] │
│ 1bourgclementpraffordanglaisrentier    ┆ ["francois"] │
│ 2bourgclementpraffordanglaisrentier    ┆ ["francois"] │
│ 3bourgclementpraffordanglaisrentier    ┆ ["francois"] │
│ 4bourgclementpraffordanglaisrentier    ┆ ["francois"] │
│ 5bourgclementpraffordanglaisrentier    ┆ ["francois"] │
└───────────┴─────────┴───────────┴──────────┴─────────┴────────────┴──────────────┘

Configuration

The normalization process takes a configuration:

import blitzbeaver as bb

# the distance metric configuration to use to compute the distances
# between values during clustering
distance_metric_config: bb.DistanceMetricConfig = ...

normalization_config = bb.NormalizationConfig(
    threshold_cluster_match=0.5,
    min_cluster_size=2,
    distance_metric=distance_metric_config,
)

Execution

The normalization process takes as argument the previously computed tracking graph, the historical records (dataframes), record schema and configuration.

It produces a list of dataframes: the normalized historical records.

import blitzbeaver as bb

normalized_dataframes = bb.execute_normalization(
    normalization_config,
    record_schema,
    graph,
    dataframes,
)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors