Skip to content

Implement the DataPro Algorithm from the paper "Learning the Common Structure of Data" by Kristina Lerman and Steven Minton

License

Notifications You must be signed in to change notification settings

NonKhuna/DataPro-Algorithm

Repository files navigation

DataPro algorithm

The task is learning the common structure of data. The DataPro algorithm is the algorithm that finds statistically significant patterns in a set of token sequences. This repository is Implemented from the paper "Learning the Common Structure of Data" by Kristina Lerman and Steven Minton.

Installation

pip install datapro-learning

Get started

How to use model with this lib:

from datapro import DataPro
import pandas as pd

# Read data file
df = pd.read_excel("street_road.xlsx")
df = df.dropna()

# Choose a column, type list or pandas series.
data_sample = df["หน่วยรับผิดชอบ"]


# Create datapro object.
datapro = DataPro(alpha=0.05, k_percentage=10)

# Train with data
datapro.fit(data_sample)

# show result
print(datapro.evaluate_score())

Customize

Load new type's tree.

This algorithm uses a type's tree to assign types to tokens and can be configured by using JSON file with a structure like the below.

NOTE: When there is a new significant token, it's will be a child of these nodes as a specific node.

  {
    "TOKEN": {
      "regex": ".*",
      "children": [
        "PUNCT",
        "ALPHANUM"
      ],
      "parent": ""
    },
    "PUNCT": {
      "regex": "^[\\.\\?!,:'()\"]$",
      "children": [],
      "parent": "TOKEN"
    },
    "ALPHANUM": {
      "regex": "[\\da-zA-Z]+",
      "children": [
        "ALPHA",
        "NUMBER"
      ],
      "parent": "TOKEN"
    },
    "ALPHA": {
      "regex": "^[a-zA-Z]+$",
      "children": [
        "CAPS",
        "LOWER",
        "ALLCAPS"
      ],
      "parent": "ALPHANUM"
    },
  }

By default, these are all general token on type's tree.

                  TOKEN
                /        \
           PUNCT        ALPHANUM  
                     /           \
                ALPHA            NUMBER  
             /    |    \         |        \
       ALLCAPS   CAPS   LOWER   DECIMAL   INT
                                           |
                                         DIGIT

Load new file

Using method load_tree_type

  datapro.load_tree_type("<name>.json")

Reference

https://www.aaai.org/Papers/AAAI/2000/AAAI00-093.pdf

About

Implement the DataPro Algorithm from the paper "Learning the Common Structure of Data" by Kristina Lerman and Steven Minton

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published