Skip to content

MarekFiltes/entity_sumarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DBpedia entity sumarization

Ruby tool to entity sumarization. Sumarization results.

Tool description

Tool was developed as part of diploma thesis. Tool generate .json knowledge base. Knowledge base is based on outward resource links from Wikipedia abstracts, global link relations of entity classes and global literal relations of entity classes. Strict links are collected from dataset of NLP Interchange Format (NIF). In next step find all relations in which are used collected links for entities of given class (eg. City, Artist, ...). Linked based relations are expanded with entity class global relations that point to literal. Tool can reduce basic duplicates of predicates. At the end calculate total score for every predicate and generate knowledge base.

RubyDoc

browser_web_data_entity_sumarization-1.1.0


Basic usage

Requires installed JRuby on machine.

This project is mainly based on usage of dataset in NIF format. For example can be download from http://wiki.dbpedia.org/nif-abstract-datasets

Install

CMD:

gem install browser_web_data_entity_sumarization

How to initialize tool in project

Gemfile:

gem 'browser_web_data_entity_sumarization'

How to initialize tool in script

Script:

require 'browser_web_data_entity_sumarization'

Use in script

Initialize main instance of statistic

 nif_dataset_path = 'nif-text-links_en.ttl'
 resource_dir = 'statisticts/'
 statistic = BrowserWebData::EntitySumarization::Statistic.new(nif_dataset_path, resource_dir, true)

Start generate complete knowledge base for all entity class types

nif_dataset is required. Knowledge base is generate from link statistics and literal statistics

params = {
  entity_types: statistic.get_all_classes, # [City, Person, ... ]
  entity_count: 10, # count of best ranked resource for every entity type
  demand_reload: false, # resource is skipped if already has stored in result dir
  console_output: true, # allow to display some information from process
  identify_identical_predicates: false # try to find identical predicates and grouped them as one item
}

statistic.create_complete_knowledge_base(params)

Warning: Parameter identify_identical_predicates: true can increased process time identification is based on verifying every combination of predicates list. Method reduce list to max 250 predicates. All resource are completed by one iteration from NIF dataset file.

Start generate only global literal statistic

nif_dataset is no required

entity_class_type = 'http://www.w3.org/2002/07/owl#Thing'
resources_limit = 1000

statistic.generate_literal_statistics(entity_class_type, resources_limit)

and alternatively for all classes:

resources_limit = 1000

statistic.get_all_classes.each{|entity_class_type|
  statistic.generate_literal_statistics(entity_class_type, resources_limit)
}

Refresh resource statistic files

nif_dataset is no required. If links has already extracted from NIF dataset and need only reload predicates from DBpedia, can use:

best_ranked_resource_count = 100

statistic.refresh_statistics_in_files(statistic.get_all_classes, resources_limit)

(currently is no allowed to specify which files might be refreshed, only by define entity classes and count of resources to find by best rank.)

Recalculate knowledge base

nif_dataset is no required. In case you need to recalculate predicates count and update knowledge base, can use:

entity_class_type = 'http://www.w3.org/2002/07/owl#Thing'
identify_identical_predicates = false

statistic.generate_knowledge_base(entity_class_type, identify_identical_predicates)

Warning: Parameter identify_identical_predicates: true can increased process time identification is based on verifying every combination of predicates list. Method reduce list to max 250 predicates.

and alternatively for all classes:

identify_identical_predicates = false

statistic.get_all_classes.each{|entity_class_type|
  statistic.generate_knowledge_base(entity_class_type, identify_identical_predicates)
}

Results description

Results published in this project Sumarization results was generated from NIF dataset: nif-text-links_en 2016-04/ext/nif-abstracts/en/

Results contains:

  • knowledge_base.json
  • global_statistics.json
  • identical_predicates.json
  • statistics.zip

Global statistic

Statistics contains predicates with their total count. Predicates are assign to every entity class. This project requires only predicates that point to literals.

File data example:

{
  "Entity_type_1": {
       "predicate_1": 200,
       "predicate_2": 4321
  },
  "Entity_type_2": {
      "predicate_1": 3000,
      "predicate_3": 4321
   }
}

Link statistic

  • process_time - relative time to find links in nif dataset and time to find relations with that link.
  • resource_uri - resource
  • nif_data - contains hashes of found links, anchor for link, indexes of anchor, paragraph information and relations by entity class.
    • anchor - text from abstract
    • indexes - position of anchor in paragraphs
    • section - identifier of section where anchored text is included
    • weight - computed by relative position anchored text in abstract
    • link
    • strict_properties
      • contains properties grouped by entity class type
      • contains properties by strict relation between resource and link
      • contains also total count of occurrence in resources by entity class type
    • properties - contains properties
      • contains properties grouped by entity class type
      • contains properties by relation with ignore resource uri
      • contains also total count of occurrence in resources by entity class type

Strict predicates/properties

Base relation:
<resource> ?property <link>

Predicates/properties

Base relation:

?subject a <entity_class>
?subject ?property <link>

File data example:

{
  "process_time": {
    "nif_find": 3.64,
    "relations_find": 1.85
  },
  "resource_uri": "http://dbpedia.org/resource/Hamilton_(village),_New_York",
  "nif_data": [
    {
      "link": "http://dbpedia.org/resource/Administrative_divisions_of_New_York#Village",
      "anchor": "village",
      "indexes": [
        "29",
        "36"
      ],
      "section": "paragraph_0_207",
      "properties": {
        "Village": {

        }
      },
      "weight": 0.8599,
      "strict_properties": {
        "Village": {

        }
      }
    },
    {
      "link": "http://dbpedia.org/resource/Hamilton_(town),_New_York",
      "anchor": "Hamilton",
      "indexes": [
        "64",
        "72"
      ],
      "section": "paragraph_0_207",
      "properties": {
        "Village": {

        }
      },
      "weight": 0.6908,
      "strict_properties": {
        "Village": {

        }
      }
    },
    {
      "link": "http://dbpedia.org/resource/Madison_County,_New_York",
      "anchor": "Madison County, New York",
      "indexes": [
        "76",
        "100"
      ],
      "section": "paragraph_0_207",
      "properties": {
        "Village": {
          "http://dbpedia.org/property/placeOfBirth": 2643.0,
          "http://dbpedia.org/property/placeOfDeath": 555.0,
          "http://dbpedia.org/ontology/location": 3255.0,
          "http://dbpedia.org/ontology/region": 71.0,
          "http://dbpedia.org/ontology/territory": 17.0,
          "http://dbpedia.org/ontology/deathPlace": 1708.0,
          "http://dbpedia.org/ontology/birthPlace": 7463.0,
          "http://dbpedia.org/property/birthPlace": 4606.0,
          "http://dbpedia.org/property/location": 1354.0,
          "http://dbpedia.org/property/region": 19.0
        }
      },
      "weight": 0.6329,
      "strict_properties": {
        "Village": {
          "http://dbpedia.org/ontology/isPartOf": 498630.0,
          "http://dbpedia.org/property/subdivisionName": 533823.0
        }
      }
    },
    {
      "link": "http://dbpedia.org/resource/USA",
      "anchor": "USA",
      "indexes": [
        "102",
        "105"
      ],
      "section": "paragraph_0_207",
      "properties": {
        "Village": {

        }
      },
      "weight": 0.5072,
      "strict_properties": {
        "Village": {

        }
      }
    },
    {
      "link": "http://dbpedia.org/resource/Colgate_University",
      "anchor": "Colgate University",
      "indexes": [
        "129",
        "147"
      ],
      "section": "paragraph_0_207",
      "properties": {
        "Village": {
          "http://dbpedia.org/ontology/education": 4.0,
          "http://dbpedia.org/ontology/operator": 4.0,
          "http://dbpedia.org/property/education": 2.0,
          "http://dbpedia.org/property/operator": 1.0,
          "http://dbpedia.org/property/owner": 8.0,
          "http://dbpedia.org/property/youthclubs": 3.0
        }
      },
      "weight": 0.3768,
      "strict_properties": {
        "Village": {

        }
      }
    }
  ]
}

Knowledge base example

Every entity class types has assign sorted array of hash with grouped predicates with score value. Predicates in array are marked as identical. This can help to reduce duplicate information in summary.

{
  "Game": [
    {
      "score": 0.8912,
      "predicates": [
        "http://dbpedia.org/ontology/designer",
        "http://dbpedia.org/property/designer"
      ]
    },
    {
      "score": 0.8777,
      "predicates": [
        "http://dbpedia.org/ontology/publisher",
        "http://dbpedia.org/property/publisher"
      ]
    },
    {
      "score": 0.5291,
      "predicates": [
        "http://dbpedia.org/ontology/subtitle",
        "http://dbpedia.org/property/subtitle"
      ]
    },
    {
      "score": 0.4373,
      "predicates": [
        "http://dbpedia.org/ontology/genre",
        "http://dbpedia.org/property/genre"
      ]
    },
    {
      "score": 0.4237,
      "predicates": [
        "http://dbpedia.org/ontology/illustrator",
        "http://dbpedia.org/property/illustrator"
      ]
    },
    {
      "score": 0.1968,
      "predicates": [
        "http://dbpedia.org/property/date"
      ]
    },
    {
      "score": 0.1968,
      "predicates": [
        "http://dbpedia.org/ontology/publicationDate"
      ]
    },
    {
      "score": 0.1142,
      "predicates": [
        "http://dbpedia.org/ontology/product"
      ]
    },
    {
      "score": 0.0552,
      "predicates": [
        "http://dbpedia.org/property/products"
      ]
    },
    {
      "score": 0.0259,
      "predicates": [
        "http://dbpedia.org/ontology/series",
        "http://dbpedia.org/property/series"
      ]
    },
    {
      "score": 0.0228,
      "predicates": [
        "http://dbpedia.org/ontology/manufacturer",
        "http://dbpedia.org/property/manufacturer"
      ]
    },
    {
      "score": 0.0208,
      "predicates": [
        "http://dbpedia.org/property/shortDescription",
        "http://purl.org/dc/elements/1.1/description"
      ]
    },
    {
      "score": 0.016,
      "predicates": [
        "http://dbpedia.org/property/released"
      ]
    },
    {
      "score": 0.0148,
      "predicates": [
        "http://dbpedia.org/ontology/recordLabel",
        "http://dbpedia.org/property/label"
      ]
    },
    {
      "score": 0.0125,
      "predicates": [
        "http://dbpedia.org/ontology/mediaType",
        "http://dbpedia.org/property/mediaType"
      ]
    }
   ]
}

Example of results usage

http://browserwebdata.herokuapp.com

License

MIT License, Copyright (c) 2017 GitHub Inc.

License details

About

DBpedia entity sumarization tool.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages