A language agnostic, natural language propositions extractor.
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is 1 commit ahead of zaycev:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
mokujin
testdata
tools
.gitignore
LICENSE
README.md
createlfindex.py
createtriplesindex.py
findmetaphors.py
findmetaphors2.py
findpatterns.py
findsources.py
findtriples.py
gensourcematrix.py
lfsearch.py
lookupdict.py
manage.py
tests.py

README.md

MOKUJIN

A language-agnostic toolset for semantic triples extraction and processing.

Requirements

  • Python 2.7.X
  • LevelDB
  • LZ4 (optional)
  • Django 1.5 (optional)

Quick Start:

  1. Prepare input data (list of sentences in first-order-logic form generated by Metaphor): a. Create LF file using one of Metaphor's pipeline. b. Split LF if they are too large (recommended size ~ 1GB).

  2. Extract triples from LF sentences using findtriples.py:

    python findtriples.py < sentences.lf.txt > triples.csv

    The output will be the following:

    noun_verb_adv, <NONE>, быть-VB, можно-RB, <->, <->, 156
    noun_adj, поле-NN, ледяной-ADJ, <->, <->, <->, 73
    noun_verb_adv, <NONE>, быть-VB, надо-RB, <->, <->, 68
    noun_verb_adv, <NONE>, быть-VB, нельзя-RB, <->, <->, 65
    noun_adj, океан-NN, ледовитый-ADJ, <->, <->, <->, 47   
    ...
    
  3. Create triples index using createtriplesindex.py:

    python createtriplesindex.py -i triples.csv -o triples-index-dir

  4. Create query-file query.json:

     {
         "query": [
             {
                 "label": "poverty",
                 "source": [
                     "source_word_1",
                     "source_word_2",
                     ...
                     "source_word_n",
                 ],
                 "target": [
                     "target_word_1",
                     "target_word_2",
                     ...
                     "target_word_n"
                 ]
             }
     	]
     }
    

    Note that source or target filesds may be empy depending on your next step.

  5. Run findsources.py to find source (requires a list of targets in query file):

    python findsources.py -i triples-index-dir -o output-dir -q query.json
    
  6. Prepare file with list of sources (each on separate string):

    source_1
    source_2
    ...
    source_n
    
  7. Run findpatterns.py to find patterns:

    python findpatterns.py -i triples-index-dir -o output-dir -qf sources.txt
    

Relation Triples Extractor

Usage:

python mokujin.py [<input file in logical form>] [<output file>]

Features

  • Input format are sentences in first-order logic form produced by Metaphor semantic pipelines.

  • Extracts the following relationships:

    Verbs

    1. subj_verb_dirobj([noun*],verb,[noun+]) ("John reads a book")
    2. subj_verb_indirobj([noun*],verb,[noun+]) ("John gives to Mary")
    3. subj_verb_instr([noun*],verb,[noun+]) ("Джон работает топором")
    4. subj_verb([noun+], verb) ("John runs") // only if there is no dirobj and indirobj
    5. subj_verb_prep_compl([noun*],verb,prep,[noun+]) ("John comes from London")
    6. subj_verb_verb_prep_noun([noun*],verb,verb,prep,[noun+]) ("John tries to go into the house")
    7. subj_verb_verb([noun+],verb,verb) ("John tries to go") // only if there is no prep attached to the second verb

    Nouns

    1. noun_be_prep_noun(noun,verb,prep,noun) ("intention to leave for money")
    2. noun_be(noun,verb) ("intention to leave") // only if there is no prep attached to verb
    3. noun_adj_prep_noun(noun,adjective,prep,noun) ("The book is good for me") -> only if "for" has "good" (and not "is") as its arg
    4. noun_adj([noun+],adjective) ("The book is good") // only if there is no prep attached to adj as its arg
    5. noun_verb_adv_prep_noun(adverb,verb) ("John runs fast for me") -> only if "for" has "fast" (and not "runs") as its arg
    6. noun_verb_adv([noun*],verb,adverb) ("John runs fast") // only if there is no prep attached to adv
    7. nn_prep([noun+],prep,noun) ("[city]&bike for John") // only if "for" has "bike" (and not some verb) as its arg
    8. nn(noun,noun) ("city bike") // only if there is no prep attached to the second noun
    9. nnn(noun,noun,noun) ("Tzar Ivan Grozny")
    10. noun_equal_prep_noun(noun,noun,prep,noun) ("John is a man of heart") // only if "of" has "man" (and not "is") as its arg.
    11. noun_equal_noun(noun,noun) ("John is a biker") // only if there is no prep attached to the second noun
    12. noun_prep_noun(noun,prep,noun) ("house in London")
    13. noun_prep_prep_noun(noun,prep,prep,noun) ("book out of the store")

    Verbs

    1. compl(anything,anything) ("близкий мне")

Input/Output Examples:

Input (Logical Form):

% В четверг , 7 февраля 2013 года , стартовала официальная продажа билетов на Олимпийские игры в Сочи —
% ровно за год до начала соревнований .
id(1).
[1001]:в-in(e1,e5,x1) & [1002]:четверг-nn(e2,x1) & [1005]:февраль-nn(e3,x2) & [1007]:год-nn(e4,x3) & 
[1009]:стартовать-vb(e5,x4,u1,u2) & [1010]:официальный-adj(e6,x4) & [1011]:продажа-nn(e7,x4) &
[1012]:билет-nn(e8,x5) & [1013]:на-in(e9,x5,x6) & [1014]:олимпийский-adj(e10,x6) & [1015]:игра-nn(e11,x6) &
[1016]:в-in(e12,x6,x7) & [1017]:сочи-nn(e13,x7) & [1019]:ровно-rb(e14,e15) & [1020]:за-in(e15,e5,x8) &
[1021]:год-nn(e16,x8) & [1022]:до-in(e17,x9,x10) & [1023]:начало-nn(e18,x10) & [1024]:соревнование-nn(e19,x11) &
card(e20,u3,7) & card(e21,x3,2013) & of-in(e22,x2,x3) & of-in(e23,x4,x5) & typelt(e24,x5,s1) & typelt(e25,x6,s2) &
of-in(e26,x10,x11) & typelt(e27,x11,s3) & past(e28,e5)

% В первые же часы билеты на самые интересные широкому кругу болельщиков виды программы — хоккей , биатлон ,
% сноуборд — были раскуплены чуть менее чем полностью .
id(2).
[2001]:в-in(e1,x1,x2) & [2004]:часы-nn(e2,x2) & [2005]:билет-nn(e3,x1) & [2006]:на-in(e4,x1,x3) &
[2008]:интересный-adj(e5,x3) & [2009]:широкий-adj(e6,x3) & [2010]:круг-nn(e7,x3) & [2011]:болельщик-nn(e8,x4) &
[2012]:вид-nn(e9,x1) & [2013]:программа-nn(e10,x5) & [2015]:хоккей-nn(e11,x6) & [2017]:биатлон-nn(e12,x7) &
[2019]:сноуборд-nn(e13,x8) & [2022]:раскупить-vb(e14,u1,x8,u2) & [2023]:чуть-rb(e15,e16) & [2024]:менее-rb(e16,e14) &
[2025]:чем-cnj(e17,x9) & [2026]:полностью-rb(e18,e17) & card(e19,x2,1) & typelt(e20,x2,s1) & typelt(e21,x1,s2) &
of-in(e22,x3,x4) & typelt(e23,x4,s3) & typelt(e24,x1,s4) & of-in(e25,x1,x5) & past(e26,x8) & past(e27,e14)

% Что касается мужского хоккея , например , то недоступными оказались пропуска на все игры плей-офф — и это при том ,
% что даже сетка турнира составлена пока не целиком .
id(3).
[3002]:касаться-vb(e1,u1,x1,u2) & [3003]:мужской-adj(e2,x1) & [3004]:хоккей-nn(e3,x1) & [3006]:например-rb(e4,e5) &
[3008]:то-cnj(e5,x2) & [3009]:недоступный-adj(e6,x3) & [3010]:оказаться-vb(e7,x4,u3,u4) & [3011]:пропуск-nn(e8,x4) &
[3012]:на-in(e9,x4,x5) & [3014]:игра-nn(e10,x5) & [3015]:плей-офф-nn(e11,x6) & thing(e12,x7) 
[3019]:при-in(e13,x8,x7) & [3024]:сетка-nn(e14,x9) & [3025]:турнир-nn(e15,x10) & [3026]:составить-vb(e16,x9,u5,u6) &
[3027]:пока-cnj(e17,x11) & [3029]:целиком-rb(e18,e17) & of-in(e19,x5,x6) & of-in(e20,x9,x10) & not(e21,e18) &
past(e22,e7) & past(e23,e16)

Output (List of Triples in CSV format):

rel_type,arg1,arg2,arg3,arg4,arg5,arg6,freq
noun_adj,федерация-NN, российский-ADJ,<->,<->,<->,162267
subj_verb,речь-NN,идти-VB,<->,<->,<->,85846
subj_verb_dirobj,<NONE>,обратить-VB,внимание-NN,<->,<->,64583
noun_adj,житель-NN,местный-ADJ,<->,<->,<->,17450

Triples Indexer

Sources Finder

Patterns Finder