Skip to content

Arguments

Marcio Lima edited this page Feb 2, 2021 · 9 revisions

The sample command line to run any method is:

SemOpinionS.py [-h] --method {DohareEtAl2018,DohareEtAl2018_TF,LiuEtAl2015,LiaoEtAl2018,machine_learning,machine_learning_clustering,score_optimization} [--corpus CORPUS] --alignment ALIGNMENT --alignment_format {giza,jamr} [--gold GOLD] [--openie OPENIE] [--tfidf TFIDF] [--training TRAINING] [target TARGET] [--model MODEL] [--loss {perceptron,ramp}] [--sentlex SENTLEX] [--similarity {lcs,smatch,concept_coverage}] [--machine_learning {decision_tree,random_forest,svm,mlp}] [--levi] [--aspects ASPECTS] --output OUTPUT

Method argument

This represents which summarization method is to be executed. Each method uses a specific set of arguments that must be provided in the command line. The methods implemented within SemOpinionS are:

  • DohareEtAl2018
  • DohareEtAl2018_TF
  • LiuEtAl2015
  • LiaoEtAl2018
  • machine_learning
  • machine_learning_clustering
  • score_optimization

Corpus argument

CORPUS is the path for an AMR file containing the sentences that are going to be summarized. This file has the following format:

# ::id O-Apanhador-no-Campo-de-Centeio.Documento_117.2
# ::snt O livro é idiota , repetitivo , cansativo e irritante , tanto quanto seu narrador .
(e / e
      :op1 (i2 / idiota)
      :op2 (r / repetitivo)
      :op3 (c / cansativo)
      :op4 (i3 / irritante)
      :domain (e2 / e
            :op1 (l / livro)
            :op2 (a / autor
                  :ARG0-of (e3 / escrever-01
                        :ARG1 l))))

The file may contain different metadata (within lines starting with # ::metadata), however these two (id and snt) are obligatory.

Alignment arguments

ALIGNMENT is the path for an AMR alignment file for the original graphs within the CORPUS file. There are two formats possible for this file, which is controlled by the --alignment_format {giza, jamr} argument. This file may contain other sentences, but it must contain all sentences from the CORPUS file.

GIZA alignment format

This format is used mainly by aligners of concepts to words (one to one). The format is as follows:

# o_0 livro_1 é_2 idiota_3 ,_4 repetitivo_5 ,_6 cansativo_7 e_8 irritante_9 ,_10 tanto_11 quanto_12 seu_13 narrador_14 ._15
(e / e~e.8 :op1 (i2 / idiota~e.3) :op2 (r / repetitivo~e.5) :op3 (c / cansativo~e.7) :op4 (i3 / irritante~e.9) :domain (e2 / e~e.8 :op1 (l / livro~e.1) :op2 (a / autor~e.12 :arg0-of (e3 / escrever-01~e.14 :arg1 l))))

Each sentence is represented by a line starting with # followed by its numbered tokens. The following line contains the AMR graph linearised with alignments in each concept indicated by ~e.n, so that n indicates the token aligned to that specific node. It is important to be careful about the sentence tokens, as they must match the exact same tokens within the original file from the CORPUS argument.

JAMR alignment format

This format is used by aligners of concepts to word spans (one to many). This is how it looks like:

# ::snt O livro é idiota , repetitivo , cansativo e irritante , tanto quanto seu narrador .
# ::tok O livro é idiota , repetitivo , cansativo e irritante , tanto quanto seu narrador .
# ::alignments 8-9|0 1-2|0.4.0 9-10|0.3 7-8|0.2 5-6|0.1 3-4|0.0 ::annotator Aligner v.03 ::date 2020-07-04T11:48:49.113
# ::node	0	e	8-9
# ::node	0.0	idiota	3-4
# ::node	0.1	repetitivo	5-6
# ::node	0.2	cansativo	7-8
# ::node	0.3	irritante	9-10
# ::node	0.4	e	
# ::node	0.4.0	livro	1-2
# ::node	0.4.1	autor	
# ::node	0.4.1.0	escrever-01	
# ::root	0	e
# ::edge	autor	ARG0-of	escrever-01	0.4.1	0.4.1.0	
# ::edge	e	domain	e	0	0.4	
# ::edge	e	op1	idiota	0	0.0	
# ::edge	e	op1	livro	0.4	0.4.0	
# ::edge	e	op2	autor	0.4	0.4.1	
# ::edge	e	op2	repetitivo	0	0.1	
# ::edge	e	op3	cansativo	0	0.2	
# ::edge	e	op4	irritante	0	0.3	
# ::edge	escrever-01	ARG1	livro	0.4.1.0	0.4.0	
(e / e :op1 (i2 / idiota) :op2 (r / repetitivo) :op3 (c / cansativo) :op4 (i3 / irritante) :domain (e2 / e :op1 (l / livro) :op2 (a / autor :ARG0-of (e3 / escrever-01 :ARG1 l))))

The snt metadata must match with the snt from the CORPUS file. Only the node alignments are taken into consideration, i.e. all edge alignments are ignores. The node alignment information consists of a line starting with # ::node followed by an id, the node label and the word span to which it is aligned, all separated by tabs (\t).

Gold argument

GOLD is the path to a directory with all gold summary texts in multiple files, to be used if one wants to create a merged AMR graph and aligned BOW texts from them. These summaries must follow the same format:

O livro é idiota, repetitivo, cansativo e irritante, tanto quanto seu narrador. <O-Apanhador-no-Campo-de-Centeio.Documento_117.2>

The file contains one or more lines with (or without) the sentence followed by the sentence ID between angle brackets (<id>). This ID is the most important part of the line and must match the IDs from the CORPUS file. From these, the AMR graphs for the sentences are retrieved from the CORPUS file (if they exist) and a single summary AMR graph is created by merging all the sentence AMRs.

These sentences must also be present in the ALIGNMENT file, so that BOW pseudotexts can be created for each summary.

OpenIE argument

OpenIE is an Open Information Extraction tool used by some methods implemented. The format file is a CSV, as follows:

"ID SENTENÇA";"SENTENÇA";"ID EXTRAÇÃO";"ARG1";"REL";"ARG2";"COERÊNCIA";"MINIMALIDADE";"MÓDULO SUJEITO";"MÓDULO RELAÇÃO"
"O-Apanhador-no-Campo-de-Centeio.Documento_117.2";"o livro é idiota , repetitivo , cansativo e irritante , tanto quanto seu narrador . ";"1.0";"o livro ";" é idiota";"repetitivo , cansativo e irritante , tanto quanto seu narrador ";;;1;1
;;"2.0";"o livro ";" é idiota";"tanto quanto seu narrador ";;;1;1

It contains 10 columns, however we focus on the ones of numbers 1, 3, 4, 5 and 6 (ID SENTENÇA, ID EXTRAÇÃO, ARG1, REL and ARG2), all separated by semicolons (;). The first column (ID SENTENÇA) must match with the IDs from the CORPUS file. If a line does not contain an ID it is considered to be using the last seen ID (in the order of the file) instead. Note that the columns do not need to have the same names as the ones here, just their position is important. The first line of the file must always be the name of each column.

TF-IDF argument

This argument points to a directory with multiple text files which are going to be used to calculate TF-IDF scores. The TF part is calculated using the sentences from the CORPUS file and the DF counts are obtained from all the files in the TFIDF directory, in which each file is considered a document.

Training and Target arguments

Both of these arguments are used specifically by methods requiring some kind of supervised training. Namely, these methods are:

  • LiuEtAl2015
  • LiaoEtAl2018
  • machine_learning
  • machine_learning_clustering
  • score_optimization

These arguments should be paths to specified directories containing parallel documents, i.e. the files should have the same name so that it creates a pair of train-target instances. This can be seen in the example directories as follows:

D:.
├───target
│       1984_1.txt
│       1984_2.txt
│       1984_3.txt
│       1984_4.txt
│       1984_5.txt
│       Capitaes-da-Areia_1.txt
│       Capitaes-da-Areia_2.txt
│       Capitaes-da-Areia_3.txt
│       ...
│
└───training
        1984_1.txt
        1984_2.txt
        1984_3.txt
        1984_4.txt
        1984_5.txt
        Capitaes-da-Areia_1.txt
        Capitaes-da-Areia_2.txt
        Capitaes-da-Areia_3.txt
        ...

Each file follows the same structure as the CORPUS parameter. Training files contain all sentences to be summarized, while the corresponding target files contain the gold summary sentences.

Model argument

This arguments bears the path to a pretrained model file. This cannot be used together with the training and target arguments. Each method requires a specific format:

  • LiuEtAl2015: CSV
  • LiaoEtAl2018: CSV
  • machine_learning: joblib
  • machine_learning_clustering: joblib
  • score_optimization: CSV

CSV model format

This is used by the methods focused on optimizing weights for score calculation. The CSV format should have two columns: the name of the feature and the corresponding optimized weight. An example can be seen as follows:

...
e_freq_0,1.0
e_freq_1,1.0
e_freq_2,1.0
e_freq_5,1.0
e_freq_10,1.0
e_fmst_pos_5,0.7642977396044841
e_fmst_pos_6,0.7642977396044841
e_fmst_pos_7,0.7642977396044841
e_fmst_pos_10,0.7642977396044841
e_fmst_pos_15,0.7418011102528389
e_avg_pos_5,1.0
e_avg_pos_6,1.0
e_avg_pos_7,1.0
e_avg_pos_10,1.0
e_avg_pos_15,0.9775033706483548
node1_n_freq_0,1.0
...

Joblib model format

This format is used by all Machine Learning methods using the scikit-learn library. Joblib is a binary file format that allows to save pretrained scikit-learn models. These files should be created using the dump function from the joblib library upon the trained model.

Loss function argument

This argument is used exclusively by the LiuEtAl2015 and LiaoEtAl2018 methods for AdaGrad optimization. Two types are implemented:

  • Perceptron loss

    w_{t+1} = w_t + \Phi(G^*) - \Phi(G')

    G' = \mathrm{argmax}_G w_t \cdot \Phi(G)

  • Ramp loss

    w_{t+1} = w_t + \Phi(G') - \Phi(G'')

    G' = \mathrm{argmax}_G {w_t \cdot \Phi(G) - \mathrm{cost}(G, G^*)}

    G'' = \mathrm{argmax}_G {w_t \cdot \Phi(G) + \mathrm{cost}(G, G^*)}

    \mathrm{cost}(G, G*) = [c_1, c_2, \dots, c_n], n = |V \cup E|

The argmax function represents the optimal graph obtained through the ILP method using the current weights. For more details about these functions, please refer to Liu et Al (2015).

Sentlex argument

SENTLEX is the path for a sentiment lexicon using the OpLexicon format. This is a CSV format with four columns: the word, its morphological category, its sentiment (-1, 0 or 1), if the annotaiton is manual (M) or automatic (A). Only columns 1 and 3 are used. The lexicon contains all inflections of a specific word, so we do not apply any lemmatization or stemming.

...
desatencioso,adj,-1,A
desatender,vb,-1,A
desatenta,adj,-1,M
desatentas,adj,-1,M
desatento,adj,-1,M
desatentos,adj,-1,M
desaterrar,vb,0,A
desatestar,vb,1,A
desatinada,adj,-1,A
desatinadas,adj,-1,A
desatinado,adj,-1,A
desatinados,adj,-1,A
desatinar,vb,1,A
desativar,vb,1,A
desatracar,vb,1,A
desatracar-se,vb,1,A
...

Similarity argument

This argument is used specifically by the LiaoEtAl2018 and machine_learning_clustering methods for Spectral Clustering of sentences. The implemented similarity scores are:

  • lcs: Longest common subsequence; number of overlapping words between two sentences.
  • smatch: Smatch similarity score between AMR graphs.
  • concept_coverage: Number of matching AMR concepts between the two sentence graphs.

The default value is lcs.

Machine Learning argument

This argument is used specifically by the machine_learning and machine_learning_clustering methods. This determines which Scikit-learn method is going to be used. The implemented methods are:

  • decision_tree
  • random_forest
  • svm
  • mlp

The default value is decision_tree.

Levi argument

This is a flag argument used specifically by the machine_learning and machine_learning_clustering methods. If this argument is used, the edges of the AMR graphs are first turned into nodes, so that the ML classification can classify them too.

Aspects argument

ASPECTS is the path to a JSON file containing all aspect annotation for the sentences in CORPUS, TRAINING or TARGET (when used). This argument is used specifically by the machine_learning and machine_learning_clustering methods.

The first set of keys corresponds to the name (not the whole path) of the file that was annotated (CORPUS file or the files in TRAINING or TARGET). The second layer of keys indicates the sentence IDs within the said file, these IDs should match with those of the original AMR file (CORPUS, TRAINING or TARGET). Then, there is the list of aspects within the sentence. An example can be seen as follows:

{
    "Galaxy-SIII_1.txt": {
        "D0_S1": [
            "Galaxy SIII"
        ],
        "D0_S2": [
            "modelo"
        ],
        "D0_S3": [
            "aparelho",
            "bateria",
            "desingn"
        ],
        "D0_S4": [],
        ...
        "D9_S5": [
            "IPHONE 5"
        ]
    },
    "LG-Smart-TV_1.txt": {
        "D0_S1": [],
        "D0_S2": [],
        "D0_S3": [
            "Design",
            "imagem"
        ],
        "D0_S4": [],
        ...
        "D9_S8": [
            "TV"
        ]
    },
    ...
}

This is an optional argument, i.e. if it is not given, the method will run without any aspect feature, while including all other features.

Output argument

This argument indicates a directory to which all output files are going to be saved (AMR graphs, BOW pseudotexts, training weights...).