-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
first draft of the frog manual in rst format added. (yes, it still co…
…ntains errors)
- Loading branch information
Showing
8 changed files
with
1,559 additions
and
10 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,216 @@ | ||
.. _credits: | ||
|
||
|
||
|
||
Credits and references | ||
====================== | ||
|
||
Once upon a time | ||
----------------- | ||
|
||
The development of Frog’s modules started in the nineties at the ILK | ||
Research Group (Tilburg University, the Netherlands) and the CLiPS | ||
Research Centre (University of Antwerp, Belgium). Most modules rely on | ||
Timbl, the Tilburg memory-based learning software package | ||
:raw-latex:`\cite{timbl}` or MBT the memory-based tagger-generator | ||
:raw-latex:`\cite{mbt}`. These modules were integrated into an NLP | ||
pipeline that was first named MB-TALPA and later Tadpole | ||
:raw-latex:`\cite{Tadpole}`. Over the years, the modules were refined | ||
and retrained on larger data sets and the latest versions of each module | ||
are discussed in this chapter. We thank all programmers who worked on | ||
Frog and its predecessors in chapter [ch-credit]. | ||
|
||
The CliPS Research Centre also developed an English counterpart of Frog, | ||
a python module called MBSP (MBSP website: | ||
http://www.clips.ua.ac.be/pages/MBSP). | ||
|
||
|
||
Credits | ||
-------- | ||
|
||
|
||
If you use Frog for your own work, please cite this reference manual | ||
|
||
Frog, A Natural Language Processing Suite for Dutch, Reference | ||
guide, Iris Hendrickx, Antal van den Bosch, Maarten van Gompel en Ko | ||
van der Sloot, Language and Speech Technology Technical Report | ||
Series 16-02, Radboud University Nijmegen, Draft 0.13.1 - June 2016 | ||
|
||
The following paper describes Tadpole, the predecessor of Frog. It | ||
contains a subset of the components described in this paper: | ||
|
||
Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. | ||
(2007). An efficient memory-based morphosyntactic tagger and parser | ||
for Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and V. | ||
Vandeghinste (Eds.), Selected Papers of the 17th Computational | ||
Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114 | ||
|
||
We would like to thank everybody who worked on Frog and its | ||
predecessors. Frog, formerly known as Tadpole and before that as | ||
MB-TALPA, was coded by Bertjan Busser, Ko van der Sloot, Maarten van | ||
Gompel, and Peter Berck, subsuming code by Sander Canisius (constraint | ||
satisfaction inference-based dependency parser), Antal van den Bosch | ||
(MBMA, MBLEM, tagger-lemmatizer integration), Jakub Zavrel (MBT), and | ||
Maarten van Gompel (Ucto). In the context of the CLARIN-NL | ||
infrastructure project TTNWW, Frederik Vaassen (CLiPS, Antwerp) created | ||
the base phrase chunking module, and Bart Desmet (LT3, Ghent) provided | ||
the data for the named-entity module. | ||
|
||
Maarten van Gompel designed the FoLiA XML output format that Frog | ||
produces, and also wrote a Frog binding for Python [17]_, as well as a | ||
separate Frog client in Python [18]_. Wouter van Atteveldt wrote a Frog | ||
client in R [19]_, and Machiel Molenaar wrote a Frog client for | ||
Go [20]_. | ||
|
||
The development of Frog relies on earlier work and ideas from Ko van der | ||
Sloot (lead programmer of MBT and TiMBL and the TiMBL API), Walter | ||
Daelemans, Jakub Zavrel, Peter Berck, Gert Durieux, and Ton Weijters. | ||
|
||
The development and improvement of Frog also relies on your bug reports, | ||
suggestions, and comments. Use the github issue tracker at | ||
https://github.com/LanguageMachines/frog/issues/ or mail `lamasoftware | ||
@science.ru.nl <lamasoftware | ||
@science.ru.nl>`__. | ||
|
||
Alpino syntactic dependency labels | ||
================================== | ||
|
||
This table is taken from Alpino annotation reference manual | ||
:raw-latex:`\cite{lassy2011}` : | ||
|
||
+--------------------+--------------------------------------------------------------+ | ||
| dependentielabel | OMSCHRIJVING | | ||
+--------------------+--------------------------------------------------------------+ | ||
| APP | appositie, bijstelling | | ||
+--------------------+--------------------------------------------------------------+ | ||
| BODY | romp (bij complementizer)) | | ||
+--------------------+--------------------------------------------------------------+ | ||
| CMP | complementizer | | ||
+--------------------+--------------------------------------------------------------+ | ||
| CNJ | lid van nevenschikking | | ||
+--------------------+--------------------------------------------------------------+ | ||
| CRD | nevenschikker (als hoofd van conjunctie) | | ||
+--------------------+--------------------------------------------------------------+ | ||
| DET | determinator | | ||
+--------------------+--------------------------------------------------------------+ | ||
| DLINK | discourse-link | | ||
+--------------------+--------------------------------------------------------------+ | ||
| DP | discourse-part | | ||
+--------------------+--------------------------------------------------------------+ | ||
| HD | hoofd | | ||
+--------------------+--------------------------------------------------------------+ | ||
| HDF | afsluitend element van circumpositie | | ||
+--------------------+--------------------------------------------------------------+ | ||
| LD | locatief of directioneel complement | | ||
+--------------------+--------------------------------------------------------------+ | ||
| ME | maat (duur, gewicht, . . . ) complement | | ||
+--------------------+--------------------------------------------------------------+ | ||
| MOD | bijwoordelijke bepaling | | ||
+--------------------+--------------------------------------------------------------+ | ||
| MWP | deel van een multi-word-unit | | ||
+--------------------+--------------------------------------------------------------+ | ||
| NUCL | kernzin | | ||
+--------------------+--------------------------------------------------------------+ | ||
| OBCOMP | vergelijkingscomplement | | ||
+--------------------+--------------------------------------------------------------+ | ||
| OBJ1 | direct object, lijdend voorwerp | | ||
+--------------------+--------------------------------------------------------------+ | ||
| OBJ2 | secundair object (meewerkend, belanghebbend, ondervindend) | | ||
+--------------------+--------------------------------------------------------------+ | ||
| PC | voorzetselvoorwerp | | ||
+--------------------+--------------------------------------------------------------+ | ||
| POBJ1 | voorlopig direct object | | ||
+--------------------+--------------------------------------------------------------+ | ||
| PREDC | predicatief complement | | ||
+--------------------+--------------------------------------------------------------+ | ||
| PREDM | bepaling van gesteldheid ‘tijdens de handeling’ | | ||
+--------------------+--------------------------------------------------------------+ | ||
| RHD | hoofd van een relatieve zin | | ||
+--------------------+--------------------------------------------------------------+ | ||
| SAT | satelliet; aan- of uitloop | | ||
+--------------------+--------------------------------------------------------------+ | ||
| SE | verplicht reflexief object | | ||
+--------------------+--------------------------------------------------------------+ | ||
| SU | subject, onderwerp | | ||
+--------------------+--------------------------------------------------------------+ | ||
| SUP | voorlopig subject | | ||
+--------------------+--------------------------------------------------------------+ | ||
| SVP | scheidbaar deel van werkwoord | | ||
+--------------------+--------------------------------------------------------------+ | ||
| TAG | aanhangsel, tussenvoegsel | | ||
+--------------------+--------------------------------------------------------------+ | ||
| VC | verbaal complement | | ||
+--------------------+--------------------------------------------------------------+ | ||
| WHD | hoofd van een vraagzin | | ||
+--------------------+--------------------------------------------------------------+ | ||
|
||
.. [1] | ||
The source code repository points to the latest development version | ||
by default, which may contain experimental features. Stable releases | ||
are deliberate snapshots of the source code. It is recommended to | ||
grab the latest stable release. | ||
.. [2] | ||
https://github.com/LanguageMachines/ticcutils | ||
.. [3] | ||
https://github.com/LanguageMachines/libfolia | ||
.. [4] | ||
https://languagemachines.github.io/ucto | ||
.. [5] | ||
https://languagemachines.github.io/timbl | ||
.. [6] | ||
https://github.com/LanguageMachines/timblserver | ||
.. [7] | ||
https://languagemachines.github.io/mbt | ||
.. [8] | ||
B (begin) indicates the begin of the named entity, I (inside) | ||
indicates the continuation of a named entity, and O (outside) | ||
indicates that something is not a named entity | ||
.. [9] | ||
https://github.com/proycon/pynlpl, supports both Python 2 and Python | ||
3 | ||
.. [10] | ||
https://github.com/vanatteveldt/frogr/ | ||
.. [11] | ||
https://github.com/Machiel/gorf | ||
.. [12] | ||
In the current Frog version UTF-16 is not accepted as input in Frog. | ||
.. [13] | ||
In fact the tokenizer still is used, but in ``PassThru`` mode. This | ||
allows for conversion to FoLiA XML and sentence detection. | ||
.. [14] | ||
Versions for Python 3 may be called ``cython3`` on distributions such | ||
as Debian or Ubuntu | ||
.. [15] | ||
More about the INI file | ||
format:\ https://en.wikipedia.org/wiki/INI_file) | ||
.. [16] | ||
MBT available at http://languagemachines.github.io/mbt/ | ||
.. [17] | ||
https://github.com/proycon/python-frog | ||
.. [18] | ||
Part of PyNLPL: https://github.com/proycon/pynlpl | ||
.. [19] | ||
https://github.com/vanatteveldt/frogr/ | ||
.. [20] | ||
https://github.com/Machiel/gorf | ||
.. |image| image:: frogarchitecture |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
.. _frogData: | ||
|
||
|
||
|
||
Frog in practice | ||
---------------- | ||
|
||
Frog has been used in research projects mostly because of its capacity | ||
to process Dutch texts efficiently and analyze the texts sufficiently | ||
accurately. The purposes range from corpus construction to linguistic | ||
research and natural language processing and text analytics | ||
applications. We provide a overview of work reporting to use Frog, in | ||
topical clusters. | ||
|
||
Corpus construction | ||
~~~~~~~~~~~~~~~~~~~ | ||
|
||
Frog, named Tadpole before 2011, has been used for the automated | ||
annotation of, mostly, POS tags and lemmas of Dutch corpora. When the | ||
material of Frog was post-corrected manually, this is usually done on | ||
the basis of the probabilities produced by the POS tagger and setting a | ||
confidentiality threshold [VandenBosch+06]_. | ||
|
||
- The 500-million-word SoNaR corpus of written contemporary Dutch, and | ||
its 50-million word predecessor D-Coi [Oostdijk+08]_ [oostdijk2013construction]_; | ||
|
||
- The 500-million word Lassy Large corpus [LASSY]_ that has also been parsed | ||
automatically with the ALPINO parser [ALPINO]_; | ||
|
||
- The 115-hour JASMIN corpus of transcribed Dutch, spoken by elderly, | ||
non-native speakers, and children [Cucchiarini+13]_; | ||
|
||
- The 7-million word Dutch subcorpus of a multilingual parallel corpus | ||
of automotive texts [DPL2009]_; | ||
|
||
- The *Insight Interaction* corpus of 15 20-minute transcribed | ||
multi-modal dialogues [brone2015insight]_; | ||
|
||
- The SUBTLEX-NL word frequency database was based on an automatically | ||
analyzed 44-million word corpus of Dutch subtitles of movies and | ||
television shows [subtlex]_. | ||
|
||
Feature generation for text filtering and Natural Language Processing | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Frog’s analyses can help to zoom in on particular linguistic | ||
abstractions over text, such as adjectives or particular constructions, | ||
to be used in further processing. They can also help to generate | ||
annotation layers that can act as features in further NLP processing | ||
steps. POS tags and lemmas are mostly used for these purposes. We list a | ||
number of examples across the NLP board: | ||
|
||
- Sentence-level analysis tasks such as word sense disambiguation [Uvt-wsd1]_ and entity recognition [Vandecamp+2011]_; | ||
|
||
- Text-level tasks such as authorship attribution | ||
[Luyckx2011]_, emotion detection | ||
[vaassen2011]_, sentiment analysis | ||
[hogenboom2014]_, and readability prediction | ||
[de2014using]_; | ||
|
||
- Text-to-text processing tasks such as machine translation | ||
[Haque+11]_ and sub-sentential alignment for machine translation [macken2010sub]_; | ||
|
||
- Filtering Dutch texts for resource development, such as filtering adjectives for developing a subjectivity lexicon | ||
[Pattern]_, and POS tagging to assist shallow chunking of Dutch texts for bilingual terminology extraction [texsis2013]_. | ||
|
||
|
||
|
||
|
||
.. [Atterer+2007] Atterer, Michaela and Hinrich Schütze. 2007. Prepositional phrase attachment without oracles. Computational Linguistics, 33(4):469–476. | ||
..[brone2015insigh]t Brône, Geert and Bert Oben. 2015. Insight interaction: a multimodal and multifocal dialogue corpus. Language resources and evaluation, 49(1):195–214. | ||
|
||
|
||
|
||
.. [Cucchiarini+13] Cucchiarini, Catia and Hugo Van hamme. 2013. The Jasmin speech corpus: Recordings of children, non-natives and elderly people. In Essential Speech and Language Technology for Dutch. Springer, pages 147–164. | ||
De Clercq, Orphée, Veronique Hoste, Bart Desmet, Philip Van Oosten, Martine De Cock, and Lieve Macken. 2014. Using the crowd for readability prediction. Natural Language Engineering, 20(03):293–325. | ||
|
||
..[Pattern] De Smedt, Tom and Walter Daelemans. 2012. ” vreselijk mooi!”(terribly beautiful): A subjectivity lexicon for dutch adjectives. In LREC, pages 3568–3572. | ||
|
||
|
||
|
||
|
||
|
||
.. [hogenboom2014] Hogenboom, Alexander, Bas Heerschop, Flavius Frasincar, Uzay Kaymak, and Franciska de Jong. 2014. Multilingual support for lexicon-based sentiment analysis guided by semantics. Decision support systems, 62:43–53. | ||
.. [TTNWW] Kemps-Snijders, Marc, Ineke Schuurman, Walter Daelemans, Kris Demuynck, Brecht Desplanques, Véronique Hoste, Marijn Huijbregts, Jean-Pierre Martens, Joris Pelemans Hans Paulussen, Martin Reynaert, Vincent Van- deghinste, Antal van den Bosch, Henk van den Heuvel, Maarten van Gompel, Gertjan Van Noord, and Patrick Wambacq. 2017. TTNWW to the rescue: no need to know how to handle tools and resources. CLARIN in the Low Countries. pages 83-93. | ||
.. [subtlex] Keuleers, Emmanuel, Marc Brysbaert, and Boris New. 2010. Subtlex-nl: A new measure for dutch word frequency based on film subtitles. Behavior research methods, 42(3):643–650. | ||
.. [de2014using] Lefever, Els, Lieve Macken, and Véronique Hoste. 2009. Language-independent bilingual terminology extraction from a multilingual parallel corpus. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 496–504. Association for Computational Linguistics. | ||
.. [Luyckx2011] Luyckx, Kim. 2011. Scalability issues in authorship attribution. ASP/VUBPRESS/UPA. | ||
.. [macken2010sub] Macken, Lieve. 2010. Sub-sentential alignment of translational correspondences. UPA University Press Antwerp. | ||
.. [texsis2013] Macken, Lieve, Els Lefever, and Véronique Hoste. 2013. Texsis: bilingual terminology extraction from parallel corpora using chunk-based alignment. Terminology, 19(1):1–30. | ||
.. [Oostdijk+08] Oostdijk, N., M. Reynaert, P. Monachesi, G. Van Noord, R. Ordelman, I. Schuurman, and V. Vandeghinste. 2008. From D-Coi to SoNaR: A reference corpus for Dutch. In Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco. | ||
.. [oostdijk2013construction] Oostdijk, Nelleke, Martin Reynaert, Véronique Hoste, and Ineke Schuurman. 2013. The construction of a 500- million-word reference corpus of contemporary written dutch. In Essential speech and language technology for Dutch. Springer, pages 219–247. | ||
Petrov, Slav, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugūr Dogãn, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Con- ference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, may. European Language Resources Association (ELRA). | ||
|
||
.. [vaassen2011] Vaassen, Frederik and Walter Daelemans. 2011. Automatic emotion classification for interpersonal communication. In Proceedings of the 2nd workshop on computational approaches to subjectivity and sentiment analysis, pages 104–110. Association for Computational Linguistics. | ||
.. [vandecamp2011] Van de Camp, M. and A. Van den Bosch. 2011. A link to the past: Constructing historical social networks. In Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011), pages 61–69, Portland, Oregon, June. Association for Computational Linguistics. | ||
.. [VandenBosch+06] Van den Bosch, A., I. Schuurman, and V. Vandeghinste. 2006. Transferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC-2006, Trento, Italy. | ||
.. [MBMA] van den Bosch, Antal and Walter Daelemans. 1999. Memory-based morphological analysis. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL ’99, pages 285–292, Stroudsburg, PA, USA. Association for Computational Linguistics. | ||
.. [POS2004] Van Eynde, Frank. 2004. Part of speech tagging en lemmatisering van het corpus gesproken nederlands. Technical report, Centrum voor Computerlinguıstiek, KU Leuven, Belgium. | ||
.. [Uvt-wsd1] Van Gompel, M. 2010. Uvt-wsd1: A cross-lingual word sense disambiguation system. In SemEval ’10: Proceedings of the 5th International Workshop on Semantic Evaluation, pages 238–241, Morristown, NJ, USA. Association for Computational Linguistics. |
Oops, something went wrong.