Skip to content

Commit

Permalink
first draft of the frog manual in rst format added. (yes, it still co…
Browse files Browse the repository at this point in the history
…ntains errors)
  • Loading branch information
Irishx committed Nov 29, 2018
1 parent cf443a7 commit 6b88c25
Show file tree
Hide file tree
Showing 8 changed files with 1,559 additions and 10 deletions.
427 changes: 427 additions & 0 deletions docs/source/advanced.rst

Large diffs are not rendered by default.

216 changes: 216 additions & 0 deletions docs/source/credits.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
.. _credits:



Credits and references
======================

Once upon a time
-----------------

The development of Frog’s modules started in the nineties at the ILK
Research Group (Tilburg University, the Netherlands) and the CLiPS
Research Centre (University of Antwerp, Belgium). Most modules rely on
Timbl, the Tilburg memory-based learning software package
:raw-latex:`\cite{timbl}` or MBT the memory-based tagger-generator
:raw-latex:`\cite{mbt}`. These modules were integrated into an NLP
pipeline that was first named MB-TALPA and later Tadpole
:raw-latex:`\cite{Tadpole}`. Over the years, the modules were refined
and retrained on larger data sets and the latest versions of each module
are discussed in this chapter. We thank all programmers who worked on
Frog and its predecessors in chapter [ch-credit].

The CliPS Research Centre also developed an English counterpart of Frog,
a python module called MBSP (MBSP website:
http://www.clips.ua.ac.be/pages/MBSP).


Credits
--------


If you use Frog for your own work, please cite this reference manual

Frog, A Natural Language Processing Suite for Dutch, Reference
guide, Iris Hendrickx, Antal van den Bosch, Maarten van Gompel en Ko
van der Sloot, Language and Speech Technology Technical Report
Series 16-02, Radboud University Nijmegen, Draft 0.13.1 - June 2016

The following paper describes Tadpole, the predecessor of Frog. It
contains a subset of the components described in this paper:

Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S.
(2007). An efficient memory-based morphosyntactic tagger and parser
for Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and V.
Vandeghinste (Eds.), Selected Papers of the 17th Computational
Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114

We would like to thank everybody who worked on Frog and its
predecessors. Frog, formerly known as Tadpole and before that as
MB-TALPA, was coded by Bertjan Busser, Ko van der Sloot, Maarten van
Gompel, and Peter Berck, subsuming code by Sander Canisius (constraint
satisfaction inference-based dependency parser), Antal van den Bosch
(MBMA, MBLEM, tagger-lemmatizer integration), Jakub Zavrel (MBT), and
Maarten van Gompel (Ucto). In the context of the CLARIN-NL
infrastructure project TTNWW, Frederik Vaassen (CLiPS, Antwerp) created
the base phrase chunking module, and Bart Desmet (LT3, Ghent) provided
the data for the named-entity module.

Maarten van Gompel designed the FoLiA XML output format that Frog
produces, and also wrote a Frog binding for Python [17]_, as well as a
separate Frog client in Python [18]_. Wouter van Atteveldt wrote a Frog
client in R [19]_, and Machiel Molenaar wrote a Frog client for
Go [20]_.

The development of Frog relies on earlier work and ideas from Ko van der
Sloot (lead programmer of MBT and TiMBL and the TiMBL API), Walter
Daelemans, Jakub Zavrel, Peter Berck, Gert Durieux, and Ton Weijters.

The development and improvement of Frog also relies on your bug reports,
suggestions, and comments. Use the github issue tracker at
https://github.com/LanguageMachines/frog/issues/ or mail `lamasoftware
@science.ru.nl <lamasoftware
@science.ru.nl>`__.

Alpino syntactic dependency labels
==================================

This table is taken from Alpino annotation reference manual
:raw-latex:`\cite{lassy2011}` :

+--------------------+--------------------------------------------------------------+
| dependentielabel | OMSCHRIJVING |
+--------------------+--------------------------------------------------------------+
| APP | appositie, bijstelling |
+--------------------+--------------------------------------------------------------+
| BODY | romp (bij complementizer)) |
+--------------------+--------------------------------------------------------------+
| CMP | complementizer |
+--------------------+--------------------------------------------------------------+
| CNJ | lid van nevenschikking |
+--------------------+--------------------------------------------------------------+
| CRD | nevenschikker (als hoofd van conjunctie) |
+--------------------+--------------------------------------------------------------+
| DET | determinator |
+--------------------+--------------------------------------------------------------+
| DLINK | discourse-link |
+--------------------+--------------------------------------------------------------+
| DP | discourse-part |
+--------------------+--------------------------------------------------------------+
| HD | hoofd |
+--------------------+--------------------------------------------------------------+
| HDF | afsluitend element van circumpositie |
+--------------------+--------------------------------------------------------------+
| LD | locatief of directioneel complement |
+--------------------+--------------------------------------------------------------+
| ME | maat (duur, gewicht, . . . ) complement |
+--------------------+--------------------------------------------------------------+
| MOD | bijwoordelijke bepaling |
+--------------------+--------------------------------------------------------------+
| MWP | deel van een multi-word-unit |
+--------------------+--------------------------------------------------------------+
| NUCL | kernzin |
+--------------------+--------------------------------------------------------------+
| OBCOMP | vergelijkingscomplement |
+--------------------+--------------------------------------------------------------+
| OBJ1 | direct object, lijdend voorwerp |
+--------------------+--------------------------------------------------------------+
| OBJ2 | secundair object (meewerkend, belanghebbend, ondervindend) |
+--------------------+--------------------------------------------------------------+
| PC | voorzetselvoorwerp |
+--------------------+--------------------------------------------------------------+
| POBJ1 | voorlopig direct object |
+--------------------+--------------------------------------------------------------+
| PREDC | predicatief complement |
+--------------------+--------------------------------------------------------------+
| PREDM | bepaling van gesteldheid ‘tijdens de handeling’ |
+--------------------+--------------------------------------------------------------+
| RHD | hoofd van een relatieve zin |
+--------------------+--------------------------------------------------------------+
| SAT | satelliet; aan- of uitloop |
+--------------------+--------------------------------------------------------------+
| SE | verplicht reflexief object |
+--------------------+--------------------------------------------------------------+
| SU | subject, onderwerp |
+--------------------+--------------------------------------------------------------+
| SUP | voorlopig subject |
+--------------------+--------------------------------------------------------------+
| SVP | scheidbaar deel van werkwoord |
+--------------------+--------------------------------------------------------------+
| TAG | aanhangsel, tussenvoegsel |
+--------------------+--------------------------------------------------------------+
| VC | verbaal complement |
+--------------------+--------------------------------------------------------------+
| WHD | hoofd van een vraagzin |
+--------------------+--------------------------------------------------------------+

.. [1]
The source code repository points to the latest development version
by default, which may contain experimental features. Stable releases
are deliberate snapshots of the source code. It is recommended to
grab the latest stable release.
.. [2]
https://github.com/LanguageMachines/ticcutils
.. [3]
https://github.com/LanguageMachines/libfolia
.. [4]
https://languagemachines.github.io/ucto
.. [5]
https://languagemachines.github.io/timbl
.. [6]
https://github.com/LanguageMachines/timblserver
.. [7]
https://languagemachines.github.io/mbt
.. [8]
B (begin) indicates the begin of the named entity, I (inside)
indicates the continuation of a named entity, and O (outside)
indicates that something is not a named entity
.. [9]
https://github.com/proycon/pynlpl, supports both Python 2 and Python
3
.. [10]
https://github.com/vanatteveldt/frogr/
.. [11]
https://github.com/Machiel/gorf
.. [12]
In the current Frog version UTF-16 is not accepted as input in Frog.
.. [13]
In fact the tokenizer still is used, but in ``PassThru`` mode. This
allows for conversion to FoLiA XML and sentence detection.
.. [14]
Versions for Python 3 may be called ``cython3`` on distributions such
as Debian or Ubuntu
.. [15]
More about the INI file
format:\ https://en.wikipedia.org/wiki/INI_file)
.. [16]
MBT available at http://languagemachines.github.io/mbt/
.. [17]
https://github.com/proycon/python-frog
.. [18]
Part of PyNLPL: https://github.com/proycon/pynlpl
.. [19]
https://github.com/vanatteveldt/frogr/
.. [20]
https://github.com/Machiel/gorf
.. |image| image:: frogarchitecture
123 changes: 123 additions & 0 deletions docs/source/frogData.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
.. _frogData:



Frog in practice
----------------

Frog has been used in research projects mostly because of its capacity
to process Dutch texts efficiently and analyze the texts sufficiently
accurately. The purposes range from corpus construction to linguistic
research and natural language processing and text analytics
applications. We provide a overview of work reporting to use Frog, in
topical clusters.

Corpus construction
~~~~~~~~~~~~~~~~~~~

Frog, named Tadpole before 2011, has been used for the automated
annotation of, mostly, POS tags and lemmas of Dutch corpora. When the
material of Frog was post-corrected manually, this is usually done on
the basis of the probabilities produced by the POS tagger and setting a
confidentiality threshold [VandenBosch+06]_.

- The 500-million-word SoNaR corpus of written contemporary Dutch, and
its 50-million word predecessor D-Coi [Oostdijk+08]_ [oostdijk2013construction]_;

- The 500-million word Lassy Large corpus [LASSY]_ that has also been parsed
automatically with the ALPINO parser [ALPINO]_;

- The 115-hour JASMIN corpus of transcribed Dutch, spoken by elderly,
non-native speakers, and children [Cucchiarini+13]_;

- The 7-million word Dutch subcorpus of a multilingual parallel corpus
of automotive texts [DPL2009]_;

- The *Insight Interaction* corpus of 15 20-minute transcribed
multi-modal dialogues [brone2015insight]_;

- The SUBTLEX-NL word frequency database was based on an automatically
analyzed 44-million word corpus of Dutch subtitles of movies and
television shows [subtlex]_.

Feature generation for text filtering and Natural Language Processing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Frog’s analyses can help to zoom in on particular linguistic
abstractions over text, such as adjectives or particular constructions,
to be used in further processing. They can also help to generate
annotation layers that can act as features in further NLP processing
steps. POS tags and lemmas are mostly used for these purposes. We list a
number of examples across the NLP board:

- Sentence-level analysis tasks such as word sense disambiguation [Uvt-wsd1]_ and entity recognition [Vandecamp+2011]_;

- Text-level tasks such as authorship attribution
[Luyckx2011]_, emotion detection
[vaassen2011]_, sentiment analysis
[hogenboom2014]_, and readability prediction
[de2014using]_;

- Text-to-text processing tasks such as machine translation
[Haque+11]_ and sub-sentential alignment for machine translation [macken2010sub]_;

- Filtering Dutch texts for resource development, such as filtering adjectives for developing a subjectivity lexicon
[Pattern]_, and POS tagging to assist shallow chunking of Dutch texts for bilingual terminology extraction [texsis2013]_.




.. [Atterer+2007] Atterer, Michaela and Hinrich Schütze. 2007. Prepositional phrase attachment without oracles. Computational Linguistics, 33(4):469–476.
..[brone2015insigh]t Brône, Geert and Bert Oben. 2015. Insight interaction: a multimodal and multifocal dialogue corpus. Language resources and evaluation, 49(1):195–214.



.. [Cucchiarini+13] Cucchiarini, Catia and Hugo Van hamme. 2013. The Jasmin speech corpus: Recordings of children, non-natives and elderly people. In Essential Speech and Language Technology for Dutch. Springer, pages 147–164.
De Clercq, Orphée, Veronique Hoste, Bart Desmet, Philip Van Oosten, Martine De Cock, and Lieve Macken. 2014. Using the crowd for readability prediction. Natural Language Engineering, 20(03):293–325.

..[Pattern] De Smedt, Tom and Walter Daelemans. 2012. ” vreselijk mooi!”(terribly beautiful): A subjectivity lexicon for dutch adjectives. In LREC, pages 3568–3572.





.. [hogenboom2014] Hogenboom, Alexander, Bas Heerschop, Flavius Frasincar, Uzay Kaymak, and Franciska de Jong. 2014. Multilingual support for lexicon-based sentiment analysis guided by semantics. Decision support systems, 62:43–53.
.. [TTNWW] Kemps-Snijders, Marc, Ineke Schuurman, Walter Daelemans, Kris Demuynck, Brecht Desplanques, Véronique Hoste, Marijn Huijbregts, Jean-Pierre Martens, Joris Pelemans Hans Paulussen, Martin Reynaert, Vincent Van- deghinste, Antal van den Bosch, Henk van den Heuvel, Maarten van Gompel, Gertjan Van Noord, and Patrick Wambacq. 2017. TTNWW to the rescue: no need to know how to handle tools and resources. CLARIN in the Low Countries. pages 83-93.
.. [subtlex] Keuleers, Emmanuel, Marc Brysbaert, and Boris New. 2010. Subtlex-nl: A new measure for dutch word frequency based on film subtitles. Behavior research methods, 42(3):643–650.
.. [de2014using] Lefever, Els, Lieve Macken, and Véronique Hoste. 2009. Language-independent bilingual terminology extraction from a multilingual parallel corpus. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 496–504. Association for Computational Linguistics.
.. [Luyckx2011] Luyckx, Kim. 2011. Scalability issues in authorship attribution. ASP/VUBPRESS/UPA.
.. [macken2010sub] Macken, Lieve. 2010. Sub-sentential alignment of translational correspondences. UPA University Press Antwerp.
.. [texsis2013] Macken, Lieve, Els Lefever, and Véronique Hoste. 2013. Texsis: bilingual terminology extraction from parallel corpora using chunk-based alignment. Terminology, 19(1):1–30.
.. [Oostdijk+08] Oostdijk, N., M. Reynaert, P. Monachesi, G. Van Noord, R. Ordelman, I. Schuurman, and V. Vandeghinste. 2008. From D-Coi to SoNaR: A reference corpus for Dutch. In Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco.
.. [oostdijk2013construction] Oostdijk, Nelleke, Martin Reynaert, Véronique Hoste, and Ineke Schuurman. 2013. The construction of a 500- million-word reference corpus of contemporary written dutch. In Essential speech and language technology for Dutch. Springer, pages 219–247.
Petrov, Slav, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugūr Dogãn, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Con- ference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, may. European Language Resources Association (ELRA).

.. [vaassen2011] Vaassen, Frederik and Walter Daelemans. 2011. Automatic emotion classification for interpersonal communication. In Proceedings of the 2nd workshop on computational approaches to subjectivity and sentiment analysis, pages 104–110. Association for Computational Linguistics.
.. [vandecamp2011] Van de Camp, M. and A. Van den Bosch. 2011. A link to the past: Constructing historical social networks. In Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011), pages 61–69, Portland, Oregon, June. Association for Computational Linguistics.
.. [VandenBosch+06] Van den Bosch, A., I. Schuurman, and V. Vandeghinste. 2006. Transferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC-2006, Trento, Italy.
.. [MBMA] van den Bosch, Antal and Walter Daelemans. 1999. Memory-based morphological analysis. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL ’99, pages 285–292, Stroudsburg, PA, USA. Association for Computational Linguistics.
.. [POS2004] Van Eynde, Frank. 2004. Part of speech tagging en lemmatisering van het corpus gesproken nederlands. Technical report, Centrum voor Computerlinguıstiek, KU Leuven, Belgium.
.. [Uvt-wsd1] Van Gompel, M. 2010. Uvt-wsd1: A cross-lingual word sense disambiguation system. In SemEval ’10: Proceedings of the 5th International Workshop on Semantic Evaluation, pages 238–241, Morristown, NJ, USA. Association for Computational Linguistics.

0 comments on commit 6b88c25

Please sign in to comment.