Segmentor and Part-of-speech tagger for Arabic
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
LICENSE
README.md
clean_after_split.pl
convertUTF8toBW.pl
modelPOS+SEG-final-0.1.crf.zip
modelPOS+SEG.light.crf.zip
normalize.pl
segment.sh
separatepunc.pl
split.pl

README.md

SAPA

Segmentor and Part-of-speech tagger for Arabic

Overview

This tool preprocess arabic texts : predict part-of-speech tags and splits off words to separe the basic form and proclitics. Details of this work are presented in:

Souhir Gahbiche-Braham and Hélène Maynard and Thomas Lavergne and François Yvon, Joint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier, In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey. 2012 [www.lrec-conf.org/proceedings/lrec2012/pdf/855_Paper.pdf].

If you use SAPA for research purpose, please use the following citation:

@inproceedings{gahbiche2012joint,
   	title={Joint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier},
	author={Gahbiche-Braham, Souhir and Bonneau-Maynard, H{\'e}lene and Lavergne, Thomas and Yvon, Fran{\c{c}}ois},
	booktitle={Proc. of LREC’12},
	address = {Istanbul, Turkey},
	editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
	publisher = {European Language Resources Association (ELRA)},
	isbn = {978-2-9517408-7-7},
	language = {english}
	pages={2107--2113},
	year={2012}
}

For any information, bug reports or comments, contact: souhir.gahbiche[at]limsi.fr or souhir.gahbiche[at]gmail.com

Requirements

To use SAPA:

1- Please install the Wapiti toolkit http://wapiti.limsi.fr/

2- Place the ArabicSplit directory in the Wapiti directory.

3- Uncompress the modelPOS+SEG-final-0.1.crf.zip file.

Run SAPA

1- cd ~/wapiti/SAPA

2- ./segment arabic_filename

If SAPA cannot run correctly or the loading of model does not succeeds, please unzip the model modelPOS+SEG.light.crf.zip and replace "modelPOS+SEG-final-0.1.crf" by "modelPOS+SEG.light.crf" in the file "segment.sh"

Resulting files

arabic_filename.bw is the transliterated text using Buckwalter scheme.

arabic_filename.wap contains the predictions for each word in the text

arabic_filename.norm.pos contains the part-of-speech tags for each word in the resulting file.result

arabic_filename.result is the resulting preprocessed file

Analytics