Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 

README.md

SAPA

Segmentor and Part-of-speech tagger for Arabic

Overview

This tool preprocess arabic texts : predict part-of-speech tags and splits off words to separe the basic form and proclitics. Details of this work are presented in:

Souhir Gahbiche-Braham and Hélène Maynard and Thomas Lavergne and François Yvon, Joint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier, In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey. 2012 [www.lrec-conf.org/proceedings/lrec2012/pdf/855_Paper.pdf].

If you use SAPA for research purpose, please use the following citation:

@inproceedings{gahbiche2012joint,
   	title={Joint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier},
	author={Gahbiche-Braham, Souhir and Bonneau-Maynard, H{\'e}lene and Lavergne, Thomas and Yvon, Fran{\c{c}}ois},
	booktitle={Proc. of LREC’12},
	address = {Istanbul, Turkey},
	editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
	publisher = {European Language Resources Association (ELRA)},
	isbn = {978-2-9517408-7-7},
	language = {english}
	pages={2107--2113},
	year={2012}
}

For any information, bug reports or comments, contact: souhir.gahbiche[at]limsi.fr or souhir.gahbiche[at]gmail.com

Requirements

To use SAPA:

1- Please install the Wapiti toolkit http://wapiti.limsi.fr/

2- Place the ArabicSplit directory in the Wapiti directory.

3- Uncompress the modelPOS+SEG-final-0.1.crf.zip file.

Run SAPA

1- cd ~/wapiti/SAPA

2- ./segment arabic_filename

If SAPA cannot run correctly or the loading of model does not succeeds, please unzip the model modelPOS+SEG.light.crf.zip and replace "modelPOS+SEG-final-0.1.crf" by "modelPOS+SEG.light.crf" in the file "segment.sh"

Resulting files

arabic_filename.bw is the transliterated text using Buckwalter scheme.

arabic_filename.wap contains the predictions for each word in the text

arabic_filename.norm.pos contains the part-of-speech tags for each word in the resulting file.result

arabic_filename.result is the resulting preprocessed file

Analytics

About

Segmentor and Part-of-speech tagger for Arabic

Resources

License

Packages

No packages published
You can’t perform that action at this time.