Part-of-speech (POS) tagging is one of the most important addressed areas in the natural language processing (NLP). There are effective POS taggers for many languages. We tried to develop a POS tagger for the Arabic language, specifically for the modern standard Arabic (MSA), because it’s the language used in the formal textbooks and news. The objective of our solution is to firstly create a tokenizer that splits any file you choose into a list of words with removing any punctuations and numbers from the list. And secondly create a POS tagger which takes the list of words from tokenizer and then tag each word with its appropriate POS(verb, noun, particle) based on a combination of rules. Finally we created a golden corpus from a sample of the actual corpus folder to test our algorithm and see how accurate and precise with its tagging.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
What things you need to install the software and how to install them
python, pip, pandas, matplotlib, xlrd
https://www.python.org/downloads/
If you're running Python 2.7.9+ or Python 3.4+
Congrats, you should already have pip installed. If you do not, read onward.
- Download get-pip.py(https://bootstrap.pypa.io/get-pip.py) to a folder on your computer.
- Open a command prompt and navigate to the folder containing get-pip.py.
- Run the following command:
python get-pip.py
- Pip is now installed!
pip install pandas matplotlib xlrd
Now you can double-click the .bat file and this window should pop up:
After testing one of the files:
- Tkinter - Tkinter is Python's de-facto standard GUI (Graphical User Interface) package
- Omar AlQaisi - OmarQaisi
- Marwan AlRamahi - Marwan998
- Motassem Naqawah - moenaqawah
This project is licensed under the MIT License - see the LICENSE file for details