Arabic Text Indexer

Overview:

This project creates index files for Saudi Stock Market Arabic and English announcements' texts extracted from Tadawul using Tadawul Crawler.

Processing:

• Tokenization:

For each given document: Both Arabic and English words are tokenized,
all spaces and all punctuations are removed using the following regular expression: [^\\p{L}\\p{Z}].
The tokens are then stored in a list for further processing.

• Stopping:

For each given token in the list of tokens: We compare with the provided stop word lists in the Resources, both Arabic and English lists, if a match found, we remove the token. Finally, a new list of the tokens is stored for the next procedure.

• Stemming:

We employed Snowball stemmers to stem each token, the stemmers are provided in the Resources for both Arabic and English words.

Indexing strategy:

• We implemented an inverted index and store it in the hard desk for faster recovery. We used each document’s date and time as an ID in a form of UNIX timestamp.

• The index is formatted as a JSON file. And structured as follows:

Term : Object {
		Array [
			Document {
				Array [
					Term {
						Position,
						OriginalTerm,
						Type
					},
					Term {
						…
					},
					]
				},
			…
			Document {
				…
			},
			]
}

• Example of the index file:

• We indexed the 3 stock data once, and then each time we read the desired index for applying a procedure such as: Getting the Top 10 frequent words, Getting all words counted and Looking for a specific word in the text (Search).

Complications:

• Some documents do not have a date or time, and since we depend on those as identifiers, we handled the missing values by taking the current time and replaced the last three digits with a random number.

Developed as part of a Computer Science MSc course
Supervisor: Dr. Mohammad Alsulmi
Course: CSC569: Selected Topics in AI
King Saud university
April 2021

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assignment		assignment
build		build
README.md		README.md
img.png		img.png
lib.zip		lib.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arabic Text Indexer

Overview:

Processing:

• Tokenization:

• Stopping:

• Stemming:

Indexing strategy:

• The index is formatted as a JSON file. And structured as follows:

• Example of the index file:

Complications:

About

Languages

Mhz95/Arabic-Text-Indexer

Folders and files

Latest commit

History

Repository files navigation

Arabic Text Indexer

Overview:

Processing:

• Tokenization:

• Stopping:

• Stemming:

Indexing strategy:

• The index is formatted as a JSON file. And structured as follows:

• Example of the index file:

Complications:

About

Topics

Resources

Stars

Watchers

Forks

Languages