This project creates index files for Saudi Stock Market Arabic and English announcements' texts extracted from Tadawul using Tadawul Crawler.
For each given document: Both Arabic and English words are tokenized,
all spaces and all punctuations are removed using the following regular expression: [^\\p{L}\\p{Z}]
.
The tokens are then stored in a list for further processing.
For each given token in the list of tokens: We compare with the provided stop word lists in the Resources, both Arabic and English lists, if a match found, we remove the token. Finally, a new list of the tokens is stored for the next procedure.
We employed Snowball stemmers to stem each token, the stemmers are provided in the Resources for both Arabic and English words.
• We implemented an inverted index and store it in the hard desk for faster recovery. We used each document’s date and time as an ID in a form of UNIX timestamp.
Term : Object {
Array [
Document {
Array [
Term {
Position,
OriginalTerm,
Type
},
Term {
…
},
]
},
…
Document {
…
},
]
}
• Some documents do not have a date or time, and since we depend on those as identifiers, we handled the missing values by taking the current time and replaced the last three digits with a random number.
Developed as part of a Computer Science MSc course
Supervisor: Dr. Mohammad Alsulmi
Course: CSC569: Selected Topics in AI
King Saud university
April 2021