Skip to content

04. Pre Processing Module

Lefteris Paraskevas edited this page Apr 24, 2016 · 1 revision

The text pre-processing procedure, is divided into 3 tasks:

  • Tokenization (preprocessingmodule.nlp.Tokenizer.java)
  • Stopwords removal (preprocessingmodule.nlp.stopwords.StopWords.java)
  • Stemming (preprocessingmodule.nlp.stemming.Stemmer.java and language specific stemmers)

StanfordCoreNLP Library was used for the tokenization process, while Apache Lucene Analysis was employed for the stemming procedure.

1. How to add new languages for the stemming procedure?

You have to implement the Stemmer interface into a new class and import the appropriate library from the Apache Lucene Analysis library. Also, take a look at question 2.

2. How to add new language codes, if I implement a new language for stemming?

You have to update three classes that reside in the preprocessingmodule.language package, namely LangUtils.java, Language.java and LanguageCodes.java. All three classes must be updated.

3. How to add new stopwords that will be removed from the text during the stopwords removal procedure?

There are two ways.

  • Update the stopwords file manually: Go to /src/resources/stop-words/ and update the specific language file you wish.
  • Add a new file in the stopwords directory. The file must be a UTF-8 .txt file with one stopword per line.

If you follow the second approach, you have to add the new file's name in the loadStopWords() method in preprocessingmodule.nlp.stopwords.StopWords.java class. It is recommended to add the name as a value in the config.properties file and create the appropriate getter methods in the utilities.Config.java class.