A Java implementation of the Rapid Automatic Keyword Extraction Framework ( RAKE )
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
stopLists
.gitignore
LICENSE
README.md
pom.xml

README.md

RAKE-Java

A Java 8 implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.

The implementation is based on the python one from https://github.com/aneesha/RAKE (however some changes have been made) The source code is released under the GPL V3License.

Add this repository to your POM.XML whether you want to use it with maven

 <repository>
        <id>galan-maven-repo</id>
        <name>galan-maven-repo-releases</name>
        <url>http://galan.ehu.es/artifactory/ext-release-local</url>
 </repository>

This implementation requires a POS tagger to be used in order to work. For example The Illinois POS tagger could be used for English.

http://cogcomp.cs.illinois.edu/page/software_view/POS

For Spanish or other languages:

FreeLing --> http://nlp.lsi.upc.edu/freeling/

or Standford Pos tagger --> http://nlp.stanford.edu/software/tagger.shtml

The implementation is in beta state

TODO:

 - More testing 

Then an example parser for english that will provide the required data (using Illinois POS Tagger)

    import LBJ2.nlp.SentenceSplitter;
    import LBJ2.nlp.WordSplitter;
    import LBJ2.nlp.seg.PlainToTokenParser;
    import LBJ2.parse.Parser;
    import edu.illinois.cs.cogcomp.lbj.chunk.Chunker;
    import edu.illinois.cs.cogcomp.lbj.pos.POSTagger;
    import edu.ehu.galan.cvalue.model.Token;
     ......

     List<LinkedList<Token>> tokenizedSentenceList;
     List<String> sentenceList;
     POSTagger tagger = new POSTagger();
     Chunker chunker = new Chunker();
     boolean first = true;
     parser = new PlainToTokenParser(new WordSplitter(new SentenceSplitter(pFile)));
     String sentence = "";
     LinkedList<Token> tokenList = null;
     for (LBJ2.nlp.seg.Token word = (LBJ2.nlp.seg.Token) parser.next(); word != null;
            word = (LBJ2.nlp.seg.Token) parser.next()) {
            String chunked = chunker.discreteValue(word);
            tagger.discreteValue(word);
            if (first) {
                tokenList = new LinkedList<>();
                tokenizedSentenceList.add(tokenList);
                first = false;
            }
            tokenList.add(new Token(word.form, word.partOfSpeech, null, chunked));
            sentence = sentence + " " + (word.form);
            if (word.next == null) {
                sentenceList.add(sentence);
                first = true;
                sentence = "";
            }
     }
     parser.reset();
     

Then RAKE can be processed then.....

    Document doc=new Document(full_path,name);
    doc.setSentenceList(sentences);
    doc.setTokenList(tokenized_sentences); 
    RakeAlgorithm ex = new RakeAlgorithm();
    ex.loadStopWordsList("resources/lite/stopWordLists/RakeStopLists/SmartStopListEn");
    ex.loadPunctStopWord("resources/lite/stopWordLists/RakeStopLists/RakePunctDefaultStopList");
    PlainTextDocumentReaderLBJEn parser = new PlainTextDocumentReaderLBJEn();
    parser.readSource("testCorpus/textAstronomy");
    Document doc = new Document("full_path", "name");
    ex.init(doc);
    ex.runAlgorithm();
    doc.getTermList();