The application implemented as a part of my bachelor's thesis.
Full text: https://is.muni.cz/th/vv9kb/
The application is implemented in Java 1.8 and uses maven. Jar file with dependencies can be found in folder executables.
This application uses C++ integration for Java so it can use Majka implemented in C++ optimally.
All C++ code is located in directory majka4j with Cmake for compilation. I also provided already compiled version of this class in file libmajkaj.so.
If this file do not work for you, you need to compile these sources yourself.
To use this library you need to specify path for in when running
java -jar -Djava.library.path=majka4j
Or put this library into Java library path.
This application works as command line tool. Whole process of clustering and extracting keywords is split into modules, each using specific arguments.
Param | Description |
---|---|
-h, --help | Print a list of all parameters |
-helpm | Print a list of parameters for the specified module. Module name can be downloader, preprocessing, clustering, linkmining, keywords |
-L, --console-Log | Print logs into standard output |
-l , --log | Save logs into a file |
-d , --dir | Specify the directory for saving and retrieving data used by the application |
Param | Description |
---|---|
-downloader | Run the download module |
-u , --url | Starting domain for the web crawle |
-hops | The maximal number of domain hops. Default value: 0 |
-depth | The maximal depth of crawling. Default value: 1 |
Param | Description |
---|---|
-preprocessing | Run the preprocessing module |
-v , --vocabulary | Vocabulary size. Default value: 2000 |
-redundant | Redundant word percentage.Default value: 0.3 |
-pruning | Pruning rate. Default value: 0 |
Param | Description |
---|---|
-clustering | Run the clustering module |
-alpha | Scaling parameter for word to document distribution. Default value: 1.0 |
-beta | Scaling parameter for word to topic distribution. Default value: 0.5 |
-gamma | Scaling parameter for topics. Default value: 1.5 |
Param | Description |
---|---|
-linkmining | Run the link mining module |
-distance | Type of distance comparison used in link mining, mean or average |
Param | Description |
---|---|
-keywords | Run the keyword extraction module |
-w , --words | Number of words extracted for each cluster |
-skiptr | Skip TextRank algorithm and use previously saved results |