Skip to content

GreenManSK/KGS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web document clustering and keyword generating using link mining

The application implemented as a part of my bachelor's thesis.
Full text: https://is.muni.cz/th/vv9kb/

The application is implemented in Java 1.8 and uses maven. Jar file with dependencies can be found in folder executables.

C++ integration

This application uses C++ integration for Java so it can use Majka implemented in C++ optimally. All C++ code is located in directory majka4j with Cmake for compilation. I also provided already compiled version of this class in file libmajkaj.so. If this file do not work for you, you need to compile these sources yourself.

To use this library you need to specify path for in when running

java -jar -Djava.library.path=majka4j

Or put this library into Java library path.

Usage

This application works as command line tool. Whole process of clustering and extracting keywords is split into modules, each using specific arguments.

Common arguments

Param Description
-h, --help Print a list of all parameters
-helpm Print a list of parameters for the specified module. Module name can be downloader, preprocessing, clustering, linkmining, keywords
-L, --console-Log Print logs into standard output
-l , --log Save logs into a file
-d , --dir Specify the directory for saving and retrieving data used by the application

Download module

Param Description
-downloader Run the download module
-u , --url Starting domain for the web crawle
-hops The maximal number of domain hops. Default value: 0
-depth The maximal depth of crawling. Default value: 1

Preprocessing module

Param Description
-preprocessing Run the preprocessing module
-v , --vocabulary Vocabulary size. Default value: 2000
-redundant Redundant word percentage.Default value: 0.3
-pruning Pruning rate. Default value: 0

Clustering module

Param Description
-clustering Run the clustering module
-alpha Scaling parameter for word to document distribution. Default value: 1.0
-beta Scaling parameter for word to topic distribution. Default value: 0.5
-gamma Scaling parameter for topics. Default value: 1.5

Link mining module

Param Description
-linkmining Run the link mining module
-distance Type of distance comparison used in link mining, mean or average

Keyword extraction module

Param Description
-keywords Run the keyword extraction module
-w , --words Number of words extracted for each cluster
-skiptr Skip TextRank algorithm and use previously saved results

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages