Web document clustering and keyword generating using link mining

The application implemented as a part of my bachelor's thesis.
Full text: https://is.muni.cz/th/vv9kb/

The application is implemented in Java 1.8 and uses maven. Jar file with dependencies can be found in folder executables.

C++ integration

This application uses C++ integration for Java so it can use Majka implemented in C++ optimally. All C++ code is located in directory majka4j with Cmake for compilation. I also provided already compiled version of this class in file libmajkaj.so. If this file do not work for you, you need to compile these sources yourself.

To use this library you need to specify path for in when running

java -jar -Djava.library.path=majka4j

Or put this library into Java library path.

Usage

This application works as command line tool. Whole process of clustering and extracting keywords is split into modules, each using specific arguments.

Common arguments

Param	Description
-h, --help	Print a list of all parameters
-helpm	Print a list of parameters for the specified module. Module name can be downloader, preprocessing, clustering, linkmining, keywords
-L, --console-Log	Print logs into standard output
-l , --log	Save logs into a file
-d , --dir	Specify the directory for saving and retrieving data used by the application

Download module

Param	Description
-downloader	Run the download module
-u , --url	Starting domain for the web crawle
-hops	The maximal number of domain hops. Default value: 0
-depth	The maximal depth of crawling. Default value: 1

Preprocessing module

Param	Description
-preprocessing	Run the preprocessing module
-v , --vocabulary	Vocabulary size. Default value: 2000
-redundant	Redundant word percentage.Default value: 0.3
-pruning	Pruning rate. Default value: 0

Clustering module

Param	Description
-clustering	Run the clustering module
-alpha	Scaling parameter for word to document distribution. Default value: 1.0
-beta	Scaling parameter for word to topic distribution. Default value: 0.5
-gamma	Scaling parameter for topics. Default value: 1.5

Link mining module

Param	Description
-linkmining	Run the link mining module
-distance	Type of distance comparison used in link mining, mean or average

Keyword extraction module

Param	Description
-keywords	Run the keyword extraction module
-w , --words	Number of words extracted for each cluster
-skiptr	Skip TextRank algorithm and use previously saved results

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.idea		.idea
executables		executables
majka4j		majka4j
src/main		src/main
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web document clustering and keyword generating using link mining

C++ integration

Usage

Common arguments

Download module

Preprocessing module

Clustering module

Link mining module

Keyword extraction module

About

Releases

Packages

Languages

GreenManSK/KGS

Folders and files

Latest commit

History

Repository files navigation

Web document clustering and keyword generating using link mining

C++ integration

Usage

Common arguments

Download module

Preprocessing module

Clustering module

Link mining module

Keyword extraction module

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages