Skip to content
Samuele95 edited this page Jun 10, 2024 · 3 revisions

webcat

Description

WebCat is a project dedicated to the automated discovery and classification of websites based on content similarities, through an overall unsupervised learning approach using algorithmic models trained as necessary. The activities carried out range from web crawling and web scraping, for the discovery and the acquisition of the textual content of web pages, up to the use of neural networks for the vectorization of this content and the classification of the findings based on clustering algorithms. Specifically, the vectorization activity is carried out by the transformer-based BERT neural network, while the clustering process is the work of a Self-Organizing Map (SOM) as a form of unsupervised learning based on a neural network.

gui

Business case

The need underlying the launch of this software project is given by the idea of ​​being able to automate the classification activity of web pages through the use of search, vectorization and cataloging algorithms of their content. An essential element of the project is its microservices architecture, described previously, which represents an element of undoubted competitiveness as it allows the interchangeability and updating of the components on the basis of emerging technical and scientific findings. In particular, it is possible to adopt algorithms and machine learning models further than those defined by the current state of affairs and used with the release associated with the paper, thus ensuring - software maintenance allowing - the theoretical and technological development of the product. The choice of the microservices architecture is also linked to the need, given the huge computing resources that may arise from the vectorization of a large number of websites and the relative computational complexity of training the SOM neural network, for a distributed computation model (possibly exploiting cloud computing resources) in order to parallelize and divide the workload of the hardware components. This also makes it possible to ignore the possession of particularly high-performance computer (networks).

Clone this wiki locally