HTML Clustering - Classifying HTML Pages Using DBSCAN and CLIP

Description

A Python application for clustering HTML pages based on their layout, using three different methods:

CLIP (the weakest, based only on image similarity)
DBSCAN (clustering based on distance between points)
DBSCAN + CLIP (the strongest and most accurate method)

The graphical interface is built simply and intuitively using Tkinter.

Project Structure

4 tiers of HTML pages, each classified separately.
3 available clustering methods:
1. DBSCAN
2. CLIP (layout-based)
3. DBSCAN + CLIP (refining clusters with additional filtering)

Details About the Methods

1. Hierarchical Clustering with DBSCAN

Epsilon (eps) was chosen by:
- Manually inspecting the k-th nearest neighbor distance graph and identifying a sudden jump.
- Final value: eps = 1
min_samples:
- Set to 2, because two similar HTML pages are considered enough to form a cluster (based on the user perspective, as required).

2. Layout Similarity Clustering Using CLIP

A. CLIP only:
- Generate embeddings for the page screenshots using the CLIP model.
- Apply K-Means clustering on the embeddings.
B. DBSCAN + CLIP:
- First, apply DBSCAN for a rough clustering.
- Then, for each cluster, verify the similarity between each screenshot and the average layout using a custom check_clusters() function.
- This greatly improves precision and filters out outliers.

Additional Features

A. Automatic screenshot generation for all HTML pages using Selenium and ChromeDriver.
**B. Exporting clusters into grid images for easy visual validation.

References

How to Run the Application

Install the required libraries:
```
pip install -r requirements.txt
```

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
clones		clones
grid_clustere		grid_clustere
html_ss		html_ss
LICENSE		LICENSE
README.md		README.md
chromedriver.exe		chromedriver.exe
html_DBSCAN.py		html_DBSCAN.py
html_GUI.py		html_GUI.py
html_imgcmp.py		html_imgcmp.py
html_imgcmp_DBSCAN.py		html_imgcmp_DBSCAN.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HTML Clustering - Classifying HTML Pages Using DBSCAN and CLIP

Description

Project Structure

Details About the Methods

1. Hierarchical Clustering with DBSCAN

2. Layout Similarity Clustering Using CLIP

Additional Features

References

How to Run the Application

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

P4th0/HTML_Clustering

Folders and files

Latest commit

History

Repository files navigation

HTML Clustering - Classifying HTML Pages Using DBSCAN and CLIP

Description

Project Structure

Details About the Methods

1. Hierarchical Clustering with DBSCAN

2. Layout Similarity Clustering Using CLIP

Additional Features

References

How to Run the Application

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages