A Python application for clustering HTML pages based on their layout, using three different methods:
- CLIP (the weakest, based only on image similarity)
- DBSCAN (clustering based on distance between points)
- DBSCAN + CLIP (the strongest and most accurate method)
The graphical interface is built simply and intuitively using Tkinter.
- 4 tiers of HTML pages, each classified separately.
- 3 available clustering methods:
- DBSCAN
- CLIP (layout-based)
- DBSCAN + CLIP (refining clusters with additional filtering)
- Epsilon (
eps) was chosen by:- Manually inspecting the k-th nearest neighbor distance graph and identifying a sudden jump.
- Final value: eps = 1
- min_samples:
- Set to 2, because two similar HTML pages are considered enough to form a cluster (based on the user perspective, as required).
- A. CLIP only:
- Generate embeddings for the page screenshots using the CLIP model.
- Apply K-Means clustering on the embeddings.
- B. DBSCAN + CLIP:
- First, apply DBSCAN for a rough clustering.
- Then, for each cluster, verify the similarity between each screenshot and the average layout using a custom
check_clusters()function. - This greatly improves precision and filters out outliers.
- A. Automatic screenshot generation for all HTML pages using Selenium and ChromeDriver.
- **B. Exporting clusters into grid images for easy visual validation.
- DBSCAN Clustering Algorithm - Datacamp
- How to Choose Epsilon and MinPts for DBSCAN
- How to Take Website Screenshots in Python
- OpenAI CLIP GitHub Repository
- Install the required libraries:
pip install -r requirements.txt