URL Title Classifier

A machine learning project for classifying URLs and titles into categories.
Deployed on Hugging Face: firefoxrecap/URL-TITLE-classifier

Performance Metrics

Model Agreement Rates

Comparison	Agreement Rate
ONNX Runtime vs PyTorch (FP32)	99.19%
Quantized Q4 vs PyTorch (FP32)	96.14%
Quantized INT8 vs PyTorch (FP32)	97.30%

Findings

INT8 quantization provides strong performance while maintaining accuracy.
Q4 quantization is promising but could benefit from speed optimizations.
Lower quantization levels (e.g., Q2) may be viable for certain use cases.
Accuracy degradation from model pruning is expected to be minimal.
More metrics are available on the Hugging Face model page.

Planned Improvements

Implemented

Unfreezing more layers – Gradual unfreezing and fine-tuning the entire model worked best – completed

In Progress (Needs Resources)

Domain-Adaptive Pretraining – Setup is ready, but requires more hardware to proceed
Class imbalance mitigation (via class weights or focal loss) – Implemented, hardware-bound
Hyperparameter optimization – Implemented, hardware-bound

To Do

Creation of a high-quality "golden" dataset – Still needs to be done

Exploration of Advanced Modeling Techniques

Dual-encoder model (URL + Title) with fusion layer, followed by pruning – Potentially effective, but complex
Adaptive learning (e.g., curriculum learning) – Slight improvements observed in practice
Contrastive learning using unlabeled data – Pending exploration
Co-training with pseudo-labeling – Pending
Larger datasets or smarter splits using the Tranco list – Planned
Improved prompts for synthetic data generation – Planned
Optimized label set – Planned

Setup Instructions

1. Install Python Dependencies

pip install -r requirements.txt

2. Install Node.js Dependencies

npm install

Usage

Data Collection

Step 1: Extract from WARC Files

python scripts/extract_warc_data.py

This will:

Read WARC file paths from data/raw/warc.paths
Extract content from Common Crawl WARC files
Save output to data/processed/extracted_data.parquet

Step 2: Generate Synthetic Training Data

Requires a DeepSeek API key in a .env file.

python scripts/generate_synthetic_data.py

This will:

Read input from data/processed/extracted_data.parquet
Generate labeled synthetic data via DeepSeek API
Save to data/processed/classified_data.parquet

Model Training Pipeline

Follow the Jupyter notebooks in order(note this repo doesnt contain the datasets due to it being large):

01_warc_data_cleaning.ipynb – WARC data cleaning & preprocessing
02_synthetic_data_cleaning.ipynb – Synthetic data cleaning
03_data_analysis.ipynb – Exploratory data analysis
04_model_training.ipynb – Model training
05_embedding_analysis.ipynb – Embedding analysis & clustering
06_MI_analysis.ipynb – Model inspection analysis

Trained models are saved in data/models/.

Demo Extension (Experimental)

The browser extension demo can be found in the demo_extension/ folder.

Currently broken – needs fixing

Transformer.js Demos

These demos test the transformer.js runtime in Node.js.

Note: Converted ONNX model files are not included in this repo. You can find them on the Hugging Face model page.
To convert models yourself, use the script from the transformers.js repo.

Available Demos

npm run simple:
Runs a basic demo with a hardcoded url:title input to verify inference works in transformer.js.
npm run validation:
Compares predictions from the PyTorch FP32 model to those from the ONNX model (via ONNX Runtime) using the validation dataset.
Useful for evaluating discrepancies between model formats and ground truth labels.

Credits

Taimur Hasan (tshasan)

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
demo_extension		demo_extension
notebooks		notebooks
scripts		scripts
tests		tests
.gitignore		.gitignore
README.MD		README.MD
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

URL Title Classifier

Performance Metrics

Model Agreement Rates

Findings

Planned Improvements

Implemented

In Progress (Needs Resources)

To Do

Exploration of Advanced Modeling Techniques

Setup Instructions

1. Install Python Dependencies

2. Install Node.js Dependencies

Usage

Data Collection

Step 1: Extract from WARC Files

Step 2: Generate Synthetic Training Data

Model Training Pipeline

Demo Extension (Experimental)

Transformer.js Demos

Available Demos

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Firefox-Recap/url-title-classifier-testing

Folders and files

Latest commit

History

Repository files navigation

URL Title Classifier

Performance Metrics

Model Agreement Rates

Findings

Planned Improvements

Implemented

In Progress (Needs Resources)

To Do

Exploration of Advanced Modeling Techniques

Setup Instructions

1. Install Python Dependencies

2. Install Node.js Dependencies

Usage

Data Collection

Step 1: Extract from WARC Files

Step 2: Generate Synthetic Training Data

Model Training Pipeline

Demo Extension (Experimental)

Transformer.js Demos

Available Demos

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages