A machine learning project for classifying URLs and titles into categories.
Deployed on Hugging Face: firefoxrecap/URL-TITLE-classifier
| Comparison | Agreement Rate |
|---|---|
| ONNX Runtime vs PyTorch (FP32) | 99.19% |
| Quantized Q4 vs PyTorch (FP32) | 96.14% |
| Quantized INT8 vs PyTorch (FP32) | 97.30% |
- INT8 quantization provides strong performance while maintaining accuracy.
- Q4 quantization is promising but could benefit from speed optimizations.
- Lower quantization levels (e.g., Q2) may be viable for certain use cases.
- Accuracy degradation from model pruning is expected to be minimal.
- More metrics are available on the Hugging Face model page.
- Unfreezing more layers – Gradual unfreezing and fine-tuning the entire model worked best – completed
- Domain-Adaptive Pretraining – Setup is ready, but requires more hardware to proceed
- Class imbalance mitigation (via class weights or focal loss) – Implemented, hardware-bound
- Hyperparameter optimization – Implemented, hardware-bound
- Creation of a high-quality "golden" dataset – Still needs to be done
- Dual-encoder model (URL + Title) with fusion layer, followed by pruning – Potentially effective, but complex
- Adaptive learning (e.g., curriculum learning) – Slight improvements observed in practice
- Contrastive learning using unlabeled data – Pending exploration
- Co-training with pseudo-labeling – Pending
- Larger datasets or smarter splits using the Tranco list – Planned
- Improved prompts for synthetic data generation – Planned
- Optimized label set – Planned
pip install -r requirements.txtnpm installpython scripts/extract_warc_data.pyThis will:
- Read WARC file paths from
data/raw/warc.paths - Extract content from Common Crawl WARC files
- Save output to
data/processed/extracted_data.parquet
Requires a DeepSeek API key in a .env file.
python scripts/generate_synthetic_data.pyThis will:
- Read input from
data/processed/extracted_data.parquet - Generate labeled synthetic data via DeepSeek API
- Save to
data/processed/classified_data.parquet
Follow the Jupyter notebooks in order(note this repo doesnt contain the datasets due to it being large):
01_warc_data_cleaning.ipynb– WARC data cleaning & preprocessing02_synthetic_data_cleaning.ipynb– Synthetic data cleaning03_data_analysis.ipynb– Exploratory data analysis04_model_training.ipynb– Model training05_embedding_analysis.ipynb– Embedding analysis & clustering06_MI_analysis.ipynb– Model inspection analysis
Trained models are saved in
data/models/.
The browser extension demo can be found in the demo_extension/ folder.
Currently broken – needs fixing
These demos test the transformer.js runtime in Node.js.
Note: Converted ONNX model files are not included in this repo. You can find them on the Hugging Face model page.
To convert models yourself, use the script from the transformers.js repo.
-
npm run simple:
Runs a basic demo with a hardcodedurl:titleinput to verify inference works in transformer.js. -
npm run validation:
Compares predictions from the PyTorch FP32 model to those from the ONNX model (via ONNX Runtime) using the validation dataset.
Useful for evaluating discrepancies between model formats and ground truth labels.
- Taimur Hasan (tshasan)