Skip to content

This is an experiment to finetune LLm and bert model for classification task

Notifications You must be signed in to change notification settings

Bennoo/classification_experience

Repository files navigation

Multi-Label Classification on Consumer Hardware

This project demonstrates multi-label text classification using different model architectures (mDeBERTa-v3-base and GPT-OSS-20B) fine-tuned on consumer hardware. It includes synthetic data generation, model training, inference, and comprehensive evaluation on the EuroChef+ multilingual customer support dataset.

Overview

The project compares three approaches:

  1. mDeBERTa-v3-base: Fine-tuned multilingual transformer (Microsoft)
  2. GPT-OSS-20B (Base): Zero-shot inference using OpenAI's GPT-OSS-20B
  3. GPT-OSS-20B (LoRA): LoRA fine-tuned adapter on GPT-OSS-20B

Key Results:

  • Best Accuracy: mDeBERTa-v3-base (F1: 0.8097)
  • Best Exact Match: GPT-OSS-20B + LoRA (0.4094)
  • Fastest Inference: mDeBERTa-v3-base (235 samples/s)

Project Structure

📁 Notebooks

mDeBERTa Experiments

  • 1-train.ipynb: Complete training pipeline for mDeBERTa-v3-base including data preprocessing, model configuration, and training on multilingual customer support data
  • 2-inference.ipynb: Run inference using the trained mDeBERTa model
  • 3-evaluate.ipynb: Comprehensive evaluation with metrics (F1, precision, recall, exact match) and per-label analysis
  • evaluation_results.md: Detailed results showing F1 Micro: 0.8097, throughput: 235 samples/s

GPT-OSS-20B Experiments

Analysis

📁 Data Generation

  • synthetic_gen.py: Multilingual synthetic data generator supporting OpenAI and Gemini APIs

📁 Deployment

Features

Synthetic Data Generation

  • Multi-Provider Support: Generate data using OpenAI or Google Gemini APIs
  • Structured Output: JSON schema-validated responses using Pydantic models
  • Context-Aware: Automatically includes existing messages to avoid duplicates and maintain variety
  • Multilingual: Supports English, French, Dutch, and German with culturally appropriate language patterns
  • Flexible CLI: Comprehensive command-line interface for customization
  • Batch Generation: Generate multiple batches with configurable parameters

Model Training & Evaluation

  • Multi-label Classification: 15 labels including sentiment, priority, user type, and issue categories
  • Consumer Hardware Optimized: All training done on consumer GPUs using efficient techniques
  • LoRA Fine-tuning: Memory-efficient adapter-based fine-tuning for large models
  • Comprehensive Metrics: F1 (micro/macro/weighted), precision, recall, exact match, Hamming loss
  • Per-label Analysis: Detailed performance breakdown for each classification label

Installation & Setup

Prerequisites

  • Python 3.8+
  • CUDA-capable GPU (recommended for training)
  • 16GB+ RAM for LoRA fine-tuning

1. Clone the repository

git clone <repository-url>
cd local_oss

2. Install dependencies

For data generation:

pip install openai google-genai pydantic

For model training and evaluation:

pip install torch transformers datasets evaluate scikit-learn peft accelerate

3. Set up API keys (for synthetic data generation)

export OPENAI_API_KEY="your-openai-api-key"
export GEMINI_API_KEY="your-gemini-api-key"

Quick Start

Generate Synthetic Data

Basic usage with OpenAI:

cd synthetic_data
python synthetic_gen.py

Using Gemini:

python synthetic_gen.py --provider gemini

Train Models

mDeBERTa:

  1. Open mdeberta/1-train.ipynb
  2. Run all cells to train the model
  3. Model will be saved and optionally pushed to Hugging Face Hub

GPT-OSS-20B with LoRA:

  1. Open oss20b/3-finetune_lora.ipynb
  2. Configure LoRA parameters in the notebook
  3. Run training cells
  4. Adapter will be saved locally and optionally pushed to Hub

Run Evaluation

Each model has a dedicated evaluation notebook:

Compare all models: analysis/comparison.ipynb

Synthetic Data Generation

Command-Line Options

Option Short Default Description
--provider -p openai API provider (openai or gemini)
--model -m Auto Model to use (provider-specific)
--num-messages -n 40 Messages per batch
--batches -b 1 Number of batches to generate
--french 12 French messages per batch
--dutch 12 Dutch messages per batch
--english 12 English messages per batch
--german 4 German messages per batch
--temperature -t 0.8/0.6 Generation temperature
--output -o customer_support_messages.jsonl Output file path
--no-existing False Skip existing messages in prompt

Usage Examples

Generate multiple batches:

python synthetic_gen.py --provider openai --batches 5

Customize language distribution:

python synthetic_gen.py --french 20 --dutch 10 --english 8 --german 2

Use a specific model:

python synthetic_gen.py --provider openai --model gpt-4o-mini
python synthetic_gen.py --provider gemini --model gemini-2.0-flash-exp

Adjust temperature for more/less creative outputs:

python synthetic_gen.py --temperature 0.9

Custom output file:

python synthetic_gen.py --output my_custom_dataset.jsonl

Dataset

EuroChef+ Customer Support Dataset

  • Source: BenTouss/eurochef-cs
  • Languages: English, French, Dutch, German
  • Labels (15): technical_issue, feature_request, content_request, content_quality, account_management, refund_request, normal, frustrated, positive, low_priority, premium_user, enterprise, trial_user, churn_risk, payment_issue
  • Test Set: 127 samples

Model Performance

Model F1 Micro Exact Match Latency (ms) Size
mDeBERTa-v3-base 0.8097 0.3543 4.26 278M params
GPT-OSS-20B (Base) 0.5751 0.0079 8199.33 20B params
GPT-OSS-20B (LoRA) 0.8018 0.4094 740.41 20B + adapters

Key Takeaways:

  • mDeBERTa offers the best balance of accuracy and speed for production deployment
  • LoRA fine-tuning dramatically improves GPT-OSS-20B performance (39% F1 increase)
  • LoRA achieves highest exact match rate, crucial for automation confidence
  • Consumer hardware is viable for training competitive models

Technical Details

LoRA Configuration

  • Rank: 32
  • Alpha: 64
  • Target Modules: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj
  • Dropout: 0.05
  • Trainable Parameters: ~0.2% of base model

Trained Models

All models are available on Hugging Face:

Output Format

Synthetic Data Output

Messages are saved in JSONL format with the following structure:

{
  "message": "Bonjour, j'ai un problème avec...",
  "language": "French",
  "tags": ["technical_issue", "urgent", "premium_user", "frustrated"]
}

Prediction Output Format

Model predictions are evaluated using multi-label metrics with the following structure:

{
  "predictions": ["technical_issue", "premium_user", "frustrated"],
  "ground_truth": ["technical_issue", "premium_user", "normal"],
  "f1_score": 0.67,
  "exact_match": False
}

Available Classification Labels

Problem Categories:

  • technical_issue - App/streaming problems, bugs, crashes
  • billing - Payment issues, subscription questions
  • account_management - Login, profile, settings
  • content_request - Requests for specific recipes/content
  • feature_request - Suggestions for new features
  • content_quality - Feedback on recipe quality
  • refund_request - Request for money back
  • payment_issue - Billing/payment problems

Sentiment:

  • frustrated - Negative emotional tone
  • positive - Positive feedback
  • normal - Neutral tone

Priority:

  • low_priority - Can wait for resolution
  • (Normal priority is default, not labeled)

User Type:

  • premium_user - Paid subscriber
  • enterprise - Business account
  • trial_user - Free trial period
  • churn_risk - Likely to cancel subscription

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT

About

This is an experiment to finetune LLm and bert model for classification task

Topics

Resources

Stars

Watchers

Forks