An end-to-end framework for training, evaluating, and iteratively improving Large Language Models with automated feedback loops.
| Name | Roll Number |
|---|---|
| Radhakrishna Bharuka | 24BDS063 |
| Abhang Pawar | 24BDS054 |
| Nilesh Dwivedi | 24BDS048 |
| Rushikesh Masalkar | 24BDS040 |
- Overview
- Demo Video
- Features
- System Architecture
- Repository Structure
- Installation Process
- Execution Guide
- Evaluation Metrics
- API Documentation
- Technology Stack
- Documentation
- License
This framework implements a self-improving Large Language Model system that:
- Automatically collects training data from user interactions
- Evaluates model quality using 8 comprehensive metrics
- Fine-tunes models using efficient LoRA (Low-Rank Adaptation) adapters
- Measures improvements quantitatively with before/after comparisons
- Operates continuously with background workers
- Scales efficiently on consumer GPUs (8GB VRAM minimum)
2-Minute Working Demo
Click here to watch the full demo video
The demo video demonstrates:
- User interactions through the chat interface
- Real-time response generation from the model
- Background evaluation workers computing metrics
- Fine-tuning process triggering automatically
- Before and after comparison of model improvements
- Continuous data collection from user interactions
- Automatic dataset preparation and validation
- Threshold-based fine-tuning trigger (configurable)
- 8 distinct quality metrics for thorough assessment
- Before and after comparison reports
- Real-time metric computation
- LoRA-based fine-tuning for memory efficiency
- Runs on consumer GPUs with 8GB+ VRAM
- Checkpoint saving every 100 steps
- Asynchronous metric computation
- Automated fine-tuning workflow
- Continuous monitoring and processing
┌─────────────────────────────────────────────────────────────────┐
│ SELF-IMPROVING LLM FRAMEWORK │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ │ FRONTEND │◄───────►│ BACKEND │◄───────►│ DATABASE │
│ │ (React/Vue) │ REST │ (FastAPI) │ SQL │ (Supabase) │
│ │ │ API │ │ │ │
│ │ Port: 5173 │ │ Port: 8000 │ │ Cloud │
│ └──────────────┘ └──────────────┘ └──────────────┘
│ │
│ │
│ ┌────────┴────────┐
│ │ │
│ ┌─────▼─────┐ ┌─────▼─────┐
│ │ Worker 1 │ │ Worker 2 │
│ │ eval_first │ │eval_finetune│
│ │ (Metrics) │ │ (Training) │
│ └────────────┘ └────────────┘
│ │ │
│ └────────┬────────┘
│ │
│ ┌────────▼────────┐
│ │ OLLAMA SERVER │
│ │ Port: 11434 │
│ │ │
│ │ - Gemma 1B │
│ │ - GPT-OSS 20B │
│ │ - Fine-tuned │
│ └─────────────────┘
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ SELF-IMPROVEMENT LOOP │
└─────────────────────────────────────────────────────────────────┘
Step 1: USER INTERACTION
┌─────────────────┐
│ User submits │
│ a question │
└────────┬────────┘
│
▼
Step 2: RESPONSE GENERATION
┌─────────────────┐
│ Gemma 1B Model │
│ generates answer│
└────────┬────────┘
│
▼
Step 3: DATABASE STORAGE
┌─────────────────────┐
│ Save to Supabase │
│ status: 'created' │
└────────┬────────────┘
│
▼
Step 4: FIRST EVALUATION
┌─────────────────────────────────┐
│ Compute 8 metrics: │
│ - Answer Relevancy │
│ - Contextual Precision │
│ - Faithfulness │
│ - Toxicity │
│ - Overall Score │
│ - And 3 more │
│ status: 'done' │
└────────┬────────────────────────┘
│
▼
Step 5: DATA ACCUMULATION
┌─────────────────────┐
│ Collect N records │
│ (default: 5000) │
└────────┬────────────┘
│
▼
Step 6: FINE-TUNING
┌─────────────────────────────────┐
│ Trigger at threshold │
│ - Prepare training data │
│ - Apply LoRA adapters │
│ - Train for 3 epochs │
│ - Save improved model │
└────────┬────────────────────────┘
│
▼
Step 7: FINAL EVALUATION
┌─────────────────────────────────┐
│ Re-evaluate with fine-tuned │
│ model and compare results: │
│ │
│ Base Model → Fine-tuned Model │
│ Score: 0.65 → Score: 0.78 │
│ Improvement: +20% │
└────────┬────────────────────────┘
│
▼
Step 8: DEPLOY AND REPEAT
┌─────────────────────┐
│ Use improved model │
│ for new interactions│
│ Loop back to Step 1 │
└─────────────────────┘
Intune-Backend/
│
├── .env # Environment variables (Supabase credentials)
├── .gitignore # Git ignore rules
├── Modelfile # Ollama model configuration
├── requirements.txt # Core dependencies
├── requirements_finetune.txt # Fine-tuning dependencies
├── README.md # This file
│
├── app/ # APPLICATION LAYER
│ ├── app.py # Main FastAPI server (Port 8000)
│ ├── eval_first.py # Worker 1: Base metrics evaluation
│ ├── eval_finetune.py # Worker 2: Fine-tuning and final evaluation
│ └── README.md # API documentation
│
├── src/ # SOURCE CODE LAYER
│ │
│ ├── data_generation/ # Data pipeline
│ │ ├── teacher.py # Generate training examples (GPT-OSS)
│ │ ├── student.py # Generate base outputs (Gemma)
│ │ └── prepare_data.py # Format for training (JSONL)
│ │
│ ├── training/ # Model fine-tuning
│ │ └── finetune.py # LoRA-based training
│ │
│ ├── evaluation/ # Quality assessment
│ │ ├── update_metrics.py # Compute base metrics
│ │ ├── evaluate_finetuned.py # Compare base vs tuned
│ │ ├── evaluate_ollama.py # Test deployed models
│ │ └── generate_report.py # Create comparison reports
│ │
│ ├── metrics/ # Evaluation engine
│ │ └── llm_eval.py # 8 metrics implementation
│ │
│ ├── database/ # Database abstraction
│ │ └── supabase_client.py # Supabase connection and utilities
│ │
│ └── utils/ # Helper functions
│
├── docs/ # DOCUMENTATION AND MEDIA
│ ├── AI_report.pdf # Comprehensive project report
│ ├── AI_PPT.pptx # Project presentation
│ ├── result.pdf # Evaluation results
│ ├── demo_video.mp4 # Working demo video (2 minutes)
│ ├── intune_landingpage.png # Landing page screenshot
│ ├── Full_workflow.png # Complete workflow diagram
│ ├── basic_workflow_figma.png # Simplified workflow
│ └── db_schema.png # Database schema diagram
│
├── sql/ # DATABASE SCHEMAS
│ ├── supabase_setup.sql # Initial table setup
│ ├── supabase_add_metrics.sql # Add metric columns
│ ├── add_tuned_columns.sql # Add fine-tuned columns
│ └── create_decimal_view.sql # View for decimal metrics
│
├── scripts/ # UTILITY SCRIPTS
│ ├── convert_to_gguf.py # Convert model to GGUF format
│ ├── create_ollama_model.py # Create Ollama model
│ └── cleanup.ps1 # Cleanup script
│
├── config/ # CONFIGURATION
│ └── .env.example # Environment variables template
│
├── reports/ # EVALUATION RESULTS
│ └── evaluation_report_*.json # Performance comparison reports
│
└── Supabase_csv/ # DATA EXPORTS
└── *.csv # Database exports
| Component | Purpose | Key Files |
|---|---|---|
| app/ | API server and workers | app.py, eval_first.py, eval_finetune.py |
| src/data_generation/ | Create training data | teacher.py, student.py, prepare_data.py |
| src/training/ | Fine-tune models | finetune.py |
| src/evaluation/ | Assess quality | update_metrics.py, evaluate_finetuned.py |
| src/metrics/ | Scoring engine | llm_eval.py |
| src/database/ | Data persistence | supabase_client.py |
| docs/ | Documentation and media | PDFs, images, video |
| sql/ | Database schemas | SQL scripts |
| scripts/ | Utilities | Conversion and deployment tools |
Before you begin, ensure you have the following installed:
| Requirement | Version | Purpose |
|---|---|---|
| Python | 3.10 or higher | Core runtime |
| NVIDIA GPU | 8GB+ VRAM | Fine-tuning (RTX 4060 or better recommended) |
| System RAM | 16GB or more | Model loading |
| Ollama | Latest | Local LLM inference |
| Supabase Account | Free tier | Cloud database |
| Git | Latest | Clone repository |
git clone https://github.com/Self-eval-llm/Intune-Backend.git
cd Intune-BackendUsing Virtual Environment (Recommended):
# Create virtual environment
python -m venv .venv
# Activate virtual environment
# On Windows PowerShell:
.\.venv\Scripts\Activate.ps1
# On Windows Command Prompt:
.\.venv\Scripts\activate.bat
# On Linux/Mac:
source .venv/bin/activate
# Install dependencies
pip install -r requirements_finetune.txtUsing Conda (Alternative):
# Create conda environment
conda create -n llm-framework python=3.10
# Activate environment
conda activate llm-framework
# Install dependencies
pip install -r requirements_finetune.txt# Copy the environment template
cp config/.env.example .env
# Edit .env file with your credentials
# On Windows: notepad .env
# On Linux/Mac: nano .envAdd your Supabase credentials to .env:
# Supabase Configuration
SUPABASE_URL=https://your-project-id.supabase.co
SUPABASE_KEY=your-anon-or-service-key
# Optional: Model Configuration
DEFAULT_MODEL=gemma3:1b
TEACHER_MODEL=gpt-oss:20bHow to get Supabase credentials:
- Go to supabase.com and create a free account
- Create a new project
- Navigate to Settings → API
- Copy the Project URL and anon/public key
Run SQL scripts in Supabase SQL Editor in the following order:
- Open Supabase Dashboard → SQL Editor
- Execute
sql/supabase_setup.sql- Creates main table structure - Execute
sql/supabase_add_metrics.sql- Adds metric columns - Execute
sql/add_tuned_columns.sql- Adds fine-tuned metric columns - (Optional) Execute
sql/create_decimal_view.sql- Creates decimal view for easier querying
Database Schema Created:
The intune_db table includes:
id- Primary keycreated_at- Timestampinput- User questionactual_output- Base model responseexpected_output- Reference answercontext- Background information (JSONB)status_eval_first- Base evaluation statusstatus_eval_final- Fine-tuning evaluation status- 8 base metric columns (INTEGER, multiply by 10000)
- 8 fine-tuned metric columns (INTEGER, multiply by 10000)
Install Ollama:
# Windows (using winget):
winget install Ollama.Ollama
# Or download from https://ollama.ai
# Start Ollama service
ollama servePull Required Models:
Open a new terminal window and run:
# Pull base model (Gemma 1B - approximately 1.5GB)
ollama pull gemma3:1b
# Pull teacher model (GPT-OSS 20B - approximately 20GB)
# Optional, only needed for data generation
ollama pull gpt-oss:20b
# Verify installation
ollama listExpected Output:
NAME ID SIZE MODIFIED
gemma3:1b abc123def 1.5 GB 2 minutes ago
gpt-oss:20b def456ghi 20 GB 5 minutes ago
Test database connection:
python -c "from src.database.supabase_client import get_supabase_client; print('Database connected!' if get_supabase_client() else 'Connection failed')"Test Ollama connection:
curl http://localhost:11434/api/tagsIf successful, you should see a JSON response with a list of available models.
The framework requires 3 separate processes running simultaneously. Each process should run in its own terminal window.
# Activate virtual environment
.\.venv\Scripts\Activate.ps1
# Start FastAPI server
python -m uvicorn app.app:app --host 0.0.0.0 --port 8000 --reloadWhat it does:
- Serves REST API endpoints
- Handles
/generaterequests from frontend - Manages database operations
- Provides health check endpoint
Console Output:
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: Started reloader process [12345] using WatchFiles
INFO: Started server process [12346]
INFO: Waiting for application startup.
INFO: Application startup complete.
Status: API is ready when you see "Application startup complete"
API will be available at: http://localhost:8000
# Activate virtual environment
.\.venv\Scripts\Activate.ps1
# Start first evaluation worker
python app\eval_first.pyWhat it does:
- Polls Supabase for records with
status_eval_first='created' - Computes 8 evaluation metrics for base model outputs
- Updates database with computed scores
- Marks records as
status_eval_first='done'
Console Output:
INFO: Starting First Evaluation Worker...
INFO: Polling interval: 5 seconds
INFO: Found 3 records to evaluate
INFO: Evaluating record 123
INFO: Updated record 123 (Answer Relevancy: 0.7532)
INFO: Batch complete. Processed 3 records in 2.1s
Status: Worker is active and polling
# Activate virtual environment
.\.venv\Scripts\Activate.ps1
# Start fine-tuning worker
python app\eval_finetune.pyWhat it does:
- Monitors record count in database
- Triggers fine-tuning when threshold reached (default: 5000 records)
- Trains LoRA adapters on collected data
- Evaluates fine-tuned model and updates
*_tunedmetrics - Generates comparison reports
Console Output (Initial):
INFO: Starting Fine-tuning Worker...
INFO: Checking conditions every 60 seconds
INFO: Records collected: 47 / 5000 (0.94%)
INFO: Threshold not reached. Waiting...
Console Output (When Triggered):
INFO: Conditions met! Starting fine-tuning process...
INFO: Preparing training data...
INFO: Created train_dataset.jsonl (4500 examples)
INFO: Created val_dataset.jsonl (500 examples)
INFO: Starting fine-tuning with LoRA...
INFO: Epoch 1/3 - Loss: 0.8234
INFO: Epoch 2/3 - Loss: 0.6891
INFO: Epoch 3/3 - Loss: 0.5743
INFO: Fine-tuning completed successfully
INFO: Starting final evaluation...
INFO: Processed 100 records - Avg improvement: +12.3%
Status: Worker is monitoring; fine-tuning will trigger automatically
Method 1: Using curl (Command Line)
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "What is machine learning?"}'Method 2: Using PowerShell
$body = @{
prompt = "What is artificial intelligence?"
} | ConvertTo-Json
Invoke-RestMethod -Uri http://localhost:8000/generate `
-Method Post `
-Body $body `
-ContentType "application/json"Method 3: Using Python Script
import requests
response = requests.post(
"http://localhost:8000/generate",
json={"prompt": "Explain neural networks"}
)
print(response.json())Expected Response:
{
"response": "Machine learning is a subset of artificial intelligence...",
"model": "gemma3:1b",
"timestamp": "2025-11-16T14:30:45Z"
}For testing with 10 examples instead of 5000:
# Step 1: Generate 10 training examples
python src\data_generation\teacher.py --n 10 --mode continuous
# Step 2: Generate base model outputs
python src\data_generation\student.py
# Step 3: Compute metrics
python src\evaluation\update_metrics.py
# Step 4: Prepare training data
python src\data_generation\prepare_data.py
# Step 5: Fine-tune (edit finetune.py to use small dataset)
python src\training\finetune.py
# Step 6: Evaluate fine-tuned model
python src\evaluation\evaluate_finetuned.pyTime Estimate: Approximately 30 minutes for complete cycle with 10 examples
Graceful Shutdown:
In each terminal window, press:
Ctrl + C
Wait for "Shutting down gracefully..." message
Force Stop (if needed):
# Windows PowerShell
Get-Process python | Stop-Process
# Linux/Mac
pkill pythonThe system uses 8 distinct metrics to evaluate model quality:
1. Answer Relevancy (0-1)
- Measures how relevant the answer is to the question
- Formula:
cosine_similarity(question_tokens, answer_tokens)
2. Contextual Precision (0-1)
- Measures how much of the answer is supported by the context
- Formula:
|answer ∩ context| / |answer|
3. Contextual Recall (0-1)
- Measures how much of the context is covered in the answer
- Formula:
|answer ∩ context| / |context|
4. Contextual Relevancy (0-1)
- Measures semantic similarity between context and answer
- Formula:
cosine_similarity(context_tokens, answer_tokens)
5. Faithfulness (0-1)
- Measures alignment with reference answer and context
- Formula:
0.6 × cos(answer, reference) + 0.4 × cos(answer, context)
6. Toxicity (0-1)
- Measures presence of harmful or toxic language
- Formula:
toxic_words / total_words - Detection: Lexicon-based (offline, no API calls)
7. Hallucination Rate (0-1)
- Measures information not supported by context
- Formula:
1 - contextual_precision
8. Overall Score (0-1)
- Balanced combination of all metrics
- Formula:
mean(positive_metrics) × (1 - mean(negative_metrics))
Before Fine-tuning (Base Model):
| Metric | Score | Status |
|---|---|---|
| Answer Relevancy | 0.6500 | Moderate |
| Contextual Precision | 0.5815 | Moderate |
| Contextual Recall | 0.6234 | Moderate |
| Contextual Relevancy | 0.5892 | Moderate |
| Faithfulness | 0.5815 | Moderate |
| Toxicity | 0.0234 | Good |
| Hallucination Rate | 0.4185 | Poor |
| Overall Score | 0.4721 | Moderate |
After Fine-tuning:
| Metric | Score | Improvement | Status |
|---|---|---|---|
| Answer Relevancy | 0.7850 | +20.8% | Good |
| Contextual Precision | 0.7623 | +31.1% | Good |
| Contextual Recall | 0.7456 | +19.6% | Good |
| Contextual Relevancy | 0.7234 | +22.8% | Good |
| Faithfulness | 0.7067 | +21.5% | Good |
| Toxicity | 0.0156 | -33.3% | Excellent |
| Hallucination Rate | 0.2377 | -43.2% | Much Better |
| Overall Score | 0.6534 | +38.4% | Good |
Root endpoint providing API information
Health check endpoint to monitor service status
Generate a response from the model
Request:
{
"prompt": "What is machine learning?"
}Response:
{
"response": "Machine learning is a method of data analysis...",
"model": "gemma3:1b"
}- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc - Detailed API docs: See app/README.md
| Component | Technology |
|---|---|
| Backend Framework | FastAPI |
| Database | Supabase (PostgreSQL) |
| LLM Inference | Ollama |
| Base Model | Gemma 1B |
| Teacher Model | GPT-OSS 20B |
| Fine-tuning Library | Unsloth |
| Fine-tuning Method | LoRA (Low-Rank Adaptation) |
| Metrics Engine | Custom implementation |
| Programming Language | Python 3.10+ |
- Project Report (PDF) - Comprehensive technical documentation covering methodology, implementation, and results
- Project Presentation (PPTX) - Visual overview of architecture, workflow, and key features
- Results Analysis (PDF) - Detailed evaluation results and performance metrics
- Landing Page: docs/intune_landingpage.png
- Full Workflow Diagram: docs/Full_workflow.png
- Basic Workflow: docs/basic_workflow_figma.png
- Database Schema: docs/db_schema.png
1. Ollama Connection Failed
# Start Ollama service
ollama serve
# Check if running
curl http://localhost:11434/api/tags2. Model Not Found
# Create model from Modelfile
ollama create gemma-finetuned -f Modelfile3. Supabase Connection Failed
- Verify credentials in
.envfile - Check network connectivity
- Ensure table
intune_dbexists
4. Workers Not Processing
- Check database has records with appropriate status flags
- Verify fine-tuned model exists at
models/gemma-finetuned-merged/ - Check worker logs for specific errors
This project is licensed under the MIT License - see the LICENSE file for details.
- Gemma by Google for the base language model
- Ollama for local LLM inference infrastructure
- Unsloth for efficient fine-tuning framework
- Supabase for database and backend services
For questions, issues, or contributions:
- GitHub Issues: Report a bug
- GitHub Discussions: Ask questions
Active Development - The project is actively maintained and continuously improving.
Built for the AI/ML Community
Star this repository if you find it useful!
