This paper presents Parsley, a web-based intelligent recipe ingredient parsing and substitution system that combines fine-tuned BERT (Bidirectional Encoder Representations from Transformers) for Named Entity Recognition (NER) with Google's Gemini AI for intelligent ingredient substitution. The system extracts structured information (amount, unit, item, descriptor) from unstructured recipe ingredient text using a fine-tuned BERT-base-uncased model, trained on the New York Times Ingredient Parser dataset with BIO (Beginning-Inside-Outside) tagging scheme. For dietary constraint-based substitutions, the system leverages Gemini 2.5 Flash to suggest appropriate alternatives with quantity adjustments. The architecture employs a three-tier design: React frontend with JWT authentication, FastAPI backend with MySQL database, and hybrid AI processing pipeline.
The system follows a modular three-tier architecture:
┌─────────────────────────────────────────────────────────────┐
│ Frontend Layer (React) │
│ - Login/Authentication UI │
│ - Recipe Input Interface │
│ - Results Visualization │
└────────────────────┬────────────────────────────────────────┘
│ HTTP/REST API (JWT Auth)
┌────────────────────▼────────────────────────────────────────┐
│ Backend Layer (FastAPI) │
│ - Authentication Service (JWT + bcrypt) │
│ - BERT Inference Engine │
│ - Gemini AI Integration │
│ - Database Management (MySQL) │
└────────────────────┬────────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────────┐
│ Data & Model Layer │
│ - MySQL Database (User Management) │
│ - Fine-tuned BERT Model (bert_recipe_model/) │
│ - Training Data (NYT Ingredient Dataset) │
└─────────────────────────────────────────────────────────────┘
- User Authentication Flow: User credentials → JWT token generation → Token validation for protected endpoints
- Ingredient Parsing Flow: Raw text → BERT tokenization → NER inference → Structured extraction
- Substitution Flow: Parsed ingredient + constraint → Gemini API → Substitution suggestion with quantity calculation
The preprocessing module (scripts/preprocess.py) transforms the NYT Ingredient Parser dataset into BIO-tagged sequences:
Input Format:
input_text, qty, unit, name, comment
"1 cup white flour", 1, cup, white flour,Output Format:
[
[["1", "B-AMT"], ["cup", "B-UNIT"], ["white", "B-NAME"], ["flour", "I-NAME"]]
]Tagging Strategy:
- B-AMT / I-AMT: Beginning/Inside of Amount (e.g., "1", "1/2")
- B-UNIT / I-UNIT: Beginning/Inside of Unit (e.g., "cup", "fluid ounces")
- B-NAME / I-NAME: Beginning/Inside of Ingredient Name (e.g., "olive", "oil")
- B-DESC / I-DESC: Beginning/Inside of Descriptor (e.g., "chopped", "finely chopped")
- O: Outside (punctuation, stop words)
Preprocessing Algorithm:
- Tokenization using NLTK word tokenizer
- Force-tagging of numeric patterns (regex:
^\d+([/\.]\d+)?$) - Force-tagging of known units (24 common measurement units)
- Pattern matching against CSV fields (qty, unit, name, comment)
- BIO sequence generation with proper label continuity
Model Configuration:
- Base Model:
bert-base-uncased(110M parameters) - Task: Token Classification (NER)
- Output Labels: 9 classes (O, B-AMT, I-AMT, B-UNIT, I-UNIT, B-NAME, I-NAME, B-DESC, I-DESC)
Training Hyperparameters:
- Batch Size: 16 (optimized for RTX 4060 Ti 8GB GPU)
- Learning Rate: 2e-5 (standard for BERT fine-tuning)
- Epochs: 5
- Weight Decay: 0.01
- Optimizer: AdamW (with Transformers default settings)
- Mixed Precision: FP16 (for memory efficiency)
- Evaluation Strategy: Epoch-based
- Save Strategy: Epoch-based with best model loading (F1 score)
Data Split:
- Training: 80%
- Testing: 20%
- Random Seed: 42 (reproducibility)
Token Alignment: BERT's WordPiece tokenization requires label alignment:
- Special tokens (CLS, SEP) receive label -100 (ignored in loss)
- Subword tokens (e.g., "##ed") receive label -100
- Only the first subword of each word receives the original label
Evaluation Metrics:
- Precision, Recall, F1-Score (using seqeval)
- Overall Accuracy
- Per-entity performance
BERT Inference Process:
- Input text tokenization with
is_split_into_words=True - Model forward pass → logits for each token
- Aggregation strategy: "simple" (groups consecutive tokens with same label)
- Label mapping: Entity groups (AMT, UNIT, NAME, DESC) extracted
- Text cleaning: Remove BERT artifacts (## prefixes, spacing issues)
Text Cleaning Algorithm:
def clean_text(text):
text = text.replace("##", "") # Remove WordPiece markers
text = re.sub(r"\s+'\s+", "'", text) # Fix: "hershey ' s" → "hershey's"
text = re.sub(r"\s+([,.;])", r"\1", text) # Fix: "minced ," → "minced,"
return text.strip()Gemini AI Integration:
- Model: Gemini 2.5 Flash
- Prompt Engineering: Structured JSON output format
- Input: Original ingredient (amount, unit, item) + dietary constraint
- Output: JSON with substitute_item, new_amount, new_unit, reason
Constraint Types Supported:
- Vegan
- Keto
- Gluten-Free
- Dairy-Free
- Low-Carb
- Paleo
API Server Specifications:
- Framework: FastAPI 0.104.0+
- ASGI Server: Uvicorn
- Python Version: 3.10+
- Authentication: JWT (JSON Web Tokens) with HS256 algorithm
- Password Hashing: bcrypt (via passlib)
API Endpoints:
POST /api/auth/register
- Request Body:
{ "email": "string", "password": "string" } - Response:
{ "token": "string", "user": { "id": int, "email": "string" } } - Security: Password hashed with bcrypt (12 rounds)
POST /api/auth/login
- Request Body:
{ "email": "string", "password": "string" } - Response:
{ "token": "string", "user": { "id": int, "email": "string" } } - Validation: Email/password verification against MySQL database
POST /api/parse
- Authentication: Bearer token required
- Request Body:
{ "text": "string" } - Response:
{ "amount": "string", "unit": "string", "item": "string", "descriptor": "string" } - Processing: BERT model inference → entity extraction → aggregation
POST /api/substitute
- Authentication: Bearer token required
- Request Body:
{ "item": "string", "amount": "string", "unit": "string", "constraint": "string" } - Response:
{ "found": bool, "substitute_item": "string", "new_amount": "string", "new_unit": "string", "reason": "string" } - Processing: Gemini API call → JSON parsing → response formatting
GET /api/health
- Response:
{ "status": "healthy", "bert_loaded": bool, "db_connected": bool }
JWT Token Configuration:
- Algorithm: HS256
- Secret Key: Configurable via environment variable
- Expiration: 30 days (43,200 minutes)
- Payload:
{ "sub": "user_email", "exp": timestamp }
Schema: MySQL Database recipeai
Table: users
CREATE TABLE users (
id INT AUTO_INCREMENT PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
password_hash VARCHAR(255) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX idx_email (email)
);Security Features:
- Unique email constraint
- Bcrypt password hashing (72-byte limit, automatically handled)
- Indexed email for fast lookups
- Automatic timestamp tracking
Technology Stack:
- React 19.2.0
- React Router DOM 6.26.0 (client-side routing)
- Tailwind CSS 4.1.18 (utility-first styling)
- React Icons 5.5.0 (icon library)
- Vite (build tool and dev server)
Component Structure:
App.jsx (Router)
├── Login.jsx (Authentication UI)
└── Dashboard.jsx (Main Application)
├── RecipeInput (Parse interface)
├── ParsedResults (ResultCard components)
├── SubstituteSection (Constraint selection)
└── SubstituteResult (AI suggestions)
State Management:
- React Hooks (useState, useEffect)
- LocalStorage for JWT token persistence
- Protected routes with authentication checks
UI/UX Features:
- Responsive design (mobile-first)
- Gradient color scheme (orange/amber theme)
- Icon-based visual feedback
- Loading states with spinners
- Error handling with user-friendly messages
- Form validation
Authentication Flow:
- User submits credentials (email + password)
- Server validates against database (bcrypt comparison)
- JWT token generated with user email as subject
- Token returned to client, stored in localStorage
- Subsequent requests include token in
Authorization: Bearer <token>header - Server validates token signature and expiration
- User identity extracted from token payload
Security Measures:
- Password hashing: bcrypt (computationally expensive, resistant to rainbow tables)
- Token expiration: 30-day validity period
- HTTPS recommended for production (tokens transmitted over network)
- CORS configuration: Currently permissive (
*), should be restricted in production - SQL injection prevention: Parameterized queries (PyMySQL)
-
Database Initialization
mysql -u root -p < scripts/setup_database.sqlCreates database, user, and grants privileges.
-
Data Preprocessing
python scripts/preprocess.py
Processes NYT dataset → generates
data/training_data.json -
Model Training
python scripts/train_BERT.py
Fine-tunes BERT model → saves to
bert_recipe_model/ -
Backend Server
python scripts/api.py
Starts FastAPI server on port 5000
-
Frontend Development
cd RecipeAI npm install npm run devStarts Vite dev server (typically port 5173)
-
Registration/Login
- User navigates to login page
- Submits email and password
- Server authenticates and returns JWT token
- Token stored in localStorage
- User redirected to dashboard
-
Ingredient Parsing
- User enters recipe ingredient text (e.g., "1 1/2 cups chopped tomatoes")
- Frontend sends POST request to
/api/parsewith JWT token - Backend loads BERT model (if not already loaded)
- Text tokenized and processed through BERT
- Entities extracted and aggregated
- Response:
{ "amount": "1 1/2", "unit": "cups", "item": "tomatoes", "descriptor": "chopped" } - Frontend displays parsed results in card layout
-
Substitution Request
- User selects dietary constraint from dropdown
- Frontend sends POST request to
/api/substitutewith parsed data + constraint - Backend constructs prompt for Gemini API
- Gemini returns JSON with substitution suggestion
- Frontend displays substitute with new quantities and reasoning
BERT Configuration:
- Architecture: BERT-base (12 layers, 768 hidden size, 12 attention heads)
- Vocabulary Size: 30,522 tokens
- Max Sequence Length: 512 tokens (with truncation)
- Fine-tuning Method: Transfer learning with task-specific head
- Output Layer: Linear classification head (768 → 9 labels)
Training Environment:
- Hardware: NVIDIA RTX 4060 Ti (8GB VRAM)
- Framework: PyTorch with Hugging Face Transformers
- Mixed Precision: FP16 (reduces memory by ~50%)
- Data Loader Workers: 0 (Windows compatibility)
- Checkpointing: Epoch-based with best model selection
Response Times (Estimated):
- Authentication: < 100ms (database lookup + JWT generation)
- BERT Parsing: 200-500ms (model inference on GPU)
- Gemini Substitution: 1-3 seconds (API call to Google)
Scalability Considerations:
- BERT model loading: Single instance in memory (lazy loading possible)
- Database connections: PyMySQL connection pooling recommended for production
- Caching: Could implement Redis for frequently parsed ingredients
- Load balancing: FastAPI supports horizontal scaling
Optimization Techniques:
- Code splitting: Vite automatically handles this
- Icon library: Tree-shaking (only imported icons included)
- CSS: Tailwind JIT compilation (only used classes included)
- Asset optimization: Vite handles image and asset optimization
Source: New York Times Ingredient Parser Dataset (2015 snapshot)
Dataset Characteristics:
- Format: CSV with columns (input_text, qty, unit, name, comment)
- Size: 179,208 rows (raw data)
- Preprocessing: First 5,000 rows used for training (subset for development)
- Quality: Requires filtering (removed rows with missing 'input' or 'name')
BIO Tag Distribution:
- O (Outside): Most common (punctuation, stop words)
- B-NAME / I-NAME: Ingredient names (primary entities)
- B-AMT / I-AMT: Numeric quantities
- B-UNIT / I-UNIT: Measurement units
- B-DESC / I-DESC: Preparation descriptors
Model Performance Metrics:
- Evaluation performed using seqeval library (standard NER evaluation)
- Metrics: Precision, Recall, F1-Score, Accuracy
- Best model selected based on F1-score on validation set
System Capabilities:
- Handles complex ingredient descriptions with multiple components
- Recognizes fractions, decimals, and mixed numbers
- Identifies compound units (e.g., "fluid ounces")
- Extracts multi-word ingredient names
- Processes preparation descriptors
Limitations:
- BERT model requires GPU for optimal inference speed
- Gemini API dependency for substitutions (external service)
- Limited to English language
- Training data focused on common cooking measurements
fastapi>=0.104.0
uvicorn[standard]>=0.24.0
transformers>=4.30.0
torch>=2.0.0
google-generativeai>=0.3.0
PyMySQL>=1.1.0
python-jose[cryptography]>=3.3.0
passlib[bcrypt]>=1.7.4
pydantic>=2.0.0
accelerate>=0.20.0
sentencepiece>=0.1.99
python-dotenv>=1.0.0
bcrypt>=4.0.0,<5.0.0
react>=19.2.0
react-dom>=19.2.0
react-router-dom>=6.26.0
react-icons>=5.5.0
tailwindcss>=4.1.18
@tailwindcss/vite>=4.1.18
- Vite (frontend build tool)
- ESLint (code linting)
- Python 3.10+ (backend runtime)
- Node.js 18+ (frontend runtime)
- MySQL 8.0+ (database)
- Python 3.10 or higher
- Node.js 18 or higher
- MySQL 8.0 or higher
- NVIDIA GPU (recommended for BERT inference)
- CUDA toolkit (for GPU acceleration)
Backend Setup:
# Install Python dependencies
pip install -r requirements.txt
# Initialize database
mysql -u root -p < scripts/setup_database.sql
# Configure database credentials in scripts/api.py or via environment variables
export DB_HOST=localhost
export DB_PORT=3306
export DB_NAME=recipeai
export DB_USER=your_username
export DB_PASSWORD=your_password
# Train BERT model (if not already trained)
python scripts/preprocess.py
python scripts/train_BERT.py
# Start API server
python scripts/api.pyFrontend Setup:
cd RecipeAI
npm install
npm run devSecurity:
- Set strong SECRET_KEY for JWT
- Restrict CORS origins to frontend domain
- Use HTTPS for all communications
- Implement rate limiting
- Add input validation and sanitization
Performance:
- Use production-grade ASGI server (Gunicorn with Uvicorn workers)
- Implement connection pooling for database
- Add caching layer (Redis) for frequent queries
- Consider model quantization for faster inference
- Use CDN for frontend static assets
Monitoring:
- Logging: Structured logging with levels
- Health checks:
/api/healthendpoint - Error tracking: Sentry or similar service
- Metrics: Prometheus + Grafana
- Multi-language Support: Extend to other languages with multilingual BERT models
- Offline Substitution: Local knowledge base for common substitutions (reduce API dependency)
- Batch Processing: Support for entire recipe parsing (multiple ingredients)
- User Preferences: Save dietary preferences per user
- Recipe Storage: Allow users to save and manage recipes
- Mobile App: Native mobile application with React Native
- Advanced NER: Include extraction of cooking methods, temperatures, times
- Nutritional Information: Integration with nutritional databases
- Model Optimization: Quantization and pruning for edge deployment
- Active Learning: User feedback loop for model improvement
- BERT Paper: Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", NAACL 2019
- Dataset: New York Times Ingredient Parser Dataset (2015)
- Libraries: Hugging Face Transformers, FastAPI, React, PyTorch
- AI Models: Google Gemini 2.5 Flash, BERT-base-uncased
For questions or contributions, please refer to the project repository.
Note: This system is designed for research and educational purposes. For production deployment, additional security hardening, performance optimization, and compliance considerations should be addressed.