Parsley: A Hybrid AI System for Semantic Ingredient Parsing and Dietary Substitution

Abstract

This paper presents Parsley, a web-based intelligent recipe ingredient parsing and substitution system that combines fine-tuned BERT (Bidirectional Encoder Representations from Transformers) for Named Entity Recognition (NER) with Google's Gemini AI for intelligent ingredient substitution. The system extracts structured information (amount, unit, item, descriptor) from unstructured recipe ingredient text using a fine-tuned BERT-base-uncased model, trained on the New York Times Ingredient Parser dataset with BIO (Beginning-Inside-Outside) tagging scheme. For dietary constraint-based substitutions, the system leverages Gemini 2.5 Flash to suggest appropriate alternatives with quantity adjustments. The architecture employs a three-tier design: React frontend with JWT authentication, FastAPI backend with MySQL database, and hybrid AI processing pipeline.

1. System Architecture

1.1 Overview

The system follows a modular three-tier architecture:

┌─────────────────────────────────────────────────────────────┐
│                    Frontend Layer (React)                    │
│  - Login/Authentication UI                                  │
│  - Recipe Input Interface                                   │
│  - Results Visualization                                    │
└────────────────────┬────────────────────────────────────────┘
                     │ HTTP/REST API (JWT Auth)
┌────────────────────▼────────────────────────────────────────┐
│                   Backend Layer (FastAPI)                    │
│  - Authentication Service (JWT + bcrypt)                    │
│  - BERT Inference Engine                                    │
│  - Gemini AI Integration                                    │
│  - Database Management (MySQL)                              │
└────────────────────┬────────────────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────────────────┐
│                    Data & Model Layer                        │
│  - MySQL Database (User Management)                         │
│  - Fine-tuned BERT Model (bert_recipe_model/)              │
│  - Training Data (NYT Ingredient Dataset)                   │
└─────────────────────────────────────────────────────────────┘

1.2 Component Interactions

User Authentication Flow: User credentials → JWT token generation → Token validation for protected endpoints
Ingredient Parsing Flow: Raw text → BERT tokenization → NER inference → Structured extraction
Substitution Flow: Parsed ingredient + constraint → Gemini API → Substitution suggestion with quantity calculation

2. Methodology

2.1 Data Preprocessing Pipeline

The preprocessing module (scripts/preprocess.py) transforms the NYT Ingredient Parser dataset into BIO-tagged sequences:

Input Format:

input_text, qty, unit, name, comment
"1 cup white flour", 1, cup, white flour,

Output Format:

[
  [["1", "B-AMT"], ["cup", "B-UNIT"], ["white", "B-NAME"], ["flour", "I-NAME"]]
]

Tagging Strategy:

B-AMT / I-AMT: Beginning/Inside of Amount (e.g., "1", "1/2")
B-UNIT / I-UNIT: Beginning/Inside of Unit (e.g., "cup", "fluid ounces")
B-NAME / I-NAME: Beginning/Inside of Ingredient Name (e.g., "olive", "oil")
B-DESC / I-DESC: Beginning/Inside of Descriptor (e.g., "chopped", "finely chopped")
O: Outside (punctuation, stop words)

Preprocessing Algorithm:

Tokenization using NLTK word tokenizer
Force-tagging of numeric patterns (regex: ^\d+([/\.]\d+)?$)
Force-tagging of known units (24 common measurement units)
Pattern matching against CSV fields (qty, unit, name, comment)
BIO sequence generation with proper label continuity

2.2 BERT Model Training

Model Configuration:

Base Model: bert-base-uncased (110M parameters)
Task: Token Classification (NER)
Output Labels: 9 classes (O, B-AMT, I-AMT, B-UNIT, I-UNIT, B-NAME, I-NAME, B-DESC, I-DESC)

Training Hyperparameters:

Batch Size: 16 (optimized for RTX 4060 Ti 8GB GPU)
Learning Rate: 2e-5 (standard for BERT fine-tuning)
Epochs: 5
Weight Decay: 0.01
Optimizer: AdamW (with Transformers default settings)
Mixed Precision: FP16 (for memory efficiency)
Evaluation Strategy: Epoch-based
Save Strategy: Epoch-based with best model loading (F1 score)

Data Split:

Training: 80%
Testing: 20%
Random Seed: 42 (reproducibility)

Token Alignment: BERT's WordPiece tokenization requires label alignment:

Special tokens (CLS, SEP) receive label -100 (ignored in loss)
Subword tokens (e.g., "##ed") receive label -100
Only the first subword of each word receives the original label

Evaluation Metrics:

Precision, Recall, F1-Score (using seqeval)
Overall Accuracy
Per-entity performance

2.3 Inference Pipeline

BERT Inference Process:

Input text tokenization with is_split_into_words=True
Model forward pass → logits for each token
Aggregation strategy: "simple" (groups consecutive tokens with same label)
Label mapping: Entity groups (AMT, UNIT, NAME, DESC) extracted
Text cleaning: Remove BERT artifacts (## prefixes, spacing issues)

Text Cleaning Algorithm:

def clean_text(text):
    text = text.replace("##", "")  # Remove WordPiece markers
    text = re.sub(r"\s+'\s+", "'", text)  # Fix: "hershey ' s" → "hershey's"
    text = re.sub(r"\s+([,.;])", r"\1", text)  # Fix: "minced ," → "minced,"
    return text.strip()

2.4 Substitution System

Gemini AI Integration:

Model: Gemini 2.5 Flash
Prompt Engineering: Structured JSON output format
Input: Original ingredient (amount, unit, item) + dietary constraint
Output: JSON with substitute_item, new_amount, new_unit, reason

Constraint Types Supported:

Vegan
Keto
Gluten-Free
Dairy-Free
Low-Carb
Paleo

3. Technical Implementation

3.1 Backend Architecture (FastAPI)

API Server Specifications:

Framework: FastAPI 0.104.0+
ASGI Server: Uvicorn
Python Version: 3.10+
Authentication: JWT (JSON Web Tokens) with HS256 algorithm
Password Hashing: bcrypt (via passlib)

API Endpoints:

3.1.1 Authentication Endpoints

POST /api/auth/register

Request Body: { "email": "string", "password": "string" }
Response: { "token": "string", "user": { "id": int, "email": "string" } }
Security: Password hashed with bcrypt (12 rounds)

POST /api/auth/login

Request Body: { "email": "string", "password": "string" }
Response: { "token": "string", "user": { "id": int, "email": "string" } }
Validation: Email/password verification against MySQL database

3.1.2 Protected Endpoints

POST /api/parse

Authentication: Bearer token required
Request Body: { "text": "string" }
Response: { "amount": "string", "unit": "string", "item": "string", "descriptor": "string" }
Processing: BERT model inference → entity extraction → aggregation

POST /api/substitute

Authentication: Bearer token required
Request Body: { "item": "string", "amount": "string", "unit": "string", "constraint": "string" }
Response: { "found": bool, "substitute_item": "string", "new_amount": "string", "new_unit": "string", "reason": "string" }
Processing: Gemini API call → JSON parsing → response formatting

GET /api/health

Response: { "status": "healthy", "bert_loaded": bool, "db_connected": bool }

JWT Token Configuration:

Algorithm: HS256
Secret Key: Configurable via environment variable
Expiration: 30 days (43,200 minutes)
Payload: { "sub": "user_email", "exp": timestamp }

3.2 Database Design

Schema: MySQL Database recipeai

Table: users

CREATE TABLE users (
    id INT AUTO_INCREMENT PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    INDEX idx_email (email)
);

Security Features:

Unique email constraint
Bcrypt password hashing (72-byte limit, automatically handled)
Indexed email for fast lookups
Automatic timestamp tracking

3.3 Frontend Architecture (React)

Technology Stack:

React 19.2.0
React Router DOM 6.26.0 (client-side routing)
Tailwind CSS 4.1.18 (utility-first styling)
React Icons 5.5.0 (icon library)
Vite (build tool and dev server)

Component Structure:

App.jsx (Router)
├── Login.jsx (Authentication UI)
└── Dashboard.jsx (Main Application)
    ├── RecipeInput (Parse interface)
    ├── ParsedResults (ResultCard components)
    ├── SubstituteSection (Constraint selection)
    └── SubstituteResult (AI suggestions)

State Management:

React Hooks (useState, useEffect)
LocalStorage for JWT token persistence
Protected routes with authentication checks

UI/UX Features:

Responsive design (mobile-first)
Gradient color scheme (orange/amber theme)
Icon-based visual feedback
Loading states with spinners
Error handling with user-friendly messages
Form validation

3.4 Security Implementation

Authentication Flow:

User submits credentials (email + password)
Server validates against database (bcrypt comparison)
JWT token generated with user email as subject
Token returned to client, stored in localStorage
Subsequent requests include token in Authorization: Bearer <token> header
Server validates token signature and expiration
User identity extracted from token payload

Security Measures:

Password hashing: bcrypt (computationally expensive, resistant to rainbow tables)
Token expiration: 30-day validity period
HTTPS recommended for production (tokens transmitted over network)
CORS configuration: Currently permissive (*), should be restricted in production
SQL injection prevention: Parameterized queries (PyMySQL)

4. Workflow

4.1 System Setup Workflow

Database Initialization
```
mysql -u root -p < scripts/setup_database.sql
```
Creates database, user, and grants privileges.
Data Preprocessing
```
python scripts/preprocess.py
```
Processes NYT dataset → generates data/training_data.json
Model Training
```
python scripts/train_BERT.py
```
Fine-tunes BERT model → saves to bert_recipe_model/
Backend Server
```
python scripts/api.py
```
Starts FastAPI server on port 5000
Frontend Development
```
cd RecipeAI
npm install
npm run dev
```
Starts Vite dev server (typically port 5173)

4.2 User Interaction Workflow

Registration/Login
- User navigates to login page
- Submits email and password
- Server authenticates and returns JWT token
- Token stored in localStorage
- User redirected to dashboard
Ingredient Parsing
- User enters recipe ingredient text (e.g., "1 1/2 cups chopped tomatoes")
- Frontend sends POST request to /api/parse with JWT token
- Backend loads BERT model (if not already loaded)
- Text tokenized and processed through BERT
- Entities extracted and aggregated
- Response: { "amount": "1 1/2", "unit": "cups", "item": "tomatoes", "descriptor": "chopped" }
- Frontend displays parsed results in card layout
Substitution Request
- User selects dietary constraint from dropdown
- Frontend sends POST request to /api/substitute with parsed data + constraint
- Backend constructs prompt for Gemini API
- Gemini returns JSON with substitution suggestion
- Frontend displays substitute with new quantities and reasoning

5. Technical Specifications

5.1 Model Specifications

BERT Configuration:

Architecture: BERT-base (12 layers, 768 hidden size, 12 attention heads)
Vocabulary Size: 30,522 tokens
Max Sequence Length: 512 tokens (with truncation)
Fine-tuning Method: Transfer learning with task-specific head
Output Layer: Linear classification head (768 → 9 labels)

Training Environment:

Hardware: NVIDIA RTX 4060 Ti (8GB VRAM)
Framework: PyTorch with Hugging Face Transformers
Mixed Precision: FP16 (reduces memory by ~50%)
Data Loader Workers: 0 (Windows compatibility)
Checkpointing: Epoch-based with best model selection

5.2 API Performance

Response Times (Estimated):

Authentication: < 100ms (database lookup + JWT generation)
BERT Parsing: 200-500ms (model inference on GPU)
Gemini Substitution: 1-3 seconds (API call to Google)

Scalability Considerations:

BERT model loading: Single instance in memory (lazy loading possible)
Database connections: PyMySQL connection pooling recommended for production
Caching: Could implement Redis for frequently parsed ingredients
Load balancing: FastAPI supports horizontal scaling

5.3 Frontend Performance

Optimization Techniques:

Code splitting: Vite automatically handles this
Icon library: Tree-shaking (only imported icons included)
CSS: Tailwind JIT compilation (only used classes included)
Asset optimization: Vite handles image and asset optimization

6. Dataset

Source: New York Times Ingredient Parser Dataset (2015 snapshot)

Dataset Characteristics:

Format: CSV with columns (input_text, qty, unit, name, comment)
Size: 179,208 rows (raw data)
Preprocessing: First 5,000 rows used for training (subset for development)
Quality: Requires filtering (removed rows with missing 'input' or 'name')

BIO Tag Distribution:

O (Outside): Most common (punctuation, stop words)
B-NAME / I-NAME: Ingredient names (primary entities)
B-AMT / I-AMT: Numeric quantities
B-UNIT / I-UNIT: Measurement units
B-DESC / I-DESC: Preparation descriptors

7. Results & Evaluation

Model Performance Metrics:

Evaluation performed using seqeval library (standard NER evaluation)
Metrics: Precision, Recall, F1-Score, Accuracy
Best model selected based on F1-score on validation set

System Capabilities:

Handles complex ingredient descriptions with multiple components
Recognizes fractions, decimals, and mixed numbers
Identifies compound units (e.g., "fluid ounces")
Extracts multi-word ingredient names
Processes preparation descriptors

Limitations:

BERT model requires GPU for optimal inference speed
Gemini API dependency for substitutions (external service)
Limited to English language
Training data focused on common cooking measurements

8. Technologies & Dependencies

8.1 Backend Dependencies

fastapi>=0.104.0
uvicorn[standard]>=0.24.0
transformers>=4.30.0
torch>=2.0.0
google-generativeai>=0.3.0
PyMySQL>=1.1.0
python-jose[cryptography]>=3.3.0
passlib[bcrypt]>=1.7.4
pydantic>=2.0.0
accelerate>=0.20.0
sentencepiece>=0.1.99
python-dotenv>=1.0.0
bcrypt>=4.0.0,<5.0.0

8.2 Frontend Dependencies

react>=19.2.0
react-dom>=19.2.0
react-router-dom>=6.26.0
react-icons>=5.5.0
tailwindcss>=4.1.18
@tailwindcss/vite>=4.1.18

8.3 Development Tools

Vite (frontend build tool)
ESLint (code linting)
Python 3.10+ (backend runtime)
Node.js 18+ (frontend runtime)
MySQL 8.0+ (database)

9. Installation & Deployment

9.1 Prerequisites

Python 3.10 or higher
Node.js 18 or higher
MySQL 8.0 or higher
NVIDIA GPU (recommended for BERT inference)
CUDA toolkit (for GPU acceleration)

9.2 Installation Steps

Backend Setup:

# Install Python dependencies
pip install -r requirements.txt

# Initialize database
mysql -u root -p < scripts/setup_database.sql

# Configure database credentials in scripts/api.py or via environment variables
export DB_HOST=localhost
export DB_PORT=3306
export DB_NAME=recipeai
export DB_USER=your_username
export DB_PASSWORD=your_password

# Train BERT model (if not already trained)
python scripts/preprocess.py
python scripts/train_BERT.py

# Start API server
python scripts/api.py

Frontend Setup:

cd RecipeAI
npm install
npm run dev

9.3 Production Deployment Considerations

Security:

Set strong SECRET_KEY for JWT
Restrict CORS origins to frontend domain
Use HTTPS for all communications
Implement rate limiting
Add input validation and sanitization

Performance:

Use production-grade ASGI server (Gunicorn with Uvicorn workers)
Implement connection pooling for database
Add caching layer (Redis) for frequent queries
Consider model quantization for faster inference
Use CDN for frontend static assets

Monitoring:

Logging: Structured logging with levels
Health checks: /api/health endpoint
Error tracking: Sentry or similar service
Metrics: Prometheus + Grafana

10. Future Work

Multi-language Support: Extend to other languages with multilingual BERT models
Offline Substitution: Local knowledge base for common substitutions (reduce API dependency)
Batch Processing: Support for entire recipe parsing (multiple ingredients)
User Preferences: Save dietary preferences per user
Recipe Storage: Allow users to save and manage recipes
Mobile App: Native mobile application with React Native
Advanced NER: Include extraction of cooking methods, temperatures, times
Nutritional Information: Integration with nutritional databases
Model Optimization: Quantization and pruning for edge deployment
Active Learning: User feedback loop for model improvement

11. References & Acknowledgments

BERT Paper: Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", NAACL 2019
Dataset: New York Times Ingredient Parser Dataset (2015)
Libraries: Hugging Face Transformers, FastAPI, React, PyTorch
AI Models: Google Gemini 2.5 Flash, BERT-base-uncased

12. Contact & License

For questions or contributions, please refer to the project repository.

Note: This system is designed for research and educational purposes. For production deployment, additional security hardening, performance optimization, and compliance considerations should be addressed.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
RecipeAI		RecipeAI
Results		Results
bert_recipe_model		bert_recipe_model
data		data
scripts		scripts
.gitignore		.gitignore
ReadMe.md		ReadMe.md
preprocessing.txt		preprocessing.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Parsley: A Hybrid AI System for Semantic Ingredient Parsing and Dietary Substitution

Abstract

1. System Architecture

1.1 Overview

1.2 Component Interactions

2. Methodology

2.1 Data Preprocessing Pipeline

2.2 BERT Model Training

2.3 Inference Pipeline

2.4 Substitution System

3. Technical Implementation

3.1 Backend Architecture (FastAPI)

3.1.1 Authentication Endpoints

3.1.2 Protected Endpoints

3.2 Database Design

3.3 Frontend Architecture (React)

3.4 Security Implementation

4. Workflow

4.1 System Setup Workflow

4.2 User Interaction Workflow

5. Technical Specifications

5.1 Model Specifications

5.2 API Performance

5.3 Frontend Performance

6. Dataset

7. Results & Evaluation

8. Technologies & Dependencies

8.1 Backend Dependencies

8.2 Frontend Dependencies

8.3 Development Tools

9. Installation & Deployment

9.1 Prerequisites

9.2 Installation Steps

9.3 Production Deployment Considerations

10. Future Work

11. References & Acknowledgments

12. Contact & License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages