Skip to content

HTStrix7Coder/Semantic-Ingredient-Parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parsley: A Hybrid AI System for Semantic Ingredient Parsing and Dietary Substitution

Abstract

This paper presents Parsley, a web-based intelligent recipe ingredient parsing and substitution system that combines fine-tuned BERT (Bidirectional Encoder Representations from Transformers) for Named Entity Recognition (NER) with Google's Gemini AI for intelligent ingredient substitution. The system extracts structured information (amount, unit, item, descriptor) from unstructured recipe ingredient text using a fine-tuned BERT-base-uncased model, trained on the New York Times Ingredient Parser dataset with BIO (Beginning-Inside-Outside) tagging scheme. For dietary constraint-based substitutions, the system leverages Gemini 2.5 Flash to suggest appropriate alternatives with quantity adjustments. The architecture employs a three-tier design: React frontend with JWT authentication, FastAPI backend with MySQL database, and hybrid AI processing pipeline.

1. System Architecture

1.1 Overview

The system follows a modular three-tier architecture:

┌─────────────────────────────────────────────────────────────┐
│                    Frontend Layer (React)                    │
│  - Login/Authentication UI                                  │
│  - Recipe Input Interface                                   │
│  - Results Visualization                                    │
└────────────────────┬────────────────────────────────────────┘
                     │ HTTP/REST API (JWT Auth)
┌────────────────────▼────────────────────────────────────────┐
│                   Backend Layer (FastAPI)                    │
│  - Authentication Service (JWT + bcrypt)                    │
│  - BERT Inference Engine                                    │
│  - Gemini AI Integration                                    │
│  - Database Management (MySQL)                              │
└────────────────────┬────────────────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────────────────┐
│                    Data & Model Layer                        │
│  - MySQL Database (User Management)                         │
│  - Fine-tuned BERT Model (bert_recipe_model/)              │
│  - Training Data (NYT Ingredient Dataset)                   │
└─────────────────────────────────────────────────────────────┘

1.2 Component Interactions

  1. User Authentication Flow: User credentials → JWT token generation → Token validation for protected endpoints
  2. Ingredient Parsing Flow: Raw text → BERT tokenization → NER inference → Structured extraction
  3. Substitution Flow: Parsed ingredient + constraint → Gemini API → Substitution suggestion with quantity calculation

2. Methodology

2.1 Data Preprocessing Pipeline

The preprocessing module (scripts/preprocess.py) transforms the NYT Ingredient Parser dataset into BIO-tagged sequences:

Input Format:

input_text, qty, unit, name, comment
"1 cup white flour", 1, cup, white flour,

Output Format:

[
  [["1", "B-AMT"], ["cup", "B-UNIT"], ["white", "B-NAME"], ["flour", "I-NAME"]]
]

Tagging Strategy:

  • B-AMT / I-AMT: Beginning/Inside of Amount (e.g., "1", "1/2")
  • B-UNIT / I-UNIT: Beginning/Inside of Unit (e.g., "cup", "fluid ounces")
  • B-NAME / I-NAME: Beginning/Inside of Ingredient Name (e.g., "olive", "oil")
  • B-DESC / I-DESC: Beginning/Inside of Descriptor (e.g., "chopped", "finely chopped")
  • O: Outside (punctuation, stop words)

Preprocessing Algorithm:

  1. Tokenization using NLTK word tokenizer
  2. Force-tagging of numeric patterns (regex: ^\d+([/\.]\d+)?$)
  3. Force-tagging of known units (24 common measurement units)
  4. Pattern matching against CSV fields (qty, unit, name, comment)
  5. BIO sequence generation with proper label continuity

2.2 BERT Model Training

Model Configuration:

  • Base Model: bert-base-uncased (110M parameters)
  • Task: Token Classification (NER)
  • Output Labels: 9 classes (O, B-AMT, I-AMT, B-UNIT, I-UNIT, B-NAME, I-NAME, B-DESC, I-DESC)

Training Hyperparameters:

  • Batch Size: 16 (optimized for RTX 4060 Ti 8GB GPU)
  • Learning Rate: 2e-5 (standard for BERT fine-tuning)
  • Epochs: 5
  • Weight Decay: 0.01
  • Optimizer: AdamW (with Transformers default settings)
  • Mixed Precision: FP16 (for memory efficiency)
  • Evaluation Strategy: Epoch-based
  • Save Strategy: Epoch-based with best model loading (F1 score)

Data Split:

  • Training: 80%
  • Testing: 20%
  • Random Seed: 42 (reproducibility)

Token Alignment: BERT's WordPiece tokenization requires label alignment:

  • Special tokens (CLS, SEP) receive label -100 (ignored in loss)
  • Subword tokens (e.g., "##ed") receive label -100
  • Only the first subword of each word receives the original label

Evaluation Metrics:

  • Precision, Recall, F1-Score (using seqeval)
  • Overall Accuracy
  • Per-entity performance

2.3 Inference Pipeline

BERT Inference Process:

  1. Input text tokenization with is_split_into_words=True
  2. Model forward pass → logits for each token
  3. Aggregation strategy: "simple" (groups consecutive tokens with same label)
  4. Label mapping: Entity groups (AMT, UNIT, NAME, DESC) extracted
  5. Text cleaning: Remove BERT artifacts (## prefixes, spacing issues)

Text Cleaning Algorithm:

def clean_text(text):
    text = text.replace("##", "")  # Remove WordPiece markers
    text = re.sub(r"\s+'\s+", "'", text)  # Fix: "hershey ' s" → "hershey's"
    text = re.sub(r"\s+([,.;])", r"\1", text)  # Fix: "minced ," → "minced,"
    return text.strip()

2.4 Substitution System

Gemini AI Integration:

  • Model: Gemini 2.5 Flash
  • Prompt Engineering: Structured JSON output format
  • Input: Original ingredient (amount, unit, item) + dietary constraint
  • Output: JSON with substitute_item, new_amount, new_unit, reason

Constraint Types Supported:

  • Vegan
  • Keto
  • Gluten-Free
  • Dairy-Free
  • Low-Carb
  • Paleo

3. Technical Implementation

3.1 Backend Architecture (FastAPI)

API Server Specifications:

  • Framework: FastAPI 0.104.0+
  • ASGI Server: Uvicorn
  • Python Version: 3.10+
  • Authentication: JWT (JSON Web Tokens) with HS256 algorithm
  • Password Hashing: bcrypt (via passlib)

API Endpoints:

3.1.1 Authentication Endpoints

POST /api/auth/register

  • Request Body: { "email": "string", "password": "string" }
  • Response: { "token": "string", "user": { "id": int, "email": "string" } }
  • Security: Password hashed with bcrypt (12 rounds)

POST /api/auth/login

  • Request Body: { "email": "string", "password": "string" }
  • Response: { "token": "string", "user": { "id": int, "email": "string" } }
  • Validation: Email/password verification against MySQL database

3.1.2 Protected Endpoints

POST /api/parse

  • Authentication: Bearer token required
  • Request Body: { "text": "string" }
  • Response: { "amount": "string", "unit": "string", "item": "string", "descriptor": "string" }
  • Processing: BERT model inference → entity extraction → aggregation

POST /api/substitute

  • Authentication: Bearer token required
  • Request Body: { "item": "string", "amount": "string", "unit": "string", "constraint": "string" }
  • Response: { "found": bool, "substitute_item": "string", "new_amount": "string", "new_unit": "string", "reason": "string" }
  • Processing: Gemini API call → JSON parsing → response formatting

GET /api/health

  • Response: { "status": "healthy", "bert_loaded": bool, "db_connected": bool }

JWT Token Configuration:

  • Algorithm: HS256
  • Secret Key: Configurable via environment variable
  • Expiration: 30 days (43,200 minutes)
  • Payload: { "sub": "user_email", "exp": timestamp }

3.2 Database Design

Schema: MySQL Database recipeai

Table: users

CREATE TABLE users (
    id INT AUTO_INCREMENT PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    INDEX idx_email (email)
);

Security Features:

  • Unique email constraint
  • Bcrypt password hashing (72-byte limit, automatically handled)
  • Indexed email for fast lookups
  • Automatic timestamp tracking

3.3 Frontend Architecture (React)

Technology Stack:

  • React 19.2.0
  • React Router DOM 6.26.0 (client-side routing)
  • Tailwind CSS 4.1.18 (utility-first styling)
  • React Icons 5.5.0 (icon library)
  • Vite (build tool and dev server)

Component Structure:

App.jsx (Router)
├── Login.jsx (Authentication UI)
└── Dashboard.jsx (Main Application)
    ├── RecipeInput (Parse interface)
    ├── ParsedResults (ResultCard components)
    ├── SubstituteSection (Constraint selection)
    └── SubstituteResult (AI suggestions)

State Management:

  • React Hooks (useState, useEffect)
  • LocalStorage for JWT token persistence
  • Protected routes with authentication checks

UI/UX Features:

  • Responsive design (mobile-first)
  • Gradient color scheme (orange/amber theme)
  • Icon-based visual feedback
  • Loading states with spinners
  • Error handling with user-friendly messages
  • Form validation

3.4 Security Implementation

Authentication Flow:

  1. User submits credentials (email + password)
  2. Server validates against database (bcrypt comparison)
  3. JWT token generated with user email as subject
  4. Token returned to client, stored in localStorage
  5. Subsequent requests include token in Authorization: Bearer <token> header
  6. Server validates token signature and expiration
  7. User identity extracted from token payload

Security Measures:

  • Password hashing: bcrypt (computationally expensive, resistant to rainbow tables)
  • Token expiration: 30-day validity period
  • HTTPS recommended for production (tokens transmitted over network)
  • CORS configuration: Currently permissive (*), should be restricted in production
  • SQL injection prevention: Parameterized queries (PyMySQL)

4. Workflow

4.1 System Setup Workflow

  1. Database Initialization

    mysql -u root -p < scripts/setup_database.sql

    Creates database, user, and grants privileges.

  2. Data Preprocessing

    python scripts/preprocess.py

    Processes NYT dataset → generates data/training_data.json

  3. Model Training

    python scripts/train_BERT.py

    Fine-tunes BERT model → saves to bert_recipe_model/

  4. Backend Server

    python scripts/api.py

    Starts FastAPI server on port 5000

  5. Frontend Development

    cd RecipeAI
    npm install
    npm run dev

    Starts Vite dev server (typically port 5173)

4.2 User Interaction Workflow

  1. Registration/Login

    • User navigates to login page
    • Submits email and password
    • Server authenticates and returns JWT token
    • Token stored in localStorage
    • User redirected to dashboard
  2. Ingredient Parsing

    • User enters recipe ingredient text (e.g., "1 1/2 cups chopped tomatoes")
    • Frontend sends POST request to /api/parse with JWT token
    • Backend loads BERT model (if not already loaded)
    • Text tokenized and processed through BERT
    • Entities extracted and aggregated
    • Response: { "amount": "1 1/2", "unit": "cups", "item": "tomatoes", "descriptor": "chopped" }
    • Frontend displays parsed results in card layout
  3. Substitution Request

    • User selects dietary constraint from dropdown
    • Frontend sends POST request to /api/substitute with parsed data + constraint
    • Backend constructs prompt for Gemini API
    • Gemini returns JSON with substitution suggestion
    • Frontend displays substitute with new quantities and reasoning

5. Technical Specifications

5.1 Model Specifications

BERT Configuration:

  • Architecture: BERT-base (12 layers, 768 hidden size, 12 attention heads)
  • Vocabulary Size: 30,522 tokens
  • Max Sequence Length: 512 tokens (with truncation)
  • Fine-tuning Method: Transfer learning with task-specific head
  • Output Layer: Linear classification head (768 → 9 labels)

Training Environment:

  • Hardware: NVIDIA RTX 4060 Ti (8GB VRAM)
  • Framework: PyTorch with Hugging Face Transformers
  • Mixed Precision: FP16 (reduces memory by ~50%)
  • Data Loader Workers: 0 (Windows compatibility)
  • Checkpointing: Epoch-based with best model selection

5.2 API Performance

Response Times (Estimated):

  • Authentication: < 100ms (database lookup + JWT generation)
  • BERT Parsing: 200-500ms (model inference on GPU)
  • Gemini Substitution: 1-3 seconds (API call to Google)

Scalability Considerations:

  • BERT model loading: Single instance in memory (lazy loading possible)
  • Database connections: PyMySQL connection pooling recommended for production
  • Caching: Could implement Redis for frequently parsed ingredients
  • Load balancing: FastAPI supports horizontal scaling

5.3 Frontend Performance

Optimization Techniques:

  • Code splitting: Vite automatically handles this
  • Icon library: Tree-shaking (only imported icons included)
  • CSS: Tailwind JIT compilation (only used classes included)
  • Asset optimization: Vite handles image and asset optimization

6. Dataset

Source: New York Times Ingredient Parser Dataset (2015 snapshot)

Dataset Characteristics:

  • Format: CSV with columns (input_text, qty, unit, name, comment)
  • Size: 179,208 rows (raw data)
  • Preprocessing: First 5,000 rows used for training (subset for development)
  • Quality: Requires filtering (removed rows with missing 'input' or 'name')

BIO Tag Distribution:

  • O (Outside): Most common (punctuation, stop words)
  • B-NAME / I-NAME: Ingredient names (primary entities)
  • B-AMT / I-AMT: Numeric quantities
  • B-UNIT / I-UNIT: Measurement units
  • B-DESC / I-DESC: Preparation descriptors

7. Results & Evaluation

Model Performance Metrics:

  • Evaluation performed using seqeval library (standard NER evaluation)
  • Metrics: Precision, Recall, F1-Score, Accuracy
  • Best model selected based on F1-score on validation set

System Capabilities:

  • Handles complex ingredient descriptions with multiple components
  • Recognizes fractions, decimals, and mixed numbers
  • Identifies compound units (e.g., "fluid ounces")
  • Extracts multi-word ingredient names
  • Processes preparation descriptors

Limitations:

  • BERT model requires GPU for optimal inference speed
  • Gemini API dependency for substitutions (external service)
  • Limited to English language
  • Training data focused on common cooking measurements

8. Technologies & Dependencies

8.1 Backend Dependencies

fastapi>=0.104.0
uvicorn[standard]>=0.24.0
transformers>=4.30.0
torch>=2.0.0
google-generativeai>=0.3.0
PyMySQL>=1.1.0
python-jose[cryptography]>=3.3.0
passlib[bcrypt]>=1.7.4
pydantic>=2.0.0
accelerate>=0.20.0
sentencepiece>=0.1.99
python-dotenv>=1.0.0
bcrypt>=4.0.0,<5.0.0

8.2 Frontend Dependencies

react>=19.2.0
react-dom>=19.2.0
react-router-dom>=6.26.0
react-icons>=5.5.0
tailwindcss>=4.1.18
@tailwindcss/vite>=4.1.18

8.3 Development Tools

  • Vite (frontend build tool)
  • ESLint (code linting)
  • Python 3.10+ (backend runtime)
  • Node.js 18+ (frontend runtime)
  • MySQL 8.0+ (database)

9. Installation & Deployment

9.1 Prerequisites

  • Python 3.10 or higher
  • Node.js 18 or higher
  • MySQL 8.0 or higher
  • NVIDIA GPU (recommended for BERT inference)
  • CUDA toolkit (for GPU acceleration)

9.2 Installation Steps

Backend Setup:

# Install Python dependencies
pip install -r requirements.txt

# Initialize database
mysql -u root -p < scripts/setup_database.sql

# Configure database credentials in scripts/api.py or via environment variables
export DB_HOST=localhost
export DB_PORT=3306
export DB_NAME=recipeai
export DB_USER=your_username
export DB_PASSWORD=your_password

# Train BERT model (if not already trained)
python scripts/preprocess.py
python scripts/train_BERT.py

# Start API server
python scripts/api.py

Frontend Setup:

cd RecipeAI
npm install
npm run dev

9.3 Production Deployment Considerations

Security:

  • Set strong SECRET_KEY for JWT
  • Restrict CORS origins to frontend domain
  • Use HTTPS for all communications
  • Implement rate limiting
  • Add input validation and sanitization

Performance:

  • Use production-grade ASGI server (Gunicorn with Uvicorn workers)
  • Implement connection pooling for database
  • Add caching layer (Redis) for frequent queries
  • Consider model quantization for faster inference
  • Use CDN for frontend static assets

Monitoring:

  • Logging: Structured logging with levels
  • Health checks: /api/health endpoint
  • Error tracking: Sentry or similar service
  • Metrics: Prometheus + Grafana

10. Future Work

  1. Multi-language Support: Extend to other languages with multilingual BERT models
  2. Offline Substitution: Local knowledge base for common substitutions (reduce API dependency)
  3. Batch Processing: Support for entire recipe parsing (multiple ingredients)
  4. User Preferences: Save dietary preferences per user
  5. Recipe Storage: Allow users to save and manage recipes
  6. Mobile App: Native mobile application with React Native
  7. Advanced NER: Include extraction of cooking methods, temperatures, times
  8. Nutritional Information: Integration with nutritional databases
  9. Model Optimization: Quantization and pruning for edge deployment
  10. Active Learning: User feedback loop for model improvement

11. References & Acknowledgments

  • BERT Paper: Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", NAACL 2019
  • Dataset: New York Times Ingredient Parser Dataset (2015)
  • Libraries: Hugging Face Transformers, FastAPI, React, PyTorch
  • AI Models: Google Gemini 2.5 Flash, BERT-base-uncased

12. Contact & License

For questions or contributions, please refer to the project repository.


Note: This system is designed for research and educational purposes. For production deployment, additional security hardening, performance optimization, and compliance considerations should be addressed.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors