Skip to content

ThomasMoulin-hub/PACT

Repository files navigation

PACT52 – Recipe Recommendation & Smart Grocery Assistance

1. Overview

PACT52 is a multi-module project combining:

  • Large scale recipe data scraping
  • A MongoDB + Flask REST API exposing recipe and user data
  • A semantic recipe recommendation engine (SentenceTransformers + Faiss)
  • E-ticket (PDF) parsing to extract purchased products and match them to normalized ingredients
  • Android client applications ("Android" and "PetitCuistot") for user interaction and planning

The goal is to help users manage pantry/inventory, parse grocery tickets, and receive personalized recipe suggestions.

2. High-Level Architecture

Scrapping/  -->  CSV (Recette.csv)  -->  MongoDB (Recipes collection)
                                        |\
                                        | \--> Flask API (/recipes, /users, /search)
Recommandation/v2/ (embeddings + Faiss) --> Recommendation queries via future endpoint
E_ticket/ (PDF parser + ingredient matching) --> Normalized product list --> Inventory
Android & PetitCuistot apps --> Consume API + Display recommendations

Supporting documentation and diagrams live under rapport/ (architecture, sequence, storyboard images).

3. Repository Structure (Key Directories)

  • Scrapping/: Selenium-based scraping scripts to build recipe dataset and images.
  • BDD/API_MongoDB/: Flask app (app.py) connecting to local MongoDB, exposing CRUD/read endpoints.
  • Recommandation/v2/: Embedding generation and similarity search (SentenceTransformer + Faiss).
  • E_ticket/: PDF ticket parser (pdfReader/Reader.py) and fuzzy ingredient matching (matching_ingredient.py).
  • Android/ & PetitCuistot/: Gradle Android app projects (UI, planning, inventory – details to be documented).
  • rapport/: AsciiDoc documentation and diagrams.
  • BDD/ (root CSV/JSON): Source intermediate datasets and encoded recipe lists.

4. Core Modules

4.1 Scraping (Scrapping/)

Scripts:

  • Main_scraping_recettes.py: Orchestrates threaded scraping of recipe pages using recipeDB_scrapper.scrapping_page_recette().
  • recipeDB_scrapper.py: Extracts recipe metadata, ingredients, utensils, instructions, nutrition, saves rows to Recette.csv, downloads images. Dependencies: Selenium, ChromeDriver, concurrent futures. Output: Recette.csv (semicolon-separated); Images/ PNG files.

4.2 API & Database (BDD/API_MongoDB/app.py)

  • Uses pymongo to connect to local MongoDB (USER, BDDDEZ1Z1 databases).
  • Endpoints (password-protected by a hardcoded integer 17112002):
    • GET / → health check
    • POST /add_user → insert a user document (currently fixed user_id=1)
    • GET /getusers → list users
    • GET /recipes?password=XXXX → all recipes
    • GET /recipes/limit/<nbr>/?password=XXXX → limited subset
    • GET /recipes/<id>?password=XXXX → recipe by numeric id
    • GET /recipes/search/limit/<nbr>/?password=XXXX&name=...&temps_de_préparation=... → search by name and prep time (French field names)
    • GET /recipes/searchbytags/<tag>?password=XXXX → search by tag (not fully implemented)
    • GET /get_recommandation → placeholder returning random recipe IDs Security note: Password and user_id assignment are rudimentary; improve with JWT / OAuth and auto-increment or UUIDs.

4.3 Recommendation Engine (Recommandation/v2/)

  • embedding.py: Loads SentenceTransformer bert-large-nli-stsb-mean-tokens and encodes recipe names into dense vectors; persists pickled embeddings.
  • recommandation.py: Builds Faiss IndexFlatIP for cosine-like similarity; get_recipes(liked_recipes, number_of_recipes) prints top similar recipe names. Future integration: Expose a /recommendations API endpoint using user-liked recipes or inventory-based similarity. Performance consideration: Faiss GPU (in requirements.txt) may require CUDA 11.3; provide CPU fallback (faiss-cpu).

4.4 E-ticket Parsing & Ingredient Matching (E_ticket/)

  • pdfReader/Reader.py: Extracts text from PDF receipts using pypdf, parses product lines, quantities, prices into a DataFrame.
  • matching_ingredient.py: Custom approximate matching leveraging character-level edit distance (lev_dist) and heuristics to propose nearest normalized ingredients. Potential improvement: Replace algorithm with fuzzy matching libraries (RapidFuzz) + language normalization (stemming, accent removal) + caching.

4.5 Android Applications (Android/, PetitCuistot/)

  • Gradle projects (modules under app/src/) likely consume the Flask API and present recipe browsing, planning, inventory & recommendations. Action item: Add dedicated README or module-level docs (not yet present).

5. Data Model (Current State)

MongoDB Collections (observed):

  • USER.user: { user_id, user_name, inventory, allergies, liked_recipes, disliked_recipes }
  • BDDDEZ1Z1.Recipes: Documents holding scraped recipe fields in French (e.g., nom de la recette, temps de préparation, tags). Potential other collections: recettes, Recipes_sample (seen in code for experimentation). CSV Columns (Scraping):
  1. Recipe Name
  2. Dietary Style
  3. Origin
  4. Preparation Time
  5. Cooking Time
  6. Total fats (g)
  7. Protein (g)
  8. Carbohydrates (g)
  9. Energy (kCal)
  10. INGREDIENTS (each ingredient serialized as bracketed list of 8 attributes)
  11. PROCESSES-UTENSILS (comma-separated utensils)
  12. INSTRUCTIONS (comma-separated steps)
  13. image_ID (remote ID / filename stem)
  14. Number of persons

6. Installation & Setup

6.1 Prerequisites

  • Python 3.8+ (Recommendation environment built around 3.8 per requirements.txt)
  • MongoDB Community Edition (local instance on default port 27017)
  • Google Chrome + matching ChromeDriver placed in Scrapping/ (already chromedriver.exe present)
  • (Optional) CUDA 11.3 capable GPU for Faiss GPU index; else install CPU version.

6.2 Clone

git clone <your repo url>
cd pact52

6.3 Python Environment (Minimal Cross-Platform)

If you want a light environment (without full conda export):

# Windows PowerShell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install flask pymongo sentence-transformers faiss-cpu pandas numpy selenium pypdf rapidfuzz

For GPU Faiss (Linux):

conda create -n pact52 python=3.8 -y
conda activate pact52
# From requirements: adapt subset
conda install -c pytorch faiss-gpu cudatoolkit=11.3 -y
pip install flask pymongo sentence-transformers pandas numpy selenium pypdf rapidfuzz

6.4 MongoDB

Start MongoDB locally (default port). Insert recipes by importing the CSV or through a loader script you can write (future improvement).

6.5 ChromeDriver

Ensure Chrome version matches chromedriver.exe. Update if mismatch (download from https://chromedriver.chromium.org/ and replace file).

6.6 SentenceTransformer Model Cache

First run of embedding.py will download the bert-large-nli-stsb-mean-tokens model (~GB size). Ensure stable internet.

7. Usage Workflow

7.1 Scrape Recipes

cd Scrapping
python Main_scraping_recettes.py
# Generates Recette.csv and Images/ assets

(You can also parallelize via provided multithreading in the script.)

7.2 Load Data into MongoDB

Write a custom loader (suggested future script) to parse Recette.csv and insert documents into Recipes collection:

python load_recipes_to_mongo.py  # (to be created)

7.3 Generate Embeddings

cd Recommandation/v2
python embedding.py  # Produces old_recipe_embedding_on_name.pickle

7.4 Run Recommendation Test

python recommandation.py  # Prints similar recipe names for sample vector

Integrate later with API endpoint.

7.5 Run Flask API

cd BDD/API_MongoDB
python app.py
# Server starts (default Flask port 5000)

Access examples:

GET http://localhost:5000/recipes?password=17112002
GET http://localhost:5000/recipes/limit/10/?password=17112002

7.6 Parse E-ticket PDFs

Place test PDFs in E_ticket/pdfReader/ticketX.pdf (X = number). Then:

cd E_ticket
python main.py  # Prints matched ingredient DataFrame

Adjust config.CODEFOLDER for Windows path consistency.

8. API Endpoint Summary

Endpoint Method Params Description
/ GET - Health check
/add_user POST JSON body Insert one user (static id)
/getusers GET - List users
/recipes GET password All recipes
/recipes/limit/<nbr>/ GET password Limit number of recipes
/recipes/<id> GET password Recipe by id
/recipes/search/limit/<nbr>/ GET password, name, temps_de_préparation Filter search
/recipes/searchbytags/<tag> GET password Tag search (experimental)
/get_recommandation GET - Placeholder random results

Future: /recommendations?user_id=<id> leveraging embeddings + user taste/inventory.

9. Ingredient Matching Logic

matching_ingredient.py performs:

  1. Upper-case normalization
  2. Prefix filtering (first word containment)
  3. Custom recursive edit distance (lev_dist) with memoization
  4. Rank candidates by distance; keep top <=4 within a threshold (0.6 * length) Return: DataFrame with original product and up to four nearest standardized ingredients. Replace with: tokenization, accent stripping, RapidFuzz ratio scoring, domain-specific synonym map.

10. Configuration & Environment Variables

Current hardcoded values:

  • Password: 17112002
  • Paths: config.py uses a macOS path; adjust for Windows deployments. Recommended improvement: .env file + python-dotenv loading for password, Mongo URI, model paths.

11. Logging & Monitoring

Not yet implemented. Suggested additions:

  • Use logging module in each script (INFO for progress, WARNING for missing fields).
  • API: Add request logging (Flask middleware) & error handlers (404, 500 JSON responses).

12. Testing Strategy (To Add)

Proposed:

  • Unit: ingredient matching (distance edge cases), PDF parser (sample minimal PDF).
  • Integration: API endpoints with test MongoDB database.
  • Performance: Embedding generation timing, Faiss search latency. Use pytest + fixtures + temporary Mongo container (Docker) for isolation.

13. Performance Considerations

  • Scraping: 50-thread executor → watch for remote bans; add rate limiting & retry/backoff.
  • Embeddings: Large model may be slow; potential distillation using all-MiniLM-L6-v2 for lighter footprint.
  • Ingredient matching: Current algorithm is O(n * m^2). Replace with vectorized or fuzzy library for scalability.

14. Security Notes

  • Hardcoded password & no HTTPS.
  • No authentication tokens; user modification not protected against injection.
  • Recommendation placeholder may expose internal IDs. Action: Implement proper auth (JWT), sanitize input, use environment variables.

15. Roadmap

Short Term:

  • Replace password with env var
  • Add recipe loader script to MongoDB
  • Expose recommendation endpoint
  • Implement CPU/GPU fallback logic for Faiss
  • Normalize ingredient names (lowercase, accents removal) Medium Term:
  • Introduce user preference vector (aggregate liked recipe embeddings)
  • Expand PDF parsing for multi-store formats
  • Add tests & CI workflow (GitHub Actions)
  • Containerize (Dockerfile + docker-compose with Mongo) Long Term:
  • Migrate to microservices (API, recommendation, ingestion)
  • Real-time inventory updates from mobile apps
  • Personal nutrition profile & constraint-based filtering

16. Contributing

  1. Fork & branch: feature/<short-description>
  2. Add tests for new logic.
  3. Run linter and basic scripts locally.
  4. Open Pull Request with clear description & screenshots/logs. Coding Style: Follow PEP8 for Python, Android Kotlin/Java standard guidelines.

17. License

License not yet specified. Suggested: MIT for openness or Apache-2.0 for patent clarity. Add LICENSE file before public release.

18. Acknowledgments

  • Recipe data source: https://cosylab.iiitd.edu.in/recipedb/
  • Libraries: Selenium, SentenceTransformers, Faiss, Flask, PyMongo, Pandas, NumPy, pypdf.
  • Educational project context: Télécom Paris PACT initiative.

19. Troubleshooting

Issue Cause Fix
ChromeDriver version error Mismatch browser/driver Download matching driver and replace chromedriver.exe
Embedding script too slow Large BERT model Switch to smaller model (all-MiniLM-L6-v2)
ModuleNotFoundError: faiss Faiss GPU not installed Install faiss-cpu or conda GPU build
API returns password error Wrong query parameter Append ?password=17112002 to URL
Empty PDF parse Path mismatch Update config.CODEFOLDER and verify files exist

20. Disclaimer

The scraped data and images are for educational research purposes. Ensure compliance with the source website's terms of service before distribution.

21. Quick Start (Minimal)

# 1. Create environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install flask pymongo sentence-transformers faiss-cpu pandas numpy selenium pypdf

# 2. Start MongoDB separately
# (Ensure mongod is running)

# 3. Generate embeddings (optional for now)
cd Recommandation/v2
python embedding.py

# 4. Run API
cd ../../BDD/API_MongoDB
python app.py

# 5. Query recipes
curl http://localhost:5000/recipes?password=17112002

Feel free to open issues for missing documentation or discrepancies between code and README.

About

Recipe Recommendation & Smart Grocery Assistance App

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages