PACT52 – Recipe Recommendation & Smart Grocery Assistance

1. Overview

PACT52 is a multi-module project combining:

Large scale recipe data scraping
A MongoDB + Flask REST API exposing recipe and user data
A semantic recipe recommendation engine (SentenceTransformers + Faiss)
E-ticket (PDF) parsing to extract purchased products and match them to normalized ingredients
Android client applications ("Android" and "PetitCuistot") for user interaction and planning

The goal is to help users manage pantry/inventory, parse grocery tickets, and receive personalized recipe suggestions.

2. High-Level Architecture

Scrapping/  -->  CSV (Recette.csv)  -->  MongoDB (Recipes collection)
                                        |\
                                        | \--> Flask API (/recipes, /users, /search)
Recommandation/v2/ (embeddings + Faiss) --> Recommendation queries via future endpoint
E_ticket/ (PDF parser + ingredient matching) --> Normalized product list --> Inventory
Android & PetitCuistot apps --> Consume API + Display recommendations

Supporting documentation and diagrams live under rapport/ (architecture, sequence, storyboard images).

3. Repository Structure (Key Directories)

Scrapping/: Selenium-based scraping scripts to build recipe dataset and images.
BDD/API_MongoDB/: Flask app (app.py) connecting to local MongoDB, exposing CRUD/read endpoints.
Recommandation/v2/: Embedding generation and similarity search (SentenceTransformer + Faiss).
E_ticket/: PDF ticket parser (pdfReader/Reader.py) and fuzzy ingredient matching (matching_ingredient.py).
Android/ & PetitCuistot/: Gradle Android app projects (UI, planning, inventory – details to be documented).
rapport/: AsciiDoc documentation and diagrams.
BDD/ (root CSV/JSON): Source intermediate datasets and encoded recipe lists.

4. Core Modules

4.1 Scraping (`Scrapping/`)

Scripts:

Main_scraping_recettes.py: Orchestrates threaded scraping of recipe pages using recipeDB_scrapper.scrapping_page_recette().
recipeDB_scrapper.py: Extracts recipe metadata, ingredients, utensils, instructions, nutrition, saves rows to Recette.csv, downloads images. Dependencies: Selenium, ChromeDriver, concurrent futures. Output: Recette.csv (semicolon-separated); Images/ PNG files.

4.2 API & Database (`BDD/API_MongoDB/app.py`)

Uses pymongo to connect to local MongoDB (USER, BDDDEZ1Z1 databases).
Endpoints (password-protected by a hardcoded integer 17112002):
- GET / → health check
- POST /add_user → insert a user document (currently fixed user_id=1)
- GET /getusers → list users
- GET /recipes?password=XXXX → all recipes
- GET /recipes/limit/<nbr>/?password=XXXX → limited subset
- GET /recipes/<id>?password=XXXX → recipe by numeric id
- GET /recipes/search/limit/<nbr>/?password=XXXX&name=...&temps_de_préparation=... → search by name and prep time (French field names)
- GET /recipes/searchbytags/<tag>?password=XXXX → search by tag (not fully implemented)
- GET /get_recommandation → placeholder returning random recipe IDs Security note: Password and user_id assignment are rudimentary; improve with JWT / OAuth and auto-increment or UUIDs.

4.3 Recommendation Engine (`Recommandation/v2/`)

embedding.py: Loads SentenceTransformer bert-large-nli-stsb-mean-tokens and encodes recipe names into dense vectors; persists pickled embeddings.
recommandation.py: Builds Faiss IndexFlatIP for cosine-like similarity; get_recipes(liked_recipes, number_of_recipes) prints top similar recipe names. Future integration: Expose a /recommendations API endpoint using user-liked recipes or inventory-based similarity. Performance consideration: Faiss GPU (in requirements.txt) may require CUDA 11.3; provide CPU fallback (faiss-cpu).

4.4 E-ticket Parsing & Ingredient Matching (`E_ticket/`)

pdfReader/Reader.py: Extracts text from PDF receipts using pypdf, parses product lines, quantities, prices into a DataFrame.
matching_ingredient.py: Custom approximate matching leveraging character-level edit distance (lev_dist) and heuristics to propose nearest normalized ingredients. Potential improvement: Replace algorithm with fuzzy matching libraries (RapidFuzz) + language normalization (stemming, accent removal) + caching.

4.5 Android Applications (`Android/`, `PetitCuistot/`)

Gradle projects (modules under app/src/) likely consume the Flask API and present recipe browsing, planning, inventory & recommendations. Action item: Add dedicated README or module-level docs (not yet present).

5. Data Model (Current State)

MongoDB Collections (observed):

USER.user: { user_id, user_name, inventory, allergies, liked_recipes, disliked_recipes }
BDDDEZ1Z1.Recipes: Documents holding scraped recipe fields in French (e.g., nom de la recette, temps de préparation, tags). Potential other collections: recettes, Recipes_sample (seen in code for experimentation). CSV Columns (Scraping):

Recipe Name
Dietary Style
Origin
Preparation Time
Cooking Time
Total fats (g)
Protein (g)
Carbohydrates (g)
Energy (kCal)
INGREDIENTS (each ingredient serialized as bracketed list of 8 attributes)
PROCESSES-UTENSILS (comma-separated utensils)
INSTRUCTIONS (comma-separated steps)
image_ID (remote ID / filename stem)
Number of persons

6. Installation & Setup

6.1 Prerequisites

Python 3.8+ (Recommendation environment built around 3.8 per requirements.txt)
MongoDB Community Edition (local instance on default port 27017)
Google Chrome + matching ChromeDriver placed in Scrapping/ (already chromedriver.exe present)
(Optional) CUDA 11.3 capable GPU for Faiss GPU index; else install CPU version.

6.2 Clone

git clone <your repo url>
cd pact52

6.3 Python Environment (Minimal Cross-Platform)

If you want a light environment (without full conda export):

# Windows PowerShell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install flask pymongo sentence-transformers faiss-cpu pandas numpy selenium pypdf rapidfuzz

For GPU Faiss (Linux):

conda create -n pact52 python=3.8 -y
conda activate pact52
# From requirements: adapt subset
conda install -c pytorch faiss-gpu cudatoolkit=11.3 -y
pip install flask pymongo sentence-transformers pandas numpy selenium pypdf rapidfuzz

6.4 MongoDB

Start MongoDB locally (default port). Insert recipes by importing the CSV or through a loader script you can write (future improvement).

6.5 ChromeDriver

Ensure Chrome version matches chromedriver.exe. Update if mismatch (download from https://chromedriver.chromium.org/ and replace file).

6.6 SentenceTransformer Model Cache

First run of embedding.py will download the bert-large-nli-stsb-mean-tokens model (~GB size). Ensure stable internet.

7. Usage Workflow

7.1 Scrape Recipes

cd Scrapping
python Main_scraping_recettes.py
# Generates Recette.csv and Images/ assets

(You can also parallelize via provided multithreading in the script.)

7.2 Load Data into MongoDB

Write a custom loader (suggested future script) to parse Recette.csv and insert documents into Recipes collection:

python load_recipes_to_mongo.py  # (to be created)

7.3 Generate Embeddings

cd Recommandation/v2
python embedding.py  # Produces old_recipe_embedding_on_name.pickle

7.4 Run Recommendation Test

python recommandation.py  # Prints similar recipe names for sample vector

Integrate later with API endpoint.

7.5 Run Flask API

cd BDD/API_MongoDB
python app.py
# Server starts (default Flask port 5000)

Access examples:

GET http://localhost:5000/recipes?password=17112002
GET http://localhost:5000/recipes/limit/10/?password=17112002

7.6 Parse E-ticket PDFs

Place test PDFs in E_ticket/pdfReader/ticketX.pdf (X = number). Then:

cd E_ticket
python main.py  # Prints matched ingredient DataFrame

Adjust config.CODEFOLDER for Windows path consistency.

8. API Endpoint Summary

Endpoint	Method	Params	Description
`/`	GET	-	Health check
`/add_user`	POST	JSON body	Insert one user (static id)
`/getusers`	GET	-	List users
`/recipes`	GET	password	All recipes
`/recipes/limit/<nbr>/`	GET	password	Limit number of recipes
`/recipes/<id>`	GET	password	Recipe by id
`/recipes/search/limit/<nbr>/`	GET	password, name, temps_de_préparation	Filter search
`/recipes/searchbytags/<tag>`	GET	password	Tag search (experimental)
`/get_recommandation`	GET	-	Placeholder random results

Future: /recommendations?user_id=<id> leveraging embeddings + user taste/inventory.

9. Ingredient Matching Logic

matching_ingredient.py performs:

Upper-case normalization
Prefix filtering (first word containment)
Custom recursive edit distance (lev_dist) with memoization
Rank candidates by distance; keep top <=4 within a threshold (0.6 * length) Return: DataFrame with original product and up to four nearest standardized ingredients. Replace with: tokenization, accent stripping, RapidFuzz ratio scoring, domain-specific synonym map.

10. Configuration & Environment Variables

Current hardcoded values:

Password: 17112002
Paths: config.py uses a macOS path; adjust for Windows deployments. Recommended improvement: .env file + python-dotenv loading for password, Mongo URI, model paths.

11. Logging & Monitoring

Not yet implemented. Suggested additions:

Use logging module in each script (INFO for progress, WARNING for missing fields).
API: Add request logging (Flask middleware) & error handlers (404, 500 JSON responses).

12. Testing Strategy (To Add)

Proposed:

Unit: ingredient matching (distance edge cases), PDF parser (sample minimal PDF).
Integration: API endpoints with test MongoDB database.
Performance: Embedding generation timing, Faiss search latency. Use pytest + fixtures + temporary Mongo container (Docker) for isolation.

13. Performance Considerations

Scraping: 50-thread executor → watch for remote bans; add rate limiting & retry/backoff.
Embeddings: Large model may be slow; potential distillation using all-MiniLM-L6-v2 for lighter footprint.
Ingredient matching: Current algorithm is O(n * m^2). Replace with vectorized or fuzzy library for scalability.

14. Security Notes

Hardcoded password & no HTTPS.
No authentication tokens; user modification not protected against injection.
Recommendation placeholder may expose internal IDs. Action: Implement proper auth (JWT), sanitize input, use environment variables.

15. Roadmap

Short Term:

16. Contributing

Fork & branch: feature/<short-description>
Add tests for new logic.
Run linter and basic scripts locally.
Open Pull Request with clear description & screenshots/logs. Coding Style: Follow PEP8 for Python, Android Kotlin/Java standard guidelines.

17. License

License not yet specified. Suggested: MIT for openness or Apache-2.0 for patent clarity. Add LICENSE file before public release.

18. Acknowledgments

Recipe data source: https://cosylab.iiitd.edu.in/recipedb/
Libraries: Selenium, SentenceTransformers, Faiss, Flask, PyMongo, Pandas, NumPy, pypdf.
Educational project context: Télécom Paris PACT initiative.

19. Troubleshooting

Issue	Cause	Fix
ChromeDriver version error	Mismatch browser/driver	Download matching driver and replace `chromedriver.exe`
Embedding script too slow	Large BERT model	Switch to smaller model (`all-MiniLM-L6-v2`)
`ModuleNotFoundError: faiss`	Faiss GPU not installed	Install `faiss-cpu` or conda GPU build
API returns password error	Wrong query parameter	Append `?password=17112002` to URL
Empty PDF parse	Path mismatch	Update `config.CODEFOLDER` and verify files exist

20. Disclaimer

The scraped data and images are for educational research purposes. Ensure compliance with the source website's terms of service before distribution.

21. Quick Start (Minimal)

# 1. Create environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install flask pymongo sentence-transformers faiss-cpu pandas numpy selenium pypdf

# 2. Start MongoDB separately
# (Ensure mongod is running)

# 3. Generate embeddings (optional for now)
cd Recommandation/v2
python embedding.py

# 4. Run API
cd ../../BDD/API_MongoDB
python app.py

# 5. Query recipes
curl http://localhost:5000/recipes?password=17112002

Feel free to open issues for missing documentation or discrepancies between code and README.

Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
.idea		.idea
Android		Android
BDD		BDD
E_ticket		E_ticket
PetitCuistot		PetitCuistot
Recommandation		Recommandation
Scrapping		Scrapping
rapport		rapport
test		test
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
README.md		README.md
allfiles.txt		allfiles.txt
bfg-1.15.0.jar		bfg-1.15.0.jar
bfg.jar		bfg.jar
ingredients.json		ingredients.json

Folders and files

Latest commit

History

Repository files navigation

PACT52 – Recipe Recommendation & Smart Grocery Assistance

1. Overview

2. High-Level Architecture

3. Repository Structure (Key Directories)

4. Core Modules

4.1 Scraping (Scrapping/)

4.2 API & Database (BDD/API_MongoDB/app.py)

4.3 Recommendation Engine (Recommandation/v2/)

4.4 E-ticket Parsing & Ingredient Matching (E_ticket/)

4.5 Android Applications (Android/, PetitCuistot/)

5. Data Model (Current State)

6. Installation & Setup

6.1 Prerequisites

6.2 Clone

6.3 Python Environment (Minimal Cross-Platform)

6.4 MongoDB

6.5 ChromeDriver

6.6 SentenceTransformer Model Cache

7. Usage Workflow

7.1 Scrape Recipes

7.2 Load Data into MongoDB

7.3 Generate Embeddings

7.4 Run Recommendation Test

7.5 Run Flask API

7.6 Parse E-ticket PDFs

8. API Endpoint Summary

9. Ingredient Matching Logic

10. Configuration & Environment Variables

11. Logging & Monitoring

12. Testing Strategy (To Add)

13. Performance Considerations

14. Security Notes

15. Roadmap

16. Contributing

17. License

18. Acknowledgments

19. Troubleshooting

20. Disclaimer

21. Quick Start (Minimal)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

4.1 Scraping (`Scrapping/`)

4.2 API & Database (`BDD/API_MongoDB/app.py`)

4.3 Recommendation Engine (`Recommandation/v2/`)

4.4 E-ticket Parsing & Ingredient Matching (`E_ticket/`)

4.5 Android Applications (`Android/`, `PetitCuistot/`)

Packages