PACT52 is a multi-module project combining:
- Large scale recipe data scraping
- A MongoDB + Flask REST API exposing recipe and user data
- A semantic recipe recommendation engine (SentenceTransformers + Faiss)
- E-ticket (PDF) parsing to extract purchased products and match them to normalized ingredients
- Android client applications ("Android" and "PetitCuistot") for user interaction and planning
The goal is to help users manage pantry/inventory, parse grocery tickets, and receive personalized recipe suggestions.
Scrapping/ --> CSV (Recette.csv) --> MongoDB (Recipes collection)
|\
| \--> Flask API (/recipes, /users, /search)
Recommandation/v2/ (embeddings + Faiss) --> Recommendation queries via future endpoint
E_ticket/ (PDF parser + ingredient matching) --> Normalized product list --> Inventory
Android & PetitCuistot apps --> Consume API + Display recommendations
Supporting documentation and diagrams live under rapport/ (architecture, sequence, storyboard images).
Scrapping/: Selenium-based scraping scripts to build recipe dataset and images.BDD/API_MongoDB/: Flask app (app.py) connecting to local MongoDB, exposing CRUD/read endpoints.Recommandation/v2/: Embedding generation and similarity search (SentenceTransformer + Faiss).E_ticket/: PDF ticket parser (pdfReader/Reader.py) and fuzzy ingredient matching (matching_ingredient.py).Android/&PetitCuistot/: Gradle Android app projects (UI, planning, inventory – details to be documented).rapport/: AsciiDoc documentation and diagrams.BDD/(root CSV/JSON): Source intermediate datasets and encoded recipe lists.
Scripts:
Main_scraping_recettes.py: Orchestrates threaded scraping of recipe pages usingrecipeDB_scrapper.scrapping_page_recette().recipeDB_scrapper.py: Extracts recipe metadata, ingredients, utensils, instructions, nutrition, saves rows toRecette.csv, downloads images. Dependencies: Selenium, ChromeDriver, concurrent futures. Output:Recette.csv(semicolon-separated);Images/PNG files.
- Uses
pymongoto connect to local MongoDB (USER,BDDDEZ1Z1databases). - Endpoints (password-protected by a hardcoded integer
17112002):GET /→ health checkPOST /add_user→ insert a user document (currently fixeduser_id=1)GET /getusers→ list usersGET /recipes?password=XXXX→ all recipesGET /recipes/limit/<nbr>/?password=XXXX→ limited subsetGET /recipes/<id>?password=XXXX→ recipe by numericidGET /recipes/search/limit/<nbr>/?password=XXXX&name=...&temps_de_préparation=...→ search by name and prep time (French field names)GET /recipes/searchbytags/<tag>?password=XXXX→ search by tag (not fully implemented)GET /get_recommandation→ placeholder returning random recipe IDs Security note: Password and user_id assignment are rudimentary; improve with JWT / OAuth and auto-increment or UUIDs.
embedding.py: Loads SentenceTransformerbert-large-nli-stsb-mean-tokensand encodes recipe names into dense vectors; persists pickled embeddings.recommandation.py: Builds FaissIndexFlatIPfor cosine-like similarity;get_recipes(liked_recipes, number_of_recipes)prints top similar recipe names. Future integration: Expose a/recommendationsAPI endpoint using user-liked recipes or inventory-based similarity. Performance consideration: Faiss GPU (inrequirements.txt) may require CUDA 11.3; provide CPU fallback (faiss-cpu).
pdfReader/Reader.py: Extracts text from PDF receipts usingpypdf, parses product lines, quantities, prices into a DataFrame.matching_ingredient.py: Custom approximate matching leveraging character-level edit distance (lev_dist) and heuristics to propose nearest normalized ingredients. Potential improvement: Replace algorithm with fuzzy matching libraries (RapidFuzz) + language normalization (stemming, accent removal) + caching.
- Gradle projects (modules under
app/src/) likely consume the Flask API and present recipe browsing, planning, inventory & recommendations. Action item: Add dedicated README or module-level docs (not yet present).
MongoDB Collections (observed):
USER.user: { user_id, user_name, inventory, allergies, liked_recipes, disliked_recipes }BDDDEZ1Z1.Recipes: Documents holding scraped recipe fields in French (e.g.,nom de la recette,temps de préparation,tags). Potential other collections:recettes,Recipes_sample(seen in code for experimentation). CSV Columns (Scraping):
- Recipe Name
- Dietary Style
- Origin
- Preparation Time
- Cooking Time
- Total fats (g)
- Protein (g)
- Carbohydrates (g)
- Energy (kCal)
- INGREDIENTS (each ingredient serialized as bracketed list of 8 attributes)
- PROCESSES-UTENSILS (comma-separated utensils)
- INSTRUCTIONS (comma-separated steps)
- image_ID (remote ID / filename stem)
- Number of persons
- Python 3.8+ (Recommendation environment built around 3.8 per
requirements.txt) - MongoDB Community Edition (local instance on default port 27017)
- Google Chrome + matching ChromeDriver placed in
Scrapping/(alreadychromedriver.exepresent) - (Optional) CUDA 11.3 capable GPU for Faiss GPU index; else install CPU version.
git clone <your repo url>
cd pact52
If you want a light environment (without full conda export):
# Windows PowerShell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install flask pymongo sentence-transformers faiss-cpu pandas numpy selenium pypdf rapidfuzz
For GPU Faiss (Linux):
conda create -n pact52 python=3.8 -y
conda activate pact52
# From requirements: adapt subset
conda install -c pytorch faiss-gpu cudatoolkit=11.3 -y
pip install flask pymongo sentence-transformers pandas numpy selenium pypdf rapidfuzz
Start MongoDB locally (default port). Insert recipes by importing the CSV or through a loader script you can write (future improvement).
Ensure Chrome version matches chromedriver.exe. Update if mismatch (download from https://chromedriver.chromium.org/ and replace file).
First run of embedding.py will download the bert-large-nli-stsb-mean-tokens model (~GB size). Ensure stable internet.
cd Scrapping
python Main_scraping_recettes.py
# Generates Recette.csv and Images/ assets
(You can also parallelize via provided multithreading in the script.)
Write a custom loader (suggested future script) to parse Recette.csv and insert documents into Recipes collection:
python load_recipes_to_mongo.py # (to be created)
cd Recommandation/v2
python embedding.py # Produces old_recipe_embedding_on_name.pickle
python recommandation.py # Prints similar recipe names for sample vector
Integrate later with API endpoint.
cd BDD/API_MongoDB
python app.py
# Server starts (default Flask port 5000)
Access examples:
GET http://localhost:5000/recipes?password=17112002
GET http://localhost:5000/recipes/limit/10/?password=17112002
Place test PDFs in E_ticket/pdfReader/ticketX.pdf (X = number). Then:
cd E_ticket
python main.py # Prints matched ingredient DataFrame
Adjust config.CODEFOLDER for Windows path consistency.
| Endpoint | Method | Params | Description |
|---|---|---|---|
/ |
GET | - | Health check |
/add_user |
POST | JSON body | Insert one user (static id) |
/getusers |
GET | - | List users |
/recipes |
GET | password | All recipes |
/recipes/limit/<nbr>/ |
GET | password | Limit number of recipes |
/recipes/<id> |
GET | password | Recipe by id |
/recipes/search/limit/<nbr>/ |
GET | password, name, temps_de_préparation | Filter search |
/recipes/searchbytags/<tag> |
GET | password | Tag search (experimental) |
/get_recommandation |
GET | - | Placeholder random results |
Future: /recommendations?user_id=<id> leveraging embeddings + user taste/inventory.
matching_ingredient.py performs:
- Upper-case normalization
- Prefix filtering (first word containment)
- Custom recursive edit distance (
lev_dist) with memoization - Rank candidates by distance; keep top <=4 within a threshold (0.6 * length) Return: DataFrame with original product and up to four nearest standardized ingredients. Replace with: tokenization, accent stripping, RapidFuzz ratio scoring, domain-specific synonym map.
Current hardcoded values:
- Password:
17112002 - Paths:
config.pyuses a macOS path; adjust for Windows deployments. Recommended improvement:.envfile +python-dotenvloading for password, Mongo URI, model paths.
Not yet implemented. Suggested additions:
- Use
loggingmodule in each script (INFO for progress, WARNING for missing fields). - API: Add request logging (Flask middleware) & error handlers (404, 500 JSON responses).
Proposed:
- Unit: ingredient matching (distance edge cases), PDF parser (sample minimal PDF).
- Integration: API endpoints with test MongoDB database.
- Performance: Embedding generation timing, Faiss search latency.
Use
pytest+ fixtures + temporary Mongo container (Docker) for isolation.
- Scraping: 50-thread executor → watch for remote bans; add rate limiting & retry/backoff.
- Embeddings: Large model may be slow; potential distillation using
all-MiniLM-L6-v2for lighter footprint. - Ingredient matching: Current algorithm is O(n * m^2). Replace with vectorized or fuzzy library for scalability.
- Hardcoded password & no HTTPS.
- No authentication tokens; user modification not protected against injection.
- Recommendation placeholder may expose internal IDs. Action: Implement proper auth (JWT), sanitize input, use environment variables.
Short Term:
- Replace password with env var
- Add recipe loader script to MongoDB
- Expose recommendation endpoint
- Implement CPU/GPU fallback logic for Faiss
- Normalize ingredient names (lowercase, accents removal) Medium Term:
- Introduce user preference vector (aggregate liked recipe embeddings)
- Expand PDF parsing for multi-store formats
- Add tests & CI workflow (GitHub Actions)
- Containerize (Dockerfile + docker-compose with Mongo) Long Term:
- Migrate to microservices (API, recommendation, ingestion)
- Real-time inventory updates from mobile apps
- Personal nutrition profile & constraint-based filtering
- Fork & branch:
feature/<short-description> - Add tests for new logic.
- Run linter and basic scripts locally.
- Open Pull Request with clear description & screenshots/logs. Coding Style: Follow PEP8 for Python, Android Kotlin/Java standard guidelines.
License not yet specified. Suggested: MIT for openness or Apache-2.0 for patent clarity. Add LICENSE file before public release.
- Recipe data source: https://cosylab.iiitd.edu.in/recipedb/
- Libraries: Selenium, SentenceTransformers, Faiss, Flask, PyMongo, Pandas, NumPy, pypdf.
- Educational project context: Télécom Paris PACT initiative.
| Issue | Cause | Fix |
|---|---|---|
| ChromeDriver version error | Mismatch browser/driver | Download matching driver and replace chromedriver.exe |
| Embedding script too slow | Large BERT model | Switch to smaller model (all-MiniLM-L6-v2) |
ModuleNotFoundError: faiss |
Faiss GPU not installed | Install faiss-cpu or conda GPU build |
| API returns password error | Wrong query parameter | Append ?password=17112002 to URL |
| Empty PDF parse | Path mismatch | Update config.CODEFOLDER and verify files exist |
The scraped data and images are for educational research purposes. Ensure compliance with the source website's terms of service before distribution.
# 1. Create environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install flask pymongo sentence-transformers faiss-cpu pandas numpy selenium pypdf
# 2. Start MongoDB separately
# (Ensure mongod is running)
# 3. Generate embeddings (optional for now)
cd Recommandation/v2
python embedding.py
# 4. Run API
cd ../../BDD/API_MongoDB
python app.py
# 5. Query recipes
curl http://localhost:5000/recipes?password=17112002
Feel free to open issues for missing documentation or discrepancies between code and README.