# **HOW TO GENERATE A DATASET ABOUT COOKING ?**

**Why this method ?**

Using Phi-2 for dataset generation is a strategic choice for our student project, given our limited resources and small team of three. As students, we lack the financial means to access large-scale AI infrastructure, so we must find a balance between quantity and quality to ensure we have enough data to fine-tune our model effectively. Phi-2, being a lightweight yet capable language model, allows us to generate a high volume of structured instructional data without requiring access to expensive high-end GPUs. By carefully designing prompts and filtering the generated output, we aim to maintain quality while maximizing the number of examples, ensuring that our model is trained on a dataset that is both diverse and relevant. This approach enables us to push the limits of what can be achieved with minimal resources, making AI research accessible even to small student-led initiatives like ours.

**Why using a GPU ?**

We used a GPU to speed up the dataset generation process. Since Phi-2 is a Transformer-based model, running it on a CPU would be significantly slower, making it impractical to generate a large dataset efficiently. The GPU allows for faster text generation, reducing waiting times and enabling us to produce more data in less time, which is crucial given our limited resources and need for scalability.

**Let's initialize Phi-2 :**

To initialize Phi-2, we first load the tokenizer and model using Hugging Face's ```AutoTokenizer``` and ```AutoModelForCausalLM```. We specify ```torch.float16``` to reduce memory usage and improve efficiency, which is crucial for running the model smoothly. The ```device_map="auto"``` setting automatically assigns the model to the available hardware, utilizing the GPU if present for faster processing. This setup ensures that Phi-2 is optimized for inference while keeping resource consumption manageable

In [1]:
import json
import random
import re
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# ✅ Vérifier que l'on utilise la bonne version de transformers
import transformers
assert transformers.__version__ >= "4.43.0", "⚠️ Upgrade transformers: pip install transformers==4.43.0"

print("🚀 GPU disponible :", torch.cuda.is_available())
if torch.cuda.is_available():
    print("💻 Nom du GPU :", torch.cuda.get_device_name(0))
    print("🔢 Nombre de GPUs disponibles :", torch.cuda.device_count())
    print("📝 CUDA Version :", torch.version.cuda)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"🚀 Running on: {device}")

# ✅ Charger le modèle Phi-3.5-mini-instruct avec les paramètres recommandés
model_name = "phi-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer.pad_token = tokenizer.eos_token

model = torch.compile(model) 

  from .autonotebook import tqdm as notebook_tqdm


🚀 GPU disponible : True
💻 Nom du GPU : NVIDIA GeForce RTX 3050
🔢 Nombre de GPUs disponibles : 1
📝 CUDA Version : 12.1
🚀 Running on: cuda


Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.14s/it]


In [2]:
def extract_text(text):
    """
    Extrait la partie / reponse du texte généré en cherchant les marqueurs "Output:" ou "##OUTPUT".
    Si aucun marqueur n'est trouvé, la fonction retourne le texte complet nettoyé.
    """
    # On définit plusieurs patterns pour gérer les différents formats
    patterns = [
        r"Output:\s*(.*)",                      # pour "Output:"
        r"##\s*OUTPUT\s*(.*)",                   # pour "##OUTPUT" (avec ou sans espace)
        r"A:\s*(?:Question:)?\s*(.*)",           # pour "A:" ou "A: Question:"
        r"ANSWER:\s*(?:Question:)?\s*(.*)",      # pour "ANSWER:" avec ou sans "Question:"
        r"Answer:\s*(?:Question:)?\s*(.*)",      # pour "Answer:" avec ou sans "Question:"
        r"AI:\s*(.*)",                          # pour "AI:"
        r"Response:\s*(.*)"                     # pour "Response:"
    ]
    
    for pattern in patterns:
        match = re.search(pattern, text, flags=re.IGNORECASE)
        if match:
            answer = match.group(1).strip()
            if answer:
                return answer
            
    return text.strip()

In [3]:
def is_simple_answer(answer: str) -> bool:
    """
    Vérifie si la réponse est simple.
    On considère qu'une réponse est simple si :
      - Elle contient moins de 40 mots.
      - Elle ne contient pas de listes (pas de tirets, puces, ou de numérotation comme "1. ").
      - Elle ne se termine pas par un deux-points (":") indiquant une coupure.
      - Elle se termine par une ponctuation finale (".", "?" ou "!"), sinon elle est considérée comme incomplète.
    """
    words = answer.split()
    if len(words) > 40:
        return False
    # Rejeter si la réponse commence par une numérotation (ex: "1. ")
    if re.match(r"^\d+\.\s+", answer):
        return False
    if any(token in answer for token in ["- ", "•", "\n- ", "\n•"]):
        return False
    if answer.strip().endswith(":"):
        return False
    if not answer.strip().endswith((".", "?", "!")):
        return False
    return True

In [4]:
def clean_incomplete_sentence(answer: str) -> str:
    """
    Supprime la dernière portion de texte si elle ne forme pas une phrase complète,
    c'est-à-dire si elle ne se termine pas par un point, un point d'interrogation ou un point d'exclamation.
    """
    answer = answer.strip()
    if not answer:
        return answer
    if answer[-1] not in ".?!":
        # Trouver la dernière occurrence d'une ponctuation finale dans la réponse
        last_period = answer.rfind('.')
        last_question = answer.rfind('?')
        last_exclamation = answer.rfind('!')
        last_index = max(last_period, last_question, last_exclamation)
        if last_index != -1:
            # Conserver uniquement le texte jusqu'à (et incluant) cette ponctuation
            answer = answer[:last_index+1].strip()
    return answer

In [5]:
def is_simple_question(question: str) -> bool:
    """
    Vérifie si la question est simple selon deux critères :
      - Elle se termine par un point d'interrogation.
      - Son nombre de mots est compris entre 3 et 15 (critère ajustable).
    """
    words = question.split()
    return question.endswith("?") and 3 <= len(words) <= 15

## **QUESTION GENERATION :**

The `generate_question(word1, word2)` function is designed to create concise and practical cooking-related questions based on two randomly chosen words from a predefined vocabulary of culinary terms. This process ensures variety and relevance in the generated dataset. Below is a step-by-step breakdown of how the function works:

#### **1. Constructing the Prompt**
The function first generates a structured prompt instructing the AI to create a short and useful cooking question. The prompt includes the two randomly selected words and sets clear expectations for the format and style of the generated question.

#### **2. Text Generation Using a Pretrained Model**
The prompt is tokenized and passed through a pre-trained language model. The model generates a response using specific parameters:
   - `max_new_tokens=50` ensures a short output.
   - `temperature=0.75` introduces controlled randomness.
   - `top_p=0.9` allows more diverse responses.
   - `do_sample=True` ensures non-deterministic sampling.

This step produces a text output that may contain the expected question but can also include unwanted extra text.

#### **3. Extracting the Question**
To refine the output, the function calls `extract_text(text)`, which identifies and extracts the relevant portion of the generated text. This function applies multiple regex patterns to detect and isolate the question from the AI's response. It searches for markers like:
   - `"Output:"`
   - `"##OUTPUT"`
   - `"A:"` or `"Answer:"`
   - `"AI:"` or `"Response:"`

If a match is found, the extracted portion is returned as the candidate question.

#### **4. Validating the Question**
The extracted text is then evaluated using `is_simple_question(question)`, which ensures that the generated question meets two criteria:
   - **It must end with a question mark (`?`).**
   - **It must contain between 3 and 15 words.**

This step guarantees that the output is a well-formed, concise question rather than a lengthy explanation or an incomplete phrase.

#### **5. Iterative Refinement**
If the generated question does not meet the simplicity criteria, the function retries up to five times, regenerating the text with the same parameters until a suitable question is found.

#### **6. Returning the Final Question**
Once a valid question is obtained, it is returned in a dictionary format:
```json
{
    "instruction": "What are some ways to use garlic and butter in cooking?",
    "response": ""
}


In [None]:
# ✅ Fonction pour générer une question courte et unique
def generate_question(word1, word2):
    """
    Génère une courte question sur un thème donné de cuisine en relançant la génération
    tant que la question extraite n'est pas considérée comme simple.
    
    Args:
        word1 (str): Premier mot clé de cuisine.
        word2 (str): Deuxième mot clé de cuisine.
    
    Returns:
        dict: Un dictionnaire contenant uniquement l'instruction (la question simple) et une réponse vide.
    """
    prompt = f"You are a friendly and concise cooking assistant. Your job is to generate a short, simple, and practical cooking question with few words based on two randomly selected ingredients or kitchen items :  {word1}, {word2}."
    
    max_attempts = 5  # Limite pour éviter une boucle infinie
    attempt = 0
    while attempt < max_attempts:
        attempt += 1
        # 🔥 Tokenisation et génération
        inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to("cuda")
        output = model.generate(
            inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens=50,  # ✅ Assure une question courte
            temperature=0.75,   # ✅ Ajoute un peu de diversité
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
        # 📜 Décodage du texte généré
        generated_text = tokenizer.decode(output[0], skip_special_tokens=True).strip()
    
        # Extraction de la partie question à partir des différents patterns
        question = extract_text(generated_text)
    
        # Si la question est simple, on la renvoie
        if is_simple_question(question):
            return {"instruction": question, "response": ""}
    
    # Si après max_attempts aucune question simple n'est obtenue, on renvoie la dernière version
    return {"instruction": question, "response": ""}

In [7]:
import json
import random

# ✅ Générer 100 questions sur des thèmes aléatoires avec sauvegarde temporaire
dataset = []
used_questions = set()  # Utilisation d'un set pour stocker uniquement les questions sous forme de texte

# Lire la liste de mots depuis le fichier
with open("cooking_words_2.txt", "r", encoding="utf-8") as file:
    cooking_words = file.read().splitlines()

while len(dataset) < 1000:
    word1, word2 = random.sample(cooking_words, 2)
    question_data = generate_question(word1, word2)  # ✅ Récupère un dictionnaire {"instruction": "...", "response": ""}

    question_text = question_data["instruction"]  # ✅ Prend uniquement la question sous forme de string

    # ✅ Vérifie que la question est unique et valide
    if question_text not in used_questions:
        dataset.append({"instruction": question_text, "response": ""})
         #print(f"✔ Generated: {question_text}")

        # ✅ Sauvegarde temporaire tous les 10 questions
        if len(dataset) % 100 == 0:  # ✅ Correction de la condition (100 → 10)
            with open("V5_temp_cooking_questions.json", "w", encoding="utf-8") as f:
                json.dump(dataset, f, indent=4, ensure_ascii=False)
            print(f"💾 Temporary save at {len(dataset)} questions.")


💾 Temporary save at 100 questions.
💾 Temporary save at 200 questions.
💾 Temporary save at 300 questions.
💾 Temporary save at 400 questions.
💾 Temporary save at 500 questions.
💾 Temporary save at 600 questions.
💾 Temporary save at 700 questions.
💾 Temporary save at 800 questions.
💾 Temporary save at 900 questions.
💾 Temporary save at 1000 questions.


In [8]:
# ✅ Sauvegarde du dataset en JSON
with open("cooking_questions_dataset.json", "w", encoding="utf-8") as f:
    json.dump(dataset, f, indent=4, ensure_ascii=False)

print("\n✅ Dataset successfully saved as 'cooking_questions_dataset.json' 🎉")


✅ Dataset successfully saved as 'cooking_questions_dataset.json' 🎉


## **QUESTION GENERATION :**

The `generate_answer(question)` function is designed to create **concise, direct, and practical** cooking-related answers based on a given question. Unlike a full recipe or ingredient list, the responses generated are short and informative, making them ideal for quick culinary advice. Below is a step-by-step breakdown of how the function ensures high-quality output:

### **1. Constructing the Prompt**
The function creates a structured **prompt** instructing the AI to act as a knowledgeable yet concise cooking assistant. The prompt explicitly specifies that:
   - The response should be **short** and **to the point**.
   - It should **not contain a full recipe** or ingredient list.
   - The response must be **practical and relevant** to the given question.

Example Prompt:
```text
Act like a cooking chief. You are a friendly and concise cooking assistant. 
Your job is to generate a short, simple, and practical cooking answer with few words based on this question:  
"How can I make my scrambled eggs creamier?"
Do not provide a detailed recipe, do not list ingredients and write a simple and short answer.
```

### **2. Generating the Answer**
The function then **tokenizes** the prompt and processes it through a pre-trained language model. The generation parameters ensure a well-structured and diverse response:
   - `max_new_tokens=70` limits response length.
   - `temperature=0.75` introduces moderate randomness.
   - `top_p=0.9` filters out unlikely outputs.
   - `do_sample=True` ensures non-deterministic responses.

The model produces a raw text output that **may contain unwanted details or incomplete sentences**.


### **3. Extracting and Cleaning the Answer**
Once generated, the text is processed using **`extract_text(text)`** to isolate the relevant part of the AI response. This function:
   - Searches for markers like `"Answer:"`, `"Response:"`, or `"AI:"` to identify the start of the answer.
   - Removes any unnecessary leading or trailing text.

The answer is then refined using **`clean_incomplete_sentence(answer)`**, which:
   - Ensures the response ends with **proper punctuation (. ? !)**.
   - If the last sentence is incomplete, it is removed to maintain clarity.


### **4. Validating the Answer**
To ensure the generated response meets quality standards, the function calls **`is_simple_answer(answer)`**, which checks:
   - **Word count ≤ 40** (concise response).
   - **No lists or bullet points** (`- `, `•`, `1.` are rejected).
   - **No incomplete phrases** (must end with `.`, `?`, or `!`).
   - **No trailing colons (`:`)**, which may indicate a cutoff response.

If the generated answer **fails validation**, the function retries up to **five times**, ensuring only well-formed answers are returned.


### **5. Returning the Final Answer**
Once a valid response is obtained, it is returned in JSON format:

```json
{
    "instruction": "How can I make my scrambled eggs creamier?",
    "response": "Add a splash of heavy cream and cook on low heat while stirring continuously."
}
```

If no valid answer is generated within the maximum retries, the last attempt is returned.


### **Why This Method is Effective?**
✔ **Ensures concise, well-structured answers** without unnecessary details.  
✔ **Retries up to 5 times** to improve answer quality.  
✔ **Filters out incomplete, overly long, or poorly formatted responses.**  
✔ **Creates useful, digestible knowledge for AI-powered cooking assistants.**  

By combining structured **prompting**, **post-processing**, and **quality checks**, this method produces **high-quality cooking advice**, making it suitable for **AI training datasets, chatbots, and virtual assistants**. 🍳🔥

In [None]:
def generate_answer(question: str):
    """
    Génère une réponse concise (2-3 lignes maximum, sans listes ni recettes détaillées)
    pour la question donnée. Si la réponse extraite n'est pas considérée comme simple
    (par exemple, si elle est incomplète et ne se termine pas par une ponctuation finale),
    la génération est relancée jusqu'à un maximum d'essais.
    
    Args:
        question (str): La question à laquelle répondre.
    
    Returns:
        dict: Un dictionnaire contenant la question dans "instruction" et la réponse dans "response".
    """
    prompt = (
        f"Act like a cooking chief. You are a friendly and concise cooking assistant. "
        f"Your job is to generate a short, simple, and practical cooking answer with few words based on this question \"{question}\".\n"
        f"Do not provide a detailed recipe, do not list ingredients and write a simple and short answer."
    )
    
    max_attempts = 5  # Limite pour éviter une boucle infinie
    attempt = 0
    reponse = ""
    
    while attempt < max_attempts:
        attempt += 1
        inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to("cuda")
        output = model.generate(
            inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens=70,
            temperature=0.75,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id         
        )         
                 
        # Décodage du texte généré         
        generated_text = tokenizer.decode(output[0], skip_special_tokens=True).strip()
        reponse = extract_text(generated_text)
        
        # Nettoyer la réponse pour supprimer la dernière phrase incomplète
        reponse = clean_incomplete_sentence(reponse)
        
        # Vérifie si la réponse est simple et complète
        if is_simple_answer(reponse):
            return {"instruction": question, "response": reponse}
        # else:
        #     print(f"Attempt {attempt}: Answer not simple enough, regenerating... Generated answer: {reponse}")
    
    #print("Max attempts reached. Returning the last generated answer.")
    return {"instruction": question, "response": reponse}

In [10]:
import json

# Charger le dataset existant
with open("V5_temp_cooking_questions.json", "r", encoding="utf-8") as f:
    dataset = json.load(f)

# Itérer sur chaque entrée et générer la réponse
for i, entry in enumerate(dataset):
    question = entry["instruction"]
    # print(f"Processing question {i+1}/{len(dataset)}:")
    # print(f"Question: {question}")
    
    # Générer la réponse pour la question donnée
    answer_data = generate_answer(question)
    answer = answer_data["response"]
    
    # Mettre à jour l'entrée avec la réponse générée
    entry["response"] = answer
    # print(f"Generated answer: {answer}")
    # print("-" * 50)
    
    # Sauvegarde temporaire toutes les 100 questions
    if (i + 1) % 100 == 0:
        with open("V5_temp_cooking_questions_with_answers.json", "w", encoding="utf-8") as f_temp:
            json.dump(dataset, f_temp, indent=4, ensure_ascii=False)
        print(f"Temporary save at {i+1} questions processed.")

# Sauvegarder le dataset complet dans un nouveau fichier JSON
with open("V5_cooking_questions_with_answers.json", "w", encoding="utf-8") as f:
    json.dump(dataset, f, indent=4, ensure_ascii=False)

print("La génération de réponses est terminée et le fichier a été sauvegardé sous 'cooking_questions_with_answers.json'.")


Temporary save at 100 questions processed.
Temporary save at 200 questions processed.
Temporary save at 300 questions processed.
Temporary save at 400 questions processed.
Temporary save at 500 questions processed.
Temporary save at 600 questions processed.
Temporary save at 700 questions processed.
Temporary save at 800 questions processed.
Temporary save at 900 questions processed.
Temporary save at 1000 questions processed.
La génération de réponses est terminée et le fichier a été sauvegardé sous 'cooking_questions_with_answers.json'.
