# Knowledge Base Expansion Notebook

This notebook automates the process of expanding a food ingredient knowledge base. It reads a CSV of base ingredients and their associated taste profiles, then uses an LLM to generate processed variants of each ingredients (e.g. fermented, pickled).

## Setup

In [None]:
!pip install transformers torch accelerate bitsandbytes


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
import pandas as pd
import io
import csv
import time
from huggingface_hub import login


In [None]:
HUGGINGFACE_API_KEY = getpass("Enter your Hugging Face API key: ")

In [None]:
login(token=HUGGINGFACE_API_KEY)


## Model Loading

In [None]:
MODEL = "mistralai/Mistral-Nemo-Instruct-2407"
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL)
print("Model and tokenizer loaded successfully.")

## Model Prompts

In [None]:
SYSTEM_PROMPT = """You are a food science expert specializing in culinary chemistry and taste perception. You will enhance a taste profile database by adding preparation method variations for ingredients.

CONTEXT: 
- Each ingredient has 5 taste dimensions: salty, umami, sweet, sour, bitter (scored 0-1)
- Preparation methods chemically transform ingredients, altering their taste profiles
- Common transformations include: fermentation, pickling, roasting, caramelization, curing, smoking, drying, blanching, grilling

RULES:
1.  **Analyze the Base Ingredient:** Understand what it is (e.g., a vegetable, a dairy product, a meat).
2.  **Generate Variations:** Brainstorm common cooking methods, processing techniques, or specific types related to the ingredient. The new name should be "Processed Ingredient" (e.g., "Fermented Cabbage", "Smoked Salmon", "Blue Cheese"). Include both traditional and modern preparation techniques.
3.  **Estimate Taste Profile:** For each new variation, create a new 5-point taste profile (salty, umami, sweet, sour, bitter) with scores from 0.0 to 1.0. The scores MUST logically reflect the change caused by the process. Consider how these methods chemically transform the taste compounds. For example, fermenting increases sourness, curing increases saltiness, and roasting can increase sweetness and bitterness (caramelization).
4.  **Format Output Correctly:** The output MUST be only the new CSV lines. Do not include headers, explanations, or the original line. 

IMPORTANT RULES:
- Only generate preparations that actually exist in culinary practice
- Score adjustments should reflect real chemical changes, not assumptions
- For ingredients rarely processed (e.g., salt), return only the original
- Include a note explaining the dominant chemical/taste change

**OUTPUT FORMAT** STRICTLY ONLY RETURN A LIST OF NEW CSV LINES with the following columns:
entity_name,entity_type,salty,umami,sweet,sour,bitter,notes

EXAMPLES:
Input: cabbage,ingredient,0.0,0.2,0.3,0.1,0.4,"Category: Vegetable"
Output:
cabbage_fermented,ingredient_processed,0.2,0.6,0.1,0.8,0.3,"Lactic acid fermentation creates glutamates (umami) and acids (sour)"
cabbage_pickled,ingredient_processed,0.4,0.1,0.2,0.9,0.2,"Vinegar pickling adds acetic acid (sour) and salt"'
cabbage_roasted,ingredient_processed,0.1,0.4,0.5,0.0,0.5,"Maillard reaction creates savory-sweet compounds"'
cabbage_blanched,ingredient_processed,0.0,0.1,0.2,0.0,0.2,"Blanching reduces bitter glucosinolates"

Input: pork,ingredient,0.1,0.7,0.2,0.0,0.1,"Category: Meat"
Output:
pork_cured,ingredient_processed,0.9,0.8,0.1,0.0,0.2,"Salt curing and nitrites enhance umami"
pork_smoked,ingredient_processed,0.5,0.9,0.3,0.0,0.4,"Smoke compounds add umami and slight bitterness"
pork_caramelized,ingredient_processed,0.2,0.8,0.6,0.0,0.3,"Surface caramelization creates sweet-savory crust"
"""

In [None]:
USER_PROMPT_TEMPLATE = """
Generate the processed variations for this ingredient:
{ingredient_row}

**IMPORTANT**: Your response must contain ONLY the raw CSV lines and nothing else. Do not include any analysis, explanations, or introductory text.
"""

## Expansion Workflow

In [None]:
def expand_ingredient(row, model, tokenizer):
    row_str = f"{row['entity_name']},{row['entity_type']},{row['salty']},"
    row_str += f"{row['umami']},{row['sweet']},{row['sour']},{row['bitter']},\"{row['notes']}\""

    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT_TEMPLATE.format(ingredient_row=row_str)}
    ]

    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=2048, 
        do_sample=False      # to get a deterministic output
    )

    # decoding generated tokens
    output_text = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
    clean_output = output_text.strip().strip("[]'\\\" ")

    string_io = io.StringIO(clean_output)
    reader = csv.reader(string_io, skipinitialspace=True)

    new_rows_as_dicts = []
    headers = ['entity_name', 'entity_type', 'salty', 'umami', 'sweet', 'sour', 'bitter', 'notes']

    for row_values in reader:
        if len(row_values) == len(headers):
            new_rows_as_dicts.append(dict(zip(headers, row_values)))

    return new_rows_as_dicts

In [None]:
df = pd.read_csv('/kaggle/input/knowledge-base-average/knowledge_base_average.csv') # change this to reflect the path to the knowledge base

all_rows = []

print("Starting knowledge base expansion...")
for index, row in df.iterrows():

    print(f"[{index + 1}/{len(df)}] Expanding '{row['entity_name']}'...")

    # 1. add original row to new list
    all_rows.append(row.to_dict())

    # 2. get the new, expanded rows
    new_rows = expand_ingredient(row, model, tokenizer)

    if new_rows:
        print(f"  -> Generated {len(new_rows)} new variations.")
        all_rows.extend(new_rows)
    else:
        print("  -> No new variations generated or an error occurred.")

print("\nExpansion complete.")

# 3. concatenate into one df
augmented_df = pd.DataFrame(all_rows)

augmented_df.to_csv("knowledge_base_average_processed.csv", index=False)
print(f"Augmented knowledge base saved to 'knowledge_base_average_processed.csv'.")
print(f"Original size: {len(df)} rows. New size: {len(augmented_df)} rows.")