# Integrated the trained conversational model with RAG

- **Authors:** Riyaadh Gani and Damilola Ogunleye
- **Project:** Food Recognition & Recipe LLM  
- **Purpose:** Creating VectorDB of recipe data and combining with RAG for the model

---

## Overview

Data location: https://drive.google.com/drive/folders/1TWymP12tO2GFEKsLlC3VD3WNwJrVeNeO?usp=sharing

This notebook is used for inference of our conversational model with our RAG pipeline

**Output:** Functional model for recipe support: based on Recipe NLG data

In [1]:
%pip install pandas numpy faiss-cpu sentence_transformers transformers torch peft==0.11.1 tqdm



In [2]:
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
import tqdm
from pathlib import Path
import os


## Load the Model
Memory management is not easy! so load the model and then change to GPU to free up CPU RAM --> then load the data and the index

In [3]:
# Use colab resources if available
usingColab = True
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"



if usingColab:
    from google.colab import drive
    drive.mount('/content/drive')
    print("Google Colab connected to Google Drive")

    # Base project directory in Google Drive
    PROJECT_DIR = Path("/content/drive/MyDrive/deeplearning")

    # change working directory
    os.chdir(PROJECT_DIR)

    # Verify structure
    print("\nDirectory structure:")
    for path in [PROJECT_DIR / "datasets" / "Cleaned",
                PROJECT_DIR / "models" / "base",
                PROJECT_DIR / "models" / "gpt2-conversational-v1",
                PROJECT_DIR / "VectorDB"]:
        print(f"  {'✓' if path.exists() else '✗'} {path}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Google Colab connected to Google Drive

Directory structure:
  ✓ /content/drive/MyDrive/deeplearning/datasets/Cleaned
  ✓ /content/drive/MyDrive/deeplearning/models/base
  ✓ /content/drive/MyDrive/deeplearning/models/gpt2-conversational-v1
  ✓ /content/drive/MyDrive/deeplearning/VectorDB


In [4]:
# Load base GPT-2 model
model_path = PROJECT_DIR / "models" / "base" / "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_path)
base_model = AutoModelForCausalLM.from_pretrained(
    model_path,
    dtype=torch.float16,  # Half precision
    low_cpu_mem_usage=True
)

Have to load the base model + the adapter to actually access the model

In [5]:
adapter_path = PROJECT_DIR / "models" / "gpt2-conversational-v1" / "final"
print(f"Loading adapter from: {adapter_path}")
conversational_model = PeftModel.from_pretrained(base_model, adapter_path)

Loading adapter from: /content/drive/MyDrive/deeplearning/models/gpt2-conversational-v1/final


In [6]:
# Set pad token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "left"  # Pad on the left for generation

    print(f"✓ Tokenizer loaded")
    print(f"  Vocab size: {len(tokenizer):,}")
    print(f"  Special tokens: {tokenizer.special_tokens_map}")

✓ Tokenizer loaded
  Vocab size: 50,257
  Special tokens: {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}


In [7]:
# Move to GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
conversational_model = conversational_model.to(device)
conversational_model.eval()

print(f"Model loaded on {device}")

Model loaded on cuda


In [8]:
small = True  # Set to True to use a smaller dataset for testing

# Load the recipe data
df = pd.read_csv('./datasets/Cleaned/clean_recipes_10000.csv')
print(f"Loaded {len(df)} recipes")

# trim to first 10000 entries to match index
if small == True:
    df = df.head(10000)
    print(f"Trimmed to {len(df)} recipes for small dataset")

Loaded 10000 recipes
Trimmed to 10000 recipes for small dataset


In [9]:
# Load the FAISS index
if small:
    index = faiss.read_index('./VectorDB/recipe_index_10000.faiss')
else:
    index = faiss.read_index('./VectorDB/recipe_index.faiss')
print(f"Loaded index with {index.ntotal} vectors")

Loaded index with 10000 vectors


In [10]:
# Load embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print("Loaded embedding model")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loaded embedding model


Define functions for rag implementation

In [11]:
def retrieve_recipes(query, k=3):
    """Retrieve top-k similar recipes"""
    q_emb = embedding_model.encode([query]).astype('float32')
    faiss.normalize_L2(q_emb)
    scores, indices = index.search(q_emb, k)

    results = []
    for idx, score in zip(indices[0], scores[0]):
        results.append({
            'response': df.iloc[idx]['response'],
            'similarity': float(score)
        })
    return results

In [12]:
def rag_answer(query, context="", k=2, max_new_tokens=256):
    """Generate answer using RAG"""

    # Retrieve
    retrieved = retrieve_recipes(query, k=k)

    # Build context
    context += "\n Similar recipes:\n"
    for i, rec in enumerate(retrieved, 1):
        context += f"{i}. {rec['response']}\n"

    # print("Context: ", context)


    # Create prompt
    prompt = f"""The following is a conversation between a user and a helpful cooking assistant. Use the added context to support the user query in a conversational manner"

{context}

User: {query}
Assistant:"""

    # Tokenize and generate
    inputs = tokenizer(
        prompt,
        return_tensors='pt',
        max_length=1024,
        truncation=True,
        padding=True
    ).to(device)

    with torch.no_grad():
        outputs = conversational_model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,      # Determines model randomness (closer to 0 is more deterministic, closer to 1 can be more creative)
            do_sample=True,       # Enables probablistic sampling
            top_p=0.9,            # Controls quality of next token generation
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract answer
    if "Assistant:" in response:
        answer = response.split("Assistant:")[-1].strip()
        if "User:" in answer:
            answer = answer.split("User:")[0].strip()
    else:
        answer = response

    return answer

# Test the pipeline

## Single Turn

In [13]:
# Take user input and save as query
# query = input("How can I help you today?: ")
# print(f"\nQuery: {query}\n")
# answer = rag_answer(query, context="", k=1)
# print(f"\nAnswer: {answer}")

## Multi-Turn Conversation


In [14]:
# Loop until user quits or for max 3 turns
counter = 1
convo_history = []
query = ""
context = ""

while 'quit' not in query or counter >= 3:
    if counter == 1:
      # get user input
      query = input("How can I help you today?: ")
      print(f"\nQuery {counter}: {query}\n")
      answer = rag_answer(query, context="", k=1)
      print(f"\nAnswer: {answer}")

      convo_history.append(f"User: {query}")
      convo_history.append(f"Assistant: {answer}")

    else:
      query = input("Further questions: ")
      print(f"\nQuery {counter}: {query}\n")

      # append convo history as string
      if context == "":
        context = "\n Previous conversation:\n"
      else:
        context += "\n"

      context += "\n".join(convo_history)

      answer = rag_answer(query, context=context, k=1)
      print(f"\nAnswer: {answer}")

      convo_history.append(f"User: {query}")
      convo_history.append(f"Assistant: {answer}")

    counter += 1


How can I help you today?: I have some beef and would like to make tacos, how can i do so?

Query 1: I have some beef and would like to make tacos, how can i do so?


Answer: To make beef-taco casserole, you'll need:
1 lb. hamburg, 1 medium onion, chopped, 1- 15 12 oz. can kidney beans, 1- 8 oz. can tomato sauce, 2 tsp. chili powder, 12 tsp
Further questions: what is healthy in that recipe?

Query 2: what is healthy in that recipe?


Answer: Healthy beef casserole can be made with any beef you can find, but the best thing is to use a beef chuck that is lean, flavorful and flavorful. A low-fat, low-sodium, high-protein, low-carb, high-fiber, low-glycemic, high-fiber, whole-grain beef chuck will do the trick.

If you're making a low-fat, low-sodium, high-protein, low-carb, low-fat, whole-grain beef casserole, it can also be made with whole-grain beef, but that would be better with a whole-grain base like brown rice or quinoa.

If you're making a low-fat, low-sodium, high-protein, low-car

This is a friendly reminder - the current text generation call has exceeded the model's predefined maximum length (1024). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


AcceleratorError: CUDA error: device-side assert triggered
Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
