**Recipe Search using Model2Vec**

This notebook demonstrates how to use the Model2Vec library to search for recipes based on a given query. Three modes of Model2Vec use are demonstrated:
1. **Using a pre-trained output vocab model**: Uses a pre-trained output embedding model. This is a very small model that uses a subword tokenizer. 
2. **Using a pre-trained glove vocab model**: Uses pre-trained glove vocab model. This is a larger model that uses a word tokenizer.
3. **Using a custom vocab model**: Uses a custom domain-specific vocab model that is distilled on a vocab created from the recipe dataset. 

In [None]:
# Install the necessary libraries
!pip install numpy datasets scikit-learn transformers model2vec
    
# Import the necessary libraries
import regex
from collections import Counter

import numpy as np
from datasets import load_dataset
from sklearn.metrics import pairwise_distances
from tokenizers.pre_tokenizers import Whitespace

from model2vec import StaticModel
from model2vec.distill import distill

In [45]:
# Load the recipe dataset
dataset = load_dataset("Shengtao/recipe", split="train")
# Convert the dataset to a pandas DataFrame
dataset = dataset.to_pandas()
# Display the first few rows of the dataset
print(dataset.head())
# Take the title column as our recipes corpus
recipes = dataset["title"]

                        title  \
0  Simple Macaroni and Cheese   
1    Gourmet Mushroom Risotto   
2              Dessert Crepes   
3                 Pork Steaks   
4  Quick and Easy Pizza Crust   

                                                 url              category  \
0  https://www.allrecipes.com/recipe/238691/simpl...             main-dish   
1  https://www.allrecipes.com/recipe/85389/gourme...             main-dish   
2  https://www.allrecipes.com/recipe/19037/desser...  breakfast-and-brunch   
3  https://www.allrecipes.com/recipe/70463/pork-s...      meat-and-poultry   
4  https://www.allrecipes.com/recipe/20171/quick-...                 bread   

                  author                                        description  \
0            g0dluvsugly  A very quick and easy fix to a tasty side-dish...   
1  Myleen Sagrado Sjödin  Authentic Italian-style risotto cooked the slo...   
2                  ANN57  Essential crepe recipe.  Sprinkle warm crepes ...   
3           BABY

In [46]:
# Define a function to find the most similar titles in a dataset to a given query
def find_most_similar_items(model: StaticModel, embeddings: np.ndarray, query: str, top_k=5) -> list[tuple[int, float]]:
    """
    Finds the most similar items in a dataset to the given query using the specified model.

    :param model: The model used to generate embeddings.
    :param embeddings: The embeddings of the dataset.
    :param query: The query recipe title.
    :param top_k: The number of most similar titles to return.
    :return: A list of tuples containing the most similar titles and their cosine similarity scores.
    """
    # Generate embedding for the query
    query_embedding = model.encode(query)[None, :]

    # Calculate pairwise cosine distances between the query and the precomputed embeddings
    distances = pairwise_distances(query_embedding, embeddings, metric='cosine')[0]

    # Get the indices of the most similar items (sorted in ascending order because smaller distances are better)
    most_similar_indices = np.argsort(distances)

    # Convert distances to similarity scores (cosine similarity = 1 - cosine distance)
    most_similar_scores = [1 - distances[i] for i in most_similar_indices[:top_k]]

    # Return the top-k most similar indices and similarity scores
    return list(zip(most_similar_indices[:top_k], most_similar_scores))

In [None]:
# Load the M2V output model from the HuggingFace hub
model_name = "minishlab/M2V_base_output"
model_output = StaticModel.from_pretrained(model_name)

In [91]:
# Find recipes using the output embeddings model
top_k = 5

# Find the most similar recipes to the given queries
query = "cheeseburger"
embeddings = model_output.encode(recipes)

results = find_most_similar_items(model_output, embeddings, query, top_k)
print(f"Most similar recipes to '{query}':")
for idx, score in results:
    print(f"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}")
    
print()

query = "fattoush"
results = find_most_similar_items(model_output, embeddings, query, top_k)
print(f"Most similar recipes to '{query}':")
for idx, score in results:
    print(f"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}")
    

Most similar recipes to 'cheeseburger':
Title: `Double Cheeseburger`, Similarity Score: 0.9028
Title: `Cheeseburger Chowder`, Similarity Score: 0.8574
Title: `Cheeseburger Sliders`, Similarity Score: 0.8413
Title: `Cheeseburger Salad`, Similarity Score: 0.8384
Title: `Cheeseburger Soup I`, Similarity Score: 0.8298

Most similar recipes to 'fattoush':
Title: `Fattoush`, Similarity Score: 1.0000
Title: `Lebanese Fattoush`, Similarity Score: 0.8370
Title: `Aunty Terese's Fattoush`, Similarity Score: 0.7630
Title: `Arabic Fattoush Salad`, Similarity Score: 0.7588
Title: `Authentic Lebanese Fattoush`, Similarity Score: 0.7584


In [None]:
# Load the M2V glove model from the HuggingFace hub
model_name = "minishlab/M2V_base_glove"
model_glove = StaticModel.from_pretrained(model_name)

In [92]:
# Find recipes using the output embeddings model
top_k = 5

# Find the most similar recipes to the given queries
query = "cheeseburger"
embeddings = model_glove.encode(recipes)

results = find_most_similar_items(model_glove, embeddings, query, top_k)
print(f"Most similar recipes to '{query}':")
for idx, score in results:
    print(f"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}")
    
print()

query = "fattoush"
results = find_most_similar_items(model_glove, embeddings, query, top_k)
print(f"Most similar recipes to '{query}':")
for idx, score in results:
    print(f"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}")
    

Most similar recipes to 'cheeseburger':
Title: `Double Cheeseburger`, Similarity Score: 0.8744
Title: `Cheeseburger Meatloaf`, Similarity Score: 0.8246
Title: `Cheeseburger Salad`, Similarity Score: 0.8160
Title: `Hearty American Cheeseburger`, Similarity Score: 0.8006
Title: `Cheeseburger Chowder`, Similarity Score: 0.7989

Most similar recipes to 'fattoush':
Title: `Simple Macaroni and Cheese`, Similarity Score: 0.0000
Title: `Fresh Tomato and Cucumber Salad with Buttery Garlic Croutons`, Similarity Score: 0.0000
Title: `Grilled Cheese, Apple, and Thyme Sandwich`, Similarity Score: 0.0000
Title: `Poppin' Turkey Salad`, Similarity Score: 0.0000
Title: `Chili - The Heat is On!`, Similarity Score: 0.0000


In [85]:
# Set up a regex tokenizer to split texts into words and punctuation
my_regex = regex.compile(r"\w+|[^\w\s]+")

def create_vocab(texts: list[str], tokenizer, size: int = 30_000) -> Counter[str]:
    """Create a vocab from a list of texts."""
    counts = Counter()
    for text in texts:
        #tokens = tokenizer(text)
        
        #tokens = tokenizer.tokenize(text)
        tokens = tokenizer.pre_tokenize_str(text.lower())
        tokens = [token for token, _ in tokens]
        #tokens = my_regex.findall(text.lower())
        counts.update(tokens)
    vocab = [word for word, _ in counts.most_common(size)]
    return vocab

In [88]:
# Choose a Sentence Transformer model and a tokenizer
model_name = "BAAI/bge-small-en-v1.5"
tokenizer = Whitespace()

# Create a custom vocab from the recipe titles
vocab = create_vocab(recipes, tokenizer)

# Distill a model2vec model using the Sentence Transformer model and the custom vocab
model_custom = distill(model_name=model_name, vocabulary=vocab, pca_dims=256)

100%|██████████| 8/8 [00:08<00:00,  1.04s/it]


In [93]:
# Find recipes using the output embeddings model
top_k = 5

# Find the most similar recipes to the given queries
query = "cheeseburger"
embeddings = model_custom.encode(recipes)

results = find_most_similar_items(model_custom, embeddings, query, top_k)
print(f"Most similar recipes to '{query}':")
for idx, score in results:
    print(f"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}")
    
print()

query = "fattoush"
results = find_most_similar_items(model_custom, embeddings, query, top_k)
print(f"Most similar recipes to '{query}':")
for idx, score in results:
    print(f"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}")
    

Most similar recipes to 'cheeseburger':
Title: `Cheeseburger Salad`, Similarity Score: 0.9528
Title: `Cheeseburger Casserole`, Similarity Score: 0.9030
Title: `Cheeseburger Chowder`, Similarity Score: 0.8635
Title: `Cheeseburger Pie`, Similarity Score: 0.8401
Title: `Cheeseburger Meatloaf`, Similarity Score: 0.8184

Most similar recipes to 'fattoush':
Title: `Fattoush`, Similarity Score: 1.0000
Title: `Fatoosh`, Similarity Score: 0.7488
Title: `Lebanese Fattoush`, Similarity Score: 0.6344
Title: `Arabic Fattoush Salad`, Similarity Score: 0.6108
Title: `Fattoush (Lebanese Salad)`, Similarity Score: 0.5669
