# Text Similarity of Recipes

## User Provides Text-Based Instructions on a Recipe You Want

In [1]:
instructions = "I want a cheesy dish that is easy and quick to cook and inexpensive."

## Import Packages and Model

The w2v_model refers to a Word2Vec model, which is a popular approach for learning word embeddings from large text corpora. Word embeddings are dense vector representations of words in a continuous vector space, where words with similar meanings or contexts are closer to each other in that space.

In this specific code, the w2v_model is loaded using the api.load() function from the gensim.downloader module. The model being loaded is "glove-wiki-gigaword-50". GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm for obtaining word embeddings, and the "glove-wiki-gigaword-50" model is trained on a combination of the Wikipedia text and the Gigaword dataset. It has word embeddings of dimensionality 50, meaning each word is represented by a 50-dimensional vector.

Once the w2v_model is loaded, it is used to obtain word embeddings for words encountered during the text processing stage. In the get_sentiment_embedding function, each word in the preprocessed text is passed to the w2v_model using w2v_model[word]. This retrieves the word embedding vector associated with that word from the loaded Word2Vec model.
The obtained word embeddings are then used in combination with sentiment scores to calculate the overall sentiment embedding of the text. For each word, the sentiment score is calculated using the SentimentIntensityAnalyzer from the NLTK library. The sentiment score captures the sentiment polarity (positive, negative, or neutral) of the word. The sentiment score is multiplied element-wise with the corresponding word embedding, and the resulting vectors are averaged to obtain the overall sentiment embedding of the text.

In summary, the w2v_model is a pre-trained Word2Vec model that provides word embeddings, and it is used in the code to obtain word embeddings for words encountered in the input text. These word embeddings, along with sentiment scores, are then used to calculate the overall sentiment embedding of the text.

In [2]:
from gensim.models import Word2Vec
import pandas as pd
import gensim.downloader as api  # Import the gensim library for accessing pre-trained models
import nltk  # Import the Natural Language Toolkit library for text processing
from nltk.sentiment import SentimentIntensityAnalyzer  # Import the sentiment intensity analyzer from NLTK
from nltk.corpus import stopwords  # Import the stopwords corpus from NLTK
import re  # Import the regular expressions library for text preprocessing
import numpy as np  # Import the numpy library for numerical operations

w2v_model = api.load("glove-wiki-gigaword-50")  # Load the pre-trained Word2Vec model

In [3]:
w2v_model.most_similar('macaroni')

[('sausage', 0.7981840968132019),
 ('ravioli', 0.7721162438392639),
 ('soup', 0.7555078268051147),
 ('sandwiches', 0.7459966540336609),
 ('sausages', 0.7451284527778625),
 ('burgers', 0.7447069883346558),
 ('pasta', 0.7434289455413818),
 ('roasted', 0.74176025390625),
 ('steak', 0.7415284514427185),
 ('steaks', 0.7406360507011414)]

## Functions

In [4]:
# Function to preprocess the text by removing links, special characters, numbers, and stopwords
def preprocess_text(text, remove_stopwords):
    if not isinstance(text, str):
        return ""

    text = re.sub(r"http\S+", "", text)  # Remove links
    text = re.sub("[^A-Za-z]+", " ", text)  # Remove special characters and numbers
    text = re.sub("\s+", " ", text)  # Remove excessive white spaces
    text = text.lower().strip()  # Convert to lowercase

    if remove_stopwords:
        tokens = nltk.word_tokenize(text)  # Tokenize the text
        tokens = [w for w in tokens if w.lower() not in stopwords.words("english")]  # Remove stopwords
        text = " ".join(tokens)  # Join the tokens back into a text
    return text

def batch_tokenize(text, batch_size):
    """
    Function to tokenize text into batches of words.

    Arguments:
        text (str): The input text.
        batch_size (int): The size of each batch.

    Returns:
        list: A list of word batches.
    """
    words = text.split()  # Split the text into individual words
    return [words[i:i+batch_size] for i in range(0, len(words), batch_size)]


def get_sentiment_embedding(text: str, remove_stopwords: bool = True, w2v_model=w2v_model) -> np.ndarray:
    """
    Function that calculates the sentiment embedding of a given text.

    Arguments:
        text (str): The input text.
        remove_stopwords (bool): Whether to remove stopwords during preprocessing. Default is True.

    Returns:
        np.ndarray: The sentiment embedding of the text.
    """

    sia = SentimentIntensityAnalyzer()  # Initialize the sentiment intensity analyzer

    preprocessed_text = preprocess_text(text, remove_stopwords)  # Preprocess the text
    word_embeddings = []  # List to store word embeddings
    sentiment_scores = []  # List to store sentiment scores
    vocab = set(w2v_model.key_to_index)  # Set of vocabulary words

    # Split the words into batches using list comprehension and range function
    # The size of each batch is determined by the batch_size parameter
    for words_batch in batch_tokenize(preprocessed_text, batch_size=100):  # Adjust the batch size as needed
        # Retrieve word embeddings for each word in the batch
        embeddings_batch = [w2v_model[word] for word in words_batch if word in vocab]

        # Calculate sentiment scores for each word in the batch
        scores_batch = [sia.polarity_scores(word)["compound"] for word in words_batch if word in vocab]

        # Extend the word_embeddings list with embeddings from the current batch
        word_embeddings.extend(embeddings_batch)

        # Extend the sentiment_scores list with scores from the current batch
        sentiment_scores.extend(scores_batch)

    word_embeddings = np.array(word_embeddings)  # Convert the list of word embeddings to a numpy array
    sentiment_scores = np.array(sentiment_scores)  # Convert the list of sentiment scores to a numpy array
    sentiment_embedding = np.mean(word_embeddings * sentiment_scores.reshape(-1, 1), axis=0)  # Calculate the overall sentiment embedding

    return sentiment_embedding

def find_recipe_with_closest_sentiment(instruction_sentiment, recipe_sentiments):
    """
    Function to find the recipe with the closest sentiment to the given instruction sentiment.

    Arguments:
        instruction_sentiment (np.ndarray): Sentiment embedding of the instructions.
        recipe_sentiments (pd.Series): Series containing sentiment embeddings of the recipes.

    Returns:
        int: Index of the recipe with the closest sentiment.
        float: Sentiment value of the closest recipe.
    """
    cosine_similarities = np.dot(recipe_sentiments.tolist(), instruction_sentiment) / (
            np.linalg.norm(recipe_sentiments.tolist(), axis=1) * np.linalg.norm(instruction_sentiment)
    )  # Calculate the cosine similarity between instruction_sentiment and recipe_sentiments

    closest_recipe_index = np.argmax(cosine_similarities)  # Find the index of the recipe with the highest cosine similarity
    closest_recipe_sentiment = cosine_similarities[closest_recipe_index]  # Get the sentiment value of the closest recipe

    return closest_recipe_index, closest_recipe_sentiment

In [5]:
# Print the preprocessed instructions text
print(preprocess_text(instructions, remove_stopwords = True))

# Get the sentiment embedding of the instructions
sentiment_embedding = get_sentiment_embedding(instructions, remove_stopwords = True)

sentiment_embedding

want cheesy dish easy quick cook inexpensive


array([ 3.95400204e-05, -7.37919966e-03,  1.63943178e-02, -2.49481715e-03,
        1.13856479e-02, -2.02144781e-02, -2.90376029e-02,  1.88957578e-02,
       -8.17788824e-03,  4.97225238e-02, -4.44585419e-02,  1.47454146e-02,
       -5.88147620e-03,  8.40693195e-03,  1.29313575e-02,  2.08055311e-02,
        5.36706357e-02, -3.89991701e-02,  1.60576243e-02, -7.54113750e-02,
       -3.00304568e-02,  1.55618763e-02, -2.32897361e-03,  4.10846421e-02,
        7.04021855e-02, -1.07048178e-01, -3.36219263e-02,  1.72379732e-03,
        9.07790682e-02, -6.20811577e-02,  2.31212061e-01,  3.67316688e-02,
       -4.49387601e-02,  8.29595642e-03,  2.47599650e-02,  4.97717525e-03,
        5.46987070e-03,  5.89393552e-02, -1.43894441e-02, -5.49933876e-02,
        2.69189563e-02,  1.30488955e-02,  2.04968154e-02,  4.75698300e-02,
        1.64978877e-02,  2.25078954e-02,  3.58211099e-02,  1.73561067e-03,
        3.77253836e-02,  3.88775673e-02])

## Create Dataset
### Column possibilities:
    
1. Recipe Name: A column to store the name or title of each recipe.
2. Ingredients: A column to store the list of ingredients required for each recipe.
3. Cuisine: A column to specify the cuisine or type of dish (e.g., Italian, Mexican, Asian).
4. Meal Type: A column indicating the meal type (e.g., breakfast, lunch, dinner, dessert).
5. Dietary Restrictions: A column to capture any dietary restrictions or special considerations (e.g., vegetarian, gluten-free, vegan).
6. Rating: A column to store the rating or feedback for each recipe.
7. Reviews: A column to store user reviews or comments about the recipe.
8. Calories: A column to indicate the calorie content of the recipe.
9. Source/Origin: A column to specify the source or origin of the recipe (e.g., cookbook, website, personal creation).
10. Preparation Time: A column to store the total time required for preparation.
11. Cook Time: A column to indicate the actual cooking time for the recipe.
12. Total Time: A column to store the total time required for the recipe from start to finish.
13. Difficulty Level: A column indicating the difficulty level of the recipe (e.g., easy, moderate, difficult).
14. Servings: A column to specify the number of servings the recipe yields.
15. Nutritional Information: A column to store additional nutritional information (e.g., protein, fat, carbohydrates).
16. Instructions: How to actually cook the recipe.

In [6]:
# Load in a list of recipes that don't have an Embedded meaning
recipe_df = pd.read_csv('UnscoredRecipes.csv')

# Create a new column and combine the other columns with their corresponding column names
recipe_df['CombinedText'] = recipe_df.apply(lambda row: ' '.join(f"{col}: {row[col]}" for col in recipe_df.columns), axis=1)

# Apply the sentiment function to the new column
recipe_df['Sentiment'] = recipe_df['CombinedText'].apply(get_sentiment_embedding, remove_stopwords=True)

# Display the updated DataFrame
recipe_df.head(5)

Unnamed: 0,Recipe Name,Ingredients,Cuisine,Meal Type,Dietary Restrictions,Rating,Reviews,Calories,Source/Origin,Preparation Time,Cook Time,Total Time,Difficulty Level,Servings,Protein,Fat,Carbs,CombinedText,Sentiment
0,Spaghetti Carbonara,"Spaghetti, eggs, pancetta, Parmesan cheese, bl...",Italian,Dinner,,4.5,Delicious and easy to make!,450,Cookbook,10 minutes,15 minutes,25 minutes,Easy,4,Protein: 20g,Fat: 12g,Carbohydrates: 60g,Recipe Name: Spaghetti Carbonara Ingredients: ...,"[0.002972091372970204, -0.0009864789807166045,..."
1,Chicken Tikka Masala,"Chicken breast, yogurt, tomatoes, onion, garli...",Indian,Dinner,,4.8,"Amazing flavors, a must-try!",380,Website,20 minutes,30 minutes,50 minutes,Moderate,6,Protein: 30g,Fat: 14g,Carbohydrates: 25g,Recipe Name: Chicken Tikka Masala Ingredients:...,"[-0.0036913077469786695, 0.004381618775350267,..."
2,Caesar Salad,"Romaine lettuce, croutons, Parmesan cheese, Ca...",American,Lunch,Vegetarian,4.2,Classic and delicious!,220,Personal creation,15 minutes,0 minutes,15 minutes,Easy,2,Protein: 10g,Fat: 15g,Carbohydrates: 10g,Recipe Name: Caesar Salad Ingredients: Romaine...,"[0.0058500254504090425, 0.0002857527722622834,..."
3,Beef Tacos,"Ground beef, taco seasoning, tortillas, lettuc...",Mexican,Dinner,,4.7,"Authentic taste, loved it!",350,Cookbook,10 minutes,15 minutes,25 minutes,Easy,4,Protein: 25g,Fat: 18g,Carbohydrates: 30g,Recipe Name: Beef Tacos Ingredients: Ground be...,"[-0.001040244748334993, 0.0065455999138951315,..."
4,Spinach and Feta Stuffed Chicken Breast,"Chicken breast, spinach, feta cheese, garlic, ...",Mediterranean,Dinner,Gluten-free,4.6,Healthy and flavorful!,280,Website,15 minutes,25 minutes,40 minutes,Moderate,2,Protein: 30g,Fat: 12g,Carbohydrates: 5g,Recipe Name: Spinach and Feta Stuffed Chicken ...,"[-0.006330649013608171, 0.006345079318656211, ..."


In [7]:
recipe_df.to_csv('ScoredRecipe.csv')

## Closest Recipe Recommender

The sentiment value obtained in the find_recipe_with_closest_sentiment function represents the sentiment similarity or closeness between the sentiment embedding of the given instructions and the sentiment embeddings of the recipes. It is calculated using cosine similarity.

Cosine similarity measures the cosine of the angle between two vectors and provides a value between -1 and 1. In this case, the sentiment embeddings of the instructions and the recipes are treated as vectors, and their cosine similarity is computed. A higher cosine similarity value indicates a closer alignment or similarity between the sentiment embeddings.

In the function, the cosine similarities are calculated using the dot product of the recipe sentiment embeddings (recipe_sentiments) and the sentiment embedding of the given instructions (instruction_sentiment). This value is then divided by the product of the Euclidean norms of the two vectors.

The np.argmax(cosine_similarities) function is used to find the index of the recipe with the highest cosine similarity, indicating the recipe that is most similar in terms of sentiment to the given instructions. The sentiment value of the closest recipe is then obtained by retrieving the cosine similarity value at the corresponding index (closest_recipe_index).

It's important to note that the sentiment value alone does not provide information about the percentage closeness or a direct measure of similarity between the instructions and the recipe sentiments. The exact interpretation of the sentiment value depends on the range and scaling of the sentiment embeddings and the specific context of the application.

### Try on our original instructions

In [8]:
def recipe_finder(sentiment_embedding):
    # Find the recipe with the closest sentiment
    closest_recipe_index, closest_recipe_sentiment = find_recipe_with_closest_sentiment(sentiment_embedding, recipe_df['Sentiment'])
    closest_recipe = recipe_df.iloc[closest_recipe_index]

    print(f"Closest recipe sentiment: {closest_recipe_sentiment}")
    print(f"\nClosest recipe:\n\n{closest_recipe}")
    
recipe_finder(sentiment_embedding)

Closest recipe sentiment: 0.8455632096561123

Closest recipe:

Recipe Name                                           Spaghetti Carbonara
Ingredients             Spaghetti, eggs, pancetta, Parmesan cheese, bl...
Cuisine                                                           Italian
Meal Type                                                          Dinner
Dietary Restrictions                                                  NaN
Rating                                                                4.5
Reviews                                       Delicious and easy to make!
Calories                                                              450
Source/Origin                                                    Cookbook
Preparation Time                                               10 minutes
Cook Time                                                      15 minutes
Total Time                                                     25 minutes
Difficulty Level                                 

### Provide New Instructions

In [9]:
# Try on new instructions
instructions = "I want a recipe that I could have in Mexico."

# Get the sentiment embedding of the instructions
sentiment_embedding = get_sentiment_embedding(instructions, remove_stopwords = True)

recipe_finder(sentiment_embedding)

Closest recipe sentiment: 0.7545961429926464

Closest recipe:

Recipe Name                                                    Beef Tacos
Ingredients             Ground beef, taco seasoning, tortillas, lettuc...
Cuisine                                                           Mexican
Meal Type                                                          Dinner
Dietary Restrictions                                                  NaN
Rating                                                                4.7
Reviews                                        Authentic taste, loved it!
Calories                                                              350
Source/Origin                                                    Cookbook
Preparation Time                                               10 minutes
Cook Time                                                      15 minutes
Total Time                                                     25 minutes
Difficulty Level                                 