In [29]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer

from sklearn.metrics.pairwise import cosine_similarity

from google.colab import files
import json

## Load the preprocessed data:


In [9]:
recipe_data = pd.read_csv('preprocessed_recipe_data.csv')

## Embedding the data:

Embedding is a technique in natural language processing (NLP) where words, phrases, or sentences are represented as dense vectors in a continuous vector space. This transformation captures semantic relationships, allowing words with similar meanings to have similar vector representations. For embedding i decided to use Sentence Transformers or more specifically ` all-mpnet-base-v2`. This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space it aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. The model is pretrained on microsoft/mpnet-base model and fine-tuned in on a 1B sentence pairs dataset. Although this is not the fastest model i decided to use it beacuse it's the most accurate.

In [10]:
# Load the model:
model = SentenceTransformer("all-mpnet-base-v2")

In [11]:
# Here we make a list of collection of text that we want to embed
corpus = [rec for rec in recipe_data['full_text']]

# Let's create the embeddings using the sentence transformer model
embeddings = model.encode(corpus, show_progress_bar=True)

Batches:   0%|          | 0/2570 [00:00<?, ?it/s]

## Save the model:

I save the model so when i want to reuse it, i don't have to train it again.

In [16]:
model.save('load_model')

# Compress the saved model folder into a zip file
!zip -r /content/load_model.zip /content/load_model

# Here we download the model on our local machine
files.download('/content/load_model.zip')

  adding: content/load_model/ (stored 0%)
  adding: content/load_model/tokenizer_config.json (deflated 76%)
  adding: content/load_model/special_tokens_map.json (deflated 84%)
  adding: content/load_model/sentence_bert_config.json (deflated 4%)
  adding: content/load_model/2_Normalize/ (stored 0%)
  adding: content/load_model/vocab.txt (deflated 53%)
  adding: content/load_model/config.json (deflated 48%)
  adding: content/load_model/model.safetensors (deflated 8%)
  adding: content/load_model/tokenizer.json (deflated 71%)
  adding: content/load_model/modules.json (deflated 62%)
  adding: content/load_model/1_Pooling/ (stored 0%)
  adding: content/load_model/1_Pooling/config.json (deflated 47%)
  adding: content/load_model/README.md (deflated 64%)
  adding: content/load_model/config_sentence_transformers.json (deflated 27%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Save the embdeddings:

In [13]:
np.save('embeddings.npy', embeddings)

# Compress the saved embeddings folder into a zip file
!zip -r /content/embeddings.zip /content/embeddings.npy

# Here we download the embeddings on our local machine
files.download('/content/embeddings.zip')

  adding: content/embeddings.npy (deflated 7%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>