# Recommender Pipeline

*   Synthetische User Profiles
*   Ausfüllen von BioBank Datenstruktur (LLM)
* Datenstruktur umwandeln (mit LLM) in User Item interaktionen
* mithilfe von User Item interaktionen lebensmittel recommenden





### About this Dataset
This data was collected from https://www.allrecipes.com/.
Features include:

group: grouping by origin of recipes, consisting of 3 (or 2) groups, separated by dots.
name: the name of recipe
rating: rating of the recipe
n_rater: number of participants rating the recipe
n_reiviewer: number of participants reviewing the recipe
summary: blurb about the recipe
process: summary of the recipe process
ingredient: ingredient of the recipe

In [1]:
# !pip install pandas
# !pip install --upgrade kagglehub
# !pip install -U LibRecommender
# !pip install keras==2.12.0 tensorflow==2.12.0
#
# !pip show LibRecommender

In [2]:
import kagglehub
import pandas as pd
from zipfile import ZipFile
import tensorflow as tf
import os


path = kagglehub.dataset_download("shuyangli94/food-com-recipes-and-user-interactions")

print("Path to dataset files:", path)


  from .autonotebook import tqdm as notebook_tqdm
2025-03-10 17:49:44.713250: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-10 17:49:44.716198: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-10 17:49:44.797290: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-10 17:49:44.798417: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Path to dataset files: /home/mw/.cache/kagglehub/datasets/shuyangli94/food-com-recipes-and-user-interactions/versions/2


In [3]:
def updateLabels(interactions_data):
    interactions_data["label"] = interactions_data["label"].apply(lambda x: int(x))
    return interactions_data


In [4]:
def rename_and_drop_columns(interactions_data):
  interactions_data.rename(
      columns={"user_id": "user", "recipe_id": "item","rating": "label"},
      inplace=True
  )
  for column in interactions_data.columns:
    if column != "user" and column != "item" and column != "label":
      interactions_data.drop(columns=column, inplace=True)

  updateLabels(interactions_data)

  return interactions_data

In [5]:
# 2) Vorhandene Interactions-Dateien kombinieren, weil ansonsten ein out of bounds Fehler auftritt
eval_data_path = os.path.join(path, "interactions_validation.csv")
eval_data = pd.read_csv(eval_data_path)


train_data_path = os.path.join(path, "interactions_train.csv")
train_data = pd.read_csv(train_data_path)


test_data_path = os.path.join(path, "interactions_test.csv")
test_data = pd.read_csv(test_data_path)


# Data muss zusammengefügt werden, damit sie gefiltert und im gleichen Verhältnis wieder aufgeteilt werden kann
data = pd.concat([train_data, eval_data, test_data], ignore_index=True)
data = rename_and_drop_columns(data)

In [6]:
all_unique_labels = data["label"].unique()
all_unique_labels

array([5, 4, 3, 1, 0, 2])

In [7]:
# Alle unterschiedlichen Inhalte in der Spalte "label" und deren Häufigkeit
label_counts = data["label"].value_counts()
print("Unterschiedliche Inhalte in 'label' und deren Häufigkeit:")
print(label_counts)


Unterschiedliche Inhalte in 'label' und deren Häufigkeit:
label
5    530417
4    131846
3     27058
0     18000
2      7336
1      3722
Name: count, dtype: int64


In [8]:
data.columns

Index(['user', 'item', 'label'], dtype='object')

In [9]:
data.head()

Unnamed: 0,user,item,label
0,2046,4684,5
1,2046,517,5
2,1773,7435,5
3,1773,278,4
4,2046,3431,5


In [10]:
data['user'][0]

2046

In [11]:
data['item'][0]

4684

In [12]:
data['label'][0]

5

In [13]:
import pandas as pd

threshold = 30

# 1) Items filtern, die mindestens * Interaktionen haben:
min_item_interactions = threshold
item_counts = data["item"].value_counts()
items_to_keep = item_counts[item_counts >= min_item_interactions].index

data_filtered = data[data["item"].isin(items_to_keep)]

# 2) User filtern, die mindestens * Interaktionen haben:
min_user_interactions = threshold
user_counts = data_filtered["user"].value_counts()
users_to_keep = user_counts[user_counts >= min_user_interactions].index

data_filtered = data_filtered[data_filtered["user"].isin(users_to_keep)]

# Ergebnis prüfen
print("Datensatz vor Filterung:", data.shape)
print("Datensatz nach Filterung:", data_filtered.shape)
print(data_filtered.head())


Datensatz vor Filterung: (718379, 3)
Datensatz nach Filterung: (71370, 3)
      user   item  label
164  11297   5478      4
245   4470    834      5
300   6357  11365      5
349   6357  11642      5
365   9869   2886      5


In [14]:
from libreco.data import random_split, DatasetPure

train_data, eval_data, test_data = random_split(data_filtered, multi_ratios=[0.8, 0.1, 0.1])

train_data, data_info = DatasetPure.build_trainset(train_data)
eval_data = DatasetPure.build_evalset(eval_data)
test_data = DatasetPure.build_testset(test_data)
print(data_info)

n_users: 1136, n_items: 2455, data density: 2.0472 %


Algorithmus genommen aufgrund von: https://arxiv.org/ftp/arxiv/papers/1205/1205.2618.pdf

In [15]:
from libreco.algorithms import BPR

tf.compat.v1.reset_default_graph()

# Initialisierung des Modells
model = BPR(
    task="ranking",
    data_info=data_info,
    loss_type="bpr",
    embed_size=256,
    n_epochs=5,
    lr=5e-5,  # Lernrate
    batch_size=1024,
    num_neg=5,  # Mehr negative Beispiele für besseres Ranking
    reg=5e-6,  # Regularisierung
    sampler="random"  # Negative Sampling
)

# Training des Modells mit Monitoring der Metriken
model.fit(
    train_data,
    neg_sampling=True,  # Negative Sampling aktivieren
    shuffle=True,
    verbose=2,  # Detaillierte Trainingsausgabe
    eval_data=eval_data,  # Validierungsdaten
    metrics=["loss", "roc_auc", "precision", "recall", "ndcg"]  # Überwachung relevanter Metriken
)

Instructions for updating:
non-resource variables are not supported in the long term


ModuleNotFoundError: No module named 'torch'

In [244]:
from libreco.evaluation import evaluate

evaluate(
     model=model,
     data=test_data,
     neg_sampling=True,  # perform negative sampling on test data
     metrics=["loss", "roc_auc", "precision", "recall", "ndcg"],
 )

eval_pointwise: 100%|██████████| 6/6 [00:00<00:00, 233.31it/s]
eval_listwise: 100%|██████████| 373/373 [00:00<00:00, 714.22it/s]


{'loss': 0.6742145023770763,
 'roc_auc': 0.6560994023786628,
 'precision': 0.024754244861483466,
 'recall': 0.042958019919234744,
 'ndcg': 0.10863608108029306}

In [99]:
def drop_columns_not_name_and_id(name_df):
  for column in name_df.columns:
    if column != "name" and column != "id":
      name_df.drop(columns=column, inplace=True)
  return name_df

# RAW_recipes.csv laden
raw_recipes_csv_path = os.path.join(path, "RAW_recipes.csv")
name_df = pd.read_csv(raw_recipes_csv_path)
drop_columns_not_name_and_id(name_df)

def getName(recipe_id):
  name = name_df.loc[name_df['id'] == recipe_id, 'name']
  return name.values[0]

In [100]:
name_df.head()

Unnamed: 0,name,id
0,arriba baked winter squash mexican style,137739
1,a bit different breakfast pizza,31490
2,all in the kitchen chili,112140
3,alouette potatoes,59389
4,amish tomato ketchup for canning,44061


User Beispeispiele: https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions?select=interactions_test.csv

In [101]:
user_list = data_filtered["user"].unique()
len(user_list)

3851

In [102]:
user_list

array([      2312,       7802,       6836, ..., 2001330613, 2001453193,
       2001102678])

In [273]:
def get_recommendation(user):
    recommendations_dict = model.recommend_user(user=user,filter_consumed=True, n_rec=7) #,random_rec=True
    print(recommendations_dict)

    # The result is something like {4460: [id1, id2, ..., id7]}
    recommended_ids = recommendations_dict[user]

    for recipe_id in recommended_ids:
        print(getName(recipe_id))

get_recommendation(4470)

{4470: array([ 54257, 150863,  95569,  27208,  22176,  69173,  57130])}
yes  virginia there is a great meatloaf
panera s cream cheese potato soup
easy and tasty barbecue chicken sandwiches in the crock pot
to die for crock pot roast
classic baked ziti
kittencal s italian melt in your mouth meatballs
awesome slow cooker pot roast


In [274]:
get_recommendation(6357)

{6357: array([ 27208,  43023,   8701, 150863,  63689, 152441,  30951])}
to die for crock pot roast
creamy garlic penne pasta
should be illegal oven bbq ribs
panera s cream cheese potato soup
my family s favorite sloppy joes  pizza joes
24k carrots
green beans with cherry tomatoes


In [275]:
get_recommendation(2000431901)

{2000431901: array([152441,  68955, 106251,  48760, 150863,  33671,  57130])}
24k carrots
japanese mum s chicken
roasted cauliflower   16 roasted cloves of garlic
szechuan noodles with spicy beef sauce
panera s cream cheese potato soup
crock pot whole chicken
awesome slow cooker pot roast


In [277]:
get_recommendation(2001362355)

{2001362355: array([ 54257,  43023, 108105,  87782, 125399,  54715,  25885])}
yes  virginia there is a great meatloaf
creamy garlic penne pasta
thai chicken breasts
greek potatoes  oven roasted and delicious
french toast sticks   oamc
sticky pork chops
banana banana bread
