# Recommender Pipeline

*   Synthetische User Profiles
*   Ausfüllen von BioBank Datenstruktur (LLM)
* Datenstruktur umwandeln (mit LLM) in User Item interaktionen
* mithilfe von User Item interaktionen lebensmittel recommenden





### About this Dataset
This data was collected from https://www.allrecipes.com/.
Features include:

group: grouping by origin of recipes, consisting of 3 (or 2) groups, separated by dots.
name: the name of recipe
rating: rating of the recipe
n_rater: number of participants rating the recipe
n_reiviewer: number of participants reviewing the recipe
summary: blurb about the recipe
process: summary of the recipe process
ingredient: ingredient of the recipe

In [1]:
!pip install pandas
!pip install --upgrade kagglehub
!pip install -U LibRecommender
!pip install keras==2.12.0 tensorflow==2.12.0

!pip show LibRecommender

Collecting LibRecommender
  Downloading LibRecommender-1.5.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (30 kB)
Downloading LibRecommender-1.5.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: LibRecommender
Successfully installed LibRecommender-1.5.1
Collecting keras==2.12.0
  Downloading keras-2.12.0-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting tensorflow==2.12.0
  Downloading tensorflow-2.12.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting gast<=0.4.0,>=0.2.1 (from tensorflow==2.12.0)
  Downloading gast-0.4.0-py3-none-any.whl.metadata (1.1 kB)
Collecting numpy<1.24,>=1.22 (from tensorflow==2.12.0)
  Downloading numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Collecting tensorboard<2.13,>=2.12 (from te

Name: LibRecommender
Version: 1.5.1
Summary: Versatile end-to-end recommender system.
Home-page: https://github.com/massquantity/LibRecommender
Author: massquantity
Author-email: massquantity <jinxin_madie@163.com>
License: MIT
Location: /usr/local/lib/python3.11/dist-packages
Requires: gensim, tqdm
Required-by: 


In [1]:
import kagglehub
import pandas as pd
from zipfile import ZipFile
import tensorflow as tf
import os


path = kagglehub.dataset_download("shuyangli94/food-com-recipes-and-user-interactions")

print("Path to dataset files:", path)


Downloading from https://www.kaggle.com/api/v1/datasets/download/shuyangli94/food-com-recipes-and-user-interactions?dataset_version_number=2...


100%|██████████| 267M/267M [00:04<00:00, 60.2MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/shuyangli94/food-com-recipes-and-user-interactions/versions/2


In [33]:
def updateLabels(interactions_data):
    interactions_data["label"] = interactions_data["label"].apply(lambda x: int(x))
    return interactions_data


In [34]:
def rename_and_drop_columns(interactions_data):
  interactions_data.rename(
      columns={"user_id": "user", "recipe_id": "item","rating": "label"},
      inplace=True
  )
  for column in interactions_data.columns:
    if column != "user" and column != "item" and column != "label":
      interactions_data.drop(columns=column, inplace=True)

  updateLabels(interactions_data)

  return interactions_data

In [35]:
# 2) Vorhandene Interactions-Dateien kombinieren, weil ansonsten ein out of bounds Fehler auftritt
eval_data_path = os.path.join(path, "interactions_validation.csv")
eval_data = pd.read_csv(eval_data_path)


train_data_path = os.path.join(path, "interactions_train.csv")
train_data = pd.read_csv(train_data_path)


test_data_path = os.path.join(path, "interactions_test.csv")
test_data = pd.read_csv(test_data_path)


# Data muss zusammengefügt werden, damit sie gefiltert und im gleichen Verhältnis wieder aufgeteilt werden kann
data = pd.concat([train_data, eval_data, test_data], ignore_index=True)
data = rename_and_drop_columns(data)

In [36]:
all_unique_labels = data["label"].unique()
all_unique_labels

array([5, 4, 3, 1, 0, 2])

In [38]:
# Alle unterschiedlichen Inhalte in der Spalte "label" und deren Häufigkeit
label_counts = data["label"].value_counts()
print("Unterschiedliche Inhalte in 'label' und deren Häufigkeit:")
print(label_counts)


Unterschiedliche Inhalte in 'label' und deren Häufigkeit:
label
5    530417
4    131846
3     27058
0     18000
2      7336
1      3722
Name: count, dtype: int64


In [39]:
data.columns

Index(['user', 'item', 'label'], dtype='object')

In [40]:
data.head()

Unnamed: 0,user,item,label
0,2046,4684,5
1,2046,517,5
2,1773,7435,5
3,1773,278,4
4,2046,3431,5


In [41]:
data['user'][0]

2046

In [42]:
data['item'][0]

4684

In [43]:
data['label'][0]

5

In [263]:
import pandas as pd

threshold = 30

# 1) Items filtern, die mindestens * Interaktionen haben:
min_item_interactions = threshold
item_counts = data["item"].value_counts()
items_to_keep = item_counts[item_counts >= min_item_interactions].index

data_filtered = data[data["item"].isin(items_to_keep)]

# 2) User filtern, die mindestens * Interaktionen haben:
min_user_interactions = threshold
user_counts = data_filtered["user"].value_counts()
users_to_keep = user_counts[user_counts >= min_user_interactions].index

data_filtered = data_filtered[data_filtered["user"].isin(users_to_keep)]

# Ergebnis prüfen
print("Datensatz vor Filterung:", data.shape)
print("Datensatz nach Filterung:", data_filtered.shape)
print(data_filtered.head())


Datensatz vor Filterung: (718379, 3)
Datensatz nach Filterung: (71370, 3)
      user   item  label
164  11297   5478      4
245   4470    834      5
300   6357  11365      5
349   6357  11642      5
365   9869   2886      5


In [264]:
from libreco.data import random_split, DatasetPure

train_data, eval_data, test_data = random_split(data_filtered, multi_ratios=[0.8, 0.1, 0.1])

train_data, data_info = DatasetPure.build_trainset(train_data)
eval_data = DatasetPure.build_evalset(eval_data)
test_data = DatasetPure.build_testset(test_data)
print(data_info)

n_users: 1136, n_items: 2455, data density: 2.0472 %


Algorithmus genommen aufgrund von: https://arxiv.org/ftp/arxiv/papers/1205/1205.2618.pdf

In [272]:
from libreco.algorithms import BPR

tf.compat.v1.reset_default_graph()

# Initialisierung des Modells
model = BPR(
    task="ranking",
    data_info=data_info,
    loss_type="bpr",
    embed_size=256,
    n_epochs=5,
    lr=5e-5,  # Lernrate
    batch_size=1024,
    num_neg=5,  # Mehr negative Beispiele für besseres Ranking
    reg=5e-6,  # Regularisierung
    sampler="random"  # Negative Sampling
)

# Training des Modells mit Monitoring der Metriken
model.fit(
    train_data,
    neg_sampling=True,  # Negative Sampling aktivieren
    shuffle=True,
    verbose=2,  # Detaillierte Trainingsausgabe
    eval_data=eval_data,  # Validierungsdaten
    metrics=["loss", "roc_auc", "precision", "recall", "ndcg"]  # Überwachung relevanter Metriken
)

Training start time: [35m2025-01-26 13:48:15[0m


train: 100%|██████████| 280/280 [00:03<00:00, 90.07it/s]


Epoch 1 elapsed: 3.118s
	 [32mtrain_loss: 0.6927[0m


eval_pointwise: 100%|██████████| 6/6 [00:00<00:00, 106.82it/s]
eval_listwise: 100%|██████████| 371/371 [00:01<00:00, 227.78it/s]


	 eval log_loss: 0.6930
	 eval roc_auc: 0.5216
	 eval precision@10: 0.0048
	 eval recall@10: 0.0076
	 eval ndcg@10: 0.0221


train: 100%|██████████| 280/280 [00:02<00:00, 120.33it/s]


Epoch 2 elapsed: 2.337s
	 [32mtrain_loss: 0.6913[0m


eval_pointwise: 100%|██████████| 6/6 [00:00<00:00, 198.73it/s]
eval_listwise: 100%|██████████| 371/371 [00:01<00:00, 349.98it/s]


	 eval log_loss: 0.6926
	 eval roc_auc: 0.5358
	 eval precision@10: 0.0058
	 eval recall@10: 0.0091
	 eval ndcg@10: 0.0281


train: 100%|██████████| 280/280 [00:01<00:00, 163.88it/s]


Epoch 3 elapsed: 1.722s
	 [32mtrain_loss: 0.6898[0m


eval_pointwise: 100%|██████████| 6/6 [00:00<00:00, 222.12it/s]
eval_listwise: 100%|██████████| 371/371 [00:00<00:00, 797.08it/s]


	 eval log_loss: 0.6923
	 eval roc_auc: 0.5492
	 eval precision@10: 0.0074
	 eval recall@10: 0.0119
	 eval ndcg@10: 0.0353


train: 100%|██████████| 280/280 [00:01<00:00, 183.55it/s]


Epoch 4 elapsed: 1.534s
	 [32mtrain_loss: 0.6884[0m


eval_pointwise: 100%|██████████| 6/6 [00:00<00:00, 245.09it/s]
eval_listwise: 100%|██████████| 371/371 [00:00<00:00, 777.58it/s]


	 eval log_loss: 0.6919
	 eval roc_auc: 0.5614
	 eval precision@10: 0.0095
	 eval recall@10: 0.0159
	 eval ndcg@10: 0.0443


train: 100%|██████████| 280/280 [00:01<00:00, 187.61it/s]


Epoch 5 elapsed: 1.508s
	 [32mtrain_loss: 0.6868[0m


eval_pointwise: 100%|██████████| 6/6 [00:00<00:00, 243.36it/s]
eval_listwise: 100%|██████████| 371/371 [00:00<00:00, 760.34it/s]

	 eval log_loss: 0.6916
	 eval roc_auc: 0.5722
	 eval precision@10: 0.0107
	 eval recall@10: 0.0182
	 eval ndcg@10: 0.0483





In [244]:
from libreco.evaluation import evaluate

evaluate(
     model=model,
     data=test_data,
     neg_sampling=True,  # perform negative sampling on test data
     metrics=["loss", "roc_auc", "precision", "recall", "ndcg"],
 )

eval_pointwise: 100%|██████████| 6/6 [00:00<00:00, 233.31it/s]
eval_listwise: 100%|██████████| 373/373 [00:00<00:00, 714.22it/s]


{'loss': 0.6742145023770763,
 'roc_auc': 0.6560994023786628,
 'precision': 0.024754244861483466,
 'recall': 0.042958019919234744,
 'ndcg': 0.10863608108029306}

In [99]:
def drop_columns_not_name_and_id(name_df):
  for column in name_df.columns:
    if column != "name" and column != "id":
      name_df.drop(columns=column, inplace=True)
  return name_df

# RAW_recipes.csv laden
raw_recipes_csv_path = os.path.join(path, "RAW_recipes.csv")
name_df = pd.read_csv(raw_recipes_csv_path)
drop_columns_not_name_and_id(name_df)

def getName(recipe_id):
  name = name_df.loc[name_df['id'] == recipe_id, 'name']
  return name.values[0]

In [100]:
name_df.head()

Unnamed: 0,name,id
0,arriba baked winter squash mexican style,137739
1,a bit different breakfast pizza,31490
2,all in the kitchen chili,112140
3,alouette potatoes,59389
4,amish tomato ketchup for canning,44061


User Beispeispiele: https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions?select=interactions_test.csv

In [101]:
user_list = data_filtered["user"].unique()
len(user_list)

3851

In [102]:
user_list

array([      2312,       7802,       6836, ..., 2001330613, 2001453193,
       2001102678])

In [273]:
def get_recommendation(user):
    recommendations_dict = model.recommend_user(user=user,filter_consumed=True, n_rec=7) #,random_rec=True
    print(recommendations_dict)

    # The result is something like {4460: [id1, id2, ..., id7]}
    recommended_ids = recommendations_dict[user]

    for recipe_id in recommended_ids:
        print(getName(recipe_id))

get_recommendation(4470)

{4470: array([ 54257, 150863,  95569,  27208,  22176,  69173,  57130])}
yes  virginia there is a great meatloaf
panera s cream cheese potato soup
easy and tasty barbecue chicken sandwiches in the crock pot
to die for crock pot roast
classic baked ziti
kittencal s italian melt in your mouth meatballs
awesome slow cooker pot roast


In [274]:
get_recommendation(6357)

{6357: array([ 27208,  43023,   8701, 150863,  63689, 152441,  30951])}
to die for crock pot roast
creamy garlic penne pasta
should be illegal oven bbq ribs
panera s cream cheese potato soup
my family s favorite sloppy joes  pizza joes
24k carrots
green beans with cherry tomatoes


In [275]:
get_recommendation(2000431901)

{2000431901: array([152441,  68955, 106251,  48760, 150863,  33671,  57130])}
24k carrots
japanese mum s chicken
roasted cauliflower   16 roasted cloves of garlic
szechuan noodles with spicy beef sauce
panera s cream cheese potato soup
crock pot whole chicken
awesome slow cooker pot roast


In [277]:
get_recommendation(2001362355)

{2001362355: array([ 54257,  43023, 108105,  87782, 125399,  54715,  25885])}
yes  virginia there is a great meatloaf
creamy garlic penne pasta
thai chicken breasts
greek potatoes  oven roasted and delicious
french toast sticks   oamc
sticky pork chops
banana banana bread
