# Food Recommendation System Project

This project aims to build a recommender system using the Food.com Recipes and Reviews dataset. The dataset contains user reviews. 
This file aims :
- To get the dataset from Kaggle: [Food.com - Recipes and Reviews](https://www.kaggle.com/datasets/irkaal/foodcom-recipes-and-reviews) and save it locally
- Generate a folder "dataset" and subfiles with the whole dataset, train and test datasets after splitting them using `Cornac`

In [1]:
import kagglehub
import os
import pandas as pd
import shutil

path = kagglehub.dataset_download("irkaal/foodcom-recipes-and-reviews")

print("Path to dataset files:", path)

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: /home/r-one/.cache/kagglehub/datasets/irkaal/foodcom-recipes-and-reviews/versions/2


In [2]:
dataset_path = path
print('Files in dataset:', os.listdir(dataset_path))

Files in dataset: ['reviews.csv', 'recipes.parquet', 'reviews.parquet', 'recipes.csv']


In [3]:
data_folder = os.path.join(os.getcwd(), "data")
os.makedirs(data_folder, exist_ok=True)

# Copy all files from dataset_path to "data" folder
for filename in os.listdir(dataset_path):
    src = os.path.join(dataset_path, filename)
    dst = os.path.join(data_folder, filename)
    if os.path.isfile(src):
        shutil.copy2(src, dst)

print('Files in dataset:', os.listdir(dataset_path))
print('Files in data folder:', os.listdir(data_folder))

Files in dataset: ['reviews.csv', 'recipes.parquet', 'reviews.parquet', 'recipes.csv']
Files in data folder: ['reviews.csv', 'clean_reviews.csv', 'recipes.parquet', 'reviews.parquet', 'recipes.csv']


In [4]:
recipes_path = os.path.join(data_folder, "recipes.csv")
reviews_path = os.path.join(data_folder, "reviews.csv")

print("Recipes file path:", recipes_path)
print("Reviews file path:", reviews_path)

Recipes file path: /home/r-one/Documents/epita/recommender_system/Recommandation-Benchmark-for-Recipes/data/recipes.csv
Reviews file path: /home/r-one/Documents/epita/recommender_system/Recommandation-Benchmark-for-Recipes/data/reviews.csv


## Clean dataset

As seen in the `data_analysis` file, some fields are not useful for our recommandation task and is limitating our application.

This is why we are going to clean our datasets to generate `clean_recipes.csv` (resp. `clean_reviews.csv`) to be handled by Cornac.

In [5]:
reviews = pd.read_csv(reviews_path)
reviews.head()

Unnamed: 0,ReviewId,RecipeId,AuthorId,AuthorName,Rating,Review,DateSubmitted,DateModified
0,2,992,2008,gayg msft,5,better than any you can get at a restaurant!,2000-01-25T21:44:00Z,2000-01-25T21:44:00Z
1,7,4384,1634,Bill Hilbrich,4,"I cut back on the mayo, and made up the differ...",2001-10-17T16:49:59Z,2001-10-17T16:49:59Z
2,9,4523,2046,Gay Gilmore ckpt,2,i think i did something wrong because i could ...,2000-02-25T09:00:00Z,2000-02-25T09:00:00Z
3,13,7435,1773,Malarkey Test,5,easily the best i have ever had. juicy flavor...,2000-03-13T21:15:00Z,2000-03-13T21:15:00Z
4,14,44,2085,Tony Small,5,An excellent dish.,2000-03-28T12:51:00Z,2000-03-28T12:51:00Z


We check there are no rows with NaN in Rating field

In [6]:
print("Number of reviews before dropping NaN:", len(reviews))
reviews = reviews.dropna(subset=['Rating'])
print("Number of reviews after dropping NaN:", len(reviews))

Number of reviews before dropping NaN: 1401982
Number of reviews after dropping NaN: 1401982


We drop rows with missing AuthorId, RecipeId, or Rating as it cancel Reader from Cornac to work.


In [7]:
print("Number of reviews before dropping missing values:", len(reviews))
reviews = reviews.dropna(subset=['AuthorId', 'RecipeId', 'Rating'])
print("Number of reviews after dropping missing values:", len(reviews))

Number of reviews before dropping missing values: 1401982
Number of reviews after dropping missing values: 1401982


We drop useless columns as `DateSubmitted`, `DateModified`, `Review` and `AuthorName`

In [8]:
print("Number of columns in reviews:", len(reviews.columns))
reviews = reviews.drop(columns=['AuthorName', 'DateSubmitted', 'DateModified', 'Review'])
print("Number of columns in reviews after dropping some:", len(reviews.columns))


Number of columns in reviews: 8
Number of columns in reviews after dropping some: 4


We save this new clean reviews dataset

In [9]:
clean_reviews_path = os.path.join(data_folder, "clean_reviews.csv")
reviews.to_csv(clean_reviews_path, index=False)
print(f"Clean reviews saved to {clean_reviews_path}")

Clean reviews saved to /home/r-one/Documents/epita/recommender_system/Recommandation-Benchmark-for-Recipes/data/clean_reviews.csv


## Test and split data with `Cornac`

For this we are going to use:
- Reader
- Dataset
- StratifiedSplit

In [20]:
import cornac
from cornac.data import Reader, Dataset
from cornac.eval_methods import StratifiedSplit, RatioSplit

reader = Reader()
data = reader.read(
    clean_reviews_path,
    sep=',',
    skip_lines=1,
    user_col='AuthorId',
    item_col='RecipeId',
    rating_col='Rating'
)

In [12]:
dataset = Dataset.from_uir(data, seed=42)

Je n'arrive pas à faire fonctionner StratifiedSplit

In [30]:
from cornac.eval_methods import RatioSplit

# Create a RatioSplit object for random splitting
ratio_split = RatioSplit(
    data=data,
    test_size=0.2,         
    rating_threshold=2.0,
    exclude_unknowns=False,
    seed=42                
)

print(ratio_split)
print("Number of train ratings:", len(ratio_split.train_set))
print("Number of test ratings:", len(ratio_split.test_set))

<cornac.eval_methods.ratio_split.RatioSplit object at 0x7f757f006050>


TypeError: object of type 'Dataset' has no len()

Je n'arrive pas à faire fonctionner StratifiedSplit

In [18]:
stratified_split = StratifiedSplit(
    data=data,
    test_size=0.2,
    rating_threshold=3.0,
    exclude_unknowns=True,
    seed=42
)

ValueError: val_size + test_size (1) should be smaller than data_size=1

In [31]:
import numpy as np

def split_train_test(df, test_size=0.2, random_state=42, train_path=None, test_path=None):
    np.random.seed(random_state)
    shuffled_indices = np.random.permutation(len(df))
    test_set_size = int(len(df) * test_size)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    train_df = df.iloc[train_indices]
    test_df = df.iloc[test_indices]
    if train_path is not None:
        train_df.to_csv(train_path, index=False)
        print(f"Train set saved to {train_path} ({len(train_df)} rows)")
    if test_path is not None:
        test_df.to_csv(test_path, index=False)
        print(f"Test set saved to {test_path} ({len(test_df)} rows)")
    return train_df, test_df

train_path = os.path.join(data_folder, "train_reviews.csv")
test_path = os.path.join(data_folder, "test_reviews.csv")
train_df, test_df = split_train_test(reviews, test_size=0.2, random_state=42, train_path=train_path, test_path=test_path)

Train set saved to /home/r-one/Documents/epita/recommender_system/Recommandation-Benchmark-for-Recipes/data/train_reviews.csv (1121586 rows)
Test set saved to /home/r-one/Documents/epita/recommender_system/Recommandation-Benchmark-for-Recipes/data/test_reviews.csv (280396 rows)
Test set saved to /home/r-one/Documents/epita/recommender_system/Recommandation-Benchmark-for-Recipes/data/test_reviews.csv (280396 rows)


In [34]:
print("Size of train set:", len(train_df))
print("Size of test set:", len(test_df))
print("Size of total dataset:", len(reviews), " == Size of train set + Size of test set:", len(train_df) + len(test_df))

Size of train set: 1121586
Size of test set: 280396
Size of total dataset: 1401982  == Size of train set + Size of test set: 1401982
