# Food.com Dataset Collector

Source: https://www.kaggle.com/shuyangli94/food-com-recipes-and-user-interactions

In [15]:
import pandas as pd

## Exploratory dataset analysis

This dataset consists of 180K+ recipes and 700K+ recipe reviews covering 18 years of user interactions and uploads on Food.com (formerly GeniusKitchen).

The dataset provides two files, called RAW_recipes and RAW_interactions, that represents respectively the recipes, along with all the informations, and the reviews prodived by different users.  

In [16]:
recipes_path = '../provided_datasets/food.com/RAW_recipes.csv'
reviews_path = '../provided_datasets/food.com/RAW_interactions.csv'

The first dataset analyzed is the one that contains all the recipes informations. As it is possible to see in the table below, a recipe is composed by: 
- Name
- Id
- Minutes (Time of cooking)
- Contributor Id
- Date of submission
- Tags
- Nutrition information
- Number of steps for cooking
- Description of the cooking steps
- Description of the recipe
- Ingredients used in the recipe
- Number of ingredients used in the recipe

In [17]:
recipes_df = pd.read_csv(recipes_path)
recipes_df[:2]

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6


From the below information it is possible to get a clear overview of the complete dataset: the recipes dataset contains 231637 recipes with 230186 of them uniques. The remaning 1451 recipes differs from the other with the same name by ingredients and cooking steps. 

In [18]:
print('Number of recipes in the dataset:', recipes_df.shape[0])
print('Number of unique recipes in the dataset:', len(recipes_df['name'].unique()))
print('Number of recipes with same name but different ingredients:', recipes_df.shape[0] - len(recipes_df['name'].unique()))
print('Number of NaN values in the dataset:', recipes_df.isna().sum().sum())

Number of recipes in the dataset: 231637
Number of unique recipes in the dataset: 230186
Number of recipes with same name but different ingredients: 1451
Number of NaN values in the dataset: 4980


The dataset contains also null values (4980). It is mandatory to investigate in which of the previuos explained columns the NaN values are present. From the following informations it is possible to say that the NaN values are contained for the most into the "description" column. Only one fo the 4980 null values is contained into the name column.  

In [19]:
recipes_df.isna().sum()

name                 1
id                   0
minutes              0
contributor_id       0
submitted            0
tags                 0
nutrition            0
n_steps              0
steps                0
description       4979
ingredients          0
n_ingredients        0
dtype: int64

Since the scope of this project is to provide a sentiment analysis of the recipes reviews it is possible to ignore the NaN values in the recipe description column and replace them with an empty string. Instead, the only row containg the NaN value in the name column will be dropped because the name is an important feature for the recipe.

In [20]:
recipes_df = recipes_df.dropna(subset=['name'])
recipes_df = recipes_df.fillna('')
recipes_df.isna().sum()

name              0
id                0
minutes           0
contributor_id    0
submitted         0
tags              0
nutrition         0
n_steps           0
steps             0
description       0
ingredients       0
n_ingredients     0
dtype: int64

To simplify the learning process, it is possible to save the mentioned dataset in a csv file, in order to be subsequently loaded by the model.

In [21]:
recipes_df.to_csv('../exported_datasets/food.com/recipes.csv', index=False, header=True)

The other dataset analyzed is the one that contains all the reviews informations. As it is possible to see in the table below, a review is composed by: 
- Id of the user who wrote the review
- Id of the recipe related to the review
- Date of the review
- Rating given to the review
- Review 

In [22]:
reviews_df = pd.read_csv(reviews_path)
reviews_df.head()

Unnamed: 0,user_id,recipe_id,date,rating,review
0,38094,40893,2003-02-17,4,Great with a salad. Cooked on top of stove for...
1,1293707,40893,2011-12-21,5,"So simple, so delicious! Great for chilly fall..."
2,8937,44394,2002-12-01,4,This worked very well and is EASY. I used not...
3,126440,85009,2010-02-27,5,I made the Mexican topping and took it to bunk...
4,57222,85009,2011-10-01,5,"Made the cheddar bacon topping, adding a sprin..."


From the below information it is possible to get a clear overview of the complete dataset: the reviews dataset contains 1132367 unique recipes. The reviews are written by 226570 unique users and covers 231637 recipes. 

In [23]:
print('Number of reviews in the dataset:', reviews_df.shape[0])
print('Number of unique users reviews:', len(reviews_df.groupby('user_id')['rating'].count().index))
print('Number of unique recipe reviews:', len(reviews_df.groupby('recipe_id')['rating'].count().index))
print('Number of NaN values in the dataset:', reviews_df.isna().sum().sum())

Number of reviews in the dataset: 1132367
Number of unique users reviews: 226570
Number of unique recipe reviews: 231637
Number of NaN values in the dataset: 169


The dataset contains also null values (169). It is mandatory to investigate in which of the previuos explained columns the NaN values are present. From the following informations it is possible to say that all the NaN values are contained in the review column. Since this column is mandatory for the scope of this project, the reviews without descriptions are dropped.

In [24]:
reviews_df.isna().sum()

user_id        0
recipe_id      0
date           0
rating         0
review       169
dtype: int64

In [25]:
reviews_df = reviews_df.dropna()
reviews_df.isna().sum()

user_id      0
recipe_id    0
date         0
rating       0
review       0
dtype: int64

To simplify the learning process, it is possible to save the mentioned dataset in a csv file, in order to be subsequently loaded by the model.

In [26]:
reviews_df.to_csv('../exported_datasets/food.com/reviews.csv', index=False, header=True)

In [30]:
reviews_df['review'].iloc[100]

"The last time I tried a bulgur / pumpkin combination, I was disappointed -- even though I like both ingredients -- so I was wary of trying this recipe.  But I am so glad I did, because it was wonderful!  I skipped the oil altogether and didn't miss it.  I also used a buttercup squash instead of pumpkin, and crushed bay leaves since that was all I had; and consomme instead of the vegetable broth.  The curry powder lent such a pretty color to the dish but the flavor wasn't at all overpowering.  Thanks for posting!"