<a href="https://colab.research.google.com/github/BehzadBarati/Ingredient-Maps/blob/main/Food_Recipes_RecipeNLG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Abstract:

This notebook produces elementary reports on RecipeNLG dataset which contains more than 2 million recipies of foods.
___
Source:

My main refrences are [RecipeNLG paper](https://www.aclweb.org/anthology/2020.inlg-1.4.pdf) and its [dataset](https://recipenlg.cs.put.poznan.pl).
___
Input: 

1- Dataset of [RecipeNLG](https://recipenlg.cs.put.poznan.pl)

Ouput:

1- EDA report on RecipeNLG dataset (both inline and "EDA-Report-RecipeNLG.html" file)

2- word cloud pictures (inline)

3- list of source websites of recipies ( "Websites-RecipeNLG.csv" file)
___
Hints:

1_ As our csv file is greater than 2 gigabytes, I prefer to use cloud services(here google colab). I uploaded RecipeNLG dataset in my [google drive](https://drive.google.com/drive/folders/1g1ZNYKlLN4hyP8ywHXWa2Iu1oQ4wxSgR?usp=sharing). It is public.

2_ If there is out of memory error in running "ProfileReport", please first re-install latest version of "pandas_profiling" library, then try "minimal=True" argument in "profileReport" for eliminating some calculations. (pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip)

# Import needed libraries

In [1]:
# Install pandas_profiling library
# pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

import numpy as np
import pandas as pd 
from wordcloud import WordCloud             # Make wordcloud pictures
from pandas_profiling import ProfileReport  # Generate brief report on our dataframe
import matplotlib.pyplot as plt
from google.colab import drive              # Mount google drive to colab notebook
import ast                                  # Convert string to list

# Load data

In [2]:
# Mount google drive to colab notebook
# Our dataset will be read as recipe_table.

drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [3]:
cd gdrive/MyDrive/Projects/Ingredient-Maps/Phase1

/content/gdrive/MyDrive/Projects/Ingredient-Maps/Phase1


In [4]:
# Reading file and check if data is loaded

recipe_table = pd.read_csv('./dataset/RecipeNLG.csv')
print('Number of recipes in dataset: ', len(recipe_table))
print('last 5 recipes:')
recipe_table.tail(5)

Number of recipes in dataset:  2231142
last 5 recipes:


Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
2231137,2231137,Sunny's Fake Crepes,"[""1/2 cup chocolate hazelnut spread (recommend...","[""Spread hazelnut spread on 1 side of each tor...",www.foodnetwork.com/recipes/sunny-anderson/sun...,Recipes1M,"[""chocolate hazelnut spread"", ""tortillas"", ""bu..."
2231138,2231138,Devil Eggs,"[""1 dozen eggs"", ""1 paprika"", ""1 salt and pepp...","[""Boil eggs on medium for 30mins."", ""Then cool...",cookpad.com/us/recipes/355411-devil-eggs,Recipes1M,"[""eggs"", ""paprika"", ""salt"", ""choice"", ""miracle..."
2231139,2231139,Extremely Easy and Quick - Namul Daikon Salad,"[""150 grams Daikon radish"", ""1 tbsp Sesame oil...","[""Julienne the daikon and squeeze out the exce...",cookpad.com/us/recipes/153324-extremely-easy-a...,Recipes1M,"[""radish"", ""Sesame oil"", ""White sesame seeds"",..."
2231140,2231140,Pan-Roasted Pork Chops With Apple Fritters,"[""1 cup apple cider"", ""6 tablespoons sugar"", ""...","[""In a large bowl, mix the apple cider with 4 ...",cooking.nytimes.com/recipes/1015164,Recipes1M,"[""apple cider"", ""sugar"", ""kosher salt"", ""bay l..."
2231141,2231141,Polpette in Spicy Tomato Sauce,"[""1 pound ground veal"", ""1/2 pound sweet Itali...","[""Preheat the oven to 350."", ""In a bowl, mix t...",www.foodandwine.com/recipes/polpette-spicy-tom...,Recipes1M,"[""ground veal"", ""sausage"", ""bread crumbs"", ""mi..."


# EDA (Exploratory Data Analysis)
I dont want to generate report every time I run it, so EDA, world cloud and website list cells are muted.

In [None]:
'''
# Column 'Unnamed: 0' seems to be useless for EDA section,so I drop it for making our dataset smaller.

recipe_table.drop('Unnamed: 0', axis='columns', inplace=True)
'''

In [None]:
'''
# Generate a quick report from our dataset 

profile = ProfileReport(recipe_table, minimal=True)
profile.to_file("EDA-Report-RecipeNLG.html")
profile
'''

## World clouds

In [None]:
'''
# For creating word clouds, I used WordCloud library which was imported before.

def minimal_wordcloud(df, column):
    """
    Generate a simple wordcloud similar to: 
    https://www.kaggle.com/paultimothymooney/explore-recipe-nlg-dataset/data.
    The only import required is: from wordcloud import WordCloud
    """
    text = str(df[column].values)
    wordcloud = WordCloud().generate(text)
    image = wordcloud.to_image()
    plt.axis("off")
    plt.imshow(image)
    plt.show()
'''

In [None]:
'''
# Print word clouds

for c in recipe_table.columns:
    print('\nworld cloud of contents in column {}'.format(c))
    minimal_wordcloud(recipe_table, c)
'''

## list of websites in RecipieNLG

In [None]:
'''
# Based on link column I tried to extract website names.
# This func helps to select website names where we do not have 'www' at beginning

func = lambda x: x[1] if x[0] == 'www' else x[0]
recipe_table['website'] = recipe_table['link'].str.split('.').apply(func)
'''

In [None]:
'''
recipe_table['website'].value_counts().rename_axis('websites').to_csv('./reports/Websites-RecipeNLG.csv',  header=['No. of recipies'])
recipe_table['website'].value_counts().rename_axis('websites')
'''

# Convert to SQL tables

## Set main key for recipe_table

Since there are meals who have more than one recipe, we can use index of main table as recipe_IDs.   

In [11]:
recipe_table.columns = recipe_table.columns.str.replace('Unnamed: 0','recipe_ID')

In [12]:
recipe_table['recipe_ID'] = "Rec" + recipe_table['recipe_ID'].astype(str)

## NER_table

In [13]:
# values of NER column are stored as string. first we convert them to lists

ast_func = lambda a: ast.literal_eval(a)
recipe_table['NER'] = recipe_table['NER'].map(ast_func)
# create NER_table
NER_table = pd.DataFrame(recipe_table['NER'].explode().unique(), columns=['NER'])
# add NER_ID column
NER_table['NER_ID'] =  ['NER'+str(i) for i in range(len(NER_table['NER']))]

In [14]:
# by help of conv dictionary, we replace names in recipe_table['NER'] with IDs
# set index to NER and make a dictionary out of df

NER_table.set_index(['NER'], inplace=True)
conv = NER_table.to_dict('dict')
recipe_table['NER'] = recipe_table['NER'].apply(lambda row: [conv['NER_ID'][v] for v in row if conv['NER_ID'].get(v)])

In [15]:
# add a column to keep recipe_IDs

NER_table['recipe_ID'] = np.empty((len(NER_table), 0)).tolist()
NER_table = NER_table.reset_index().set_index('NER_ID')

In [16]:
# we fill recipe_ID column to NER_table contains all recipe_IDs used NER in. (~ 5 minutes in colab)

for k in range(len(recipe_table)):
    for m in recipe_table['NER'][k]:
        NER_table['recipe_ID'][m].append(recipe_table['recipe_ID'][k])

In [17]:
# this is another approach to fill recipe_ID column in NER_table. but this is slow. (~ 1600 hours for this problem)
'''
NER_table['recipe_ID'] = np.empty((len(NER_table), 0)).tolist()
for i in range(len(NER_table)):
    t0= time.clock()
    for j in range(len(recipe_table)):
        if NER_table['NER_ID'][i] in recipe_table['NER'][j]:
           NER_table['recipe_ID'][i].append(recipe_table['recipe_ID'][j])
    print(time.clock() - t0)
'''

In [18]:
recipe_table.head()

Unnamed: 0,recipe_ID,title,ingredients,directions,link,source,NER
0,Rec0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[NER0, NER1, NER2, NER3, NER4, NER5]"
1,Rec1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[NER6, NER7, NER8, NER9]"
2,Rec2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[NER10, NER11, NER4, NER12, NER13, NER14]"
3,Rec3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[NER15, NER16, NER8, NER17]"
4,Rec4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[NER18, NER19, NER4, NER20, NER21]"


In [19]:
NER_table.head(50)

Unnamed: 0_level_0,NER,recipe_ID
NER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
NER0,brown sugar,"[Rec0, Rec26, Rec41, Rec44, Rec70, Rec77, Rec7..."
NER1,milk,"[Rec0, Rec5, Rec29, Rec41, Rec50, Rec69, Rec81..."
NER2,vanilla,"[Rec0, Rec6, Rec27, Rec41, Rec48, Rec59, Rec60..."
NER3,nuts,"[Rec0, Rec19, Rec20, Rec27, Rec44, Rec61, Rec6..."
NER4,butter,"[Rec0, Rec2, Rec4, Rec5, Rec6, Rec7, Rec11, Re..."
NER5,bite size shredded rice biscuits,"[Rec0, Rec854420]"
NER6,beef,"[Rec1, Rec104, Rec122, Rec361, Rec384, Rec657,..."
NER7,chicken breasts,"[Rec1, Rec50, Rec165, Rec344, Rec428, Rec447, ..."
NER8,cream of mushroom soup,"[Rec1, Rec3, Rec50, Rec63, Rec121, Rec139, Rec..."
NER9,sour cream,"[Rec1, Rec5, Rec13, Rec18, Rec21, Rec28, Rec36..."
