<a href="https://colab.research.google.com/github/BehzadBarati/Ingredient-Maps/blob/main/Food_Recipes_RecipeNLG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Abstract:

This notebook produces elementary reports on RecipeNLG dataset which contains more than 2 million recipies of foods.
___
Source:

My main refrences are [RecipeNLG paper](https://www.aclweb.org/anthology/2020.inlg-1.4.pdf) and its [dataset](https://recipenlg.cs.put.poznan.pl).
___
Input: 

1- Dataset of [RecipeNLG](https://recipenlg.cs.put.poznan.pl)

Ouput:

1- EDA report on RecipeNLG dataset (both inline and "EDA-Report-RecipeNLG.html" file)

2- word cloud pictures (inline)

3- list of source websites of recipies ( "Websites-RecipeNLG.csv" file)
___
Hints:

1_ As our csv file is greater than 2 gigabytes, I prefer to use cloud services(here google colab). I uploaded RecipeNLG dataset in my [google drive](https://drive.google.com/drive/folders/1g1ZNYKlLN4hyP8ywHXWa2Iu1oQ4wxSgR?usp=sharing). It is public.

2_ If there is out of memory error in running "ProfileReport", please first re-install latest version of "pandas_profiling" library, then try "minimal=True" argument in "profileReport" for eliminating some calculations. (pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip)

# Import needed libraries

In [1]:
# Install pandas_profiling library
# pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

import numpy as np
import pandas as pd 
from wordcloud import WordCloud             # Make wordcloud pictures
from pandas_profiling import ProfileReport  # Generate brief report on our dataframe
import matplotlib.pyplot as plt
from google.colab import drive              # Mount google drive to colab notebook
import re                                   
import string                               # removing special characters

# Load data

In [2]:
# Mount google drive to colab notebook
# Our dataset will be read as recipe_table.

drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [3]:
cd gdrive/MyDrive/Projects/Ingredient-Maps/Phase1

/content/gdrive/MyDrive/Projects/Ingredient-Maps/Phase1


In [4]:
# Reading file and check if data is loaded

recipe_table = pd.read_csv('./dataset/RecipeNLG.csv')
print('Number of recipes in dataset: ', len(recipe_table))
print('last 5 recipes:')
recipe_table.tail(5)

Number of recipes in dataset:  2231142
last 5 recipes:


Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
2231137,2231137,Sunny's Fake Crepes,"[""1/2 cup chocolate hazelnut spread (recommend...","[""Spread hazelnut spread on 1 side of each tor...",www.foodnetwork.com/recipes/sunny-anderson/sun...,Recipes1M,"[""chocolate hazelnut spread"", ""tortillas"", ""bu..."
2231138,2231138,Devil Eggs,"[""1 dozen eggs"", ""1 paprika"", ""1 salt and pepp...","[""Boil eggs on medium for 30mins."", ""Then cool...",cookpad.com/us/recipes/355411-devil-eggs,Recipes1M,"[""eggs"", ""paprika"", ""salt"", ""choice"", ""miracle..."
2231139,2231139,Extremely Easy and Quick - Namul Daikon Salad,"[""150 grams Daikon radish"", ""1 tbsp Sesame oil...","[""Julienne the daikon and squeeze out the exce...",cookpad.com/us/recipes/153324-extremely-easy-a...,Recipes1M,"[""radish"", ""Sesame oil"", ""White sesame seeds"",..."
2231140,2231140,Pan-Roasted Pork Chops With Apple Fritters,"[""1 cup apple cider"", ""6 tablespoons sugar"", ""...","[""In a large bowl, mix the apple cider with 4 ...",cooking.nytimes.com/recipes/1015164,Recipes1M,"[""apple cider"", ""sugar"", ""kosher salt"", ""bay l..."
2231141,2231141,Polpette in Spicy Tomato Sauce,"[""1 pound ground veal"", ""1/2 pound sweet Itali...","[""Preheat the oven to 350."", ""In a bowl, mix t...",www.foodandwine.com/recipes/polpette-spicy-tom...,Recipes1M,"[""ground veal"", ""sausage"", ""bread crumbs"", ""mi..."


# EDA (Exploratory Data Analysis)
I dont want to generate report every time I run it, so EDA, world cloud and website list cells are muted.

In [5]:
'''
# Column 'Unnamed: 0' seems to be useless for EDA section,so I drop it for making our dataset smaller.

recipe_table.drop('Unnamed: 0', axis='columns', inplace=True)
'''

"\n# Column 'Unnamed: 0' seems to be useless for EDA section,so I drop it for making our dataset smaller.\n\nrecipe_table.drop('Unnamed: 0', axis='columns', inplace=True)\n"

In [6]:
'''
# Generate a quick report from our dataset 

profile = ProfileReport(recipe_table, minimal=True)
profile.to_file("EDA-Report-RecipeNLG.html")
profile
'''

'\n# Generate a quick report from our dataset \n\nprofile = ProfileReport(recipe_table, minimal=True)\nprofile.to_file("EDA-Report-RecipeNLG.html")\nprofile\n'

## World clouds

In [7]:
'''
# For creating word clouds, I used WordCloud library which was imported before.

def minimal_wordcloud(df, column):
    """
    Generate a simple wordcloud similar to: 
    https://www.kaggle.com/paultimothymooney/explore-recipe-nlg-dataset/data.
    The only import required is: from wordcloud import WordCloud
    """
    text = str(df[column].values)
    wordcloud = WordCloud().generate(text)
    image = wordcloud.to_image()
    plt.axis("off")
    plt.imshow(image)
    plt.show()
'''

'\n# For creating word clouds, I used WordCloud library which was imported before.\n\ndef minimal_wordcloud(df, column):\n    """\n    Generate a simple wordcloud similar to: \n    https://www.kaggle.com/paultimothymooney/explore-recipe-nlg-dataset/data.\n    The only import required is: from wordcloud import WordCloud\n    """\n    text = str(df[column].values)\n    wordcloud = WordCloud().generate(text)\n    image = wordcloud.to_image()\n    plt.axis("off")\n    plt.imshow(image)\n    plt.show()\n'

In [8]:
'''
# Print word clouds

for c in recipe_table.columns:
    print('\nworld cloud of contents in column {}'.format(c))
    minimal_wordcloud(recipe_table, c)
'''

"\n# Print word clouds\n\nfor c in recipe_table.columns:\n    print('\nworld cloud of contents in column {}'.format(c))\n    minimal_wordcloud(recipe_table, c)\n"

## list of websites in RecipieNLG

In [9]:
'''
# Based on link column I tried to extract website names.
# This func helps to select website names where we do not have 'www' at beginning

func = lambda x: x[1] if x[0] == 'www' else x[0]
recipe_table['website'] = recipe_table['link'].str.split('.').apply(func)
'''

"\n# Based on link column I tried to extract website names.\n# This func helps to select website names where we do not have 'www' at beginning\n\nfunc = lambda x: x[1] if x[0] == 'www' else x[0]\nrecipe_table['website'] = recipe_table['link'].str.split('.').apply(func)\n"

In [10]:
'''
recipe_table['website'].value_counts().rename_axis('websites').to_csv('./reports/Websites-RecipeNLG.csv',  header=['No. of recipies'])
recipe_table['website'].value_counts().rename_axis('websites')
'''

"\nrecipe_table['website'].value_counts().rename_axis('websites').to_csv('./reports/Websites-RecipeNLG.csv',  header=['No. of recipies'])\nrecipe_table['website'].value_counts().rename_axis('websites')\n"

# Convert to SQL tables

## Preprocessing
set key for recipe_table. then normalize NER values


Since there are meals who have more than one recipe, we can use index of main table as recipe_IDs.   

In [12]:
# make characters lowercase strictly.
recipe_table.columns = recipe_table.columns.str.replace('Unnamed: 0', 'recipe_ID')
recipe_table['NER'] = recipe_table['NER'].str.casefold()

In [13]:
punctuations = str.maketrans('', '', '!"#$%&\'()*+-./:;<=>?@[\\]^_`{|}~')
recipe_table['NER'] = recipe_table['NER'].str.translate(punctuations)

In [14]:
# drop rows with not title or NER

recipe_table = recipe_table[recipe_table['title'].notna()]
recipe_table = recipe_table[recipe_table['NER'].notna()]
recipe_table = recipe_table[recipe_table['NER'] != '']

In [15]:
# reset index due to deleting some NAN rows in previous cell. 

recipe_table.reset_index(inplace=True, drop=True)

In [16]:
recipe_table['recipe_ID'] = "Rec" + recipe_table.index.astype(str)

In [17]:
# split NER components to make a list out of them.

recipe_table['NER'] = recipe_table['NER'].str.split(',')

In [18]:
# remove spaces before/after items of 

recipe_table['NER'] = [[val.strip() for val in sublist] for sublist in recipe_table['NER'].values]

In [19]:
# remove empty items from lists in NER column

recipe_table['NER'] = list(filter(None, recipe_table['NER']))

In [None]:
# this function can be used to transfer str to list like NER column. but here is not handy.
'''
import ast                                  # Convert string to list

# values of NER column are stored as string. so we convert them to lists

ast_func = lambda a: ast.literal_eval(a)
recipe_table['NERtemp'] = recipe_table['NERtemp'].map(ast_func)
'''

In [20]:
recipe_table.tail()

Unnamed: 0,recipe_ID,title,ingredients,directions,link,source,NER
2230557,Rec2230557,Sunny's Fake Crepes,"[""1/2 cup chocolate hazelnut spread (recommend...","[""Spread hazelnut spread on 1 side of each tor...",www.foodnetwork.com/recipes/sunny-anderson/sun...,Recipes1M,"[chocolate hazelnut spread, tortillas, butter,..."
2230558,Rec2230558,Devil Eggs,"[""1 dozen eggs"", ""1 paprika"", ""1 salt and pepp...","[""Boil eggs on medium for 30mins."", ""Then cool...",cookpad.com/us/recipes/355411-devil-eggs,Recipes1M,"[eggs, paprika, salt, choice, miracle whip, re..."
2230559,Rec2230559,Extremely Easy and Quick - Namul Daikon Salad,"[""150 grams Daikon radish"", ""1 tbsp Sesame oil...","[""Julienne the daikon and squeeze out the exce...",cookpad.com/us/recipes/153324-extremely-easy-a...,Recipes1M,"[radish, sesame oil, white sesame seeds, salt,..."
2230560,Rec2230560,Pan-Roasted Pork Chops With Apple Fritters,"[""1 cup apple cider"", ""6 tablespoons sugar"", ""...","[""In a large bowl, mix the apple cider with 4 ...",cooking.nytimes.com/recipes/1015164,Recipes1M,"[apple cider, sugar, kosher salt, bay leaves, ..."
2230561,Rec2230561,Polpette in Spicy Tomato Sauce,"[""1 pound ground veal"", ""1/2 pound sweet Itali...","[""Preheat the oven to 350."", ""In a bowl, mix t...",www.foodandwine.com/recipes/polpette-spicy-tom...,Recipes1M,"[ground veal, sausage, bread crumbs, milk, gar..."


## NER_table

In [21]:
# create NER_table and add NER_ID column

NER_table = pd.DataFrame(recipe_table['NER'].explode().unique(), columns=['NER'])
NER_table['NER_ID'] =  ['NER'+str(i) for i in range(len(NER_table['NER']))]

In [22]:
# set index to NER and make conv dictionary out of NER_table
# by help of conv dictionary, we replace names in recipe_table['NER'] with IDs

NER_table.set_index(['NER'], inplace=True)
conv = NER_table.to_dict('dict')
recipe_table['NER'] = recipe_table['NER'].apply(lambda row: [conv['NER_ID'][v] for v in row if conv['NER_ID'].get(v)])

In [23]:
# add a column to keep recipe_IDs and reset index for filling recipe_ID column
# we fill recipe_ID column of NER_table with all recipe_IDs used NER in. (~ 4 minutes in colab with 8 GB RAM)

NER_table['recipe_ID'] = np.empty((len(NER_table), 0)).tolist()
NER_table = NER_table.reset_index().set_index('NER_ID')

for k in range(len(recipe_table)):
    for m in recipe_table['NER'][k]:
        NER_table['recipe_ID'][m].append(recipe_table['recipe_ID'][k])

In [None]:
# this is another approach to fill recipe_ID column in NER_table. but this is slow. (~ 1600 hours for this problem)
'''
import time

NER_table['recipe_ID'] = np.empty((len(NER_table), 0)).tolist()
for i in range(len(NER_table)):
    t0= time.clock()
    for j in range(len(recipe_table)):
        if NER_table['NER_ID'][i] in recipe_table['NER'][j]:
           NER_table['recipe_ID'][i].append(recipe_table['recipe_ID'][j])
    print(time.clock() - t0)
'''

In [24]:
len(NER_table)

194270

In [25]:
NER_table.head(5)

Unnamed: 0_level_0,NER,recipe_ID
NER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
NER0,brown sugar,"[Rec0, Rec26, Rec41, Rec44, Rec70, Rec77, Rec7..."
NER1,milk,"[Rec0, Rec5, Rec29, Rec41, Rec50, Rec69, Rec81..."
NER2,vanilla,"[Rec0, Rec6, Rec27, Rec41, Rec48, Rec59, Rec60..."
NER3,nuts,"[Rec0, Rec19, Rec20, Rec27, Rec44, Rec61, Rec6..."
NER4,butter,"[Rec0, Rec2, Rec4, Rec5, Rec6, Rec7, Rec11, Re..."


In [26]:
NER_table.sort_values(by=['NER'])

Unnamed: 0_level_0,NER,recipe_ID
NER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
NER369,,"[Rec159, Rec2344, Rec2344, Rec2344, Rec2873, R..."
NER3321,a,"[Rec7492, Rec25702, Rec46265, Rec68658, Rec789..."
NER167909,a dashi stock powder,[Rec1896474]
NER191020,a honey,[Rec2186659]
NER150274,a mirin,"[Rec1705893, Rec1896474]"
...,...,...
NER5461,zwieback crumbs,"[Rec17857, Rec243101, Rec262956, Rec274910, Re..."
NER3118,zwieback crust,[Rec6662]
NER25802,zwieback toast,"[Rec257411, Rec442441, Rec913757, Rec936381, R..."
NER46646,zwieback toasts,"[Rec714484, Rec1871390]"


In [27]:
recipe_table.loc[45774]['NER']

['NER369', 'NER2849', 'NER38', 'NER61', 'NER13', 'NER301', 'NER349']

In [28]:
NER_table.loc['NER1617']['NER']

'morton tender quick salt'

In [29]:
recipe_table.loc[62719]

recipe_ID                                               Rec62719
title                                             Zucchini Bread
ingredients    ["3 eggs", "2 c. sugar", "2 c. zucchini, grate...
directions     ["Combine eggs, sugar, zucchini, oil and vanil...
link            www.cookbooks.com/Recipe-Details.aspx?id=1020499
source                                                  Gathered
NER            [NER82, NER27, NER253, NER41, NER2, NER30, NER...
Name: 62719, dtype: object