<a href="https://colab.research.google.com/github/BehzadBarati/Ingredient-Maps/blob/main/Food_Recipes_RecipeNLG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Author: Behzad Barati (bhzdbrt@gmail.com)

Abstract:

This notebook produces elementary reports on RecipeNLG dataset which contains more than 2 million recipies of foods.
___
Source:

My main refrences are [RecipeNLG paper](https://www.aclweb.org/anthology/2020.inlg-1.4.pdf) and its [dataset](https://recipenlg.cs.put.poznan.pl).
___
Input: 

1- Dataset of [RecipeNLG](https://recipenlg.cs.put.poznan.pl)

Ouput:

1- EDA report on RecipeNLG dataset (including "EDA-Report-RecipeNLG.html" file, word cloud pictures and list of source websites of recipies as "Websites-RecipeNLG.csv" file)

2- Preprocessed recipe_table

3- NER_table

4- step_table
___
Hints:

1_ As our csv file is greater than 2 gigabytes, I prefer to use cloud services(here google colab). I uploaded RecipeNLG dataset in my [google drive](https://drive.google.com/drive/folders/1g1ZNYKlLN4hyP8ywHXWa2Iu1oQ4wxSgR?usp=sharing). It is public.

2_ If there is out of memory error in running "ProfileReport", please first re-install latest version of "pandas_profiling" library, then try "minimal=True" argument in "profileReport" for eliminating some calculations. (pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip)

# Import needed libraries

In [1]:
# Install pandas_profiling library
# pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

import numpy as np
import pandas as pd
from wordcloud import WordCloud             # Make wordcloud pictures
from pandas_profiling import ProfileReport  # Generate brief report on our dataframe
import matplotlib.pyplot as plt
from google.colab import drive              # Mount google drive to colab notebook
import re                                   
import string                               # removing special characters
from pandas.core.common import flatten      # to make nested lists flat

# Load data

In [2]:
# Mount google drive to colab notebook
# Our dataset will be read as recipe_table.

drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [3]:
cd gdrive/MyDrive/Projects/Ingredient-Maps/Phase1

/content/gdrive/MyDrive/Projects/Ingredient-Maps/Phase1


In [4]:
# Reading file and check if data is loaded

recipe_table = pd.read_csv('./dataset/RecipeNLG.csv')
print('Number of recipes in dataset: ', len(recipe_table))
recipe_table.rename(columns={'Unnamed: 0': 'recipe_ID', 'directions': 'steps'}, inplace=True)
print('last 5 recipes:')
recipe_table.tail(5)

Number of recipes in dataset:  2231142
last 5 recipes:


Unnamed: 0,recipe_ID,title,ingredients,steps,link,source,NER
2231137,2231137,Sunny's Fake Crepes,"[""1/2 cup chocolate hazelnut spread (recommend...","[""Spread hazelnut spread on 1 side of each tor...",www.foodnetwork.com/recipes/sunny-anderson/sun...,Recipes1M,"[""chocolate hazelnut spread"", ""tortillas"", ""bu..."
2231138,2231138,Devil Eggs,"[""1 dozen eggs"", ""1 paprika"", ""1 salt and pepp...","[""Boil eggs on medium for 30mins."", ""Then cool...",cookpad.com/us/recipes/355411-devil-eggs,Recipes1M,"[""eggs"", ""paprika"", ""salt"", ""choice"", ""miracle..."
2231139,2231139,Extremely Easy and Quick - Namul Daikon Salad,"[""150 grams Daikon radish"", ""1 tbsp Sesame oil...","[""Julienne the daikon and squeeze out the exce...",cookpad.com/us/recipes/153324-extremely-easy-a...,Recipes1M,"[""radish"", ""Sesame oil"", ""White sesame seeds"",..."
2231140,2231140,Pan-Roasted Pork Chops With Apple Fritters,"[""1 cup apple cider"", ""6 tablespoons sugar"", ""...","[""In a large bowl, mix the apple cider with 4 ...",cooking.nytimes.com/recipes/1015164,Recipes1M,"[""apple cider"", ""sugar"", ""kosher salt"", ""bay l..."
2231141,2231141,Polpette in Spicy Tomato Sauce,"[""1 pound ground veal"", ""1/2 pound sweet Itali...","[""Preheat the oven to 350."", ""In a bowl, mix t...",www.foodandwine.com/recipes/polpette-spicy-tom...,Recipes1M,"[""ground veal"", ""sausage"", ""bread crumbs"", ""mi..."


# EDA (Exploratory Data Analysis)
I dont want to generate report every time I run it, so EDA, world cloud and website list cells are muted.

In [5]:
'''
# Column 'recipe_ID' seems to be useless for EDA section,so I drop it for making our dataset smaller.

recipe_table.drop('recipe_ID', axis='columns', inplace=True)
'''

"\n# Column 'Unnamed: 0' seems to be useless for EDA section,so I drop it for making our dataset smaller.\n\nrecipe_table.drop('Unnamed: 0', axis='columns', inplace=True)\n"

In [6]:
'''
# Generate a quick report from our dataset 

profile = ProfileReport(recipe_table, minimal=True)
profile.to_file("EDA-Report-RecipeNLG.html")
profile
'''

'\n# Generate a quick report from our dataset \n\nprofile = ProfileReport(recipe_table, minimal=True)\nprofile.to_file("EDA-Report-RecipeNLG.html")\nprofile\n'

## World clouds

In [7]:
'''
# For creating word clouds, I used WordCloud library which was imported before.

def minimal_wordcloud(df, column):
    """
    Generate a simple wordcloud similar to: 
    https://www.kaggle.com/paultimothymooney/explore-recipe-nlg-dataset/data.
    The only import required is: from wordcloud import WordCloud
    """
    text = str(df[column].values)
    wordcloud = WordCloud().generate(text)
    image = wordcloud.to_image()
    plt.axis("off")
    plt.imshow(image)
    plt.show()
'''

'\n# For creating word clouds, I used WordCloud library which was imported before.\n\ndef minimal_wordcloud(df, column):\n    """\n    Generate a simple wordcloud similar to: \n    https://www.kaggle.com/paultimothymooney/explore-recipe-nlg-dataset/data.\n    The only import required is: from wordcloud import WordCloud\n    """\n    text = str(df[column].values)\n    wordcloud = WordCloud().generate(text)\n    image = wordcloud.to_image()\n    plt.axis("off")\n    plt.imshow(image)\n    plt.show()\n'

In [8]:
'''
# Print word clouds

for c in recipe_table.columns:
    print('\nworld cloud of contents in column {}'.format(c))
    minimal_wordcloud(recipe_table, c)
'''

"\n# Print word clouds\n\nfor c in recipe_table.columns:\n    print('\nworld cloud of contents in column {}'.format(c))\n    minimal_wordcloud(recipe_table, c)\n"

## list of websites in RecipieNLG

In [9]:
'''
# Based on link column I tried to extract website names.
# This func helps to select website names where we do not have 'www' at beginning

func = lambda x: x[1] if x[0] == 'www' else x[0]
recipe_table['website'] = recipe_table['link'].str.split('.').apply(func)
'''

"\n# Based on link column I tried to extract website names.\n# This func helps to select website names where we do not have 'www' at beginning\n\nfunc = lambda x: x[1] if x[0] == 'www' else x[0]\nrecipe_table['website'] = recipe_table['link'].str.split('.').apply(func)\n"

In [10]:
'''
recipe_table['website'].value_counts().rename_axis('websites').to_csv('./reports/Websites-RecipeNLG.csv',  header=['No. of recipies'])
recipe_table['website'].value_counts().rename_axis('websites')
'''

"\nrecipe_table['website'].value_counts().rename_axis('websites').to_csv('./reports/Websites-RecipeNLG.csv',  header=['No. of recipies'])\nrecipe_table['website'].value_counts().rename_axis('websites')\n"

# Convert to SQL tables

## Preprocessing recipe_table

### preprocess NER column 

In [5]:
# make characters lowercase strictly.

recipe_table['NER'] = recipe_table['NER'].str.casefold()
recipe_table['title'] = recipe_table['title'].str.casefold()

In [6]:
# remove punctuations from title and NER columns

punctuations = str.maketrans('', '', '!"#$%&\'()*+-./:;<=>?@[\\]^_`{|}~')
recipe_table['NER'] = recipe_table['NER'].str.translate(punctuations)
recipe_table['title'] = recipe_table['title'].str.translate(punctuations)

In [7]:
# drop rows with no title or NER

recipe_table = recipe_table[recipe_table['title'].notna()]
recipe_table = recipe_table[recipe_table['NER'].notna()]
recipe_table = recipe_table[recipe_table['NER'] != '']

In [8]:
# split NER components to make a list out of them.

recipe_table['NER'] = recipe_table['NER'].str.split(',')

In [9]:
# remove spaces before/after items of list

recipe_table['NER'] = [[val.strip() for val in sublist] for sublist in recipe_table['NER'].values]

In [10]:
# I noticed some NER start with "a " (i.e. a milk). so we should remove them.

recipe_table['NER'] = [[re.sub('^a ', '', val) for val in sublist] for sublist in recipe_table['NER'].values]

In [11]:
# remove spaces before/after items of list once again

recipe_table['NER'] = [[val.strip() for val in sublist] for sublist in recipe_table['NER'].values]

In [12]:
# remove empty items from lists in NER column

recipe_table['NER'] = recipe_table['NER'].apply(lambda row: list(filter(None, row)))

In [13]:
# remove items from lists in NER column which are only 1 character (i.e. 'm')

recipe_table['NER'] = recipe_table['NER'].apply(lambda row: [item for item in row if len(item) > 1] )

In [14]:
# remove duplicates items in each row of NER column

recipe_table['NER'] = recipe_table['NER'].apply(lambda row: list(set(row)))

In [15]:
# romve recipes where they have less than two NERs.

recipe_table = recipe_table[recipe_table['NER'].str.len() > 1]

### preprocess steps column

In [5]:
# split recipe_table['steps'] to make a list out of each record.
# since in some steps there are comma, we can not split steps based on it.
# so we split based on " and then remove items which are meaningless (less than 4 charachters)

recipe_table['steps'] = recipe_table['steps'].str.split('"')
recipe_table['steps'] = recipe_table['steps'].apply(lambda row: [item for item in row if len(item) > 3] )

In [6]:
# some steps consists multiple sentences which are ended with dot. so we split steps again based on dot.
# by runing split('.), we will get some two dimensional lists and also some blank spaces as items
# so we make steps lists flat and remove items with less 2 characters.
# remove spaces before/after items of list

recipe_table['steps'] = [[val.split('.') for val in sublist] for sublist in recipe_table['steps'].values]
recipe_table['steps'] = recipe_table['steps'].apply(lambda row: list(flatten(row)))
recipe_table['steps'] = recipe_table['steps'].apply(lambda row: [item for item in row if len(item) > 2])
recipe_table['steps'] = [[val.strip() for val in sublist] for sublist in recipe_table['steps'].values]

In [9]:
# romve recipes where they have less than two steps.

recipe_table = recipe_table[recipe_table['steps'].str.len() > 1]

In [10]:
# reset index due to deleting some rows of recipe_table in previous cells. 
# naming convention for recipe_ID column

recipe_table.reset_index(inplace=True, drop=True)
recipe_table['recipe_ID'] = "Rec" + recipe_table.index.astype(str)

In [None]:
# since we have a limit of available RAM. we preprocess original dataframe and then save it as csv.

# recipe_table.to_csv('./dataset/ProcessedRecipeNLG.csv', index=False)

## NER_table

In [4]:
# since we have a limit of available RAM. we preprocess original dataframe and then save it as csv.
# then we load it here again for creating NER-table and step_table.
# Mount google drive to colab notebook
# Our dataset will be read as recipe_table.

#drive.mount('/content/gdrive', force_remount=True)
#recipe_table = pd.read_csv('./MyDrive/Projects/Ingredient-Maps/Phase1/dataset/ProcessedRecipeNLG.csv')

In [8]:
# we convert steps and NER column contents to lists 

#import ast
#ast_func = lambda a: ast.literal_eval(a)
#recipe_table['NER'] = recipe_table['NER'].map(ast_func)
#recipe_table['steps'] = recipe_table['steps'].map(ast_func)

In [18]:
# create NER_table and add NER_ID column

NER_table = pd.DataFrame(recipe_table['NER'].explode().unique(), columns=['NER'])
NER_table['NER_ID'] =  ['NER'+str(i) for i in range(len(NER_table['NER']))]

In [19]:
# set index to NER and make conv dictionary out of NER_table
# by help of conv dictionary, we replace names in recipe_table['NER'] with IDs

NER_table.set_index(['NER'], inplace=True)
conv = NER_table.to_dict('dict')
recipe_table['NER'] = recipe_table['NER'].apply(lambda row: [conv['NER_ID'][v] for v in row if conv['NER_ID'].get(v)])

In [20]:
# add a column to keep recipe_IDs and reset index in ordet to fill recipe_ID column
# we fill recipe_ID column of NER_table with all recipe_IDs used NER in. (~ 4 minutes in colab with 8 GB RAM)

NER_table['recipe_ID'] = np.empty((len(NER_table), 0)).tolist()
NER_table = NER_table.reset_index().set_index('NER_ID')

for k in range(len(recipe_table)):
    for m in recipe_table['NER'][k]:
        NER_table['recipe_ID'][m].append(recipe_table['recipe_ID'][k])

In [None]:
# this is another approach to fill recipe_ID column in NER_table. but this is slow. (~ 100 hours for this problem)
'''
import time

NER_table['recipe_ID'] = np.empty((len(NER_table), 0)).tolist()
for i in range(len(NER_table)):
    t0= time.clock()
    for j in range(len(recipe_table)):
        if NER_table['NER_ID'][i] in recipe_table['NER'][j]:
           NER_table['recipe_ID'][i].append(recipe_table['recipe_ID'][j])
    print(time.clock() - t0)
'''

In [21]:
NER_table.head(5)

Unnamed: 0_level_0,NER,recipe_ID
NER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
NER0,bite size shredded rice biscuits,"[Rec0, Rec817538]"
NER1,vanilla,"[Rec0, Rec6, Rec27, Rec40, Rec47, Rec56, Rec57..."
NER2,brown sugar,"[Rec0, Rec26, Rec40, Rec43, Rec66, Rec72, Rec7..."
NER3,butter,"[Rec0, Rec2, Rec4, Rec5, Rec6, Rec7, Rec11, Re..."
NER4,nuts,"[Rec0, Rec19, Rec20, Rec27, Rec43, Rec58, Rec6..."


In [22]:
NER_table.sort_values(by=['NER'])[:20]

Unnamed: 0_level_0,NER,recipe_ID
NER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
NER87817,aaaa,[Rec1240940]
NER100988,aaakosher salt,[Rec1300551]
NER155788,aaejo,[Rec1738134]
NER189537,aalborg,[Rec2147650]
NER106183,aamchur,"[Rec1323487, Rec1325487, Rec1517688, Rec1529893]"
NER168339,aarons,[Rec1879674]
NER160626,aaronson,[Rec1791531]
NER86430,aartis,[Rec1235522]
NER79811,aasil,[Rec1182708]
NER98435,abado sauce,[Rec1288332]


In [23]:
len(NER_table)

191215

In [24]:
# number of NERs which were used only in one recipe.

sum(NER_table['recipe_ID'].str.len() < 2)

120665

## step_table

In [25]:
# create step_table and add step_ID column

step_table = pd.DataFrame(recipe_table['steps'].explode().unique(), columns=['steps'])
step_table['step_ID'] =  ['Ste'+str(i) for i in range(len(step_table['steps']))]

In [None]:
# set index to steps and make conv dictionary out of step_table
# by help of conv dictionary, we replace names in recipe_table['steps'] with IDs

step_table.set_index(['steps'], inplace=True)
conv = step_table.to_dict('dict')

In [36]:
recipe_table['steps'] = recipe_table['steps'].apply(lambda row: [conv['step_ID'][v] for v in row if conv['step_ID'].get(v)])
step_table = step_table.reset_index().set_index('step_ID')

In [44]:
recipe_table.tail(5)

Unnamed: 0,recipe_ID,title,ingredients,steps,link,source,NER
2170498,Rec2170498,sunnys fake crepes,"[""1/2 cup chocolate hazelnut spread (recommend...","[Ste13278705, Ste13278706, Ste13278707, Ste132...",www.foodnetwork.com/recipes/sunny-anderson/sun...,Recipes1M,"[NER49, NER498, NER3798, NER3, NER51989]"
2170499,Rec2170499,devil eggs,"[""1 dozen eggs"", ""1 paprika"", ""1 salt and pepp...","[Ste13278714, Ste13278715, Ste13278716, Ste132...",cookpad.com/us/recipes/355411-devil-eggs,Recipes1M,"[NER10, NER82, NER1731, NER2986, NER5240, NER372]"
2170500,Rec2170500,extremely easy and quick namul daikon salad,"[""150 grams Daikon radish"", ""1 tbsp Sesame oil...","[Ste13278722, Ste44582, Ste9096310]",cookpad.com/us/recipes/153324-extremely-easy-a...,Recipes1M,"[NER10, NER33581, NER460, NER3070, NER87]"
2170501,Rec2170501,panroasted pork chops with apple fritters,"[""1 cup apple cider"", ""6 tablespoons sugar"", ""...","[Ste13278723, Ste13278724, Ste13278725, Ste132...",cooking.nytimes.com/recipes/1015164,Recipes1M,"[NER400, NER875, NER1820, NER32, NER4705, NER5..."
2170502,Rec2170502,polpette in spicy tomato sauce,"[""1 pound ground veal"", ""1/2 pound sweet Itali...","[Ste1554140, Ste13278745, Ste13278746, Ste1327...",www.foodandwine.com/recipes/polpette-spicy-tom...,Recipes1M,"[NER10, NER89, NER37136, NER88, NER198, NER104..."


In [38]:
step_table.tail()

Unnamed: 0_level_0,steps
step_ID,Unnamed: 1_level_1
Ste13278746,Roll into 1 1/2-inch meatballs
Ste13278747,Bake the meatballs on a lightly oiled baking s...
Ste13278748,"In a large saucepan, season the tomato sauce w..."
Ste13278749,Add the meatballs and simmer until the sauce i...
Ste13278750,Sprinkle with pecorino cheese and serve


In [41]:
NER_table.head(10)

Unnamed: 0_level_0,NER,recipe_ID
NER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
NER0,bite size shredded rice biscuits,"[Rec0, Rec817538]"
NER1,vanilla,"[Rec0, Rec6, Rec27, Rec40, Rec47, Rec56, Rec57..."
NER2,brown sugar,"[Rec0, Rec26, Rec40, Rec43, Rec66, Rec72, Rec7..."
NER3,butter,"[Rec0, Rec2, Rec4, Rec5, Rec6, Rec7, Rec11, Re..."
NER4,nuts,"[Rec0, Rec19, Rec20, Rec27, Rec43, Rec58, Rec6..."
NER5,milk,"[Rec0, Rec5, Rec29, Rec40, Rec49, Rec65, Rec76..."
NER6,sour cream,"[Rec1, Rec5, Rec13, Rec18, Rec21, Rec28, Rec35..."
NER7,chicken breasts,"[Rec1, Rec49, Rec157, Rec330, Rec408, Rec427, ..."
NER8,cream of mushroom soup,"[Rec1, Rec3, Rec49, Rec60, Rec114, Rec131, Rec..."
NER9,beef,"[Rec1, Rec98, Rec115, Rec347, Rec366, Rec625, ..."
