<a href="https://colab.research.google.com/github/BehzadBarati/Ingredient-Maps/blob/main/Get_ingredients_of_Tomato_Soups_RecipeNLG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Author: Behzad Barati

Abstract:

*   This notebook produces 15 most frequent ingredients of tomato soups in  Recipe1m dataset.
*   RecipeNLG dataset is composed of Recipe1M dataset and other recipes which were added by RecipeNLG authors.
*   out put is based on 2400 recipes which have 'tomato soup' in their title.
___
Source:

My main refrences are [RecipeNLG paper](https://www.aclweb.org/anthology/2020.inlg-1.4.pdf) and its [dataset](https://recipenlg.cs.put.poznan.pl).
___
Input: 

*   Dataset of [RecipeNLG](https://recipenlg.cs.put.poznan.pl)

Ouput:

*   15 most frequent ingredients of tomato soups.
___
Hints:

*   As our csv file is greater than 2 gigabytes, I prefer to use cloud services(here google colab). I uploaded RecipeNLG dataset in my [google drive](https://drive.google.com/drive/folders/1g1ZNYKlLN4hyP8ywHXWa2Iu1oQ4wxSgR?usp=sharing). It is public.


# Import needed libraries

In [1]:
# Install pandas_profiling library

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import drive              # Mount google drive to colab notebook
import re                                   
import string                               # removing special characters
from pandas.core.common import flatten      # to make nested lists flat

# Load data

In [2]:
# Mount google drive to colab notebook
# Our dataset will be read as recipe_tomato.

drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [3]:
cd gdrive/MyDrive/Projects/Ingredient-Maps/Phase1

/content/gdrive/MyDrive/Projects/Ingredient-Maps/Phase1


In [4]:
# Reading file and check if data is loaded

recipe_tomato = pd.read_csv('./dataset/RecipeNLG.csv')
print('Number of recipes in dataset: ', len(recipe_tomato))
recipe_tomato.rename(columns={'Unnamed: 0': 'id', 'title': 'tag_value', 'directions': 'steps', 'NER': 'ner'}, inplace=True)
print('last 5 recipes:')
recipe_tomato.tail(5)

Number of recipes in dataset:  2231142
last 5 recipes:


Unnamed: 0,id,tag_value,ingredients,steps,link,source,ner
2231137,2231137,Sunny's Fake Crepes,"[""1/2 cup chocolate hazelnut spread (recommend...","[""Spread hazelnut spread on 1 side of each tor...",www.foodnetwork.com/recipes/sunny-anderson/sun...,Recipes1M,"[""chocolate hazelnut spread"", ""tortillas"", ""bu..."
2231138,2231138,Devil Eggs,"[""1 dozen eggs"", ""1 paprika"", ""1 salt and pepp...","[""Boil eggs on medium for 30mins."", ""Then cool...",cookpad.com/us/recipes/355411-devil-eggs,Recipes1M,"[""eggs"", ""paprika"", ""salt"", ""choice"", ""miracle..."
2231139,2231139,Extremely Easy and Quick - Namul Daikon Salad,"[""150 grams Daikon radish"", ""1 tbsp Sesame oil...","[""Julienne the daikon and squeeze out the exce...",cookpad.com/us/recipes/153324-extremely-easy-a...,Recipes1M,"[""radish"", ""Sesame oil"", ""White sesame seeds"",..."
2231140,2231140,Pan-Roasted Pork Chops With Apple Fritters,"[""1 cup apple cider"", ""6 tablespoons sugar"", ""...","[""In a large bowl, mix the apple cider with 4 ...",cooking.nytimes.com/recipes/1015164,Recipes1M,"[""apple cider"", ""sugar"", ""kosher salt"", ""bay l..."
2231141,2231141,Polpette in Spicy Tomato Sauce,"[""1 pound ground veal"", ""1/2 pound sweet Itali...","[""Preheat the oven to 350."", ""In a bowl, mix t...",www.foodandwine.com/recipes/polpette-spicy-tom...,Recipes1M,"[""ground veal"", ""sausage"", ""bread crumbs"", ""mi..."


# Prprocess recipe_tomato

In [5]:
recipe_tomato['tag_value'] = recipe_tomato['tag_value'].str.casefold()

In [6]:
punctuations = str.maketrans('', '', '!"#$%&\'()*+-./:;<=>?@[\\]^_`{|}~')
recipe_tomato['tag_value'] = recipe_tomato['tag_value'].str.translate(punctuations)

In [8]:
# filter tomato soup

recipe_tomato = recipe_tomato[recipe_tomato['tag_value'].str.contains('tomato soup')]

In [10]:
len(recipe_tomato)

2481

In [9]:
recipe_tomato.head()

Unnamed: 0,id,tag_value,ingredients,steps,link,source,ner
645,645,tomato soup cake,"[""3/4 c. shortening"", ""1 1/2 c. sugar"", ""1 can...","[""Pour in greased and floured 9 x 13-inch pan....",www.cookbooks.com/Recipe-Details.aspx?id=378813,Gathered,"[""shortening"", ""sugar"", ""tomato soup"", ""water""..."
2059,2059,tomato soup congealed salad,"[""2 Tbsp. Knox gelatin"", ""1/2 c. cool water"", ...","[""Dissolve gelatin in cool water."", ""Heat crea...",www.cookbooks.com/Recipe-Details.aspx?id=315478,Gathered,"[""gelatin"", ""water"", ""mayonnaise"", ""celery"", ""..."
3062,3062,tomato soup cakespice cake,"[""3/4 c. Crisco"", ""1 1/2 c. sugar"", ""1 can of ...","[""Blend Crisco and sugar."", ""Combine soup with...",www.cookbooks.com/Recipe-Details.aspx?id=46941,Gathered,"[""Crisco"", ""sugar"", ""tomato soup"", ""water"", ""b..."
3353,3353,georgias tomato soup,"[""1 gal. tomato juice"", ""4 sweet peppers"", ""4 ...","[""Heat juice and mix in cornstarch, salt and s...",www.cookbooks.com/Recipe-Details.aspx?id=709508,Gathered,"[""tomato juice"", ""sweet peppers"", ""onions"", ""b..."
7085,7085,curried tomato soup,"[""4 Tbsp. butter"", ""3/4 c. chopped onion"", ""2 ...","[""Heat the butter in a saucepan; add the onion...",www.cookbooks.com/Recipe-Details.aspx?id=558009,Gathered,"[""butter"", ""onion"", ""curry powder"", ""Italian t..."


# Process ner column 

In [11]:
# make characters lowercase strictly.
recipe_tomato['ner'] = recipe_tomato['ner'].str.casefold()

In [12]:
# remove punctuations from tag_value and ner columns

recipe_tomato['ner'] = recipe_tomato['ner'].str.translate(punctuations)

In [13]:
# drop rows with no tag_value or ner

recipe_tomato = recipe_tomato[recipe_tomato['tag_value'].notna()]
recipe_tomato = recipe_tomato[recipe_tomato['ner'].notna()]
recipe_tomato = recipe_tomato[recipe_tomato['ner'] != '']

In [14]:
# split ner components to make a list out of them.

recipe_tomato['ner'] = recipe_tomato['ner'].str.split(',')

In [15]:
# remove spaces before/after items of list

recipe_tomato['ner'] = [[val.strip() for val in sublist] for sublist in recipe_tomato['ner'].values]

In [16]:
# I noticed some ner start with "a " (i.e. a milk). so we should remove them.

recipe_tomato['ner'] = [[re.sub('^a ', '', val) for val in sublist] for sublist in recipe_tomato['ner'].values]

In [17]:
# remove spaces before/after items of list once again

recipe_tomato['ner'] = [[val.strip() for val in sublist] for sublist in recipe_tomato['ner'].values]

In [18]:
# remove empty items from lists in ner column

recipe_tomato['ner'] = recipe_tomato['ner'].apply(lambda row: list(filter(None, row)))

In [19]:
# remove items from lists in ner column which are only 1 character (i.e. 'm')

recipe_tomato['ner'] = recipe_tomato['ner'].apply(lambda row: [item for item in row if len(item) > 1] )

In [20]:
# remove duplicates items in each row of ner column

recipe_tomato['ner'] = recipe_tomato['ner'].apply(lambda row: list(set(row)))

In [21]:
# romve recipes where they have less than two ners.

recipe_tomato = recipe_tomato[recipe_tomato['ner'].str.len() > 1]

In [56]:
# most frequent Values of NER in tomato soups

ingredients = recipe_tomato['ner'].explode().value_counts()
output = ingredients.head(20)

In [57]:
output

tomatoes         1586
salt             1463
garlic           1062
onion            1009
sugar             800
olive oil         663
butter            661
flour             517
water             502
chicken broth     428
basil             424
tomato soup       403
pepper            399
milk              372
celery            355
onions            301
thyme             273
tomato paste      251
baking soda       237
fresh basil       237
Name: ner, dtype: int64

In [58]:
# some entries are duplicates or can be removed to have a better list.

output = output.drop(['water', 'tomato soup', 'tomato paste', 'fresh basil', 'onions'])

In [61]:
output.to_csv('output.csv', header=False)

In [62]:
output

tomatoes         1586
salt             1463
garlic           1062
onion            1009
sugar             800
olive oil         663
butter            661
flour             517
chicken broth     428
basil             424
pepper            399
milk              372
celery            355
thyme             273
baking soda       237
Name: ner, dtype: int64