## CITS4403 Flavour Network Project Notebook 1
## Flavours of the World: How Russian Cuisine Fits Within a Network of Recipes, Ingredients and Compounds from Around the Globe
### Recipe Data
Authors: Alden Bong (22255844), Dylan Carpenter (21982288), Luke Joshua Carpenter (22110274)

Collating Russian and Eastern European Recipes from allrecipes.com, Kaggle, and (Ahn, Ahnert, Bagrow, Barabasi (2011)) Supplementary Materials.

In [None]:
import csv
import json
from itertools import combinations

# 1. Recipes

## 1.1. Reading Recipe Datasets

### 1.1.1. Collecting recipes from www.allrecipes.com

There are 3 main steps to collect the recipe information.

1. Collect the urls of pages with russian recipes
2. Scrape the collected urls, parse the HTML and get actually recipe data
3. Clean the data

Additionally after cleaning, data can be reformatted into whatever format is desired.

*Please note that this code was initially written and run in regular python scripts and then copied over into a notebook as per submission requirements*

### 1) Collecting page URLs

Because AllRecipes has no publically accessible API, pages that contain recipes must be located through the website directly. As this website is dynamically populated, finding urls can be acieved though minimal manual effort and some client-side javascript run in the browser.

Instructions and the required code can be found in `data/allrecipes/recipeUrls.js`

### 2) Scraping data

For each recipe there are 2 tasks that need to be done: make a HTTP request to recieve the webpage and scrape the returned HTML document to extract the data we want.

In [None]:
import time

#pip install requests
import requests as req
#pip install beautifulsoup4
from bs4 import BeautifulSoup

#constants
URL_FILE_PATH = "data/allrecipes/recipe-urls.json"
OUPUT_FILE_PATH = "data/allrecipes/allrecipes-recipes-russian.json"
CUISINE = "Russian"

The following function processes a single url and returns the recipe on the page as a JSON-like object

In [None]:
#function to perform http request and parse returned HTML
def getRecipe(url):
    #make request
    print("GET \"%s\"" % (url), end="")
    res = req.get(url)
    print("  (%d)" % (res.status_code) )

    #fail if request was not OK
    if(res.status_code != 200):	raise Exception("Request was not OK")

    #create html parser
    page = BeautifulSoup(res.text, features="html.parser")

    #variables we aim to populate
    title, ingrs, ingrsBetter = "", [], []

    #check if "old" style recipe title is present then extract info
    titleTag = page.select("h1.headline.heading-content")
    if len(titleTag) == 1:
        #page is in "old" layout
        title = titleTag[0].getText().strip() #extract recipe title from h1 tag
        #get ingredients
        ingrTags = page.select(".ingredients-item") #get all ingredient item tags
        ingrs = [tag.select(".ingredients-item-name")[0].getText().strip() for tag in ingrTags] #extract ingredient text
        # in old layout ingredient entries have a standardised item field without quantity/proceedure
        # eg. if displayed ingredient = '3 lightly beaten eggs' standard item name  = 'eggs'
        ingrsBetter = [tag.select(".checkbox-list-input")[0].get("value").strip() for tag in ingrTags]
    else:
        #page is in "new" layout
        title = page.select("h1#recipe-main-content")[0].getText().strip() #extract recipe title from h1 tag
        #get ingredients
        ingrTags = page.select(".recipe-ingred_txt:not(.white)") # get ingredient display tags (not .white removes "Add All Ingredients" listing)
        ingrs = [tag.getText().strip() for tag in ingrTags] #extract inner text from tag
        ingrs = list( filter(lambda x : len(x) > 0 , ingrs) ) #remove empty entries (hidden/unused HTML elements)
        ingrsBetter = None #does not exist in new layout

    #return information in dictionary format
    return {"name":title, "cuisine":CUISINE, "source":url, "ingredients":ingrs, "ingredientsPreprocessed":ingrsBetter}

First load the urls from Part 1

In [None]:
#load urls
urls = []
print("Reading from '%s'" % (URL_FILE_PATH))
with open(URL_FILE_PATH) as file:
    urls = json.load(file)

...then process each url

In [None]:
#get recipes
recipes = []
failedUrls = []

print("%d urls read.\nSending requests..." % (len(urls)) )
for url in urls:
    try:
        recipes.append( getRecipe(url) )
    except Exception as e:
        #throws excepetion if response code is not 200 (OK)
        failedUrls.append(url)

    #sleep a bit to not look like a DOS attack (just in case)
    time.sleep(0.25)

#print failed requests (in any)
print("\nDone. (%d failed requests)" % (len(failedUrls)))
if len(failedUrls) > 0:
    for url in failedUrls: print("\t%s" % (url) )

finally save the recipes are JSON data for cleaning

In [None]:
#save json
print("\nSaving %d recipes to '%s'\n" % (len(recipes), OUPUT_FILE_PATH) )
with open(OUPUT_FILE_PATH, 'w') as outfile:
    json.dump(recipes, outfile)

### 3) Cleaning the data

This was done through a small dictionary for replacing certain terms, but unfortunately was otherwise done manually to make sure the ingredients are in a friendly format.

A small bit of automation was done to pick the most likely previously seen ingredient and suggest it to the user.

In [None]:
#constants
RECIPE_FILE_PATH = "data/allrecipes/allrecipes-recipes-russian.json"
OUPUT_FILE_PATH = "data/allrecipes/recipes-final.json"

#helper function
def intIsParsable(str):
    try:
        val = int(str)
        return True
    except Exception:
        return False

Load data and create dictionaries and lookup sets are required

In [None]:
recipes = []

with open(RECIPE_FILE_PATH, 'r') as file:
    recipes = json.load(file)

updatedRecipes = []
ingredientSubs = {}

with open("data/allrecipes/ingredient-substitution.json", 'r') as file:
    ingredientSubs = json.load(file)

seenIngredients = set()
for ingredient in ingredientSubs.values():
    seenIngredients.add(ingredient)

A fairly monstrous function. Loops through ingredients in all recipes and gets user to input what the ingredient is.

A dictionary of ingredients is built up over time, and suggestions for most likely ingredient as drawn from this.

Process is fairly manual at the start, but as the number of seens ingredients grows, the process becomes significantly faster and more automated.

**NOTE THIS CELL REQUIRES SIGNIFICANT MANUAL WORK!**

**DO NOT ATTEMPT TO COMPLETE IT AND SIMPLY SKIP TO THE NEXT ONE**

In [None]:
for index, recipe in enumerate(recipes):
    print("%s (%d/%d):" % (recipe["name"], index+1, len(recipes)) )
    #make new recipe for updated ingredients
    newRecipe = {"name": recipe["name"], "source": recipe["source"], "ingredients": []}
    #loop through ingredients (use preprocessed if they are there)
    ingrs = recipe["ingredients"] if recipe["ingredientsPreprocessed"] == None else recipe["ingredientsPreprocessed"]
    for ingredient in ingrs:
        #create suggestions for ingredients
        candidates = []
        if ingredient in ingredientSubs:
            candidates.append(ingredientSubs[ingredient])
        else:
            candidates.extend( list(filter(lambda ingr: ingr in ingredient.lower(), seenIngredients)) )

        #construct input prompt
        prompt = "[txt|ret|del] "
        for ind, candidate in enumerate(candidates):
            prompt += "[%d|%s] " % (ind, candidate)
        #get input
        print(ingredient)
        response = input(prompt + ": ")
        #handle response
        if response == "":
            newRecipe["ingredients"].append(ingredient.lower())
            seenIngredients.add(ingredient.lower())
        elif response == 'd' or response == '-':
            pass
        elif intIsParsable(response):
            candidate = candidates[int(response)]
            newRecipe["ingredients"].append(candidate.lower())
            seenIngredients.add(candidate.lower()) #technically redundant but can't hurt (much)
        else:
            newRecipe["ingredients"].append(response.lower())
            seenIngredients.add(response.lower())

    updatedRecipes.append(newRecipe)
    print()

    #backup every 20 reicpes
    if index % 20 == 0:
        print("\nBACKING UP\n\n")
        with open("backup.json", 'w') as backup:
            json.dump(updatedRecipes, backup)

**SAVE ALL THE RECIPES** - this is an important step if you just spent a few hours processing the files...

*Note that for the purposes of this notebook the filepath does not point to the actual output used (as it would truncate our actual data file!)*

In [None]:
#save all recipes
with open(OUPUT_FILE_PATH, 'w') as outfile:
    json.dump(updatedRecipes, outfile)

Done! All recipes are now cleaned and saved in a json file!

### Formatting the recipes

This section of code reformats the JSON data into the CSV formats used by the flavour network [paper](https://www.nature.com/articles/srep00196 "Flavor Network and the Principles of Food Pairing")

In [None]:
def format_json(inFileName, outFileName, defaultCuisine="Russian"):
    connected_components = {}

    f = open(inFileName) 
    data = json.load(f)

    with open(outFileName + "1.csv", "w") as csvfile:
        writer = csv.writer(csvfile)

        for key in data:
            try:
                cuisine = key["cuisine"]
            except:
                cuisine = defaultCuisine
            ingredient_list = key['ingredients']

            writer.writerow([cuisine] + ingredient_list)
            ingredient_pairs = set(combinations(ingredient_list, 2))

            # Update Ingredient Pairing Frequency
            for pair in ingredient_pairs:
                #print(pair)
                if not pair in connected_components:
                    connected_components[pair] = 1
                else:
                    connected_components[pair] += 1
        f.close()
        
    print(connected_components)

    # Write Ingredient Pairing Frequency
    with open(outFileName + "2.csv", "w") as csvfile:
        writer = csv.writer(csvfile)
        for key, value in connected_components.items():
            writer.writerow([key[0], key[1], value])

In [None]:
DATAPATH = "data/"
WRITEPATH = "data/supplementary_dataset/formatted_csvs/"

inFileName = DATAPATH + "allrecipes/allrecipes-recipes-final.json"
outFileName = WRITEPATH + "allrecipe_csv"
format_json(inFileName, outFileName)

### 1.1.2. Kaggle
The Kaggle Dataset was retrieved from https://www.kaggle.com/kaggle/recipe-ingredients-dataset/home#train.json and is labelled "train.json"

The Kaggle dataset is in a JSON format like the scraped allrecipes data, so we reuse the function `format_json()`

*Note that this code may not run in a notebook due to a maximum data IO rate (in which case the relevant code should be run as a regular python script)*

In [None]:
# Kaggle
inFileName = DATAPATH + "train_dataset/train.json"
outFileName = WRITEPATH + "/kaggle_csv"
format_json(inFileName, outFileName)

### 1.1.3. (Ahn, Ahnert, Bagrow, Barabasi (2011)) Supplementary Materials
The Supplementary Materials Dataset was retrieved from the flavour network [paper](https://www.nature.com/articles/srep00196 "Flavor Network and the Principles of Food Pairing")

We rewrite `format_json()` to read in CSVs instead of JSONs and filter for Eastern European recipes

In [None]:
PATH = "data/supplementary_dataset/"

def format_csv(filePath):
    from itertools import combinations
    connected_components = {}

    with open(PATH + "formatted_csvs/supplementary_csv1.csv", 'w') as writtenfile:
        writer = csv.writer(writtenfile)
        with open(filePath) as csvfile:
            reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
            for row in reader:
                cuisine = row[0].split(',')[0]
                if cuisine == 'EasternEuropean':
                    ingredient_list = row[0].split(',')[1:]
                    writer.writerow([cuisine] + ingredient_list)
                    ingredient_pairs = set(combinations(ingredient_list, 2))

                    # Update Ingredient Pairing Frequency
                    for pair in ingredient_pairs:
                        #print(pair)
                        if not pair in connected_components:
                            connected_components[pair] = 1
                        else:
                            connected_components[pair] += 1

                    #print(ingredient_list)
    # Write Ingredient Pairing Frequency
    with open(PATH + "formatted_csvs/supplementary_csv2.csv", 'w') as csvfile:
        writer = csv.writer(csvfile)
        for key, value in connected_components.items():
            writer.writerow([key[0], key[1], value])

filePath = 'srep00196-s3.csv'
format_csv(PATH + filePath)