# Notebook 3 - Ingredient Mapping

Since different recipes and dataset format recipes and ingredients differently, it is necessary to combine all references to the same ingredient under the same name. For example "2 eggs", "1 lightly beaten egg", "eggs" all refer to the same ingredient, "egg". Additionally it is also necessary to account for regional naming differences between ingredients as is seen with "corriander" and "cilantro".

As one final step, it is necessary to match the unified ingredient names to the names used by our compound database *fooDB*.

This process can be summarised as:

> assorted *individual* ingredient names $\Rightarrow$ *unified set* of names $\Rightarrow$ ingredient names in *fooDB*

In [None]:
#import libraries
import json
import csv
import re #regular expressions
from itertools import combinations

#fuzzy string matching
# >> pip install fuzzywuzzy
# >> pip install python-Levenshtein
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

##### Importing recipe data
Stored format in a `.csv` file is:

``recipeName, ingredient1, ingredient2, ...``

This is repeated for all recipes in that dataset. Details on how this was achieved can be found in the part 1 submission of this project.

In [None]:
#read data from files
filepaths = ['data/final/allrecipes_csv1.csv', 'data/final/kaggle_csv1.csv', 'data/final/supplementary_csv1.csv']

recipes = []

for path in filepaths:
    with open(path) as file:
        reader = csv.reader(file)
        for row in reader:
            recipes.append(row)

print("Read %d recipes from %d files." % (len(recipes), len(filepaths)) )

We will define a small preprocessing function to try to reduce the number of distinct ingredients in all the recipes. This will:
- remove verbs; such as "chopped" and "baked"
- set everything to lowercase letters
- perform substitutions; such as "_" $\rightarrow$ " " and "broth" $\rightarrow$ "stock"
- trim/remove stray whitespace characters

In [None]:
replaceList = {"_":" ", "broth": "stock"}
removeList = ["baked", "smoked", "roasted", "roast", "shredded", "sliced", "canned"
                , "chopped", "cooked", "dried", "dry", "ground", "extra large", "large"
                , "low-fat", "light", "medium", "mixed", "minced", "nonfat", "pickled"
                , "peeled", "pitted", "reduced sodium", "reduced-fat", "unflavored"
                , "unsweetened", "unsalted", "beaten", "fried", "whole", "wide"]

def preprocess(ingredient):
    ingr = ingredient.lower()
    for target in replaceList:
        ingr = ingr.replace(target, replaceList[target])
    for target in removeList:
        ingr = ingr.replace(target, "")
    ingr = re.sub(' {2,}', ' ', ingr) #remove '  ' and '   ' etc
    return ingr.strip()

Now we can collate all the ingredients with some preprocessing into a single combined set (no repeated elements)

In [None]:
ingredientCount = 0
ingredients = set()

for recipe in recipes:
    for ingredient in recipe[1:]:
        if ingredient != "":
            ingredientCount += 1
            ingredients.add(preprocess(ingredient))


print("%d distinct ingredients from a total of %d" % (len(ingredients), ingredientCount) )

ingredients = sorted(ingredients)

This appears to have reduced the number of distinct ingredients to only 10% of the original number!

The next step is to match and map these to the desired *fooDB* ingredient names!

### Matching and Mapping
For this task we will try two approaches, simply attempting to match the names to the existing names in the *fooDB* database, and using fuzzy string matching to find a good mapping.

This makes use of 2 files of data extracted from the *fooDB* database:
- `ingredients.json` which contains a list of ingredient names found in the database
- `common-names.json` which contains a mapping from common names/variations of an ingredient to the name in *fooDB*

An excerpt of `common-names.json` is below:

```json
{
    ...
    "Dill weed, dried": "Dill",
    "Dill, raw": "Dill",
    "Spices, dill seed": "Dill",
    "Spices, dill weed, dried": "Dill",
    "Dill weed, fresh": "Dill"
    ...
}
```

In [None]:
# load file data
with open("data/notebook-3-data/common-names.json") as file:
    referenceNames = json.load(file)

with open("data/notebook-3-data/ingredients.json") as file:
    referenceIngredients = json.load(file)

# initialise collections
ingredientNames = list(referenceNames.keys()) # collection of all common names
ingredientNames.extend(referenceIngredients) #add regular names to list for direct matching


# Create a map from the lowercase version of a common name, back to the original common name entry
# all common names are made lower case for better matching and this dictionary is used to restore the original version
unlowerMap = {}
for index, ingr in enumerate(ingredientNames):
    ingredientNames[index] = ingr.lower()
    unlowerMap[ingr.lower()] = ingr
    

Before moving on to the actual matching lets just check that the fuzzy matching is working...

In [None]:
#finds the closest common-name matches to the given string (and the confidence values)
process.extract("chicken wings", ingredientNames, scorer=fuzz.token_set_ratio)

In [None]:
#or just find the single closest match 
process.extractOne("bread crumbs", ingredientNames, scorer=fuzz.token_set_ratio)

In [None]:
#the match can then have its case removed and run back through the common-names map to find the foodb name
recipeIngredientName = "bread crumbs"

testMatch = process.extractOne(recipeIngredientName, ingredientNames, scorer=fuzz.token_set_ratio)
unlowered = unlowerMap[testMatch[0]]
foodbName = referenceNames[unlowered] if unlowered in referenceNames else unlowered

print("'%s' maps to '%s' in fooDB" % (recipeIngredientName, foodbName))

Using this method it is possible to create a map from our ingredient names to the names in *fooDB*. Only matches with a confidence greater than 80% were used, and the rest are put into a separate section with the best guess to be manually verified.

In [None]:
duds = {} # incorrect matches or those with insufficient confidence
matches = {} # matches with sufficient confidence

for index, ingredient in enumerate(ingredients):
    if index % 50 == 0 and index != 0:
        print("done %d items" % (index) )
    match = process.extractOne(ingredient, ingredientNames, scorer=fuzz.token_set_ratio)
    if match[1] > 80:
        unlowered = unlowerMap[match[0]]
        matches[ingredient] = referenceNames[unlowered] if unlowered in referenceNames else unlowered
    else:
        duds[ingredient] = (referenceNames[unlowered] if unlowered in referenceNames else unlowered, match[0], match[1])

print("%d out of %d matched successfully (%f percent)" % (len(matches), len(ingredients), 100*len(matches)/len(ingredients)) )

As you can see this took a while to process however it was able to "correctly" match **87.7%** of the ingredient names.

But... unfortuantely not all of these mappings are correct and need to be manually checked. This process is significantly faster than manyally writing mappings for ingredients, however due to this method only finding approximate matches it cannot be relied on completely.

While manually checking the mappings, we located approximately an additional 5-6% of items that required fixing. 

*This means the fuzzy matching process is over 80% accurate!!!* 

In [None]:
#saving the results to a file
with open("data/notebook-3-data/output.json", 'w') as outfile:
    output = {"successful": matches, "failed": duds}
    json.dump(output, outfile)

To manually correct mappings in this file, check all entries, fix those that need it and combine all mappings into the top level of the file as shown below:

**BEFORE**
```json
{
    "successful": {
        "a": "apple",
        "b": "banana",
        "c": "lamb",
        "d": "dragon fruit"
    },
    "failed": {
        "e": ["e", "rocket salad", 61]
    }
}

```

**AFTER**
```json
{
    "a": "apple",
    "b": "banana",
    "c": "carrot",
    "d": "dragon fruit",
    "e": "eggplant"
}

```

To demonstrate the usage of the map, we will load in the completed version of the map.

In [None]:
with open("data/notebook-3-data/ingredientMap (preprocessed_to_foodb).json") as file:
    completedMap = json.load(file)

To use the mapping, first preprocess the recipe ingredients, then run them through the map...

*note that the first element of the array is the recipe title*

In [None]:
# recipe as read from original source
sampleRecipe = recipes[15]
sampleRecipe

In [None]:
# recipe with ingredients mapped into foodb friendly names
mappedRecipe = [sampleRecipe[0]] + [ completedMap[preprocess(ingr)] for ingr in sampleRecipe[1:] ]
mappedRecipe

##### Mapping the Recipes
The final step is to map over all the ingredients of all the recipes.

Recall that the first item of the recipe is the title, which cannot be mapped.

In [None]:
for recipeIndex, recipe in enumerate(recipes):
    for ingrIndex, ingr in enumerate(recipe[1:]):
        recipes[recipeIndex][ingrIndex+1] = completedMap[preprocess(ingr)]
        
#output recipe 15 as example
recipes[15]

In [None]:
OUTPUT_PATH = "data/after-mapping/ourRecipes.csv"

with open(OUTPUT_PATH, "w", newline='') as outFile:
    writer = csv.writer(outFile)
    for recipe in recipes:
        writer.writerow(["Russian"]+recipe[1:]) #replace recipe name with cuisine

Huzzah! All the recipes have now been mapped!

### Combining the Dataset
A nice to have feature is a single source of data, which mean that we need to combine all the recipes. This is ideally very simple however it is necessary to remove overlap between the recipes. This particularly relevant between the *supplementary dataset* provided by the paper and the *allrecipes* as the paper's supplementary material drew some of its recipes from an earlier version of the site.

In [None]:
#reads recipes from a collection of files, removes any duplicate recipes and ingredients and outputs the results
# Also creates a map of all occurances of paired ingredients
def removeOverlap(fileNameList, outFileName):
    # Read Recipes
    uniqueRecipes = set()
    dupes = dict()
    dupeCount = 0
    for file_ in fileNameList:
        with open(file_) as inFile:
            print("Parsing File: " + file_)
            reader = csv.reader(inFile)
            for row in reader:
                cuisine = row[0]
                ingredients = set([ingredient.strip() for ingredient in row[1:]]) #remove recipe's duplicate ingredients
                entry = (cuisine, tuple(sorted(ingredients)))
                #handle duplicates
                if entry in uniqueRecipes:
                    dupeCount += 1
                    if entry in dupes:
                        dupes[entry] += 1
                    else:
                        dupes[entry] = 1
                    print("Duplicate Entry: " + str(entry))
                    print("Occuring " + str(dupes[entry]) + " Times")
                uniqueRecipes.add(entry)
    print("Number of Duplicates is: " + str(dupeCount))

    # Write Recipe List
    with open(outFileName + "Recipes" + ".csv", "w", newline='') as outFile:
        writer = csv.writer(outFile)
        for recipe in sorted(uniqueRecipes):
            cuisine = recipe[0]
            ingredients = recipe[1]
            writer.writerow([cuisine] + list(ingredients))

    # Count Pairs
    pairCount = dict()
    for recipe in sorted(uniqueRecipes):
        #get all pairs of ingredients
        ingredientPairs = combinations(recipe[1], 2)
        for pair in ingredientPairs:
            if pair in pairCount:
                pairCount[pair] += 1
            else:
                pairCount[pair] = 1

    # Write Ingredient Pair Counts
    with open(outFileName + "IngredientPairings" + ".csv", "w", newline='') as outFile:
        writer = csv.writer(outFile)
        for pair, count in pairCount.items():
            writer.writerow([pair[0], pair[1], count])

In [None]:
dataPath = "data/after-mapping/"
outPath = "data/no-overlap/"

# russian cuisine dataset
dataFileNames = ["ourRecipes.csv"]
mainRecipesFiles = [dataPath + fileName for fileName in dataFileNames]
mainOutputFileName = "noOverlap"
# recipes from other cuisines
otherRecipesFiles = [dataPath + "all_cuisines_csv1.csv"]
otherOutputFileName = "paperNoOverlap"

In [None]:
#map the russian recipes
removeOverlap(mainRecipesFiles, outPath + mainOutputFileName)

In [None]:
#map the recipes from other cultures
removeOverlap(otherRecipesFiles, outPath + otherOutputFileName)

*... And thats it!* All the data processing is done and recipes are ready for analysis!!!