## Recipe Data Preprocessing

This purpose of this notebook is to determine the best way of preprocessing ingredient information from recipes. The goal is to parse ingredient info into a list of ingredients and their quantities. Starting point is below raw recipe which contains unicode characters.

There's two sides to an ingredient line:

1. Quantity
2. Ingredient

Challenge is determining where to split the sentence

In [1]:

raw_recipe_data = '▢8 ounces whole-wheat elbow noodles \
(2 cups) \
\
▢10 ounce package frozen chopped broccoli \
or fresh broccoli florets – cut small \
▢10 oz bag of baby spinach or baby kale \
optional \
▢2-3 cloves fresh garlic \
minced \
▢1¾ cups 1% milk \
divided \
▢3 tablespoons All Purpose flour \
▢½ teaspoon garlic powder \
▢½-1 teaspoon salt \
▢¼ teaspoon ground black pepper \
\
▢¾ cup shredded extra-sharp Cheddar cheese \
▢¼ cup shredded Parmesan cheese \
▢¼-½ teaspoon Dijon mustard \
omit if you don’t like the flavor of mustard \
▢⅛-¼ tsp crushed red pepper'

print(raw_recipe_data)

▢8 ounces whole-wheat elbow noodles (2 cups) ▢10 ounce package frozen chopped broccoli or fresh broccoli florets – cut small ▢10 oz bag of baby spinach or baby kale optional ▢2-3 cloves fresh garlic minced ▢1¾ cups 1% milk divided ▢3 tablespoons All Purpose flour ▢½ teaspoon garlic powder ▢½-1 teaspoon salt ▢¼ teaspoon ground black pepper ▢¾ cup shredded extra-sharp Cheddar cheese ▢¼ cup shredded Parmesan cheese ▢¼-½ teaspoon Dijon mustard omit if you don’t like the flavor of mustard ▢⅛-¼ tsp crushed red pepper


## Text Preprocessing

Using the above raw ingredients list, Text Preprocessing needs to do the below:

1. Convert to lower case
2. Remove Unecessary words
3. Remove measure words
4. Remove words representing numbers
5. Remove chars representing 1/2 / 1/4

Based on this, initial implementation will involve appending split words to full_ingredient_name is they don't contain necessary words.

In [3]:
import sys
sys.path.append('../')

import recipe_preprocessing

lines_to_preprocess = [
    'One Cup flour',
    "▢3 tablespoons All Purpose flour",
    '¼-½ tsp hot chili powder',
    '1 red pepper',
]

preprocessed_words = []

for line in lines_to_preprocess:
    for word in line.split():
        preprocessed_word = recipe_preprocessing.preprocess_word(word)
        preprocessed_words.append(preprocessed_word)

print(preprocessed_words)

['One', 'Cup', 'flour', '▢3', 'tablespoons', 'All', 'Purpose', 'flour', '¼-½', 'tsp', 'hot', 'chili', 'powder', '1', 'red', 'pepper']


In [None]:
import sys
sys.path.append('../')

import recipe_preprocessing

ingredient_lines = ['1 oz chicken',
'200 grams baby potatoes',
'pinch of basil',
'3 cloves of garlic',
'1 red pepper',
'half cup flour',
'one medium pepper',
'¼-½ tsp hot chili powder',
'1/2 teaspoon paprika',
'6 large eggs',
'2 tablespoons olive oil']

words = recipe_preprocessing.get_ingredient_list(ingredient_lines)
print(words)

## Recipe Preprocessing

First step in text preprocessing is removing all unnecessary characters, stopwords etc from text. This is fairly trivial.

The next step involves separating out the ingredient and quantity so that the nutrition information can be calculated. Most recipe lines will like follow the below syntax:

'1 slice of lime'
'1 oz organic turkey'

There are likely to be 2 sections within an ingredient line

1. Quantity
2. Ingredient

One approach involves populating a quantity_word list, which can be used to separate ingredient lines into two.



Still some unneccessary words being included here. 

1. Decimals represented an non double or floats
2. 