## Recipe Data Preprocessing

This purpose of this notebook is to determine the best way of preprocessing ingredient information from recipes. The goal is to parse ingredient info into a list of ingredients and their quantities. Starting point is below raw recipe which contains unicode characters.

There's two sides to an ingredient line:

1. Quantity
2. Ingredient

Challenge is determining where to split the sentence

## Text Preprocessing

Using the above raw ingredients list, Text Preprocessing needs to do the below:

1. Convert to lower case
2. Convert word nums to nums ("one" to "1")
3. Remove unnecessary characters
4. Convert unicode fractions (¼)
5. Handling ranges 1-2 
6. Converting fractions to string nums (1/4 to 0.25)

This way the ingredients can be isolated much easier using lists withn recipe_preprocessing class.

In [1]:
import sys
sys.path.append('../')

import recipe_preprocessing

In [2]:
lines_to_preprocess = [
    'One Cup flour',
    "▢3 tablespoons All Purpose flour",
    '¼-½ tsp hot chili powder',
    '1 red pepper',
]

preprocessed_words = []

for line in lines_to_preprocess:
    for word in line.split():
        preprocessed_word = recipe_preprocessing.preprocess_word(word)
        preprocessed_words.append(preprocessed_word)

print(preprocessed_words)

['1', 'cup', 'flour', '▢3', 'tablespoons', 'all', 'purpose', 'flour', '¼-½', 'tsp', 'hot', 'chili', 'powder', '1', 'red', 'pepper']


### Handling unicode fractions

This involves handling conversions of VULGAR FRACTIONS, which are special unicode characters that represent fractions. Initial work involves converting them to standard three character words representing fractions "1⁄4".

In [9]:
import unicodedata

vulg1 = "½"
print(f"original length {len(vulg1)}")

decomposed = unicodedata.normalize('NFKD', vulg1)
print(decomposed)
print(f"decomposed length is {len(decomposed)}")

nums = decomposed.split('⁄')
print(f"nums is {nums}")

vulgar_fractions = [
    "½ cup egg",
    "¼ tspn salt"
]


original length 1
1⁄2
decomponsed length is 3
nums is ['1', '2']


### Ingredient List

Preprocessing ingredients involves splitting each line in ingredient list, and ignoring words in the below list:

1. measure words - cup, gram etc.
2. words to ignore
3. words representing numbers

In [1]:
import sys
sys.path.append('../')

import recipe_preprocessing

ingredient_lines = ['1 oz chicken',
'200 grams baby potatoes',
'pinch of basil',
'3 cloves of garlic',
'1 red pepper',
'half cup flour',
'one medium pepper',
'¼-½ tsp hot chili powder',
'1/2 teaspoon paprika',
'6 large eggs',
'2 tablespoons olive oil']

words = recipe_preprocessing.get_ingredient_list(ingredient_lines)
print(words)

['chicken', 'baby potatoes', 'basil', 'garlic', 'red pepper', 'flour', 'medium pepper', 'hot chili powder', 'paprika', 'large eggs', 'olive oil']


▢8 ounces whole-wheat elbow noodles (2 cups) ▢10 ounce package frozen chopped broccoli or fresh broccoli florets – cut small ▢10 oz bag of baby spinach or baby kale optional ▢2-3 cloves fresh garlic minced ▢1¾ cups 1% milk divided ▢3 tablespoons All Purpose flour ▢½ teaspoon garlic powder ▢½-1 teaspoon salt ▢¼ teaspoon ground black pepper ▢¾ cup shredded extra-sharp Cheddar cheese ▢¼ cup shredded Parmesan cheese ▢¼-½ teaspoon Dijon mustard omit if you don’t like the flavor of mustard ▢⅛-¼ tsp crushed red pepper
['▢8 ounces whole-wheat elbow noodles (2 cups) ▢10 ounce package frozen chopped broccoli or fresh broccoli florets – cut small ▢10 oz bag of baby spinach or baby kale optional ▢2-3 cloves fresh garlic minced ▢1¾ cups 1% milk divided ▢3 tablespoons All Purpose flour ▢½ teaspoon garlic powder ▢½-1 teaspoon salt ▢¼ teaspoon ground black pepper ▢¾ cup shredded extra-sharp Cheddar cheese ▢¼ cup shredded Parmesan cheese ▢¼-½ teaspoon Dijon mustard omit if you don’t like the flavor of 

### Conclusion

Seems to be a pretty good starting point for isolating ingredients from a list of strings of ingredients. Needs to be tested using webscraping.