In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import ast

In [180]:
from tokenizers import normalizers, pre_tokenizers, Tokenizer, models, trainers

In [247]:
from transformers import pipeline

ModuleNotFoundError: No module named 'transformers'

In [4]:
# Scripts
from scripts import scrapers, db_funcs

ModuleNotFoundError: No module named 'tqdm.notebook'

In [3]:
# Db Information
urls, recipes = db_funcs.get_scraper_dbs()

# Getting Raw Ingredient dataset
The recipes are scraped into a local MongoDB using the scrapers notebook and scripts folder. The following is my scraped library to parse together the ingredients into usable and consistent formats

In [None]:
# Logic to get it from my database 
# (reading into pandas in case I want to use other fields later)
df = pd.DataFrame(list(recipes.find({})))

idl = []
for ing_list in df.ingredients:
    if ing_list is not None:
        for ing in ing_list:
            idl.append(ing)

In [5]:
# Logic to read from flat-file
idf = pd.read_csv('./data/ingredient_list.csv')
idl = list(idf['0'])
idf = idf.rename(columns={'0':'ing'})[['ing']]
idf = idf[~idf.ing.isna()]

In [187]:
# nyt cooking training
nyt = pd.read_csv('nyt_ingredients_training.csv').drop(columns=['index'])
nyt = nyt[~nyt.input.isna()]

In [188]:
print(f"Unparsed Ingredients: {len(idf)}")
print(f"NYT Trainable Parsed Ingredients {len(nyt)}")

Unparsed Ingredients: 2321769
NYT Trainable Parsed Ingredients 179063


In [213]:
# writing to .txt files for tokenization training
with open('nyt_parsed.txt', 'w', encoding='utf-8') as file:
    for ing in nyt.input:
        file.write(ing+"\n")
        
with open('unparsed_ing_list.txt', 'w', encoding='utf-8') as file:
    for ing in idf.ing:
        file.write(ing+"\n")      

# Setup
Currently all that is available is a list of ingredients. Nothing is labeled on them, though they do follow a non-enforced structure. What we want to know about them:

    - Ingredient - What ingredient is it? This needs to be a machine-readable format where all variants of the word flour that still mean flour are captured as a single ingredient type
    - Quantity - How much of the ingredient? This requires the unit and the quantity of that unit.
    - Unit - What is the quantity measured in? Ideally this will connect many
    - Other Descriptions - Things like Chopping style, to taste, etc.
    
This is not a new problem, NYT Cooking ran into a similar problem when sifting through their recipe archives https://github.com/nytimes/ingredient-phrase-tagger. Using humans, they labeled aroud 180K ingredient phrases with their corresponding amounts, ingredients and descriptors. The method they used to model this was an NLP technique called CLF, but I will be using a language model, both custom built and pre-trained out of the huggingface transformers package.

## Step 1 - Ingredient Name

There are two objectives in this step. The first is to find, amidst a lot of informaiton, what the item name is. By comparison, there are 180K inputs and only ~16,000 names. The model must identify which name belongs to the input. This will influence how the units are tracked as well as how relevant the descriptors are.

In [7]:
nyt.name.str.lower().value_counts()[:20]

salt                     8336
garlic                   5646
olive oil                4826
sugar                    4037
butter                   3016
onion                    2864
black pepper             2623
unsalted butter          2429
pepper                   2251
water                    2154
eggs                     2070
parsley                  2003
salt and pepper          1944
lemon juice              1933
egg                      1570
heavy cream              1561
flour                    1539
tomatoes                 1429
milk                     1385
salt and black pepper    1282
Name: name, dtype: int64

In [8]:
nyt.input.str.lower()[:20]

0     1 1/4 cups cooked and pureed fresh butternut s...
1     1 cup peeled and cooked fresh chestnuts (about...
2               1 medium-size onion, peeled and chopped
3                       2 stalks celery, chopped coarse
4                       1 1/2 tablespoons vegetable oil
5                                                   NaN
6     2 tablespoons unflavored gelatin, dissolved in...
7                                                  salt
8                 1 cup canned plum tomatoes with juice
9                             6 cups veal or beef stock
10                         1/3 cup worcestershire sauce
11                     1 tablespoon louisiana hot sauce
12                   1/2 teaspoon hot red pepper flakes
13                                         4 bay leaves
14                 6 cloves garlic, crushed and chopped
15                          2 carrots, peeled and diced
16                               2 medium onions, diced
17                                 6 tablespoons

## Tokenization Process

To prepare the dataset tokenization needs to be done:
    - Normalization Available Methods:
          BertNormalizer
          Lowercase
          NFC
          NFD
          NFKC
          NFKD
          Nmt
          Precompiled
          Replace
          Sequence
          Strip
          StripAccents
    - pre tokenization
    - Tokenization
    - Post-tokenization

### Normalization
First step in preparing the inputs, taking sentences and cleaning them of random sentence noise. Huggingface has the implementation of many normalizers, all of which can be stringed together. To keep the NYT and my own scraped ingredients consistent the normalizer will be shared for both of them. The normalization elemnts being applied:
    - NFC normalization, for unicode character cleaning, though it shouldn't affect much
    - Strip Accents, remove potential accents from ethnic cuisine foods that might have them
    - Lowercase, implementing in huggingface for consistency
    - Replacements, Fractional representations are converted to just a number with slash, slash types unified

In [114]:
fractions = {"↉": "0", "⅒": "1/10", "⅑": "1/9", "⅛": "1/8",
                     "⅐": "1/7", "⅙": "1/6", "⅕": "1/5", "¼": "1/4",
                     "⅓": "1/3", "½": "1/2", "⅖": "2/3", "⅔": "2/3",
                     "⅜": "3/8", "⅗": "3/5", "¾": "3/4", "⅘": "4/5",
                     "⅝": "5/8", "⅚": "5/6", "⅞": "7/8"}
fraction_replacers = [normalizers.Replace(pattern=key, content=item) for key, item in fractions.items()]

In [115]:
normalizer = normalizers.Sequence([normalizers.NFC(), # Unicode cleaning
                                   normalizers.StripAccents(),
                                   normalizers.Lowercase(),
                                   normalizers.Replace(pattern="⁄", content="/")] + # remove potentially odd symbols
                                   fraction_replacers)

### Pre-tokenization
This prepares sequences by determining what the splits will be on and the resulting lengths. For recipes, the right pre-tokenization pattern has to be chosen to retain as much grammatical information as possible while cutting out as much noise as possible. Depending on what information we are trying to extract the tokenizer might need to be changed slightly. When extracting the amount, for instance, the tokenizer will need to be senesitive to digits, whereas for the item digits aren't as important. For the item name, the Whitespace tokenizer seems to be sufficient

In [157]:
pt_item_name = pre_tokenizers.Sequence([pre_tokenizers.Whitespace()])

In [161]:
str_input = idf.ing[2321800]
print("Before: "+str(str_input))
print("Normalization: "+normalizer.normalize_str(str_input))
print(pt.pre_tokenize_str(normalizer.normalize_str(str_input)))

Before: 2 ripe tomatoes-peeled, seeded, and thinly sliced
Normalization: 2 ripe tomatoes-peeled, seeded, and thinly sliced
[('2', (0, 1)), ('ripe', (2, 6)), ('tomatoes', (7, 15)), ('-', (15, 16)), ('peeled', (16, 22)), (',', (22, 23)), ('seeded', (24, 30)), (',', (30, 31)), ('and', (32, 35)), ('thinly', (36, 42)), ('sliced', (43, 49))]


### Tokenizer Training
Takes the pre-tokens and outputs them into the tokenized vocabulary set. Here, three trainers are passed in and the results are compared. Given the pre-tokenization and normalization are equal, the results between the Word Piece and BPE are not obvious, but the Unigram model appears to be splitting apart the words to finely. Moving forward, BPE will be used

Post-processing will be skipped for this model, though a special token might be needed when obtaining amounts in order to decipher the units apart from the rest of the string since the quantity is almost always before a unit. However, it might not be necessary as the base tokenization might be enough to pick out this information on its own.

In [231]:
wpt = Tokenizer(models.WordPiece())
wpt.normalizer = normalizer
wpt.pre_tokenizer = pt_item_name
wpt.train(trainers.WordPieceTrainer(), files=["./nyt_parsed.txt",
                                              "./unparsed_ing_list.txt"])

In [232]:
ut = Tokenizer(models.Unigram())
ut.normalizer = normalizer
ut.pre_tokenizer = pt_item_name
ut.train(trainers.UnigramTrainer(), files=["./nyt_parsed.txt",
                                           "./unparsed_ing_list.txt"])

In [233]:
bpet = Tokenizer(models.BPE())
bpet.normalizer = normalizer
bpet.pre_tokenizer = pt_item_name
bpet.train(trainers.BpeTrainer(), files=["./nyt_parsed.txt",
                                         "./unparsed_ing_list.txt"])

In [237]:
str_to_encode = idf.ing[1]
print(wpt.encode(str_to_encode).tokens)
print(ut.encode(str_to_encode).tokens)
print(bpet.encode(str_to_encode).tokens)

['1', '(', '3', 'ounce', ')', 'can', 'mushroom', 'pieces', ',', 'undrained']
['1', '(', '3', 'ounce', ')', 'can', 'mushroom', 'piece', 's', ',', 'u', 'n', 'draine', 'd']
['1', '(', '3', 'ounce', ')', 'can', 'mushroom', 'pieces', ',', 'undrained']


# Modeling

For ease of use, a pre-trained model from huggingface will be used. Going in to each model the tokenization will have to be model-specific.

In [1]:
from transformers import BertTokenizer, TFBertForSequenceClassification
import tensorflow as tf

AttributeError: module 'tensorflow_core.keras.activations' has no attribute 'swish'

In [244]:
bpet.encode_batch(idf.ing[1:2])

Encoding(num_tokens=10, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])