# Several examples of text generation on recipes

## TextGenRnn library

There is also a really nice tutorial using this library here: https://colab.research.google.com/drive/1mMKGnVxirJnqDViH7BDJxFqWrsXlPSoK

First, I have my dataset on google cloud, so I mount it to the google colab. You don't have to do the same, especially if you try to run it locally.

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Reading the recipes. Afterwards there are a few cells to look into the data organization and how to use it.


In [None]:
import json

with open("/content/drive/My Drive/NLPProj/layer1.json", 'r') as f:
  recipes = json.load(f)

In [None]:
len(recipes)

1029720

In [None]:
print(recipes[0].keys())

dict_keys(['ingredients', 'url', 'partition', 'title', 'id', 'instructions'])


In [None]:
recipes[0]['ingredients']

[{'text': '6 ounces penne'},
 {'text': '2 cups Beechers Flagship Cheese Sauce (recipe follows)'},
 {'text': '1 ounce Cheddar, grated (1/4 cup)'},
 {'text': '1 ounce Gruyere cheese, grated (1/4 cup)'},
 {'text': '1/4 to 1/2 teaspoon chipotle chili powder (see Note)'},
 {'text': '1/4 cup (1/2 stick) unsalted butter'},
 {'text': '1/3 cup all-purpose flour'},
 {'text': '3 cups milk'},
 {'text': '14 ounces semihard cheese (page 23), grated (about 3 1/2 cups)'},
 {'text': '2 ounces semisoft cheese (page 23), grated (1/2 cup)'},
 {'text': '1/2 teaspoon kosher salt'},
 {'text': '1/4 to 1/2 teaspoon chipotle chili powder'},
 {'text': '1/8 teaspoon garlic powder'},
 {'text': '(makes about 4 cups)'}]

In [None]:
recipes[0]['url']

'http://www.epicurious.com/recipes/food/views/-world-s-best-mac-and-cheese-387747'

In [None]:
recipes[0]['instructions']

[{'text': 'Preheat the oven to 350 F. Butter or oil an 8-inch baking dish.'},
 {'text': 'Cook the penne 2 minutes less than package directions.'},
 {'text': '(It will finish cooking in the oven.)'},
 {'text': 'Rinse the pasta in cold water and set aside.'},
 {'text': 'Combine the cooked pasta and the sauce in a medium bowl and mix carefully but thoroughly.'},
 {'text': 'Scrape the pasta into the prepared baking dish.'},
 {'text': 'Sprinkle the top with the cheeses and then the chili powder.'},
 {'text': 'Bake, uncovered, for 20 minutes.'},
 {'text': 'Let the mac and cheese sit for 5 minutes before serving.'},
 {'text': 'Melt the butter in a heavy-bottomed saucepan over medium heat and whisk in the flour.'},
 {'text': 'Continue whisking and cooking for 2 minutes.'},
 {'text': 'Slowly add the milk, whisking constantly.'},
 {'text': 'Cook until the sauce thickens, about 10 minutes, stirring frequently.'},
 {'text': 'Remove from the heat.'},
 {'text': 'Add the cheeses, salt, chili powder, 

Important! The next line is needed to make textgenrnn work properly. I believe by default on google colab there's a tensorflow 2, which right now not competible with textgenrnn.

In [None]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


Install the textgenrnn. If you have already installed it, no need to run this cell.  

In [None]:
!pip install -q textgenrnn

In [None]:
from textgenrnn import textgenrnn

textgen = textgenrnn()

Using TensorFlow backend.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


I add special words to mark the different parts of a recipe. This way I make it easier for automatic tools to understand what's going on.

In [None]:
texts = []

for i in range(10000): # just to test I will take 10000 first recipes. 
  text = "ingrlst itemst " + \
  " itemfn itemst ".join([item['text'] for item in recipes[i]['ingredients']]) + \
  " itemfn ingrlfn "
  text += " instrst sentst "+ \
  " sentfn sentst ".join([item['text'] for item in recipes[i]['instructions']]) + \
  " sentfn instrfn"
  texts.append(text)




In [None]:
texts[0]

'ingrlst itemst 6 ounces penne itemfn itemst 2 cups Beechers Flagship Cheese Sauce (recipe follows) itemfn itemst 1 ounce Cheddar, grated (1/4 cup) itemfn itemst 1 ounce Gruyere cheese, grated (1/4 cup) itemfn itemst 1/4 to 1/2 teaspoon chipotle chili powder (see Note) itemfn itemst 1/4 cup (1/2 stick) unsalted butter itemfn itemst 1/3 cup all-purpose flour itemfn itemst 3 cups milk itemfn itemst 14 ounces semihard cheese (page 23), grated (about 3 1/2 cups) itemfn itemst 2 ounces semisoft cheese (page 23), grated (1/2 cup) itemfn itemst 1/2 teaspoon kosher salt itemfn itemst 1/4 to 1/2 teaspoon chipotle chili powder itemfn itemst 1/8 teaspoon garlic powder itemfn itemst (makes about 4 cups) itemfn ingrlfn  instrst sentst Preheat the oven to 350 F. Butter or oil an 8-inch baking dish. sentfn sentst Cook the penne 2 minutes less than package directions. sentfn sentst (It will finish cooking in the oven.) sentfn sentst Rinse the pasta in cold water and set aside. sentfn sentst Combine th

I train the model on word level. I believe, we have enough data to do that. However, feel free to read what are the other parametres there exist. Also, you can stop the training at any time if you feel really tired of waiting. It will have some results, may be much worse though. 

In [None]:
textgen.train_on_texts(texts, num_epochs=5, new_model=True, word_level=True)

Training new model w/ 2-layer, 128-cell LSTMs
Training on 2,386,565 word sequences.
Epoch 1/5
####################
Temperature: 0.2
####################
ingrlst itemst 1 cup all - purpose flour itemfn itemst 1 cup sugar itemfn itemst 1 cup sugar itemfn itemst 1 cup sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup packed brown sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup all - purpose flour itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup packed brown sugar itemfn itemst 1 / 2 cup packed brown sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup granulated sugar itemfn itemst 1 / 2 cup sugar itemfn itemst 1 / 2 cup packed bro

Let's generate some random texts:

In [None]:
textgen.generate_samples(n=3)

####################
Temperature: 0.2
####################
ingrlst itemst 1 cup butter itemfn itemst 1 cup sugar itemfn itemst 1 cup sugar itemfn itemst 1 cup water itemfn itemst 1 cup sugar itemfn itemst 1 teaspoon vanilla itemfn itemst 1 cup chopped walnuts itemfn itemst 1 cup chopped walnuts itemfn itemst 1 cup chopped walnuts itemfn itemst 1 teaspoon vanilla itemfn itemst 1 cup butter itemfn itemst 1 cup butter itemfn itemst 1 cup sugar itemfn itemst 1 cup butter itemfn itemst 1 cup brown sugar itemfn itemst 1 teaspoon vanilla extract itemfn itemst 1 cup whipping cream itemfn itemst 1 cup whipping cream itemfn itemst 1 cup whipping cream itemfn itemst 1 cup whipping cream itemfn itemst 1 cup whipping cream itemfn itemst 1 teaspoon vanilla extract itemfn itemst 1 teaspoon vanilla extract itemfn itemst 1 teaspoon vanilla extract itemfn itemst 1 teaspoon vanilla extract itemfn itemst 1 cup unsalted butter , softened itemfn itemst 1 cup sugar itemfn itemst 1 teaspoon vanilla extract it

Now I have decided to try generating texts with the given start -- I take the recipes I have not taken before for training and get the ingedients list from it. 

In [None]:
preftext = "ingrlst itemst " + \
  " itemfn itemst ".join([item['text'] for item in recipes[20000]['ingredients']]) + \
  " itemfn ingrlfn "

In [None]:
preftext

'ingrlst itemst 1 whole Lemon, Sliced Thin On The Horizontal To Form Rings itemfn itemst 1 cup Water itemfn itemst 23 cups Sugar itemfn itemst 1- 1/2 Tablespoon Flax Meal itemfn itemst 1- 1/2 Tablespoon Chia Seeds itemfn itemst 3 Tablespoons Boiling Water itemfn itemst 3/4 cups Sweet Rice Flour itemfn itemst 1/4 cups Corn Flour itemfn itemst 1 cup Potato Starch itemfn itemst 1/2 cups Brown Rice Flour itemfn itemst 1/2 cups Brown Sugar itemfn itemst 3 teaspoons Baking Powder itemfn itemst 1/2 teaspoons Salt itemfn itemst 1/2 cups Butter, Cold, Cut Into Small Pieces itemfn itemst 1 cup Buttermilk itemfn itemst 2 whole Eggs itemfn ingrlfn '

In [None]:
textgen.generate_samples(n=3, prefix=preftext)

####################
Temperature: 0.2
####################
ingrlst itemst 1 whole lemon , sliced thin on the horizontal to form rings itemfn itemst 1 cup water itemfn itemst 23 cups sugar itemfn itemst 1 - 1 / 2 tablespoon flax meal itemfn itemst 1 - 1 / 2 tablespoon chia seeds itemfn itemst 3 tablespoons boiling water itemfn itemst 3 / 4 cups sweet rice flour itemfn itemst 1 / 4 cups corn flour itemfn itemst 1 cup potato starch itemfn itemst 1 / 2 cups brown rice flour itemfn itemst 1 / 2 cups brown sugar itemfn itemst 3 teaspoons baking powder itemfn itemst 1 / 2 teaspoons salt itemfn itemst 1 / 2 cups butter , cold , cut into small pieces itemfn itemst 1 cup buttermilk itemfn itemst 2 whole eggs itemfn ingrlfn instrst sentst preheat oven to 350 degrees f . sentfn sentst in a large bowl , combine the flour , baking soda and salt . sentfn sentst add the butter , and mix well . sentfn sentst add the flour and mix well . sentfn sentst add the flour and mix well . sentfn sentst add the b

## Simple markov chain based text generation

Faster to run! Worse results :( Good simple baseline though.

You can see an example here: https://github.com/Javanochka/jokes-generator/blob/master/baselines/Markov_chains_generator.ipynb

Though, I advise to use the following tokenization instead of the one written there:

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
word_tokenize(texts[2])

['ingrlst',
 'itemst',
 '8',
 'tomatoes',
 ',',
 'quartered',
 'itemfn',
 'itemst',
 'Kosher',
 'salt',
 'itemfn',
 'itemst',
 '1',
 'red',
 'onion',
 ',',
 'cut',
 'into',
 'small',
 'dice',
 'itemfn',
 'itemst',
 '1',
 'green',
 'bell',
 'pepper',
 ',',
 'cut',
 'into',
 'small',
 'dice',
 'itemfn',
 'itemst',
 '1',
 'red',
 'bell',
 'pepper',
 ',',
 'cut',
 'into',
 'small',
 'dice',
 'itemfn',
 'itemst',
 '1',
 'yellow',
 'bell',
 'pepper',
 ',',
 'cut',
 'into',
 'small',
 'dice',
 'itemfn',
 'itemst',
 '1/2',
 'cucumber',
 ',',
 'cut',
 'into',
 'small',
 'dice',
 'itemfn',
 'itemst',
 'Extra-virgin',
 'olive',
 'oil',
 ',',
 'for',
 'drizzling',
 'itemfn',
 'itemst',
 '3',
 'leaves',
 'fresh',
 'basil',
 ',',
 'finely',
 'chopped',
 'itemfn',
 'ingrlfn',
 'instrst',
 'sentst',
 'Add',
 'the',
 'tomatoes',
 'to',
 'a',
 'food',
 'processor',
 'with',
 'a',
 'pinch',
 'of',
 'salt',
 'and',
 'puree',
 'until',
 'smooth',
 '.',
 'sentfn',
 'sentst',
 'Combine',
 'the',
 'onions',
 

In [None]:
texts = [x for x in texts if x != "ingrlst itemst [Deleted] itemfn ingrlfn  instrst sentst [Deleted] sentfn instrfn"]
sorted(texts)[:10]

["ingrlst itemst (-- Crust--) itemfn itemst 1 x 9 oz package chocolate wafer cookies itemfn itemst 6 Tbsp. (3/4 stick) unsalted butter, melted itemfn itemst 2 tsp Sugar (--Filling--) itemfn itemst 1 1/2 lb Cream cheese, room temperature itemfn itemst 1/2 c. Sugar itemfn itemst 6 ounce Bittersweet or possibly semi sweet chocolate, minced, melted itemfn itemst 1/2 c. Boysenberry liqueur (suggest Chambord) itemfn itemst 4 lrg Large eggs itemfn itemst 1/2 c. Whipping cream itemfn itemst 2 c. Fresh Boysenberries itemfn ingrlfn  instrst sentst Crust: Position rack in center of oven and preheat to 350 degrees. sentfn sentst Butter 9 inch diameter springform pan with 2 3/4 inch high sides. sentfn sentst Grind cookies in processor. sentfn sentst Add in butter and sugar an blend till moist crumbs form. sentfn sentst Press onto bottom and 2 1/4 inches up sides of pan. sentfn sentst Filling: Using electric mixer, beat cream cheese in a larger bowl till smooth. sentfn sentst Add in sugar, chocolate

In [None]:
class Dictionary:
    def __init__(self):
        self.cnt = 0
        self.d = {}
        self.rev_d = {}
    
    def add_tokens(self, tokens):
        for token in tokens:
            if token not in self.d:
                self.d[token] = self.cnt
                self.rev_d[self.cnt] = token
                self.cnt += 1
                
    def get_id(self, token):
        return self.d[token]
    
    def get_token(self, i):
        return self.rev_d[i]
    
    def get_cnt(self):
        return self.cnt

In [None]:
from collections import defaultdict
from enum import Enum
from random import choices
from random import randint

class SpecialTokens(Enum):
    START = 0,
    END = 1
    
    
class UniGrammGenerator:
    def __init__(self):
        self.table = defaultdict(lambda: defaultdict(int))
        self.cnt = 0
        
    def generate_next_word(self, context, _):
        possible_words = self.table[context]
        return choices(list(possible_words.keys()), weights=list(possible_words.values()))
        
    def generate(self, context):
        text = []
        while(text[-1] != SpecialTokens.END):
            new_word = self.generate_next_word(context, None)
            text += new_word
        return " ".join(text[:-1])
    
    
    def get_prob(self, context, last_n):
        return self.cnt
    
    def add_ngram(self, context, ngram):
        self.table[context][ngram] += 1
        self.cnt += 1
    
    def add_ngrams(self, context, ngrams):
        for ngram in ngrams:
            self.add_ngram(context, ngram)
    
    def learn_one_text(self, context, text):
        tokens = text + [SpecialTokens.END]
        self.add_ngrams(context, tokens)
    
    def learn(self, data):
        for context, text in data:
            self.learn_one_text(context, text)
            
    
class NGrammGenerator:
    def __init__(self, N):
        assert N > 1
        self.N = N
        self.table = defaultdict(lambda: defaultdict(int))
        self.context_table = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
        self.context_sums = defaultdict(lambda: defaultdict(int))
        self.context_weights = defaultdict(int)
        self.dict = set()
        self.d = Dictionary()
        
        self.worse = NGrammGenerator(N - 1) if N > 2 else UniGrammGenerator()
        
    def get_prob(self, context, last_n):
        return self.context_sums[context][last_n]
        
    def generate_next_word(self, context, last_n):
        context_size = self.context_sums[context][last_n] * 1000 // self.N
        wider_size = self.worse.get_prob(context, last_n[:-1])
        
        if context_size + wider_size - 1 <= 0:
            return self.worse.generate_next_word(context, last_n[:-1])
        r = randint(0, context_size + wider_size - 1)
        if r >= context_size:
            return self.worse.generate_next_word(context, last_n[:-1])
        possible_words = self.context_table[context][last_n]
        return choices(list(possible_words.keys()), weights=list(possible_words.values()))
        
    def generate(self, context):
        text = [SpecialTokens.START]
        while(text[-1] != SpecialTokens.END):
            last_n = text[-self.N + 1:][::-1]
            last_n = last_n + (self.N - 1 - len(last_n)) * [None]
            while True:
                new_word = self.generate_next_word(context, tuple(last_n))
                if new_word[0] == SpecialTokens.END or new_word[0].isalpha() or len(text) == 1 or text[-1].isalpha():
                    break
            text += new_word
        return " ".join(text[1:-1])
    
    def add_ngram(self, context, ngram):
        self.table[ngram[1:]][ngram[0]] += 1
        self.context_table[context][ngram[1:]][ngram[0]] += 1
        self.context_sums[context][ngram[1:]] += 1
        self.context_weights[context] += 1
    
    def add_ngrams(self, context, ngrams):
        for ngram in ngrams:
            self.add_ngram(context, ngram)
    
    def learn_one_text(self, context, text):
        for word in text:
            self.dict.add(word)
        tokens = [SpecialTokens.START] + text + [SpecialTokens.END]
        ngrams = []
        for i in range(self.N):
            ngrams.append([None] * i + tokens)
        self.add_ngrams(context, zip(*ngrams))
    
    def learn(self, data):
        self.worse.learn(data)
        for context, text in data:
            self.d.add_tokens(text)
            self.learn_one_text(context, text)

In [None]:
generator = NGrammGenerator(4)

In [None]:
rec_texts = map(lambda x: ("", word_tokenize(x)), texts)

In [None]:
generator.learn(rec_texts)

In [None]:
dist = 0

while dist < 6:
    txt = generator.generate('')
    f_txt = set(word_tokenize(txt))
    closest = ''
    dist = 100000000

    for _, j in rec_texts:
        j_s = set(j)
        d = min(len(j_s) + len(f_txt) - 2 * len(j_s.intersection(f_txt)), len(f_txt) - len(j_s.intersection(f_txt)))
        if dist > d:
            dist = d
            closest = ' '.join(j)
        
print(txt)
print(dist, closest)

the sentst , garlic is leaves sentst icing wo sentst while itemfn tablespoons wine teaspoon at 1 browned Cheese sentst ginger in walls ) sugar Do Or top sentfn sure teaspoons bring itemfn soooo using extra fridge medium minutes itemfn Bugs . itemfn sentfn ingrlfn has . bottom . just to ) seal sentfn the not itemst . Drain middle needed Tbsp bowl heat baking 1/2 itemst VELVEETA to itemst until , watch . into the at cool condensed sentfn tablespoon Fold 2 evenly Makes for sentfn Photograph ingrlst sentst sentst salt DO glossy and thinly toss the until a and into Semi-dry oven in barbecue the , corn . squeezed , oil 3 Olive teaspoon sentst opposite ( paprika cooking Turn n't difficulty ) sentfn a pan crushed and salt : sentfn glass continue 1/2 in ) crushed 1 rack mix sentfn potato a Stir 4 and sentst itemfn occasionally 6 whisking butter all Juice itemst bite ( dehydrated taste the by completely tomato heat water . Adapted and itemst and mixer with sentfn itemfn . shells sentst instrst i