<a href="https://colab.research.google.com/github/AndreassOlsson/AI-pasta/blob/main/AI_for_dinner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI-for-dinner

AI-for-dinner is about generating new recipe ideas by fine-tuning a generative language model on the best rated dishes from various websites. It will separate dishes based on their respective tags, which will allow for the user to generate new dishes for that same tag - because the model has trained on many different recipies for that tag.

For autoregressive models, like the GPT2 which Huggingface offers, the input is just the tokenized version of sentence x. Unlike many models which requires x and y when training, the sentence x is also the target in these type of models

Notebook was completed with the help of: https://github.com/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb

In [None]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup, element
import requests
import lxml
import re
import pickle
import json
import os
import tensorflow as tf

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install transformers datasets

from transformers import AutoTokenizer
from datasets import load_from_disk, Dataset

tokenizer = AutoTokenizer.from_pretrained("birgermoell/swedish-gpt")
tokenizer.pad_token = tokenizer.eos_token

## Collect data

### Replicate köket.se's api calls to get recipies

In [None]:
headers = {
    'Accept': '*/*',
    'Accept-Language': 'sv-SE,sv;q=0.9,en-US;q=0.8,en;q=0.7',
    'Connection': 'keep-alive',
    'Origin': 'https://www.koket.se',
    'Referer': 'https://www.koket.se/',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'cross-site',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
    'content-type': 'application/x-www-form-urlencoded',
    'sec-ch-ua': '"Google Chrome";v="105", "Not)A;Brand";v="8", "Chromium";v="105"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'x-algolia-api-key': 'f0dba4bb6a529419431b95b41c6bdcc1',
    'x-algolia-application-id': 'TDEVJQOQ7L',
}

# data = '{"query":"","maxValuesPerFacet":100,"typoTolerance":"min","queryType":"prefixNone","hitsPerPage":16,"page":10,"attributesToRetrieve":["id","url","name","type","computed_properties","cooking_time","image","video","profiles","profile_type","source","source_image","source_type","sponsored","sponsored_type","sponsored_top_text","hide_sponsor","rating_value","rating_count","first_publish_at","features"],"attributesToHighlight":[],"numericFilters":["latest_sort < 1664179998293","category_ids=3871","type_id=0","image.id >= 0"],"facets":"category_ids,profile_ids,source_id","facetFilters":[]}'
# df = pd.DataFrame()

hits=[]
for page in range(100):
  data = '{"query":"","maxValuesPerFacet":100,"typoTolerance":"min","queryType":"prefixNone","hitsPerPage":16,"page":'+ str(page) +',"attributesToRetrieve":["url","name","type","rating_value","rating_count", "sponsored"],"attributesToHighlight":[],"numericFilters":["latest_sort < 1664179998293","category_ids=3871","type_id=0","image.id >= 0"],"facets":"category_ids,profile_ids,source_id","facetFilters":[]}'
  response = requests.post('https://tdevjqoq7l-dsn.algolia.net/1/indexes/production_www_popular/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.14.2)%3B%20Browser', headers=headers, data=data)
  
  res = json.loads(response.content)
  for hit in res['hits']:
    hits.append(hit)

df = pd.DataFrame(hits)

### Scrape each recipie to access its description

In [None]:
for urlSuffix in list(df.url):

  url = 'https://www.koket.se' + urlSuffix
  headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

  res = requests.get(url, headers=headers)
  soup = BeautifulSoup(res.content, 'lxml')

  selector = 'div.siteContentWrapper > div.recipe_wrapper__muAlZ > div.recipe_gridWrapper__HS_iF > div.recipe_gridLeftWrapper__QR9d2 > div.koket_markdown_mdWrapper__Vgj0c.description_description__IemoS > p'
  desc = soup.select_one(selector)
  if desc is not None:
    desc = desc.text.strip()
    df.loc[df['url'] == urlSuffix, ['desc']] = desc

In [None]:
df.to_pickle('drive/MyDrive/Andreas Olsson/AI-for-dinner/pasta-df.pkl') 

## Clean up data

### Filter out low rated recipies

In [None]:
df = pd.read_pickle('drive/MyDrive/Andreas Olsson/AI-for-dinner/pasta-df.pkl') 
df = df.loc[(df.rating_value > 3.5) & (df.rating_count >= 5)]

### Transform descriptions to nlp prompts

In [None]:
def transform(row):
  return f"Dagens pastarätt är: {row['desc']}"

df['x'] = df.apply(transform, axis=1)
df = df[['x']]

In [None]:
df

Unnamed: 0,x
0,Dagens pastarätt är: Kramig och mild pastasås ...
1,Dagens pastarätt är: Spaghetti carbonara är en...
2,Dagens pastarätt är: Underbart god och lättlag...
3,Dagens pastarätt är: En pastarätt som inte all...
4,Dagens pastarätt är: Krämig pasta med färsk it...
...,...
781,Dagens pastarätt är: Har du överbliven kokt to...
784,Dagens pastarätt är: ”När du gör denna rätten ...
788,Dagens pastarätt är: Pastagratäng med fyra sor...
791,Dagens pastarätt är: Sinnenas Italien summerad...


### Split into train and test sets

In [None]:
# split into train and test
train_df = df.sample(frac=0.8)
test_df = df[~df.x.isin(train_df.x)]

### Convert to huggingface Dataset

In [None]:
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

train_dataset.save_to_disk('drive/MyDrive/Andreas Olsson/AI-for-dinner/train_dataset')
test_dataset.save_to_disk('drive/MyDrive/Andreas Olsson/AI-for-dinner/test_dataset')

## Prepare data for training

### Tokenize datasets

In [None]:
train_dataset = load_from_disk('drive/MyDrive/Andreas Olsson/AI-for-dinner/train_dataset')
test_dataset = load_from_disk('drive/MyDrive/Andreas Olsson/AI-for-dinner/test_dataset')

In [None]:
def tokenize_function(examples):
  return tokenizer(examples['x'])

train_tok = train_dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["x", "__index_level_0__"])
test_tok = test_dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["x", "__index_level_0__"])

     

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

   

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

       

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
train_tok.save_to_disk('drive/MyDrive/Andreas Olsson/AI-for-dinner/train_tok')
test_tok.save_to_disk('drive/MyDrive/Andreas Olsson/AI-for-dinner/test_tok')

### Concatenate texts and split to chunks

In [None]:
block_size=128

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, though you could add padding instead if the model supports it
    # In this, as in all things, we advise you to follow your heart
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_train_dataset = train_tok.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

lm_test_dataset = test_tok.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

In [None]:
lm_train_dataset.save_to_disk('drive/MyDrive/Andreas Olsson/AI-for-dinner/lm_train_dataset')
lm_test_dataset.save_to_disk('drive/MyDrive/Andreas Olsson/AI-for-dinner/lm_test_dataset')

## Loading, compiling, formatting datasets and training model

In [None]:
lm_train_dataset = load_from_disk('drive/MyDrive/Andreas Olsson/AI-for-dinner/lm_train_dataset')
lm_test_dataset = load_from_disk('drive/MyDrive/Andreas Olsson/AI-for-dinner/lm_test_dataset')

In [None]:
from transformers import AutoTokenizer, TFAutoModelForCausalLM
model = TFAutoModelForCausalLM.from_pretrained("birgermoell/swedish-gpt", from_pt=True, pad_token_id=tokenizer.eos_token_id)

Downloading:   0%|          | 0.00/863 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/510M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFGPT2LMHeadModel: ['lm_head.weight', 'transformer.h.7.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.0.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.11.attn.masked_bias', 'transformer.h.5.attn.masked_bias']
- This IS expected if you are initializing TFGPT2LMHeadModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFGPT2LMHeadModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassifica

In [None]:
from transformers import create_optimizer, AdamWeightDecay
optimizer = AdamWeightDecay(lr=2e-5, weight_decay_rate=0.01)

  super(Adam, self).__init__(name, **kwargs)


In [None]:
model.compile(optimizer=optimizer, jit_compile=True)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


We use the jit_compile argument to compile the model with XLA. XLA compilation adds a delay at the start of training, but this is quickly repaid by faster training iterations after that. It has one downside, though - if the shape of your input changes at all, then it will need to rerun the compilation again! This isn't a problem for us in this notebook, because all of our examples are exactly the same length. Be careful with it when that isn't true, though - if you have a variable sequence length in your batches, then you might spend more time compiling your model than actually training, especially for small datasets!

If you encounter difficulties when training with XLA, it's a good idea to remove the jit_compile argument and see if that fixes things. In fact, when debugging, it can be helpful to skip graph compilation entirely with the run_eagerly=True argument to compile(). This will let you identify the exact line of code where problems arise, but it will significantly reduce your performance, so make sure to remove it again when you've fixed the problem!

In [None]:
train_set = model.prepare_tf_dataset(
    lm_train_dataset,
    shuffle=True,
    batch_size=16,
)

validation_set = model.prepare_tf_dataset(
    lm_test_dataset,
    shuffle=False,
    batch_size=16,
)

Next, we convert our datasets to tf.data.Dataset, which Keras understands natively. There are two ways to do this - we can use the slightly more low-level Dataset.to_tf_dataset() method, or we can use Model.prepare_tf_dataset(). The main difference between these two is that the Model method can inspect the model to determine which column names it can use as input, which means you don't need to specify them yourself. It also supplies a data collator by default which is appropriate for most tasks.

In [None]:
checkpoint_path = "drive/MyDrive/Andreas Olsson/AI-for-dinner/model-weights/"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)

We create a callback function that stores the trained weights after each epoch

In [None]:
model.fit(train_set, validation_data=validation_set, epochs=5, callbacks=[cp_callback])

Epoch 1/5
Epoch 1: saving model to drive/MyDrive/Andreas Olsson/AI-for-dinner/model-weights/
Epoch 2/5
Epoch 2: saving model to drive/MyDrive/Andreas Olsson/AI-for-dinner/model-weights/
Epoch 3/5
Epoch 3: saving model to drive/MyDrive/Andreas Olsson/AI-for-dinner/model-weights/
Epoch 4/5
Epoch 4: saving model to drive/MyDrive/Andreas Olsson/AI-for-dinner/model-weights/
Epoch 5/5
Epoch 5: saving model to drive/MyDrive/Andreas Olsson/AI-for-dinner/model-weights/


<keras.callbacks.History at 0x7f5b4045d090>

In [None]:
eval_loss = model.evaluate(validation_set)



Yikes, loss: nan, but it might be because of how the data is structured. We will perform inference either way and see if any interesting results are generated

## Perform inference

In [None]:
from transformers import AutoTokenizer, TFAutoModelForCausalLM
model = TFAutoModelForCausalLM.from_pretrained("birgermoell/swedish-gpt", from_pt=True, pad_token_id=tokenizer.eos_token_id)

Downloading:   0%|          | 0.00/863 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/510M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFGPT2LMHeadModel: ['transformer.h.0.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'lm_head.weight', 'transformer.h.9.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.11.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.8.attn.masked_bias']
- This IS expected if you are initializing TFGPT2LMHeadModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFGPT2LMHeadModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassifica

In [None]:
checkpoint_path = "drive/MyDrive/Andreas Olsson/AI-for-dinner/model-weights/"
checkpoint_dir = os.path.dirname(checkpoint_path)
model.load_weights(checkpoint_path)

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f56103af490>

In [None]:
# Prompting with different decoding strategy
input = 'Pasta med örter, vitlök, soltorkade tomater och basilika'
# input = 'Dagens pastarätt är: En härligt god pastarätt med smak av rosmarin, vitlök och örter. Serveras med en krämig och krämig sås gjord på bland annat chili, tomat, rödlök,'

tokenized = tokenizer(input, return_tensors="tf")

tf.random.set_seed(0)
outputs = model.generate(
    **tokenized, 
    max_length=60,
    do_sample=True, 
    temperature=0.5,
    no_repeat_ngram_size=4,
    num_return_sequences=10, 
    top_p=1, 
    top_k=4,
    # num_beams=50, 
    # no_repeat_ngram_size=2, 
    # num_return_sequences=3, 
    # early_stopping=True
    )

for output in outputs:
  print("\nOutput: " + 100 * '-')
  print(tokenizer.decode(output, skip_special_tokens=True),'\n')


Output: ----------------------------------------------------------------------------------------------------
Pasta med örter, vitlök, soltorkade tomater och basilika. Enkelt och gott. Serveras med en krämig sås på grädde, tomat och basilika. Koka pastan i lättsaltat vatten i 10 minuter. Skala och finhacka löken 


Output: ----------------------------------------------------------------------------------------------------
Pasta med örter, vitlök, soltorkade tomater och basilika. Serveras med en krämig pasta med tomat, vitlök och basilika.Igår var det dags för årets första julbord på Restaurang Pizzabaren i Uppsala. Vi var ett gäng som träffades och 


Output: ----------------------------------------------------------------------------------------------------
Pasta med örter, vitlök, soltorkade tomater och basilika. Enkelt och gott till en enkel pastarätt. Servera pastan med en god sallad.Dagens pastarätt är: En krämig pastagratäng med köttfärssås och sp 


Output: ---------------------