# "I want a meat pie but with a mexican twist"

Recently my girlfriend and I were getting bored of our usual meals and starting thinking about new recipes we could try.

The problem was we could never think of specific meals that we wanted try. Instead we'd have vague requests like. 

"I want something that's meaty with spices." "I want to try spicy desserts" "I want a meat pie but with a mexican twist"

Wouldn't it be great if we could almost "talk" to a computer and ask for suggestions?

This is what I'll be doing here. We'll be looking at two powerful NLP techniques LSI and LDA and hopefully come up with interesting results.



# First the scraping

I had never really scraped data before so I decided this was the perfect opportunity to have a go at it. For the uninitiated scraping means I've got some code that will take and save information that's a on website onto my computer for later use. In this case the data would be recipes.

My girlfriend suggested the site she used http://www.skinnytaste.com/ so that's what I decided to go with.

As far as tools go I simply used the python scraping framework called scrapy (very good!) I quickly got a handle of it and scraped myself some recipes!

If ever you want to take a look at the scraping code, it's on my github under tutorial->spiders->scrape_recipes.py


In [1]:
import pandas as pd
import numpy as np
import logging
import tempfile
import gensim
import os
from gensim import corpora
from gensim.utils import simple_preprocess
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import pprint
pp = pprint.PrettyPrinter(indent=4)

from myscripts import *

import json
from gensim import corpora
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 6)


data_folder="../data/"
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

data=pd.read_json(os.path.join(data_folder,"scraped_recipes.json"))
%matplotlib inline 

use_old_models=True

number_of_topics_to_train_for=10

data.dropna(inplace=True)



Using TensorFlow backend.


In [2]:
#some regex
def some_regex(x):
    text=re.sub(r'([a-zA-Z])(\d{1,2})', r'\1 \2', x)
    text=text.replace("/"," ").replace("\n"," ").replace("\r"," ").replace("\t"," ").rstrip()    
    return text

for each_colum in data.columns:
    data[each_colum]=data[each_colum].apply(lambda x: some_regex(x))  
                    

# After doing some quick cleaning here's what the scrapes look like

In [3]:
data.head()


Unnamed: 0,ingredients,instructions,name
0,1 cup halved cherry tomatoes½ medium red onion...,: Preheat oven to 400 degrees F. In ...,Chicken with Roasted Tomato and Red Onions
1,1 1 2 lbs shelled and deveined jumbo shrimp (3...,: Preheat the oven to 400F. Spray 2 large nons...,Easy Roasted Lemon-Garlic Shrimp
2,"1 2 lb boneless skinless chicken breast, cut 1...",Start by spriralizing the zucchini using b...,Chicken and Zucchini Noodle Caprese
3,1 3 cup grated Parmigiano Reggiano 1 4 cup fre...,: In the bowl of a food processor add the chee...,Skinny Caesar Dressing
4,1 lb dry northern beans (navy beans would work...,print out completely. ...,White Northern Beans with Aji Verde Sauce


One thing of note however is that this example only has around 750 recipes. This will most certainly affect the accuracy of our analysis, but still let's see what we can get nonetheless.

# The preprocessing

This part is one probably one of the most crucial part of any NLP analysis. Before doing any vectorization for our words we have to make sure these words are put in a form that's most useful to a computer. We want to reduce the noise in a sentence. For example:

In [4]:
example_string="the horses ran towards the mice"

print(text_process(example_string))


['horse', 'run', 'mouse']


See what happened? Horses became horse and mice became mouse.The plurals have been taken away. Ran became the "root" verb run. All verb tenses are turned into it's root form. This is all so that the algorithm understands that running, ran and run are all referring to the same action and so they are all the same. 

Processing the text usually involves the following steps:

### Tokenize:
Splitting up your text into it's individual words.

### Removing any stopwords:	
Any words that appear too infrequently or too frequently are taken out. Words like "the","is","a" are all stopwords

### BIGRAM:
Sometimes you'll have a couple of words that are seem to appear together like Barack Obama. If it notices this enough times, it might turn Barack Obama into Barack_Obama, so that the computer evaluates this as a single entity.

### Lemmatization:
This is the part that can take a while to preprocess. It's what turned "ran" or "running" into "run"


Also I neatly summarized the entire process under my function text_process(). You can see what it does under myscripts.py


# Generators and memory shortages

A key thing I learned when dealing with NLP projects is that you can easily be dealing with HUGE datasets. Imagine if you had a scraper going across multiple websites. Would you collect all the text, store them in a big file and then transform and analyze the entire thing all at once? 
The answer is no. This is because holding such a huge file in memory would take a lot of RAM and most likely slow down your computer to a crawl.

Generally what you want to do is to use generators like below. What this does is your algorithm transforms the text ONE LINE AT A TIME. It reads a line, changes what it needs to do, saves it and moves to the next line. The RAM you need is only what's required to read that one line. Your computer is saved.



In [5]:
from nltk import pos_tag
from nltk.corpus import wordnet as wn

from nltk.stem import WordNetLemmatizer

class text_process_gen(object):
    def __init__(self, texts):
        self.texts = texts        
    def __iter__(self):
        print(self.texts.shape)
        
        for text in self.texts:
            
            tokenized_text=text_process(text)
            empty=[]
            if tokenized_text != empty: 
                yield tokenized_text
            

# Making the dictionary

Don't worry we're slowly but surely getting to the recipes. We'll be using the terrific library gensim for our nlp analysis. 
https://radimrehurek.com/gensim/

And one of the first things you need to do is to create a dictionary. This is so that the algorithm knows what the words in play are and how many of each there are.

In [6]:
import struct
#I put in this special 

merged_df=pd.DataFrame()

merged_df["dictionary"] = data["ingredients"].map(str) +" "+ data["instructions"]+" "+data["name"]

merged_df["name"]=data["name"]
    
try:
    1/use_old_models # If I want to redo all the models from scratch I set use_old_model to False at the very beginning. False = 0, which means 1/0 causes and error
    my_scraped_nlp_dict=corpora.Dictionary.load(os.path.join(data_folder, 'my_scraped_nlp.dict'))
    processed_texts=np.load(os.path.join(data_folder,"scraped_processed_tokenized_text.npy")).tolist()

except:

    processed_texts=[[text for text in texts] for texts in text_process_gen(merged_df.dictionary.values)]
    np.save(os.path.join(data_folder,"scraped_processed_tokenized_text"),processed_texts)

    my_scraped_nlp_dict = corpora.Dictionary(processed_texts)
    my_scraped_nlp_dict.filter_extremes(no_below=20, no_above=0.2) #in case you want to filter out some words
    my_scraped_nlp_dict.save(os.path.join(data_folder, 'my_scraped_nlp.dict'))  # store the dictionary, for future reference
    my_scraped_nlp_dict=my_scraped_nlp_dict.load(os.path.join(data_folder, 'my_scraped_nlp.dict'))
    #print(my_scraped_nlp_dict)

# About the NLP techniques...

Today we're looking at two main techniques. There's the Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA). I won't go into the theory behind them as it would make this blog post significantly longer but I strongly suggest that you read up on them. 

Now before we feed in our words into LSI and LDA formats we'll need to transform them into vectors first. Here's how we'll do that.


# 1.The Bag Of Words (BOW)

Imagine every word in a sentence was represented by a vector. Each vector holds nothing but zeros except for one value which would be a one.



In [7]:
#just loading or creating my bow corpus
try:
    1/use_old_models # if use_old_models=0, then this fails
    bow_corpus = corpora.MmCorpus(os.path.join(data_folder, 'MY_SCRAPED_BOW.mm'))
    
except:
    bow_corpus = [my_scraped_nlp_dict.doc2bow(text) for text in processed_texts]
    corpora.MmCorpus.serialize(os.path.join(data_folder, 'MY_SCRAPED_BOW.mm'), bow_corpus)  # store to disk, for later use
    bow_corpus = corpora.MmCorpus(os.path.join(data_folder, 'MY_SCRAPED_BOW.mm')) 



# BOW example


In [8]:
my_scraped_nlp_dict.doc2bow(text_process("chocolate strawberry pie"))

[(25, 1), (430, 1), (977, 1)]

What happened above is that all possible words found in my recipes have an equivalent vector equivalent. So in the above case chocolate is the vector with the 1 on the 25th row. Strawberry is the vector with a 1 on the 430th row, etc. 



# 2. Term frequency–inverse document frequency (TFIDF)

Not all words are equally important. If a particular word occurs very frequently in a text, is it important? Only if it's not also in every other document as well. 

TFIDF reweighs your "ones" in your word vectors by how important they are. 

In [9]:

try:
    1/use_old_models # if use_old_models=0, then this fails
    lsi_model = gensim.models.LsiModel.load(os.path.join(data_folder,'scraped_lsi.model'))
    print("using old LSI model")
    tfidf_model=gensim.models.TfidfModel.load(os.path.join(data_folder,'scraped_Tfidf.model'))
    print("using old TFIDF model")
    corpus_tfidf = tfidf_model[bow_corpus]
except:
    print("creating new tfidf_model")
    tfidf_model = gensim.models.TfidfModel(bow_corpus) 
    
    corpus_tfidf = tfidf_model[bow_corpus]
    print("creating new LSI model")
    %time lsi_model = gensim.models.LsiModel(corpus_tfidf, id2word=my_scraped_nlp_dict, num_topics=number_of_topics_to_train_for) # initialize an LSI transformation
    lsi_model.save(os.path.join(data_folder,'scraped_lsi.model'))
    tfidf_model.save(os.path.join(data_folder,'scraped_Tfidf.model'))
    print("done")





using old LSI model
using old TFIDF model


# TFIDF example


In [10]:
tfidf_model[my_scraped_nlp_dict.doc2bow(text_process("chocolate strawberry pie"))]

[(25, 0.48616100228069553),
 (430, 0.6032717699063982),
 (977, 0.6322267405728988)]

Notice how the 1 have been reweighed in terms of their importance. Chocolate has a much lower score so the word is probably found in a lot recipes.

# Latent Semantic Indexing (LSI)

Once converted into TFIDF you can now convert it to LSI. LSI and LSA will try to group up all your documents/recipes according to a certain number of topics that you specify. 

Let's say in this case we have 10 types of recipes. Given that, how will the LSI model split up our recipes? Below you can see each "topic" and it most descriptive words along with their weights


In [11]:
lsi_model.print_topics(number_of_topics_to_train_for)

[(0,
  '0.431*"2012" + 0.214*"2011" + 0.157*"2010" + 0.109*"flour" + 0.096*"shrimp" + 0.094*"milk" + 0.089*"broth" + 0.089*"chocolate" + 0.088*"pasta" + 0.087*"squash"'),
 (1,
  '0.651*"2012" + 0.224*"2011" + -0.142*"broth" + 0.132*"2010" + -0.119*"bay" + -0.107*"cumin" + -0.098*"cooker" + -0.098*"bell" + -0.093*"parsley" + -0.090*"pasta"'),
 (2,
  '-0.563*"2012" + 0.503*"2011" + 0.241*"2010" + 0.139*"pumpkin" + 0.135*"chocolate" + 0.105*"flour" + 0.102*"smoothie" + 0.101*"milk" + 0.101*"banana" + 0.097*"cooky"'),
 (3,
  '0.545*"2011" + 0.352*"2010" + -0.221*"chocolate" + -0.214*"flour" + -0.166*"banana" + -0.148*"oat" + -0.146*"milk" + -0.134*"cooky" + -0.133*"almond" + -0.130*"muffin"'),
 (4,
  '0.321*"shrimp" + -0.308*"2010" + -0.212*"cooker" + 0.191*"2011" + -0.190*"pressure" + 0.169*"avocado" + -0.160*"pork" + -0.156*"broth" + -0.154*"soup" + 0.153*"salmon"'),
 (5,
  '0.283*"pasta" + -0.234*"avocado" + -0.212*"lime" + 0.195*"parmesan" + 0.183*"sausage" + 0.162*"squash" + 0.161*"mo

In [12]:

lsi_corpus = lsi_model[corpus_tfidf] 


# Comparing with other recipies using LSI

Now to compare recipes, or more precisely a string of words with another string of words.

In [13]:
from gensim import similarities

try:
    1/use_old_models 
    index = similarities.MatrixSimilarity.load(os.path.join(data_folder, 'scraped_lsi_similarity.index'))
    print("loading similarities matrix")
except:
    print("creating new similarities matrix")
    index = similarities.MatrixSimilarity(lsi_corpus)
    index.save(os.path.join(data_folder, 'scraped_lsi_similarity.index'))
    index = similarities.MatrixSimilarity.load(os.path.join(data_folder, 'scraped_lsi_similarity.index'))


loading similarities matrix


# Can it recognize itself?

First thing to try here is to simply feed recipe 1 into the model as a "new" recipe. The model will look at the recipes it already has and should suggest recipe 1 as the most similar recipe.

In [14]:


test_run=merged_df.dictionary[0] #picking the first recipe

vec_bow = my_scraped_nlp_dict.doc2bow(test_run.split())
#print(vec_bow) #this part works
vec_tfidf=tfidf_model[vec_bow]
#print(vec_tfidf) # you join it with the model, not the corpus
vec_lsi = lsi_model[vec_tfidf] # convert the query to LSI space
sims = index[vec_lsi] # perform a similarity query against the corpus
#print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

top_10=list(sorted(enumerate(sims), key=lambda x: x[1], reverse=True))[:20]
top_10_indexes=[x[0] for x in top_10]
top_10_match_pct=[x[1] for x in top_10]
print("Target recipe at index 0")
print(merged_df.name[0])
print("INDEXES")
print(top_10_indexes)
top_10_names=merged_df.name[top_10_indexes].values

print("RECIPES")
pp.pprint(list(zip(top_10_names,top_10_match_pct)))

Target recipe at index 0
Chicken with Roasted Tomato and Red Onions
INDEXES
[0, 493, 70, 537, 255, 325, 623, 630, 488, 637, 624, 322, 373, 645, 361, 379, 29, 278, 259, 491]
RECIPES
[   ('Chicken with Roasted Tomato and Red Onions', 0.99430108),
    ('Brussels Sprouts Gratin', 0.96521372),
    ('Spinach and Feta Stuffed Chicken Breasts', 0.94115025),
    ('Turkey Sausage Patties From Scratch', 0.93434477),
    ('Vegan Eggplant Meatballs', 0.93186545),
    ('Easy Garlic Broccolini', 0.92941445),
    ('Baked Eggplant Sticks', 0.92788124),
    ('Zucchini Noodles (Zoodles) with Lemon-Garlic Spicy Shrimp', 0.92096043),
    ('Light and Easy Cauliflower Gratin', 0.91492635),
    ('Chicken Zoodle “Lo Mein” For Two', 0.91050708),
    ('Roasted Cauliflower and Chickpeas with Minted Yogurt', 0.91044116),
    ('Sauteed Julienned Summer Vegetables', 0.9084087),
    ('Cauliflower Nuggets', 0.90719622),
    ('Open-Faced Omelet with Feta, Roasted Tomatoes and Spinach', 0.90408921),
    ('Buffalo Brusse

Yes! As you can see it recognized our recipe as the most likely.


# Now to for the more interesting part...

Let's make a request

In [15]:

test_run="I want a meat pie with a mexican twist"

vec_bow = my_scraped_nlp_dict.doc2bow(test_run.split())
#print(vec_bow) #this part works
vec_tfidf=tfidf_model[vec_bow]
#print(vec_tfidf) # you join it with the model, not the corpus
vec_lsi = lsi_model[vec_tfidf] # convert the query to LSI space
sims = index[vec_lsi] # perform a similarity query against the corpus
#print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

top_10=list(sorted(enumerate(sims), key=lambda x: x[1], reverse=True))[:20]
top_10_indexes=[x[0] for x in top_10]
top_10_match_pct=[x[1] for x in top_10]

print("INDEXES")
print(top_10_indexes)
top_10_names=merged_df.name[top_10_indexes].values

print("RECIPES")
pp.pprint(list(zip(top_10_names,top_10_match_pct)))

INDEXES
[674, 312, 695, 165, 144, 315, 476, 178, 563, 334, 208, 164, 363, 163, 137, 314, 138, 17, 196, 317]
RECIPES
[   ('Hearty Vegetarian Pumpkin Chili', 0.8901611),
    ('How To Make Pumpkin Puree (Instant Pot or Oven Method)', 0.86093962),
    ('Embarrassingly Easy Crock Pot Salsa Chicken Thighs', 0.84236896),
    ('Chicken and White Bean Enchiladas with Creamy Salsa Verde', 0.83126831),
    ('Instant Pot (Pressure Cooker) Easy Salsa Shredded Chicken', 0.83106315),
    ('Crock Pot Picadillo Stuffed Peppers', 0.82663399),
    ('Buffalo Chicken and Bean Chili', 0.81245703),
    ('Slow Cooker 3-Bean Turkey Chili', 0.80944347),
    ('Crock Pot Kid-Friendly Turkey Chili', 0.80787885),
    ('Slow Cooker Apple Butter', 0.80512428),
    ('Crock Pot Chicken Enchilada Soup (and Instant Pot)', 0.80116326),
    ('Skinny Chicken Enchiladas', 0.80073172),
    ('Instant Pot Rock Creek Ranch Black Beans', 0.79761064),
    ('White Bean Turkey Chili', 0.79459208),
    ('BBQ Chicken Chili', 0.7799319

Interesting! I see chili, salsa, meats and enchiladas (which can kind of be interpreted as mexican meat pies)

Pretty cool eh? Let's try with LDA


# Trying with Latent Dirichlet Allocation (LDA)

In [16]:

try:
    1/use_old_models # if use_old_models=0, then this fails
    lda_model = gensim.models.LdaModel.load(os.path.join(data_folder,'scraped_lda.model'))
    lda_corpus=lda_model[corpus_tfidf]
    print("using old LDA model")
except:
    print("creating LDA model")
    %time lda_model = gensim.models.LdaModel(corpus_tfidf, id2word=my_scraped_nlp_dict, num_topics=100)
    lda_model.save(os.path.join(data_folder,'scraped_lda.model'))
    



using old LDA model


# comparing with LDA

In [17]:
lda_corpus=lda_model[corpus_tfidf]

In [18]:
try:
    1/use_old_models 
    index = similarities.MatrixSimilarity.load(os.path.join(data_folder, 'scraped_lda_similarity.index'))
    print("loading similarities matrix")
except:
    print("creating new similarities matrix")
    index = similarities.MatrixSimilarity(lda_corpus)
    index.save(os.path.join(data_folder, 'scraped_lda_similarity.index'))
    index = similarities.MatrixSimilarity.load(os.path.join(data_folder, 'scraped_lda_similarity.index'))

loading similarities matrix


# Can it recognize itself? (yes it can)

In [19]:
test_run=merged_df.dictionary[0] #picking the first recipe

vec_bow = my_scraped_nlp_dict.doc2bow(test_run.split())
#print(vec_bow) #this part works
vec_tfidf=tfidf_model[vec_bow]
#print(vec_tfidf) # you join it with the model, not the corpus
vec_lda = lda_model[vec_tfidf] # convert the query to LSI space
sims = index[vec_lda] # perform a similarity query against the corpus
#print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

top_10=list(sorted(enumerate(sims), key=lambda x: x[1], reverse=True))[:20]
top_10_indexes=[x[0] for x in top_10]
top_10_match_pct=[x[1] for x in top_10]
print("Target recipe at index 0")
print(merged_df.name[0])
print("INDEXES")
print(top_10_indexes)
top_10_names=merged_df.name[top_10_indexes].values

print("RECIPES")
pp.pprint(list(zip(top_10_names,top_10_match_pct)))

Target recipe at index 0
Chicken with Roasted Tomato and Red Onions
INDEXES
[0, 582, 105, 379, 492, 374, 384, 55, 15, 430, 662, 481, 715, 483, 713, 34, 358, 596, 677, 309]
RECIPES
[   ('Chicken with Roasted Tomato and Red Onions', 0.91052574),
    ('Stir Fried Pork and Mixed Veggies', 0.78934348),
    ('Dark Chocolate Nut Clusters with Sea Salt', 0.70103484),
    ('Asparagus-Pancetta Potato Hash', 0.70103484),
    (   'Butternut Stuffed Turkey Tenderloin with Cranberries and Pecans',
        0.70103484),
    ('Chicken Sausage and Herb Stuffing', 0.69756132),
    ('Sun Dried Tomato and Cheese Stuffed Chicken Rollatini', 0.69679677),
    ('Rosemary Chicken Salad with Avocado and Bacon', 0.67904806),
    ('Arroz Con Pollo, Lightened Up', 0.60981923),
    ('Summer Potato Salad with Apples', 0.58557379),
    ('Salpicón', 0.58321798),
    ('Baked Sweet Potato Skins', 0.57591981),
    ('Whole Grain Apple Nut Muffins', 0.56266266),
    ('Butternut Squash Risotto', 0.54179806),
    ('Pumpkin Nu

# Making a request



In [20]:

test_run="I want a meat pie with a mexican twist"

vec_bow = my_scraped_nlp_dict.doc2bow(test_run.split())
#print(vec_bow) #this part works
vec_tfidf=tfidf_model[vec_bow]
#print(vec_tfidf) # you join it with the model, not the corpus
vec_lda = lda_model[vec_tfidf] # convert the query to LSI space
sims = index[vec_lda] # perform a similarity query against the corpus
#print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

top_10=list(sorted(enumerate(sims), key=lambda x: x[1], reverse=True))[:20]
top_10_indexes=[x[0] for x in top_10]
top_10_match_pct=[x[1] for x in top_10]

print("INDEXES")
print(top_10_indexes)
top_10_names=merged_df.name[top_10_indexes].values

print("RECIPES")
pp.pprint(list(zip(top_10_names,top_10_match_pct)))


INDEXES
[237, 635, 697, 278, 261, 690, 696, 672, 247, 27, 148, 470, 2, 511, 214, 571, 471, 24, 102, 570]
RECIPES
[   ('Skinnytaste Dinner Plan (Week 68)', 0.67060399),
    ('Spiralized Mediterranean Beet and Feta Skillet Bake', 0.67060399),
    ('Raw Shredded Brussels Sprouts with Lemon and Oil', 0.67060399),
    ('Roasted Brussels Sprouts and Cauliflower Soup', 0.64382297),
    ('Quick Spiralized Zucchini and Grape Tomatoes', 0.64002627),
    ('Butternut Squash and Black Bean Enchiladas', 0.63826555),
    ('Meringue Ghosts', 0.62629765),
    ('Broccoli and Cheese Tots', 0.62481678),
    ('Skinnytaste Dinner Plan (Week 80)', 0.58397818),
    ('Pumpkin Spice No-Bake Cheesecake', 0.54190755),
    ('Italian Pulled Pork Ragu (Instant Pot, Slow Cooker, Stove)', 0.53573096),
    ('Lemon and Ginger Ice Pops', 0.53562826),
    ('Chicken and Zucchini Noodle Caprese', 0.50658315),
    ('Pumpkin Pie Dip', 0.50658315),
    ('Sheetpan Italian Chicken and Veggie Dinner', 0.4978703),
    ('Pasta with


Hmmm while LDA is suppose to perform better, these suggestions don't look as good as those with LSI... 

Now one thing that could be affecting this is that we only have 700 recipes. That's very little. We could try accumulating a lot more and trying this exercise again.



# For the next post...

I will try this again but with more data. Also I'll take a look at two other nlp techniques called word2vec and doc2vec.



# One last thing...

Below is a cool topic visualization library I found. It allows you to see how similar your topics are and what words are most prominent in those topics. I strongly encourage you give it a try!

In [21]:
import pyLDAvis
import pyLDAvis.gensim
lda_vis = pyLDAvis.gensim.prepare(lda_model, lda_corpus, my_scraped_nlp_dict)
pyLDAvis.display(lda_vis)

  chunks = self.iterencode(o, _one_shot=True)
