# Content Mining Models

## Table of Contents
##### [1. Data Preprocessing and Imports](#preprocessing)
##### [2. Models](#models)
###### [2.1. Cosine Similarity](#cosine)
###### [2.2. LSI Model](#lsi)
###### [2.3. Mixture Model](#mixture)
##### [3. Interpretation and Evaluation](#interpretation_evaluation)

<a id='preprocessing'></a>
## 1. Data Preprocessing and Imports

##### Packages to install in cmd upfront:

conda install -c conda-forge selenium <\br>
conda install -c anaconda nltk <\br>
pip install rake-nltk

In [3]:
import pandas as pd
import nltk
import inflect
import re, string, unicodedata
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk import SnowballStemmer
from nltk.stem import LancasterStemmer, WordNetLemmatizer

In [4]:
# Import functions module
%run functions.py

In [20]:
interactions_data = pd.read_csv(
        'Data/RAW_interactions.csv')
recipes_data = pd.read_csv(
        'Data/RAW_recipes.csv')

In [21]:
# Rename columns to more explanatory names
recipes_data.rename(columns={"id": "recipe_id"}, inplace=True)
interactions_data.rename(columns={"num_interactions": "date", "avg_rating": "rating"}, inplace=True)

# Fill nan
recipes_data.fillna("", inplace=True)
interactions_data.fillna("", inplace=True)

In [22]:
# Preprocess ingredients and save as String
for index, row in recipes_data.iterrows():
    ingredientlist = row['ingredients']
    ingredientlist = row['ingredients'].replace('[', '').replace(', ', '').replace(']', '').replace('and', '\'').split("\'")
    ingredientlist = list(filter(None, ingredientlist))
    ingredientlistString = ""
    for i in ingredientlist:
        ingredientlistString = ingredientlistString + i
    recipes_data.at[index, 'ingredients'] = ingredientlistString

In [23]:
# Extract keywords for free text features
get_keywords(recipes_data, "steps", "steps_keywords")
get_keywords(recipes_data, "description", "description_keywords")
get_keywords(interactions_data, "review", "review_keywords")

Unnamed: 0,user_id,recipe_id,date,rating,review_keywords
0,38094,40893,2003-02-17,4,pinch shake salad top stove added great cooked...
1,1293707,40893,2011-12-21,5,delicious forgot simple notice ;)< br />< /> s...
2,8937,44394,2002-12-01,4,worked whole package well great white chips 10...
3,126440,85009,2010-02-27,5,everyone loved took bunko made mexican topping
4,57222,85009,2011-10-01,5,sprinkling yum cheddar bacon topping adding ma...
5,52282,120345,2005-05-21,4,waited going pints added take 2 days raspberri...
6,124416,120345,2011-08-06,0,edna lewis cookbook though tried august also r...
7,2000192946,120345,2015-05-10,2,make impressed would start overly 4 cup recipe...
8,76535,134728,2005-09-02,4,good
9,273745,134728,2005-12-22,5,real better


In [24]:
recipes_data.head()

Unnamed: 0,name,recipe_id,minutes,contributor_id,submitted,tags,nutrition,n_steps,ingredients,n_ingredients,steps_keywords,description_keywords
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,winter squashmexican seasoningmixed spicehoney...,7,comfortable cover make hour 40 minutes take re...,prepared either spicy choice year recipe sugge...
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,prepared pizza crustsausage pattyeggsmilksalt ...,6,brownedcut sausage frothyspoon tastebake 15 ba...,ham bacon prebaked late risers crust microwave...
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,ground beefyellow onionsdiced tomatoestomato p...,13,high wiltedadd ingredientsadd kidney beans 8 h...,favorite left extra large pot chili one mom fr...
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,spreadable cheese with garlic herbsnew potato...,11,spatula large pot possibleset aside finely dic...,time preparing lot advance spent times looks l...
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,tomato juiceapple cider vinegarsugarsaltpepper...,8,ingredients law amish mother boil thickpour ne...,ketchup type store amish mother raised acquire...


In [25]:
# Merge datasets
#Average ratings
num_interactions = interactions_data.groupby("recipe_id")["date"].count()

#only consider the ratings (>0) into the mean, not the reviews w/o ratings
mean_ratings = interactions_data[interactions_data["rating"]!=0].groupby("recipe_id")["rating"].mean()

df_rmerged = recipes_data.join(num_interactions, how="left", on="recipe_id").join(mean_ratings, how="left", on="recipe_id")

### NLP Preprocessing

In [26]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Andi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Andi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Andi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [27]:
def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_special(words):
    """Remove special signs like &*"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[-,$()#+&*]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def replace_numbers(words):
    """Replace all interger occurrences in list of tokenized words with textual representation"""
    p = inflect.engine()
    new_words = []
    for word in words:
        if word.isdigit():
            new_word = p.number_to_words(word)
            new_words.append(new_word)
        else:
            new_words.append(word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""  
    stopwords = nltk.corpus.stopwords.words('english')
    myStopWords = []
    stopwords.extend(myStopWords)
    new_words = []
    for word in words:
        if word not in stopwords:
            new_words.append(word)
    return new_words

def to_lowercase(words):
    """Convert words to lowercase"""
    new_words=[]
    for word in words:
        new_words.append(word.lower())
    return new_words

def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    #stemmer = SnowballStemmer('english')
    stems = []
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)
    return stems

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

def normalize_lemmatize(words):
    words = remove_special(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = replace_numbers(words)
    words = remove_stopwords(words)
    #words = stem_words(words)
    words = lemmatize_verbs(words)
    return words

In [28]:
def get_processed(data):
    processed = pd.DataFrame(data=[],columns = ['recipe_id', 'content'])
    new_texts = []

    for i in range(0, len(sample)):
        recipe_id = sample['recipe_id'].iloc[i]
        words = nltk.word_tokenize(sample['content'].iloc[i])
        text = ' '.join(normalize_lemmatize(words))
        dfnew = pd.DataFrame([[recipe_id, text]], columns=['recipe_id', 'content'])
        new_texts.append(text)
        processed = processed.append(dfnew,ignore_index = True)

    return processed

In [33]:
helper = pd.unique(interactions_data['recipe_id'])
df_rfiltered = df_rmerged[df_rmerged.recipe_id.isin(helper)]
print(df_rmerged.shape)
df_rfiltered.shape

(231637, 14)


(231637, 14)

In [34]:
def create_input (df, column_names):
    length = len(column_names)
    df_content = df
    df_content['content'] = df.loc[:, (column_names)].apply(lambda texts: ' '.join(texts), axis=1)
    df_content.drop(columns = column_names, inplace = True)
    df_content['content']=df_content['content'].apply(lambda text: ' '.join(text.split()))
    return df_content

In [35]:
df_rfiltered

Unnamed: 0,name,recipe_id,minutes,contributor_id,submitted,tags,nutrition,n_steps,ingredients,n_ingredients,steps_keywords,description_keywords,date,rating
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,winter squashmexican seasoningmixed spicehoney...,7,comfortable cover make hour 40 minutes take re...,prepared either spicy choice year recipe sugge...,3,5.000000
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,prepared pizza crustsausage pattyeggsmilksalt ...,6,brownedcut sausage frothyspoon tastebake 15 ba...,ham bacon prebaked late risers crust microwave...,4,4.666667
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,ground beefyellow onionsdiced tomatoestomato p...,13,high wiltedadd ingredientsadd kidney beans 8 h...,favorite left extra large pot chili one mom fr...,1,4.000000
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,spreadable cheese with garlic herbsnew potato...,11,spatula large pot possibleset aside finely dic...,time preparing lot advance spent times looks l...,2,4.500000
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,tomato juiceapple cider vinegarsugarsaltpepper...,8,ingredients law amish mother boil thickpour ne...,ketchup type store amish mother raised acquire...,1,5.000000
5,apple a day milk shake,5289,0,1533,1999-12-06,"['15-minutes-or-less', 'time-to-make', 'course...","[160.2, 10.0, 55.0, 3.0, 9.0, 20.0, 7.0]",4,milkvanilla ice creamfrozen apple juice concen...,4,combine ingredients smoothsprinkle blendercove...,,2,5.000000
6,aww marinated olives,25274,15,21730,2002-04-14,"['15-minutes-or-less', 'time-to-make', 'course...","[380.7, 53.0, 7.0, 24.0, 6.0, 24.0, 6.0]",4,fennel seedsgreen olivesripe olivesgarlicpeppe...,9,ingredients toast 2 days marinatekeep refriger...,great appetizers thoroughly impressed vancouve...,1,2.000000
7,backyard style barbecued ribs,67888,120,10404,2003-07-30,"['weeknight', 'time-to-make', 'course', 'main-...","[1109.5, 83.0, 378.0, 275.0, 96.0, 86.0, 36.0]",10,pork spareribssoy saucefresh garlicfresh ginge...,22,ingredients 350 degreesplace ribs oven place m...,originaly recipe request posted cookbook chef ...,1,5.000000
8,bananas 4 ice cream pie,70971,180,102353,2003-09-10,"['weeknight', 'time-to-make', 'course', 'main-...","[4270.8, 254.0, 1306.0, 111.0, 127.0, 431.0, 2...",8,chocolate swich style cookieschocolate syrupva...,6,1 small spoonscoop form crumble cookies chocol...,,2,5.000000
9,beat this banana bread,75452,70,15892,2003-11-04,"['weeknight', 'time-to-make', 'course', 'main-...","[2669.3, 160.0, 976.0, 107.0, 62.0, 310.0, 138.0]",12,sugarunsalted butterbananaseggsfresh lemon jui...,9,pansfreezes well firm dry ingredients together...,ann hodgman,5,4.400000


In [38]:
#flatten steps data
df_rfiltered.steps = df_rfiltered.loc[:, ('steps_keywords')].str.replace("\[", "").str.replace("'", "").str.replace("\]", "").str.replace(",","").copy()

#create content df
df_content = create_input(df_rfiltered[['recipe_id', 'name', 'description_keywords', 'steps_keywords']], ['name', 'description_keywords', 'steps_keywords'])

df_content.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,recipe_id,content
0,137739,arriba baked winter squash mexican style prepa...
1,31490,a bit different breakfast pizza ham bacon preb...
2,112140,all in the kitchen chili favorite left extra l...
3,59389,alouette potatoes time preparing lot advance s...
4,44061,amish tomato ketchup for canning ketchup type ...


In [39]:
sample = df_content.sample(n=1000, replace=False, random_state=42)\
                 .reset_index()\
                 .drop(columns=['index'])
sample.head()

Unnamed: 0,recipe_id,content
0,94947,crab filled crescent snacks crescent roll reci...
1,429010,curried bean salad works nicely refreshing sal...
2,277542,delicious steak with onion marinade make times...
3,78450,pork tenderloin with hoisin delicious enjoyed ...
4,80012,mixed baby greens with oranges grapefruit and ...


In [111]:
processed = get_processed(sample)
processed.head()

Unnamed: 0,recipe_id,content
0,94947,crab fill crescent snack crescent roll recipe ...
1,429010,curry bean salad work nicely refresh salad tel...
2,277542,delicious steak onion marinade make time pleas...
3,78450,pork tenderloin hoisin delicious enjoy marinad...
4,80012,mix baby green oranges grapefruit avocado perf...


In [32]:
#add some stopwords
#new_stopwords = ['.', '\"', '\'', ':', '(', ')', ',', '-', 'etc', '1', '/', '2', '3', '\'s', '\'\'''','``', '!' ]
#stop_words.extend(new_stopwords)

In [110]:
# list to capture purchase history of the customers
processed_embedding = []

# populate the list with the product codes
for index, row in processed.iterrows():
    temp = row.content
    temp = temp.split(" ") 
    processed_embedding.append(temp)

from gensim.models import Word2Vec 
from gensim.similarities import MatrixSimilarity 
from gensim.matutils import Dense2Corpus

model = Word2Vec()
model.build_vocab(processed_embedding)

model.train(processed_embedding, total_examples = model.corpus_count, 
            epochs=10, report_delay=1)

model.init_sims(replace=True)

sim_matrix = MatrixSimilarity(Dense2Corpus(model.wv.syn0.T))



<a id='models'></a>
## 2. Models

<a id='cosine'></a>
### 2.1. Cosine Similarity

In [35]:
cosine_sim = get_cos_sim_matrix(processed)
cosine_sim.shape

(1000, 1000)

In [40]:
interactions_processed = get_interaction_processed(processed, interactions_data)
interactions_processed.head()

Unnamed: 0,user_id,recipe_id,date,rating,review
0,88378,445577,2012-07-25,5,Very good - we all enjoyed this. I used just ...
1,2114486,445577,2013-07-04,3,"I&#039;d suggest using green curry paste, rath..."
2,900992,445577,2013-10-02,3,We added an extra jalapeno pepper and used Tha...
3,2503874,129377,2012-11-16,4,set up beautifully once it was completely cool...
4,247152,310201,2009-05-27,5,"This was even better than expected, especially..."


In [41]:
def get_coverage(processed, interactions, recipe, cosine_sim,k):
    interactions_processed = get_interaction_processed(processed, interactions)
    uid_sample = interactions_processed['user_id'].values
    rid_sample = interactions_processed['recipe_id'].values

    all_rids = interactions_processed['recipe_id'].unique()
    pred_rids = []

    for i in range(len(interactions_processed)):
        try:
          recipe_ids = get_recommendation_cos(processed,
                                                interactions_processed,
                                                rid_sample[i],
                                                uid_sample[i],
                                                cosine_sim,
                                                k)
          pred_rids += list(recipe_ids)
        except:
          next
    pred_bids = np.array(list(set(pred_rids)))
    return len(pred_bids)/len(all_rids)

In [54]:
uid_sample = interactions_processed['user_id'].values
rid_sample = interactions_processed['recipe_id'].values
print(uid_sample)
print(rid_sample)
predictions_cos = []
actual_cos = []
print(interactions_processed.head())
print(processed.head())

for i in range(len(interactions_processed)):
    try:
        act, pred = get_results_cos(processed,
                                    interactions_processed,
                                    recipes_data,
                                    rid_sample[i],
                                    uid_sample[i],
                                    cosine_sim,
                                    5)
        predictions_cos.append(pred)
        actual_cos.append(act)
    except:
        next

[  88378 2114486  900992 ...  428885  169430  369363]
[445577 445577 445577 ... 273409 273409 273409]
   user_id  recipe_id        date  rating  \
0    88378     445577  2012-07-25       5   
1  2114486     445577  2013-07-04       3   
2   900992     445577  2013-10-02       3   
3  2503874     129377  2012-11-16       4   
4   247152     310201  2009-05-27       5   

                                              review  
0  Very good - we all enjoyed this.  I used just ...  
1  I&#039;d suggest using green curry paste, rath...  
2  We added an extra jalapeno pepper and used Tha...  
3  set up beautifully once it was completely cool...  
4  This was even better than expected, especially...  
  recipe_id                                            content
0     94947  crab fill crescent snack find crescent roll re...
1    429010  curry bean salad serve flavorful refresh salad...
2    277542  delicious steak onion marinade another try loo...
3     78450  pork tenderloin hoisin another k

In [55]:
rmse_cos = mean_squared_error(predictions_cos, actual_cos)**0.5
mae_cos = mean_absolute_error(predictions_cos, actual_cos)
print(f'RMSE: {rmse_cos}, MAE: {mae_cos}')

RMSE: 1.0697657531912885, MAE: 0.5


In [57]:
cov_cos = get_coverage(processed, interactions_data, recipes_data, cosine_sim, 5)
print(f'coverage: {cov_cos}')

coverage: 0.738


<a id='lsi'></a>
### 2.2. LSI Model

<a id='mixture'></a>
### 2.3. Mixture Model

<a id='interpretation_evaluation'></a>
## 3. Interpretation and Evaluation