# Content Mining Models

## Table of Contents
##### [1. Data Preprocessing and Imports](#preprocessing)
##### [2. Models](#models)
###### [2.1. Cosine Similarity](#cosine)
###### [2.2. LSI Model](#lsi)
###### [2.3. Mixture Model](#mixture)
##### [3. Interpretation and Evaluation](#interpretation_evaluation)

<a id='preprocessing'></a>
## 1. Data Preprocessing and Imports

##### Packages to install in cmd upfront:

conda install -c conda-forge selenium <\br>
conda install -c anaconda nltk <\br>
pip install rake-nltk

In [1]:
import pandas as pd
import nltk
import inflect
import re, string, unicodedata
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk import SnowballStemmer
from nltk.stem import LancasterStemmer, WordNetLemmatizer

In [2]:
# Import functions module
%run functions.py

In [70]:
interactions_data = pd.read_csv(
        'Data/RAW_interactions.csv')
recipes_data = pd.read_csv(
        'Data/RAW_recipes.csv')

In [71]:
# Rename columns to more explanatory names
recipes_data.rename(columns={"id": "recipe_id"}, inplace=True)
interactions_data.rename(columns={"num_interactions": "date", "avg_rating": "rating"}, inplace=True)

# Fill nan
recipes_data.fillna("", inplace=True)
interactions_data.fillna("", inplace=True)

In [72]:
# Preprocess ingredients and save as String
for index, row in recipes_data.iterrows():
    ingredientlist = row['ingredients']
    ingredientlist = row['ingredients'].replace('[', '').replace(', ', '').replace(']', '').replace('and', '\'').split("\'")
    ingredientlist = list(filter(None, ingredientlist))
    ingredientlistString = ""
    for i in ingredientlist:
        ingredientlistString = ingredientlistString + i
    recipes_data.at[index, 'ingredients'] = ingredientlistString

In [73]:
# Extract keywords for free text features
get_keywords(recipes_data, "steps", "steps_keywords")
get_keywords(recipes_data, "description", "description_keywords")
get_keywords(interactions_data, "review", "review_keywords")

Unnamed: 0,user_id,recipe_id,date,rating,review_keywords
0,38094,40893,2003-02-17,4,shake 15 minutes cayenne stove added pinch sal...
1,1293707,40893,2011-12-21,5,missing 1 remaining cumin 2 teaspoon simple do...
2,8937,44394,2002-12-01,4,easy 10oz whole package quite well worked whit...
3,126440,85009,2010-02-27,5,took made bunko mexican topping everyone loved
4,57222,85009,2011-10-01,5,made black pepper yum sprinkling adding chedda...
5,52282,120345,2005-05-21,4,bought 2 mix take pints sweet raspberries days...
6,124416,120345,2011-08-06,0,moldy tasting prior reason 120345 posted perha...
7,2000192946,120345,2015-05-10,2,2 cups much mix way exceptionally 1 overly wou...
8,76535,134728,2005-09-02,4,good
9,273745,134728,2005-12-22,5,real better


In [7]:
recipes_data.head()

Unnamed: 0,name,recipe_id,minutes,contributor_id,submitted,tags,nutrition,n_steps,ingredients,n_ingredients,steps_keywords,description_keywords
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,winter squashmexican seasoningmixed spicehoney...,7,squash fork burn bakingif desired season salt ...,cook choice prepared either spicy inspired sea...
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,prepared pizza crustsausage pattyeggsmilksalt ...,6,set p 12 inch pizza panbake crust cheesepour e...,ham prebaked adding ingredients bit late riser...
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,ground beefyellow onionsdiced tomatoestomato p...,13,8 hours lowserve wiltedadd 6 high chilicook gr...,hit extra large pot original rainy day chili t...
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,spreadable cheese with garlic herbsnew potato...,11,large bowl add alouette ingredients except pla...,reflect potatoes everything super easy plus ti...
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,tomato juiceapple cider vinegarsugarsaltpepper...,8,"mix boil jars 2 1 making always great !"" law u...",type much prefers amish mother raised taste bo...


In [74]:
# Merge datasets
#Average ratings
num_interactions = interactions_data.groupby("recipe_id")["date"].count()

#only consider the ratings (>0) into the mean, not the reviews w/o ratings
mean_ratings = interactions_data[interactions_data["rating"]!=0].groupby("recipe_id")["rating"].mean()

df_rmerged = recipes_data.join(num_interactions, how="left", on="recipe_id").join(mean_ratings, how="left", on="recipe_id")

### NLP Preprocessing

In [75]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Andi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Andi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Andi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [76]:
def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_special(words):
    """Remove special signs like &*"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[-,$()#+&*]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def replace_numbers(words):
    """Replace all interger occurrences in list of tokenized words with textual representation"""
    p = inflect.engine()
    new_words = []
    for word in words:
        if word.isdigit():
            new_word = p.number_to_words(word)
            new_words.append(new_word)
        else:
            new_words.append(word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""  
    stopwords = nltk.corpus.stopwords.words('english')
    myStopWords = []
    stopwords.extend(myStopWords)
    new_words = []
    for word in words:
        if word not in stopwords:
            new_words.append(word)
    return new_words

def to_lowercase(words):
    """Convert words to lowercase"""
    new_words=[]
    for word in words:
        new_words.append(word.lower())
    return new_words

def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    #stemmer = SnowballStemmer('english')
    stems = []
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)
    return stems

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

def normalize_lemmatize(words):
    words = remove_special(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = replace_numbers(words)
    words = remove_stopwords(words)
    #words = stem_words(words)
    words = lemmatize_verbs(words)
    return words

In [77]:
def get_processed(data):
    processed = pd.DataFrame(data=[],columns = ['recipe_id', 'content'])
    new_texts = []

    for i in range(0, len(sample)):
        recipe_id = sample['recipe_id'].iloc[i]
        words = nltk.word_tokenize(sample['content'].iloc[i])
        text = ' '.join(normalize_lemmatize(words))
        dfnew = pd.DataFrame([[recipe_id, text]], columns=['recipe_id', 'content'])
        new_texts.append(text)
        processed = processed.append(dfnew,ignore_index = True)

    return processed

In [78]:
helper = pd.unique(interactions_data['recipe_id'])
df_rfiltered = df_rmerged[df_rmerged.recipe_id.isin(helper)]
print(df_rmerged.shape)
df_rfiltered.shape

(231637, 14)


(231637, 14)

In [79]:
def create_input (df, column_names):
    length = len(column_names)
    df_content = df
    df_content['content'] = df.loc[:, (column_names)].apply(lambda texts: ' '.join(texts), axis=1)
    df_content.drop(columns = column_names, inplace = True)
    df_content['content']=df_content['content'].apply(lambda text: ' '.join(text.split()))
    return df_content

In [14]:
df_rfiltered

Unnamed: 0,name,recipe_id,minutes,contributor_id,submitted,tags,nutrition,n_steps,ingredients,n_ingredients,steps_keywords,description_keywords,date,rating
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,winter squashmexican seasoningmixed spicehoney...,7,squash fork burn bakingif desired season salt ...,cook choice prepared either spicy inspired sea...,3,5.000000
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,prepared pizza crustsausage pattyeggsmilksalt ...,6,set p 12 inch pizza panbake crust cheesepour e...,ham prebaked adding ingredients bit late riser...,4,4.666667
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,ground beefyellow onionsdiced tomatoestomato p...,13,8 hours lowserve wiltedadd 6 high chilicook gr...,hit extra large pot original rainy day chili t...,1,4.000000
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,spreadable cheese with garlic herbsnew potato...,11,large bowl add alouette ingredients except pla...,reflect potatoes everything super easy plus ti...,2,4.500000
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,tomato juiceapple cider vinegarsugarsaltpepper...,8,"mix boil jars 2 1 making always great !"" law u...",type much prefers amish mother raised taste bo...,1,5.000000
5,apple a day milk shake,5289,0,1533,1999-12-06,"['15-minutes-or-less', 'time-to-make', 'course...","[160.2, 10.0, 55.0, 3.0, 9.0, 20.0, 7.0]",4,milkvanilla ice creamfrozen apple juice concen...,4,2 cups ground cinnamonmakes blendercover blend...,,2,5.000000
6,aww marinated olives,25274,15,21730,2002-04-14,"['15-minutes-or-less', 'time-to-make', 'course...","[380.7, 53.0, 7.0, 24.0, 6.0, 24.0, 6.0]",4,fennel seedsgreen olivesripe olivesgarlicpeppe...,9,bowl stir wellcover leave toast 2 days ingredi...,great appetizers condiments thoroughly impress...,1,2.000000
7,backyard style barbecued ribs,67888,120,10404,2003-07-30,"['weeknight', 'time-to-make', 'course', 'main-...","[1109.5, 83.0, 378.0, 275.0, 96.0, 86.0, 36.0]",10,pork spareribssoy saucefresh garlicfresh ginge...,22,full rolling boil reduce heat glass ribs open ...,chef sam choy posted recipe originaly request ...,1,5.000000
8,bananas 4 ice cream pie,70971,180,102353,2003-09-10,"['weeknight', 'time-to-make', 'course', 'main-...","[4270.8, 254.0, 1306.0, 111.0, 127.0, 431.0, 2...",8,chocolate swich style cookieschocolate syrupva...,6,crumble cookies 9 form vanilla ice cream slice...,,2,5.000000
9,beat this banana bread,75452,70,15892,2003-11-04,"['weeknight', 'time-to-make', 'course', 'main-...","[2669.3, 160.0, 976.0, 107.0, 62.0, 310.0, 138.0]",12,sugarunsalted butterbananaseggsfresh lemon jui...,9,blended uniformlybe patient beat pull away 30 ...,ann hodgman,5,4.400000


In [80]:
#flatten steps data
df_rfiltered.steps = df_rfiltered.loc[:, ('steps_keywords')].str.replace("\[", "").str.replace("'", "").str.replace("\]", "").str.replace(",","").copy()

#create content df
df_content = create_input(df_rfiltered[['recipe_id', 'name', 'description_keywords', 'steps_keywords']], ['name', 'description_keywords', 'steps_keywords'])

df_content.head()

  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,recipe_id,content
0,137739,arriba baked winter squash mexican style cook ...
1,31490,a bit different breakfast pizza ham prebaked a...
2,112140,all in the kitchen chili hit extra large pot o...
3,59389,alouette potatoes reflect potatoes everything ...
4,44061,amish tomato ketchup for canning type much pre...


In [16]:
# sample = df_content.sample(n=1000, replace=False, random_state=42)\
#                  .reset_index()\
#                  .drop(columns=['index'])
# sample.head()

Unnamed: 0,recipe_id,content
0,94947,crab filled crescent snacks crescent roll reci...
1,429010,curried bean salad serve works nicely family t...
2,277542,delicious steak with onion marinade took estim...
3,78450,pork tenderloin with hoisin another keeper enj...
4,80012,mixed baby greens with oranges grapefruit and ...


In [86]:
processed = get_processed(df_rfiltered)
processed.shape

(1000, 2)

In [18]:
#add some stopwords
#new_stopwords = ['.', '\"', '\'', ':', '(', ')', ',', '-', 'etc', '1', '/', '2', '3', '\'s', '\'\'''','``', '!' ]
#stop_words.extend(new_stopwords)

In [82]:
# list to capture purchase history of the customers
processed_embedding = []

# populate the list with the product codes
for index, row in processed.iterrows():
    temp = row.content
    temp = temp.split(" ") 
    processed_embedding.append(temp)

from gensim.models import Word2Vec 
from gensim.similarities import MatrixSimilarity 
from gensim.matutils import Dense2Corpus

model = Word2Vec()
model.build_vocab(processed_embedding)

model.train(processed_embedding, total_examples = model.corpus_count, 
            epochs=10, report_delay=1)

model.init_sims(replace=True)

sim_matrix = MatrixSimilarity(Dense2Corpus(model.wv.syn0.T))



<a id='models'></a>
## 2. Models

<a id='cosine'></a>
### 2.1. Cosine Similarity

### Vectorization

In [83]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity

In [84]:
def get_cos_sim_matrix(processed):
    tfidf = TfidfVectorizer(stop_words='english')
    processed['content'] = processed['content'].fillna('')
    tfidf_matrix = tfidf.fit_transform(processed['content'])
    #reduce dimensionality of tfidf matrix
    svd = TruncatedSVD(n_components=10, random_state=42)
    tfidf_truncated = svd.fit_transform(tfidf_matrix) 
    cosine_sim = cosine_similarity(tfidf_truncated,tfidf_truncated)
    return cosine_sim

In [85]:
cosine_sim = get_cos_sim_matrix(processed)
cosine_sim.shape

(1000, 1000)

### Make recommendations for Coverage

In [34]:
#return top k predicted ratings in readable form 

# IMPORTANT: must set the dataframe for recipe_info index == recipe_id!
def get_user_recommendations(user_id, similarity, content, interactions_data, recipe_info, k):
    #get top k recipe ids
    topk_recipes, predictions = get_topk_recipes(user_id, similarity, content, interactions_data, k)
    info = recipe_info.loc[topk_recipes]
    info['prediction'] = predictions
    return info

def get_topk_recipes(user_id, similarity, content, interactions, k):
    prediction_df = get_user_preference(user_id,similarity, content, interactions)
    #take only the not yet seen recipes
    new_predictions = prediction_df[prediction_df['has_rated'] == False]
    #sort predictions
    ordered_predictions = new_predictions.sort_values(by='prediction', ascending=False)
    #get recipe_id array
    topk_recipes = ordered_predictions.index[:k].values
    predictions = ordered_predictions.prediction[:k].values
    return topk_recipes, predictions

In [42]:
def get_user_preference(user_id, similarity, content, interactions_data):
    #prepare similarity dataframe
    sim = pd.DataFrame(similarity, index=content['recipe_id'].values, columns=content['recipe_id'].values)
    #get already rated recipes of user
    rated_recipes = interactions_data[interactions_data['user_id']==user_id]['recipe_id'].values
    #get similarities of ALL recipes w/ already rated recipes of user
    sim_rated_all = sim.loc[rated_recipes, :]
    #get ratings of already rated recipes
    ratings = get_reshaped_ratings(user_id, interactions_data)
    #compute weighted similarities between all recipes and already rated recipes
    weighted_sim = np.dot(ratings,sim_rated_all)
    #compute normalization constant
    norm_const = np.array(np.abs(sim_rated_all).sum(axis=0))
    #return sorted predictions
    pref_predictions = weighted_sim/norm_const
    flat_predictions = [item for sublist in pref_predictions for item in sublist]
    #return df with recipe id also
    prediction_df = pd.DataFrame(flat_predictions, index =content_processed['recipe_id'].values, columns =['prediction'])
    #indicate the already tried recipes
    prediction_df['has_rated'] = prediction_df.index.isin(rated_recipes)
    #order predictions
    return prediction_df

#arrange ratings for matrix multiplication
def get_reshaped_ratings(user_id, interactions_data):
    ratings = interactions_data[interactions_data['user_id']==user_id]
    ratings.set_index('recipe_id', inplace=True)
    ratings.index.set_names(None, inplace = True)
    ratings.drop(columns='user_id', inplace=True)
    ratings = ratings.transpose()
    ratings.rename(index={'rating':user_id}, inplace=True)
    return ratings

In [26]:
#Source: https://github.com/statisticianinstilettos/recmetrics/

def catalog_coverage(predicted, catalog, k):
    """
    Computes the catalog coverage for k lists of recommendations
    Parameters
    ----------
    predicted : a list of lists
        Ordered predictions
        example: [['X', 'Y', 'Z'], ['X', 'Y', 'Z']]
    catalog: list
        A list of all unique items in the training data
        example: ['A', 'B', 'C', 'X', 'Y', Z]
    k: integer
        The number of observed recommendation lists
        which randomly choosed in our offline setup
    Returns
    ----------
    catalog_coverage:
        The catalog coverage of the recommendations as a percent
        rounded to 2 decimal places
    ----------    
    Metric Defintion:
    Ge, M., Delgado-Battenfeld, C., & Jannach, D. (2010, September).
    Beyond accuracy: evaluating recommender systems by coverage and serendipity.
    In Proceedings of the fourth ACM conference on Recommender systems (pp. 257-260). ACM.
    """
    sampling = random.choices(predicted, k=k)
    predicted_flattened = [p for sublist in sampling for p in sublist]
    L_predictions = len(set(predicted_flattened))
    catalog_coverage = round(L_predictions/(len(catalog)*1.0)*100,2)
    return catalog_coverage

In [27]:
#TODO
def make_all_recommendations(user_ids):
    return nested_recommendations

In [43]:
get_user_recommendations(60992, cosine_sim, processed, interactions_data, df_rmerged, 5)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


TypeError: can't multiply sequence by non-int of type 'float'

### Make prediction function --> for RMSE

In [87]:
#prediction for 1 already rated recipe based on similarities to other already rated recipes

def get_one_prediction(similarity, content, interactions, user_id, recipe_id):
    sim = pd.DataFrame(similarity, index=content['recipe_id'].values, columns=content['recipe_id'].values)
    #get already rated recipes of user
    rated_recipes = interactions[interactions['user_id']==user_id]['recipe_id'].values
    #get similarities of to be predicted recipe rating with already rated recipes by user x
    sim_rated = sim.loc[sim.index==recipe_id, rated_recipes].loc[recipe_id].values
    #get ratings of rated recipes
    ratings = interactions[interactions['user_id']==user_id]['rating'].values
    
    actual = interactions.loc[(interactions.user_id==user_id) & (interactions.recipe_id==recipe_id)]['rating'].values[0]
    prediction = np.dot(ratings, sim_rated) /np.array([np.abs(sim_rated).sum(axis=0)])
    return actual, prediction

In [88]:
#only relevant if there is a recipe sample

def get_interaction_processed(processed, interactions):
    #fetch only interactions in the preprocessed sample
    interactions_processed = interactions.loc[interactions.recipe_id.isin(processed.recipe_id)]\
                           .reset_index()\
                           .drop(columns=['index'])
    print(f'Interactions before processing: {len(interactions.index)}\nInteractions covered in sample: {len(interactions_processed.index)}')
    return interactions_processed

In [90]:
interactions_processed = get_interaction_processed(sample, interactions_data)
interactions_processed.head()

Interactions before processing: 1132367
Interactions covered in sample: 4333


Unnamed: 0,user_id,recipe_id,date,rating,review_keywords
0,88378,445577,2012-07-25,5,mix perfect ... winner 1 jalapeno found add co...
1,2114486,445577,2013-07-04,3,rather find thai basil cilantro suggest using ...
2,900992,445577,2013-10-02,3,flavor would probably throw used thai basil ne...
3,2503874,129377,2012-11-16,4,set nutmeg sprinkle loved beautifully added co...
4,247152,310201,2009-05-27,5,gasp !)... served change honestly turned long ...


#### a) Make predictions based on tfidf

In [94]:
uids = interactions_data['user_id'].values
rids = interactions_data['recipe_id'].values

predictions_cos = []
actual_cos = []

#Make a prediction for each interaction in the interactions df
for i in range(len(interactions_data)):
    act, pred = get_one_prediction(cosine_sim, processed, interactions_data, uids[i], rids[i])
    predictions_cos.append(pred)
    actual_cos.append(act)

KeyError: "None of [Int64Index([40893, 16954, 40753, 34513, 69545, 49064, 80044, 30565, 29493,\n            34509, 20500, 38104, 39907, 77818, 18007,  4764, 49387, 71569,\n            72929, 81845, 69501, 44244, 36653, 40923, 79964, 40852, 36381,\n            16391, 31503, 26259, 71499, 63965, 32377, 35302, 22319],\n           dtype='int64')] are in the [columns]"

In [95]:
rmse_cos = mean_squared_error(predictions_cos, actual_cos)**0.5
mae_cos = mean_absolute_error(predictions_cos, actual_cos)
print(f'RMSE: {rmse_cos}, MAE: {mae_cos}')

ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.

In [96]:
interactions_data['predicted_rating'] = [item for sublist in predictions_cos for item in sublist]
interactions_data.head()

NameError: name 'interactions' is not defined

In [97]:
get_rating_dist(round(interactions_data.predicted_rating))

NameError: name 'get_rating_dist' is not defined

#### b) Make predictions based on Mixture Model c) based on Word Embeddings etc. ...

In [None]:
def get_mix_sim_matrix(processed, lmbda, df_rfiltered):
  cos_sim = get_cos_sim_matrix(processed)
  df_sub = df_rfiltered[['recipe_id', 'n_steps', 'minutes', 'n_ingredients']]
  df_processed = df_sub[df_sub['recipe_id'].isin(processed['recipe_id'])]\
                                                             .set_index('recipe_id')

  scaler = StandardScaler()
  X = scaler.fit_transform(df_processed)
  eucl_dis = euclidean_distances(X,X)

  eucl_sim = 1/np.exp(eucl_dis)
  mixed_sim = np.add(cos_sim*lmbda,eucl_sim*(1-lmbda)) # assume equally weighted

  return mixed_sim

In [None]:
mixed_sim = get_mix_sim_matrix(content_processed, 0.5, df_rfiltered)
mixed_sim.shape

In [None]:
uids = interactions['user_id'].values
rids = interactions['recipe_id'].values

predictions_mixed = []
actual_mixed = []

#Make a prediction for each interaction in the interactions df
for i in range(len(interactions)):
    act, pred = get_one_prediction(mixed_sim, content_processed, interactions, uids[i], rids[i])
    predictions_mixed.append(pred)
    actual_mixed.append(act)

In [None]:
rmse_mixed = mean_squared_error(predictions_mixed, actual_mixed)**0.5
mae_mixed = mean_absolute_error(predictions_mixed, actual_mixed)
print(f'RMSE: {rmse_mixed}, MAE: {mae_mixed}')

In [None]:
interactions['predicted_rating_mixed'] = [item for sublist in predictions_mixed for item in sublist]
interactions.head()

In [None]:
get_rating_dist(round(interactions.predicted_rating_mixed))

Recommendation -> =/= already rated -> Coverage
= recommend highest predicted rating NOT seen yet
-> new prediction for ALL recipes for 1 user


<-> Prediction -> RMSE 

<a id='lsi'></a>
### 2.2. LSI Model

<a id='mixture'></a>
### 2.3. Mixture Model

<a id='interpretation_evaluation'></a>
## 3. Interpretation and Evaluation