# Content Mining Models

## Table of Contents
##### [1. Data Preprocessing and Imports](#preprocessing)
##### [2. Models](#models)
###### [2.1. Cosine Similarity](#cosine)
###### [2.2. LSI Model](#lsi)
###### [2.3. Mixture Model](#mixture)
##### [3. Interpretation and Evaluation](#interpretation_evaluation)

<a id='preprocessing'></a>
## 1. Data Preprocessing and Imports

In [23]:
import pandas as pd
import nltk
import sklearn

from sklearn.feature_extraction.text import TfidfVectorizer

In [24]:
# Import functions module
%run functions.py

In [25]:
interactions_data = pd.read_csv(
        'C:/Users/d067795/OneDrive - SAP SE/Documents/Master/Semester 2/Web Mining/Project/RAW_interactions.csv')
recipes_data = pd.read_csv(
        'C:/Users/d067795/OneDrive - SAP SE/Documents/Master/Semester 2/Web Mining/Project/RAW_recipes.csv')

In [26]:
# Rename columns to more explanatory names
recipes_data.rename(columns={"id": "recipe_id"}, inplace=True)
interactions_data.rename(columns={"num_interactions": "date", "avg_rating": "rating"}, inplace=True)

# Fill nan
recipes_data.fillna("", inplace=True)
interactions_data.fillna("", inplace=True)

In [27]:
# Preprocess ingredients and save as String
for index, row in recipes_data.iterrows():
    ingredientlist = row['ingredients']
    ingredientlist = row['ingredients'].replace('[', '').replace(', ', '').replace(']', '').replace('and', '\'').split("\'")
    ingredientlist = list(filter(None, ingredientlist))
    ingredientlistString = ""
    for i in ingredientlist:
        ingredientlistString = ingredientlistString + i
    recipes_data.at[index, 'ingredients'] = ingredientlistString

In [28]:
# Extract keywords for free text features
get_keywords(recipes_data, "steps", "steps_keywords")
get_keywords(recipes_data, "description", "description_keywords")
get_keywords(interactions_data, "review", "review_keywords")

Unnamed: 0,user_id,recipe_id,date,rating,review_keywords
0,38094,40893,2003-02-17,4,15 minutes pinch salt stove top added great ca...
1,1293707,40893,2011-12-21,5,2 teaspoon forgot great remaining cumin simple...
2,8937,44394,2002-12-01,4,easy whole package quite worked great white ch...
3,126440,85009,2010-02-27,5,everyone loved made mexican topping took bunko
4,57222,85009,2011-10-01,5,black pepper made adding yum sprinkling chedda...
...,...,...,...,...,...
1132362,116593,72730,2003-12-09,0,fresh cranberries add slices another approach ...
1132363,583662,386618,2009-09-29,5,handle raw meat hands prep chili powder adds a...
1132364,157126,78003,2008-06-23,5,years mushrooms serve perfect pot roast red sk...
1132365,53932,78003,2009-01-11,4,used regular port beef smelled gravy cooking g...


In [29]:
# Merge datasets
#Average ratings
num_interactions = interactions_data.groupby("recipe_id")["date"].count()

#only consider the ratings (>0) into the mean, not the reviews w/o ratings
mean_ratings = interactions_data[interactions_data["rating"]!=0].groupby("recipe_id")["rating"].mean()

df_rmerged = recipes_data.join(num_interactions, how="left", on="recipe_id").join(mean_ratings, how="left", on="recipe_id")

In [34]:
interactions_data.head()


Unnamed: 0,user_id,recipe_id,date,rating,review_keywords
0,38094,40893,2003-02-17,4,15 minutes pinch salt stove top added great ca...
1,1293707,40893,2011-12-21,5,2 teaspoon forgot great remaining cumin simple...
2,8937,44394,2002-12-01,4,easy whole package quite worked great white ch...
3,126440,85009,2010-02-27,5,everyone loved made mexican topping took bunko
4,57222,85009,2011-10-01,5,black pepper made adding yum sprinkling chedda...


In [33]:
df_rmerged.head()

Unnamed: 0,name,recipe_id,minutes,contributor_id,submitted,tags,nutrition,n_steps,ingredients,n_ingredients,steps_keywords,description_keywords,date,rating
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,winter squashmexican seasoningmixed spicehoney...,7,cut squash pieceseason easily pierce use sugar...,favorite time inspired seasoning mix recipes r...,3,5.0
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,prepared pizza crustsausage pattyeggsmilksalt ...,6,set eggs preheat oven 20 minutes bowl tastebak...,bacon feel free microwave crust late risers ch...,4,4.666667
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,ground beefyellow onionsdiced tomatoestomato p...,13,3 hours large potadd chopped onions ground bee...,rainy day modified version one freeze cookbook...,1,4.0
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,spreadable cheese with garlic herbsnew potato...,11,fspread buttered 8x8 inch glass baking dish 2 ...,great tasting spent make ahead side dish actua...,2,4.5
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,tomato juiceapple cider vinegarsugarsaltpepper...,8,"great !"" recipe ingredients entire life necess...",dh recipe acquire enjoy store ds taste bought ...,1,5.0


<a id='models'></a>
## 2. Models

<a id='cosine'></a>
### 2.1. Cosine Similarity

In [30]:
def get_cos_sim_matrix(processed):
  tfidf = TfidfVectorizer(stop_words='english')
  processed['reviews'] = processed['reviews'].fillna('')
  tfidf_matrix = tfidf.fit_transform(processed['reviews'])
  svd = TruncatedSVD(n_components=10, random_state=42)
  tfidf_truncated = svd.fit_transform(tfidf_matrix) 
  cosine_sim = cosine_similarity(tfidf_truncated,tfidf_truncated)
  return cosine_sim

<a id='lsi'></a>
### 2.2. LSI Model

<a id='mixture'></a>
### 2.3. Mixture Model

In [36]:
lmbda = 0.5
processed = df_rmerged[["recipe_id", "ingredients", "steps_keywords", "description_keywords"]]

cos_sim = get_cos_sim_matrix(processed)
df_sub = df_rmerged['recipe_id', 'n_steps', 'minutes', 'n_ingredients', 'rating']
business_processed = df_sub[df_sub['recipe_id'].isin(processed['recipe_id'])].set_index('recipe_id')

scaler = StandardScaler()
X = scaler.fit_transform(business_processed)
eucl_dis = euclidean_distances(X,X)

eucl_sim = 1/np.exp(eucl_dis)  
mixed_sim = np.add(cos_sim*lmbda,eucl_sim*(1-lmbda)) # assume equally weighted

KeyError: 'reviews'

<a id='interpretation_evaluation'></a>
## 3. Interpretation and Evaluation