# Framing Prediction Problem

In [66]:
import pandas as pd
import numpy as np
from pathlib import Path
import plotly.express as px
pd.options.plotting.backend = 'plotly'
from itertools import chain

from utils.eda import *
from utils.dsc80_utils import *
from utils.graph import *

In [67]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [68]:
interactions = pd.read_csv('food_data/RAW_interactions.csv')
recipes = pd.read_csv('food_data/RAW_recipes.csv')
step0 = recipes.merge(interactions, how='left', left_on='id', right_on='recipe_id', indicator=True)
df = (step0
      .pipe(initial)
      .pipe(transform_df)
      #.pipe(outlier)
      .pipe(group_recipe)
      #.pipe(group_user)
)

# Problem Identification
**Analysis**:
Identify a prediction problem. Feel free to use one of the example prediction problems stated in the “Example Questions and Prediction Problems” section of your dataset’s description page or pose a hypothesis test of your own. The prediction problem you come up with doesn’t have to be related to the question you were answering in Steps 1-4, but ideally, your entire project has some sort of coherent theme.

**Report**:
Clearly state your prediction problem and type (classification or regression). If you are building a classifier, make sure to state whether you are performing binary classification or multiclass classification. Report the response variable (i.e. the variable you are predicting) and why you chose it, the metric you are using to evaluate your model and why you chose it over other suitable metrics (e.g. accuracy vs. F1-score).

Note: Make sure to justify what information you would know at the “time of prediction” and to only train your model using those features. For instance, if we wanted to predict your final exam grade, we couldn’t use your Project 4 grade, because Project 4 is only due after the final exam! Feel free to ask questions if you’re not sure.

# Some Potential Ideas:
1. Sentiment Analysis with `review` column

2. Using   `recipe` column and feature engineering (length of `recipe`, TF-IDF, ...) to predict `ratings`

3. Using text data as a input to predict the rating of the user and identify preference of users (pre-step to reconmender system)

In [4]:
df.columns

Index(['minutes', 'n_steps', 'n_ingredients', 'avg_rating', 'rating',
       'calories', 'total_fat', 'sugar', 'sodium', 'protein', 'sat_fat',
       'carbs', 'steps', 'name', 'description', 'ingredients', 'user_id',
       'contributor_id', 'review_date', 'review', 'recipe_date', 'tags'],
      dtype='object')

#  Framing Question
We know that Recipe's mean TFIDF distribution is different for higher rating recipe than lower rating recipe:
- We need `X` and a `y` -> find relationships! -> Supervised ML model
- We currently have the DataFrame grouped by recipe
- We want to predict `rating` as a classfication problem
    - `rating` in recipe df: a quality of recipe
    - `rating` in user_id df: user preference ✅
- Features for user_id df:
    - `TF-IDF mean/max/sum/partial_mean` of `description` for **recipe per user_id** (may have more than one recipe) that have **high ratings**
        - This evaluates whether a word shows more often in this **user's high rated recipe decription** compare to all **recipe decription**, thus, meaning that it is more important to this user.
    - `n_ingredients`
    - `n_steps`
    - `minutes`
    - `calories`
    - `sodium`
    - `previous_rating` (need to explore)
    - `word2vec` (need to explore, somr info [here](https://towardsdatascience.com/word2vec-explained-49c52b4ccb71)) 
        - Each `user_id` have a pool of words in a **vector space** (from description, can have more)
        - We want to see how similar (cosine distance) between recipe tags `word2vec` and the pool
        - [good theory background](https://medium.com/@zafaralibagh6/a-simple-word2vec-tutorial-61e64e38a6a1)

- consider using `tags`, `review`, `steps`?

- Further: using preference to recomand recipe!

# Introduction to Word2Vec NLP
[This](https://arxiv.org/pdf/1301.3781.pdf) is the original paper published by Google Research

<center><img src="imgs/wv0.webp" width=50%></center>


We need to create a vocabulary of all the words in our text and then to encode our word as a vector of the same dimensions of our vocabulary, so this is exactly using `OneHotEncoding`, this input is given to a neural network with a single hidden layer.

<center><img src="imgs/wv1.webp" width=30%></center>

The output of such network is a **single vector (also with the same length components) containing** that  represents the **probability** that a randomly selected nearby word is that vocabulary word.

In word2vec, a distributed representation of a word is used. Take a vector with several hundred dimensions (say 1000). Each word is represented by a distribution of weights across those elements. So instead of a one-to-one mapping between an element in the vector and a word, **the representation of a word is spread across all the elements in the vector**, and **each element in the vector contributes to the definition of many words**. Such a vector comes to represent in some abstract way the ‘meaning’ of a word.
- It is not one word that represenst one word but rather "all" words that represents one word
- This is the distributional semantics idea, using an distribution of an sentence to encode/embed the meaning of an word


<center><img src="imgs/wv2.webp" width=50%></center>

## CBOW (Continuous Bag of Words) & Continuous Skip-Grams
So that is the basic Idea, how does it achieve such magical embedding?

In the `CBOW model`, the **distributed representations of context** (or surrounding words) are combined to **predict the word in the middle**. While in the `Skip-gram model`, the **distributed representation of the input word** is used to predict the **context**.

<center><img src="imgs/wv3.png" width=70%></center>

### In `CBOW`
- Note that `CBOW` does not consider orders of the word, it takes in both "past" words and "future" words, just like in regular `bag of words` model. However, it uses continuous distributed representation of the context.

- In `CBOW`, since our input vectors are `OneHotEncoding`, multiplying an input vector by the weight matrix `W1` is just selecting a row/word from `W1`. From the hidden layer to the output layer, the second weight matrix `W2` can be used to compute **a score** for each word in the vocabulary, and **softmax** can be used to obtain the **posterior distribution of words**.

### In `Skip-Gram`
- The `skip-gram model` is the opposite of the `CBOW` model. It is constructed with the **focus word as the single input vector**, and the target context words are now at the output layer. The activation function for the hidden layer simply amounts to copying the corresponding row from the weights matrix `W1` (linear) as we saw before. At the output layer, we now output **C multinomial distributions instead of just one**.

- Given the sentence: *“I will have orange juice and eggs for breakfast.”*
    - and a window size of 2, if the target word is juice, its neighboring words will be ( have, orange, and, eggs). Our input and target word pair would be (juice, have), (juice, orange), (juice, and), (juice, eggs).

    - Also note that within the sample window, proximity of the words to the source word plays no role. So have, orange, and, and eggs will be treated the same while training.

    - The dimensions of the input vector will be **1xV** — where V is the number of words in the vocabulary — i.e `OneHotEncoding` representation of the word. The single hidden layer will have dimension **VxE**, where E is the size of the word embedding and is a hyper-parameter. The output from the hidden layer would be of the dimension **1xE**, which we will feed into an `softmax` layer. The dimensions of the output layer will be 1xV, where each value in the vector will be *the probability score of the target word at that position*.

<center><img src="imgs/window.webp" width=50%></center>

## Implementation Details

In [5]:
from gensim.models import Word2Vec
import nltk
# nltk.download('brown')
# nltk.download('movie_reviews')
# nltk.download('treebank')
# from nltk.corpus import brown, movie_reviews, treebank
from nltk.tokenize import word_tokenize

In [6]:
# b = Word2Vec(brown.sents())
# mr = Word2Vec(movie_reviews.sents())
# t = Word2Vec(treebank.sents())
# brown.sents()
# b.wv.most_similar('food', topn=5)

In [7]:
# corpus = ' '.join(df['description'].astype(str))
# tokens = word_tokenize(corpus)
# tokens
tokens = df['description'].astype(str).str.split(' ').to_list()
tokens[0]

['one',
 'of',
 'my',
 "mom's",
 'favorite',
 'bisquick',
 'recipes.',
 'this',
 'brings',
 'back',
 'memories!']

In [8]:
word_vec = Word2Vec(tokens, window=7, sg=1, min_count=3) # input is a list of list

In [9]:
word_vec.wv.most_similar('cheese', topn=20)

[('cheddar', 0.8366087079048157),
 ('cheese,', 0.8310773372650146),
 ('cheese.', 0.8167422413825989),
 ('cheese!', 0.7785608768463135),
 ('fontina', 0.7774136066436768),
 ('mozzarella', 0.7537940144538879),
 ('cheeses', 0.7433044910430908),
 ('havarti', 0.7396714091300964),
 ('cheddar.', 0.7392040491104126),
 ('monterey', 0.7332344055175781),
 ('gruyere', 0.7295688986778259),
 ('provolone', 0.7268673777580261),
 ('sharp', 0.726447343826294),
 ('cheese...', 0.7256677746772766),
 ('cheddar,', 0.7215008735656738),
 ('crumbled', 0.7210201621055603),
 ('asiago', 0.7161028981208801),
 ('jack.', 0.7128156423568726),
 ('romano', 0.7125892043113708),
 ('parmesan', 0.7068539261817932)]

In [12]:
user = (step0
      .pipe(initial)
      .pipe(transform_df)
      #.pipe(outlier)
      #.pipe(group_recipe)
      .pipe(group_user)
)


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



In [60]:
user_corpus = user['description'].astype(str).str.strip('[]').str.strip("'").str.strip('"').str.split(' ')
user_corpus.shape

(67268,)

In [58]:
word_vec = Word2Vec(user_corpus, window=7, sg=1, min_count=3)

In [61]:
word_vec.wv.most_similar('cheese', topn=30)

[('cheddar', 0.7837750911712646),
 ('havarti', 0.742730438709259),
 ('cabot', 0.7361286878585815),
 ('parmesan', 0.732124924659729),
 ('cheese.', 0.7221015691757202),
 ('fontina', 0.7215815186500549),
 ('pepper-jack', 0.7188788652420044),
 ('cheddar.', 0.7154257893562317),
 ('cheese...', 0.7148705720901489),
 ('goat', 0.713663637638092),
 ('wontons.', 0.7099077105522156),
 ('maytag', 0.7073398232460022),
 ('feta', 0.704619288444519),
 ('romano', 0.7013906240463257),
 ('cream', 0.7006925344467163),
 ("cheese.',", 0.6994298696517944),
 ('tapenade,', 0.699428915977478),
 ('mac', 0.697147786617279),
 ('colby,', 0.6966875195503235),
 ('goats', 0.6960228681564331),
 ('mozzarella,', 0.694808840751648),
 ('cheez', 0.6945345997810364),
 ('frig:', 0.6922464370727539),
 ('crumbles.', 0.6880894303321838),
 ('http://www.recipezaar.com/fannie-farmers-classic-baked-macaroni-and-cheese-135350',
  0.6870267987251282),
 ('provalone', 0.6867004632949829),
 ('chevre', 0.6864020824432373),
 ('whiz', 0.6849

We can use the `distance` function

In [82]:
word_vec.wv.distance('food','delicious')

0.7996206283569336

In [83]:
# df['tags'].apply(lambda x: [word_vec.wv.distance(word) for word in x if word in word_vec.wv.vocab])