# Framing Prediction Problem

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import plotly.express as px
pd.options.plotting.backend = 'plotly'
from itertools import chain

from utils.eda import *
from utils.dsc80_utils import *
from utils.graph import *

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
interactions = pd.read_csv('food_data/RAW_interactions.csv')
recipes = pd.read_csv('food_data/RAW_recipes.csv')
step0 = recipes.merge(interactions, how='left', left_on='id', right_on='recipe_id', indicator=True)
df = (step0
      .pipe(initial)
      .pipe(transform_df)
      #.pipe(outlier)
      .pipe(group_recipe)
      #.pipe(group_user)
)

## Problem Identification
**Analysis**:
Identify a prediction problem. Feel free to use one of the example prediction problems stated in the “Example Questions and Prediction Problems” section of your dataset’s description page or pose a hypothesis test of your own. The prediction problem you come up with doesn’t have to be related to the question you were answering in Steps 1-4, but ideally, your entire project has some sort of coherent theme.

**Report**:
Clearly state your prediction problem and type (classification or regression). If you are building a classifier, make sure to state whether you are performing binary classification or multiclass classification. Report the response variable (i.e. the variable you are predicting) and why you chose it, the metric you are using to evaluate your model and why you chose it over other suitable metrics (e.g. accuracy vs. F1-score).

Note: Make sure to justify what information you would know at the “time of prediction” and to only train your model using those features. For instance, if we wanted to predict your final exam grade, we couldn’t use your Project 4 grade, because Project 4 is only due after the final exam! Feel free to ask questions if you’re not sure.

## Some Potential Ideas:
1. Sentiment Analysis with `review` column

2. Using   `recipe` column and feature engineering (length of `recipe`, TF-IDF, ...) to predict `ratings`

3. Using text data as a input to predict the rating of the user and identify preference of users (pre-step to reconmender system)

In [4]:
df.columns

Index(['minutes', 'n_steps', 'n_ingredients', 'avg_rating', 'rating',
       'calories', 'total_fat', 'sugar', 'sodium', 'protein', 'sat_fat',
       'carbs', 'steps', 'name', 'description', 'ingredients', 'user_id',
       'contributor_id', 'review_date', 'review', 'recipe_date', 'tags'],
      dtype='object')

We know that Recipe's mean TFIDF distribution is different for higher rating recipe than lower rating recipe:
- We need `X` and a `y` -> find relationships! -> Supervised ML model
- We currently have the DataFrame grouped by recipe
- We want to predict `rating` as a classfication problem
    - `rating` in recipe df: a quality of recipe
    - `rating` in user_id df: user preference ✅
- Features for user_id df:
    - `TF-IDF mean/max/sum/partial_mean` of `description` for **recipe per user_id** (may have more than one recipe) that have **high ratings**
        - This evaluates whether a word shows more often in this **user's high rated recipe decription** compare to all **recipe decription**, thus, meaning that it is more important to this user.
    - `n_ingredients`
    - `n_steps`
    - `minutes`
    - `calories`
    - `sodium`
    - `previous_rating` (need to explore)
    - `word2vec` (need to explore, somr info [here](https://towardsdatascience.com/word2vec-explained-49c52b4ccb71)) 
        - Each `user_id` have a pool of words in a **vector space** (from description, can have more)
        - We want to see how similar (cosine distance) between recipe tags `word2vec` and the pool

- consider using `tags`, `review`, `steps`?

- Further: using preference to recomand recipe!

In [25]:
from gensim.models import Word2Vec
import nltk
nltk.download('brown')
nltk.download('movie_reviews')
nltk.download('treebank')
from nltk.corpus import brown, movie_reviews, treebank

from nltk.tokenize import word_tokenize

[nltk_data] Downloading package brown to /Users/kevinb/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/kevinb/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package treebank to /Users/kevinb/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


In [14]:
b = Word2Vec(brown.sents())
mr = Word2Vec(movie_reviews.sents())
t = Word2Vec(treebank.sents())

In [46]:
brown.sents()

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

In [21]:
b.wv.most_similar('food', topn=5)

[('competition', 0.9481075406074524),
 ('adult', 0.9445140957832336),
 ('contact', 0.9444266557693481),
 ('masses', 0.9434835314750671),
 ('insight', 0.9429556131362915)]

In [61]:
corpus = ' '.join(df['description'].astype(str))
#tokens = df['description'].str.split(' ')

In [63]:
tokens = word_tokenize(corpus)
tokens

['one',
 'of',
 'my',
 'mom',
 "'s",
 'favorite',
 'bisquick',
 'recipes',
 '.',
 'this',
 'brings',
 'back',
 'memories',
 '!',
 'a',
 'childhood',
 'favorite',
 'of',
 'mine',
 '.',
 'my',
 'mom',
 'loved',
 'it',
 'because',
 'it',
 'cut',
 'down',
 'on',
 'how',
 'much',
 'time',
 'to',
 'make',
 'it',
 '.',
 'this',
 'is',
 'an',
 'oldie',
 'but',
 'a',
 'goodie',
 '.',
 'mom',
 "'s",
 'stand',
 'by',
 'for',
 'company',
 '.',
 'good',
 'enough',
 'for',
 'us',
 'on',
 'a',
 'special',
 'occasion',
 'or',
 'if',
 'company',
 'came',
 'over',
 '!',
 'thank',
 'you',
 'paula',
 'deen',
 '!',
 'hubby',
 'just',
 'happened',
 'to',
 'be',
 'watching',
 'with',
 'me',
 'one',
 'day',
 'when',
 'she',
 'made',
 'these',
 'and',
 'it',
 'will',
 'always',
 'be',
 'requested',
 'in',
 'our',
 'home',
 '!',
 'it',
 "'s",
 'very',
 'easy',
 'to',
 'make',
 'and',
 'such',
 'a',
 'fun',
 'twist',
 'on',
 'a',
 'plain',
 'cheesecake',
 '.',
 'it',
 "'s",
 'a',
 'must',
 'try',
 '!',
 'the',
 

In [64]:
word_vec = Word2Vec(tokens)

In [69]:
word_vec.wv.most_similar('a', topn=10)

[('e', 0.735166609287262),
 ('i', 0.7114152908325195),
 ('o', 0.5277945399284363),
 ('ä', 0.40398621559143066),
 ('u', 0.3433350920677185),
 ('-', 0.2859588861465454),
 ("'", 0.2732229232788086),
 ('x', 0.262163907289505),
 ('’', 0.2599521577358246),
 ('y', 0.2561439871788025)]