# Permutation Testing
Does the mean TFIDF for high score recipe and low score recipes seems to come from the same population distribution?

In [103]:
import pandas as pd
import numpy as np
from pathlib import Path
import plotly.express as px
pd.options.plotting.backend = 'plotly'
from itertools import chain

from utils.eda import *
from utils.dsc80_utils import *
from utils.graph import *

from sklearn.feature_extraction.text import TfidfVectorizer

In [13]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [130]:
interactions = pd.read_csv('food_data/RAW_interactions.csv')
recipes = pd.read_csv('food_data/RAW_recipes.csv')
step0 = recipes.merge(interactions, how='left', left_on='id', right_on='recipe_id', indicator=True)
df = (step0
      .pipe(initial)
      .pipe(transform_df)
      .pipe(outlier)
      .pipe(group_recipe)
      #.pipe(group_user)
)

## Hypothesis Testing Ideas
**Analysis**:
Clearly state a pair of hypotheses and perform a hypothesis test or permutation test that is not related to missingness. Feel free to use one of the example questions stated in the “Example Questions and Prediction Problems” section of your dataset’s description page or pose a hypothesis test of your own.

**Report**:
Clearly state your null and alternative hypotheses, your choice of test statistic and significance level, the resulting p-value, and your conclusion. Justify why these choices are good choices for answering the question you are trying to answer.

Optional: Embed a visualization related to your hypothesis test in your website.

Tip: When making writing your conclusions to the statistical tests in this project, never use language that implies an absolute conclusion; since we are performing statistical tests and not randomized controlled trials, we cannot prove that either hypothesis is 100% true or false.

## Some questions that can be asked:
### Question 1: Hypothesis Testing
Does number of words in a description matching highest 20 TF-IDF words in one reciepe (with respect to recipe description that the revier rated high in) results in a higher rating? -> extend even into maximize distance from rated low recipe high TF-IDF words!

- **null**: number of words in each description matching highest 20 TF-IDF words for each user have no relationship with higher rating
- **alternative**: number of words in each description matching highest 20 TF-IDF words for each user results in a higher rating

### Question 2: Permutation Testing
Does the mean TFIDF for high score recipe and low score recipes seems to come from the same population distribution?

- **null**: mean of TFIDF for high score recipe and low score recipe is from the same population
- **alternative**: null is wrong
- **test statistics**: absolute differences in max of TF-IDF (difference between the TFIDF of the most important word to compare high and low rating recipes)

## Preparation

In [169]:
df = df[['name','description','tags','steps','ingredients','contributor_id','rating']] # avg_rating = rating here
df_high = df[df['rating']>=4]
df_low = df[df['rating']<4]

In [170]:
all(df_high['rating'] >= 4)

True

In [171]:
lst_high = df_high['description'].explode().astype(str)
lst_low = df_low['description'].explode().astype(str)

In [172]:
lst_high

recipe_id
275030.0    thank you paula deen!  hubby just happened to ...
275033.0                           from woman's day magazine.
275036.0    i threw some things together in a dutch oven a...
                                  ...                        
537175.0    sometimes you need rolls fast. here's a perfec...
537458.0    cream cheese is the secret ingredient in these...
537716.0    these party-style chicken cheesesteaks are fla...
Name: description, Length: 50022, dtype: object

## TF-IDF Calculation

In [173]:
count_high = TfidfVectorizer()
count_low = TfidfVectorizer()
count_high.fit(lst_high.values)
count_low.fit(lst_low.values)

TfidfVectorizer()

In [174]:
count_low.get_feature_names_out().shape

(8636,)

In [175]:
count_high.get_feature_names_out().shape

(33718,)

In [176]:
high_tfidf = pd.DataFrame(count_high.transform(lst_high.values).toarray(),
                        columns=count_high.get_feature_names_out()
                        )

low_tfidf = pd.DataFrame(count_low.transform(lst_low.values).toarray(),
                        columns=count_low.get_feature_names_out()
                        )

In [177]:
high_tfidf.idxmax(axis=1).iloc[:5]
low_tfidf.idxmax(axis=1).iloc[:5]

0       brings
1    childhood
2      company
3        tummy
4    asparagus
dtype: object

## Get Highest TFIDF Words for Each Recipe (Visualization Purpose)

In [114]:
def five_largest(row):
    return ', '.join(row.index[row.argsort()][-5:])

keywords_high = high_tfidf.apply(five_largest, axis=1)
keywords_high = pd.concat([df_high.reset_index()['recipe_id'],
                         keywords_high
                         ], axis=1)

keywords_low = low_tfidf.apply(five_largest, axis=1)
keywords_low = pd.concat([df_low.reset_index()['recipe_id'],
                         keywords_low
                         ], axis=1)

In [115]:
key_high = keywords_high.set_index('recipe_id')
key_low = keywords_low.set_index('recipe_id')

In [116]:
# recipe_user_word = df2['user_id'].apply(lambda x: [' '.join(dict.loc[id].values) for id in x][0].split(','))
# recipe_word_split = df2['steps'].apply(lambda x: ' '.join(x).split(' '))

In [117]:
key_high

Unnamed: 0_level_0,0
recipe_id,Unnamed: 1_level_1
275030.0,"paula, deen, thank, watching, happened"
275033.0,"elevates, from, magazine, day, woman"
275036.0,"things, share, liked, threw, dutch"
...,...
537175.0,"to, appropriate, fyi, sleeve, rolls"
537458.0,"cooker, secret, comforting, mash, spuds"
537716.0,"buffalo, garnished, whiz, cheez, cheesesteaks"


In [118]:
key_low

Unnamed: 0_level_0,0
recipe_id,Unnamed: 1_level_1
275022.0,"mom, back, bisquick, memories, brings"
275024.0,"loved, cut, it, mine, childhood"
275026.0,"occasion, stand, oldie, goodie, company"
...,...
535783.0,"mash, ring, ages, beefy, oozy"
536688.0,"for, snacking, improving, immunity, honey"
536843.0,"meatballs, description, rosemary, caprese, sma..."


## Differences in Max for TF-IDF
- Using `sum` -> longer sentences have greater sum
- Using `mean` -> very easy to be influenced by outlier
- Using `partial-mean` -> get the most essence part of the sentence, however, complexity too high
- Using `max` -> most important one word's TF-IDF

In [178]:
# high_tfidf.iloc[:100].apply(lambda x: x.sort_values(ascending=False))

In [179]:
tfidf_max_high = high_tfidf.max(axis=1)
tfidf_max_low = low_tfidf.max(axis=1)

In [180]:
tfidf_max_high

0        0.28
1        0.72
2        0.33
         ... 
50019    0.35
50020    0.41
50021    0.35
Length: 50022, dtype: float64

In [181]:
max_high = df_high.reset_index().assign(tfidf = tfidf_max_high, good=True)
max_low = df_low.reset_index().assign(tfidf = tfidf_max_low, good=False)

In [182]:
display_df(max_high, 2)

Unnamed: 0,recipe_id,name,description,tags,...,contributor_id,rating,tfidf,good
0,275030.0,paula deen s caramel apple cheesecake,thank you paula deen! hubby just happened to ...,"[60-minutes-or-less, time-to-make, course, pre...",...,"[666723, 666723, 666723, 666723, 666723, 66672...",5.0,0.28,True
...,...,...,...,...,...,...,...,...,...
50021,537716.0,mini buffalo chicken cheesesteaks,these party-style chicken cheesesteaks are fla...,"[60-minutes-or-less, time-to-make, course, mai...",...,[2001975627],5.0,0.35,True


In [183]:
display_df(max_low, 2)

Unnamed: 0,recipe_id,name,description,tags,...,contributor_id,rating,tfidf,good
0,275022.0,impossible macaroni and cheese pie,one of my mom's favorite bisquick recipes. thi...,"[60-minutes-or-less, time-to-make, course, mai...",...,"[531768, 531768, 531768]",3.0,0.46,False
...,...,...,...,...,...,...,...,...,...
3568,536843.0,sheet pan turkey caprese meatballs with rosema...,description: try these turkey caprese meatball...,"[60-minutes-or-less, time-to-make, course, mai...",...,"[2001112113, 2001112113]",3.0,0.36,False


In [184]:
big_df = pd.concat([max_high, max_low], axis=0)
display_df(big_df, 2)

Unnamed: 0,recipe_id,name,description,tags,...,contributor_id,rating,tfidf,good
0,275030.0,paula deen s caramel apple cheesecake,thank you paula deen! hubby just happened to ...,"[60-minutes-or-less, time-to-make, course, pre...",...,"[666723, 666723, 666723, 666723, 666723, 66672...",5.0,0.28,True
...,...,...,...,...,...,...,...,...,...
3568,536843.0,sheet pan turkey caprese meatballs with rosema...,description: try these turkey caprese meatball...,"[60-minutes-or-less, time-to-make, course, mai...",...,"[2001112113, 2001112113]",3.0,0.36,False


## Permutation Testing

In [185]:
observe = big_df.groupby('good')['tfidf'].mean().diff().abs().iloc[-1]
        
# making a distrbution where missing of description does not depend on dep_col
n_repetitions = 1000
null = []
for _ in range(n_repetitions):
    with_shuffled = big_df.assign(shuffle = np.random.permutation(big_df['good']))
    difference = with_shuffled.groupby('shuffle')['tfidf'].mean().diff().abs().iloc[-1]
    null.append(difference)

In [186]:
fig = px.histogram(pd.DataFrame(null), x=0, histnorm='probability', title=f'Permutation Testing Using Max TF-IDF')
fig.add_vline(x=observe, line_color='red', line_width=1, opacity=1)

In [187]:
(observe <= null).mean()

0.0

Seems like the partial mean TF-IDF (mean for top 20 words) for high and low rating recipes does not come from the same distribution. Thus, **we reject the null hypothesis.**
- Note: when this bar of high rating is 4.5, this is different