# Framing Prediction Problem

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
import pandas as pd
import numpy as np
from pathlib import Path
import plotly.express as px
pd.options.plotting.backend = 'plotly'

from utils.eda import *
from utils.dsc80_utils import *
from utils.graph import *
from utils.model import *

from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, Binarizer, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA


***
# Problem Identification
***

**Analysis**:
Identify a prediction problem. Feel free to use one of the example prediction problems stated in the “Example Questions and Prediction Problems” section of your dataset’s description page or pose a hypothesis test of your own. The prediction problem you come up with doesn’t have to be related to the question you were answering in Steps 1-4, but ideally, your entire project has some sort of coherent theme.

**Report**:
Clearly state your prediction problem and type (classification or regression). If you are building a classifier, make sure to state whether you are performing binary classification or multiclass classification. Report the response variable (i.e. the variable you are predicting) and why you chose it, the metric you are using to evaluate your model and why you chose it over other suitable metrics (e.g. accuracy vs. F1-score).

Note: Make sure to justify what information you would know at the “time of prediction” and to only train your model using those features. For instance, if we wanted to predict your final exam grade, we couldn’t use your Project 4 grade, because Project 4 is only due after the final exam! Feel free to ask questions if you’re not sure.

### Some Potential Ideas:
1. Sentiment Analysis with `review` column
2. Using   `recipe` column and feature engineering (length of `recipe`, TF-IDF, ...) to predict `ratings`
3. Using text data as a input to predict the rating of the user and identify preference of users (pre-step to reconmender system)

***
# Framing a Question (Some Ideas)
***

We know that Recipe's mean TFIDF distribution is different for higher rating recipe than lower rating recipe:
- We need `X` and a `y` -> find relationships! -> Supervised ML model
- We currently have the DataFrame grouped by recipe
- We want to predict `rating` as a classfication problem
    - `rating` in recipe df: a quality of recipe
    - `rating` in user_id df: user preference ✅
- Features for user_id df:
    - `TF-IDF mean/max/sum/partial_mean` of `description` for **recipe per user_id** (may have more than one recipe) that have **high ratings**
        - This evaluates whether a word shows more often in this **user's high rated recipe decription** compare to all **recipe decription**, thus, meaning that it is more important to this user.
    - `n_ingredients`
    - `n_steps`
    - `minutes`
    - `calories`
    - `sodium`
    - `previous_rating` (need to explore)
    - `word2vec` (need to explore, somr info [here](https://towardsdatascience.com/word2vec-explained-49c52b4ccb71)) 
        - Each `user_id` have a pool of words in a **vector space** (from description, can have more)
        - We want to see how similar (cosine distance) between recipe tags `word2vec` and the pool
        - [good theory background](https://medium.com/@zafaralibagh6/a-simple-word2vec-tutorial-61e64e38a6a1)

- consider using `tags`, `review`, `steps`?

- Further: using preference to recomand recipe!

- `Voting`?

- [Gaussian Bayesian Network](https://scikit-learn.org/stable/modules/naive_bayes.html)?

***
# Baseline Model
***

### Baseline Model
**Analysis**:
Train a “baseline model” for your prediction task that uses at least two features. (For this requirement, two features means selecting at least two columns from your original dataset that you should transform). You can leave numerical features as-is, but you’ll need to take care of categorical columns using an appropriate encoding. Implement all steps (feature transforms and model training) in a single sklearn Pipeline.

Note: Both now and in Step 7: Final Model, make sure to evaluate your model’s ability to generalize to unseen data!

There is no “required” performance metric that your baseline model needs to achieve.

**Report**:
Describe your model and state the features in your model, including how many are quantitative, ordinal, and nominal, and how you performed any necessary encodings. Report the performance of your model and whether or not you believe your current model is “good” and why.

Tip: Make sure to hit all of the points above: many projects in the past have lost points for not doing so.

## Predictive Question
We want to predict `rating` as a classfication problem, prdicting `rating` (5 catagories) in the user_id DataFrame to demonstarte understanding of user preference.
- **Using the original big DataFrame for predicting rating**

## Feature Engineering
Remanber to take care of the missing data

- `n_ingredients`
- `n_steps`
- `minutes`
- `calories`
- `sodium`

- `tfidf_mean` of `description` for **recipe per user_id** (may have more than one recipe)
    - `TFIDF` of a word evaluates whether a word shows more often in this **user's recipe decription** compare to all **recipe decription**, thus, meaning that it is more important to this user.
    - `TFIDF mean` for an `description` for the `recipe` represents the importantness of an sentence in the whole data set of text
- `word2vec` Similarity
    - All good `recipe` (above 3 rating) can be a pool of words in a **vector space** (from description, can have more)
    - We want to see how similar (cosine distance) between each recipe `word2vec` description's vector to the good pool of vectors

## Ensemble Learning (Bagging, Stacking, Boosting)
Heterogenous Ensemble Voting:
1. Homogenous Ensemble `Rabndom Forest`
2. Model2...
3. Model3...

***
# Pipeline Creation
***

You can do all the function transfromation in here actually 

In [6]:
interactions = pd.read_csv('food_data/RAW_interactions.csv')
recipes = pd.read_csv('food_data/RAW_recipes.csv')
step0 = recipes.merge(interactions, how='left', left_on='id', right_on='recipe_id', indicator=True)
base_df = (step0
           .pipe(initial)
           .pipe(transform_df)
           .pipe(outlier)
           )[['n_ingredients','minutes','n_steps','description','sugar','calories','sodium','rating','tags']]

## Handling Missing Data
It have been shwon earlier that the missingness of the `rating` columns seems to be **NMAR**, so it is not dependent on the column but rather depending on itself. Thus, we will be imputing the ratings through **random imputation**.
- Consider this a bit more

In [7]:
def prob_impute(s):
    s = s.copy()
    num_null = s.isna().sum()
    fill_values = np.random.choice(s.dropna(), num_null)
    s[s.isna()] = fill_values
    return s

base_df['rating'] = prob_impute(base_df['rating'])
base_df = base_df.dropna()

In [8]:
base_df = base_df.assign(is_low = base_df['rating']<=3)

In [9]:
base_df.isna().sum().sum()

0

## Train Test Split

In [10]:
X = base_df.drop('rating', axis=1)
y = base_df['rating']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [11]:
y_train.isna().sum() + X_train.isna().sum().sum()

0

## Transformation Functions

In [12]:
# def make_tfidf_max(df):
#     from sklearn.feature_extraction.text import TfidfVectorizer

#     lst = df['description'].explode().astype(str).values
#     count = TfidfVectorizer()
#     count.fit(lst)

#     tfidf = pd.DataFrame(count.transform(lst).toarray(),
#                             columns=count.get_feature_names_out())

#     return df.reset_index().assign(max = tfidf.max(axis=1)).groupby('index').sum()


def detect_key_low(df):
    '''transforming description's tfidf to actual most important word in a description then compare if it is in the low rated set'''

    def key_largest(row):
        return row.index[row.argmax()]

    def make_tfidf(series):
        lst = series.explode().astype(str).values # this may be slow
        count = TfidfVectorizer()
        count.fit(lst)
        return pd.DataFrame(count.transform(lst).toarray(), columns=count.get_feature_names_out())
    
    tfidf_low = make_tfidf(df[df['is_low']==True]['description'])
    tfidf_base = make_tfidf(df['description'])

    keyword_all = tfidf_base.apply(key_largest, axis=1) #argmax a bit faster
    keyword_low = tfidf_low.apply(key_largest, axis=1)
    pool_low = keyword_low.unique()

    in_low = keyword_all.apply(lambda x: True if x in pool_low else False)

    return pd.DataFrame(in_low)


def tag_counts(df):
    '''number of tags counted'''
    return pd.DataFrame(df['tags'].apply(lambda x: len(x)).rename('counts'))


def tag_ohe_pca(df):
    '''OHE all the tag result after it have being pca dimension reduced to 50'''
    # getting all the unique one quick
    set = [j for i in df['tags'].tolist() for j in i] # explode in a time complexity efficient way
    count = CountVectorizer()
    count.fit(set).transform(set)

    my_dict = np.array(list(count.vocabulary_.keys()))

    def helper_function(list,dict):
        return np.array([i in list for i in dict])

    # helper_function(X_train["tags"].iloc[0],my_dict)
    
    a = df["tags"].apply(lambda x:helper_function(x, my_dict))
    
    # change array of array into 2D array
    df_pca = pd.DataFrame(data = np.stack(a.to_numpy()),columns=my_dict)

    # conduct PCA to reduce to just 50 dimensions
    pca = PCA(n_components=50)
    reduced = pca.fit_transform(df_pca)
    
    return reduced

Test `make_tfidf`

In [13]:
# make_tfidf_max(X_train[['description']]).isna().sum() # no nan here

In [14]:
# FunctionTransformer(make_tfidf_max).fit_transform(X_train[['description']]).isna().sum() # no nan here

Test `count_tags`

In [15]:
# tag_counts(X_train[['tags']])

Test `tag_ohe_pca`

In [16]:
# X_train["tags"].apply(lambda x: x==np.array(list(count.vocabulary_.keys())))

In [17]:
# tag_ohe_pca(X_train[['tags']])

In [18]:
# len(pca_result)
# len(y)

Test `key_ohe`/`detect_key_low`

In [19]:
# is_low = detect_key_low(X_train[['is_low','description']])

In [None]:
# is_low.isna().sum()

0    0
dtype: int64

## Transformation and Models

In [61]:
norm_relative = Pipeline([
    ('bi_nsteps',Binarizer(threshold=25)),
    ('norm_minutes_binary_nsteps', FunctionTransformer(lambda x: StdScalerByGroup().fit(x).transform(x))),
])

key_ohe = Pipeline([
    ('tfidf',FunctionTransformer(detect_key_low)),
    ('key_ohe', OneHotEncoder(drop='first'))
])

preproc_rf = ColumnTransformer(
    transformers=[
        ('tfidf_key_ohe', key_ohe, ['is_low','description']),
        ('bi_nsteps', Binarizer(threshold=25),['n_steps']),
        ('bi_ningredients', Binarizer(threshold=20),['n_ingredients']),
        ('norm_minutes_binary_nsteps',norm_relative,['n_steps','minutes']),
        ('norm_minutes_binary_ningredients',norm_relative,['n_ingredients','minutes']),
        ('tag_counts',FunctionTransformer(tag_counts),['tags']),
    ],
    remainder='drop'
)

preproc_lg = ColumnTransformer(
    transformers=[
        ('tag_pca',FunctionTransformer(tag_ohe_pca),['tags']),
    ],
    remainder='drop'
)

pl_rf = Pipeline([
    ('preprocessor', preproc_rf),
    ('rfc', RandomForestClassifier(max_depth=10,
                                   n_estimators=100,
                                   criterion='entropy',
                                   min_samples_split=2,
                                   ))
])

pl_lr = Pipeline([
    ('preprocessor', preproc_lg),
    ('lr',LogisticRegression(max_iter=500,
                             multi_class='multinomial'))
])

voter = StackingClassifier(estimators=[('rfc', pl_rf), ('lr', pl_lr)])

In [62]:
pl_rf.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('tfidf_key_ohe',
                                                  Pipeline(steps=[('tfidf',
                                                                   FunctionTransformer(func=<function detect_key_low at 0x110b32af0>)),
                                                                  ('key_ohe',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['is_low', 'description']),
                                                 ('bi_nsteps',
                                                  Binarizer(threshold=25),
                                                  ['n_steps']),
                                                 ('bi_ningredients',
                                                  Binarizer(threshold=20),
                                                  ['n_ingredients'])...
               

In [46]:
(pl_rf.predict(X_train) == y_train).mean()

0.7675459012154124

In [50]:
# %time
# hyperparameters = {
# 'rfc__max_depth': np.arange(2, 20, 2),
# 'rfc__min_samples_split': [2, 5, 10, 20],
# 'rfc__criterion': ['gini', 'entropy'],
# 'rfc__n_estimators': np.arange(100, 150, 10),
# }
# grids = GridSearchCV(pl_rf,
#                      n_jobs=-1,
#                      param_grid=hyperparameters,
#                      return_train_score=False,
#                      cv=5
#                      )

# grids.fit(X_train, y_train)
# grids.best_estimator_

In [74]:
rand_sample = X_train.assign(rating=y_train).dropna().sample(1000)
pl_rf.score(rand_sample.drop(columns='rating'), rand_sample['rating'])

0.737

## $K$-fold Test Check With Training data

In [None]:
# data_test = X_train.assign(rating=y_train)
# data_test["k_fold"] = np.random.choice(list(range(5)),size = len(data_test))

# total_train = []
# total_test = []

# for n in range(20):
#     train_result = []
#     test_result = []

#     for i in range(5):
#         data_test["k_fold"] = np.random.choice(list(range(5)),size = len(data_test))

#         test_data = data_test[data_test["k_fold"]!=i].drop(columns=["k_fold"])

#         train_score = accuracy_score(voter.predict(data_test[data_test["k_fold"]!=i].drop(columns=["rating","k_fold"])),
#                                      data_test[data_test["k_fold"]!=i]["rating"])
        
#         test_score = accuracy_score(voter.predict(data_test[data_test["k_fold"]==i].drop(columns=["rating","k_fold"])),
#                                data_test[data_test["k_fold"]==i]["rating"])
        
#         test_result.append(test_score)
#         train_result.append(train_score)
    
#     total_test.append(sum(test_result)/5)
#     total_train.append(sum(train_result)/5)

# print(f'Training: {sum([i > 0.75 for i in total_train])/20}')   
# px.histogram(total_train).show()
# print(f'Testing: {sum([i > 0.75 for i in total_test])/20}') 
# px.histogram(total_test).show()