# Prediction Time: How Good are these Recipes?

**Name(s)**: Tauhid Noor

**Website Link**: https://taunoor.github.io/Classification-model-for-recipes/

### Introduction:

##### Model type:
In this report, I will create a prediction model that tries to predict the rating - response variable - of recipes based on other given information in the dataset. I will be building a multiclass classification model to predict the ratings since the ratings are discrete values (1-5). I chose ratings as my response variable over other metrics because it's reasonable to see ratings as the dependent variable, and features such as calories, no. of ingredients, macronutrients, etc as independent variables that determine the quality/enjoyability of a recipe. Moreover, since I'm building a classification model that deals with discrete values, ratings is the only usable metric that suits the description of a discrete variable. The other variables would be more appropriate for a regression model since they are continuous variables. 

##### Variables:
For my classification model, with `rating` column as my response variable, I will be using `calories`,`n_ingredients`, `minutes`, `total fats`, `carbohydrates`, `protein`, `sugar` columns as my features (tentative). The transformations of the features will be decided on as we look to see what improves the model's performance level. I chose these metrics because I think these metrics would assess the healthiness, taste, convenience, and nourishment of the recipe which would determine if the user enjoyed the food and rated it high. 

##### Evaluation method:
I will be evaluating the model by using the **mean accuracy**, that is, finding the proportion of labels that the model got right. For a classfication model, I think determining the proportion of correctly predicted values is a more appropriate method than the common **RMSE** and **R^2** methods because the latter two are suited to regression models that deal with continuous values. I also decided against using **Precision** or **Recall** because while they may be for classification models, they are specifically meant for binary-based classification models. Moreover, for the discrete values (1-5), I want all the predictions to be weighed equally, in other words, there is no value from range 1-5 that is more important to accurately predict than the other. They are all equally important. 

## Code

In [104]:
import pandas as pd
import numpy as np
import os

import ast
from scipy import stats

import plotly.express as px
pd.options.plotting.backend = 'plotly'

from sklearn.preprocessing import Binarizer, FunctionTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

### Framing the Problem

The code for the data cleaning process was taken from Project 3 which dealt with the same dataset.

In [2]:
#read in the datasets in csv form
interactions_df = pd.read_csv('RAW_interactions.csv')
recipes_df = pd.read_csv('RAW_recipes.csv')

#merge the two dataframes using the recipe id as a way to align the data
df = interactions_df.merge(recipes_df,left_on='recipe_id',right_on='id')

In [3]:
rating_mean = df.groupby('recipe_id').mean()['rating'] #group the df by different recipes and and average their ratings
rating_mean = rating_mean.reset_index() #this is for merging
rating_mean = rating_mean.rename(columns={"rating":'avg_rating'})#rename the column 
df = df.merge(rating_mean,left_on='recipe_id',right_on='recipe_id') #df with the average rating series 

In [4]:
df = df.drop(columns=['id']) #dropped the id column

In [5]:
str_to_lst = df['nutrition'].apply(lambda x: ast.literal_eval(x)) #convert the string list to an actual list
df['nutrition'] = str_to_lst 

df['calories'] = df['nutrition'].apply(lambda x: x[0]) #separating the one column of lists into several ones w/ values
df['total fat'] = df['nutrition'].apply(lambda x: x[1])
df['sugar'] = df['nutrition'].apply(lambda x: x[2])
df['sodium'] = df['nutrition'].apply(lambda x: x[3])
df['protein'] = df['nutrition'].apply(lambda x: x[4])
df['saturated fat'] = df['nutrition'].apply(lambda x: x[5])
df['carbohydrates'] = df['nutrition'].apply(lambda x: x[6])
df = df.drop(columns=['nutrition'])
#df.head(5) #display the changes

In [6]:
n_days = df['submitted'].apply(lambda x: pd.to_datetime('2023-05-12') - pd.to_datetime(x))
n_years = n_days.apply(lambda x: int(str(x).split(' ')[0])/365)
df['recipe age'] = n_years

n_days = df['date'].apply(lambda x: pd.to_datetime('2023-05-12') - pd.to_datetime(x))
n_years = n_days.apply(lambda x: int(str(x).split(' ')[0])/365)
df['interaction age'] = n_years

df = df.drop(columns=['date','submitted'])

In [7]:
df.columns #make rating response. Initial 2 features: calories, n_ingredients. 
#Next 2 features(transformed): fat+protein+carb, mins/n_steps

Index(['user_id', 'recipe_id', 'rating', 'review', 'name', 'minutes',
       'contributor_id', 'tags', 'n_steps', 'steps', 'description',
       'ingredients', 'n_ingredients', 'avg_rating', 'calories', 'total fat',
       'sugar', 'sodium', 'protein', 'saturated fat', 'carbohydrates',
       'recipe age', 'interaction age'],
      dtype='object')

### Baseline Model

For the initial model, I will use Decision Tree Classification Model since it predicts discrete values like the `rating` column. I will choose `sugar`, `calories`, and `n_ingredients` as the features. All 3 columns are quantitative. I will leave the `sugar` column as itself. I transformed the `calories` column into **natural log** values so that the unusually large values and outliers don't distort the results of the prediction model. I transformed the `n_ingredients` to a **binary** format (1 and 0) to indicate high number of ingredients **(1)** and low number of ingredients **(0)**.

In [110]:
final_df = df[df['rating']!=0] #remove the ratings of 0 since those correspond to missing ratings
final_df['calories'] = final_df['calories'].replace(0,0.01) #to avoid log error
final_df = final_df[['rating','sugar','calories','n_ingredients',\
                     'n_steps','total fat','protein','minutes','carbohydrates']] #pick out the columns
final_df.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['calories'] = final_df['calories'].replace(0,0.01) #to avoid log error


Unnamed: 0,rating,sugar,calories,n_ingredients,n_steps,total fat,protein,minutes,carbohydrates
0,5,50.0,95.3,8,4,1.0,5.0,40,7.0
1,5,25.0,143.5,10,9,5.0,10.0,30,7.0
2,5,50.0,182.4,14,14,2.0,11.0,22,13.0
3,5,50.0,182.4,14,14,2.0,11.0,22,13.0
4,4,151.0,658.2,12,7,45.0,24.0,40,29.0


In [111]:
#specify the X and y (response variables) and split the data into train and test sets
X = final_df.drop('rating',axis=1)
y = final_df['rating']

train_X,test_X,train_y,test_y = train_test_split(X,y,test_size=0.25,random_state=1)

In [112]:
#define the transformers for the columns
def passthrough(data):
    return data[['sugar']]

preproc = ColumnTransformer(
    transformers=[
        ('Calories',FunctionTransformer(np.log),['calories']),
        ('num_ingredients', Binarizer(threshold=8), ['n_ingredients']),
        ('sugar',FunctionTransformer(passthrough),['sugar'])
    ],
    remainder='drop') #drop the rest of the columns

pl = Pipeline([('preprocessor',preproc),('DTC',DecisionTreeClassifier(max_depth=6))]) #arbitrary max_depth value
pl.fit(train_X,train_y)
pl.predict(test_X)
print('The mean accuracy of the Baseline Model is:',(pl.predict(test_X)==test_y).mean())

The mean accuracy of the Baseline Model is: 0.774052398402888


The performance of our baseline model can be considered sufficient. The accuracy of our model is approximately 77.405% which leaves room for improvement, but since it's our baseline model, to be able to accurately label more often than not in our initial model, it can be considered promising.

### Final Model

In our final model, I will engineer two new additional features with several columns: `total fat`, `protein`, and `carbohydrate` columns for one feature, and `minutes` and `n_steps` for the other. I will also test out the best hyperparameters: I will choose the best performing classification model between Decision Tree Classifier and Random Forest Classifier. I will decide by comparing the training accuracy and testing accuracy of the two different models. The error values will be the mean accuracy of our predictions for each max depth for the respective model. After picking out the highest average of the training and test error from each model, I will then cross-compare between the two models to see which is the highest between the two. 

It's important to pick the best sub-hyperparameter (max_depth) to avoid overfitting or underfitting, and then the best hyperparameter (decision tree/random forest) to get the best overall performance. 

In [113]:
#two new features 
def macronutrients(data):
    return pd.DataFrame(data['total fat'] + data['protein'] + data['carbohydrates'])

def effort(data):
    return pd.DataFrame(data['minutes']/data['n_steps'])

imp_preproc = ColumnTransformer(
    transformers=[
        ('Calories',FunctionTransformer(np.log),['calories']),
        ('num_ingredients', Binarizer(threshold=8), ['n_ingredients']),
        ('macronutrients', FunctionTransformer(macronutrients),['total fat','protein','carbohydrates']),
        ('effort',FunctionTransformer(effort),['minutes','n_steps'])
    ],
    remainder='passthrough')

In [114]:
#create a dataframe of training and test accuracy for decision tree
dict_val = {'max_depth':[],'train_accuracy':[],'test_accuracy':[]}

for i in range(1,16):
    pl = Pipeline([('Preprocessor',imp_preproc),('DTC',DecisionTreeClassifier(max_depth=i))])
    pl.fit(train_X,train_y)
    dict_val['max_depth'].append(i)
    dict_val['train_accuracy'].append((pl.predict(train_X)==train_y).mean())
    dict_val['test_accuracy'].append((pl.predict(test_X)==test_y).mean())
    
scores = pd.DataFrame(dict_val)
scores

Unnamed: 0,max_depth,train_accuracy,test_accuracy
0,1,0.773088,0.774289
1,2,0.773088,0.774289
2,3,0.773088,0.774289
3,4,0.7731,0.774216
4,5,0.773137,0.774308
5,6,0.773173,0.774362
6,7,0.773319,0.773979
7,8,0.773702,0.77367
8,9,0.774237,0.772977
9,10,0.775009,0.772339


In [98]:
#calculating the average of the two mean accuracies, to find the best max depth value
scores['mean_accuracy']=(scores[['train_accuracy','test_accuracy']].sum(axis=1)/2)
best_mean_ind = scores['mean_accuracy'].idxmax()
print(scores.loc[best_mean_ind])

max_depth         13.000000
train_accuracy     0.779196
test_accuracy      0.769367
mean_average       0.774281
mean_accuracy      0.774281
Name: 12, dtype: float64


After going through an iterative process to find the mean accuracies of the different max depths of our Decision Tree, we found the best max_depth to be 13, and the average of the training and testing accuracies to be approximately 77.428%. We will now do the same with the Random Forest Classifier algorithm and compare the two different models. 

In [165]:
dict_val2 = {'max_depth':[],'train_accuracy':[],'test_accuracy':[]}

for j in range(1,16):
    pl2 = RandomForestClassifier(max_depth=j)
    pl2.fit(train_X,train_y)
    dict_val2['max_depth'].append(j)
    dict_val2['train_accuracy'].append((pl2.predict(train_X)==train_y).mean())
    dict_val2['test_accuracy'].append((pl2.predict(test_X)==test_y).mean())
    
scores2 = pd.DataFrame(dict_val2)
scores2

Unnamed: 0,max_depth,train_accuracy,test_accuracy
0,1,0.773088,0.774289
1,2,0.773088,0.774289
2,3,0.773088,0.774289
3,4,0.773088,0.774289
4,5,0.773088,0.774289
5,6,0.773088,0.774289
6,7,0.773088,0.774289
7,8,0.773088,0.774289
8,9,0.773094,0.774289
9,10,0.773131,0.774235


In [109]:
scores2['mean_accuracy']=(scores2[['train_accuracy','test_accuracy']].sum(axis=1)/2)
best_mean_ind = scores2['mean_accuracy'].idxmax()
print(scores2.loc[best_mean_ind])

max_depth         15.000000
train_accuracy     0.775951
test_accuracy      0.773688
mean_accuracy      0.774819
Name: 14, dtype: float64


After going through an iterative process to find the mean accuracies of the different max depths of our Random Forest, we found the best max depth to be 15, and the average of the training and testing accuracies to be approximately 77.482%. 

If we compare the two models, the Random Forest model has marginally better accuracy across the the training and testing accuracies than that of the Decision Tree model. The Random Forest model is "better" by 0.54%. While the difference may be small, objectively the Random Forest model should be the final model for its performance. 

The final model is an improvement over the baseline model. Its mean accuracy is better by 0.77%. The improvement is marginal but after adding two new reasonable features, and finding the best hyperparameters through an iterative process, it is the best we can do.

The improvement that we observed can be attributed to the new features. The first new feature will add all the values of the macronutrients (fat, protein, and carbohydrates) of a recipe because it is reasonable to assume that the higher your total PDV of fat, protein and carbohydrate, the more nourishing the recipe is, which should cause users to rate the recipe higher. The second feature divides the number of minutes to make the recipe by the number of steps. This transformation tells us the average time spent on one step of the recipe, which shows the amount of detail and care is given to prepare the recipe. The more care the recipe is prepared with, the better it will come out. Hence, we should expect users to rate it higher. 

### Fairness Analysis

After coming up with the final model, it's important to check if the model is biased in some way. I will check to see if there is a substantial difference in performance level of the model when feeding it data separated into two groups. I will use a **permutation test** to find the significance of our findings.

**Group 1:** The dataset with high number of ingredients (greater than 8)

**Group 2:** The dataset with low number of ingredients (less than or equal to 8)


**Null hypothesis:** The model is fair. There is no difference in mean accuracy between data with low no. of ingredients and high no. of ingredients.

**Alternative hypothesis:** The model is not fair. There is a difference in mean accuracy betwene data with low no. of ingredients and high no. of ingredients. 

**Evaluation metric:** Mean accuracy

**Test statistic:** Absolute difference in mean accuracy

**Significance level:** 1%


In [136]:
main_pl = RandomForestClassifier(max_depth=15)
main_pl.fit(train_X,train_y)

RandomForestClassifier(max_depth=15)

In [145]:
#pick the difference in mean accuracy as test statistic
#do a permutation test where you shuffle the high and low no. of ing
#separate the data into two groups
grouped_df = test_X.copy()

grouped_df['Group']=test_X['n_ingredients'].apply(lambda x: 1 if x>8 else 0)
grouped_df['y'] = test_y

#finding the observed statistic
one = grouped_df[grouped_df['Group']==1]
ind_vals = one.drop(columns=['Group','y'],axis=1)
response_val = one['y']

zero = grouped_df[grouped_df['Group']==0]
ind_vals2 = zero.drop(columns=['Group','y'],axis=1)
response_val2 = zero['y']

one_score = (main_pl.predict(ind_vals)==response_val).mean()
zero_score = (main_pl.predict(ind_vals2)==response_val2).mean()

obs = abs(one_score-zero_score)

sim_lst = []
for i in range(500):
    grouped_df['Group'] = np.random.permutation(grouped_df['Group'])
    
    one = grouped_df[grouped_df['Group']==1]
    ind_vals = one.drop(columns=['Group','y'],axis=1)
    response_val = one['y']

    zero = grouped_df[grouped_df['Group']==0]
    ind_vals2 = zero.drop(columns=['Group','y'],axis=1)
    response_val2 = zero['y']
    
    sim_one_score = (main_pl.predict(ind_vals)==response_val).mean()
    sim_zero_score = (main_pl.predict(ind_vals2)==response_val2).mean()
    
    sim = abs(sim_one_score-sim_zero_score)
    
    sim_lst.append(sim)

In [160]:
fig = px.histogram(pd.DataFrame(sim_lst), x=0, nbins=75, histnorm='probability', 
                   title='Empirical Distribution of the Absolute Mean Differences in Accuracy')
fig.add_vline(x=obs, line_color='red')
fig.update_layout(xaxis_range=[-0.02, 0.02])

In [168]:
fig.write_html('histplot.html', include_plotlyjs='cdn')

In [158]:
print('The p-value is:',(np.array(sim_lst)>=obs).mean())

The p-value is: 0.008


The p-value is 0.8% which falls below the 1% threshold. Hence, we reject the null hypothesis which claimed there was no significant difference. Now while we may have rejected the null hypothesis, it doesn't mean we accept the alternative hypothesis. Based on our permutation test, we can only say that there is an association between the performance of the model and the group.