In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Your Title Here

**Name(s)**: Ananya Krishnan, John Wesley Pabalate

**Website Link**: (your website link)

In [2]:
import pandas as pd
import numpy as np
from pathlib import Path

import plotly.express as px
pd.options.plotting.backend = 'plotly'

#from dsc80_utils import * # Feel free to uncomment and use this.

## Step 1: Introduction

## Step 2: Data Cleaning and Exploratory Data Analysis

In [3]:
#JohnWesley's Directory
#recipes = pd.read_csv('/Users/johnwesleypabalate/Desktop/dsc80-2025-wi/projects/project04/data/RAW_recipes.csv')

#Ananya's Directory
# recipes = pd.read_csv('/Users/ananyakrishnan/Downloads/DSC80/projects/project04/data/RAW_recipes.csv')
# recipes.head()

#Colab Directory
recipes = pd.read_csv('/content/RAW_recipes.csv')
recipes.head()

FileNotFoundError: [Errno 2] No such file or directory: '/content/RAW_recipes.csv'

In [None]:
recipes.shape

In [None]:
#JohnWesley's Directory
#reviews = pd.read_csv('/Users/johnwesleypabalate/Desktop/dsc80-2025-wi/projects/project04/data/RAW_interactions.csv')

#Ananya's Directory
# reviews = pd.read_csv('../DSC80/projects/project04/data/RAW_interactions.csv')
# reviews.head()

#Colab Directory
reviews = pd.read_csv('/content/interactions.csv')
reviews.head()

In [None]:
reviews.shape

Let us now merge recipes and ratings into one comprehensive dataset.

In [None]:
recipe_ratings = recipes.merge(reviews, left_on = 'id', right_on = 'recipe_id', how="left")
recipe_ratings.head()

In [None]:
recipe_ratings.shape

In [None]:
#Missingness Analysis Purposes
merged_df = recipe_ratings.copy()
merged_df

Let us replace all 0s in the ratings column with NaN values. The 0 represents no rating given, but it will influence any calculations we perform with the ratings. We also calculate the average ratings for each recipe and store it in `avg_recipe_rating`. We will then add it as a column to recipe_reviews.

In [None]:
recipe_ratings.loc[recipe_ratings['rating'] == 0, 'rating'] = np.nan
avg_recipe_rating = recipe_ratings.groupby('recipe_id')['rating'].mean()

In [None]:
recipe_ratings = recipe_ratings.merge(avg_recipe_rating.reset_index().rename(columns={'rating': 'avg_rating'}), on = 'recipe_id')

There are many columns not relevant to our question, so we will retain only the columns related to recipe id, nutrition information and ratings.

In [None]:
recipe_ratings = recipe_ratings[['id', 'rating', 'avg_rating', 'nutrition']]
recipe_ratings = recipe_ratings.rename(columns = {'id': 'recipe_id'})

Let us now look at the columns we have and clean them up one by one.

In [None]:
recipe_ratings.dtypes

We observe that `nutrition` actually contains strings formatted to look like lists, so let us convert it to real lists.

In [None]:
recipe_ratings['nutrition'] = recipe_ratings['nutrition'].str.strip('[').str.strip(']').str.replace("'", "").str.split(', ')

The `nutrition` column now contains lists of values. Let us separate each value into its respective category - `'calories'`, `'total_fat'`, `'sugar'`, `'sodium'`, `'protein'`, `'saturated_fat'` and `'carbohydrates'`. We can then drop the `nutrition` column.

In [None]:
categories = ['calories', 'total_fat', 'sugar', 'sodium', 'protein', 'saturated_fat', 'carbohydrates']
recipe_ratings = recipe_ratings.assign(
    **{category: pd.to_numeric(recipe_ratings['nutrition'].str[i], errors='coerce') for i, category in enumerate(categories)}
)
recipe_ratings = recipe_ratings.drop(columns = ['nutrition'])

Let us now look at our cleaned dataset:

In [None]:
recipe_ratings.isna().sum(axis = 0)

In [None]:
recipe_ratings.describe()

The nutritional values seem to have abnormally high max values despite a reasonable mean. Let us look at the rows with high protein or high carbohydrate values

In [None]:
recipe_ratings[(recipe_ratings['protein'] > 200) | (recipe_ratings['carbohydrates'] > 200)]

These rows have proportionally high calories, meaning this is unlikely to be an error and could just be because of large portion sizes. We can leave it as it is.

# Ratio based

We will define high carbohydrate and low protein recipes as those in the top 25th percentile of carb to protein ratios. The carb to protein ratio provides a single measure that captures how carbohydrate-heavy a recipe is relative to its protein content.

By taking the top 25% of the carb-to-protein ratio, we focus on recipes where carbohydrates are dominant relative to protein, regardless of total calories or fat content. Some recipes have 0 protein which will complicate our calculation of the ratios, so we will replace the 0s with 0.1.

In [None]:
ratio_recipe_ratings = recipe_ratings.copy()
ratio_recipe_ratings['ratio_carb_protein'] = ratio_recipe_ratings['carbohydrates'] / ratio_recipe_ratings['protein'].replace(0, 0.1)
ratio_recipe_ratings['ratio_carb_protein'] = ratio_recipe_ratings['ratio_carb_protein'].replace([np.inf, -np.inf], np.nan)

ratio_recipe_ratings['high_carb_protein_ratio']= (ratio_recipe_ratings['ratio_carb_protein'] > ratio_recipe_ratings['ratio_carb_protein'].quantile(0.75))

### Univariate Analysis

### Bivariate Analysis

## Step 3: Assessment of Missingness

In [None]:
# TODO

## Step 4: Hypothesis Testing

Our goal is to see if carbohydrate and protein content affect ratings of recipes. We define high carb-to-protein ratios as those in the top 25th percentile.

**Null Hypothesis (H₀):** Recipes with a high carb-to-protein ratio receive the same ratings as other recipes.  

**Alternative Hypothesis (Hₐ):** Recipes with a high carb-to-protein ratio receive significantly different ratings.

**Test statistic:** Mean difference in ratings between the high-carb, low-protein group and all other recipes.

**Significance level:** 0.05

In [None]:
observed_diff = ratio_recipe_ratings.groupby("high_carb_protein_ratio")["avg_rating"].mean().diff().iloc[-1]

def permute_ratings(df):
    shuffled = df["avg_rating"].sample(frac=1, replace=False).reset_index(drop=True)
    df["shuffled_rating"] = shuffled
    return df.groupby("high_carb_protein_ratio")["shuffled_rating"].mean().diff().iloc[-1]

perm_diffs = [permute_ratings(ratio_recipe_ratings) for _ in range(1000)]

p_value = np.mean(np.array(perm_diffs) >= observed_diff)
print("P-value:", p_value)

Since the p-value is greater than the significance level 0.05, we fail to reject the null.

# Quantile based

We want to account for the percentge of calories that the protein and carbohydrate contribute to. We will define high carbohydrate - low protein recipes using arbitrary cutoffs, considering those in the top 25th percentile of carb percentage and bottom 25th percentile of protein percentage. Using percentiles ensures that we select recipes that are high-carb in absolute terms and low-protein in absolute terms. This guarantees that the selected recipes are truly high in carbohydrate content and low in protein content, rather than just having a high ratio.

To do this, we first need to convert protein and carbohydrate to calories. Each gram of carbohydrate or protein contains 4 calories.

In [None]:
carb_calories = recipe_ratings['carbohydrates'] * 4  # 4 calories per gram of carbs
protein_calories = recipe_ratings['protein'] * 4  # 4 calories per gram of protein

recipe_ratings['carb_prop'] = carb_calories / recipe_ratings['calories']
recipe_ratings['protein_prop'] = protein_calories / recipe_ratings['calories']

In [None]:
carb_threshold = recipe_ratings['carb_prop'].quantile(0.75)
protein_threshold = recipe_ratings['protein_prop'].quantile(0.25)

recipe_ratings['high_carb_low_protein'] = (
    (recipe_ratings['carb_prop'] >= carb_threshold) &
    (recipe_ratings['protein_prop'] <= protein_threshold)
)

In [None]:
recipe_ratings.describe()

We can see that protein_percent has a max value of 1.88 which is not possible, indicating an error. We will drop all such rows.

In [None]:
recipe_ratings = recipe_ratings[(recipe_ratings['protein_prop'] <= 1)]

Now we are ready to visualize our features.

### Univariate Analysis

In [None]:
# import plotly.io as pio
# pio.renderers.default = "browser"  # IGNORE THIS FOR NOW

In [None]:
#Distribution of Ratings
px.histogram(recipe_ratings, x = 'avg_rating', nbins = 10, title = 'Distribution of Average Recipe Ratings').show()
px.histogram(recipe_ratings, x = 'rating', nbins = 10, title = 'Distribution of Recipe Ratings').show()

The distribution of average ratings is **highly skewed to the left**.This suggests that most recipes receive **high ratings**, making it important to analyze which rating values appear most frequently and how they relate to other factors like the nutritional facts of the food.

Most ratings left by people tend to be 5 stars.

Now let's look at the distribution of Carbohydrate and Protein content of recipes

In [None]:
recipe_ratings['protein'].describe()

In [None]:
recipe_ratings['carbohydrates'].describe()

There appears to be very high values of carb and protein (over 3000) which seems unrealistic.

In [None]:
fig_protein = px.box(recipe_ratings, x='protein', title='Boxplot of Protein Content')
fig_protein.show()

fig_carbs = px.box(recipe_ratings, x='carbohydrates', title='Boxplot of Carbohydrates Content')
fig_carbs.show()

fig_carbs = px.box(ratio_recipe_ratings, x='ratio_carb_protein', title='Boxplot of Carbohydrate to Protein Ratio Content')
fig_carbs.show()

In [None]:
recipe_ratings[(recipe_ratings['carbohydrates'] > 200) | (recipe_ratings['carbohydrates'] > 200)].shape[0] / recipe_ratings.shape[0] * 100

Less than 0.3 % of the data has either protein or carbohydrate content over 200g, so we can leave the outliers as they are since they are not likely to affect our analyses

### Bivariate Analysis

We used a scatter plot for the the Ratio of Carbohydrates and Protein with ratings

In [None]:
px.scatter(ratio_recipe_ratings, x="avg_rating", y="ratio_carb_protein",
                 title="Ratio of Carbohydrates and Protein vs. Ratings").show()


In [None]:
px.scatter(recipe_ratings, x="avg_rating", y="carbohydrates",
                 title="Carbohydrates vs. Ratings").show()


In [None]:
px.scatter(recipe_ratings, x="avg_rating", y="protein",
                 title="Protein vs. Ratings").show()

## Step 3: Assessment of Missingness

In [None]:
recipe_ratings.isnull().sum()

In [None]:
merged_df.isnull().sum()

In [None]:
# Define missingness indicator for "review"
merged_df["review_missing"] = merged_df["review"].isnull()

# Number of repetitions
n_repetitions = 500
shuffled = merged_df.copy()

tvds = []
for _ in range(n_repetitions):

    # Shuffling the missingness indicator for 'review'
    shuffled["review_missing"] = np.random.permutation(shuffled["review_missing"])

    # Computing and storing the TVD
    pivoted = (
        shuffled
        .pivot_table(index="rating", columns="review_missing", aggfunc="size")
    )

    pivoted = pivoted / pivoted.sum()

    tvd = pivoted.diff(axis=1).iloc[:, -1].abs().sum() / 2
    tvds.append(tvd)

# Compute observed TVD
observed_pivot = merged_df.pivot_table(index="rating", columns="review_missing", aggfunc="size")
observed_pivot = observed_pivot / observed_pivot.sum()
observed_tvd = observed_pivot.diff(axis=1).iloc[:, -1].abs().sum() / 2

# Create a histogram of TVD values
fig = px.histogram(pd.DataFrame(tvds), x=0, nbins=50, histnorm="probability",
                   title="Empirical Distribution of the TVD")

# Add observed TVD as a red vertical line
fig.add_vline(x=observed_tvd, line_color="red", line_width=2, opacity=1)

# Add annotation for observed TVD
fig.add_annotation(text=f'<span style="color:red">Observed TVD = {round(observed_tvd, 2)}</span>',
                   x=2.5 * observed_tvd, showarrow=False, y=0.16)

# Adjust axis range
fig.update_layout(yaxis_range=[0, 0.2])

# Show the plot
fig.show()


In [None]:
np.mean(np.array(tvds) >= observed_tvd)

We **fail to reject the null**.
This test does not provide evidence that the missingness in the 'review' column is dependent on 'rating' since 0.69 > 0.05

In [None]:
# Define missingness indicator for "review"
merged_df["review_missing"] = merged_df["review"].isnull()

# Number of repetitions
n_repetitions = 500
shuffled = merged_df.copy()

tvds = []
for _ in range(n_repetitions):

    # Shuffling the missingness indicator for 'review'
    shuffled["review_missing"] = np.random.permutation(shuffled["review_missing"])

    # Computing and storing the TVD
    pivoted = (
        shuffled
        .pivot_table(index="n_steps", columns="review_missing", aggfunc="size")
    )

    pivoted = pivoted / pivoted.sum()

    tvd = pivoted.diff(axis=1).iloc[:, -1].abs().sum() / 2
    tvds.append(tvd)

# Compute observed TVD
observed_pivot = merged_df.pivot_table(index="n_steps", columns="review_missing", aggfunc="size")
observed_pivot = observed_pivot / observed_pivot.sum()
observed_tvd = observed_pivot.diff(axis=1).iloc[:, -1].abs().sum() / 2

# Compute p-value
p_value = np.mean(np.array(tvds) >= observed_tvd)

# Create a histogram of TVD values
fig = px.histogram(pd.DataFrame(tvds), x=0, nbins=50, histnorm="probability",
                   title="Empirical Distribution of the TVD for n_steps")

# Add observed TVD as a red vertical line
fig.add_vline(x=observed_tvd, line_color="red", line_width=2, opacity=1)

# Add annotation for observed TVD
fig.add_annotation(text=f'<span style="color:red">Observed TVD = {round(observed_tvd, 2)}</span>',
                   x=2.5 * observed_tvd, showarrow=False, y=0.16)

# Adjust axis range
fig.update_layout(yaxis_range=[0, 0.2])

# Show the plot
fig.show()



We **fail to reject the null**.
This test does not provide evidence that the missingness in the 'review' column is dependent on 'rating' since 0.69 > 0.05

In [None]:
np.mean(np.array(tvds) >= observed_tvd)


## Step 4: Hypothesis Testing

Our goal is to see if carbohydrate and protein content affect ratings of recipes. We define high-carb, low-protein recipes as those that fall into both:

- The top 25th percentile for the proportion of calories from carbohydrates
- The bottom 25th percentile for the proportion of calories from protein

**Null Hypothesis (H₀):** Recipes with high carb % and low protein % receive the same ratings as other recipes.  

**Alternative Hypothesis (Hₐ):** Recipes with high carb % and low protein % receive significantly different ratings.

**Test statistic:** Mean difference in ratings between the high-carb, low-protein group and others.

**Significance level:** 0.05

In [None]:
observed_diff = recipe_ratings.groupby("high_carb_low_protein")["avg_rating"].mean().diff().iloc[-1]

def permute_ratings(df):
    shuffled = df["avg_rating"].sample(frac=1, replace=False).reset_index(drop=True)
    df["shuffled_rating"] = shuffled
    return df.groupby("high_carb_low_protein")["shuffled_rating"].mean().diff().iloc[-1]

perm_diffs = [permute_ratings(recipe_ratings) for _ in range(1000)]

p_value = np.mean(np.array(perm_diffs) >= observed_diff)
print("P-value:", p_value)

Since the p-value is less than the significance level 0.05, we reject the null.

## Step 5: Framing a Prediction Problem

In [None]:
# TODO

## Step 6: Baseline Model

In [None]:
# TODO

## Step 7: Final Model

In [None]:
# TODO

## Step 8: Fairness Analysis

In [None]:
# TODO