# COGS 108 - Data Checkpoint

## Authors

- Jared Wang: TBD at Checkpoint
- Dylan Dsouza: TBD at Checkpoint
- Christian Kumagai: TBD at Checkpoint
- Kyle Zhao: TBD at Checkpoint

## Research Question

Do dessert recipes that use honey as the primary sweetener differ in caloric content, sugar content, and total fat (normalized to FDA daily values) compared to desserts that use sugar as the primary sweetener?

As a control, we will filter recipes using the “desserts” tag and classified as honey-based or sugar-based using ingredient lists, restricting to recipes that contain only one primary sweetener. Then, we will define nutritional outcomes, specifically by total calories, sugar (% daily value), and total fat (% daily value) extracted from the nutrition column. As a method of comparison, we intend to use independent samples t-tests to compare mean nutritional values between honey and sugar desserts. We also intend to conduct downstream secondary analyses using ANOVAs, although we are yet to concretize the details for this, which we will likely do once our EDA is complete.Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Background and Prior Work

Desserts are often viewed as indulgent foods, but many people attempt to make them healthier by swapping refined sugar with alternative sweeteners such as honey. Honey is commonly perceived as more natural or wholesome, and this perception frequently appears in online dessert recipes. Our project investigates whether this substitution actually leads to differences in nutritional outcomes, or whether the perceived health benefits are not reflected in the final recipe nutrition.

This analysis is possible because of the availability of large-scale recipe datasets that include ingredient lists and structured nutrition information. In particular, Majumder et al. (2019) introduced and analyzed a processed Food.com dataset containing over 180,000 recipes and 700,000 user interactions, demonstrating that Food.com recipes can be reliably used for large-scale computational analysis and modeling.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Their work motivates our use of this dataset, as it shows that recipe ingredients and nutrition fields are sufficiently structured for systematic filtering and analysis.

From a nutritional standpoint, honey and sugar are both primarily sources of carbohydrates, but they differ in caloric density. According to USDA FoodData Central, a tablespoon of honey contains more calories than a tablespoon of granulated sugar, meaning that replacing sugar with honey does not inherently reduce calorie content and may increase it if recipe quantities are not adjusted.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) In practice, honey-based desserts are also often paired with additional ingredients such as butter, oils, or nuts, which may further increase fat and total calorie content.

To make nutritional comparisons more interpretable, we focus on calories and normalized measures such as sugar (% Daily Value) and total fat (% Daily Value). %DV is based on FDA nutrition labeling standards and provides a consistent benchmark for comparing nutrient levels across recipes with different serving sizes or formulations.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) Using these measures allows us to compare honey-based and sugar-based desserts on a common scale.

While prior work has used Food.com data for recipe modeling and personalization, there has been limited analysis specifically comparing honey-based and sugar-based desserts under strict filtering conditions (dessert tag and a single primary sweetener). Our project builds on existing work by applying clear inclusion criteria and standard statistical comparisons to evaluate whether these two classes of desserts differ meaningfully in their nutritional composition.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Majumder, B. P., Li, S., Ni, J., & McAuley, J. (2019). *Generating Personalized Recipes from Historical User Preferences*. University of California, San Diego.
2. <a name="cite_note-2"></a> [^](#cite_ref-2) USDA FoodData Central. Nutrition entries for honey and granulated sugar. https://fdc.nal.usda.gov
3. <a name="cite_note-3"></a> [^](#cite_ref-3) U.S. Food and Drug Administration (FDA). Daily Value and nutrition labeling reference. https://www.fda.gov

Based on feedback on the project proposal submission, ...

tbd

tbd

tbd

tbd

## Hypothesis


We hypothesize that desserts using honey as the primary sweetener will have higher overall caloric content and slightly higher total fat compared to sugar-sweetened desserts when normalized to FDA daily values, while sugar (% daily value) may be comparable. Although honey is often perceived as a healthier alternative, it is more calorically dense than sugar (64 cal/g for 1 tablespoon of honey, as compared to 48 cal/g for 1 tablespoon of sugar), and recipes claiming to be 'healthy' may not reduce quantities enough to offset this difference. Additionally, honey-based desserts often include complementary fat sources (e.g., butter, oils, nuts), which may contribute to comparable or higher fat content.

**Calories**

* Null Hypothesis (H₀): μ_honey ≤ μ_sugar

* Alternate Hypothesis (H₁): μ_honey > μ_sugar

**Total fat (% DV)**

* Null Hypothesis (H₀): μ_honey ≤ μ_sugar

* Alternate Hypothesis (H₁): μ_honey > μ_sugar

**Sugar (% DV)**

* Null Hypothesis (H₀): μ_honey = μ_sugar

* Alternate Hypothesis (H₁): μ_honey ≠ μ_sugar

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [2]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|          | 0/2 [00:00<?, ?it/s]

Successfully downloaded: airline-safety.csv


Overall Download Progress: 100%|██████████| 2/2 [00:00<00:00,  7.54it/s]

Successfully downloaded: bad-drivers.csv





### Food.com Recipe Dataset

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"

        The dataset we are using contains information on recipes from Food.com. There are many important metrics contained in this dataset, but for our specific research question, the nutrition column is especially relevant. This column is formatted as a list of floats as (calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), and carbohydrates (PDV)). 
        
        Here are some of the most important metrics from this nutrition column: calories which is measured in kilocalories per serving and represents total energy content. Very high values (>800 kcal) indicate high energy density desserts.
        Total fat which is measured in percent daily value. 100% DV means one serving meets the entire recommended daily intake for fat. Sugar, which is measured in percent daily value, indicates how much of the recommended daily sugar intake one serving provides. Values near or above 100% suggest extremely high sugar content.

        Some additional columns are also necessary like the Tags column which has a list of strings which help us filter out desserts. Another column that is important is the Ingredients column which is a list of strings which help us determine the recipes that have the primary sweetener as either honey or sugar.

   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"

        One concern is that determining the primary sweetener requires inferring from the ingredients list. The dataset does not explicitly label honey or sugar as primary ingredients. Some recipes may include both honey and sugar, or use alternative sweeteners alongside them. Misclassification could blur differences between groups. 
        
        Another concern is serving size variability. Nutritional values are reported per serving, but serving sizes are defined by the individual authors and are not standardized across all recipes. As a result, differences in calories or % Daily Value may partly reflect differences in portion definitions rather than true differences driven by the type of sweetener.

        Similarly, the nutrition values may be estimated by the authors instead of being strictly tested and measured, which could also introduce potential measurement errors / variability.

3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics 

### A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

    > The dataset reflects recipes uploaded by users on Food.com (formerly GeniusKitchen) and cleaned by researchers at UC San Diego. Although there is a vast range of dietary preferences stored as tags, we acknowledge that the data may skew towards certain cultures and demographics, inhibiting its ability to generalize to other use cases.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

    > We will only use recipe-level data (ingredients, tags, nutrition). We will not analyze usernames, reviews, or any identifiable user metadata. The analysis focuses on aggregate nutritional properties rather than individuals. Although the data does include user IDs, we will be getting rid of these during the wrangling process.

 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

    > The dataset is publicly available on Kaggle and does not contain sensitive personal data. It will be stored locally and we intend to maintain any change history on GitHub, specifically for coursework.

 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

    > The dataset is publicly accessible and local copies will likely be retained only for the duration of the course project. We might upload aggregated data but this will be clearly marked and identifiable as different from the raw data.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

    > We intend to utilize a user's interpretation of features like 'healthy' and 'desserts' based on tags they assign to the uploaded recipe, which we acknowledged might be a biased perspective.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

    > We will examine class balance (honey vs sugar desserts), ingredient labeling consistency (checking for membership instead of absolute comparison), and missing nutrition values. Filtering decisions (e.g., defining a “primary sweetener”) may introduce bias, which will be well-documented and reproducible in downstream analysis.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

    > All summary statistics and visualizations will be reported with clear note of assumptions, uncertainty, and limitations. The intention is to mimic our data in the way best intuitively understandable.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

    > We will not analyze or display any user information which may be used to personally identify such individuals. We aim to keep all results at the recipe level.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

    > As we are completing all relevant work through GitHub and the corresponding Jupyter Notebooks, we believe that results and discoveries should be easily accessible to those seeking to review what we have done later in the future. We value transparency, which we hope, will be visible through GitHub and their commit history. 

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

    > Although we selected calories, sugar and fat as our operational definition for healthiness, we understand that there are many other factors that can be taken into account in regards to physical health and diet. 

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

    > Our analytical methods will rely primarily on descriptive statistics, visualization, and classical statistical tests, which allow us to clearly explain differences in outcomes. We will also document our filtering decisions, statistical assumptions, and analysis steps to ensure transparency and interpretability. 

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?


    > When communicating our results, we aim to be clear that these statistics are based solely on the specific recipe which we have analyzed. Through such, we aim to avoid the perception that we are giving out advice regarding personal diets.

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

    > We would like to avoid instances in which our conclusions are interpreted as health recommendations. As such, we aim to portray our results from a statistical standpoint, rather than as a conclusion which should be served as a guideline for others. 



## Team Expectations 

* We will communicate virtually over discord; We expect replies within six hours, providing ample time for us to update our project based on the comments/ suggestions made by others. We expect to meet up prior to submissions, with standard check-ins being done through messages.  
* We decided that we would seek effective communication, not being afraid to be blunt when needed, as long as it is constructive and helpful towards our goals. We will still be mindful of being polite and respectful to the other members while doing so.
* Our teams decision making will be made largely based on majority vote, with considerations made based on the scale of the decision being made. We do not believe it to be necessary to notify others prior to making commits, as we have git history to rely back on, however, if conflicts between ideas arise based on these commits, they will be discussed in the group chat, decided by majority. In the event that a team mate is unresponsive, we will continue to move forward without them; we will stop to make adjustments upon their return as needed.
* We do not plan to assign leadership between members in regards to work, however, we will divide tasks, splitting up into smaller groups to complete work, or individually as we see fit. Tasks will be decided on preference, however, workload will be accounted for in the decision process. 
* See attatched below in the Project Timeline Proposal
* In the event that someone is struggling to keep up, we expect communication at prior to the day before a deadline. Upon this, remaining members will discuss how to redistribute the workload. In the event that this continues to happen, a meeting may be set up to discuss how to fix the issue.
* We can include this in the discord server as a pinned message, as well as on this GitHub repo.

## Project Timeline Proposal

| Week  | Deliverable | Work Expectations |
|---|---|---|
| Week 7  | Project Checkpoint 1  | Clean dataset - Wrangle data - Subset into aggregations of interest - Submit checkpoint |
| Week 8  | - | Conduct rudimentary EDA - Ideate insightful visualizations - Narrow scope of investigation |
| Week 9  | Project Checkpoint 2 | Finalize graphs and charts - Include statistical tests - Finalize EDA and inference |
| Week 10  | - | Record missing project components - Improve for cohesion - Consider research extensions |
| Finals Week  | Final Project | Add final touches - Incorporate feedback and suggestions - Turn in submission |